There
is a chicken-egg dilemma on object detection and recognition: before an
object is IDENTIFIED,
it must be DETECTED;
while to DETECT
an object, we must develop a system to IDENTIFY it.
To
crack this problem, I prefer a two-stage architecture proposed by Rensink,
and implemented by Walther. During the first stage, "perceptual
blobs", or "proto-objects" are detected by plain detection
algorithms. Then identification mechanisms work directly on these
proto-objects.
This
architecture suggests two things:
1.
In order to mimic this early stage visual
processing, we may not resort to the information that would be available
only in late stages.
2.
The result of the first stage may be
crude, in some cases the proto-objects are not actual objects. However, if
we expect otherwise, we are asking the detecting system to IDENTIFY - which
is obviously unattainable given the computational constrains.
The
incentive for Spectral Residual is plain and straightfoward: I am composing
an early stage attention model, so this model must be as simple as
possible, free of training or hand-lebeling, and most important, no
parameter tuning.
Given
these extremely tough constrains, I failed to find a common property shared
by different targets, but there is one property for backgrounds in spectral
domain. By eliminating the homogenous background, the "residual"
parts are detected as proto-objects.
Spatial
local cues are primary information used in nowadays model. But what makes
my spectral residual approach unique, is the SPECTRAL representation. In
the field of vision, the spectral representation recieves much less
attention than it deserves. As far as I know, the most renowned utilization
of the Fourier spectrum is Oliva's Gist
of the Scene.
The
oblivion of the Fourier spectrum may be due to a long held idea that
information behind the amplitude spectrum is only trivial (see Chapter 7 of
Computer Vision
by Forsyth for reference). But as demonstrated by Spectral Residual, we
believe there is gold behind a mess of the Fourier spectrum.
|