📄 yan.txt
字号:
1
Introduction
1.1 Motivation
Imagine the following scenario: you are visiting the street carnival in Cologne,
Germany for the first time. Fascinated by the colorful and imaginative costumes
of the people around you, your gaze wanders from one exciting spot
to the next: here a clown with a fancy dress, there a small boy masqueraded
as Harry Potter. But not only visual cues capture your attention: over there
a band starts to play the new hit of the year and the smell of fresh cookies
from the right also revives your interest. Suddenly you remember that you did
not come here alone: where has your friend gone? You start to look around,
finding her is not easy in the crowd. You remember that she wears a yellow
hat, a clue that could make the search easier and you start to watch out for
yellow hats. After your gaze has been distracted by some other yellow spots,
you detect the hat, recognize your friend who is just dancing with a group of
witches, and you start to push through the crowd to join them.
This scenario gives an insight into the complexity of human perception.
A wealth of information is perceived at each moment, much more than can
be processed efficiently by the brain. Nevertheless, detection and recognition
of objects usually succeed with little conscious effort. In contrast, in computer
vision and robotics the detection and recognition of objects is one of
the hardest problems [Forsyth and Ponce, 2003]. There are several sophisticated
systems for specialized tasks such as the detection of faces [Viola and
Jones, 2004] or pedestrians [Papageorgiou et al., 1998] – although even these
approaches usually fail if the target is not viewed frontally – but developing
a general system able to match the human ability to recognize thousands of
objects from different viewpoints, under changing illumination conditions and
with partial occlusions seems to lie remotely in the future. Suggesting therefore
to improve the performance of technical systems is to seek for inspiration
from biological systems and to simulate their mechanisms – the brain is the
proof that solving the task is possible.
S. Frintrop: VOCUS: A Visual Attention System..., LNAI 3899, pp. 1-5, 2006.
? Springer-Verlag Berlin Heidelberg 2006
2 1 Introduction
One of the mechanisms that make humans so effective in acting in everyday
life is the ability to extract the relevant information at an early processing
stage, a mechanism called selective attention. The extracted information is
then directed to higher brain areas where complex processes such as object
recognition take place. Restricting these processes to a limited subset of the
sensory data enables efficient processing.
One of the main questions when determining the relevant information is
the problem of what is relevant. There is no general answer since the relevance
of information depends on the situation.With no special goal except exploring
the environment, certain cues with strong contrasts attract our attention,
for example the clown in the fancy dress. The saliency also depends on the
surrounding: the clown is much more salient in a crowd of black witches than
among other clowns. In addition to these bottom-up cues, the attention is
also influenced by top-down cues, that means cues from higher brain areas
like knowledge, motivations and emotions. For example, if you are hungry the
smell of fresh cookies might capture your attention and cause you to ignore
the clown. Even more demanding is a goal: when you start to search for the
yellow hat of your friend you concentrate on yellow things on the heads of the
people around you. Other cues, even if salient, lose importance. Both bottomup
and top-down cues compete for attention and direct your gaze to the most
interesting region. The choice of this region is not only based on visual cues
but, as suggested in the carnival example, sounds, smells, tactile sensations,
and tastes also compete for attention.
In computer vision and robotics, object detection and recognition is a field
of high interest. Applications in computer vision range from video surveillance,
traffic monitoring, driver assistance systems, and industrial inspection
to human computer interaction, image retrieval in digital libraries and medical
image analysis. In robotics, the detection of obstacles, the manipulation
of objects, the creation of semantic maps, and the detection of landmarks for
navigation profit considerably from object recognition.
The further the development of such systems proceeds and the more general
their tasks will be, the more urgent is the need for a pre-selecting system
that sorts out the bulk of irrelevant information and helps to concentrate on
the currently relevant data. A system that meets these requirements is the
visual attention system VOCUS (Visual Object detection with a CompUtational
attention System) that will be presented in this work.
1.2 Scope
In this monograph, a computational attention system, VOCUS, is presented,
which detects regions of potential interest in images. First, fast and rough
mechanisms compute saliencies according to different features like intensity,
color, and orientation in parallel. If target information is available, the features
are weighted according to the properties of the target. Second, the resulting
1.3 Contributions 3
information is fused and the most salient region is determined, yielding the
focus of attention. Finally, the focus region is provided for complex processes
like object recognition, which are usually costly and time consuming. By restricting
the complex tasks to small portions of the input data, the system is
able to achieve considerable performance gains.
The introductory example presented above already contains the four main
aspects of the monograph which are examined in the four main chapters:
first, VOCUS detects regions of interest from bottom-up cues such as strong
contrasts and uniqueness (e.g., the fancy clown); second, top-down influences
such as goal-dependent properties influence the processing and enable goaldirected
search (e.g., the yellow hat); third, information from different sensor
modes attracts the attention and is fused to yield a single focus of attention
(as the music and the smell of cookies compete for attention with the visual
cues) and finally, after directing the focus of attention to a region of interest,
object recognition takes place (e.g., recognition of the hat).
Now some words to categorize the present work. There are two objectives
usually aspired by computational attention systems. The first is to better understand
human perception and provide a tool that is able to test whether
the psychological models are plausible. The second objective is to build a
technical system which represents a useful front-end for higher-level tasks as
object recognition and thus assists to yield a faster and more robust recognition
system. This monograph concentrates on the second objective, that
means the aim of the work is to build a system that improves the recognition
performance in computer vision and robotics.
1.3 Contributions
This monograph presents a new approach for robust object detection and
goal-directed search in images. The work is based on a well-known and widely
accepted bottom-up attention system [Itti et al., 1998]. This architecture is
extended and improved in several aspects, the major one being extending the
system to deal with top-down influences and perform goal-directed search. A
detailed discussion on the delimitation to existing work follows in the respective
chapters, here we present a short summary of the main contributions:
? Introduction of the computational attention system VOCUS which extends
and improves one of the standard approaches of computational attention
systems [Itti et al., 1998] by several aspects, ranging from implementation
details to conceptual revisions. These improvements enable a considerable
gain in performance and robustness (chapter 4, also published in [Mitri
et al., 2005,Frintrop et al., 2005c,Frintrop et al., 2005b]).
? Presentation of a new top-down extension of VOCUS to enable goaldirected
search. Learning of target-specific properties as well as searching
for the target in a test scene are performed by the same attention system.
Detailed experiments and evaluations of the method illustrate the
4 1 Introduction
behavior of the system and demonstrate its robustness in various settings.
This is the main contribution of the monograph (chapter 5, also published
in [Frintrop et al., 2005a,Mitri et al., 2005,Frintrop et al., 2005b]).
? Extension of the attention model to enable operation on different sensor
modes. Application of the system to range and reflection data from a 3D
laser scanner and investigation of the advantages of the respective sensor
modes (chapter 6, also published in [Frintrop et al., 2005c,Frintrop et al.,
2003a,Frintrop et al., 2003b]).
? Combination of the attention system with a classifier that enables object
recognition. Evaluation of the time and quality performance that is
achieved by combining the systems (chapter 7, also published in [Frintrop
et al., 2004b,Frintrop et al., 2004a,Mitri et al., 2005]).
Several aspects of these contributions have been done in cooperation with
some of my colleagues: the data acquisition with the laser scanner (chapter
6 and 7) has been performed by Andreas N¨uchter and Hartmut Surmann.
The object recognition with the classifier (chapter 7) has been done in cooperation
with Andreas N¨uchter, Sara Mitri and Kai Perv¨olz. Some of the
experiments concerning goal-directed search (chapter 5) have been performed
by Uwe Weddige. Furthermore, many valuable hints and suggestions were
given by Joachim Hertzberg, Erich Rome, and Gerriet Backer.
1.4 Outline
The remainder of this monograph is structured into six chapters. The first
two are concerned with the psychological and neuro-scientific background of
visual attention (chapter 2) and with the state of the art of computational
attention systems (chapter 3), whereas the following four chapters each deal
with one of the main contributions of this work:
Chapter 4 introduces the computational attention system describing the
details that enable the computation of a region of interest. Particular emphasis
is placed on the improvements with respect to other systems and on the
discussion of how bottom-up systems of attention may be evaluated.
Chapter 5 elaborates on top-down influences as a new approach to bias
the processing of visual input according to the properties of a target object.
It is shown how these properties are learned from one or a small selection of
training images, and how the learned information is used to find the target in
a test scene. A wide variety of experiments on artificial as well as on real-world
scenes show the effectiveness of the system.
Chapter 6 examines the extension of VOCUS to several sensor modes.
The application of the attention system to range and reflection data from a
3D laser scanner illustrates how the information may be processed separately
and finally fused into a combined representation from which a single focus of
attention is computed. The advantages of each sensor mode are discussed and
the differences between saliencies in laser and camera data are highlighted.
1.4 Outline 5
Chapter 7 combines the attention system with a fast and powerful classifier
to enable recognition on the region of interest. It is shown how the time
and quality performance improves when combining the two systems. Finally,
chapter 8 concludes the work by summarizing the main concepts, discussing
the strengths and limitations, and giving an outlook on future work.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -