📄 using a particle filter for gesture recognition.htm
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0035)http://www.mit.edu/~alexgru/vision/ -->
<HTML><HEAD><TITLE>Using a Particle Filter for Gesture Recognition</TITLE>
<META http-equiv=Content-Type content="text/html; charset=gb2312">
<META content="MSHTML 6.00.2743.600" name=GENERATOR></HEAD>
<BODY>Back to my <A href="http://www.mit.edu/~alexgru/">homepage</A>
<H1>Using a Particle Filter for Gesture Recognition</H1>
<H2>Alexander Gruenstein </H2>
<H3>Introduction</H3>For my final project, I experimented with applying a
particle filter to the problem of gesture recognition. Specifically, I attempted
to differentiate between the following two American Sign Language (ASL) signs
(special thanks goes to Anna for allowing me to film her!) (please click on the
signs to see a sample video):
<UL>
<LI><A href="http://www.mit.edu/~alexgru/vision/sign2_3.mov">Leftover</A>
<LI><A href="http://www.mit.edu/~alexgru/vision/sign3_3.mov">Paddle</A>
</LI></UL>My procedure was as follows:
<OL>
<LI>Film 7 examples of each sign
<LI>Find skin-colored pixels in every frame
<LI>Find the three largest 'blobs' of skin-colored pixels in each frame (the
head, right hand, and left hand)
<LI>Calculate the motion trajectories of each blob over time
<LI>Create a model of the average trajectories of the blobs for each of the
two signs, using 5 of the examples
<LI>Use a particle filter to classify both the test and training sets of video
sequences as one of the two signs </LI></OL>
<H3>Related Work </H3>
<P>There has been a lot of work in recognizing sign language. My work builds
mainly on two papers: </P>
<P>Black and Jepson have previously applied particle filters (CONDENSATION) to
the problem of gesture recognition. Their work differs from mine in that they
recognized only 'whiteboard' style gestures like "save," "cut," and "paste" made
with a distinctly colored object (a "phicon"). In my project, I recognize actual
ASL signs made with two hands. </P>
<P>Yang and Ahuja also use motion trjaectories to recognize ASL signs. However,
they use a Time-Delayed Nueral Network (TDNN) to classify signs, while I use a
particle filter </P>
<P>My work, then, builds on Yang and Ahuja's in the sense that I use the motion
trajectories of the hands to recognize ASL signs. Unlike Yang and Ahuja, I don't
attempt to robustly solve of the problem of tracking the hands in the first
place. I extend the work of Black and Jepson by applying CONDENSATION to
multiple motion trajectories simultaneously.</P>
<H3>Filming The Examples </H3>The image sequences were filmed using a Sony
DCR-TRV900 MiniDV camera. They were manually aligned and then converted into
sequences of TIFs to be processed in MATLAB. Each TIF was 243x360 pixels, 24bit
color. The lighting and background in each sequence is held constant; the
background is not cluttered. The focus of my project was not to solve the
tracking problem, hence I wanted the hands to be relatively easy to track. I
collected 7 film sequences of each sign.
<H3>Finding Skin Colored Pixels </H3>
<P>In order to segment out skin-colored pixels, I used the color_segment routine
we developed in MATLAB for our last homework assignment. Every image in every
each sequence was divided into the following regions: skin, background, clothes,
and outliers. The source code is here:
<UL>
<LI><A
href="http://www.mit.edu/~alexgru/vision/color_segment.m">color_segment.m</A>
<LI><A
href="http://www.mit.edu/~alexgru/vision/gaussdensity.m">gaussdensity.m</A>
<LI><A
href="http://www.mit.edu/~alexgru/vision/image_to_hsv2data.m">image_to_hsv2data.m</A>
</LI></UL>The original image: <BR><IMG height="50%"
src="Using a Particle Filter for Gesture Recognition.files/segment.jpg"
width="50%"> <BR>The skin pixel mask: <BR><IMG height="50%"
src="Using a Particle Filter for Gesture Recognition.files/segment_skin.jpg"
width="50%"> <BR>
<H3>Finding Skin-Colored Blobs </H3>
<P>I then calculated the centroids of the three largest skin colored 'blobs' in
each image. Blobs were calculated by processing the skin pixel mask generated in
the previous step. A blob is defined to be a connected region of 1's in the
mask. Finding blobs turned out to be a bit more difficult than I had originally
thought. My first implementation was a straightforward recursive algorithm which
scans the top down from left to right until it comes across a skin pixel which
has yet to be assigned to a blob. It then recursively checks each of that
pixel's neighbors to see if they too are skin pixels. If they are, it assigns
them to the same blob and recurses. On such large images, this quickly led to
stack overflow and huge inefficiency in MATLAB. </P>
<P>The working algorithm I eventually came up with is an iterative one that
scans the skin pixel mask from left to right top down. When it comes across a
skin pixel that has yet to be assigned to a blob, it first checks pixels
neighbors (to the left and above) to see if they are in a blob. If they aren't,
it creates a new blob and adds the newly found pixel to the blob. If any of the
neighbors are in a blob, it assigns the pixel to the neighbor's blob. However,
two non-adjacent nieghbors might be in different blobs, so these blobs must be
merged into a single blob.</P>
<P>Finally, the algorithm searches for the 3 largest blobs and calculates each
of their respective centroids. </P>
<P>The MATLAB code can be found here: <A
href="http://www.mit.edu/~alexgru/vision/blob2.m">blob2.m</A> </P>
<P>These videos show the 3 largest skin-colored blobs tracked for the two
example video sequences above (ignore the funny colors -- they are just an
artifact of the blob search)</P>
<UL>
<LI><A href="http://www.mit.edu/~alexgru/vision/tracking2_3.mov">Leftover</A>
<LI><A href="http://www.mit.edu/~alexgru/vision/tracking3_3.mov">Paddle</A>
</LI></UL>
<H3>Calculating the Blobs' Motion Trajectories over Time </H3>At this point,
tracking the trajectories of the blobs over time was fairly simple. For a given
video sequence, I made a list of the position of the centroid for each of the 3
largest blobs in each frame (source code: <A
href="http://www.mit.edu/~alexgru/vision/centroids.m">centroids.m</A>). Then, I
examined the first frame in the sequence and determined which centroid was
farthest to the left and which was farthest to the right. The one on the left
corresponds to the right hand of signer, the one to the right corresponds to the
left hand of the signer. Then, for each successive frame, I simply determined
which centroid was closest to each of the previous left centroid and called this
the new left centroid; I did the same for the blob on the right. Once the two
blobs were labeled, I calculated the horizontal and vertical velocity of both
blobs across the two frames using [(change in position)/time]. I recorded these
values for each sequential frame pair in the sequence. The source code is here:
<A
href="http://www.mit.edu/~alexgru/vision/split_centroids.m">split_centroids.m</A>.
<H3>Creating the Motion Models </H3>I then created models of the hand motions
involved in each sign. Specifically, for each frame in the sign, I used 5
training instances to calculate the average horizontal and vertical velocities
of both hands in that particular frame. The following graphs show the models
derived for both signs: (These turned out a bit grainy as jpegs, there are pdf
versions here: <A
href="http://www.mit.edu/~alexgru/vision/model1.pdf">model1.pdf</A> and <A
href="http://www.mit.edu/~alexgru/vision/model2.pdf">model2.pdf</A>).<BR><BR>Model
1 "leftover": <BR><IMG height="80%"
src="Using a Particle Filter for Gesture Recognition.files/model1.jpg"
width="80%"> <BR>Model 2 "paddle": <BR><IMG height="80%"
src="Using a Particle Filter for Gesture Recognition.files/model2.jpg"
width="80%"> <BR><BR>Sample source code for creating a model is here: <A
href="http://www.mit.edu/~alexgru/vision/make_model2.m">make_model2.m</A>
<H3>Using CONDENSATION to Classify New Video Sequences</H3>All the image
preprocessing is now finished and the 2 motion models have been created. Now
follows a brief description of the Condensation algorithm and then a description
of how I applied it to this specific task.
<H3><A name=SECTION00010000000000000000>The Basics of Condensation</A></H3>
<P>The Condensation algorithm (Conditional Density Propagation over time) makes
use of random sampling in order to model arbitrarily complex probability density
functions. That is, rather than attempting to fit a specific equation to
observed data, it uses <I>N</I> weighted samples to approximate the curve
described by the data. Each sample consists of a <EM>state</EM> and a
<EM>weight</EM> proportional to the probability that the state is predicted by
the input data. As the number of samples increases, the precision with which the
samples model the observed pdf increases.
<P>Now assume that a series of observations are made during time steps <IMG
height=20 alt=tex2html_wrap_inline130
src="Using a Particle Filter for Gesture Recognition.files/img1.gif" width=64
align=middle> . In order to generate the new sample set at time <I>t</I>+1,
states are randomly selected (with replacement) from the sample set at <I>t</I>,
based on their weight; that is, the weight of each sample determines the
probability it will be chosen. Given such a randomly sampled state <IMG
height=14 alt=tex2html_wrap_inline136
src="Using a Particle Filter for Gesture Recognition.files/img2.gif" width=11
align=middle> , a prediction of a new state <IMG height=15
alt=tex2html_wrap_inline138
src="Using a Particle Filter for Gesture Recognition.files/img3.gif" width=27
align=middle> at time step <I>t</I>+1 is made based on a predictive model. This
corresponds to sampling from the process density <IMG height=24
alt=tex2html_wrap_inline142
src="Using a Particle Filter for Gesture Recognition.files/img4.gif" width=116
align=middle> , where <IMG height=22 alt=tex2html_wrap_inline144
src="Using a Particle Filter for Gesture Recognition.files/img5.gif" width=17
align=middle> is a vector of parameters describing the object's state. Finally,
<IMG height=15 alt=tex2html_wrap_inline138
src="Using a Particle Filter for Gesture Recognition.files/img3.gif" width=27
align=middle> is assigned a weight proportional to the probability <IMG
height=24 alt=tex2html_wrap_inline148
src="Using a Particle Filter for Gesture Recognition.files/img6.gif" width=85
align=middle> , where <IMG height=15 alt=tex2html_wrap_inline150
src="Using a Particle Filter for Gesture Recognition.files/img7.gif" width=28
align=middle> is a set of parameters describing the observed state of the object
at time <I>t</I>+1. Then the process iterates for the next observation. In this
way, predicted states that correspond better to the data receive larger weights.
Since arbitrarily complex pdfs can be modeled, an arbitrary number of competing
hypotheses (assuming sufficiently large <I>N</I>) can be maintained until a
single hypothesis dominates.
<P>
<H3><A name=SECTION00020000000000000000>Applying Condensation to Recognizing
ASL</A></H3>
<P>In order to apply the Condensation Algorithm to sign-language recognition, I
extend the methods described by Black and Jepson. Specifically, a <EM>state</EM>
at time <I>t</I> is described as a parameter vector: <IMG height=27
alt=tex2html_wrap_inline158
src="Using a Particle Filter for Gesture Recognition.files/img8.gif" width=125
align=middle> where: <BR><BR><IMG height=14 alt=tex2html_wrap_inline160
src="Using a Particle Filter for Gesture Recognition.files/img9.gif" width=8
align=middle> is the integer index of the predictive model; <BR><IMG height=30
alt=tex2html_wrap_inline162
src="Using a Particle Filter for Gesture Recognition.files/img10.gif" width=13
align=middle> indicates the current position in the model; <BR><IMG height=15
alt=tex2html_wrap_inline164
src="Using a Particle Filter for Gesture Recognition.files/img11.gif" width=14
align=bottom> refers to an amplitudal scaling factor; <BR><IMG height=30
alt=tex2html_wrap_inline166
src="Using a Particle Filter for Gesture Recognition.files/img12.gif" width=13
align=middle> is a scale factor in the time dimension. <BR>where <IMG height=24
alt=tex2html_wrap_inline168
src="Using a Particle Filter for Gesture Recognition.files/img13.gif" width=60
align=middle> <BR><BR>Note that <I>i</I> indicates which hand's motion
trajectory this <IMG height=24 alt=tex2html_wrap_inline172
src="Using a Particle Filter for Gesture Recognition.files/img14.gif" width=14
align=middle> , <IMG height=11 alt=tex2html_wrap_inline174
src="Using a Particle Filter for Gesture Recognition.files/img15.gif" width=15
align=bottom> , or <IMG height=22 alt=tex2html_wrap_inline176
src="Using a Particle Filter for Gesture Recognition.files/img16.gif" width=14
align=middle> refers to. My models contain data about the motion trajectory of
both the left hand and the right hand; by allowing two sets of parameters, I
allow the motion trajectory of the left hand to be scaled and shifted separetely
from the motion trajectory of the right hand (so, for example, <IMG height=28
alt=tex2html_wrap_inline178
src="Using a Particle Filter for Gesture Recognition.files/img17.gif" width=12
align=middle> refers to the current position in the model for the left hand's
trajectory, while <IMG height=24 alt=tex2html_wrap_inline180
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -