📄 using a particle filter for gesture recognition.htm

📁 一个很好的粒子滤波算法
💻 HTM
📖 第 1 页 / 共 2 页
字号:
12 下一页
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0035)http://www.mit.edu/~alexgru/vision/ -->
<HTML><HEAD><TITLE>Using a Particle Filter for Gesture Recognition</TITLE>
<META http-equiv=Content-Type content="text/html; charset=gb2312">
<META content="MSHTML 6.00.2743.600" name=GENERATOR></HEAD>
<BODY>Back to my <A href="http://www.mit.edu/~alexgru/">homepage</A> 
<H1>Using a Particle Filter for Gesture Recognition</H1>
<H2>Alexander Gruenstein </H2>
<H3>Introduction</H3>For my final project, I experimented with applying a 
particle filter to the problem of gesture recognition. Specifically, I attempted 
to differentiate between the following two American Sign Language (ASL) signs 
(special thanks goes to Anna for allowing me to film her!) (please click on the 
signs to see a sample video): 
<UL>
  <LI><A href="http://www.mit.edu/~alexgru/vision/sign2_3.mov">Leftover</A> 
  <LI><A href="http://www.mit.edu/~alexgru/vision/sign3_3.mov">Paddle</A> 
</LI></UL>My procedure was as follows: 
<OL>
  <LI>Film 7 examples of each sign 
  <LI>Find skin-colored pixels in every frame 
  <LI>Find the three largest 'blobs' of skin-colored pixels in each frame (the 
  head, right hand, and left hand) 
  <LI>Calculate the motion trajectories of each blob over time 
  <LI>Create a model of the average trajectories of the blobs for each of the 
  two signs, using 5 of the examples 
  <LI>Use a particle filter to classify both the test and training sets of video 
  sequences as one of the two signs </LI></OL>
<H3>Related Work </H3>
<P>There has been a lot of work in recognizing sign language. My work builds 
mainly on two papers: </P>
<P>Black and Jepson have previously applied particle filters (CONDENSATION) to 
the problem of gesture recognition. Their work differs from mine in that they 
recognized only 'whiteboard' style gestures like "save," "cut," and "paste" made 
with a distinctly colored object (a "phicon"). In my project, I recognize actual 
ASL signs made with two hands. </P>
<P>Yang and Ahuja also use motion trjaectories to recognize ASL signs. However, 
they use a Time-Delayed Nueral Network (TDNN) to classify signs, while I use a 
particle filter </P>
<P>My work, then, builds on Yang and Ahuja's in the sense that I use the motion 
trajectories of the hands to recognize ASL signs. Unlike Yang and Ahuja, I don't 
attempt to robustly solve of the problem of tracking the hands in the first 
place. I extend the work of Black and Jepson by applying CONDENSATION to 
multiple motion trajectories simultaneously.</P>
<H3>Filming The Examples </H3>The image sequences were filmed using a Sony 
DCR-TRV900 MiniDV camera. They were manually aligned and then converted into 
sequences of TIFs to be processed in MATLAB. Each TIF was 243x360 pixels, 24bit 
color. The lighting and background in each sequence is held constant; the 
background is not cluttered. The focus of my project was not to solve the 
tracking problem, hence I wanted the hands to be relatively easy to track. I 
collected 7 film sequences of each sign. 
<H3>Finding Skin Colored Pixels </H3>
<P>In order to segment out skin-colored pixels, I used the color_segment routine 
we developed in MATLAB for our last homework assignment. Every image in every 
each sequence was divided into the following regions: skin, background, clothes, 
and outliers. The source code is here: 
<UL>
  <LI><A 
  href="http://www.mit.edu/~alexgru/vision/color_segment.m">color_segment.m</A> 
  <LI><A 
  href="http://www.mit.edu/~alexgru/vision/gaussdensity.m">gaussdensity.m</A> 
  <LI><A 
  href="http://www.mit.edu/~alexgru/vision/image_to_hsv2data.m">image_to_hsv2data.m</A> 
  </LI></UL>The original image: <BR><IMG height="50%" 
src="Using a Particle Filter for Gesture Recognition.files/segment.jpg" 
width="50%"> <BR>The skin pixel mask: <BR><IMG height="50%" 
src="Using a Particle Filter for Gesture Recognition.files/segment_skin.jpg" 
width="50%"> <BR>
<H3>Finding Skin-Colored Blobs </H3>
<P>I then calculated the centroids of the three largest skin colored 'blobs' in 
each image. Blobs were calculated by processing the skin pixel mask generated in 
the previous step. A blob is defined to be a connected region of 1's in the 
mask. Finding blobs turned out to be a bit more difficult than I had originally 
thought. My first implementation was a straightforward recursive algorithm which 
scans the top down from left to right until it comes across a skin pixel which 
has yet to be assigned to a blob. It then recursively checks each of that 
pixel's neighbors to see if they too are skin pixels. If they are, it assigns 
them to the same blob and recurses. On such large images, this quickly led to 
stack overflow and huge inefficiency in MATLAB. </P>
<P>The working algorithm I eventually came up with is an iterative one that 
scans the skin pixel mask from left to right top down. When it comes across a 
skin pixel that has yet to be assigned to a blob, it first checks pixels 
neighbors (to the left and above) to see if they are in a blob. If they aren't, 
it creates a new blob and adds the newly found pixel to the blob. If any of the 
neighbors are in a blob, it assigns the pixel to the neighbor's blob. However, 
two non-adjacent nieghbors might be in different blobs, so these blobs must be 
merged into a single blob.</P>
<P>Finally, the algorithm searches for the 3 largest blobs and calculates each 
of their respective centroids. </P>
<P>The MATLAB code can be found here: <A 
href="http://www.mit.edu/~alexgru/vision/blob2.m">blob2.m</A> </P>
<P>These videos show the 3 largest skin-colored blobs tracked for the two 
example video sequences above (ignore the funny colors -- they are just an 
artifact of the blob search)</P>
<UL>
  <LI><A href="http://www.mit.edu/~alexgru/vision/tracking2_3.mov">Leftover</A> 
  <LI><A href="http://www.mit.edu/~alexgru/vision/tracking3_3.mov">Paddle</A> 
  </LI></UL>
<H3>Calculating the Blobs' Motion Trajectories over Time </H3>At this point, 
tracking the trajectories of the blobs over time was fairly simple. For a given 
video sequence, I made a list of the position of the centroid for each of the 3 
largest blobs in each frame (source code: <A 
href="http://www.mit.edu/~alexgru/vision/centroids.m">centroids.m</A>). Then, I 
examined the first frame in the sequence and determined which centroid was 
farthest to the left and which was farthest to the right. The one on the left 
corresponds to the right hand of signer, the one to the right corresponds to the 
left hand of the signer. Then, for each successive frame, I simply determined 
which centroid was closest to each of the previous left centroid and called this 
the new left centroid; I did the same for the blob on the right. Once the two 
blobs were labeled, I calculated the horizontal and vertical velocity of both 
blobs across the two frames using [(change in position)/time]. I recorded these 
values for each sequential frame pair in the sequence. The source code is here: 
<A 
href="http://www.mit.edu/~alexgru/vision/split_centroids.m">split_centroids.m</A>. 

<H3>Creating the Motion Models </H3>I then created models of the hand motions 
involved in each sign. Specifically, for each frame in the sign, I used 5 
training instances to calculate the average horizontal and vertical velocities 
of both hands in that particular frame. The following graphs show the models 
derived for both signs: (These turned out a bit grainy as jpegs, there are pdf 
versions here: <A 
href="http://www.mit.edu/~alexgru/vision/model1.pdf">model1.pdf</A> and <A 
href="http://www.mit.edu/~alexgru/vision/model2.pdf">model2.pdf</A>).<BR><BR>Model 
1 "leftover": <BR><IMG height="80%" 
src="Using a Particle Filter for Gesture Recognition.files/model1.jpg" 
width="80%"> <BR>Model 2 "paddle": <BR><IMG height="80%" 
src="Using a Particle Filter for Gesture Recognition.files/model2.jpg" 
width="80%"> <BR><BR>Sample source code for creating a model is here: <A 
href="http://www.mit.edu/~alexgru/vision/make_model2.m">make_model2.m</A> 
<H3>Using CONDENSATION to Classify New Video Sequences</H3>All the image 
preprocessing is now finished and the 2 motion models have been created. Now 
follows a brief description of the Condensation algorithm and then a description 
of how I applied it to this specific task. 
<H3><A name=SECTION00010000000000000000>The Basics of Condensation</A></H3>
<P>The Condensation algorithm (Conditional Density Propagation over time) makes 
use of random sampling in order to model arbitrarily complex probability density 
functions. That is, rather than attempting to fit a specific equation to 
observed data, it uses <I>N</I> weighted samples to approximate the curve 
described by the data. Each sample consists of a <EM>state</EM> and a 
<EM>weight</EM> proportional to the probability that the state is predicted by 
the input data. As the number of samples increases, the precision with which the 
samples model the observed pdf increases. 
<P>Now assume that a series of observations are made during time steps <IMG 
height=20 alt=tex2html_wrap_inline130 
src="Using a Particle Filter for Gesture Recognition.files/img1.gif" width=64 
align=middle> . In order to generate the new sample set at time <I>t</I>+1, 
states are randomly selected (with replacement) from the sample set at <I>t</I>, 
based on their weight; that is, the weight of each sample determines the 
probability it will be chosen. Given such a randomly sampled state <IMG 
height=14 alt=tex2html_wrap_inline136 
src="Using a Particle Filter for Gesture Recognition.files/img2.gif" width=11 
align=middle> , a prediction of a new state <IMG height=15 
alt=tex2html_wrap_inline138 
src="Using a Particle Filter for Gesture Recognition.files/img3.gif" width=27 
align=middle> at time step <I>t</I>+1 is made based on a predictive model. This 
corresponds to sampling from the process density <IMG height=24 
alt=tex2html_wrap_inline142 
src="Using a Particle Filter for Gesture Recognition.files/img4.gif" width=116 
align=middle> , where <IMG height=22 alt=tex2html_wrap_inline144 
src="Using a Particle Filter for Gesture Recognition.files/img5.gif" width=17 
align=middle> is a vector of parameters describing the object's state. Finally, 
<IMG height=15 alt=tex2html_wrap_inline138 
src="Using a Particle Filter for Gesture Recognition.files/img3.gif" width=27 
align=middle> is assigned a weight proportional to the probability <IMG 
height=24 alt=tex2html_wrap_inline148 
src="Using a Particle Filter for Gesture Recognition.files/img6.gif" width=85 
align=middle> , where <IMG height=15 alt=tex2html_wrap_inline150 
src="Using a Particle Filter for Gesture Recognition.files/img7.gif" width=28 
align=middle> is a set of parameters describing the observed state of the object 
at time <I>t</I>+1. Then the process iterates for the next observation. In this 
way, predicted states that correspond better to the data receive larger weights. 
Since arbitrarily complex pdfs can be modeled, an arbitrary number of competing 
hypotheses (assuming sufficiently large <I>N</I>) can be maintained until a 
single hypothesis dominates. 
<P>
<H3><A name=SECTION00020000000000000000>Applying Condensation to Recognizing 
ASL</A></H3>
<P>In order to apply the Condensation Algorithm to sign-language recognition, I 
extend the methods described by Black and Jepson. Specifically, a <EM>state</EM> 
at time <I>t</I> is described as a parameter vector: <IMG height=27 
alt=tex2html_wrap_inline158 
src="Using a Particle Filter for Gesture Recognition.files/img8.gif" width=125 
align=middle> where: <BR><BR><IMG height=14 alt=tex2html_wrap_inline160 
src="Using a Particle Filter for Gesture Recognition.files/img9.gif" width=8 
align=middle> is the integer index of the predictive model; <BR><IMG height=30 
alt=tex2html_wrap_inline162 
src="Using a Particle Filter for Gesture Recognition.files/img10.gif" width=13 
align=middle> indicates the current position in the model; <BR><IMG height=15 
alt=tex2html_wrap_inline164 
src="Using a Particle Filter for Gesture Recognition.files/img11.gif" width=14 
align=bottom> refers to an amplitudal scaling factor; <BR><IMG height=30 
alt=tex2html_wrap_inline166 
src="Using a Particle Filter for Gesture Recognition.files/img12.gif" width=13 
align=middle> is a scale factor in the time dimension. <BR>where <IMG height=24 
alt=tex2html_wrap_inline168 
src="Using a Particle Filter for Gesture Recognition.files/img13.gif" width=60 
align=middle> <BR><BR>Note that <I>i</I> indicates which hand's motion 
trajectory this <IMG height=24 alt=tex2html_wrap_inline172 
src="Using a Particle Filter for Gesture Recognition.files/img14.gif" width=14 
align=middle> , <IMG height=11 alt=tex2html_wrap_inline174 
src="Using a Particle Filter for Gesture Recognition.files/img15.gif" width=15 
align=bottom> , or <IMG height=22 alt=tex2html_wrap_inline176 
src="Using a Particle Filter for Gesture Recognition.files/img16.gif" width=14 
align=middle> refers to. My models contain data about the motion trajectory of 
both the left hand and the right hand; by allowing two sets of parameters, I 
allow the motion trajectory of the left hand to be scaled and shifted separetely 
from the motion trajectory of the right hand (so, for example, <IMG height=28 
alt=tex2html_wrap_inline178 
src="Using a Particle Filter for Gesture Recognition.files/img17.gif" width=12 
align=middle> refers to the current position in the model for the left hand's 
trajectory, while <IMG height=24 alt=tex2html_wrap_inline180
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -