📄 neuralnetrecognition.html

📁 基于神经网络的手写体识别程序
💻 HTML
📖 第 1 页 / 共 5 页
字号:
				output = 1.0;  // this is the bias weight
			}
			else
			{
				output = m_pPrevLayer-&gt;m_Neurons[ kk ]-&gt;output;
			}
			
			dErr_wrt_dWn[ (*cit).WeightIndex ] += dErr_wrt_dYn[ ii ] * output;
		}
		
		ii++;
	}
	
	
	// calculate equation (5): dErr_wrt_Xnm1 = Wn * dErr_wrt_dYn, which is needed as the input value of
	// dErr_wrt_Xn for backpropagation of the next (i.e., previous) layer
	// For each neuron in this layer
	
	ii = 0;
	for ( nit=m_Neurons.begin(); nit&lt;m_Neurons.end(); nit++ )
	{
		NNNeuron& n = *(*nit);  // for simplifying the terminology
		
		for ( cit=n.m_Connections.begin(); cit&lt;n.m_Connections.end(); cit++ )
		{
			kk=(*cit).NeuronIndex;
			if ( kk != ULONG_MAX )
			{
				// we exclude ULONG_MAX, which signifies the phantom bias neuron with
				// constant output of "1", since we cannot train the bias neuron
				
				nIndex = kk;
				
				dErr_wrt_dXnm1[ nIndex ] += dErr_wrt_dYn[ ii ] * m_Weights[ (*cit).WeightIndex ]-&gt;value;
			}
			
		}
		
		ii++;  // ii tracks the neuron iterator
		
	}
	
	
	// calculate equation (6): update the weights in this layer using dErr_wrt_dW (from 
	// equation (4)	and the learning rate eta

	for ( jj=0; jj&lt;m_Weights.size(); ++jj )
	{
		oldValue = m_Weights[ jj ]-&gt;value;
		newValue = oldValue.dd - etaLearningRate * dErr_wrt_dWn[ jj ];
		m_Weights[ jj ]-&gt;value = newValue;
	}
}
</PRE>


<BR><A HREF="#topmost"><FONT SIZE="-6" COLOR="">go back to top</FONT></A>

<BR><BR>
<A name="SecondOrder"/>
<h3>Second Order Methods</h2>

<P>All second order techniques have one goal in mind: to increase the speed with which backpropagation converges to optimal weights.  All second order techniques (at least in principle) accomplish this in the same fundamental way: by adjusting each weight differently, e.g., by applying a learning rate <I>eta</I> that differs for each individual weight.</P>

<P>In his <A HREF="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf" target=_newwin>&quot;Efficient BackProp,&quot;&nbsp;<IMG SRC="Images/ExternalLink.gif" WIDTH="14" HEIGHT="14" BORDER="0" ALT="External Link"></A> article, Dr. LeCun proposes a second order technique that he calls the &quot;stochastic diagonal Levenberg-Marquardt method&quot;.  He compares the performance of this method with a &quot;carefully tuned stochastic gradient algorithm&quot;, which is an algorithm that does not rely on second order techniques, but which does apply different learning rates to each individual weight.  According to his comparisons, he concludes that &quot;the additional cost <I>[of stochastic diagonal Levenberg-Marquardt]</I> over regular backpropagation is negligible and convergence is - as a rule of thumb - about three times faster than a carefully tuned stochastic gradient algorithm.&quot; (See page 35 of the article.)</P>

<P>It was clear to me that I needed a second order algorithm.  Convergence without it was tediously slow.  Dr. Simard, in his article titled <A HREF="http://research.microsoft.com/~patrice/PDF/fugu9.pdf" target=_newwin>&quot;Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis,&quot;&nbsp;<IMG SRC="Images/ExternalLink.gif" WIDTH="14" HEIGHT="14" BORDER="0" ALT="External Link"></A>, indicated that he wanted to keep his algorithm as simple as possible and therefore did <I><B>not</B></I> use second order techniques.  He also admitted that he required hundreds of epochs for convergence (my guess is that he required a few thousand).</P>

<P>With the MNIST database, each epoch requires 60,000 backpropagations, and on my computer each epoch took around 40 minutes.  I did not have the patience (or confidence in the correctness of the code) to wait for thousands of epochs.  It was also clear that, unlike Dr. LeCun, I did not have the skill to design &quot;a carefully tuned stochastic gradient algorithm&quot;.  So, in keeping with the advice that stochastic diagonal Levenberg-Marquardt would be around three times faster than that anyway, my neural network implements this second order technique.</P>

<P>I will not go into the math or the code for the stochastic diagonal Levenberg-Marquardt algorithm.  It's actually not too dissimilar from standard backpropagation.  Using this technique, I was able to achieve good convergence in around 20-25 epochs.  In my mind this was terrific for two reasons.  First, it increased my confidence that the network was performing correctly, since Dr. LeCun also reported convergence in around 20 epochs.  Second, at 40 minutes per epoch, the network converged in around 14-16 hours, which is a palatable amount of time for an overnight run.</P>

<P>If you have the inclination to inspect the code on this point, the functions you want to focus on are named <CODE>CMNistDoc::CalculateHessian()</CODE> (which is in the document class - yes, the program is an MFC doc/view program), and <CODE>NeuralNetwork::BackpropagateSecondDervatives()</CODE>.  In addition, you should note that the <CODE>NNWeight</CODE> class includes a <CODE>double</CODE> member that was not mentioned in the simplified view above.  This member is named &quot;<CODE>diagHessian</CODE>&quot;, and it stores the curvature (in weight space) that is calculated by Dr. LeCun's algorithm.  Basically, when <CODE>CMNistDoc::CalculateHessian()</CODE> is called, 500 MNIST patterns are selected at random.  For each pattern, the <CODE>NeuralNetwork::BackpropagateSecondDervatives()</CODE> function calculates the Hessian for each weight caused by the pattern, and this number is accumulated in <CODE>diagHessian</CODE>.  After the 500 patterns are run, the value in <CODE>diagHessian</CODE> is divided by 500, which results in a unique value of <CODE>diagHessian</CODE> for each and every weight.  During actual backpropagation, the <CODE>diagHessian</CODE> value is used to amplify the current learning rate <I>eta</I>, such that in highly curved areas in weight space, the learning rate is slowed, whereas in flat areas in weight space, the learning rate is amplified.</P>


<BR><A HREF="#topmost"><FONT SIZE="-6" COLOR="">go back to top</FONT></A>

<BR><BR>
<A name="ConvolutionalStructure"/>
<h2>Structure of the Convolutional Neural Network</h2>

<P>As indicated above, the program does not implement a generalized neural network, and is not a neural network workbench.  Rather, it is a very specific neural network, namely, a five-layer convolutional neural network.  The input layer takes grayscale data of a 29x29 image of a handwritten digit, and the output layer is composed of ten neurons of which exactly one neuron has a value of +1 corresponding to the answer (hopefully) while all other nine neurons have an output of -1.</P>

<P>Convolutional neural networks are also known as &quot;shared weight&quot; neural networks.  The idea is that a small kernel window is moved over neurons from a prior layer.  In this network, I use a kernel sized to 5x5 elements.  Each element in the 5x5 kernel window has a weight independent of that of another element, so there are 25 weights (plus one additional weight for the bias term).  This kernel is shared across all elements in the prior layer, hence the name &quot;shared weight&quot;.  A more detailed explanation follows.</P>


<BR><A HREF="#topmost"><FONT SIZE="-6" COLOR="">go back to top</FONT></A>

<BR><BR>
<A name="Illustration"/>
<h3>Illustration and General Description</h2>

<P>Here is an illustration of the neural network:</P>


<TABLE>
<TR>
	<TD WIDTH="599" COLSPAN="5"><IMG SRC="Images/IllustrationNeuralNet.gif" WIDTH="599" HEIGHT="300" BORDER="0" ALT="Illustration of the Neural Network"></TD>
</TR>
<TR>
	<TD WIDTH="229" ALIGN="center">Input Layer<BR>29x29</TD>
	<TD WIDTH="140" ALIGN="center">Layer #1<BR>6 Feature Maps<BR>Each 13x13</TD>
	<TD WIDTH="75" ALIGN="center">Layer #2<BR>50 Feature Maps<BR>Each 5x5</TD>
	<TD WIDTH="75" ALIGN="center">Layer #3<BR>Fully Connected<BR>100 Neurons</TD>
	<TD WIDTH="80" ALIGN="center">Layer #4<BR>Fully Connected<BR>10 Neurons</TD>
</TR>
</TABLE>

<P>The input layer (Layer #0) is the grayscale image of the handwritten character.  The MNIST image database has images whose size is 28x28 pixels each, but because of the considerations described by Dr. Simard in his article <A HREF="http://research.microsoft.com/~patrice/PDF/fugu9.pdf" target=_newwin>&quot;Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis,&quot;&nbsp;<IMG SRC="Images/ExternalLink.gif" WIDTH="14" HEIGHT="14" BORDER="0" ALT="External Link"></A>, the image size is padded to 29x29 pixels.  There are therefore 29x29 = 841 neurons in the input layer.</P>

<P>Layer #1 is a convolutional layer with six (6) feature maps.  Each feature map is sized to 13x13 pixels/neurons.  Each neuron in each feature map is a 5x5 convolutional kernel of the input layer, but every other pixel of the input layer is skipped (as described in Dr. Simard's article).  As a consequence, there are 13 positions where the 5x5 kernel will fit in each row of the input layer (which is 29 neurons wide), and 13 positions where the 5x5 kernel will fit in each column of the input layer (which is 29 neurons high).  There are therefore 13x13x6 = 1014 neurons in Layer #1, and (5x5+1)x6 = 156 weights.  (The  &quot;+1&quot; is for the bias.)</P>

<P>On the other hand, since each of the 1014 neurons has 26 connections, there are 1014x26 = 26364 connections from this Layer #1 to the prior layer.  At this point, one of the benefits of a convolutional &quot;shared weight&quot; neural network should become more clear: because the weights are shared, even though there are 26364 connections, only 156 weights are needed to control those connections.  As a consequence, only 156 weights need training.  In comparison, a traditional &quot;fully connected&quot; neural network would have needed a unique weight for each connection, and would therefore have required training for 26364 different weights.  None of that excess training is needed here.</P>

<P>Layer #2 is also a convolutional layer, but with 50 feature maps.  Each feature map is 5x5, and each unit in the feature maps is a 5x5 convolutional kernel of corresponding areas of all 6 of the feature maps of the previous layers, each of which is a 13x13 feature map.  There are therefore 5x5x50 = 1250 neurons in Layer #2, (5x5+1)x6x50 = 7800 weights, and 1250x26 = 32500 connections.</P>

<P>Before proceeding to Layer #3, it's worthwhile to mention a few points on the architecture of the neural network in general, and on Layer #2 in particular.  As mentioned above, each feature map in Layer #2 is connected to all 6 of the feature maps of the previous layer.  This was a design decision, but it's not the only decision possible.  As far as I can tell, the design is the same as Dr. Simard's design.  But it's distinctly different from Dr. LeCun's design.  Dr. LeCun deliberately chose not to connect each feature map in Layer #2 to all of the feature maps in the previous layer.  Instead, he connected each feature map in Layer #2 to only a few selected ones of the feature maps in the previous layer.  Each feature map was, in addition, connected to a different combination of feature maps from the previous layer.  As Dr. LeCun explained it, his non-complete connection scheme would force the feature maps to extract different and (hopefully) complementary information, by virtue of the fact that they are provided with different inputs.  One way of thinking about this is to imagine that you are forcing information through fewer connections, which should result in the connections becoming more meaningful.  I think Dr. LeCun's approach is correct.  However, to avoid additional complications to programming that was already complicated enough, I chose the simpler approach of Dr. Simard.</P>

<P>Other than this, the architectures of all three networks (i.e., the one described here and those described by Drs. LeCun and Simard) are largely similar.</P>

<P>Turning to Layer #3, Layer #3 is a fully-connected layer with 100 units.  Since it is fully-connected, each of the 100 neurons in the layer is connected to all 1250 neurons in the previous layer.  There are therefore 100 neurons in Layer #3, 100*(1250+1) = 125100 weights, and 100x1251 = 125100 connections.</P>

<P>Layer #4 is the final, output layer.  This layer is a fully-connected layer with 10 units.  Since it is fully-connected, each of the 10 neurons in the layer is connected to all 100 neurons in the previous layer.  There are therefore 10 neurons in Layer #4, 10x(100+1) = 1010 weights, and 10x101 = 1010 connections.</P>

<P>Like Layer #2, this output Layer #4 also warrants an architectural note.  Here, Layer #4 is implemented as a standard, fully connected layer, which is the same as the implementation of Dr. Simard's network.  Again, however, it's different from Dr. LeCun's implementation.  Dr. LeCun implemented his output layer as a &quot;radial basis function&quot;, which basically measures the Euclidean distance between the actual inputs and a desired input (i.e., a target input).  This allowed Dr. LeCun to experiment in tuning his neural network so that the output of Layer #3 (the previous layer) matched idealized forms of handwritten digits.  This was clever, and it also yields some very impressive graphics.  For example, you can basically look at the outputs of his Layer #3 to determine whether the network is doing a good job at recognition.  For my Layer #3 (and Dr. Simard's), the outputs of Layer #3 are meaningful only to the network; looking at them will tell you nothing.  Nevertheless, the implementation of a standard, fully connected layer is far simpler than the implementation of radial basis functions, both for forward propagation and for training during backpropagation.  I therefore chose the simpler approach.</P>

<P>Altogether, adding the above numbers, there are a total of 3215 neurons in the neural network, 134066 weights, and 184974 connections.</P>

<P>The object is to train all 134066 weights so that, for an arbitrary input of a handwritten digit at the input layer, there is exactly one neuron at the output layer whose value is +1 whereas all other nine (9) neurons at the output layer have a value of -1.  Again, the benchmark was an error rate of 0.82% or better, corresponding to the results obtained by Dr. LeCun.</P>


<BR><A HREF="#topmost"><FONT SIZE="-6" COLOR="">go back to top</FONT></A>

<BR><BR>
<A name="CodeToBuild"/>
<h3>Code For Building the Neural Network</h2>

<P>The code for building the neural network is found in the <CODE>CMNistDoc::OnNewDocument()</CODE> function of the document class.  Using the above illustration, together with its description, it should be possible to follow the code which is reproduced in simplified form below:</P>

<PRE>// simplified code

BOOL CMNistDoc::OnNewDocument()
{
	if (!COleDocument::OnNewDocument())
		return FALSE;
	
	// grab the mutex for the neural network
	
	CAutoMutex tlo( m_utxNeuralNet );
	
	// initialize and build the neural net
	
	NeuralNetwork& NN = m_NN;  // for easier nomenclature
	NN.Initialize();
	
	NNLayer* pLayer;
	
	int ii, jj, kk;
	double initWeight;
	
	// layer zero, the input layer.
	// Create neurons: exactly the same number of neurons as the input
	// vector of 29x29=841 pixels, and no weights/connections
	
	pLayer = new NNLayer( _T("Layer00") );
	NN.m_Layers.push_back( pLayer );
	
	for ( ii=0; ii&lt;841; ++ii )
	{
		pLayer-&gt;m_Neurons.push_back( new NNNeuron() );
	}

	
	// layer one:
	// This layer is a convolutional layer that has 6 feature maps.  Each feature 
	// map is 13x13, and each unit in the feature maps is a 5x5 convolutional kernel
	// of the input layer.
	// So, there are 13x13x6 = 1014 neurons, (5x5+1)x6 = 156 weights
	
	pLayer = new NNLayer( _T("Layer01"), pLayer );
	NN.m_Layers.push_back( pLayer );
	
	for ( ii=0; ii&lt;1014; ++ii )
	{
		pLayer-&gt;m_Neurons.push_back( new NNNeuron() );
	}
	
	for ( ii=0; ii&lt;156; ++ii )
	{
		initWeight = 0.05 * UNIFORM_PLUS_MINUS_ONE;  // uniform random distribution
		pLayer-&gt;m_Weights.push_back( new NNWeight( initWeight ) );
	}
	
	// interconnections with previous layer: this is difficult
	// The previous layer is a top-down bitmap image that has been padded to size 29x29
	// Each neuron in this layer is connected to a 5x5 kernel in its feature map, which 
	// is also a top-down bitmap of size 13x13.  We move the kernel by TWO pixels, i.e., we
	// skip every other pixel in the input image
	
	int kernelTemplate[25] = {
		0,  1,  2,  3,  4,
		29, 30, 31, 32, 33,
		58, 59, 60, 61, 62,
		87, 88, 89, 90, 91,
		116,117,118,119,120 };
💿 文件大小 204 K
👤 上传用户 yuyx2003
📂 所属分类人工智能/神经网络
🏷️ 相关标签

#神经网络 #识别 #程序
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -