📄 readme

📁 #ifdef INTEGER #include "ibp.h" #else #include "rbp.h" #endif /* built-in C functions */
💻
📖 第 1 页 / 共 5 页
字号:
59r  m 2 1 1 x aahs aos bh 1.000000 bo 1.000000 Dh 1.000000 Do 1.000000  file = ../xor3.new
 8.926291e+000  1  1 1 to 2 1
-7.945858e+000  1  1 2 to 2 1
 3.898432e+000  2  2 b to 2 1
 5.382575e+000  1  1 1 to 3 1
-4.862383e+000  1  1 2 to 3 1
-1.086713e+001  1  2 1 to 3 1
 7.715632e+000  2  3 b to 3 1

To write the weights the program starts with the second layer, writes
out the weights leading into these units in order with the threshold
weight last, then it moves on to the third layer, and so on.  In
addition to writing out the weights the second column lists whether or
not the weights are in use.  If the weight is in use it is marked with a
1, if it is a bias unit weight it is marked as 2 and if it is not in use
it is marked with a 0.  This is not used in this free version.  The last
4 numbers on each line tell which units the weights run between.  The
first weight listed runs from layer 1 unit 1 to layer 2 unit 1.  The
letter b indicates the weight is a bias unit.  These last 4 values on a
line are ignored when the file is read so in fact if you want to make up
your own weights file you don't need to type them in.  These last four
values are just here for human convenience.  However the inuse values
must be present if you write your own weights file.  And you must use
only one weight per line.

   To restore these weights type `rw' for restore weights.  At this time
the program reads the header line and sets the total number of
iterations the program has gone through to be the first number it finds
on the header line.  It then reads the character immediately after the
number.  The `r' indicates that the weights will be real numbers
represented as character strings.

   The remaining text on the first line of a weight file is not used by
the restore weights command at this time and it is there to give you a
record of what size and type the network was.  The fact that the rest of
this line is not read by the restore weights program means that before
you read in weights you have to make the proper size network with the
"m" command.  The "m 2 1 1 x" of course means there are 2 units in the
first layer, one in the second, one in the third and the x means there
are extra connections from the input units to the output unit.
Following that the initial command file that was read in is given.

   To save weights to a file other than "weights" you can say: "sw
<filename>", where, of course, <filename> is the file you want to save
to.  To continue saving to the same file you can just do "sw".  If you
type "rw" to restore weights they will come from this current weights
file as well.  You can restore weights from another file by using: "rw
<filename>".  Of course this also sets the name of the file to write to
so if you're not careful you could lose your original weights file.


8. Initializing Weights (c,ci)
------------------------------
   All the weights in the network initially start out at 0 and they are
also set to 0 by using the clear (c) command.  In some problems where
all the weights are 0 the weight changes may cancel themselves out so
that no learning takes place.  Moreover, in most problems the training
process will usually converge faster if the weights start out with small
random values.  To do this use the clear and initialize command as in:

ci 0.5

where the random initial weights will run from -0.5 to +0.5.  If the
value is omitted the last range specified will be used.  The initial
value is 1.


9. The Seed Value (s)
---------------------
   The initial seed value is set to 0 and this value is as good as any
other value however networks often do not converge quickly or at all
with some sets of initial weights.  To get some other initial random
weights use the seed command as in:

s 7

where the seed is set to 7.  The seed value is of type unsigned.


10. The Algorithm Command (a)
-----------------------------
   A number of different variations on the original back-propagation
algorithm have been proposed in order to speed up convergence and some
of these have been built into these simulators.  These options are set
using the `a' command and a number of options can go on the one line.

Activation Functions

   To set the activation functions use:

a a <char>  * to set the activation function for all layers to <char>.
a ah <char> * to set the hidden layer(s) function to <char>.
a ao <char> * to set the output layer function to <char>.

where <char> can be:

   l  for the linear activation function:  x
   s  for the traditional smooth activation function:
      1.0 / (1.0 + exp(x))

   The s function is the standard smooth activation function originally
used by researchers and it is still the most commonly used one.  In the
bp program it is implemented by a table look-up (default) or if the
compiler variable LOOKUP is undefined in the file ibp.h the regular
time-consuming real valued calculations are done.

   The linear activation function gives networks only a very limited
ability to learn patterns and it is therefore hardly ever used by itself
in a network however it is often used in the output layer of networks
with 3 or more layers so that the network can give output values beyond
the range of the other activation functions.  For instance, suppose you
need to train a network to compute some non-linear function but you need
to produce outputs in the range -10 to 10.  The usual activation
functions are restricted to the range 0 to 1 or -1 to 1 but you can
choose a non-linear function for the network's hidden layers and with 
linear neurons in the output layer the network can produce values
in the range -10 to 10.


The Derivatives

   The correct derivative for the standard activation function is s(1-s)
where s is the activation value of a unit however when s is near 0 or 1
this term will give only very small weight changes during the learning
process.  To counter this problem Fahlman proposed the following one
for the output layer:

0.1 + s(1-s)

(For the original description of this method see "Faster Learning
Variations of Back-Propagation:  An Empirical Study", by Scott E.
Fahlman, in Proceedings of the 1988 Connectionist Models Summer School,
Morgan Kaufmann, 1989.)

   Besides Fahlman's derivative and the original one the differential
step size method (see "Stepsize Variation Methods for Accelerating the
Back-Propagation Algorithm", by Chen and Mars, in IJCNN-90-WASH-DC,
Lawrence Erlbaum, 1990) takes the derivative to be 1 in the layer going
into the output units and uses the correct derivative term for all other
layers.  The learning rate for the inner layers is normally set to some
smaller value.  To set a value for eta2 give two values in the `e'
command as in:

e 0.1 0.01

To set the derivative use the `a' command as in:

a dc   * use the correct derivative for whatever function
a dd   * use the differential step size derivative (default)
a df   * use Fahlman's derivative in only the output layer
a do   * use the original derivative (same as `c' above)

Update Methods

   The choices are the periodic (batch) method, the continuous (online)
method, delta-bar-delta and quickprop.  The following commands set the
update methods:

a uC   * for the "right" continuous update method
a uc   * for the "wrong" continuous update method
a ud   * for the delta-bar-delta method
a up   * for the original periodic update method (default)
a uq   * for the quickprop algorithm


11. The Delta-Bar-Delta Method (d)
----------------------------------
   The delta-bar-delta method attempts to find a learning rate, eta, for
each individual weight.  The parameters are the initial value for the
etas, the amount by which to increase an eta that seems to be too small,
the rate at which to decrease an eta that is too large, a maximum value
for each eta and a parameter used in keeping a running average of the
slopes.  Here are examples of setting these parameters:

d d 0.5    * sets the decay rate to 0.5
d e 0.1    * sets the initial etas to 0.1
d k 0.25   * sets the amount to increase etas by (kappa) to 0.25
d m 10     * sets the maximum eta to 10
d n 0.005  * an experimental noise parameter
d t 0.7    * sets the history parameter, theta, to 0.7

These settings can all be placed on one line:

d d 0.5  e 0.1  k 0.25  m 10  t 0.7

The version implemented here does not use momentum.  The symmetric
versions sbp and srbp do not implement delta-bar-delta.

   The idea behind the delta-bar-delta method is to let the program find
its own learning rate for each weight.  The `e' sub-command sets the
initial value for each of these learning rates.  When the program sees
that the slope of the error surface averages out to be in the same
direction for several iterations for a particular weight the program
increases the eta value by an amount, kappa, given by the `k' parameter.
The network will then move down this slope faster.  When the program
finds the slope changes signs the assumption is that the program has
stepped over to the other side of the minimum and so it cuts down the
learning rate by the decay factor given by the `d' parameter.  For
instance, a d value of 0.5 cuts the learning rate for the weight in
half.  The `m' parameter specifies the maximum allowable value for an
eta.  The `t' parameter (theta) is used to compute a running average of
the slope of the weight and must be in the range 0 <= t < 1.  The
running average at iteration i, a[i], is defined as:

a[i] = (1 - t) * slope[i] + t * a[i-1],

so small values for t make the most recent slope more important than the
previous average of the slope.  Determining the learning rate for
back-propagation automatically is, of course, very desirable and this
method often speeds up convergence by quite a lot.  Unfortunately, bad
choices for the delta-bar-delta parameters give bad results and a lot of
experimentation may be necessary.  If you have n patterns in the
training set try starting e and k around 1/n.  The n parameter is an
experimental noise term that is only used in the integer version.  It
changes a weight in the wrong direction by the amount indicated when the
previous weight change was 0 and the new weight change would be 0 and
the slope is non-zero.  (I found this to be effective in an integer
version of quickprop so I tossed it into delta-bar-delta as well.  If
you find this helps please let me know.)  For more on delta-bar-delta
see "Increased Rates of Convergence" by Robert A. Jacobs, in Neural
Networks, Volume 1, Number 4, 1988.


12. Quickprop (qp)
------------------
    Quickprop (see "Faster-Learning Variations on Back-Propagation: An
Empirical Study", by Scott E. Fahlman, in Proceedings of the 1988
Connectionist Models Summer School", Morgan Kaufmann, 1989 or ftp to
archive.cis.ohio-state.edu, look in the directory pub/neuroprose for the
file, fahlman.quickprop-tr.ps.Z.) may be one of the fastest network
training algorithms.  It is loosely based on Newton's method.

   The parameter mu is used to limit the size of the weight change to
less than or equal to mu times the previous weight change.  Fahlman
suggests mu = 1.75 is generally quite good so this is the initial value
for mu but slightly larger or slightly smaller values are sometimes
better.

   To get the process started quickprop makes the typical backprop
weight change of - eta * slope.  I have found that a good value for the
quickprop eta value is around 1 / n or 2 / n where n is the number of
patterns in the training set.  Other sources often use much larger
values.  In addition Fahlman uses this term at other times.  I had to
wonder if this was a good idea so in this code I've included a
capability to add it in or not add it in.  So far it seems to me that
sometimes adding in this extra term helps and sometimes it doesn't.  The
default is to use the extra term.

   Another factor involved in quickprop comes about from the fact that
the weights often grow very large very quickly.  To minimize this
problem there is a decay factor designed to keep the weights small.
The weight decay is implemented by decreasing the value of the slope
and it is different from the general weight decay that people use and
which is also implemented in this software.  Fahlman recently mentioned
that now he does not use does not use this unless the weights get very
large.  I've found that too large a decay factor can stall
out the learning process so that if your network isn't learning fast
enough or isn't learning at all one possible fix is to decrease the
decay factor.  Note:  in the old free version the value of the weight
decay constant is the value you enter / 1000 in order to allow small
weight decay values in the integer version however in this version the
problem is handled differently so that what you enter is exactly what
you get, not the value divided by 1000.

   I built in one additional feature for the integer version.  I found
that by adding small amounts of noise the time to convergence can be
brought down and the number of failures can be decreased somewhat.  This
seems to be especially true when the weight changes get very small.  The
noise consists of moving uphill in terms of error by a small amount when
the previous weight change was zero.  Good values for the noise seem to
be around 0.005.

   The parameters for quickprop are all set in the `qp' command like
so:

qp d <value>  * set the weight decay factor for all layers to <value>
qp d h 0      * the default weight decay for hidden layer units
qp d o 0.0001 * the default weight decay for output layer units
qp e 0.5      * the default value for eta
qp m 1.75     * the default value for mu
qp n 0        * the default value for noise
qp s+         * the default value is to always include the slope

or a whole series can go on one line:

qp d 0.1 e 0.5 m 1.75 n 0 s+


13. Making a Network (m)
------------------------
   In the simplest form of the make a network command you type an `m'
followed by the number of units in each layer as in:

m 8 4 4 2

Most of the time this type of network is all you will ever need but
there are others that can be tried and which may sometimes will work
better.  One innovation that often speeds up learning is to include
extra connections between the input and output layers.  To get this
type of network you add an x to the end of the m command as in:

m 8 4 2 x

These extra connections are said to be important when the problem to
be solved is almost linear and then the hidden layer units provide some
extra corrections to the output neurons to distort the results from a
purely linear model.
💿 文件大小 124 K
👤 上传用户 face137
📂 所属分类人工智能/神经网络
🏷️ 相关标签

#include #functions #built-in #INTEGER
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -