<HTML
><HEAD
><TITLE
>Learning Rules</TITLE
><META
NAME="GENERATOR"
CONTENT="Modular DocBook HTML Stylesheet Version 1.74b"><LINK
REL="HOME"
HREF="index.html"><LINK
REL="PREVIOUS"
HREF="index.html"><LINK
REL="NEXT"
TITLE="Conclusion"
HREF="conc.html"></HEAD
><BODY
CLASS="sect1"
><DIV
CLASS="NAVHEADER"
><TABLE
SUMMARY="Header navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TH
COLSPAN="3"
ALIGN="center"
></TH
></TR
><TR
><TD
WIDTH="10%"
ALIGN="left"
VALIGN="bottom"
><A
HREF="index.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="80%"
ALIGN="center"
VALIGN="bottom"
></TD
><TD
WIDTH="10%"
ALIGN="right"
VALIGN="bottom"
><A
HREF="conc.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
></TABLE
><HR
ALIGN="LEFT"
WIDTH="100%"></DIV
><DIV
CLASS="sect1"
><H1
CLASS="sect1"
><A
NAME="rules"
>Learning Rules</A
></H1
><P
>The following learning rules are divided into supervised and
  unsupervised rules and also by their architecture.  These
  divisions follow those suggested in the comp.ai.neural-nets FAQ,
  but some of these distinctions are ambiguous, especially where
  hybrid rules are considered (see Kohonen or RBF networks).</P
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="sv"
>Supervised Learning</A
></H2
><P
>Supervised learning means that target values for the output
    are presented to the network, in order that the network can update
    its weights such that the euclidean distance of the output vector
    and the target vector is minimised.  Supervised learning attempts
    to match the output of the network to values that have already
    been defined.  After training network verification is used, in
    which only the input values are presented to the network so that
    the success of the training can be established.</P
><DIV
CLASS="sect3"
><H3
CLASS="sect3"
><A
NAME="ff"
>Feedforward Learning Rules</A
></H3
><DIV
CLASS="sect4"
><H4
CLASS="sect4"
><A
NAME="quick"
>Quickprop
        [<SPAN
CLASS="citation"
>fahlman88empirical</SPAN
>]</A
></H4
><P
><DIV
CLASS="equation"
><A
NAME="AEN79"
></A
><P
><B
>Equation 1. Error derivative at previous epoch</B
></P
><IMG
SRC="../images/ewrtweight2.jpg"></DIV
>
	<DIV
CLASS="equation"
><A
NAME="AEN82"
></A
><P
><B
>Equation 2. Error derivative at this epoch</B
></P
><IMG
SRC="../images/ewrtweight.jpg"></DIV
></P
><P
>The Quickprop algorithm is loosely based on Newton's
	method.  It is quicker than standard backpropagation because
	it uses an approximation to the error curve, and second order
	derivative information which allow a quicker evaluation.
	Training is similar to backprop except for a copy of (eq. 1)
	the error derivative at a previous epoch.  This, and the
	current error derivative (eq. 2), are used to minimise an
	approximation to this error curve.  The update rule is given
	in equation 3:</P
><P
><DIV
CLASS="equation"
><A
NAME="AEN87"
></A
><P
><B
>Equation 3. Quickprop update rule</B
></P
><IMG
SRC="../images/quickdeltaw.jpg"></DIV
></P
><P
>This equation uses no learning rate.  If the slope of
	the error curve is less than that of the previous one, then
	the weight will change in the same direction (positive or
	negative).  However, there needs to be some controls to prevent
	the weights from growing too large.</P
></DIV
><DIV
CLASS="sect4"
><H4
CLASS="sect4"
><A
NAME="rprop"
>RProp (Resilient Propagation)
        [<SPAN
CLASS="citation"
>riedmiller93direct</SPAN
>]</A
></H4
><P
>RProp uses a technique that differs from backpropagation
	by the weight update which is done without using the gradient
	information. or a learning rate parameter.  A quantity,
	called the update value, is added to the weight at each
	iteration.  Its value and sign are governed by the gradient
	sign technique, such that the error function is minimised:
	<DIV
CLASS="equation"
><A
NAME="AEN95"
></A
><P
><B
>Equation 4. RProp Updating</B
></P
><IMG
SRC="../images/rpropdelta.jpg"></DIV
>
	If the derivative retains its sign, the update
	value is slightly increased in order to accelerate convergence
	in the shallow regions of the error functions.  Once the
	update value for each weight is adapted, the weight update
	itself follows a simple rule: if the derivative is positive,
	the weight is decreased by its update value.  If negative, the
	update value is added.</P
><P
>RProp is a batch updating algorithm.  It is very fast.
	The weight updating algorithm is expressed in figure 5:
	<DIV
CLASS="figure"
><A
NAME="AEN99"
></A
><P
><B
>Figure 5. RProp Weight Updating</B
></P
><P
><IMG
SRC="../images/rpropal.jpg"></P
></DIV
>Note here that there
	are upper and lower limits to the deltas ... MAX and MIN are
	self explanatory ... SGN is the signum function ...</P
><P
>Quickprop and RProp are both algorithms for feedforward
	neural networks, and both will require the
	<FONT
COLOR="RED"
><TT
CLASS="classname"
>FeedforwardArchitecture</TT
></FONT
>
	class.  As far as implementation is concerned, both of these
	could inherit
	<FONT
COLOR="RED"
><TT
CLASS="classname"
>BackpropagationLearningRule</TT
></FONT
>
	as the theories themselves derive from backpropagation so
	apart from the weight updating methods which differ, the
	implementation remains the same.</P
></DIV
><DIV
CLASS="sect4"
><H4
CLASS="sect4"
><A
NAME="casc"
>Cascade Correlation
	[<SPAN
CLASS="citation"
>fahlman91cascadecorrelation</SPAN
>]</A
></H4
><P
>The cascade correlation algorithm constructs an ANN by
	adding hidden units one by one.  There are two training
	stages: (1) maximising correlation between the network
	residual error and the newly added hidden neuron's output
	(learning the weights from input to hidden layers); (2)
	minimising network error while freezing all weights to the
	hidden node (from the hidden to output layers).</P
><P
>The correlation (covariance) is defined as the
	function in eq. 6:
	<DIV
CLASS="equation"
><A
NAME="AEN112"
></A
><P
><B
>Equation 5. Correlation (covariance)</B
></P
><IMG
SRC="../images/cascade.jpg"></DIV
>a<SUB
>n</SUB
> is the hidden unit output
	for the n<SUP
>th</SUP
> and
	E<SUB
>k</SUB
><SUP
>n</SUP
> is the
	network error without this hidden node.  The algorithm is
	quick as only single network training is involved.</P
><P
>Cascade correlation uses another learning rule, such as
	backpropagation or its variants in conjunction with adding
	neurons.</P
></DIV
><DIV
CLASS="sect4"
><H4
CLASS="sect4"
><A
NAME="braindam"
>Optimal Brain Damage (OBD)
        [<SPAN
CLASS="citation"
>lecun90optimal</SPAN
>]</A
></H4
><P
>OBD is the name given to the technique of removing
	unimportant weights during network training (often called
	pruning).  OBD can be applied in particular to those networks
	previously discussed.  Networks that used pruning have been
	found to have better generalisation, require less training
	patterns and have better/faster overall performance.</P
><P
>OBD is a technique that selectively deletes weights from
	a trained neural network, reducing the network size.  This
	works because it reduces the dimensionality of the network.
	OBD is viewed as a minimisation procedure.  Essentially, take
	a good network, remove half of the weights and end up with a
	network that works just as well (no reduction in
	generalisation ability) or better (quicker, reducing network
	complexity).</P
><P
>OBD can be expressed as follows:
	<P
></P
><OL
TYPE="1"
><LI
><P
>Choose a network;
	  architecture</P
></LI
><LI
><P
>Train the network;</P
></LI
><LI
><P
>Compute the second derivatives of the
	  Hessian weight matrix for each parameter;</P
></LI
><LI
><P
>Compute the <SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>saliencies</I
></SPAN
>
	  for each parameter;</P
></LI
><LI
><P
>Sort the parameters by saliency and remove
	  the lowest salient parameters</P
></LI
><LI
><P
>return to stage 2.</P
></LI
></OL
></P
><P
>The <SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>saliencies</I
></SPAN
> are parameters which
	represent the efficacy of individual connections.  It is
	defined as the change in the objective function if the
	connection were to be removed</P
><P
>As mentioned before, the OpenAI classes don't currently
	include a facility for adding or removing network
	connections, so the architecture class needs to allow for
	this, either by adding a method to the abstract base class or
	including it in a derived class.</P
></DIV
><DIV
CLASS="sect4"
><H4
CLASS="sect4"
><A
NAME="rbfn"
>Radial Basis Function Networks
        [<SPAN
CLASS="citation"
>orr96rbf</SPAN
>]</A
></H4
><P
>RBFNs are two layer feedforward networks.  Network
        training is divided into two stages: The first determines
        weights from the input to hidden nodes.  These weights
        represent the centres of <SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>radial basis
        functions</I
></SPAN
>.  A radial basis function is a function
        for each data point, of the form in equation
        6:<DIV
CLASS="equation"
><A
NAME="AEN148"
></A
><P
><B
>Equation 6. Radial Basis Function</B
></P
><IMG
SRC="../images/rbfunction.jpg"></DIV
>  The first
        training stage is an unsupervised technique to find suitable
        centres of these basis functions.  Then the hidden nodes
        outputs are a sum of the weighted basis functions.  The
        second stage is a supervised learning stage for a problem
        which is now linearly seperable.  The input to hidden layer
        learning, using the radial basis functions, self organises the
        data in order to map a nonlinear function to a linear space.
        Training data is presented only once and this makes RBFN
        training very quick.</P
><P
>RBFN networks do not have anything that is the same as
	the bias term present in the MLP.</P
><P
>There are many techniques that are involved with RBFNs,
	and the an implementation will have to consider all of them.
	For example, the choice of radial basis function, fixed
	centres selected at random or normalised RB functions.</P
><P
>It is not necessary to go into depth into the stage two
	algorithms, it is enough to say that they are straightfoward
	linear, supervised algorithms.</P
><P
>Problems which are applicable to MLPs are also
	applicable to RBFNs.  The roots of RBFNs are in Cover's
	theorem and universal function approximation, and its
	mathematical foundation is very strong.  This has made the
	RBFN a popular choice of network.</P
></DIV
><DIV
CLASS="sect4"
><H4
CLASS="sect4"
><A
NAME="gen"
>Genetic Learning Rule G-Prop
          [<SPAN
CLASS="citation"
>castillo99gprop</SPAN
>]</A
></H4
><P
>G-prop uses a combined genetic algorithm  GA and
        backpropagation approach for network training.   the GA
        selects the initial set of weights and number of hidden units
        by applying genetic operators.  The theory goes that a global
        GA search is combined with a local BP search for a more
        effective training procedure.</P
><P
>GAs can be applied to neural network learning in several
	ways (for a comprehensive review of evolutionary methods with
	neural networks see [<SPAN
CLASS="citation"
>yao_ie3_proc_online</SPAN
>]):
	<P
></P
><UL
><LI
><P
>Searching for an optimal set of
	  weights</P
></LI
><LI
><P
>Search over topology space</P
></LI
><LI
><P
>Search for optimal learning
	  parameters</P
></LI
><LI
><P
>Genetic approach to modify a training
	  algorithm, such as BP</P
></LI
></UL
>
	G-Prop limits itself to a search for the initial weights and
	hidden layer size, but these are not the only restrictions and
	an OpenAI implementation can be made to be far more
	comprehensive.  For now, these will be passed over in order to
	describe G-Prop, but this must be kept in mind.</P
><P
>The full G-Prop algorithm is as follows:
	  <P
></P
><UL
><LI
><P
>Generate an initial population with random
	    weights and a random number of hidden units within a
	    defined range (from 2 to MAXHIDDEN)</P
></LI
><LI
><P
>Repeat for G generations
	      <P
></P
><OL
TYPE="1"
><LI
><P
>Evaluate the individuals of the population
	        - train them and assign a fitness value based on their
	        error and/or number of training
	        epochs</P
></LI
><LI
><P
>Select the N fittest individuals and apply
		a mating or crossover function.  Doing this guarantees
		diversity.</P
></LI
><LI
><P
>Replace the unfit individuals with new
		ones</P
></LI
></OL
></P
></LI
><LI
><P
>Use the best individuals to obtain the
	    testing error</P
></LI
></UL
></P
><P
>The speed of a GProp network can be inhibitive, but is
	an interesting method, especially considering its combined
	application with the OpenAI genetic algorithm classes.</P
></DIV
></DIV
><DIV
CLASS="sect3"
><H3
CLASS="sect3"
><A
NAME="rna"
>Recurrent Network Algorithms</A
></H3
><DIV
CLASS="sect4"
><H4
CLASS="sect4"
><A
NAME="tdnn"
>Time Delay Neural Networks (TDNNs)</A
></H4
><P
>The TDNN is a specific kind of neural network, discussed
        here for its obvious practical implications and its relative
        simplicity.  A TDNN is essentially the same as an MLP except
        with a single feedback connection from the single output
        neuron to the input (see figure 2).  An input is given as f(t
        - n) ... f(t), where f is an arbitrary time series problem for
        which the value f(t + 1) is needed.  n is the number of
        inputs.  The network is used to predict f(t + 1) which is the
        value of the output node.  The next epoch is used to predict
        f(t + 2) with inputs now f(t _ n + 1) ... f(t + 1), and so on.
        The feedback connection may or may not have a weight different
        to unity.</P
></DIV
></DIV
></DIV
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="ul"
>Unsupervised Learning</A
></H2
><P
>Unsupervised learning, in contrast to supervised learning,
    does not provide the network with target output values.  This
    isn't strictly true, as often (and for the cases discussed in the
    this section) the output is identical to the input.  Unsupervised
    learning usually performs a mapping from input to output space,
    data compression or clustering.</P
><DIV
CLASS="sect3"
><H3
CLASS="sect3"
><A
NAME="soms"
>Self Organising Maps</A
></H3
><DIV
CLASS="sect4"
><H4
CLASS="sect4"
><A
NAME="ksom"
>Kohonen Self Organising Maps</A
></H4
><P
>This is an unsupervised learning algorithm which maps an
	input vector to an output array of nodes.  A reference vector,
	w<SUB
>ji</SUB
>, is assigned to each of
	these nodes.  An input vector, x<SUB
>i</SUB
>, is
	compared with it and the best match of all the neurons is the
	response.  This is what is involved in competitive learning.
	The winner is the neuron with the best response; the neuron
	where the Euclidean distance between its weight s and the
	input pattern is the smallest. <DIV
CLASS="equation"
><A
NAME="AEN201"
></A
><P
><B
>Equation 7. Euclidean
	distance</B
></P
><IMG
SRC="../images/somdistance.jpg"></DIV
> . (possible
	implementation...)</P
><P
>The next stage is cooperation.  A neighbourhood of
	neurons is defined around the winning neuron.  The neurons in
	this neighbourhood will all have the chance to update their
	weights, though not as much as the winner, and the size of the
	update decreases for neurons as they are further away.
	Neighbourhoods are usually symmetrical.</P
><P
>The neighbourhood can be defined in one of a number of
	ways, including defining a square around the winning neuron
	or commonly using a Gaussian
	function.<DIV
CLASS="equation"
><A
NAME="AEN206"
></A
><P
><B
>Equation 8. Gaussian Function to Define
	Neighbourhood Size</B
></P
><IMG
SRC="../images/neighgauss.jpg"></DIV
>
	In this way, the weight changes can be made to improve the
	winning neuron, that is make it closer to the input.  This is
	adaption of the excited neurons, minimising the expression in
	equation 7 further through adjusting the synaptic weights
	between the connections in the output layer.</P
></DIV
><DIV
CLASS="sect4"
><H4
CLASS="sect4"
><A
NAME="lvq"
>Learning Vector Quantisation (LVQ)</A
></H4
><P
>Learning Vector Quantisation is a technique which
	improves the performance of SOMs for classification.  SOM is
	an unsupervised learning algorithm, but can be used
	for supervised learning.  For example, each neuron of the SOM
	can be labelled after training.  By further supervised
	learning using LVQ, a two stage learning adaptively is used to
	solve a classification problem.</P
><P
>The algorithm for SOM learning with LVQ is as follows:
	<P
></P
><UL
><LI
><P
>execute SOM algorithm to obtain learned set
	  of vectors;</P
></LI
><LI
><P
>label the neurons;</P
></LI
><LI
><P
>LVQ:</P
></LI
><LI
><P
>Find the weight that is closest to the
	  input pattern X;</P
></LI
><LI
><P
>If the classes of the output neuron vector and
	  the input are the same:
	    <P
></P
><UL
><LI
><P
>Move the neuron vector (weight) closer to the
	      input;</P
></LI
><LI
><P
>Otherwise moving them away from the input
	      (equation).</P
></LI
></UL
></P
></LI
></UL
>
	<DIV
CLASS="equation"
><A
NAME="AEN229"
></A
><P
><B
>Equation 9. Learning Vector Quantisation</B
></P
><IMG
SRC="../images/lvq.jpg"></DIV
>
	LVQ improves classification especially when classes overlap.
	A combined SOM and LVQ approach is much better in defining
	decision boundaries.</P
><P
>SOM is used for unsupervised learning, classification
	and optimisation.  Also, returning to the RBFN example, SOM
	learning can be used for the first (input to hidden layer)
	stage of learning.</P
></DIV
></DIV
></DIV
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="opt"
>Learning Using Optimisation Procedures</A
></H2
><P
>There are two approaches that can be taken using
    optimisation for neural networks: 1) optimising the set of
    weights, thereby using an optimsation procedure as a learning rule
    or 2) training the network using an existing learning rule and
    optimising the learning rule parameters and network topology.</P
><P
>Network training can be seen as finding the optimal
    weights for a network for a given problem.  Therefore well
    known optimisation procedures have been used to do this,
    including Newton's method, conjugate gradient, simulated
    annealing and evolutionary methods (these have been discussed
    earlier).  Outlined next is a simulated annealing approach.</P
><DIV
CLASS="sect3"
><H3
CLASS="sect3"
><A
NAME="simann"
>Simulated Annealing
      [<SPAN
CLASS="citation"
>boese93simulated</SPAN
>]</A
></H3
><P
>This optimisation technique has a basis in a physical
      analogy, of slow cooling of materials which increases the order
      of the molecules as it goes from a hot (high disorder) to a cold
      solid (high order).  Successively
      <SPAN
CLASS="emphasis"
><I
CLASS="emphasis"
>quenching</I
></SPAN
>, or gradually minimising the
      energy of the system increases its order and allows it to find
      an optimal state.  Often, successive   heating and cooling the
      system can achieve better performance as so the energy barriers
      can be overcome, giving the chance of escaping from local
      optima.  This technique has been successful in many NP hard
      optimisation problems (such as the travelling salesman problem)
      and has been applied in optimising the topology and learning
      rule of a neural network acting on a given data set.</P
></DIV
></DIV
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="cm"
>Combining Networks</A
></H2
><P
>Committee machines use a well known paradigm of computer
    science, the divide and conquer strategy.  A complex task can be
    subdivided into smaller ones and solved individually.  A group of
    ANNs each perform a role as part of the complex task.  This
    combination of expert knowledge can provide a powerful method for
    complex problem solving.</P
><P
>The structure of committee machines can be static (the output of
    expert NNs are combined without involving inputs, which is
    involved in ensemble averaging and boosting), or dynamic (the
    outputs involve the inputs, which are used when considering a
    mixture of experts or hiearchies)</P
><P
>A committee machine structure can be built over a
    distributed network, with individual experts taking advantage of
    multiprocessing power, creating a very powerful and effective
    tool.</P
><P
>At this stage, it is not necessary to delve to far into
    combining networks, but it can be seen how this strategy can be an
    effective method of machine learning.</P
></DIV
></DIV
><DIV
CLASS="NAVFOOTER"
><HR
ALIGN="LEFT"
WIDTH="100%"><TABLE
SUMMARY="Footer navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
><A
HREF="index.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="index.html"
ACCESSKEY="H"
>Home</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
><A
HREF="conc.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
></TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
>&nbsp;</TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
>Conclusion</TD
></TR
></TABLE
></DIV
></BODY
></HTML
>