Training Arbitrarily Connected Neural Networks
with Second Order Algorithms
Except where reference is made to the work of others, the work described in this
thesis is my own or was done in collaboration with my advisory committee. This
thesis does not include proprietary or classifled information.
Nicholas Jay Cotton
Certiflcate of Approval:
Thaddeus A. Roppel
Associate Professor
Electrical and Computer Engineering
Bogdan M. Wilamowski, Chair
Professor
Electrical and Computer Engineering
John Y. Hung
Professor
Electrical and Computer Engineering
George T. Flowers
Interim Dean, Graduate School
Training Arbitrarily Connected Neural Networks
with Second Order Algorithms
Nicholas Jay Cotton
A Thesis
Submitted to
the Graduate Faculty of
Auburn University
in Partial Fulflllment of the
Requirements for the
Degree of
Master of Science
Auburn, Alabama
August 9, 2008
Training Arbitrarily Connected Neural Networks
with Second Order Algorithms
Nicholas Jay Cotton
Permission is granted to Auburn University to make copies of this thesis at its
discretion, upon the request of individuals or institutions and at
their expense. The author reserves all publication rights.
Signature of Author
Date of Graduation
iii
Thesis Abstract
Training Arbitrarily Connected Neural Networks
with Second Order Algorithms
Nicholas Jay Cotton
Master of Science, August 9, 2008
(B.E.E., Auburn University, 2006)
68 Typed Pages
Directed by Bogdan M. Wilamowski
Neural networks have been an active area of research and application for many
years. Today they are gaining popularity with the growing processing power of mod-
ern computers. With neural networks gaining popularity comes a demand for a simple
and reliable method of training all types of networks which is the focus of this paper.
Neural Network Trainer is a training package that allows the user to create a simple
netlist style network architecture in a text flle and quickly begin training. Several al-
gorithms are implemented including Error Back Propagation as well modifled versions
of the Levenberg-Marquardt algorithm. The software is demonstrated with results
verifying the implemented algorithms as well as the trained neural networks.
iv
Acknowledgments
This project would not have been possible without the support and guidance of
the people around me. I would like to sincerely thank my advisor Dr. Bogdan M.
Wilamowski for his guidance and inspiration through my educational career. I would
also like to thank the members of my committee Dr. John Y. Hung and Dr. Thaddeus
A. Roppel for their feedback. I would also like to thank all of the members of the
Auburn University Electrical Engineering Department for their contribution to my
overall education. Finally, I would like to thank my wife, parents, and grandparents
for their support over the years throughout my education.
v
Style manual or journal used Journal of Approximation Theory (together with
the style known as \aums"). Bibliography follows van Leunen?s A Handbook for
Scholars.
Computer software used The document preparation package TEX (speciflcally
LATEX) together with the departmental style-flle aums.sty.
vi
Table of Contents
List of Figures ix
1 Introduction 1
2 Common Neural Network Training Algorithms 3
2.1 Error Back Propagation . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Levenberg-Marquardt . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Modified Levenberg-Marquardt Algorithm 8
3.1 Advantages of Arbitrarily Connected Neural Networks . . . . . . . . 8
3.2 Calculation of Gradient and Jacobian . . . . . . . . . . . . . . . . . . 13
3.2.1 Forward computation . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Backward computation . . . . . . . . . . . . . . . . . . . . . . 16
3.2.3 Calculation of Jacobian Elements . . . . . . . . . . . . . . . . 18
3.3 Self Aware Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Enhanced Self Aware Algorithm . . . . . . . . . . . . . . . . . . . . . 20
4 Developed Software 22
4.1 Graphical User Interface . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Continuous Training Control . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Nonlinear Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.1 Interfacing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.2 Synchronizing Plots . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Neural Network Trainer 33
5.1 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Input File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.3 Training Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3.1 Implemented algorithms . . . . . . . . . . . . . . . . . . . . . 38
5.3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Experimental Results 43
6.1 Fully Connected Networks . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 EBP Compared to NBN . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.3 ESA Compared to NBN . . . . . . . . . . . . . . . . . . . . . . . . . 45
vii
6.4 Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7 Conclusion 55
Bibliography 57
viii
List of Figures
3.1 Network architectures to solve the parity-3 problem with 3 neurons in
the hidden layer (MLP) and a network with 1 neuron in the hidden
layer (FCN). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Network architectures to solve the parity- 9 problem with a 3 layer
MLP network (left) and with a 3 layer FCN network (right.) . . . . 10
3.3 The desired nonlinear control surface to which the neural networks are
trained. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Resulting control surface obtained with an MLP architecture having 7
neurons in one hidden layer. The total MSE is 0.00234. . . . . . . . . 12
3.5 Resulting control surface obtained with an FCN architecture having 4
hidden neurons. The total MSE is 0.00014 . . . . . . . . . . . . . . . 13
3.6 Example of an ACN network. The network has flve neurons numbered
from 1 to 5 and eight nodes 3 of which are input nodes from 1 to 3 and
flve neuron nodes from 4 to 8. . . . . . . . . . . . . . . . . . . . . . 15
3.7 Typical hyperbolic tangent activation function. . . . . . . . . . . . . . 21
4.1 Early example of NNT in the GUIDE toolbox. . . . . . . . . . . . . 23
4.2 Neural network simulation environment for three dimensional surfaces. 29
5.1 Front end of Neural Network Trainer (NNT) . . . . . . . . . . . . . . 34
5.2 Three Neuron architechture for parity-3 problem. . . . . . . . . . . . 36
6.1 Results of EBP algorithm for a parity-3 problem using two neurons
fully connected as in Figure 3.1 . . . . . . . . . . . . . . . . . . . . . 45
6.2 Results of EBP algorithm for a parity-5 problem using three neurons
fully connected and EBP was never able to converge. . . . . . . . . . 46
ix
6.3 Results of SA algorithm for a parity-3 problem using two neurons fully
connected as in Figure 3.1 . . . . . . . . . . . . . . . . . . . . . . . . 47
6.4 Results of SA algorithm for a parity-5 with a fully connected 3 neuron
network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.5 Results of ESA algorithm for a parity-3 with two neurons fully connected. 50
6.6 Results of ESA algorithm for a parity-5 with a fully connected four
neuron network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.7 Results of ESA algorithm for a parity-7 with a fully connected 4 neuron
network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.8 Results of ESA algorithm for a parity-9 with a fully connected 4 neuron
network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.9 The fully connected architecture used for the models in Figures 6.10.
Note that this architecture does not show each neuron?s biasing weight
in order to simplify the drawing. . . . . . . . . . . . . . . . . . . . . . 52
6.10 The desired surface used for training. . . . . . . . . . . . . . . . . . . 53
6.11 The surface obtained from the output of the successfully trained net-
work in Figure 6.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.12 The difierence between Figures 6.10 and 6.11 plotted on the same scale
as the original flgures. . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.13 The difierence between Figures 6.10 and 6.11 plotted on a much smaller
axis to show details. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
x
Chapter 1
Introduction
Neural Networks have been used for years in a wide range of Electrical Engi-
neering disciplines. As computers? processing speeds increase software implemented
neural networks can be used in many nontraditional places. Traditionally neural net-
works were very di?cult to train and implement. The required processing power by
training neural networks was too much to be handled for most computers, but this
is not as true any longer. This evolution of computers is also allowing the neural
network to evolve in unpredicted ways. This change not only includes the method by
which neural networks are trained, but also where they can be implemented. Com-
puters are used everywhere today from controlling production machines in factories
to controlling the lights in the factory. Neural networks can be implemented on any
of these computers with very little efiort. The computation required is trivial for
modern computers and the time required to do so is many cases negligible. This ease
of implementation can be taken a step further by embedding neural networks.
Neural networks have become a growing area of research over the last few decades
and have taken hold in many branches of industry. One example in the fleld of indus-
trial electronics is motor drives according to [1, 2, 3]. Neural networks are also helping
with power distribution problems such as harmonic distortion [4, 5, 6, 7]. These pa-
pers show how valuable neural networks are becoming in industry. This thesis ofiers
a method of training fully connected networks to help solve some of these real world
1
problems. Fully connected networks are much more powerful and easier to train when
using the Neural Network Trainer [8, 9]. Cross layer connections reduce the number
of neurons by increasing the number of connections and corresponding weights. This
in turn reduces the total number of calculations required for the network.
Unfortunately with all of the research being done, neural networks are often
overlooked because they are widely misunderstood, di?cult to train, or di?cult to
implement. Chapter 2 discusses pre-existing training algorithms that do not simplify
these problems. In Chapter 3, a new training algorithm is presented that is demon-
strated to be more efiective and faster that those discussed in Chapter 2. Chapters
4 and 5 describe the Neural Network Trainer and how it was created to assist in the
training process. Chapter 6 demonstrates the results of the various algorithms and
training methods.
2
Chapter 2
Common Neural Network Training Algorithms
Various methods for neural network training have been developed ranging from
random search through gradient-based methods. The best-known gradient method
is the Error Back Propagation (EBP) [10, 11, 12]. This method is characterized by
very poor convergence. Several improvements for EBP were developed such as the
Quickprop algorithm, Resilient Error Back Propagation, Back Percolation, Delta-bar-
Delta etc. Much better results can be obtained using second order methods such as
Newton or Levenberg Marquet (LM) [12]. In the latter, not only the gradient but
also the Hessian or Jacobian should be found. The EBP algorithms propagate errors
layer by layer from outputs to the inputs. If a neural network has connections across
layers the EBP algorithm becomes more complicated. This added complexity is even
greater for second order algorithms such as the LM algorithm for such networks.
Another method of calculating gradients is forward calculations, also known as the
perturbation method. In this case small changes on neuron net values are introduced
and the resultant changes on the outputs are recorded. The perturbation method is
especially useful for VLSI implemented network structure, where instead of analytical
computation of gradients only various signal gains are measured [13, 14][13, 14]. EBP
can work best for feedforward neural networks. Similar to EBP, the LM algorithm
was adapted only to feedforward networks organized in layers (without weights across
layers).
3
2.1 Error Back Propagation
This section will demonstrate how EBP performs gradient descent to train neural
networks with hidden layers. This method begins by calculating the error at each
pattern where the error is deflned as:
err = dout ?out (2.1)
where dout is equal to the desired output or the output of the training data and out is
equal to the actual output. After the error has been calculated the derivative through
the layers in reverse order is needed. The derivative is deflned as:
f0 = gain?(1?out2) (2.2)
Next the change in weights is calculated by:
? = f0 ?err (2.3)
?w = fi???x (2.4)
where fi is a user deflned constant used for adjusting the step size and x is equal
to the value at that neuron connection. The new weight is then represented by:
4
wnew = ?w +w (2.5)
This process is repeated for all output neurons and then for each layer starting
at the output layer ending at the input layer. This entire process is then repeated for
each training pattern. The total error for the network is then calculated using all the
patterns and outputs as follows:
Totalerr =
npX
p=1
[dp ?op]2 (2.6)
This entire process is then repeated thousands of times in hopes of reaching the
desired error. EBP will work in many cases but it has several drawbacks. The flrst
being the algorithm is very slow to converge if it converges at all. The algorithm
typically approaches the vicinity of the error very quickly but then drastically slows
taking considerably longer to get closer to the flnal solution than when it began. This
process continues to slow as it gets closer therefore making it very di?cult to get a
precise answer. However there are other algorithms that are able to get a much more
precise answer very quickly such as Levenberg-Marquardt.
2.2 Levenberg-Marquardt
The Levenberg-Marquardt (LM) algorithm is a second order algorithm that many
times is overlooked by those attempting to train neural networks possibly because it
is more complex to implement than EBP. However, it deflnitely makes up for this
5
in superior performance. LM is similar to EBP in that it requires the calculation of
the gradient vector, but in addition, LM also computes the Jacobian. The gradient
vector is represented as:
g =
0
BB
BB
BB
BB
@
@E
@W1
@E
@W2
...
@E
@Wn
1
CC
CC
CC
CC
A
(2.7)
where E is there error of the network for that pattern and W refers to the
weights. The Jacobian is essentially every gradient for every training pattern and
network output. The Jacobian is shown below.
J =
2
66
66
66
66
66
66
66
66
66
66
66
66
66
4
@E11
@W1
@E11
@W2 ???
@E11
@Wn
@E21
@W1
@E21
@W2 ???
@E21
@Wn
... ... ...
@EM1
@W1
@EM1
@W2 ???
@EM1
@Wn
... ... ...
@E12
@W1
@E12
@W2 ???
@E12
@Wn
@E22
@W1
@E22
@W2 ???
@E22
@Wn
... ... ...
@EMP
@W1
@EMP
@W2 ???
@EMP
@Wn
3
77
77
77
77
77
77
77
77
77
77
77
77
77
5
(2.8)
6
where N is the number of weights, M is the number of outputs, and P is the
number of patterns. In other words the Jacobian will have as many columns as there
are weights, and the number of rows will be equal to the product of M and P. Once
the Jacobian is calculated, the LM algorithm can be represented by the following:
Wk+1 = WK ??JTk Jk +?I??1 JTk E (2.9)
where E is the total error for all patterns, I is the identity matrix, and ? is a
learning parameter. The learning parameter ? is then adjusted several times each
iteration and the result with the greatest reduction of error is selected. When the ?
value is very large the LM algorithm becomes steepest decent or EBP, and when ? is
equal to zero it is the Newton Method.
The entire process is then repeated untilthe error is reduced to the required value.
This second order algorithm is signiflcantly faster than EBP, as will be discussed in
more detail in Chapter 6. However, as successful as the LM algorithm is it does
require the inversion of the Jacobian matrix. For small networks with few training
patterns this is not a major issue, but for networks with many training patterns it is
very computationally intensive. This inversion will cause each training iteration for
LM to take longer than an iteration for EBP. As long as there is enough memory to
perform the inversion, the time required for training will still be far less than that of
EBP, because the LM will require such few iterations.
7
Chapter 3
Modified Levenberg-Marquardt Algorithm
The LM algorithm was used as the basis for the following algorithms. This second
order approach is modifled to train any feed-forward architecture. Even though LM
is a very powerful algorithm it is prone to stalling in local minima and becoming
unable to proceed in the training process. This problem is addressed in the following
sections.
3.1 Advantages of Arbitrarily Connected Neural Networks
Comparing FCN (Fully Connected Neurons) networks with MLP networks one
may conclude that the latter ones require about twice as many neurons to perform
a similar task[8]. For example Figures 3.1 and 3.2 show the minimum architectures
required to solve parity-3 and parity-9 problems. It is relatively simple to design
neural networks to solve parity-N problems [15]. However, to flnd a solution by
training is much more di?cult. For a three layer MLP network to solve a parity-N
problem the required number of neurons is [15]:
NMLP = N +1 (3.1)
while for three layer FCN networks the minimum number of neurons is :
8
Figure 3.1: Network architectures to solve the parity-3 problem with 3 neurons in the
hidden layer (MLP) and a network with 1 neuron in the hidden layer (FCN).
NFCN =
8>
>><
>>>
:
N?1
2 For odd number parity problems
N
2 For even number parity problems
(3.2)
Another example in which an arbitrarily connected network outperforms a tra-
ditionally connected network is in a nonlinear control system. In Figure 3.1 a desired
highly nonlinear control surface for two variables is shown. With the 3 layer, 8 neuron
MLP network shown in Figure 3.1, it was not possible to reach the desired surface.
However, with the 5 neuron FCN architecture 5 neurons shown in Figure 3.1, it was
possible to flnd a satisfactory solution, the control surface obtained being very close
to the required surface in Figure 3.1. One may notice that the FCN topology with
5 neurons produces signiflcantly smaller error than the 8 neuron MLP topology with
one hidden layer. The MSE (Mean Squared Error) is deflned as:
MSE = 1n
pno
npX
i=1
noX
i=1
e2ij (3.3)
where:
9
Figure 3.2: Network architectures to solve the parity- 9 problem with a 3 layer MLP
network (left) and with a 3 layer FCN network (right.)
10
Figure 3.3: The desired nonlinear control surface to which the neural networks are
trained.
eij = outij ?doutij (3.4)
dout is desired output, outis actual output, npis number of patterns, and nois
number of outputs.
Comparing 3 layer networks, shown in Figures 3.1 and 3.2, one may also conclude
that FCN networks are more transparent than MLP networks. With connections
across layers in ACN networks, there are fewer neurons (nonlinear elements)g on the
signal paths which results in the learning algorithms converging faster. Unfortunately,
most of the neural network learning software, such as the popular MATLAB Neural
Network Toolbox, is developed for MLP networks and are not able to handle FCN or
ACN networks. It is also much easier to write computer software for regular archi-
tectures, organized layer by layer, in comparison to neural networks with arbitrarily
11
Figure 3.4: Resulting control surface obtained with an MLP architecture having 7
neurons in one hidden layer. The total MSE is 0.00234.
12
Figure 3.5: Resulting control surface obtained with an FCN architecture having 4
hidden neurons. The total MSE is 0.00014
connected neurons (ACN). Both MLP and FCN networks are of course, a subset of
ACN networks.
3.2 Calculation of Gradient and Jacobian
The EBP algorithm requires only the computation of the error gradient. Sec-
ond order algorithms, such as LM ) algorithm or DFP (Davidon Fletcher Powel) [16]
require the computation of the Jacobian. EBP follows the concept of the steepest
descent optimization algorithm where global error is reduced by following the steep-
est descent path (moving in the opposite direction to the gradient g.) The weight
updating rule is:
13
?w = ?fig (3.5)
where fi is the experimentally selected \learning constant" and g is the gradient
vector. For the LM algorithm the weight updating rule is:
4w = ?(JTJ +?I)?1JTe (3.6)
where I is the identity matrix, e is error vector with elements given by equation
3.5, J is the Jacobian matrix, and ? is a learning parameter [8, 16]. If the Jacobian
J is known then the gradient g can be found as:
g = 2JTe (3.7)
Therefore the updates on the weights 4w can be found in both EBP and LM
algorithms equations 3.6 and 3.5 if the error vectors e and the Jacobian matrix J are
evaluated.
In the Jacobian matrix, each row corresponds to p-th input pattern and o-th
network output, therefore the number of rows of the Jacobian is equal to the product
no?np, where np is number of training patterns and no is number of outputs. The
number of columns is equal to the number of weights nw in the neural network.
Every neuron has the number of elements in this row, which is equal to the number
of inputs plus one. For the p-th pattern, o-th output, and n-th neuron with K inputs
the fragment of Jacobian matrix has the form:
14
Figure 3.6: Example of an ACN network. The network has flve neurons numbered
from 1 to 5 and eight nodes 3 of which are input nodes from 1 to 3 and flve neuron
nodes from 4 to 8.
??? @epo@w
n0
@epo
@wn1
@epo
@wn2
@epo
@wn3 ???
@epo
@wnK ??? (3.8)
Where the weight with index 0 is the biasing weight and epo is the error on the
o-th network output. In this thesis, a new NBN (Neuron by Neuron) method for cal-
culating the gradients and the Jacobians for arbitrarily connected feed forward neural
networks is presented. The rest of the computations for weight updates follow the
LM algorithm. In order to explain the computation algorithm, consider an arbitrarily
connected neural network with one output as shown in Figure 3.6.
Row elements of the Jacobian matrix for a given pattern are being computed in
three steps:
1. Forward computations
2. Backward computations
15
3. Calculation of Jacobian elements
3.2.1 Forward computation
Forward and backward calculations are done using neuron by neuron (NBN)
calculations. In the forward calculation, the neurons connected to the network inputs
are flrst processed so that their outputs can be used as inputs to the subsequent
neurons. The following neurons are then processed when their input values become
available. In other words, the selected computing sequence has to follow the concept
of feedforward networks and the signal propagation. If a signal reaches the inputs
of several neurons at the same time, then these neurons can be processed in any
sequence. For the network in Figure 3.6 there are only four possible ways in which
neurons can be processed in the forward direction: 12345; 21345; 12435; or 21435.
When the forward pass is concluded, two temporary vectors are stored: vector o with
the values of the signals on the neuron outputs and the second vector s with the
values of slopes of the neuron activation functions, which are signal dependent.
3.2.2 Backward computation
The sequence of the backward computation is opposite to the forward compu-
tation sequence. The process starts with the last neuron and continues toward the
input. In the case of the network of Figure 3.6 these are the possible sequences (back-
ward signal paths): 54321; 54312; 53421; or 53412. To demonstrate the case let us
select the sequence 54321. The attenuation vector (a) represents signal attenuation
16
from a network output to the outputs of all other neurons. The size of this vector is
equal to the number of neurons.
The process starts with the value one assigned to the last element of the a vector
and zeros to the remaining output neurons. During backward processing for each
neuron the value of the delta of this neuron is multiplied by the slope of the neuron
activation function (element of s vector calculated during forward computation) and
then multiplied by neuron input weights. The results are added to the other elements
of the a vector neurons which are not yet processed. The second step in the example
updates only the elements of the a vector that are associated with neurons 3 and 4
because only these neurons are directly connected to inputs of neuron 5. In the next
step neuron 4 is processed and elements of a vector associated with neurons 1 and
2 are updated. Next, the neuron 3 is processed and again elements of the a vector
that correspond to neurons 1 and 2 are updated. There is no reason to continue the
process beyond this point because there are no other neurons connected to inputs of
neurons 1 or 2. As the result of backward processing elements of the a vector are
obtained.
One may notice that the backward computation is done only for a limited number
of neurons. For example in the case of the 4 topologies shown in Figures 3.1 and 3.2
only one output neuron is processed.
The size of the a vector is equal to the number of neurons while the number of
Jacobian elements in one row is much larger and is equal to the number of weights in
the network. In order to obtain all row elements of the Jacobian for p-th pattern and
17
o-th output a very simple formula can be used to obtain the element of the Jacobian
matrix associated with the input k of the neuron n:
@epo
@wnk = d(n)po ?s(n)?node(k)po (3.9)
Where: d(n) is the element of a vector and s(n) is the slope calculated during
forward computation, both are associated with neuron n; node(k) is the value on the
k-th input of this neuron.
3.2.3 Calculation of Jacobian Elements
The process is repeated for every pattern and if a neural network has several
outputs it is also repeated for every output. The process of gradient computation
in arbitrarily connected neurons (ACN) network is exactly the same, but instead of
storing values in the Jacobian matrix, they are being summed into one element of the
gradient vector:
g(n;k) =
pX
p=1
OX
o=1
@epo
@wnkepo (3.10)
If Jacobian is already computed then the gradient can also be calculated using
equation 3.5. The latter approach, with Jacobian calculation, has similar computation
complexity, but it requires much more memory to store the Jacobian matrix.
18
3.3 Self Aware Algorithm
The Self Aware algorithm (SA) is a modiflcation of the NBN algorithm. It
takes a very large step in automating the training process. The motivation for this
algorithm is that many times when training problems that have deep local minima
such as parity problems, the algorithms saturate and cannot escape this depression.
This may happen after a few iterations or after tens of thousands of iterations and
the algorithm must restart with difierent initial weights. Traditionally the user must
manually restart the algorithm, but this means he must be present to observe the
situation. This is far less than ideal because many times the training process may
take several hours and consume the complete system resources of the computer on
which it is operating.
The SA algorithm alleviates the need for a user to be present by self monitoring
its progress. This task may seem trivial but actually many variables need to be
taken into account. The mean squared error (MSE) will decrease but at drastically
difierent rates during a successful run. The restart algorithm is shown in the following
equation.
jMSEite ?MSEite?5j
MSEite < MSEmax (3.11)
where ite stands for the current iteration and MSEmax refers to the maximum
error set by the user for training the network. This takes into account the rate at
which the algorithm is reducing the error but relative to the total error. This is
19
important because if the total error is several orders of magnitude larger than the
change in error then under most circumstances it will never converge and should be
restarted. However, if the error is reducing by a very small number this is acceptable
when the MSE is on the same order of magnitude. This method also takes into
account the desired er for the problem at hand. If a rough solution is all that is
desired the algorithm has a tendency to restart quicker to get to a rough solution as
quickly as possible as where when a very precise solution is requested the algorithm
is allowed to slow signiflcantly without restarting.
3.4 Enhanced Self Aware Algorithm
The Enhanced Self Aware algorithm (ESA) incorporates both the NBN and SA
algorithms in addition to a modiflcation of the Jacobian matrix. A common problem
with many training algorithms is that local minima often cause the weights to get
pushed into the saturated portion of the tangent hyperbolic curve see (Figure 3.7.)
In the saturated region, the slope is nearly zero and therefore the derivative is
nearly zero. This is a problem because even if a pattern has a large error it is lost when
it is multiplied by the very small slope. This pattern no long causes the algorithm
to adjust its weights to correct the error. This causes the algorithms to frequently
become unable to move out of a local minimum. In order to prevent the patterns
from saturating prematurely the Jacobian matrix is scaled by a constant. This allows
the patterns to remain in the linear region longer in order for all of the patterns to
be correctly classifled. It does not create a problem for correctly classifled patterns
20
Figure 3.7: Typical hyperbolic tangent activation function.
because even though the derivative is larger their error is very small and they are
still able to saturate. This method of delaying the saturation process does slow down
the entire process slightly. When the Jacobian is modifled in this manner it makes
it more di?cult to flnd the exact solution when the errors are small. To account
for this, the algorithm monitors the errors for each pattern. When the errors for all
patterns are small, the algorithm no longer scales the Jacobian Matrix. This allows
the algorithm to converge to a very small error just as NBN does. This process may
take slightly longer, but it is more likely to converge in di?cult cases.
21
Chapter 4
Developed Software
Prior to the work reported herein, there was very little software available to train
fully connected neural networks. As a signiflcate contribution of this thesis research,
a package with a graphical user interface has been developed in MATLAB for that
purpose. This software allows the user to easily enter very complex architectures
with initial weights, training parameters, data sets, and the choice of several powerful
algorithms. The front end of the Neural Network Trainer (NNT) package is shown
in Figure 5.1. The flrst section will discuss the internal workings of NNT followed by
a tutorial on how to create and train a network. The flnal section demonstrates an
addition to NNT that allows the user to verify networks used for nonlinear mapping.
The motivation behind NNT was to create a simple to use yet extremely powerful
tool for training neural networks. NNT started out as a very rudimentary training
method for neural networks that included modifying several mflles and creating arrays
of parameters in order to train a network, and since has evolved into an entire training
package that is easy to use.
4.1 Graphical User Interface
The Graphical User Interface was built using the Matlab toolbox GUIDE. It
allows the programmer to create series of buttons, textboxes, flgures, and pulldown
menus with relative ease. Figure 4.1 shows an early version of NNT in the GUIDE
22
Figure 4.1: Early example of NNT in the GUIDE toolbox.
interface with the property inspector window open on the right side. This window
is currently displaying the properties of the TRAIN button. It executes the TRAIN
call back function that is at the top of the window. When this function is called it
initiates the training process.
One important feature of NNT is that it allows the user to train the same network
with multiple algorithms with the simple selection of a drop down menu. This creates
a problem from the programmer?s standpoint because the algorithms have difierent
training parameters. A work around to the problem was implemented by manipulat-
ing typically constant text boxes in the interface using the set function. When the
graphical user interface (GUI) is created it automatically generates a structure vari-
able called handles and each text box has an entry in the handles structure. When
the user selects a difierent algorithm the callback function is executed which contains
23
a switch statement with a case for every algorithm. This statement then sets all the
parameter names and updates the default values of each. A snippet of the callback
function is shown below.
switch handles.algorithm
case 1
set(handles.label1,?String?,?Alpha?);
set(handles.input1,?String?,?1.5?);
set(handles.label2,?String?,?N/A?);
set(handles.input2,?String?,?0?);
set(handles.input3,?String?,?10?);
set(handles.input4,?String?,?5000?);
set(handles.gain,?String?,?.1?);
handles.input1_usr=1.5;
handles.input2_usr=0;
handles.input3_usr=10;
handles.input4_usr=5000;
handles.gain_usr=0.1;
case {2,3}
set(handles.label1,?String?,?Mu?);
set(handles.input1,?String?,?0.01?);
set(handles.input2,?String?,?10?);
set(handles.label2,?String?,?Mu scale?);
set(handles.label3,?String?,?Print Scale?);
set(handles.input3,?String?,?1?);
set(handles.input4,?String?,?200?);
set(handles.gain,?String?,?.1?);
handles.input1_usr=.01;
handles.input2_usr=10;
handles.input3_usr=1;
handles.input4_usr=200;
handles.gain_usr=0.1;
24
This is just the flrst three cases, but it shows the pattern for the difierent cases.
Notice that the last flve lines of code are updating the variables to the corresponding
values that are displayed. This is because changing the text box display using the
set function is not the same as the user updating the value. This feature allows dif-
ferent default training variables for each algorithm to help assist the user in selecting
reasonable values.
4.2 Continuous Training Control
After the user has entered all of the required data for the training process to
begin, this data needs to be interpreted. The parameter data from the GUI is stored
in a structure that easily passed to the training engine. However, the input flle
must be interpreted. The input flle is read using a series of text document read
commands standard in Matlab. These individual commands are fairly straightforward
but very entangled. Once the data is input, several variables are created which
the training engine uses directly. These details can be found in Chapter 3, which
pertains to calculations and training algorithms. However, many features have been
implemented into NNT other than simply sending training data to the engine. It also
extracts information about the training process, monitors it, and gives the user as
much feedback as possible. Once the training is flnished it writes a text flle with the
weights that were the best for that set of runs. The following code block shows the
handling of the self aware algorithms, successes, failures, saving data, and plotting.
25
RESTART=0;
total_ite=1;
total_TERR=0;
num_runs=1;
TERR_best=10000000000;
while RESTART==0 && CANCEL == 0
[ww, nodes, TERR, ite] = LM_BMW(inp,dout,topo,ww, param,nmod);
dlmwrite(?wwtemp.ww?,ww);
if (external_plot)
axes(handles.graph);
figure(1);
if RESTART~=1 || CANCEL==1
semilogy(TERR,?r--?);
hold on
else
semilogy(TERR,?k?);
hold on
end
end
axes(handles.graph);
if RESTART~=1 || CANCEL==1
semilogy(TERR,?r--?);
hold on
else
semilogy(TERR,?k?);
hold on
end
if TERR(end)=max_ite; break; end;
26
end
This section of code is the link between the GUI and the training engine, as well
as providing the user important usable data. The flrst thing to note is the presence
of two global variables RESTART and CANCEL. Using global variables is always
the last resort, but in this situation it is an absolute must especially in the case of
the CANCEL variable. The reason for this issue is that the CANCEL variable is
only set inside a function within the GUI, but it is needed after the GUI has already
called the training engine, therefore making it impossible to pass it in the traditional
manner. Another work-around that was needed to allow the cancel button to function
was to place a .1ms pause inside of the training algorithm that allows the GUI to
update. Without this pause even though the cancel button has been pressed, the
call back function to set the CANCEL ag would not get called until the training
process is complete, which would be too late to be useful. The other global variable
is RESTART which is for the self aware algorithms. It allows any of the algorithms
from within any function to cause a restart of the training session smoothly. Again,
the reason for using a global variable is the complex set of nested functions that the
restart command would need to propagate through is too extensive too be e?ciently
passed. These parameters are very important in controlling when to restart training
and when to exit the training process.
The outer while loop controls this feature based on the RESTART and CANCEL
ags. The loop continues to train the network until it is canceled by either the user or
until certain conditions are met, which will be discussed shortly. The flrst line within
27
the loop sends the training data and parameters to the engine to be trained and it
returns the weights, nodes, the error array, and the number of iterations performed.
If either the RESTART or CANCEL ag is set high then the training process was a
failure and is plotted with a dashed red line. The trainer then goes a step further to
check the flnal value of the mean squared error and compare it to the previous best
run. It then exports the best weights to a text flle and stores the number of iterations
required, the error vector, and the run number. If the process was not canceled by
the user, another attempt is made until the RESTART ag returns a zero, indicating
a successful run. The MSE is plotted with a solid black line, and the weights are
written to a text flle. After the training is flnished, the weights are also printed to
the command window along with the other data regarding the best run.
4.3 Nonlinear Mapping
When training neural networks for control surfaces it is important to be able to
see the three dimensional shape of the network output, not simply the mean squared
error. This was the motivation for creating a tool that works alongside NNT that
allows the user to plot networks with two inputs and one output. The tool was created
to aid in the process of analyzing and testing neural networks.
4.3.1 Interfacing
The software was created as a direct link to NNT it requires two flles an input
flle and a weight flle. The plotting software uses the same input flle that was used
28
by NNT for training. The weight flle is simply an array of weights in a text flle,
typically separated by a line return. This is further simplifled for the user because
NNT by default generates weight flles when it flnishes training called tempww.ww
and best.ww which contain the weights for the most recent run and the best run
respectively. The user needs only to select these two flles, and set the appropriate
gain used for training, and then simulate. The simulator can be seen in Figure 4.2.
Figure 4.2: Neural network simulation environment for three dimensional surfaces.
As seen in Figure 4.2 there are four graphs. The graph in the top left is the
desired surface that the network was trained from and the graph in the top right
is the output of the trained network. The third graph is the difierence between the
previous two surfaces plotted on the axis with identical dimensions to give perspective
29
of the relative size of the error. The last plot is also a difierence of the flrst two plots
but the axes are a closer flt to allow the user to see the shape of the error more closely.
4.3.2 Synchronizing Plots
One of the key features of the software is that it not only allows the user to easily
generate these plots but also to view them easily. When the user clicks the rotate
button it allows the flrst flgure to be manipulated to be seen from any angle. Once
the user is satisfled with the current angle he must simply press the rotate button
again which then aligns the remaining flgures with an identical viewing angle. This
task requires the use of the AXES and view commands in Matlab. The following
block of code shows the process of matching the viewing angles.
function Rotate_Callback(hObject, eventdata, handles)
% handles structure with handles and user data (see GUIDATA)
axes(handles.graph);
rotate3d
axes(handles.graph)
temp=view;
axes(handles.graph2)
view([temp]);
axes(handles.graph3)
view([temp]);
axes(handles.graph4)
view([temp]);
30
When the rotate button is pressed the callback function is executed which reads
the new viewing angle of the flrst graph and then overwrites the view of the other
three graphs. This step is then taken even further with the auto rotate button. This
button does a similar task but instead of the user rotating the view manually it
slowly rotates automatically one rotation or until stoped by the user. This function
for rotating is shown below.
function auto_rotate(handles)
global STOP_ROTATE
deg=360;
[az el]=view;
rotvec=0:deg/180:deg;
for i=1:length(rotvec)
if STOP_ROTATE==-1; break; end
axes(handles.graph)
view([az+rotvec(i) el])
axes(handles.graph2)
view([az+rotvec(i) el])
axes(handles.graph3)
view([az+rotvec(i) el])
axes(handles.graph4)
view([az+rotvec(i) el])
drawnow
pause(.1)
end
The auto rotate function must again use a global variable in order to interrupt
the process and allow the user to stop the rotation if so desired. The code creates a
rotate vector between 0 and 360 degrees with a step size of 2 degrees. This allows
31
the graph to move at a reasonable pace and still be uent. The rotations starts at
the current azimuth of the plot not necessarily zero which is why the rotate vector is
added to the original azimuth each rotation. At the end of the plot there is a drawnow
function which is required to allow the current function to pause and draw anything
that is in the bufier or else it would wait for the entire function to flnish and would
not be an animated rotation.
32
Chapter 5
Neural Network Trainer
The user will flrst notice there is an empty plot on the left side of the trainer
where the iterations versus means squared error will be displayed as well as training
parameters on the right hand side. In using NNT the user must follow a few simple
steps before training a network. The user prepares an input flle that contains the
training data and an architecture flle that describes the network connections, then
sets the training parameters.
5.1 Training Data
The user must create a training flle with all the data sets required to train the
NN. This data may be created in various ways such as by hand, spreadsheet, or
directly through Matlab. A simple parity-3 problem will be used for demonstration
purposes. This demonstration will use bipolar neurons so the extremes for data will
33
Figure 5.1: Front end of Neural Network Trainer (NNT)
be ?1. The training data for parity-3 is represented by the following matrix:
In1 In2 In3 Out1
?1 ?1 ?1 ?1
?1 ?1 1 1
?1 1 ?1 ?1
?1 1 1 1
1 ?1 ?1 ?1
1 ?1 1 1
1 1 ?1 ?1
1 1 1 134
As with any parity-n problem there are 2n possible outcomes. As the top row
indicates the flrst three columns are the inputs and the last column is the output for
that row. The top row of the matrix is for demonstration purposes only but is not
need in the actual data flle. This data is then copied to a text flle and saved with the
flle extension ".dat". Delimiters other than white space are not required. Once the
data flle is flnished it can be referenced by numerous architecture input flles.
5.2 Input File
The input flle contains the network architecture, neuron models, data flle refer-
ence, and optional initial weights. Each input flle will be unique to each architecture
but not necessarily to each data set. In other words, the same data set can be used
for several difierent architectures simply by creating a new input flle. The input flle
contains 3 sections: the architecture, model parameters, and data flle deflnition. The
following is an example of an input flle for the parity-3 problem discussed in the
previous section.
\\ Parity-3 input file (parity3.in)
n 4 mbip 1 2 3
n 5 mbip 1 2 3
n 6 mbip 1 2 3 4 5
W 5.17 20.08 -10.01 -4.23
W 1.0 10.81 2.20 19.84
.model mbip fun=bip, der=0.01
.model mu fun=uni, der=0.01
35
.model mlin fun=lin, der=0.05
datafile=parity3.dat
The flrst line is a comment. Either a double backslash, as in C, or a percent sign,
as in Matlab, is acceptable as a comment delimiter. After the comment comes the
network architecture for a 3-neuron fully-connected network as shown in Figure 5.2
Figure 5.2: Three Neuron architechture for parity-3 problem.
The neurons are listed in a net list type of layout that is very similar to SPICE
program. This way of listing the layout is node based. The flrst nodes are reserved
for the input nodes. The flrst character of the line is an N to signify that this line
describes a neuron. The N is followed by the neuron output node number. Looking
at Figure 5.2, the flrst neuron is neuron 4 because it is the flrst available number after
the three inputs, and it is connected to nodes 1, 2, and 3 which are inputs. The same
is true for neuron 5 or the second neuron it is also connected to all three inputs. The
output node is slightly difierent but it follows the same concept. It is connected to
all three inputs as well as to the output of the flrst two neurons. Based on this it
36
should be straightforward to see the connection between the input flle listed above
and Figure 5.2.
Also one will notice on the line of each neuron is the model of the neuron which
allows the user to specify a unique model for each neuron. This network that is setup
to solve a parity-3 problem using three bipolar neurons. This is not the minimal
architecture for this problem, but it serves as a good demonstration of the tool.
Following the architecture of the network are the optional starting weights. If
no starting weights are given the trainer will choose random weights. The weights
need to be listed in the same format as the architecture. Each line of weights starts
with the capital letter W. The biasing weight goes in place of the output node of the
neuron. In other words, the flrst weight listed for a particular neuron is the biasing
weight followed by the remaining input weights in their respective order. See the
input flle for an example.
The user specifles a model for each neuron and these models are deflned on a
single line. The user has the ability to specify the activation function, and neuron
type (unipolar, bipolar or linear) for each model. The user may include neurons with
difierent activation functions in the same network.
The flnal line of the input flle includes a reference to the data flle. This line
simply needs to read dataflle= followed by the flle name and in this example it would
be parity3.dat which can be seen on the last line of the example input flle.
37
5.3 Training Parameters
Once the network architecture has been decided and the input flles created the
next step is to select the training algorithm and parameters. When NNT is loaded
there is an orange panel full of adjustable parameters on the right side of the window.
These parameters change for each algorithm so they will be addressed accordingly
in the following section. The algorithms themselves are explained in more detail in
Chapter 3, but flrst the user should select the input flle just created.
5.3.1 Implemented algorithms
The algorithm is chosen from the pull down menu in the training parameters.
Four of the parameters are the same for all algorithms. They are: Print Scale, Max.
Iterations, Max. Error, and Gain. The Print Scale refers to how often the mean
squared error is printed to the Matlab command window. This can be important
because in certain situations the longest calculation time is that of displaying the data,
so increasing this number can signiflcantly decrease training time. Max iterations is
the number of times the algorithm will attempt to solve the problem before it is
considered a failure. An iteration is deflned as one adjustment of the weights, which
includes calculating the error for every training pattern and adjusting at the end.
The Max Error is the mean squared error that the user considers to be an acceptable
value. When this number is reached the algorithm stops calculating and displays the
flnal weights. The last common parameter is the Gain which is the factor by which
38
the sum of the inputs is multiplied for all the neurons before the activation function
is applied see Equation (5.1.)
output = tanh(gain?net) (5.1)
Error Back Propagation (EBP)
This algorithm is the traditional EBP with the ability to handle fully connected
neural networks. The Alpha parameter is the learning constant. This value is a
multiplier that acts as the numerical value of the step size in the direction of the
gradient. If Alpha is too big the algorithm can oscillate instead of reducing the error.
However, if alpha is too small the algorithm can move toward the solution too slowly
and prematurely level ofi. This parameter should be adjusted by the user until an
optimal value is found which has some oscillation that diminishes while the error
continues to decrease.
Neuron By Neuron (INBN)
NBN is a modifled Levenberg-Marquardt algorithm for arbitrarily connected neu-
ral networks. The NBN algorithm formally known as the BMW algorithm was brie y
described in [8]. It has two training parameters, ? and ? Scale. The learning param-
eter of the LM algorithm is ?. Its use can be seen in Equation 5.2.
wk+1 = Wk ?(JTk Jk +?I)?1JTk e (5.2)
39
If ? = 0 then the algorithm becomes the Gauss-Newton method. For very large values
of ? the algorithm becomes the steepest descent method or EBP. The ? parameter
is automatically adjusted at each iteration to insure convergence. The amount it is
adjusted each time is ? Scale which is the last parameter for the NBN algorithm.
Self Aware (SA)
The SA algorithm is a modiflcation of NBN. It evaluates the progression of the
algorithm?s training and determines if the algorithm is failing to converge. If the
algorithm begins to fail, the weights are reset and another trial is attempted. In
this situation the program displays its progress to the user as dotted red line on the
display and begins again. The algorithm continues to attempt to solve the problem
until either it is successful or the user cancels the process. The SA algorithm uses
the same training parameters as NBN.
Enhanced Self Aware algorithm (ESA)
ESA is also a modiflcation of the NBN algorithm is used in order to increase
chances for convergence. The modiflcation was made to the Jacobian Matrix in order
to allow the algorithm to be much more successful in solving very di?cult problems
with deep local minima. It also is aware of its current solving status and will reset
when necessary.
The ESA algorithm uses a flxed value of 10 for the ? Scale parameter and instead
allows the user to adjust the LM parameter. The LM parameter is essentially a scale
40
factor applied to the Jacobian matrix before it is used in calculating the weight
adjustment. This scale factor is typically a positive number between 1 and 10 or
possibly greater. The more local minima the problem has the larger the LM factor
should be.
Forward-Enhanced Self Aware (F-ESA)
F-ESA is another modiflcation of NBN algorithm where an alternative method
for calculating the Jacobian matrix is used. The calculation of Jacobian is unique
in the sense that only feed-forward calculations are needed. This approach is then
paired with Enhanced Self-Aware LM algorithm. The F-ESA algorithm is included
in the NNT but was written by Joel Hewlett. The F-ESA algorithm uses the same
training parameters as the ESA algorithm.
Evolutionary Gradient
Evolutionary Gradient is newly developed algorithm, which evaluates gradients
from randomly generated weight sets and uses gradient information to generate a new
population of weights. This is a hybrid algorithm which combines the use of random
populations with an approximated gradient approach. Like standard methods of evo-
lutionary computation, the algorithm is better suited for avoiding local minima when
compared to common gradient methods such as EBP. What sets the method apart
is the use of an approximated gradient which is calculated with each population. By
generating successive populations in the gradient direction, the algorithm is able to
41
converge much faster than other forms of evolutionary computation. This combina-
tion of gradient and evolutionary methods essentially ofiers the best of both worlds.
The training parameters are very difierent than the LM based algorithms previously
discussed. They include Alpha, Beta, Min. Radius, Max Radius, and Population.
This algorithm was written by Joel Hewlett and detail regarding these parameters
may be found in [17].
5.3.2 Training
Once the user selects the appropriate training algorithm the parameter boxes
will change to the corresponding parameters and default values will flll the boxes.
After the user sets the parameters, there are two other boxes that can be selected.
The clear plot box when checked will overwrite any existing plot with the new one,
but if it is left unchecked then the following plots will be drawn on the axis with all
of the previous drawings. The last option is the external plot which draws the plots
inside NNT and in a separate flgure allowing for easy printing or modifying the plot.
The train button begins the training process which prints the error to the Matlab
Command Window as it is training. At any time the process can be halted and the
results plotted by pressing the cancel button. Example plots will be shown in Chapter
6.
42
Chapter 6
Experimental Results
The Neural Network Trainer has been tested on several types of problems in order
to show its success at training neural networks. Several tests were created to test both
overall training ability as well as the individual algorithms. The following sections
will compare the training speeds and success rates of the algorithms, demonstrate
the trainers ability to train networks for nonlinear mapping, and networks with large
number of inputs and deep local minima.
6.1 Fully Connected Networks
Most of the examples throughout the results sections will use fully connected
networks because they produce better training results as measured by success rates
and number of iterations required to converge. This is according to [8] as well as
evidenced by the following results. The following experiment was done using the
NBN algorithm on networks with difierent numbers of neurons in the hidden layer
using both fully connected and traditional multilayer architectures. Figure 3.1 on the
right side is an example of a network with one neuron in the hidden layer as where the
left side has three neurons in the hidden layer. Parity-3 and parity-5 problems were
tested and the results are shown in Tables 6.1 and 6.2. The results show that fully
connected networks are converging with fewer iterations and in signiflcantly more
cases.
43
Hidden Neurons 1 2 3 4 5
MLP SR 0% 80% 98% 100% 100%ANI NA 19 10 9 8
FCN SR 100% 100% 100% 100% 100%ANI 7.8 7.2 7.0 6.8 6.7
Table 6.1: Parity-3 SR-Success Rate, ANI- Average Number of Iterations using NBN
algorithm.
Hidden Neurons 2 3 4 5 6 7
MLP SR 0% 4.9% 42 % 74% 89 % 96 %ANI NA 45 57.5 46.7 34.3 28.5
FCN SR 24% 57% 69% 86% 95% 95%ANI 24 21 19 18 17.7 16.8
Table 6.2: Parity-5 SR-Success Rate, ANI- Average Number of Iterations using NBN
algorithm.
6.2 EBP Compared to NBN
Error Back Propagation is used by many people around the world for training
neural networks. Therefore it will be used as the benchmark for training speed as
well as ability to converge. The following results show that EBP is outperformed by
the NBN, SA, and ESA algorithms in all aspects. EBP was used to train a parity-3
problem and the results can be seen in Figure 6.1. EBP was also used in efiorts to
train parity-5 and parity-7 problems but was never able to converge (see Figure 6.2.)
From examining Figure 6.1 it is apparent that EBP took several thousand iterations
to reach an error of 10?6 where it was asymptotically approaching a limit.
SA was used to solve the same problem and the results are shown in Figure 6.3.
From the graph it can be seen that the SA algorithm converged to an error of 10?6 in
just tens of iterations and with no limit. Similar results were produced for parity-5
44
Figure 6.1: Results of EBP algorithm for a parity-3 problem using two neurons fully
connected as in Figure 3.1
problems with comparable number of iterations. These results can be seen in Figure
6.4. However, with parity-5 NBN converges less frequently and rarely converges with
parity-7. These situations are where ESA produces the best results.
6.3 ESA Compared to NBN
The ESA algorithm was compared with NBN algorithm. At the beginning of
each training session random weights are generated as the starting weights for the
calculation. These weights can play a signiflcant role on whether or not the algorithm
will converge. This makes it di?cult to test whether the algorithm is actually solving
the problem because it had a good starting point or because the algorithm is working
45
Figure 6.2: Results of EBP algorithm for a parity-5 problem using three neurons fully
connected and EBP was never able to converge.
properly. In order to remove this variable a library of random weight sets were gener-
ated for each neural network and saved. Each algorithm was tested for each starting
set and its convergence was recorded. Based on success rate of the ESA algorithm,
it was better than that of the BMW algorithm on every training set compared. The
ESA in many cases took more iterations but this cost is heavily outweighed by shear
ability to converge with a reasonable success rate. Although the ESA algorithm was
converging with better success rates, it was always converging more slowly than the
NBN algorithm. Therefore, it should be used only for very di?cult cases. Table 6.3
compares ESA and NBN algorithms for a variety of parity problems.
The ESA algorithm was used to solve parity-3 and parity-5 problems and the
results are shown in Figures 6.5 and 6.6. These flgures can be compared to the
46
Figure 6.3: Results of SA algorithm for a parity-3 problem using two neurons fully
connected as in Figure 3.1
results from the NBN algorithm in Figures 6.3 and 6.4. It can be seen that the ESA
does converge more slowly but it is more reliable, as which will be seen in the next
comparison.
The following flgures show successful runs for very di?cult parity problems.
These problems are very di?cult for several reasons. The number of inputs is in-
creasing exponentially with each higher order problem, requiring more patterns to be
classifled. Also the Jacobian becomes large, therefore much more time is required at
each step.
47
Figure 6.4: Results of SA algorithm for a parity-5 with a fully connected 3 neuron
network.
6.4 Surfaces
Another network was trained to solve a nonlinear control problem. The surface
was generated using the Matlab function Peaks. The network was then trained and
the flgures were created using the nonlinear mapping software described in Section
4.3. Figure 6.9 is the network that was trained to a mean squared error of 10?4 using
NNT. The surface used for training is shown in Figure 6.10 and the output of the
trained network is Figure 6.11. The difierence between the two flgures is shown in
Figure 6.12 on the same axis to help put the error in perspective. Figure 6.13 shows
48
Algorithm Success Rate Scale
Parity-3 NBN 79% -ESA 94% 2.9
Parity-5 NBN 21% -ESA 70% 2.5
Parity-7 NBN <1% -ESA 43% 4
Parity-9 NBN 17% -ESA 59% 4
Table 6.3: Comparison of NBN and ESA algorithms
the error on a much smaller scale to illustrate the locations of the largest error which
is typically at the tips of the peaks and depressions.
49
Figure 6.5: Results of ESA algorithm for a parity-3 with two neurons fully connected.
Figure 6.6: Results of ESA algorithm for a parity-5 with a fully connected four neuron
network.
50
Figure 6.7: Results of ESA algorithm for a parity-7 with a fully connected 4 neuron
network.
Figure 6.8: Results of ESA algorithm for a parity-9 with a fully connected 4 neuron
network.
51
Figure 6.9: The fully connected architecture used for the models in Figures 6.10.
Note that this architecture does not show each neuron?s biasing weight in order to
simplify the drawing.
52
Figure 6.10: The desired surface used for training.
Figure 6.11: The surface obtained from the output of the successfully trained network
in Figure 6.9.
53
Figure 6.12: The difierence between Figures 6.10 and 6.11 plotted on the same scale
as the original flgures.
Figure 6.13: The difierence between Figures 6.10 and 6.11 plotted on a much smaller
axis to show details.
54
Chapter 7
Conclusion
This Thesis presents a method that eases the process of creating, training, and
testing neural networks. Common training algorithms are compared such as Error
Back Propagation as well as Levenberg-Marquardt. It is shown that EBP is easily
implemented to train networks, but is not powerful enough for more di?cult problems.
The LM algorithm is much more powerful but it is more di?cult to implement.
Currentlythereisverylittlesoftwareavailabletotrainfullyconnectednetworks. Fully
connected networks are shown to converge more often and faster than traditionally
connected networks. A solution for training fully connected neural networks with the
LM algorithm is presented.
The Neural Network Trainer is a training package that allows the user to easily
create difierenttopologies for the same problem ,train them, and makethe appropriate
adjustments. NNT also allows the user to train networks with a variety of algorithms
and easily compare the results.
Included as an additional tool for NNT is software to map nonlinear surfaces.
This software is ideal for testing three dimensional nonlinear control problems. It
allows the user to easily compare the ideal surface with the actual surface produced.
It simplifles this task by allow the surfaces to be rotated simultaneously for easier
viewing.
55
The included algorithms were tested and the results compared to show the ad-
vantages and disadvantages of each. The training results of NNT are shown solving
several difierent problems to demonstrate the versatility of the trainer.
56
Bibliography
[1] J. Martins, P. Santos, A. Pires, L. da Silva, and R. Mendes, \Entropy-based
choice of a neural network drive model," vol. 54, no. 1, pp. 110{116, Feb. 2007.
[2] H. Zhuang, K.-S. Low, and W.-Y. Yau, \A pulsed neural network with on-chip
learning and its practical applications," vol. 54, no. 1, pp. 34{42, Feb. 2007.
[3] B. K. Bose, \Neural network applications in power electronics and motor drives
an introduction and perspective," Industrial Electronics, IEEE Transactions on,
vol. 54, no. 1, pp. 14{33, February 2007.
[4] H. C. Lin, \Intelligent neural network-based fast power system harmonic detec-
tion," vol. 54, no. 1, pp. 43{52, Feb. 2007.
[5] B. Singh, V. Verma, and J. Solanki, \Neural network-based selective compen-
sation of current quality problems in distribution system," vol. 54, no. 1, pp.
53{60, Feb. 2007.
[6] W. Qiao and R. Harley, \Indirect adaptive external neuro-control for a series
capacitive reactance compensator based on a voltage source pwm converter in
damping power oscillations," vol. 54, no. 1, pp. 77{85, Feb. 2007.
[7] S. Chakraborty, M. Weiss, and M. Simoes, \Distributed intelligent energy man-
agement system for a single-phase high-frequency ac microgrid," vol. 54, no. 1,
pp. 97{109, Feb. 2007.
[8] B. Wilamowski, N. Cotton, O. Kaynak, and G. Dundar, \Method of computing
gradient vector and jacobean matrix in arbitrarily connected neural networks,"
in Proc. IEEE International Symposium on Industrial Electronics ISIE 2007, 4{7
June 2007, pp. 3298{3303.
[9] B. M. Wilamowski, N. Cotton, J. Hewlett, and O. Kaynak, \Neural network
trainer with second order learning algorithms," in Proc. th International Confer-
ence on Intelligent Engineering Systems, June 29 2007{July 1 2007, pp. 127{132.
[10] L. Da Silva, L. Da Silva, B. Bose, and J. Pinto, \Recurrent-neural-network-based
implementation of a programmable cascaded low-pass fllter used in stator ux
synthesis of vector-controlled induction motor drive," vol. 46, no. 3, pp. 662{665,
1999.
57
[11] M. Embrechts, M. Embrechts, and S. Benedek, \Hybrid identiflcation of nuclear
power plant transients with artiflcial neural networks," vol. 51, no. 3, pp. 686{
693, 2004.
[12] M. Hagan, M. Hagan, and M. Menhaj, \Training feedforward networks with the
marquardt algorithm," vol. 5, no. 6, pp. 989{993, 1994.
[13] T. J. Andersen and B. Wilamowski, \A modifled regression algorithm for fast
one layer neural network training," in World Congress of Neural Networks, vol. 1.
World Congress of Neural Networks, July 1995, pp. 687{690.
[14] R. Sandige and B. Wilamowski, \Methods of removing single variable static
hazards in boolean functions," vol. 38, no. 3, pp. 274{278, Aug. 1995.
[15] B. Wilamowski, D. Hunter, and A. Malinowski, \Solving parity-n problems with
feedforward neural networks," in Proc. International Joint Conference on Neural
Networks, vol. 4, 2003, pp. 2546{2551 vol.4.
[16] B. Wilamowski, \Neural network architectures and learning," in Proc. IEEE
International Conference on Industrial Technology, vol. 1, 10{12 Dec. 2003, pp.
TU1{T12.
[17] J. Hewlett, B. Wilamowski, and G. Dundar, \Merge of evolutionary computation
with gradient based method for optimization problems," in Proc. IEEE Inter-
national Symposium on Industrial Electronics ISIE 2007, 4{7 June 2007, pp.
3304{3309.
58