METHODS FOR IMPROVING GENERALIZATION AND CONVERGENCE IN
ARTIFICIAL NEURAL CLASSIFIERS
by
Joel David Hewlett
A dissertation submitted to the Graduate Faculty of
Auburn University
in partial fulfillment of the
requirements for the Degree of
Doctor of Philosophy
Auburn, Alabama
December 12, 2011
Keywords: Neural Networks, Training Algorithms, Learning Machines, Pattern Classification
Copyright 2011 by Joel David Hewlett
Approved by
Bogdan M. Wilamowski, Chair, Associate Professor of Electrical Engineering
Thaddeus Roppel, Associate Professor of Electrical Engineering
Robert Dean, Assistant Professor of Electrical Engineering
Vitaly Vodyanoy, Professor of Veterinary Medicine
ii
ABSTRACT
Artificial neural networks have proven to be quite powerful for solving nonlinear
classification problems. However, the complex error surfaces encountered in such problems
often contain local minima in which gradient based algorithms may become trapped, causing
improper classification of the training data. As a result, the success of the training process
depends largely on the initial weight set, which is generated at random. Furthermore, attempting
to analytically determine a set of initial weights that will achieve convergence is not feasible
since the shape of the error surface is generally unknown.
Another challenge which may be faced when using neural classifiers is poor
generalization once additional data points are introduced. This can be especially problematic
when dealing with training data that is poorly distributed, or in which the number of data points
in each respective class is unbalanced. In such cases, proper classification may still be achieved,
but the orientation of the separating plane and its corresponding margin of separation may be less
than optimal.
In this dissertation, a set of methods designed to improve both the generalization and
convergence rate for neural classifiers is presented. To improve generalization, a single neuron
pseudoinversion technique is presented that guarantees optimal separation and orientation of the
separating plane with respect to the training data. This is done by iteratively reducing the size of
the training set until a minimal set is reached. The final set represents those points which lie on
the boundaries of the data classes. Finally, a quadratic program formulation of the margin of
iii
separation is defined for the reduced data set, and an optimal separating plane is obtained. A
method is then described by which the presented technique may be applied to nonlinear
classification by systematically optimizing each of the neurons in the network individually.
Next, a modified training technique is discussed, which significantly improves the
success rate in gradient based searches. To do this, the proposed method monitors the state of
the gradient search in order to determine if the algorithm has become trapped in a false
minimum. Once entrapment is detected, a set of desired outputs are defined using the current
outputs of the hidden layer neurons. The desired values of the remaining misclassified patterns
are then inverted in an attempt to reconfigure the hidden layer mapping, and the hidden layer
neurons are retrained one at a time. Linear separation is then attempted on the updated mapping
using pseudoinversion of the output neuron. The process is repeated until separation is achieved.
The second method is compared with other popular algorithms using a set of 8 nonlinear
classification benchmarks, and the proposed method is shown to produce the highest success rate
for all of the tested problems. Therefore, the proposed method does, in fact, achieve the desired,
which is to improve the rate of convergence of the gradient search by overcoming the challenge
presented by local minima. Furthermore, the resulting improvement is shown to have a relatively
low cost in terms of the number of required iterations.
iv
ACKNOWLEDGMENTS
I would like to thank my parents, Marvin and Patricia Hewlett, for their unwavering love
and support throughout my time in graduate school. I have taken great comfort in knowing that
whenever discouragement, doubt or uncertainty may arise, I always have someone to turn to for
encouragement and sound advice. They are a true blessing, and I love them both dearly.
I would also like to thank Dr. Wilamowski for everything he has done on my behalf.
Whenever I have needed guidance or assistance his door has always been open, and any
questions I may have had, no matter how trivial, have always been treated with patience and
sincerity. The knowledge I have gained from him has been an invaluable resource, and is
certainly not limited to the subject engineering. I am truly grateful and fortunate to have had such
a wonderful adviser, for I honestly don?t believe I could have made it this far without him.
v
TABLE OF CONTENTS
Abstract ..................................................................................................................................... ii
Acknowledgments ..................................................................................................................... iv
List of Figures ........................................................................................................................ viii
List of Tables ............................................................................................................................. x
List of Abbreviations ................................................................................................................. xi
Chapter 1 Introduction ................................................................................................................1
Chapter 2 Artificial Neural Networks ..........................................................................................3
2.1 Fundamental Concepts ..................................................................................................4
2.1.1 Biological Neuron .......................................................................................................4
2.1.2 Artificial Models.........................................................................................................6
2.2 Training ........................................................................................................................9
2.2.2 Single Neuron Techniques ..........................................................................................9
2.2.3 Error Backpropagation .............................................................................................. 13
2.2.4 LevenbergMarquardt ............................................................................................... 17
2.2.5 NeuronbyNeuron Method ....................................................................................... 23
2.3 Using ParityN Problems as A Benchmark for Performance ........................................ 28
2.3.1 The ParityN Problem ............................................................................................... 28
2.3.2 Parity2 Example ...................................................................................................... 29
vi
2.3.3 Comparison of Algorithm Performance..................................................................... 33
Chapter 3 Single Neuron Training Using Iterative PseudoInversion Techniques ...................... 37
3.1 PseudoInversion Training for Nonlinear Activation Functions ................................... 37
3.1.1 Derivation ................................................................................................................. 37
3.1.2 Improving Computational Efficiency ........................................................................ 38
3.1.3 Discussion ................................................................................................................ 39
3.2 Improved Generalization Using Active Set Pseudoinversion ...................................... 41
3.3 Maximizing the Margin of Separation ......................................................................... 48
3.3.1 Formulating a Margin Maximizing Objective ........................................................... 50
3.4 Improving Generalization in Full scale Networks ........................................................ 52
Chapter 4 Training Neural Classifiers using a Search of the Hidden Space ................................ 54
4.1 Overview .................................................................................................................... 54
4.2 Algorithm Description ................................................................................................ 55
4.2.1 Weighted PseudoInversion Training ........................................................................ 58
4.3 Detecting Entrapment.................................................................................................. 61
4.4 Avoiding redundant feature Selection .......................................................................... 63
4.5 Graphical Representation of the Training Process ....................................................... 64
4.6 HLPI Training Example .............................................................................................. 67
4.7 Experimental Results .................................................................................................. 73
Chapter 5 Conclusion ................................................................................................................ 78
Bibliography ............................................................................................................................. 81
Appendix A MATLAB Code for the PseudoInversion Techniques ........................................... 84
vii
Appendix B MATLAB Code for the HLPI Algorithm ............................................................... 90
viii
LIST OF FIGURES
Figure 2.1: Schematic diagram of a biological neuron. ................................................................4
Figure 2.2: General form of the artificial neuron model. ..............................................................7
Figure 2.3: A comparison of common activation functions. .........................................................8
Figure 2.4: Comparison of the Newton and gradient search directions. ...................................... 20
Figure 2.5: Pseudocode for the learning parameter update. ....................................................... 23
Figure 2.6: Example network for using NBN topology description. ........................................... 25
Figure 2.7: Pseudocode for the NBN forward calculation phase. .............................................. 26
Figure 2.8: Pseudocode for the NBN Jacobian calculation........................................................ 27
Figure 2.9: Graphical representation of the parity2 problem. .................................................... 30
Figure 2.10: The two neuron cascade used to solve the parity2 problem. .................................. 31
Figure 2.11: Hidden layer separating plane for the parity2 problem. ......................................... 32
Figure 2.12: Graphical representation of the augmented XOR data set....................................... 33
Figure 3.1: Nonlinear pseudoinversion. .................................................................................... 41
Figure 3.2: Separation of an unbalanced and poorly distributed data set. ................................... 41
Figure 3.3: Separation using pseudoinversion with linear activation function. .......................... 44
Figure 3.4: Eight iterations using ASPI. Patterns in the active set are shown in blue. ................. 45
Figure 3.5: An illustration of error reduction as a function of gain. ............................................ 47
Figure 3.6: Multiple solutions for under determined systems. .................................................... 49
Figure 4.1: Flow chart of the HLPI algorithm. ........................................................................... 56
ix
Figure 4.2: Pseudocode for the weighted pseudoinversion technique....................................... 60
Figure 4.3: Output image for the parity3 problem trained with EBP. ........................................ 67
Figure 4.4: Output image for the parity4 problem trained using the proposed method. ............. 68
Figure 4.5: Single layer bridge architecture used to solve the parity4 problem. ......................... 69
Figure 4.6: Training surface for the checker3 problem. ............................................................ 74
x
LIST OF TABLES
Table 2.1: Truth table for the parity2 problem. ......................................................................... 29
Table 2.2: Truth table for the parity2 problem. ......................................................................... 32
Table 2.3: ParityN performance of the EBP algorithm. ............................................................. 34
Table 2.4: ParityN performance of the LM algorithm. .............................................................. 35
Table 4.1: Performance comparison for the proposed method. ................................................... 75
Table 4.2: Algorithm parameters. .............................................................................................. 77
xi
LIST OF ABBREVIATIONS
ANN Artificial Neural Network
RBFN Radial Basis Function Networks
SVM Support Vector Machines
BNN Biological Neural Network
EBP Error BackPropagation
LM LevenbergMarquardt
NBN NeuronbyNeuron
LMS Least Mean Squares
ASPI Active Set PseudoInversion
QP Quadratic Program
HLPI Hidden Layer PseudoInversion
1
Chapter 1
INTRODUCTION
There are a number of well developed strategies for solving classification problems.
These include, but are not limited to, artificial neural networks (ANN)[1], radial basis function
networks (RBFN)[2] and support vector machines (SVM)[3] . These methods are especially
useful for solving problems which are nonlinearly separable. Although the specifics of the
various methods differ, they operate on the same basic principle known as Cover?s theorem [4].
The theorem states that a classification problem which is not linearly separable in the input space
may become separable when cast into a higher dimensional space. From this it is evident that the
primary challenge when solving a nonlinear classification problem is not in the separation itself,
but rather in choosing a linearly separable mapping. While RBFN and SVM methods take direct
advantage of this, ANN methods generally do not. Instead, the nonlinear mapping and separation
are combined into a single optimization problem which is most often solved using a least squares
gradient approach. Though this method has proven successful, it is not necessarily the most
efficient solution.
Each neuron in an ANN can be categorized as either a hidden unit or an output unit. Each
output unit serves as a linear classifier whose inputs are supplied by the hidden units that
together form a nonlinear mapping from the input space. In addition, for a bridged or fully
connected architecture[5], the mapped space is guaranteed to be of higher dimension than the
input space. Therefore, it follows from Cover?s theorem that the likelihood of achieving
2
separation in the output layer is much greater in this higher dimensional space than in the
original input space. Viewing the network in this way gives rise to an interesting and useful
observation: if the output layer is solely composed of linear classifiers, then the primary
challenge for nonlinear classification must lie in the training of the hidden units. This fact
suggests that training methods designed to focus more heavily on the hidden layers than on the
output layer might represent a more efficient and effective approach, which is the goal of the
work presented in this dissertation.
In Chapter 2, a general overview of ANNs is presented. This includes the biological
inspiration for the artificial neuron, as well as the computational details of the artificial models.
An overview of some common training methods is also included, with a short discussion and
comparison of the first and second order search directions. In Chapter 3, a set of pseudo
inversion techniques for performing linear classification with a single neuron are discussed. The
strengths and weaknesses of the various methods are discussed, and an example case is used to
illustrate the performance characteristics of each. Optimal linear separation is also defined. Next,
a two phase method is presented that makes direct use of Cover?s theorem via a periodic search
of the hidden space to achieve separation at the output. The method seeks to escape entrapment
by individually retraining the hidden layer neurons using the pseudoinversion techniques
described in Chapter 3. The performance of the proposed method is also discussed, using the
methods from Chapter 2 for comparison. Finally, a short conclusion is offered in Chapter 5.
3
Chapter 2
ARTIFICIAL NEURAL NETWORKS
An artificial neural network is a biologically inspired computational model composed of
a collection of interconnected functional units known as neurons[1]. The behavior of a network
is characterized by the strengths of the interconnections, which are represented by multiplicative
constants known as weights. Together, the weights of a network form a multidimensional search
space known as a weight space. Networks are designed to perform specific tasks by
systematically searching the weight space for those values which produce the best fit for a
desired data set; or in the case of dynamic networks, yield equilibrium points which act as a form
of memory. This search process is commonly referred to as training and the search methods
themselves are known as training algorithms.
There are a number of different classes of networks that are most often differentiated by
either the characteristics of their neurons or the structure of their interconnections. In non
dynamic networks, where the weights define a nonlinear mapping from the input space to the
output space, the interconnections between neurons are only made in the forward direction.
These are known as feedforward networks. Allowing connections in the reverse direction, e.g.
feedback, may cause instability or high frequency oscillation in the network; however, damping
such systems via integration can produce stable dynamic behaviors which have proven useful for
some applications.
4
2.1 FUNDAMENTAL CONCEPTS
To better understand the details of the training techniques described later, a short
overview of some of the underlying concepts of ANNs are introduced first.
2.1.1 BIOLOGICAL NEURON
Because artificial neurons are loosely modeled on the nerve cells of living organisms [6],
a basic understanding of biological neural networks (BNN?s) may offer a useful starting point for
understanding their artificial counterpart. Many aspects of the operation of BNN?s are still not
fully understood, and remain the subject of ongoing research. Therefore, the discussion presented
here is limited only to the basic operation of the nerve cells that make up these networks rather
than the operation of the networks as a whole.
Cell Body
Dendrites
Axon
Axon
Terminals
Signal Flow
Figure 2.1: Schematic diagram of a biological neuron.
5
Biological neurons are elementary nerve cells which act as the fundamental building
blocks of BNN?s. A schematic diagram is shown in Figure 2.1. The typical neuron has three
major parts:
1. Cell Body: The cell body is surrounded by a thin membrane which is capable holding an
electrical potential, and the mechanisms which dictate the neurons responses to stimuli
are housed inside with the nucleus.
2. Dendrites: The dendrites are thin branchlike fibers protruding from the cell body that
form what is known as the dendritic tree. Electrical impulses from neighboring neurons
are received by the dendrites, which act as inputs for the neuron.
3. Axon: The axon is a long cablelike projection that carries impulses generated in the cell
body, and may extend up to hundreds or even thousands of times the diameter of the cell
body. The axon?s terminals act as the outputs of the neuron.
Each neuron receives impulses from neighboring cells via its dendrites, which build a cumulative
potential on the cell membrane. Once the magnitude of this potential reaches a specific level,
known as the threshold, the neuron fires its own impulse, which is then transmitted out through
the axon terminals to be received by other neurons.
The strengths of a neuron?s interconnections vary, making it more sensitive to some
neurons than others. This variance on connection strength combined with the structure of the
interconnections is what determines the behavior of the network as a whole. Since the cell
structure remains unchanged of over the lifetime of the neuron, the learning process relies almost
entirely on the manipulation of these interconnections.
6
2.1.2 ARTIFICIAL MODELS
Artificial neurons are loosely modeled after their biological counterparts[6]. In reality,
the artificial neuron is little more than a multivariate function with weighted inputs. Still, despite
their relative simplicity, ANNs have proven to be quite powerful[7], and like their biological
analogs, their behavior is entirely dependent on the structure and strength of their
interconnections. Because the neurons that make up the ANN are identical in form, they are
highly modular systems, which is one of their most attractive aspects.
Although a number of different artificial models have been devised[1], most share the
same general form, which is shown in Figure 2.2. The mathematical description for models of
this form is
=(?), where
=
+
.
(2.1)
The values
are known as the neuron?s inputs, and may be supplied by input pattern values or
by the outputs of other neurons, depending on the unit?s location in the network. Each input
is
scaled by a corresponding weight denoted by
, and the threshold of the neuron is controlled by
the biasing weight
. The weighted sum of all inputs is commonly referred to as the value
of the neuron, and is analogous to the cumulated potential on the cell membrane of the biological
neuron. The function (?) is known as the activation function, and plays the most important role
in the behavior of the system.
7
w
1
w
3
Figure 2.2: General form of the artificial neuron model.
ACTIVATION FUNCTIONS
The activation function is a mathematical mapping that describes the relationship
between the weighted sum of the neuron?s inputs and its output. A number of different functions
have been proposed. A selection of these, shown in Figure 2.3, is discussed here.
The earliest of activation functions used the binary valued sign operator[8]. That is,
=sign().
(2.2)
Activation functions of this type are referred to as ?hard? activation functions due to their sharp
discontinuity. One of the major drawbacks to using functions of this type is the fact that they are
not differentiable. This lack of differentiability means that they cannot be trained using gradient
based methods.
8
Figure 2.3: A comparison of common activation functions.
To overcome the training difficulties faced when using hard activation function, a
number of differentiable functions were introduced. The simplest of these is the linear activation
function. Technically speaking, for the general model given by (2.1), the linear activation
function is not an activation function at all. Instead, the output value is equal only to the gain
multiplied by the net value, with the activation function (?) removed entirely. Therefore, for the
linear model,
=?.
(2.3)
Although the linear model is continuously differentiable, it is not a suitable choice for
solving problems binary desired outputs. Simply saturating the linear function at ?1 will not
work either since the resulting function would not be differentiable in the saturated region. To
solve this problem, a sigmoidal function is used, which saturates asymptotically, allowing it to
maintain its differentiability. The most common of these functions is the tangent hyperbolic,
9
=tanh(?).
(2.4)
Another commonly used function is the Gaussian. The Gaussian activation function is
most useful for nonlinear classification and surface fitting problems. The general form of this
function is
=
(?)
(2.5)
Gaussian functions are also the basis for an entire class of ANNs known as radial basis function
networks, however the form of the RBF neuron model differs from that of (2.1) in the way that
the net value is defined.
2.2 TRAINING
In the field of ANN?s, the process of determining the proper set of weights for achieving
a desired mapping is known as training. This task is commonly formulated as an optimization
problem involving the minimization of the output error over the network?s weights. That is,
?
=min
().
(2.6)
where is a vector containing the weights if the network, is the total error of the network as a
function of , and
?
is the weight vector that minimizes the value of (). As a result, a
number of optimization methods have been adapted for this purpose. In this section, some of the
more notable methods are presented.
2.2.2 SINGLE NEURON TECHNIQUES
Some of the simplest and earliest training methods were developed for optimizing the
weights of single neurons. However, despite their simplicity, many of these methods played an
10
influential role in the development the generalized algorithms that came later. The methods
presented in this section form the foundation for many of the techniques discussed in later
chapters.
DELTA LEARNING RULE
The delta learning rule is a steepest descent method used to train individual neurons for
linearly separable classification problems[9]. As a steepest descent method, the delta learning
rule operates using the gradient of the output error with respect to the weights. Here, the output
error is defined as the one half the square of the difference between the desired output and the
observed output for a given pattern. That is,
=(
?
)
,
(2.7)
where
is the desired output for input pattern , and
is the observed output. The values of the
observed outputs are calculated using
=(
?),
(2.8)
where (?) is the neuron?s activation function,
is
input vector, and is the weight vector.
To determine the gradient of the error, the partial derivatives of
with respect to the weights
must be found. Therefore, substituting (2.8) into (2.7), the partial derivatives of the error are
determined by
=?(
?
)
,
(2.9)
where
is the slope of the activation function for pattern ,
is the
element of the weight
vector, and
is the corresponding input. Together, the partial derivatives computed using (2.9)
form the gradient,
11
?
=
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
,
(2.10)
where is the total number of weights. Finally, using (2.10), a steepest descent update rule is
defined,
?=??
(2.11)
The scalar is known as the learning constant, and is used to control the step size.
PSEUDOINVERSION FOR LINEAR ACTIVATION FUNCTIONS
Pseudoinversion training is a regression technique used to solve for the weights of a
single neuron[10]. Suppose a single neuron is to be used to classify a set of training patterns
with dimension , and let
be the
input of the
pattern, where =1,?, and =
1?. The squared errors for the individual training patterns can then be written as
=[
?(
+
+?+
)]
=[
?(
+
+?+
)]
?
=[
?(
+
+?+
)]
,
(2.12)
where
are the weights associated with each of the neuron?s inputs. Defining the total error
as the sum of the squared errors for all patterns yields
=
.
(2.13)
Using (2.13), the training process can be described as the following optimization problem:
12
min
.
(2.14)
There is, however, an alternative formulation of the problem which allows for a simpler
and more direct solution. The method makes use of the property that at the minimum of the error
function, the components of the gradient must all be equal to 0. This leads to the following set of
equations and unknowns:
()
=?2[
?
]
=0
()
=?2[
?
]
=0
?
()
=?2[
?
]
=0,
(2.15)
where
and
are the respective values of the desired and actual outputs for input pattern , and
? is the derivative of the activation function. Assuming the activation function is linear and has a
gain of 1 helps to simplify things by making the ? terms in (2.15) equal to 1, and thereby
allowing them to be ignored. Now, (2.15) can be rearranged to yield a set of linear equations in
matrix form.
?
?
?
?
?
?
?
?
?
?
?
?
? ? ? ?
?
?
?
?
?
?
?
?
?
?
?
?
?
=
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
(2.16)
13
Defining X as the ? matrix of input patterns, w as the ?1 vector of weights, and d as the
?1 vector of desired outputs, (2.16) can be solved for w and expressed in more concise matrix
form as
=(
)
.
(2.17)
2.2.3 ERROR BACKPROPAGATION
Error backpropagation (EBP) is a supervised learning method which is a multineuron
generalization of the delta learning rule described in section 2.2.2 [11]. The method is primarily
intended for feedforward networks, and is one of the most widely used training algorithms. The
method calculates the gradient of the error surface by using the chain rule for differentiation to
exploit the structure of feedforward networks. The resulting gradient is then used in a simple
steepest descent update rule.
Since the algorithm utilizes the gradient of the error, it is only applicable for networks
whose neurons have differentiable activation functions. Therefore, it cannot be used for networks
with hard activation functions. Still, it can be used to design such networks by using sigmoid
functions during training and then replacing them with the sign operator once training is
complete.
DERIVATION
Before moving on to the derivation, some useful assumptions can be made. First, assume
the network architecture is such that the hidden units are in layers with no cross layer
connections, and the inputs of the neurons in a given layer are connected to the outputs of all
neurons in the previous layer, with the exception of the input layer, whose inputs are the
individual elements of the training patterns. Also assume that there is only one training pattern in
14
the data set, and that all neurons have sigmoid activation functions. This will greatly simplify
the notation moving forward.
The following notation will be used:
=
input to
neuron,
= weight for
,
= net value x
w
for neuron ,
= output for unit ,
= desired output for unit .
= activation slope for unit .
In order to determine the direction of steepest descent, the gradients of the errors for each of the
output units is required. Since the process is the same for each output, only one output is
assumed for the purpose this derivation. Therefore, the goal is to determine the values of the
partial derivatives of the total error with respect to the each of the weights. To begin, note that
because
is a function of
regardless of the location of unit in the network, then by the
chain rule,
=
?
=
(2.18)
Since the differential term on the right side is the same for all input weights of unit , it can be
denoted as
for short.
Now, consider the case in which unit is in the output layer. In this case, the total error is
defined as
=
1
2
?
(2.19)
Thus,
15
=
,
= ?
?
.
(2.20)
Substituting this into (2.18) yields
=
=?
?
.
(2.21)
Next, the case in which neuron is in the hidden layer must be considered. From here, the
derivation follows a similar line of reasoning as before, only this time the two following
observations should be made:
1. Since the direction of signal flow is from the input layer to the output, for hidden unit ,
all units whose inputs are directly connected to the output of can be referred to as being
downstream. Using this terminology, for each neuron downstream from ,
is a
function of
.
2. For all units ? in the same layer as , the contribution to the total error is independent
of
.
As before, the goal is to determine
, only this time is a hidden unit rather than an output
unit. Notice that the hidden input weight
only effects the value of
, and
only
influences the value of
. Furthermore,
effects the values of
for all downstream of ,
all of which influence the total error . So, once again applying the chain rule,
=
?
?
?
,
?
=
?
?
?
?
.
(2.22)
16
Again, as was the case for the output layer, all the terms in (2.22) except
are the same
for all input weights of unit , and can be combined to form the single term
. Also note that the
term
=
,
=
and
=
. Therefore, by substitution,
=
?
?
,
?
=
?
,
=
?
?
.
(2.23)
Now, using the above notation and definitions, the backpropagation algorithm can be stated
formally.
BACKPROPAGATION ALGORITHM
Let be a learning constant which controls the step size, and let
,
and
be the
number of inputs, hidden units and output units respectively. Also let the training data be
denoted by the input vectors
and the desired output vectors
, for =1,2,?,, where is
the number of training patterns in the data set. The training process is implemented by the
following steps.
1. Apply the training pattern
to the input of the network and compute the
corresponding output vector
.
2. For each unit in the output layer, calculate
= ?(
?
)
.
3. For each hidden neuron ?, compute the values of
=
?
?
.
17
4. For each
, compute the corresponding partial derivative of the error using
?
=
.
(2.24)
5. Update the weights using the gradient descent rule
=
+?
(2.25)
6. Repeat steps 15 until all input patterns have been applied.
7. Repeat steps 17 until the total error becomes sufficiently low.
The method presented above uses an incremental update, meaning that the patterns are applied
one at a time and the weights are updated for each pattern. There is an alternative method by
which the weights are simultaneously updated for all patterns at the same time. This is known as
a cumulative update. The derivation of the latter is not presented here, however, the
backpropagation process is very similar, and the derivation follows the same basic logic as that
of the incremental version.
2.2.4 LEVENBERGMARQUARDT
The LevenbergMarquardt (LM) algorithm is a second order algorithm whose update rule
is a combination of both the Newton method and the method of steepest descent[12][13]. As with
the backpropagation algorithm, the forward calculations are made first, after which the resulting
outputs are used to define the error. The error is then propagated back through the network to
determine the partial derivatives of the error with respect to the system?s weights. The resulting
partial derivatives are used to form the Jacobian matrix, which is in turn used to approximate the
Hessian as well as calculate the gradient of the total error. An adjustable learning parameter is
then used to control the contributions of the gradient and Newton directions to the direction of
18
the final step. A short summery and comparison of these two search directions is presented in the
next section.
NEWTON AND GRADIENT SEARCH DIRECTIONS
The most obvious of all search directions is the gradient direction ??
, also known as
the direction of steepest descent. As the name implies, ??
represents the direction in which the
objective function () decreases most rapidly, thereby making it a natural choice. Although it is
sometimes referred to as the gradient direction, the steepest descent direction actually points in
the opposite direction of the true gradient, which corresponds to the direction of greatest
increase. The true gradient is, by definition, the vector of all first partial derivatives of the
objective function or, more formally,
?
()?
?
?
?
?
?
?
?
?
?
?
?
, where x=(
,
,? ,
)
(2.26)
In optimization, where the goal is to find the point at which a given function yields the lowest
possible value, knowing the direction in which that function changes most rapidly is of obvious
value, making the gradient a particularly useful tool.
Another commonly used search direction is the Newton direction[14]. Though it is more
difficult to compute than the gradient, it provides significantly more information about the local
behavior of the objective, which in turn yields a much higher rate of convergence. This is
achieved through the use of a secondorder model
(
) of the objective function, derived
from its Taylor series approximation,
(
+
)?
+
?
+
1
2
?
?
(
).
(2.27)
19
From this, the Newton direction is defined as the vector
which minimizes the quadratic model
(
). Assuming for now that ?
is positive definite, it is possible to solve for
by setting
(2.27) equal to zero. Doing so yields the following explicit formula for the Newton direction:
=?(?
)
?
.
(2.28)
Whereas the steepest decent direction revolves around the computation of the gradient,
the Newton direction relies primarily on the calculation of the Hessian ?
, which is a matrix
containing all second partial derivatives of the objective function. That is,
?
()?
?
?
?
?
?
?
?
?
?
? ? ? ?
?
?
?
?
?
?
?
?
, where =(
,
,? ,
)
(2.29)
As alluded to earlier, the computation of the Hessian can be a rather expensive operation, which
is one of the Newton method's major drawbacks. Still, the benefits of this approach are generally
considered to outweigh the costs.
In addition, a number of strategies have been devised which serve to reduce the
computational requirement by replacing the true Hessian matrix with a close approximation that
does not need to be fully reevaluated at each iteration. Search directions which operate in this
way are commonly referred to as quasiNewton methods. Some of the more common
implementations include the Broyden, Fletcher, Goldfarb and Shanno (BFGS) algorithm[15],
and the closely related Davidon, Fletcher and Powell (DFP) algorithm[16].
20
Contours
of f(x)
Contours
Contours
of m
k
Contours
of
x
*
?
N
x
k
)(xf
NewtonNewton
GradientGradient
Figure 2.4: Comparison of the Newton and gradient search directions.
Another attractive attribute of many quasiNewton methods is that they can be
reformulated to operate on the inverse of the approximated Hessian instead of on the
approximation itself, which alleviates the burden of performing costly matrix inversions when
solving (2.28). This ability can prove especially advantageous on problems with high
dimensionality.
A comparison of the two steps is shown in Figure 2.4. The lines shown in grey represent
the contours of the objective function (), while those in orange are the contours of the second
order model
. The points x
and x
?
represent the location of the current iterate and the
minimum of the objective respectively. The vector s
corresponds to the Newton step, and ??
represents the gradient.
21
It is clear from Figure 2.4 that the additional information provided by the second order
Newton approximation yields greater improvement in the objective. However, it is also evident
that neither of the two search directions is optimal. Instead, the optimal direction appears to lie
somewhere between the two. This would suggest that some combination of the two may produce
the best result, which is exactly what the LM algorithm seeks to do by actively varying the
learning parameter mentioned in the previous section.
ALGORITHM DESCRIPTION
The primary difference between the backward computations in LM and those of EBP is
the necessity for determining the elements of the Jacobian rather than the total error gradient.
Because the LM step is formulated as a least squares problem, the Jacobian is needed, which
contains the first partial derivatives of the errors associated with the individual patterns rather
than the total error used in EBP. Aside from this, the approach uses the chain rule of
differentiation in the same manner as EBP. The resulting Jacobian matrix is of the form
=
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
? ? ?
?
? ? ?
?
?
? ? ?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
,
(2.30)
22
where
is the error at the
output for the
input pattern, is the number of training
patterns, and is the number of outputs.
Once the Jacobian matrix has been constructed, the only pieces missing are the gradient
and the Hessian matrix, which contains the second partial derivates of the error with respect to
the weights. In the case of the latter, it is possible to avoid the cost of computing these values
directly by using a convenient approximation in place of the actual Hessian matrix. The
convenience arises from the fact that the approximation can be computed using only the Jacobian
matrix, which is already available. That is,
?
.
(2.31)
Replacing the Hessian with (2.31) results in a quasiNewton formulation of the LM algorithm.
Similarly, the gradient vector can also be calculated using the Jacobian. This is
accomplished using the following formula:
=
.
(2.32)
where is a vector containing the error for each pattern at each output, and has length equal to
the product of the number of patterns and the number of outputs.
In general, the LM step is computed in the following manner:
?=?(+)
.
(2.33)
Therefore, substituting (2.31) and (2.32) into (2.33), the following formula is obtained for the
weight update rule:
=
? (
+)
(2.34)
where is the identity matrix of dimension +.
23
The learning parameter in (2.34) is used to control the contributions of the Newton and
gradient search directions to the overall step. For smaller values of , the algorithm behaves
more like a first order gradient descent method, whereas larger values cause the algorithm to
behave more like the second order Newton method. During each iteration, the parameter is
varied and the value which produces the greatest reduction in the objective is used in the final
step. The pseudocode in Figure 2.5 details the process used to select new values of .
Figure 2.5: Pseudocode for the learning parameter update.
2.2.5 NEURONBYNEURON METHOD
The neuronbyneuron (NBN) method is a method for calculating the partial derivatives of the
error with respect to the weights in arbitrarily connected feedforward neural networks[17]. This
is in contrast to the traditional backpropagation method which is only applicable for the standard
multilayer perceptron architectures with no crosslayer connections. The ability to handle
arbitrarily connected networks offers a significant advantage since the introduction of crosslayer
connections can greatly reduce the number of neurons required to achieve specific nonlinear
mappings and classifications, as well as diminishing the likelihood of overfitting.
E0 = totalerror();
= *10;
E1 = totalerror();
for i = 1:5
if E1 <= E0
= *10;
else
= /10;
end
E0 = E1;
E1= totalerror();
end
24
TOPOLOGY REPRESENTATION
The NBN method makes use of a netlist style notation for describing the network
topology. The order of the listing is used to determine the order in which the reverse calculations
are to be made. A simple example will be used to explain the syntax and format of the listing.
Take, for example, the network shown in Figure 2.6. Notice that each node in the
network is individually numbered. There are 3 input nodes (1,2,3), 2 hidden nodes (4,5) and 2
output nodes (6,7). Also note that the lowest numbers are given to the input nodes, while the
higher numbers are reserved for the output nodes. The hidden nodes are labeled with the
interstitial numbers. The convention is that for any neuron in the network, the numbers
associated with the nodes connected at its input are lower than those of its output. For this
particular network, the topology would be described by the following listing:
N 4 model1 1 2 3
N 5 model2 1 2 3
N 6 model3 1 2 3 4
N 7 model4 1 2 3 5
Each line beginning with ?N? describes the connections of a single neuron. The first
number represents the output node of the neuron described by that line. Next, the neuron model
is supplied. Each model has an activation function and gain value associated with it which
describe how the neuron output is be computed. The remaining numbers represent the nodes
connected to the neuron?s input. The neurons should be arranged so that the output nodes are in
ascending order when moving down the list.
25
Figure 2.6: Example network for using NBN topology description.
FORWARD COMPUTATION
Before the partial derivatives of the error can be calculated, the values at all nodes must
be known. Thus, the first phase of the NBN method involves making the forward computations
for each input pattern, and determining the corresponding values at all nodes in the hidden and
output layers. These values are then stored in matrix from, allowing them to be easily retrieved
during the backpropagation phase. The basic process for the forward calculation phase is
presented in pseudo code form in Figure 2.7.
for each pattern (p=1:P)
for each neuron (n=1:N)
net=0;
for each input i
net=net+w(i)*node(p,ndx(i,n));
end
node(p,n)=act(net);
der(p,n)=1node(p,n)^2;
end
for each output (o=NO:N)
err(p,o)=dout(p,o)node(p,o);
end
end
26
Figure 2.7: Pseudocode for the NBN forward calculation phase.
The variables and constants in Figure 2.7 are defined as follows:
? P is the number of input patterns
? N is the number of neurons
? O is the number of output nodes
? node(p,n) is a ? 2dimensional array containing the node values for all patterns at all
nodes.
? w(i) is the weight associated with neuron input i
? ndx(i,n) is an integer array which stores the indices of the input nodes for each neuron.
? der(p,n) is a ? 2dimensional array containing the derivatives of the activation
function of all neurons for all patterns. These values will be needed for the Jacobian and
gradient calculations.
? dout(p,o) is a ? 2dimensional array containing the desired outputs for all input
patterns at each node in the output layer.
? err(p,o) is a ? 2dimensional array containing the error associated with each pattern
at each node in the output layer.
The forward calculation routine returns three arguments for each input pattern: the
resulting error (err) at each output, the derivatives of the activation functions (der) for all
neurons, and the output value (node) of each node in the network. Therefore, once the forward
calculation routine is complete, all information needed for calculating the error gradient and
Jacobian matrix is available.
27
JACOBIAN COMPUTATION
Since the error gradient can be calculated from the Jacobian using (2.32), the NBN
routine for calculating it directly has been omitted, and instead, only the Jacobian routine is
included in the present discussion. When used with first order methods this method of
calculating the gradient does require more memory; nevertheless, for second order methods in
which the Jacobian must be computed anyway, using (2.32) is actually more efficient because it
removes the additional computation time required to calculate the gradient directly. The Pseudo
code for the Jacobian calculation is presented in Figure 2.8.
Figure 2.8: Pseudocode for the NBN Jacobian calculation.
for each pattern (p=1:P)
for each output (o=NO:N)
errnode(p,o,n)=err(p,o);
for each neuron (n=N:1:1)
delta(p,o,n)=der(p,n)*errnode(p,o,n);
for each input i
errnode(p,o,ndx(i,n))=
errnode(p,o,ndx(i,n))+delta(p,o,n)*w(i);
end
end
end
end
for each pattern (p=1:P)
for each output (o=NO:N)
for each input i
J(n,i)=delta(n)*node(p,ndx(I,n));
end
end
end
28
The variables and constants in Figure 2.8 are defined in the same way as in Figure 2.7, with three
additions:
? errnode(p,o,n) is a ???1 array containing the sum of the error at the output of
each neuron over all patterns.
? delta(p,o,n) is a ???1 array containing the sum of the error at the inputs of each
neuron over all patterns.
? J(n,i) is the ?? Jacobian matrix, where M is the number of weights in the network.
2.3 USING PARITYN PROBLEMS AS A BENCHMARK FOR PERFORMANCE
In this section, the classic parityN problems are introduced and used as a performance
metric for the comparison of the EBP and LM training. The algorithms are used to solve a set of
six problems using cascade architectures of varying size. For each test, performance is measured
based on the overall success rate and the average number of iterations required for convergence.
2.3.1 THE PARITYN PROBLEM
The parityN problems are Ndimensional binary classification problems, which are
commonly used as performance benchmarks for neural classifiers[18][19]. The problems are
notoriously difficult to solve due to their high degree of nonlinearity, and the difficulty is known
to increase significantly with respect to dimensionality.
In the general parityN case, the training patterns are comprised of all Nbit binary
numbers from 0 to 2
?1. The desired output for each pattern is assigned based on the number
of input bits that have values equal to +1. The rules for this assignment are as follows:
? If the number of input bits equal to +1 is even, a desired output of 1 is assigned.
? If the number is odd, a desired output of +1 is specified instead.
29
In addition to the inherent difficulty, another attractive aspect of the parityN problems is
the existence of analytical solutions [20]. The benefit of the analytical solutions is that the
minimum number of neurons required for a given architecture can be determined prior to testing,
which provides a trustworthy baseline for comparison. Without this information, it would be
impossible to determine if tests which produce a 0% convergence rate were the result of the
failure of the algorithm or simply the nonexistence of a solution for the chosen network.
Furthermore, the difficulty of the problem for a particular architecture is inversely related to the
size of the network. Therefore, knowing the minimum number of neurons for a given network
structure offers a worst case scenario for testing.
2.3.2 PARITY2 EXAMPLE
Take, for example, the parity2 problem. Since =2, the training data is defined as the
set of 2bit binary values from 0 to 3, and the 4 corresponding output values are chosen
according to the rules defined in the previous section. The resulting truth table is shown in Table
2.1. One may also notice that for the 2dimensional case, the parity problem is equivalent to the
familiar XOR operator used in digital logic.
Table 2.1: Truth table for the parity2 problem.
1 1 1
1 1 1
1 1 1
1 1 1
For classification problems, the weights of a neuron can be viewed as defining a plane of
separation for the input patterns. For a single neuron, this means that proper classification of a
given data set depends on the linear separability of the data. For data which is not linearly
30
separable, additional neurons are required. In the case of the parity2 problem, the separability
can be determined visually by plotting the data on a 2dimensional plane using the symbols ?
and o to represent desired outputs of +1 and ?1 respectively, as shown in Figure 2.9.
1
x
2
x
Figure 2.9: Graphical representation of the parity2 problem.
It is evident from Figure 2.9 that no single plane of separation will produce the desired
classification. This confirms that the data set in Table 2.1 is in fact not linearly separable. To
separate the data, the input patterns must first be mapped into a higher dimensional space
through the addition of a hidden layer. To do this, the two neuron cascade architecture shown in
Figure 2.10 is chosen.
31
Figure 2.10: The two neuron cascade used to solve the parity2 problem.
The next step is to determine a set of weight values for the hidden layer neuron that
produce a separable mapping at the output layer. To do this, the separating plane shown in Figure
2.11 is chosen. From this, the value of
can be computed using the slope intercept form
equation
=?
+
.
(2.35)
From Figure 2.11, it is apparent by inspection that the slope of this line is equal to 1. Therefore,
form (2.35),
must be equal to one. It is also clear from (2.35) that the plane passes through the
points (1,0) and (0,1). Since the net values for the neuron are equal to 0 for all points which lie
on the separating plane, plugging these values into the net equation and setting it equal to zero
results in a set of two equations with two unknowns. That is,
1?1
+0
= 0, and
1+0
?1
= 0.
(2.36)
Solving the system in (2.36) for
and
yields the following weight vector for the hidden
neuron:
32
w=
1
1
1
.
(2.37)
Figure 2.11: Hidden layer separating plane for the parity2 problem.
Now, the output of the hidden neuron is calculated for each of the patterns in Table 2.2,
and a new truth table is defined with respect to the output layer. The augmented data set is shown
in 0.
Table 2.2: Truth table for the parity2 problem.
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
The data in Table 2.2 can be visualized in a similar manner as before, only this time three
dimensions are required due to the addition of the hidden layer output values. This is illustrated
in Figure 2.12.
33
Figure 2.12: Graphical representation of the augmented XOR data set.
It should be apparent from Figure 2.12 that, using the augmented data set provided in Table 2.2,
the input patterns can now be linearly separated by the remaining neuron in the output layer.
2.3.3 COMPARISON OF ALGORITHM PERFORMANCE
In this section, the performance of the previously discussed EBP and LM algorithms is
compared using the set of parity problems from =2 to =7. Each problem was attempted
using fully connected cascade architectures with sizes ranging from 2 to 6 neurons for parity 2
and 3, and 3 to 6 neurons for parity 4 through 7, resulting in a total of 26 test cases per algorithm.
The performance for each test case was measured based on the overall success rate and average
number iterations required for convergence after 100 attempts. The EBP and LM algorithms
were allowed respective maximums of 100,000 and 10,000 iterations per training attempt, with
the success rate defined as the number of attempts that reached an average total error of 0.001
per pattern within the allotted number of iterations.
34
The test results for the EBP algorithm are shown in Table 2.3. Training was done using
the following parameters.
? Learning Constant (): 0.7
? Activation Function: TangentHyperbolic (Bipolar)
? Neuron gain: 0.1
It is evident from the table that difficulty presented by the parity problems is directly
proportional to the problem dimensionality, and inversely proportional to the size of the network.
Despite the increased success rate when using larger networks, in general, optimal architecture
achieve better overall performance since they are less susceptible to over fitting. This behavior
may not be evident when solving parity problems, however some applications, such as function
approximation, this can become a significant issue. Therefore, the network size must be taken
into account when assessing algorithm performance.
Table 2.3: ParityN performance of the EBP algorithm.
Problem
Parity2 Parity3 Parity4 Parity5 Parity6 Parity7
Neurons
2 100%
12,175
100%
6,060
N/A N/A N/A N/A
3 100%
6,428
100%
3,139
40%
46,882
49%
35,424
0%

0%

4 100%
4,485
100%
2,009
89%
40,240
100%
25,148
13%
62,223
22%
55,309
5 100%
2,886
100%
1,706
95%
34,665
100%
19,402
33%
55,773
50%
51,361
6 100%
2,638
100%
1,530
99%
28,344
100%
13,130
100%
36,215
64%
45,151
35
Table 2.4: ParityN performance of the LM algorithm.
Problem
Parity2 Parity3 Parity4 Parity5 Parity6 Parity7
Neurons
2 100%
9
98%
9
N/A N/A N/A N/A
3 100%
9
99%
9
36%
20
51%
22
0%

0%

4 100%
8
99%
8
81%
20
85%
19
29%
34
42%
35
5 100%
8
98%
8
95%
19
93%
19
65%
32
71%
34
6 100%
8
100%
8
98%
18
99%
18
84%
34
76%
33
The test results for the LM algorithm are shown in Table 2.4. For training, the following
set of parameters was used:
? Initial Learning Parameter (
): 0.1
? Activation Function: TangentHyperbolic (Bipolar)
? Neuron Gain: 0.1
Comparing the results in Table 2.4 with those in Table 2.3, the superiority of the LM
algorithm with respect to the speed of convergence is clear. This is not surprising since the
secondorder model used by the LM algorithm offers a more accurate description of the error
surface. It is also clear, however, that the LM algorithm does not fare as well in terms of the
success rate. The reason for this is that the LM algorithm?s speed comes at a price. Because it
converges so quickly, the neuron?s are susceptible to being driven in to early saturation, before
reaching the desired solution. Once saturation is reached, the derivatives of the activation
function for the misclassified patterns become very small, despite having a relatively high error.
Because the misclassified patterns make up a minority of the total set of input patterns, their
36
contribution to the step is not significant enough to overcome the influence of the correctly
classified patterns. Therefore, the algorithm becomes trapped.
The EBP algorithm, on the other hand, benefits to some degree from its relatively slow
rate of convergence since the algorithm is not as easily driven into saturation, which allows the
misclassified patterns to have greater influence on the direction of the search.
37
Chapter 3
SINGLE NEURON TRAINING USING ITERATIVE PSEUDOINVERSION
TECHNIQUES
3.1 PSEUDOINVERSION TRAINING FOR NONLINEAR ACTIVATION FUNCTIONS
The pseudoinversion method described in section 2.2.2 was a regression technique used
for neurons with linear activation functions. Here, an iterative method is presented that is capable
of training neurons with nonlinear activation functions[10]. Two forms of the update rule are
discussed. Although the two forms are mathematically equivalent, they differ computationally in
terms of software implementation. The advantages and disadvantages of each formulation are
discussed following the derivation.
3.1.1 DERIVATION
Let
,
and
be the neuron error, desired output and observed output for pattern
respectively, where =1,?, is the pattern index, and let
and
be the weight and input
values associated with the
input of the neuron. Next, define the error for pattern as
=
?
(),
(3.1)
where =
+
+?+
, and is the dimension of the input. The derivative of
this error with respect to the
weight can then be described as
38
=
=?
,
(3.2)
where
is the slope of the neuron?s activation function as a function of the value. Next, let
the error in (3.1) be replaced by the first order Taylor approximation
=
+
?
+
?
+?+
?
.
(3.3)
Now, (3.2) and (3.3) can be combined to form the following overdetermined linear system of
equations in matrix form:
?
?
?
?
?
?
? ? ?
?
?
?
?
?
?
?
?
?
=
?
?
?
?
.
(3.4)
Solving (3.4) for ?w yields the least mean squares (LMS) formulation of the step,
?w=[(FX)
FX]
(FX)
E,
(3.5)
where X is the ? matrix of input patterns, and
F=
?
?
?
?
?
0 0 ? 0
0
0 ? 0
0 0
? 0
? ? ? ? ?
0 0 0 ?
?
?
?
?
?
.
Note that the term [(FX)
FX]
(FX)
in (3.5) is longhand for the pseudoinverse of FX, for
which the method gets its name.
3.1.2 IMPROVING COMPUTATIONAL EFFICIENCY
The computational requirements of the method derived in section 3.1.1 using a simple
reformulation. Instead of multiplying the patterns in by the corresponding values of
on the
39
left side of (3.4), the values are divided through both sides of the equation, which produces the
following algebraically equivalent formulation:
?
?
? ? ?
?
?
?
?
?
=
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
.
(3.6)
Now, solving for ?w yields the reformulated update rule
?=(
)
= ,
(3.7)
Where
=(
)
and =
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
.
The reason for reformulating the update rule in this way is that the value of X does not
change from one iteration to the next since the values of the input patterns are constants. This
means that Y. the pseudoinverse of X, need only be computed once. Since matrix inversion is a
computationally intensive operation, the formulation in (3.7) is much more efficient than (3.5),
which requires inversion to be performed at each iteration due to the changes in F.
3.1.3 DISCUSSION
Although (3.5) and (3.7) appear algebraically equivalent, in practice, they differ in terms
of performance. The disparity is a result of the fact that the underlying system of equations is
overdetermined. Therefore, while they appear algebraically equivalent, they are in fact not. This
40
is due to the nature of the pseudoinverse. In order for equivalence to be maintained, the
following identity must hold:
(FX)
=X
F
.
Although the relation is valid whenever F and X are nonsingular square matrices, it does not hold
in the case of pseudoinversion. Therefore, the formulations given in (3.5) and (3.7) are not truly
equivalent mathematically. Despite this fact, both methods do yield convergence; however, the
rate of convergence for the method outlined in section 3.1.2 is significantly slower. This is
evidenced by the comparison of the training error in Figure 3.1.
(a)
(b)
50 100 150 200 250 300
10
4
10
3
10
2
10
1
10
0
iteration
1 2 3 4 5 6 7 8
10
4
10
3
10
2
10
1
10
0
iteration
41
Figure 3.1: Ten nonlinear pseudoinversion runs using (a) the iterative inversion method; and (b)
the single inversion method.
Figure 3.2: Separation of an unbalanced and poorly distributed data set using pseudoinversion.
3.2 IMPROVED GENERALIZATION USING ACTIVE SET PSEUDOINVERSION
When solving classification problems, the optimal orientation of the separating plane
does not depend solely on the correct classification of the training data. It must also exhibit
strong generalization. That is, it should also be capable of classifying additional data points with
as high a degree of accuracy as possible given the information contained in the training set.
While the pseudoinversion techniques outlined in section 3.1 are capable of reaching
very low error levels in a relatively low number of iterations, closer inspection may reveal
significantly less than optimal separating plane orientations when applied to training sets with
unbalanced or poorly distributed data. Take, for instance, the data set shown in Figure 3.2. The
separating plane represented by the dashed line was generated using the pseudoinversion
method described in Section 3.1.2 . Despite a total error of just 10
, it is clear from the figure
that the placement of the separating plane is quite poor. This is due to the imbalance and poor
42
distribution of the training data. The larger number of data points on the left has resulted in a
shift of the separating plane in that direction, and the poor distribution has had an undesirable
influence on the plane?s orientation.
Although it is true that the placement of the plane would improve if the training process
were allowed to continue, the rate of the improvement would be quite slow due to the fact that
the neuron is in saturation. The reason for this is that, in saturation, the derivatives of the
activation function become quite small. Since the magnitude of the step is proportional to that of
the derivatives, smaller derivatives mean smaller steps. To make matters worse, as the position
of the separating plane improves, the total error continues to drop, causing the step size to do the
same. Therefore, the rate of progress is inversely proportional to the number of iterations.
To overcome this issue, a modified pseudoinversion training technique is presented that
not only produces optimal or near optimal placement of the separating plane, but also achieves a
higher rate of convergence. The method revolves around the use of a variably sized training set
referred to as the active set.
The first major difference in the active set pseudoinversion technique (ASPI) is that it
uses the neuron?s net values rather than the output values to determine the placement of the
separating plane. To do this, a trick is employed which does not require any modification to the
original update rule. All that is required is the replacement of the nonlinear activation function by
a linear model with unity gain, which is equivalent to operating on the net value of the original
model. Therefore, mathematically speaking, the update rule is of the same form as,
?=(
)
,
(3.8)
where
43
=
?
?
?
?
?
0 ? 0
0
?
? 0
? ? ? ?
0 0 ?
?
?
?
?
?
,=
?
?
? ? ?
?
,?=
?
?
?
?
,and =
?
?
?
?
.
The only differences being that the effect of the derivative values
are ignored since the linear
model has unity gain, and the output values of the original model are replaced with the linear
model such that,
=
=.
(3.9)
Therefore, ignoring and substituting (3.9) into (3.8), the linear pseudoinversion rule from
section 2.2.2 is obtained. That is,
= +(
)
(?),
= +(
)
?,
= (
)
.
(3.10)
From this result, the relationship between the linear and nonlinear methods can be clearly seen.
Using the linear model offers two advantages. First, it restricts the magnitude of the
weights, and second it removes the possibility of saturation. However, while this effectively
removes the two primary restrictions faced by the nonlinear model, it also introduces a restriction
of its own. Because the value of the output is no longer limited to the interval between plus and
minus one, only those patterns which lie on or near the ?1 contours of the plane defined by the
net value are assigned the proper classification. The remaining patterns may contribute relatively
large error values regardless of their position with respect to the separating plane, which
adversely affects the placement of the separating plane in much the same way as the nonlinear
model. This behavior is illustrated in Figure 3.3 using the data set from the previous example.
44
Figure 3.3: Separation using pseudoinversion with linear activation function.
Fortunately, the undesirable behavior introduced by the linear model can be overcome by
selectively reducing the number of training patterns at each iteration. The result is a variably
sized subset referred to as the active set. During each subsequent iteration, only those patterns
indexed by the active set are used for training. The remaining patterns are simply ignored. To
do this, an active input matrix is defined such that
=
?
?
? ? ?
?
(3.11)
for all
? from =1,? ,, where is the number of indices contained in .
The method used to determine which patterns should be included in the active set is quite
simple. During each iteration, the active set is updated using the indices of all input patterns
whose net values lie on the interval between ?1. That is, they must lie within the region bounded
by the ?1 contours of the activation function. As it turns out, due to the LMS formulation of the
45
update rule, reduction of the size of the active set from one iteration to the next is guaranteed
until a minimal set is reached, for which there exists a feasible separating plane such that all data
points in the remaining set are collinear and run parallel to the plane. Thus, using the reduced
dataset, an exact solution to the linear system in (3.10) can be found.
Figure 3.4: Eight iterations using ASPI. Patterns in the active set are shown in blue.
Six iterations of this process are shown in Figure 3.4, using the data set from the previous
examples. The active set used at each iteration is represented by the blue data points. The
remaining data points, shown in red, are ignored. The ?1 contours used for determining the each
subsequent set are represent by the gray dashed lines, and the current separating plane is shown
46
in black. At the sixth iteration, the two remaining blue points represent the minimal active set. It
is important to note that for the minimal set, all active patterns lie on the ?1 contours, which
signifies that an exact solution to (3.10) has been found, and the training process is complete.
Once the proper separating plane has been determined, the resulting weight vector can
then be scaled to produce as low a total error as may be desired. The reason for this is that the
orientation of the separating plane for a neuron with a tangent hyperbolic activation function is
entirely determined by the equation for its net value, which is obtained from the active set
solution to (3.10). The only remaining parameter for the activation function is the gain, which
controls the slope of the activation function in the linear region. For the tangent hyperbolic
function, as the value of the gain is increased, the shape of the activation function approaches
that of the sign operator. This means that once the separating plane is set, increasing the gain of
the activation function effectively reduces the total error. This behavior is illustrated Figure 3.5.
From the figure, it is clear that as the gain increases, the data points represented by the orange x
and o converge toward their desired binary output values, thereby reducing the associated errors.
Therefore, any arbitrary desired value for the maximum error per pattern can be set by simply
scaling the weights.
47
)( netkf ?
net
k
error
Figure 3.5: An illustration of error reduction as a function of gain.
The reason for this is straight forward. First, since the patterns in the active set are known
to be those closest to the separating plane, they must also produce the largest error values. It
follows, then, that by setting the errors for the active patterns equal to
, it is guaranteed that
the errors for remaining patterns will be less strictly less than
. Therefore, a scalar value is
desired such that that
=
?tanh(??
),??.
(3.12)
Conveniently, due to the fact that the separating plane resulting from (3.10) is equidistant from
all patterns in , satisfying the above equation for any pattern in a is equivalent to satisfying it
for all patterns in . In addition, it is also known from (3.10) that the net value values for the
active patterns are equal to their desired outputs. Therefore, (3.12) can be simplified by choosing
any pattern such that
=
=1. This results in the equivalent relation
=1?tanh(?)
(3.13)
Finally, solving (3.13) for , the following equation for the weight scale factor is obtained:
48
=
tanh
(1?
)
.
(3.14)
In summary, not only is the active set method exceptionally fast, but it also offers three
distinct advantages over the previous methods:
1. The placement of the separating plane is guaranteed to be equidistant from the nearest
patterns in ether class.
2. By reducing the number of patterns used to compute the step during each subsequent
iteration, the computational complexity is also reduced.
3. The maximum pattern error can be directly specified by the user.
3.3 MAXIMIZING THE MARGIN OF SEPARATION
While the active set method presented in the previous section offers a number of
significant improvements, it still does not guarantee optimal separation in every sense. This is
due to what is known as the margin of separation, which refers to the distance between the
separating plane and the contour lines passing through the data points that comprise the optimal
active set. For optimal separation, the orientation of the separating plane should achieve the
largest possible margin of separation for which (3.10) remains satisfied. For a classification with
an dimensional input space, the active set solution will, in fact, produce the maximum margin
of separation, so long as the number of active patterns in either class is greater than or equal to
. However, if the number of active patterns in both classes is less than , the maximum margin
is no longer guaranteed. This is because for the latter case, there is more than one unique solution
to (3.10). To state this formally, if
and
are the active set patterns whose desired outputs
are plus and minus 1 respectively, then
49
1. If the number of patterns in either
or
is greater than or equal to , then there is
only one unique solution to (3.10).
2. If the number of patterns in both
and
is less than , then more than one unique
solution exist.
To illustrate this, a two dimensional example is shown in Figure 3.6. The active set in
Figure 3.6a, represented in blue, is an example of case one. As expected, the solution shown here
is the only possible solution for which all points in lie on the ?1 contour lines. For the second
case, represented by the example in Figure 3.6b, two possible solutions are shown. The latter
case also illustrates the difference in the separating margin from one solution to the next. Clearly,
solution number 2 has the largest margin of separation, and is the more preferable solution.
Figure 3.6: (a) If the number of patterns in either A_+ or A_ is greater than or equal to N, then
only one unique solution can be found; (b) When the number of patterns in both A_+ and A_ is
less than N, more than one unique solution exist.
50
In this section, a technique is presented that, when used in conjunction with the active set
method, guarantees a maximal margin of separation. Therefore, using the proposed method, a
truly optimal separation is obtained.
3.3.1 FORMULATING A MARGIN MAXIMIZING OBJECTIVE
The goal of the proposed method is to determine a separating plane which achieves a
maximum margin of separation for the data points in the active set. To do this, a functional
relationship between the separating margin and the weights must be found, beginning with the
general description of the separating plane,
?+=0,
(3.15)
where is the weight vector, is the input vector, and is the biasing weight. Furthermore, in
order that the patterns in
lie on the ?1 contours, a set of equality constraints must also be
defined,
?
+ =
,
(3.16)
where
is the active pattern matrix and
is the corresponding vector of desired outputs.
Thus, for two vectors
?
and
?
on opposite sides of the separating plane,
?(
?
)=2,and
(
?
)=2.
(3.17)
Using (3.17), the margin can be defined as the projection of (
?
) onto the vector normal to
the separating plane, which is equivalent to the normalized weight vector =/??.
Therefore,
= proj
(
?
)=
2
??
.
(3.18)
51
It follows. Then, that maximizing the margin described by (3.18) is equivalent to minimizing the
following equality constrained quadratic:
min
,
1
2
??
, subject to ?
+=
.
(3.19)
The formulation in (3.19) represents an equality constrained quadratic program (QP).
The Lagrangian of the equality constrained QP in (3.19) is defined as
(,)=
1
2
??
?
(?
)+?1,
(3.20)
where is the number of patterns and
are the Lagrange multipliers. Solving for the saddle
points of the Lagrangian and substituting the result back into (3.20) yields the equivalent Wolfe
dual formulation of the QP,
min
1
2
?
, subject to
?0.
(3.21)
For the Lagrangian formulation of a constrained QP, the objective is defined as the
gradient of the original objective minus the gradients of the constraints. Therefore, solutions are
characterized by points for which the combined direction of the gradients of the active
constraints points in the opposite direction of the gradient of the objective[14]. From a
geometrical standpoint, this makes sense. If the gradients are in opposition, no direction exists by
which the original objective function can be reduced while still satisfying the active set
constraints. However, simply minimizing the difference in the gradients is not sufficient since
their magnitudes may differ. To alleviate this problem, the Lagrange multipliers are used to scale
the constraint gradients so that their combined magnitudes will be equal to that of the objective.
52
Normally, for inequality constrained problems, the influence of the inactive constraints is
ignored by setting the corresponding multipliers to 0. If the Lagrange multiplier for any of the
active constraints is found to be negative, then it is no longer a blocking constraint, and it is
removed from the working set of constraints. This process is repeated until a solution is found
for which all of the multipliers are greater than or equal to zero. However, for the set of points
found using the active set method in the previous section, all of the data points yield strict
equality constraints. This means that the all the resulting constraints will be active, and their
corresponding Lagrange multipliers will be positive and nonzero. Therefore, the dual constraints
on _ in (3.21) may be ignored, resulting in an unconstrained minimization over .
Quadratic programs of this form are encountered in a wide range of applications, and a
number of efficient methods have developed for solving them.
3.4 IMPROVING GENERALIZATION IN FULL SCALE NETWORKS
Although the presented pseudoinversion techniques are only directly applicable for
training individual neurons, they can be used to optimize the weights found by nonlinear gradient
methods using a simple retraining process. The resulting set of weights yield optimal
generalization for the given set of selected features.
The first step in the process is to train the network using any standard gradient based
search method. Once convergence is achieved, the gradient search process is terminated as usual.
The outputs of all neurons are then recorded using the final weights of the gradient search, and a
set of desired outputs is defined for each neuron by taking the signs of the measured values.
Next, the neurons are ordered such that the position of each neuron in the order is greater
than that of all neurons connected to its input. In other words, for a given neuron, all neurons
downstream have higher value in the order. This ensures that when it comes time to retrain each
53
neuron, the outputs of all neurons connected to it will be available to use as input patterns during
training. After the ordering is determined, each individual neuron is retrained using the margin
maximizing technique described in the previous section. In this way, the separating margin for
each neuron is maximized, resulting in optimal generalization for the network as a whole. Since
the ability to guarantee optimal separation is one of the most commonly cited factors when
discussing the motivation for using SVM methods over ANN methods, this represents a
significant step for neural classification.
54
Chapter 4
TRAINING NEURAL CLASSIFIERS USING A SEARCH OF THE HIDDEN
SPACE
Generally speaking, neural networks are trained using the current output values to
compute or approximate the first and second partial derivatives of the network error using
backpropagation of the error through the network. This information is then used to determine a
step direction which effectively reduces the total error of the network. Once an adequate step is
found, all of the neurons are updated simultaneously. While this method has proven successful,
the error functions it introduces can be rather complicated, making the training process slower
and more susceptible to entrapment[21]. A faster and more efficient approach would be to train
each neuron independently using the techniques described in the previous chapter; however this
is not usually possible since the desired outputs are typically known only for those neurons that
reside in the output layer. In this chapter, an approach is presented which seeks to determine
these hidden outputs using a systematic realignment of the feature space. The result is a more
robust algorithm which is not only able to escape entrapment in false minima, but also makes
direct use of the information they provide in order to proceed.
4.1 OVERVIEW
If the desired outputs are known for all of the nodes in a given network, each neuron may
be trained independently. This method of training is attractive for two reasons: one, because the
training of an individual neuron requires relatively little computational effort, making
convergence quite rapid, and two, because the error surface does not contain any false minima,
55
which eliminates the possibility of entrapment. Unfortunately, as previously stated, the desired
outputs for the hidden layer neurons are not generally available a priori, and in most cases, the
number of possible combinations makes a sequential search impractical. For some perspective, a
problem with input patterns and hidden neurons would result in 2
possible combinations
of hidden layer outputs. Thus, even a relatively small problem such as parity7, which has 128
input patterns and 3 hidden neurons (using a bridged architecture) would have nearly 4?10
possible choices! While some combinations could be ruled out immediately, the number of
remaining choices would still be prohibitively large. The only way to proceed would be to
somehow narrow the scope of the search. To do this, the proposed method uses entrapment in
local minima, one of the weaknesses of gradient based approaches, to its advantage.
Entrapment occurs when an algorithm converges towards a solution which is not optimal.
Unable to reduce the error any further, the process must be terminated. When this occurs,
training is reinitiated using a randomly generated set of new weights, and the old weights are
discarded. However, while the discarded weights may not produce the desired classification for
all patterns, the number of misclassified patterns may be quite small. Therefore the
corresponding hidden layer outputs might offer a useful starting point for a search of the feature
space. If this information could be used to determine the desired outputs of the hidden layer
neurons, then each neuron in the network could be trained independently, which is the goal of the
proposed method.
4.2 ALGORITHM DESCRIPTION
The proposed method, referred to as the hidden layer pseudoinversion algorithm (HLPI),
begins by using a second order learning algorithm to train the network in the usual manner.
Where it differs is primarily in the way it handles entrapment. If the algorithm fails to converge,
56
the second order process is halted, and the second phase of training begins. In the second phase,
the values of the current hidden layer outputs are used as a starting point for determining the
correct set of hidden layer outputs. To do this, the hidden neurons are systematically retrained in
an attempt to reclassify the remaining misclassified patterns using pseudoinversion. This
process is performed using the following steps.
Figure 4.1: Flow chart of the HLPI algorithm.
Step 1: Train the network using the gradient method until the process converges. If all patterns
are classified correctly, terminate training and return the final weights. If not, proceed to
step 2.
57
Step 2: Find the hidden unit for which the net values of the misclassified patterns are closest to
0. Define a set of desired outputs by taking the sign of the current outputs and then
multiplying the outputs for any misclassified pattern by 1. That is, let the set of desired
outputs
be defined as
={sign(
)}??
,
where
is the set of current outputs for hidden neuron , contains the indices of all
correctly classified patterns, and contains the indices of all misclassified patterns.
Step 3: Perform weighted pseudoinversion (described in the next section). Incrementally scale
the derivatives for the misclassified patterns and repeat until one or more of the
misclassified patterns is reassigned.
Step 4: Let be the indices of all correctly classified patterns whose outputs are unchanged and
all misclassified patterns whose outputs were reassigned, and let contain the indices of
all remaining patterns. Find the neuron whose net values for the patterns denoted by are
closest to 0, and define a new set of desired outputs
={sign(
)}?{?(
)}.
Step 5: Repeat steps 2 and 3 until the number of indices in reaches 0, or until all hidden units
have been retrained.
Step 6: Perform pseudoinversion for the output layer neurons using the updated outputs of the
hidden units. If all patterns are classified correctly, terminate training and return the final
weights. If not, repeat steps 1 through 6.
A flowchart of this process is depicted in Figure 4.1.
The pseudoinversion technique used in step 3 is a weighted variation of the nonlinear
method presented in Section 3.1.2 . The details of this method are presented in the next section.
58
4.2.1 WEIGHTED PSEUDOINVERSION TRAINING
In Chapter 3 a series of pseudoinversion techniques were presented for optimizing the
weights of a single neuron with a nonlinear activation function. A variation of this technique is
presented here, and is used during the realignment phase of the proposed learning algorithm.
The goal of the realignment process is to systematically adjust the feature detectors in the
hidden space in order to achieve better separation in the output layer. This is accomplished by
retraining the hidden layer neurons one at a time in an attempt to invert the signs of the current
node values for all misclassified patterns, while preserving the signs of the remaining patterns.
Thus, a desired output vector is defined using the current node values for all correctly classified
patterns and the inverted values for all misclassified patterns. Unfortunately, there is no
guarantee that the newly specified classification will be linearly separable. Therefore, in the
event that linear separation is not achievable, a modified objective must be used. The primary
goal in such cases remains the inversion of the node values of the misclassified patterns.
However, the node values of the correctly classified patterns should not be ignored either since
they play an important role in separation at the output. Therefore, simply performing pseudo
inversion is not sufficient since there is no guarantee that the misclassified patterns will be
reassigned; and placing emphasis on the misclassified patterns by ignoring the remaining
patterns is also not advisable since it may have an adverse effect on separability at the output
layer. Clearly, a compromise must be made. An approach is needed that seeks to invert as many
of the misclassified output values as possible while at the same time preserving as many of the
correctly classified values as possible. To do this, an incremental weighting scheme is introduced
that gradually increases the emphasis on the misclassified patterns until one or more of the
corresponding node values is inverted. In this way, a subset of the misclassified patterns can be
59
inverted while maintaining the influence of the correctly classified patterns on the feature
realignment process. The resulting update rule is given by
?=
(
)
,
(4.1)
where
=
?
?
?
?
?
?
0 ? 0
0
?
? 0
? ? ? ?
0 0 ?
?
?
?
?
?
?
,=
?
?
? ? ?
?
,?=
?
?
?
?
,
and =
?
?
?
?
.
The values
are the weight values used to scale the derivatives of the activation function for
each pattern. The weight values are determined according to which patterns are to be inverted.
For correctly classified patterns, the value of
is set to 1. For all misclassified patterns,
is
set equal to the current value of the incrementing scale. Pseudocode for the complete process is
shown in Figure 4.2.
60
Figure 4.2: Pseudocode for the weighted pseudoinversion technique.
The variables and constants in Figure 4.2 are defined as follows:
? scale stores the incremental weight values used to scale the derivatives of the activation
function for all misclassified patterns.
? reclassified is a binary value that represents the state of the reclassification process. If
any of the misclassified patterns has been reclassified, the value is true, otherwise the
value is false.
? w is a vector containing the weights on the neuron?s inputs.
? F? is a ? diagonal matrix containing the scaled activation function derivatives for all
patterns.
scale = 1;
reclassified = false;
while reclassified == false;
for ite = 1to nite
w = w + (F?*X)/E;
out = tanh(k*X*w);
E = dout ? out;
der = k*(1out^2);
for p = 1 to P
F?(p,p) = der(p);
if p ? ndx
F?(p,p) = F?(p,p)*scale;
if E(p) < margin
reclassified = true;
end
end
end
end
scale = scale + step;
end
61
? X is a matrix whose rows contain the input patterns to the neuron.
? E is a ?1 vector containing the errors associated with each of the input patterns.
? out is a ?1 vector containing the output values for each of the input patterns.
? dout is a ?1 vector containing the desired output value for each input pattern.
? k is the gain of the neuron.
? nite is the number of pseudoinversion updates performed between increments of the
derivative scaling factor.
? der is a ?1 vector containing the derivative of the activation function for each of the
input patterns.
? margin is a constant value which determines how close the outputs for the misclassified
patterns must be to the desired values before they are considered reclassified.
? step is a constant that determines size of the increments in the derivative scaling factor.
? P is the number of input patterns
4.3 DETECTING ENTRAPMENT
The purpose of the proposed method is to improve the success rate of training by
realigning the hidden layer neurons whenever the gradient search fails to converge. In order to do
this effectively, a method is needed for determining when the algorithm becomes trapped. The
proposed method tests for three conditions to determine the current state of the training process.
The first test checks for changes in the pattern separation at the output of the network
over a preset interval. If, at the end of the specified number of iterations (), the separation of the
training data is unchanged, it is likely that the algorithm has become stuck. A binary valued flag
62
(
) is used to track the current status of the classification at each iterate (k). Therefore, the
test for this condition is defined by
=
1 if
?
=,
0, otherwise.
(4.2)
As an example, suppose a value of 20 was chosen for . This would mean that at the end of each
iteration, the resulting output values would be compared with those generated 20 iterations
before. If the values were found to be the same, then the
flag would be set equal to one.
The separation test alone is not sufficient for detecting entrapment. One reason for this is
that once a correct separation has been reached, it may still require a number of iterations to
reduce the total error to the desired level, during which time the separation test may return true.
To avoid the possibility of false detection, a second condition is tested by comparing the current
separation to that of the desired outputs. This is done by defining a second condition flag
(
), which is defined as
=
0 if
?
=,
1, otherwise.
(4.3)
The final condition check for the percent decrease in the total error from one iteration to
the next. This is important in the early stages of training when the convergence may be slow due
to the low output values produced by the initial weights. In order to avoid false detection during
this stage, the percent improvement in the total error is monitored. If the percent difference is
less than the desired value (), the third flag (
) is set. That is,
=
1 if
?
<,
0 otherwise,
(4.4)
63
where
is the total error for the
iterate, and is the size of the interval used in the
comparison. For instance, if a value of =5 was selected, the
condition would be
determined by the percent difference in total error between the current iterate and the one
generated five iterations before it.
Used together, these three conditions offer a reliable method for detecting entrapment in
local minima. If all three flags came back true during the same iteration, the gradient search is
terminated and the realignment process begins.
4.4 AVOIDING REDUNDANT FEATURE SELECTION
There are a number of conditions which may lead to false convergence, the majority of
which can be difficult or even impossible to detect. There is, however, one such condition that
can be detected quite easily. The condition is known as redundant feature selection, and occurs
when two or more hidden units attempt to perform the same classification task[22]. Obviously,
this is a problem when training networks with optimized architectures. For certain classes of
problems, such as analogtodigital conversion, this issue can be especially significant. A simple
method for detecting and avoiding redundancy is presented in this section.
There are two means by which redundant hidden neurons can be detected. One is to
compare the weights. The drawback to using this method is that different weights do not
guarantee a different behavior. Weight vectors may differ a great deal in magnitude, and even
orientation, and still produce a very similar separation of the test patterns. In order to make the
test reliable, the neurons must be allowed to reach saturation, and their weights must be
normalized. Allowing the neurons to reach saturation removes the possibility for early detection,
and limits the network?s flexibility once any similar neurons have been realigned. This lack of
flexibility tends to draw the network back to the previous state.
64
The second method of detecting redundant hidden neurons is to compare the sign of the
neurons? outputs. This offers a more reliable test and does not require the neurons to be in
saturation. Although the dimensions of the neurons? output vectors are generally greater than that
of the weight vector, there is no need to normalize the output vectors since the dot product of two
identical outputs will be equal to the sum of either vector?s elements. In addition, the
computational requirements can be further reduced by only performing the test periodically,
rather than after each iterate.
4.5 GRAPHICAL REPRESENTATION OF THE TRAINING PROCESS
During the development of the proposed training algorithm, it became apparent that to
verify the method was operating correctly, a significant amount of data would have to be
analyzed after each training attempt. To complicate matters further, in order to determine the
state of the training process, much of the data generated in a given iteration needed to be
compared with that from previous iterations. This process was time consuming and made
debugging quite difficult. To solve this problem, a visual representation was devised that allows
large amounts data to be combined into a single graph, and offers a concise and easy to read
summary of the training process.
To do this, the outputs of each neuron are represented as bitmap images. Each pixel
represents the output of a single neuron at a specific iteration. Therefore, the image generated for
each neuron has dimensions equal to the number of patterns by the number of iterations. The
color of each pixel signifies the sign of the output, with negative values drawn in blue and
positive values in red. The intensities of the pixels are proportional to their respective values.
Lighter pixels correspond to smaller values, and darker pixels are used to represent larger ones.
Once the output images have been generated for all neurons in the network, they are stacked
65
vertically to form a combined image. The resulting composite contains nearly all of the
information needed to evaluate the performance of the algorithm. A list of the available
information is given below.
? Classification: The position of each pattern with respect to the separating plane can be
determined by the sign of its corresponding output, which is denoted by the color of its
representative pixel. Therefore, since each column of pixels represents one iteration, the
classification of the input patterns at any given iteration can be determined by the colors
of the pixels in that column.
? Movement of the separating plane: Since the pixels in each row represent the output
values of a single pattern from one iteration to the next, if two consecutive pixels share a
common row but are of different colors, then the position of the separating plane has
changed.
? Entrapment: If the colors and intensities of the pixels in each row become static, then the
position of the separating plane must also be static. Therefore, if the desired classification
has not yet been satisfied, the algorithm is most likely trapped in a false minimum.
? Redundant feature detection: If the columns of two or more hidden layer neurons have
matching pixel colors, they are performing the same task.
? Pattern saturation: If the intensity of the pixels in a given row becomes high, the output
for the corresponding pattern has reached saturation. In general, as the number of
saturated patterns becomes larger, the mobility of the separating plane is reduced.
Not only does the generated plot provide the information listed above, it also allows the data for
all iterations to be evaluated simultaneously. This greatly simplifies the debugging and
verification process.
66
An example image of the training of a two neuron cascade for parity3 using EBP is
shown in Figure 4.3. Looking at the total error alone, there is little if any noticeable change over
the first 30 iterations. However, the output images tell a much different story. First, the low
intensity of the pixels means that the patterns have not yet reached saturation. Therefore, there is
still a great deal of mobility for the separating plane. This is confirmed by the changes in the
pixel color, which signify changes in the orientation of the separating plane. Furthermore, the
first 5 columns of the bitmap for the output neuron show that the algorithm is experiencing some
oscillation. This would suggest that the selected learning constant may be too large, and could
cause instability. It is also evident that despite the low change in error, by the 30
th
iteration the
hidden neuron has achieved a linearly separable mapping, after which an additional 6 iterations
are required for separation of the augmented set of inputs. After 36 iterations the patterns have
been correctly classified, and the remaining error is slowly reduced as the outputs are driven
further into saturation. This can be seen by the gradual increase in the intensity of the remaining
pixels.
Clearly, representing the training process in this way provides a great deal of additional
insight. Furthermore, all of these observations can be determined directly from Figure 4.3, with
no additional analysis required, making it an efficient and powerful tool for debugging and
verification.
67
?
?
Figure 4.3: Output image for the parity3 problem trained with EBP.
4.6 HLPI TRAINING EXAMPLE
In this section, a stepbystep description of the training process is presented, and the
results are verified using the graphical representation described in section 4.5. For the purpose of
this demonstration, the parity4 problem is selected. The reason for this choice is that, despite
being relatively difficult, the problem has only 16 training patterns and requires just 2 neurons in
the hidden layer. Thus, the problem?s manageable size makes it well suited for a detailed
analysis. The output image for this problem, shown in Figure 4.4, is used as a reference for the
remainder of this discussion. In addition to the node values and total error, the training summary
in Figure 4.4also specifies which training patterns remain misclassified following each iteration.
The output pixels of the misclassified patterns are outlined in green and crossed out.
68
?
?
?
?
Figure 4.4: Output image for the parity4 problem trained using the proposed method.
69
OVERVIEW
The training process is carried out in two phases. The first phase involves a gradient
search using the LM algorithm, and comprises the first 58 iterations shown in Figure 4.4. After
the 58
th
iteration, all 3 condition flags have been set, signifying that the algorithm has become
trapped in a false minimum. At this time the second phase begins, during which a search of the
hidden space is performed using pseudoinversion of the hidden layer neurons. At the end of the
pseudoinversion process, linear separation is achieved resulting in the successful classification
of the training patterns. The training process is then terminated, and the final weights are
returned.
For this example, the parameters used in determining the state of the condition flags are
chosen to be =50, =5 and =10
. For the network topology, a 3neuron single layer
bridged architecture, shown in Figure 4.5, is used. The activation functions for all three neurons
are tangent hyperbolic, and have gains of =0.1.
Figure 4.5: Single layer bridge architecture used to solve the parity4 problem.
70
PHASE 1: GRADIENT SEARCH
The gradient search is initialized using a set of randomly selected starting weights with
values between ?1. The small magnitude of the initial weights helps ensure that the initial
outputs are not in saturation, which allows more flexibility for the separating planes during the
early stages of the training process. This is clearly seen when looking at the first 10 iterations of
the output images in Figure 4.4. The low intensities of the pixels confirm that the outputs have
not yet reached saturation, and the resulting flexibility is evidenced by the frequent changes in
the coloration of the pixels from one iteration to the next. It is also clear that as the number of
patterns in saturation increases, indicated by darkly shaded pixels, the changes in pixel color
becomes less frequent, which indicates less movement in the separating plane.
As the first phase of training progresses, the condition flags are repeatedly updated in
order to detect entrapment. The
flag is only cleared once separability is reached in the
hidden space, so it is always set equal to 1 when training begins. The
and
flags, on
the other hand, are always cleared when training starts. Since =5, at least 5 iterations must be
completed before the error condition can be tested. Therefore, at the beginning of training, the
flag is cleared, and remains cleared for at least the first 5 iterations. Looking at the error
plot at the top of Figure 4.4, the total error reaches a fixed value of around 0.25 after 13
iterations. Five iterations later, the error condition test yields
?
=
0.2524738?0.2524733
0.2524738
=1.98?10
<10
,
and the
flag is set. Checking the value of the error condition for the 18
th
iteration in Figure
4.4 confirms that the flag has been updated as expected.
With the error and separation flags set, the only remaining condition is the classification
condition. As with the error condition, the classification test cannot be performed until the
71
number of total iterations reaches the necessary value. In the latter case, this is determined by the
parameter, which is equal to 50 for this example. Therefore, the
flag remains cleared for
at least the first 50 iterations of the gradient search phase. Returning to the training summary in
Figure 4.4, it is apparent that, by the 8
th
iteration, the classification of the data has become fixed.
Since =50, it is expected that if the classification remains unchanged for 50 iterations, the test
for the classification condition should come back positive, and the
flag should then be set.
Checking the state of
at the 58
th
iteration in Figure 4.4 confirms that the flag has been
updated correctly.
PHASE 2: HIDDEN LAYER PSEUDOINVERSION
At the 58
th
iteration, all three condition flags are set, which means that the gradient search
has failed to converge. Thus, the first phase of training is terminated and the second phase
begins. During the second phase of training, the hidden layer neurons are retrained using pseudo
inversion in an attempt to invert the output values for the misclassified patterns.
In order to maximize the probability of correctly reclassifying the misclassified patterns,
the hidden neuron with the lowest net values for the designated patterns is chosen to be retrained
first. The reason for this choice is that the lower net values mean that the misclassified patterns
are closer to the separating plane. Therefore, reassigning these patterns will require less
movement in the separating plane, which reduces the likelihood of unintentionally reclassifying
any of the correctly classified patterns in the process.
For the example in Figure 4.4, it is clear that the only pattern which remains misclassified
is pattern 15. Looking at the output pixels of the hidden layer neurons for the 15
th
pattern, it is
apparent from the comparison of the pixel intensities that the output of the first hidden neuron is
lower than that of the second. Since both neurons use the same activation function, the lower
72
output value also corresponds to lower net value. Therefore, hidden neuron 1 is chosen to be
retrained first.
In order to retrain the selected neuron, a set of desired outputs must be defined. To do
this, the sign of the current outputs are used, but with the output values of the misclassified
patterns inverted. For the present example, this results in
=
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?0.89832
?0.41765
?0.98549
?0.89387
?0.98369
?0.88135
?0.99776
?0.98294
+0.99775
+0.99971
+0.98361
+0.99785
+0.98542
+0.99809
+0.89783
+0.98606
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?sign(
)=
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?1
?1
?1
?1
?1
?1
?1
?1
+1
+1
+1
+1
+1
+1
+1
+1
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?=
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?1
+1
?1
?1
?1
?1
?1
?1
+1
+1
+1
+1
+1
+1
+1
+1
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
.
Using the designated set of desired outputs, the first hidden neuron is retrained according
to the weighted pseudoinversion technique described in section 4.2.1 . An initial weighting
value of one is used. Since the desired set of outputs is linearly separable, the misclassified
pattern is reclassified after the first pseudoinversion attempt, without incrementing the scaling
factor. Furthermore, no unintentional reclassification of the remaining patterns has occurred, so
there is no need to retrain the second hidden neuron. This marks the end of the hidden layer
search. The resulting output can be seen in the last column of the output image for the first
hidden neuron in Figure 4.4. Notice that all output assignments remain unchanged except for the
inversion of the 15
th
pattern, which was the desired result.
73
Once pseudo inversion of the hidden layer neurons is complete, their weights are frozen
and linear separation of the updated mapping is attempted using pseudoinversion of the output
neuron. Here, the margin maximizing active set method described in section 3.3 is adopted. The
resulting classification, represented in Figure 4.4 by the last column of pixels in the bitmap of the
output neuron, matches that of the desired classification. Therefore, training is complete, with a
final mean squared error of 10
.
4.7 EXPERIMENTAL RESULTS
For verification, the proposed method was tested against the EBP and LM algorithms
described in 0. The EBP algorithm was chosen because it remains the most widely used method
for training feedforward neural networks. The LM algorithm was chosen for its speed. All three
algorithms were applied to a range of test problems, and performance was measured according to
success rate and average number of iterations per successful run. Each algorithm was assigned a
maximum number of iterations per run, and a maximum acceptable error was chosen for each
problem. Success rate was calculated as the number of runs that reached the maximum allowable
error within the allotted number of iterations.
The classification problems used for testing were chosen for their high degrees of
nonlinearity. The chosen problems are listed below, with a brief description of each.
? ParityN: The parity problems, described in section 2.3, were chosen for their high
degree of nonlinearity and their reputation as benchmarks for nonlinear classification.
Five cases were selected, ranging from parity3 to parity7.
? N mod 2: The N mod 2 problem is a single input single output problem that produces a
binary mapping of the digits from ? to that determines if the input is even or odd.
74
The problem can also be viewed as a parityN problem with the individual inputs
replaced by their sum.
? CheckerN: The checkerN problem consists of an 10?10 grid whose desired
outputs form an alternating pattern like that of the colored squares on a chess board.
Mathematically speaking, the desired outputs are assigned as
(,)=2[ ??+?? (mod 2)]?1,
where ,?{0,0.1,0.2,? ,} are the ?rank? and ?file? respectively, mod is the
modulo operator, and ??? is the floor operator. Thus, for an integer , the resulting
?board? is comprised of
1?1 ?squares?, each containing 100 points. The difficulty
of the problem increases with , as does the size of the data set, which grows at a rate
of 100(2+1). A 3dimensional example of the training surface for the checker3
problem is shown in Figure 4.6.
1
x
2
x
out
Figure 4.6: Training surface for the checker3 problem.
75
Table 4.1: Performance comparison for the proposed method.
Algorithm
Hidden
Neurons
EBP LM HLPI
Pr
oblem
Parity3
Success: 100%
Ave. ite: 6,060
Success: 98%
Ave. ite: 9
Success: 100%
Ave. ite: 10
1
Parity4
Success: 81%
Ave.ite: 77,988
Success: 7%
Ave. ite: 27
Success: 100%
Ave. ite: 114
2
Parity5
Success: 83%
Ave. ite: 103,999
Success: 4%
Ave. ite: 34.25
Success: 98%
Ave. ite: 150
2
Parity6
Success: 1%
Ave. ite: 190,634
FAILED TO
CONVERGE
Success: 98%
Ave. ite: 276
3
Parity7
FAILED TO
CONVERGE
FAILED TO
CONVERGE
Success: 96%
Ave. ite: 291
3
15 mod 2
FAILED TO
CONVERGE
FAILED TO
CONVERGE
Success: 95%
Ave. ite: 2,553
7
Checker  3
FAILED TO
CONVERGE
Success: 11%
Ave. ite: 111
Success: 100%
Ave. ite: 596
4
Checker  4
FAILED TO
CONVERGE
Success: 4%
Ave. ite: 1,472
Success: 75%
Ave. ite: 3,770
6
The training results for all 8 of the test cases are shown in Table 4.1. Single layer bridged
architectures were used for all problems, and the number of hidden neurons used in each case is
listed in the right hand column of Table 4.1. Numbers in bold font represent the best performers
for each problem. The training parameters used by each algorithm are presented in Table 4.2.
The training results in Table 4.1 show that the HLPI method offers the highest rate of
convergence for all of the tested problems, with a rate of 95% or better in all but one of the cases,
and is the only one of the three algorithms to converge successfully for on all 8 problems. It is
also evident that the average number of iterations required for the HLPI method is around 2 to 3
orders of magnitude lower than that of the EBP algorithm, and shows comparable performance
to with the LM algorithm in that regard. Moreover, the comparison with the LM algorithm is
somewhat misleading since only the successful training attempts are considered in the
calculation of the average. Therefore, the significant portion of attempts in which the number of
76
iterations required by the LM algorithm exceeded the allowable value must also be taken into
account when making a direct comparison. Furthermore, since the first phase of the HLPI
method is equivalent to the LM method, the average number of iterations for those cases in
which the LM algorithm was able to converge successfully should be identical for HLPI.
Most importantly, the test results indicate that the proposed method does, in fact, achieve
the objective for which it was originally designed: to improve the rate of convergence of the
gradient search by overcoming the challenge presented by local minim. Clearly, judging by the
success rates in Table 4.1, this goal has been achieved. Furthermore, the resulting improvement
has come at a relatively low cost in terms of the number of required iterations.
77
HLPI LM EBP
Table
4.2: Algori
th
m
parameters.
Gain
Ma
x
err
or
Ma
x I
te.
Gain
Ma
x
err
or
Ma
x i
te.
Gain
Ma
x
err
or
Ma
x i
te.
10
10 50
0.
01
0.1
0.001
500
0.
01
0.1
0.001
500
0.7 0.1
0.001
150,000 Parit
y3
Pr
oblem
10
10 50
0.01
0.1
0.
001
500
0.01
0.1
0.
001
500
0.7
0.1
0.
001
150,000
Parit
y4
10
10 50
0.
01
0.1
0.001
500
0.
01
0.1
0.001
500
0.7 0.1
0.001
150,000 Parit
y5
10
10 50
0.01
0.
1
0.
001
500
0.01
0.
1
0.
001
500
0.
7
0.
1
0.
001
250,
000
Parit
y6
10
10 50
0.01
0.
1
0.
001
500
0.01
0.
1
0.
001
500
0.
7
0.
1
0.
001
250,
000
Parit
y7
10
20 30
0.
01
0.
1
0.
001
5000
0.
01
0.
1
0.
001
5000
0.
7
0.
1
0.
001
250,
000
15 m
od
2
10
20
100
0.01
0.25
0.
001
1000
0.01
0.
025
0.
001
1000
0.
7
0.
025
0.
001
250,
000
Che
cker3
10
20
100
0.01 0.25
0.001
5000
0.01
0.025
0.001
5000
0.7
0.025
0.001
250,000
Che
cker4
78
Chapter 5
CONCLUSION
In Chapter 3, three pseudoinversion techniques were presented for use in training
individual neurons. The first technique used an algebraic manipulation of a least squares
formulation of the weight update rule to improve the computational efficiency of the training
process. This was done by eliminating the need for matrix inversion at all but the first iteration,
which significantly reduced the necessary computation time, especially for problems with large
data sets. However, despite the method?s efficiency, it was shown that the modified version was
not truly equivalent to the original least squares method from which it was derived, which caused
a reduction in the rate of convergence.
Next, an active set method was presented that successfully reduced both the computation
time and the rate of convergence. The method used a linear approximation of the activation
function to iteratively reduce the size of the training set until a minimal set of training patterns
was reached. Once the optimal set of patterns was found, the tangent hyperbolic activation
function was restored, and the total error was directly set by scaling the final weight vector. The
method was also shown to be more robust, and was better able to handle problems in which the
training data was poorly distributed or unbalanced.
Despite the decided improvement represent by the active set method, it was shown that
the method did not to guarantee an optimal solution in terms of the margin of separation. To
solve this problem, a third technique was developed that uses the solution from the active set
method to obtain an equality constrained QP formulation of the separating margin. The solution
of the resulting QP guarantees maximum separation.
79
At the end of Chapter 3, a method for using single neuron pseudoinversion techniques to
improve the generalization of full scale networks is discussed. This is done by first finding a
solution using a gradient search. Once the separation is found, the signs of the current outputs for
all hidden neurons are used to define their desired outputs. Each neuron is then retrained starting
with the units nearest the input layer and moving towards the output. Using the margin
maximizing QP method, the resulting weight set will achieve optimal separation for the feature
set determined during the initial training process.
In Chapter 4, a modified training technique, known as hidden layer pseudoinversion,
was proposed that significantly improves the success rate in gradient based searches. The method
is similar to the generalization technique in Chapter 3, in that it has two training phases
consisting of a gradient search followed by a hidden layer pseudoinversion phase. However,
unlike the generalization technique, the proposed method does not require convergence before
beginning the second training phase. Instead the method monitors the state of the algorithm in
order to determine if the algorithm has become trapped in a false minimum. Once entrapment is
detected, the current hidden layer outputs are used to define a set of desired outputs. The desired
values of the remaining misclassified patterns are inverted in an attempt to reconfigure the
hidden layer mapping. The hidden layer neurons are then retrained one at a time. Once a new
mapping is found, linear separation is attempted using pseudoinversion of the output neuron.
The process is repeated until separation is achieved.
The HLPI method was compared with the popular EBP and LM algorithms using a set of
8 nonlinear classification benchmarks. The proposed method was shown to offer the highest
success rate for all of the tested problems. It was also shown to require significantly fewer
iterations than the EBP algorithm, and performed comparably with the LM algorithm. However,
80
the latter comparison was somewhat misleading due to the fact that only the successful training
attempts were considered in this calculation. Therefore the significant portion of attempts in
which the required number of iterations exceeded the number allowable was not considered.
Furthermore, since the first phase of the HLPI method is equivalent to the LM method, the
average number of iterations for those cases in which the LM algorithm is able to converge
successfully should be identical.
81
BIBLIOGRAPHY
[1] J. M. Zurada, Artificial Neural Systems, 1st ed. Boston, MA: PWS Publishing Company,
1995.
[2] A. J. Eide, T. Lindblad, and G. Paillet, "The RadialBasisFunction Type of Neural Network
and Its Implementations in Hardware," in Industrial Electronics Handbook, 2nd ed.: CRC
Press, 2011, vol. 5, ch. 12, pp. 112.
[3] K.P. Bennet and C. Campbell, "Support vector machines: hype or hallelujah?," ACM
SIGKDD Explorations Newsletter, vol. 2, no. 2, pp. 113, Desember 2000.
[4] Thomas M. Cover, "Geometrical and Statistical Properties of Systems of Linear Inequalities
with Applications in Pattern Recognition," IEEE Transactions on Electronic Computers,
vol. EC14, no. 3, pp. 326334, June 1965.
[5] B. M. Wilamowski, "Neural network architectures and learning algorithms, ," IEEE
Industrial , vol. 3, no. 5, pp. 5663, 2009.
[6] R. Durbin, "On the correspondence between network models and the nervous system," in
The computing neuron, R. Durbin, C. Miall, and G Mitchison, Eds. Boston, MA, USA:
AdisonWesley, 1989, ch. 1, pp. 110.
[7] H. T. Siegelmann and E. D. Sontag, "On The Computational Power Of Neural Nets,"
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, vol. 50, no. 1, pp. 132150, 1995.
[8] F. Rosenblatt, "The perceptron: A probabilistic model for information storage and
organization in the brain," Psychological Review, vol. 65, no. 6, pp. 386408, November
82
1958.
[9] D. E. Rumelhart and J. L. McClelland, Parallel Distributed Processing  Vol. 1:
Foundations.: MIT Press, 1987.
[10] T. J. Anderson and B. M. Wilamowski, "A Modified Regression Algorithm for Fast One
Layer Neural Network Training," in World Congress of Neural Networks, Washington DC,
USA, 1995, pp. 687690.
[11] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning representations by back
propagating errors," in Neurocomputing: foundations of research , J. A. Anderson and E.
Rosenfeld, Eds. Cambridge, MA: MIT Press, 1988, pp. 696699.
[12] K. Levenberg, "A method for the solution of certain nonlinear problems in leas squares,"
Quarterly of Applied Mathematics, no. 2, pp. 164168, 1944.
[13] D. W. Marquardt, "An algorithm for least squares estimation of nonlinear parameters,"
SIAM Journal, no. 11, pp. 431441, 1963.
[14] J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed., T. V. Mikosch, S. I. Resnick,
and S. M. Robinson, Eds. New York, NY: Springer, 2006.
[15] C. G. Broyden, "The convergence of a class of doublerank minimization algorithms,"
Journal of the Institute of Mathematics and Its Applications, vol. 6, pp. 7690, 1970.
[16] W. C. Davidon, "Variable metric method for minimization," SIAM Journal on Optimization,
vol. 1, pp. 117, 1991.
[17] B. M. Wilamowski, N. J. Cotton, O. Kaynak, and G. Dunda, "Computing Gradient Vector
and Jacobian Matrix in Arbitrarily Connected Neural Networks," IEEE Trans. on Industrial
Electronics, vol. 55, no. 10, pp. 37843790, October 2008.
83
[18] B.M. Wilamowski, H Yu, and K T. Chung, "ParityN Problems as a Vehicle to Compare
Efficiency of Neural Network Architectures," in Industrial Electronics Handbook, vol. 5 ?
Intelligent Systems, 2nd ed.: CRC Press, 2011, vol. 5, ch. 10, pp. 18.
[19] M. E. Hohil, D. Liu, and S. H. Smith, "Solving the Nbit parity problem using neural
networks," Neural Networks, vol. 12, no. 9, pp. 13211323, November 1999.
[20] B.M. Wilamowski and D. Hunter, "Solving Parityn Problems with Feedforward Neural
Networks," in Proc. of the IJCNN'03 International Joint Conference on Neural Networks,
Portland, 2003, pp. 25462551.
[21] M. Gori and A. Tesi, "On the Problem of Local Minima in Backpropagation," IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 14, no. 1, pp. 7686,
January 1992.
[22] T. D. Gedeon, "Indicators of hidden neuron functionality: the weight matrix versus neuron
behaviour," in ANNES '95 Proceedings of the 2nd New Zealand TwoStream International
Conference on Artificial Neural Networks and Expert Systems, Washington DC, 1995, p. 26.
84
Appendix A
MATLAB CODE FOR THE PSEUDOINVERSION TECHNIQUES
A.1 NLPI.m
function NLPI;
clear all; format compact;
clc; figure(1); clf;
warning('off');
global data input output;
clc; figure(1); clf;
input=[1 1; 1 1; 1.5 1; 4 1];
output=[1; 1; 1; 1];
data.X = [1 1;1 1];
data.Y = [1.5 1;4 1];
data.axs = [0 5 2 2];
gain=0.1;
act_n=2;
ite_num=1000;
[m,n]=size(input)
X=cat(2,input,ones(m,1));
X
Y=X'*X\X';
for ii=1:1
w=generate_weights(n+1);
for iteration=1:ite_num
[del,err(iteration),outs]=compute_del(input,output,w,act_n,gain);
delta_w=Y*del;
w=w+delta_w';
w=1.2*w;
if err(iteration)<1e4 break; end;
end
figure(1);
semilogy(err); hold on;
drawnow;
axis tight
disp(['Total Error: ',num2str(err(end))]);
disp(['Final Weights: ',num2str(w)]);
xlabel('iteration')
ylabel('total error')
end;
%%
function [weight] = generate_weights(n);
weight=zeros(1,n);
for i = 1:n % number of weights
ra = 2*rand(1)1; % generate random weights between 1 and 1
while( ra == 0 )
85
ra = 2*rand(1)1;
end;
weight(i) = ra;
end;
%%compute delta using error backpropagation%%%%%%%%%%
function [del,err,outs]=compute_del(input,output,w,act_n,gain)
[m,n]=size(input);
err=0;
for p=1:m
temp=input(p,:);
net=w(n+1);
for i=1:n
net=temp(i)*w(i)+net;
end
[out,der]=actFuncDer(net,act_n,gain);
outs(p)=out;
del(p,1)=(output(p)out)/(der+0.0001);
err=err+(outoutput(p))^2;
end
%%
function [out,der]=actFuncDer(net,act_n,gain)
switch act_n
case 0, out = gain*net; der = gain;
% linear neuron
case 1, out = 1/(1+exp(gain*net)); der = gain*(1out)*out;
% unipolar neuron
case 2, out = 2/(1+exp(gain*net))1; der = gain*(1out*out)/2;
% bipolar neuron
case 3, out = gain*net/(1+gain*abs(net));der = gain/(gain*abs(net)+1)^2;
% unipolar elliot neuron
case 4, out = 2*gain*net/(1+gain*abs(net))1; der =
2*gain/(gain*abs(net)+1)^2; % bipolar elliot neuron
end;
A.2 ASPINV.m
function [w,err] = aspinv(inp,dout,gain,err_max);
[np,ni]=size(inp);
X=cat(2,inp,ones(np,1));
A = [1:length(dout)];
Xndx = find(dout==1);
w = 2*rand(ni+1,1)1;
for ite=1:100
Xa = X(A,:);
w = Xa'*Xa\Xa'*dout(A);
out = X*w;
B = find((abs(out))<=1);
if (length(B)>=2)
cls0 = length(find(Xndx==B(1)));
for i=2:length(B)
cls = length(find(Xndx==B(i)));
if cls0~=cls
A = B;
break
86
end
end
end
E = dout(A)  out(A);
if (E'*E<1e4)&(sign(out)==dout),
break;
end;
end
scale = atanh(1err_max)/gain;
w = w*scale;
E = dout  tanh(gain*X*w);
err = E'*E/np;
return
A.3 PINVQP.m
function [w,err] = pinvqp(inp,dout,gain,err_max);
[np,ni]=size(inp);
X=cat(2,inp,ones(np,1));
A = [1:length(dout)];
Xndx = find(dout==1);
w = 2*rand(ni+1,1)1;
for ite=1:100
Xa = X(A,:);
w = Xa'*Xa\Xa'*dout(A);
out = X*w;
E = dout(A)  out(A);
if (E'*E<1e4)&(sign(out)==dout),
break;
end;
B = find((abs(out))<=1);
if (length(B)>=2)
cls0 = length(find(Xndx==B(1)));
for i=2:length(B)
cls = length(find(Xndx==B(i)));
if cls0~=cls
A = B;
break
end
end
end
end
y = dout(A);
out = inp(A,:);
% V  Formulate Wolfe Dual  V %
nump = length(y);
G = (out*out').*(y*y');
c = ones(nump,1);
Ai = eye(nump);
Ae = y';
Bi = zeros(nump,1);
Be = 0;
%  Solve Wolfe Dual  %
87
x = rand(nump,1);
[x] = QPsolve(G,c,Ai,Ae,Bi,Be,x);
wo = (y'.*x')*out
S = find(abs(x)>=1e10);
nums = length(S);
s1 = 0;
for i = 1:nums
s2 = 0;
for j = 1:nums
s2 = s2 + x(S(j))*y(S(j))*out(S(j),:)*out(S(i),:)';
end
s1 = s1 + y(S(i))  s2;
end
b = 1/length(S)*s1;
w = [wo,b]';
scale = atanh(1err_max)/gain;
w = w*scale;
E = dout  tanh(gain*X*w);
err = E'*E/np;
return
A.4 QPSOLVE.m
function [x] = QPsolve(G,c,Ai,Ae,Bi,Be,x)
% Initialize working set
W = find(Ai*xBi<=0); % Find active
constraints for initial x
a = Ai(W,:);
a = [a;Ae];
b = Bi(W,:);
b = [b;Be];
%  Apply Algorithm  %
while(1)
numc = length(b); % Number of
constraints in working set
% Solve for step direction (p):
g = G*x+c; % Gradient of the
Objective
h = a*xb; % Vector of
constraint values
O = zeros(numc,numc); % Zero matrix for
the KKT matrix
KKT = [G a'; % KKT Matrix
a O];
gb = [g;
h];
88
if cond(KKT)>1e5 % Solve KKT
problem
sol = (KKT+diag(.001*ones(1,length(KKT(1,:)))))\gb;
1
else
sol = KKT\gb;
2
end
p = sol(1:length(x)); % Step direction
if length(W)>0
lam = sol(length(x)+1:end1); % Lagrange
Multipliers
else
lam = 0;
end
% Test step direction:
if sum(abs(p))<=1e10,
ndx = find(lam<0);
x = x + p;
if isempty(ndx)
break;
else
ndx = find(lam==min(lam(ndx)));
W(ndx) = [];
a = Ai(W,:);
a = [a;Ae];
b = Bi(W,:);
b = [b;Be];
end
else
an = Ai;
an(W,:) = [];
bn = Bi;
bn(W,:) = [];
ndx = 1:length(Bi);
ndx(W)=[];
apgz = find(an*p>=0);
an(apgz,:)=[];
bn(apgz,:)=[];
ndx(apgz)=[];
blk = (bnan*x)./(an*p);
alpha = min([1,min(blk)]);
x = x + alpha*p
if alpha<1
ndx = ndx(find(blk==min(blk)));
W = sort([ndx,W]);
a = Ai(W,:);
a = [a;Ae];
b = Bi(W,:);
b = [b;Be];
else
W = W;
89
end
end
end
%  %
90
Appendix B
MATLAB CODE FOR THE HLPI ALGORITHM
B.1 NRA.m
function [ww, nodes, TERR, ite, dbug] = NRA(Nparam,Aparam,Tparam)
global CANCEL
global RESTART
unpackparam(Nparam);
unpackparam(Aparam);
unpackparam(Tparam);
dbug = [];
if length(ww) ~= length(topo),
ww = (2*rand(size(topo))1);
ww = ww_init(gain,np,ni,no,nn,inp,dout,ww,iw,topo,act);
nw=length(topo);
end;
if Tparam.op == 1
disp(' ');
disp('Initial Weights');
dispf(0,ww', 8,4,'ww');
disp(' ');
disp('Training started...');
[nodes,nets,ooo,der,E] =
cal_forward(gain,np,ni,no,nn,inp,dout,ww,iw,topo,act,Nparam);
TER=E'*E/np; TERR(1)=TER;
if strcmp(alg,'NRA_NBN.m')
addpath('C:\Users\Joel\Documents\MATLAB\my_libs\NNT\alg_files\NRA_files\');
I=eye(nw); x=ww; x_new = x; jite=0; RESTART=2;
mu1 = 0.01;
flag = 0;
count = 0;
% Inititialize debugging output
dbug = Nparam; dbug.wwi = ww; dbug.ww = []; dbug.rlnite = [];
dbug.rlntyp = []; dbug.nodes = []; dbug.der = []; dbug.ndxo = {}; dbug.ord =
[];
% Entrapment test parameters
nlast = 100;
outlast = zeros(np,nlast);
91
ncount = 50;
errite = 50;
derr = 1e4;
% Weight reset parameters
sim_mes = 0.95; % similarity between neuron outputs. range: [0,1]
test_freq = 10; % Number of iterations between tests
nite_init = 100;
ite_last = 0;
for ite=2:ite_max,
if CANCEL == 1; break; end;
run Jstandard
x = x_new;
gra = J'*E; %gradient
JJ=J'*J;
jw=0;
while (1),
x_new=x((JJ+mu1*I)\gra)';
ww=x_new';
if (ite==1), TERR(ite)=terr; break; end;
%calculate new performance
[nodes,nets,ooo,der,E] =
cal_forward(gain,np,ni,no,nn,inp,dout,ww,iw,topo,act,Nparam);
TER=E'*E/np; % evaluate the objective function
TERR(ite)=TER;
if TER<=TERR(ite1)
if mu1>mu_min, mu1=mu1/LM_scal; end;
break;
end;
if mu15, break; end;
end; % while (1),
% Reset weights for redundant hidden neurons
if (~rem(ite,test_freq))&&((ite
ite_last)>(nite_init+2))&&(sum(sign(nodes(:,end))~=dout)=sim_mes);
pairs = [n1,n2];
[h1,h2] = find(abs(cmp)1)
h1 = ceil(rand*(length(hndx)));
mv1 = ww(iw(hndx(h1)):iw(hndx(h1)+1)1);
hndx(h1) = [];
h2 = ceil(rand*(length(hndx)));
92
mv2= ww(iw(hndx(h2)):iw(hndx(h2)+1)1);
hndx(end+1) = h1;
else
[mv1,mv2] =
find(abs(cmp)==max(max(abs(cmp))));
mv1 = ww(iw(mv1(1)):iw(mv1(1)+1)1);
mv2 = ww(iw(mv2(1)):iw(mv2(1)+1)1);
end
if rand>0.6;
ww(iw(rsndx):iw(rsndx+1)1) = pct*mv1+(1
pct)*mv2;
else
ww(iw(rsndx):iw(rsndx+1)1) = (1
2*rand(size(mv1)));
end
pairs(find(pairs==rsndx),:)=[];
hndx(end+1) = rsndx;
end
end
mu1 = 0.01;
end
if
((count>=ncount)&&isempty(find(sign(outlast(:,1))~=sign(nodes(:,end)))))((i
te>errite+2)&&((TERR(iteerrite)
TERR(ite))=sim_mes);
pairs = [n1,n2];
[h1,h2] = find(abs(cmp)1)
h1 = ceil(rand*(length(hndx)));
mv1 = ww(iw(hndx(h1)):iw(hndx(h1)+1)1);
hndx(h1) = [];
h2 = ceil(rand*(length(hndx)));
mv2= ww(iw(hndx(h2)):iw(hndx(h2)+1)1);
hndx(end+1) = h1;
93
else
[mv1,mv2] =
find(abs(cmp)==max(max(abs(cmp))));
mv1 = ww(iw(mv1(1)):iw(mv1(1)+1)1);
mv2 = ww(iw(mv2(1)):iw(mv2(1)+1)1);
end
if rand>0.9;
ww(iw(rsndx):iw(rsndx+1)1) = pct*mv1+(1
pct)*mv2;
else
ww(iw(rsndx):iw(rsndx+1)1) = (1
2*rand(size(mv1)));
end
pairs(find(pairs==rsndx),:)=[];
hndx(end+1) = rsndx;
end
end
[nodes,nets,ooo,der,E] =
cal_forward(gain,np,ni,no,nn,inp,dout,ww,iw,topo,act,Nparam);
inpo = nodes(:,topo(iw(end1)+1:iw(end)1));
if act(end)==4,
funo = Nparam.fun{end};
dero = Nparam.der{end};
else
funo = 0;
dero = 0;
end
[wwo,outn,rln] =
pseudo_inv(inpo,dout,gain(end),act(end),funo,dero,[],10);
ww(iw(end1):iw(end)1) = wwo/sqrt(sum(wwo.^2));
dbug.rlntyp(end+1) = 1;
mu1 = 0.01;
x_new = ww';
dbug.rlnite(end+1) = ite;
[nodes,nets,ooo,der,E] =
cal_forward(gain,np,ni,no,nn,inp,dout,ww,iw,topo,act,Nparam);
if (E'*E/np>=TER),
ww(iw(end1):iw(end)1) = 12*rand(1,iw(end)iw(end
1));
dbug.rlntyp(end) = 0;
end
[nodes,nets,ooo,der,E] =
cal_forward(gain,np,ni,no,nn,inp,dout,ww,iw,topo,act,Nparam);
TERR(ite) = E'*E/np;
end;
if count>=ncount, count = 0; end;
end;
if nlast>1,
outlast = [outlast(:,2:nlast),nodes(:,end)];
else
outlast = nodes(:,end);
end
count = count + 1;
94
if rem(ite,pscale)==0; dispf(0,TER,18,12,'total error ',ite);
end;
if TER0.9, break; end;
end
return