Advanced Learning Algorithms of Neural Networks 
 
by 
 
Hao Yu 
 
 
 
 
A dissertation submitted to the Graduate Faculty of 
Auburn University 
in partial fulfillment of the 
requirements for the Degree of 
Doctor of Philosophy 
 
Auburn, Alabama 
December 12, 2011 
 
 
 
 
Keywords: Artificial Neural Networks, Levenberg Marquardt Algorithm, Neuron-by-Neuron 
Algorithm, Forward-Only Algorithm, Improved Second Order Computation 
 
 
Copyright 2011 by Hao Yu 
 
 
Approved by 
 
Bogdan M Wilamowski, Chair, Professor of Electrical and Computer Engineering 
Hulya Kirkici, Professor of Electrical and Computer Engineering 
Vishwani D. Agrawal, Professor of Electrical and Computer Engineering 
Vitaly Vodyanoy, Professor of Anatomy Physiology and Pharmacy 
 
ii 
 
 
 
 
 
 
 
Abstract 
 
 
The concept of ?learn to behave? gives very vivid description of functionalities of neural 
networks. Specifically, a group of observations, each of which consists of inputs and desired 
outputs, are directly applied to neural networks, and the networks parameters (called ?weights?) 
are adjusted iteratively according with the differences (called ?error?) between desired network 
outputs and actual network output. The parameter adjustment process is called ?learning? or 
?training?. After the errors converging to expected accuracy, the trained networks can be used to 
analyze the input dataset which are in the same range of observations, for classification, 
recognition and prediction. 
In neural network realm, network architectures and learning algorithms are the major 
research topics, and both of them are essential in designing well-behaved neural networks. In the 
dissertation, we are focused on the computational efficiency of learning algorithms, especially 
second order algorithms. Two algorithms are proposed to solve the memory limitation problem 
and computational redundancy problem in second order computations, including the famous 
Hagan and Menhaj Levenberg Marquardt algorithm and the recently developed neuron-by-
neuron algorithm. 
The dissertation consists of seven chapters. The first chapter demonstrates the attractive 
properties of neural network with two examples, by comparing with several other methods of 
computational intelligence and human beings. The second chapter introduces background of 
neural networks, including the history of neural networks, basic concepts, network architectures, 
iii 
 
learning algorithms, generalization ability and the recently developed neuron-by-neuron 
algorithm. The third chapter discusses the current problems in second order algorithms. The 
fourth chapter describes another way of gradient vector and quasi Hessian matrix computation 
for implementing Levenberg Marquardt algorithm. With the similar computational complexity, 
the improved second order computation solves the memory limitation in second order algorithms. 
The fifth chapter presents the forward-only algorithm. By replacing the backpropagation process 
with extra calculation in forward process, the forward-only algorithm improves the training 
efficiency, especially for networks with multiple outputs. Also, the forward-only algorithm can 
handle networks consisting of arbitrarily connected neurons.  The sixth chapter introduces the 
computer software implementation of neural networks, using C++ based on Visual C++ 6.0 
platform. All the algorithms introduced in the dissertation are implemented in the software. The 
seventh chapter concludes the dissertation and also introduces our recent work. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
iv 
 
 
 
 
 
 
 
Acknowledgments 
 
 
First of all, I would like to sincerely appreciate my honorific supervisor, Prof. Bogdan M 
Wilamowski, for his great patience and knowledgeable guidance during the past three years Ph.D 
study. His professional research experience teaches me how to be creative, how to find problems 
and solve them. His active attitude of life encourages me working hard towards my destination. 
His kindness and great sense of humor makes me feel warm and happy. All the things I have 
learnt from him are marked deeply in my memory and will benefit the rest of my life. Without 
his help, I could not have finished my dissertation and Ph.D study successfully. Besides, I would 
like to express my special appreciation to both Mr. Bogdan Wilamowski and his wife, Mrs. 
Barbara Wilamowski, for their kindness, caring about me and letting me feel like studying at 
home. 
 Special thanks are also given to my committee members, Prof. Hulya Kirkici, Prof. 
Vishwani D. Agrawal and Prof. Vitaly Vodyanoy, and the outside reader Prof. Weikuan Yu. 
From their critical and valuable comments, I noticed the weakness in my dissertation and made 
the necessary improvements according to their suggestions. 
I would like to express my appreciation to my good friends who have helped me with my 
studying and living in Auburn. They are Joel Hewlett, Nam Pham, Nicholas Cotton, Pradeep 
Dandamudi, Steven Surgnier, Yuehai Jin, Haitao Zhao, Hua Mu, Jinho Hyun, Qing Dai, Yu 
Zhang, Chao Han, Xin Jin, Jia Yao, Pengcheng Li, Fang Li and Jiao Yu. I am very lucky to be 
their friend. 
v 
 
Special thanks to Charles Ellis, Prof. David Irwin and Prof. Michael Hamilton, for their 
professional guidance on the projects and papers we worked together. It was my great honor to 
have worked with them. I also would like to thank Prof. John Hung, Prof. Fa Foster Dai, Prof. 
Hulya Kirkici, Prof. Vishwani Agrawal, Prof. Stanley Reeves, Prof. Adit Singh, Prof. Bogdan 
Wilamowski and Prof. Thomas Baginski, for their excellent teaching skills and professional 
knowledge in their courses. 
Last but not least, I am greatly indebted to my wife, Dr. Tiantian Xie, my newborn 
daughter, Amy X Yu, and my parents and my parents-in-law. They are the backbone and origin 
of my happiness. Being both a father and mother while I was struggling with my dissertation was 
not an easy thing for my wife. Without her support and encouragement, I could never finish my 
Ph.D study successfully. I owe my every achievement to my family. 
 Thanks to everyone. 
 
 
 
  
vi 
 
 
 
 
 
 
 
Table of Contents 
 
 
Abstract  .......................................................................................................................................... ii 
 
Acknowledgments ......................................................................................................................... iv 
 
List of Tables  .............................................................................................................................. viii 
 
List of Figures  ................................................................................................................................ x 
 
Chapter 1 Why Neural Networks  ................................................................................................... 1 
 
  1.1 Introduction  ............................................................................................................... 1 
 
  1.2 Comparison of Different Nonlinear Approximators  ................................................. 3 
 
 1.3 Neural Networks for Image Recognition  .................................................................. 9 
  
 1.4 Conclusion  ............................................................................................................... 11 
 
Chapter 2 Background  ................................................................................................................. 13 
 
  2.1 History  ..................................................................................................................... 13 
 
 2.2 Basic Concepts  ........................................................................................................ 14 
  
 2.3 Network Architectures  ............................................................................................ 16 
 
 2.4 Learning Algorithms  ............................................................................................... 26 
  
 2.5 Generalization Ability  ............................................................................................. 39 
 
 2.6 Neuron-by-Neuron Algorithm ................................................................................. 43 
 
Chapter 3 Problems in Second Order Algorithms  ....................................................................... 47 
 
Chapter 4 Improved Second Order Computation  ........................................................................ 49 
 
  4.1 Problem Description  ................................................................................................ 49 
vii 
 
 
 4.2 Improved Computation  ........................................................................................... 51 
 
 4.3 Implementation ........................................................................................................ 58 
 
 4.4 Experiments  ............................................................................................................. 61 
 
 4.5 Conclusion  ............................................................................................................... 64 
 
Chapter 5 Forward-Only Computation  ........................................................................................ 66 
 
 5.1 Computational Fundamentals  .................................................................................. 67 
 
 5.2 Forward-Only Computation  .................................................................................... 72 
 
 5.3 Computation Comparison  ....................................................................................... 80 
 
 5.4 Experiments  ............................................................................................................. 83 
 
 5.5 Conclusion  ............................................................................................................... 91 
   
Chapter 6 C++ Implementation of Neural Network Trainer  ....................................................... 93 
 
  6.1 File Instruction  ........................................................................................................ 94 
 
 6.2 Graphic User Interface Instruction  ........................................................................ 100 
 
 6.3 Implemented Algorithms  ...................................................................................... 104 
 
 6.4 Strategies for Improving Training Performance  ................................................... 105 
 
 6.5 Case Study Using NBN 2.0  ................................................................................... 114 
 
 6.6 Conclusion  ............................................................................................................. 118 
 
Chapter 7 Conclusion  ................................................................................................................. 119 
 
References  .................................................................................................................................. 122 
 
 
 
viii 
 
 
 
 
 
 
 
List of Tables 
 
 
Table 1-1 Comparison of approximation accuracy using different methods of computational 
   intelligence  .................................................................................................................... 9 
 
Table 1-2 Success rates of the designed counterpropagation neural network for digit image 
   recognition  ................................................................................................................... 11 
 
Table 2-1 Different architectures for solving parity-N problem  .................................................. 25 
 
Table 2-2 Specifications of different learning algorithms  ........................................................... 35 
 
Table 2-3 Comparison among different learning algorithms for parity-3 problem  ..................... 36 
 
Table 2-4 Training/testing SSEs of different sizes of FCC networks  .......................................... 41 
 
Table 4-1 Computation cost analysis  ........................................................................................... 54 
 
Table 4-2 Memory cost analysis  .................................................................................................. 54 
 
Table 4-3 Memory comparison for parity problems  .................................................................... 62 
 
Table 4-4 Memory comparison for MINST problem  .................................................................. 62 
 
Table 4-5 Time comparison for parity problems  ......................................................................... 63 
 
Table 5-1 Analysis of computation cost in Hagan and Menhaj LM algorithm and 
   forward-only computation  ........................................................................................... 81 
 
Table 5-2 Comparison for ASCII problem  .................................................................................. 82 
 
Table 5-3 Analytical relative time of the forward-only computation of problems  ...................... 82 
 
Table 5-4 Training results of the two-spiral problem with the proposed forward-only  
   implementation of LM algorithm, using MLP networks with two hidden layers;  
   maximum iteration is 1,000; desired error=0.01; there are 100 trials for each  
   case  .............................................................................................................................. 84 
 
 
ix 
 
Table 5-5 Training results of the two-spiral problem with the proposed forward-only  
   implementation of LM algorithm, using FCC networks; maximum iteration is  
   1,000; desired error=0.01; there are 100 trials for each case  ...................................... 84 
 
Table 5-6 Training Results of peak surface problem using FCC architectures  ........................... 86 
 
Table 5-7 Comparison for ASCII characters recognition problem  .............................................. 88 
 
Table 5-8 Comparison for error correction problem  .................................................................... 90 
 
Table 5-9 Comparison for forward kinematics problem  .............................................................. 91 
 
Table 6-1 Parameters for training  ................................................................................................ 94 
 
Table 6-2 Three types of neurons in the software  ........................................................................ 97 
 
Table 6-3 Available commands and related functionalities  ....................................................... 103 
 
Table 6-4 Comparison of different EBP algorithms for solving XOR problem  ........................ 107 
 
Table 6-5 Testing results of parity problems using update rules (6-3) and (6-4)  ...................... 111 
 
Table 6-6 Testing results of parity-N problems using different activation functions with 
   the minimal network architecture analyzed in section 2.3  ........................................ 113 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
x 
 
 
 
 
 
 
 
List of Figures 
 
 
Figure 1-1 Surface approximation problem  ................................................................................... 3 
 
Figure 1-2 Block diagram of the two types of fuzzy systems  ........................................................ 4 
 
Figure 1-3 Result surfaces obtained using fuzzy inference systems .............................................. 4 
 
Figure 1-4 Neuro-Fuzzy System  .................................................................................................... 5 
 
Figure 1-5 Result surface of neuro-fuzzy systems, SSE= 27.3356  ............................................... 6 
 
Figure 1-6 Result surfaces obtained using support vector machine  .............................................. 6 
 
Figure 1-7 Result surfaces obtained using interpolation methods  ................................................. 7 
 
Figure 1-8 Neural network architecture and related testing result  ................................................. 8 
 
Figure 1-9 Neural network architecture and related testing result  ................................................. 8 
 
Figure 1-10 Digit images with different noise levels from 0 to 7 in left-to-right order 
      (one data in 100 groups)  ........................................................................................... 10 
 
Figure 1-11 The designed counterpropagation neural network architecture for the digit 
      image recognition problem ....................................................................................... 10 
 
Figure 1-12 Retrieval results of 7th level noised digit images  ..................................................... 11 
 
Figure 2-1 Neural cell in human brain and its simplified model in neural networks  .................. 15 
 
Figure 2-2 Different types of activation functions  ....................................................................... 16 
 
Figure 2-3 Training patterns simplification for parity-3 problem  ............................................... 17 
 
Figure 2-4 Two equivalent networks for parity-3 problem .......................................................... 18 
 
Figure 2-5 Analytical solution of parity-2 problem  ..................................................................... 18 
 
Figure 2-6 Analytical solution of parity-3 problem  ..................................................................... 19 
xi 
 
 
Figure 2-7 Solving Parity-7 problem using MLP network with one hidden layer  ...................... 19 
 
Figure 2-8 Solve parity-7 problem using BMLP networks with one hidden layer  ...................... 21 
 
Figure 2-9 Solve parity-11 problem using BMLP networks with single hidden layer  ................ 21 
 
Figure 2-10 Solve parity-11 problem using BMLP networks with two hidden layers, 
      11=2=1=1  ................................................................................................................. 22 
 
Figure 2-11 Solve parity-11 problem using BMLP networks with two hidden layers, 
      11=1=2=1  ................................................................................................................. 23 
 
Figure 2-12 Solve parity-7 problem using FCC networks ............................................................ 24 
 
Figure 2-13 Solve parity-15 problem using FCC networks .......................................................... 24 
 
Figure 2-14 Searching process of the steepest descent method with different learning 
      constants: yellow trajectory (left) is for small learning constant which leads to  
      slow convergence; purple trajectory (right) is for large learning constant which  
      causes oscillation (divergence) ................................................................................. 26 
 
Figure 2-15 Parity-3 data and network architecture  .................................................................... 35 
 
Figure 2-16 Training results of parity-3 problem  ........................................................................ 36 
 
Figure 2-17 Two-spiral problem: separation of two groups of points  ......................................... 37 
 
Figure 2-18 Comparison between EBP algorithm and LM algorithm, for different number 
      of neurons in fully connected cascade networks  ...................................................... 38 
 
Figure 2-19 Training results of the two-spiral problem with 16 neurons in fully connected 
      cascade network  ....................................................................................................... 39 
 
Figure 2-20 Function approximation problem  ............................................................................. 40 
 
Figure 2-21 Approximation results of FCC networks with different number of neurons  ........... 41 
 
Figure 2-22 Arbitrarily connected neural network indexed by NBN algorithm  .......................... 44 
 
Figure 4-1 Two ways of multiplying matrixes  ............................................................................ 53 
 
Figure 4-2 Parity-2 problem: 4 patterns, 2 inputs and 1 output  ................................................... 58 
 
Figure 4-3 Three neurons in MLP network used for training parity-2 problem; weight 
    and neuron indexes are marked in the figure  ............................................................. 58 
xii 
 
 
Figure 4-4 Pseudo code of the improved computation for quasi Hessian matrix and 
    gradient vector  ............................................................................................................ 61 
 
Figure 4-5 Some testing results for digit ?2? recognition  ............................................................ 63 
 
Figure 5-1 Connection of a neuron j with the rest of the network. Nodes yj,i could 
    represents network inputs or outputs of other neurons. Fm,j(yj) is the nonlinear  
    relationship between the neuron output node yj and the network output om................ 68 
 
Figure 5-2 Structure of Jacobian matrix: (1) the number of columns is equal to the number 
    of weights; (2) each row is corresponding to a specified training pattern p and  
    output m  ...................................................................................................................... 71 
 
Figure 5-3 Pseudo code using traditional backpropagation of delta in second order  
    algorithms (code in bold will be removed in the proposed computation)  .................. 72 
 
Figure 5-4 Interpretation of ?k,j as a signal gain, where in feedforward network neuron j  
    must be located before neuron k  ................................................................................. 73 
 
Figure 5-5 Four neurons in fully connected neural network, with 5 inputs and 3 outputs  .......... 74 
 
Figure 5-6 The ?k,j parameters for the neural network of Fig. 5-5. Input and bias weights  
    are not used in the calculation of gain parameters  ..................................................... 74 
 
Figure 5-7 The nn?nn computation table; gain matrix ? contains all the signal gains  
    between neurons; weight array w presents only the connections between neurons,  
    while network input weights and biasing weights are not included  ........................... 76 
 
Figure 5-8 Three different architectures with 6 neurons  ............................................................. 79 
 
Figure 5-9 Pseudo code of the forward-only computation, in second order algorithms  .............. 80 
 
Figure 5-10 Comparison of computation cost for MLP networks with one hidden layer; 
      x-axis is the number of neurons in hidden layer; y-axis is the time consumption  
      radio between the forward-only computation and the forward-backward  
      computation  .............................................................................................................. 83 
 
Figure 5-11 Peak surface approximation problem ........................................................................ 85 
 
Figure 5-12 The best training result in 100 trials, using LM algorithm, 8 neurons in FCC 
      network (52 weights); maximum training iteration is 1,000; SSETrain=0.0044,  
      SSEVerify=0.0080 and training time=0.37 s  ............................................................... 87 
 
 
xiii 
 
Figure 5-13 The best training result in 100 trials, using EBP algorithm, 8 neurons in FCC  
      network (52 weights); maximum training iteration is 1,000,000;  
      SSETrain=0.0764, SSEVerify=0.1271 and training time=579.98 s  ............................... 87 
 
Figure 5-14 The best training result in 100 trials, using EBP algorithm, 13 neurons in FCC 
      network (117 weights); maximum training iteration is 1,000,000;  
      SSETrain=0.0018, SSEVerify=0.4909 and training time=635.72 s  ............................... 87 
 
Figure 5-15 The first 90 images of ASCII characters  .................................................................. 89 
 
Figure 5-16 Using neural networks to solve an error correction problem; errors in input  
      data can be corrected by well trained neural networks  ............................................ 89 
 
Figure 5-17 Tow-link planar manipulator .................................................................................... 91 
 
Figure 6-1 Commands and related neural network topologies  .................................................... 96 
 
Figure 6-2 Weight initialization for parity-3 problem with 2 neurons in FCC network  .............. 96 
 
Figure 6-3 Extract the number of inputs and the number of outputs from the data file  
    and topology  ............................................................................................................... 98 
 
Figure 6-4 A sample of training result file  ................................................................................... 99 
 
Figure 6-5 A sample of training verification file for parity-3 problem  ....................................... 99 
 
Figure 6-6 The user interface of NBN 2.0  ................................................................................. 100 
 
Figure 6-7 Training process with and without momentum  ........................................................ 106 
 
Figure 6-8 Network architecture used for XOR problem  .......................................................... 107 
 
Figure 6-9 Training results of XOR problem  ............................................................................ 107 
 
Figure 6-10 The ?flat spot? problem in sigmoidal activation function  ...................................... 108 
 
Figure 6-11 Test the modified slope by ?worst case? training  .................................................. 109 
 
Figure 6-12 Parameter adjustment in update rule (6-4)  ............................................................. 110 
 
Figure 6-13 Failures of gradient based optimization  ................................................................. 111 
 
Figure 6-14 Two equivalent networks  ....................................................................................... 114 
 
Figure 6-15 Network construction commands: 15 neurons in FCC network with 2 inputs 
      and 5 outputs  .......................................................................................................... 115 
xiv 
 
 
Figure 6-16 Data classification  .................................................................................................. 115 
 
Figure 6-17 X-dimension surface of forward kinematics  .......................................................... 116 
 
Figure 6-18 Y-dimension surface of forward kinematics ........................................................... 117 
 
Figure 6-19 X-dimension testing results  .................................................................................... 117 
 
Figure 6-20 Y-dimension testing results ..................................................................................... 117 
1 
 
 
 
 
 
 
 
CHAPTER 1 
WHY NEURAL NETWORKS 
1.1 Introduction 
As rapid development of computational intelligence, the tendency becomes more and more 
apparent that human kind is going to be replaced by intelligent systems. Various algorithms of 
computational intelligence have been well-developed based on different biological or statistic 
models [1-4], and they are paid great attentions in both scientific research and industrial 
applications, such as nonlinear compensations [5-7], motor control [8-12], dynamic distribution 
systems [13], robotic manipulators [14-16], pattern recognition [17-19] and fault diagnosis [20-
21]. 
Artificial neural networks (ANNs) were extracted from the complicated interconnections 
of biological neurons and inherit the learning and reasoning properties of human brains. It was 
proven that neural networks could be considered as a general model being capable of building 
arbitrary linear/nonlinear relationships between stimulus and response [22]. It is still unknown 
about the internal computations of neural networks, so it is hard to design them directly; instead, 
researchers have developed smart algorithms to train neural networks. Error back propagation 
(EBP) algorithm [23], developed by David E. Rumelhart, is the first algorithm which has ability 
to train multilayer perceptron (MLP) networks. Levenberg Marquardt (LM) algorithm [24-25] is 
regarded as one of the most efficient algorithms for neural network learning. Recently developed 
second order neuron-by-neuron (NBN) algorithm [26-27] is capable of training arbitrarily 
2 
 
connected neural (ACN) networks which could be more efficient and powerful than traditional 
MLP networks. Fault tolerance and generalization ability are improved, when efficient network 
architectures are applied for training [28]. 
Fuzzy inference systems were designed based on fuzzy logical rules [29]. All parameters 
for designing fuzzy inference systems can be extracted from problems themselves, and the 
training process is not required. However, the tradeoff of the very simple design process is the 
accuracy of approximation. Some hybrid architectures [30], inherited from both neural networks 
and fuzzy inference systems, are proposed to improve the performance of fuzzy inference 
systems. Another disadvantage of fuzzy inference systems is that, as the increase of input 
dimensions, the computation cost increases exponentially. 
Support vector machines (SVMs) were developed from statistical learning theory [31] to 
solve data classification problems. The concept of SVMs is very similar with the three-layer 
MLP networks. Differently, the layer-by-layer architecture in SVMs is organized based on 
Cover?s theorem and each layer performs different computation. Unlike other learn-by-examples 
systems, SVMs do not face local minima problem and they can find optimized solutions by 
constrained learning process. Later improvements [32] make SVMs also proper for solving 
function approximation problems. 
Other methods of computational intelligence, such self-organizing maps (SOMs) [33], 
principal component analysis (PCA) [34], particle swarm optimization [35], ant colony 
optimization [36] and genetic algorithm [37], also attracts great interests in solving special 
optimization problems. These methods are often combined with training algorithms so as to 
improve their performance [38-40]. 
In the followed two sections, we will have two examples to illustrate the potential 
3 
 
advantages of neural networks over (1) several other methods for function approximation, and 
(2) human beings for image recognition. 
 
1.2 Comparison of Different Nonlinear Approximators 
In this section, different methods of computational intelligence, including fuzzy inference 
systems, neuro-fuzzy systems and support vector machines, interpolation and neural networks 
are compared based on a nonlinear surface approximation problem. The purpose of the problem 
is that, using the given 5?5=25 points (Fig. 1-1a, uniformly distributed in [0, 4] in both x and y 
directions) to approximate the 41?41=1,681 points (Fig. 1-1b) in the same input range. All the 
training/testing points are obtained by equation (1-1) and visualized in Fig. 1-1. The 
approximation will be evaluated by sum square error (SSE). 
? ? ? ?? ? 922 1035.0415.0e x p4 ??????? yxz                                (1-1) 
 
(a) Training data, 5?5=25 points                    (b) Testing data, 41?4 1=1,681 points 
Fig. 1-1 Surface approximation problem 
 
1.2.1 Fuzzy Inference Systems 
The most commonly used architectures for fuzzy system development are the Mamdani fuzzy 
system [41] and TSK (Takagi, Sugeno and Kang) fuzzy system [42]. Both of them consist of 
4 
 
three blocks: fuzzification block, fuzzy rule block and defuzzification/normalization block, as 
shown in Fig. 1-2 below. 
F
u
z
z
i
f
i
e
r
M
I
N
 
o
p
e
r
a
t
o
r
s
M
A
X
 
o
p
e
r
a
t
o
r
s
D
e
f
u
z
z
i
f
i
e
r
F u z z y
r u l e s
F
u
z
z
i
f
i
e
r
o u t
X
Y
F
u
z
z
i
f
i
e
r
X
Y
o u t
w e i g h t e d
s u m
N
o
r
m
a
l
i
z
a
t
i
o
n
F
u
z
z
i
f
i
e
r
R u l e  s e l e c t i o n  c e l l s
m i n  o p e r a t i o n s
 (a) Mamdani fuzzy system                                      (b) TSK fuzzy system 
Fig. 1-2 Block diagram of the two types of fuzzy systems 
 
For the given surface approximation problem, with 5 triangular membership functions in 
each direction, two different fuzzy inference systems can obtain the approximated surfaces as 
shown in Fig. 1-3. 
 
(a) Mamdani fuzzy system, SSE=319.7334          (b) TSK fuzzy system, SSE=35.1627 
Fig. 1-3 Result surfaces obtained using fuzzy inference systems 
 
The rawness of control surfaces (Fig. 1-3) in fuzzy controllers leads to raw control and 
instabilities [43]. Therefore, for resilient control systems fuzzy controllers are not used directly 
in the control loop. Instead, traditional PID controllers [44-45] are often used and fuzzy inference 
5 
 
systems are just applied to automatically adjust parameters of PID controllers [46]. 
 
1.2.2 Neuro-Fuzzy Systems 
The neuro-fuzzy system, as shown in Fig. 1-4, attempts to present fuzzy inference system in 
form of neural network [47]. It consists of four layers: fuzzification, multiplication, summation 
and division. Notice that, in the second layer, product operations are performed among fuzzy 
variables (from first layer), instead of the fuzzy rules (MIN/MAX operations) in classic fuzzy 
inference systems. The multiplication process improves the performance of neuro-fuzzy system, 
but it is more difficult for hardware implementation. 
o u t
 
 m u l t i p l i c a t i o n
F
u
z
z
i
f
i
e
r
X
F
u
z
z
i
f
i
e
r
y
F
u
z
z
i
f
i
e
r
z
??
??
s u mf u z z i f i c a t i o n d i v i s i o n
a l l  w e i g h t s  
e q u a l  1
a l l  w e i g h t s  e q u a l  
e x p e c t e d  v a l u e s
 
Fig. 1-4 Neuro-Fuzzy System 
 
 For the given problem, with the same membership functions chosen for fuzzy inference 
system design in section 1.2.1, Fig. 1-5 shows the approximation result of the neuro-fuzzy 
system. 
6 
 
 
Fig. 1-5 Result surface of neuro-fuzzy systems, SSE= 27.3356 
 
1.2.3 Support Vector Machines (SVMs) 
For the given problem, with the software LIBSVM [48], the best results (as we tried) obtained 
using radial basis function kernel (exp(-?|u-v|2) with ?=0.7) and polynomial kernel 
((0.1u??v+0.1)n with n=7) separately are shown in Fig. 1-6. For other kernels, such as linear and 
sigmoid, the SVM does not work at all. 
 
(a) Radial basis function kernel, SSE=28.9595        (b) Polynomial kernel, SSE=176.1520 
Fig. 1-6 Result surfaces obtained using support vector machine 
 
1.2.4 Interpolation 
Interpolation is considered as a commonly used method for function approximation. MATLAB 
7 
 
provides the function ?interp2? for two-dimension interpolation and there are four approximation 
methods used in this function: nearest (nearest neighbor interpolation), linear (bilinear 
interpolation), spline (spline interpolation) and cubic (bicubic interpolation as long as the data is 
uniformly distributed). Fig. 1-7 presents the approximation results using the four different ways 
of interpolation. 
 
(a) Nearest interpolation, SSE=197.7494              (b) Linear interpolation, SSE=28.6683 
 (c) Spline interpolation, SSE=11.0874                 (d) Cubic interpolation, SSE=3.2791 
Fig. 1-7 Result surfaces obtained using interpolation methods 
 
1.2.5 Neural Networks 
For the given problem in Fig. 1-1, Figs. 1-8 and 1-9 show the result surfaces using different 
number of neurons with fully connected cascade (FCC) networks. All the hidden neurons use 
8 
 
unipolar sigmoidal activation functions and the output neuron is linear. The software NBN 2.0 
[49-51] was used in the experiment and the neuron-by-neuron (NBN) algorithm [52-53] in the 
software was selected for training. 
+ 1
x
y
O u t p u t
 
(a) Four neurons in FCC network                          (b) Result surface with SSE=2.3628 
Fig. 1-8 Neural network architecture and related testing result 
 
+ 1
x
y
O u t p u t
 
(a) Five neurons in FCC network                         (b) Result surface with SSE=0.4648 
Fig. 1-9 Neural network architecture and related testing result 
 
 Table 1-1 concludes the experimental results of different nonlinear approximators. One 
may notice that, from the point of approximating accuracy, neural networks can be the best 
choice for the problem. 
 
9 
 
Table 1-1 Comparison of approximation accuracy using different methods of computational 
intelligence 
Methods of Computational Intelligence Sum Square Errors 
Fuzzy inference system ? Mamdani 319.7334 
Fuzzy inference system ? TSK 35.1627 
Neuron ? fuzzy system 27.3356 
Support vector machine ? RBF kernel 28.9595 
Support vector machine ? polynomial kernel 176.1520 
Interpolation ? nearest 197.7494 
Interpolation ? linear 28.6683 
Interpolation ? spline 11.0874 
Interpolation ? cubic 3.2791 
Neural network ? 4 neurons in FCC network 2.3628 
Neural network ? 5 neurons in FCC network 0.4648 
 
1.3 Neural Networks for Image Recognition 
It is common knowledge that computers are much superior to human beings in numerical 
computation; however, it is still believed that human beings are superior to computers in areas of 
image processing. In this part, an example is used to show the expertise of special designed 
neural networks for recognizing noised images which cannot be handled by normal people. 
The experiment was carried out in the following scheme. As shown in Fig. 1-10, for each 
column, there are 10 digit images, from ?0? to ?9?, each of which consists of 8?7=56 pixels with 
normalized Jet degree between -1 and 1 (-1 for blue and 1 for red). The first column is the 
original image data without noise; for the noised data from the 2nd column to the 8th column, the 
strength of noise is increased according with equation (1-2): 
???? iPNPi 0                                                    (1-2) 
Where: P0 is the original image data (the 1st column); NPi is the image data with i-th level noise; 
i is the noise level; ? is the randomly generated noise between [-0.5, 0.5]. 
 The purpose of this problem is to design the neural networks based on the image data in 
the 1st column and then test the generalization ability of the designed neural networks using the 
10 
 
noised image data, from the 2nd column to the 8th column. For each noise level, the testing will be 
repeated for 100 times with randomly generated noise, in order to statistically obtain the 
recognition success rate. 
 
Fig. 1-10 Digit images with different noise levels from 0 to 7 in left-to-right order (one data in 
100 groups) 
 
Using the enhanced counterpropagation neural network [54] as shown in Fig. 1-11, the 
testing results are presented in Table 1-2 below. One may notice that the recognition error 
appears when patterns with level three noises are applied. 
u n i p o l a r
n e u r o n s
H
a
m
m
i n
g
   l a y e r
I
m
a
g
e
 
i
n
p
u
t
s
I
m
a
g
e
 
o
u
t
p
u
t
s
s u m m i n g
 c i r c u i t s
W
T
A
 
W
i
n
n
e
r
 
T
a
k
e
s
 
A
l
l
p a t t e r n
r e t r i e v a l  l a y e r
l i n e a r
l a y e r ???
. . .
. . .
. . .
. . .
 Fig. 1-11 The designed counterpropagation neural network architecture for the digit image 
recognition problem 
 
11 
 
Table 1-2 Success rates of the designed counterpropagation neural network for digit image 
recognition 
      Data 
Digit        
Noise 
level 1 
Noise 
level 2 
Noise 
level 3 
Noise 
level 4 
Noise 
level 5 
Noise 
level 6 
Noise 
level 7 
Digit 0 100% 100% 100% 100% 100% 96% 97% 
Digit 1 100% 100% 100% 100% 100% 100% 94% 
Digit 2 100% 100% 100% 95% 91% 77% 82% 
Digit 3 100% 100% 99% 92% 88% 84% 65% 
Digit 4 100% 100% 100% 100% 100% 98% 96% 
Digit 5 100% 100% 100% 100% 100% 95% 93% 
Digit 6 100% 100% 100% 100% 92% 91% 88% 
Digit 7 100% 100% 100% 100% 100% 98% 88% 
Digit 8 100% 100% 99% 98% 83% 76% 67% 
Digit 9 100% 100% 100% 100% 94% 91% 72% 
 
Comparing human beings and computers in recognition of those noisy characters, Fig. 1-
12 presents the retrieval results of 7th level noised digit images. Obviously it is totally a gamble 
for human beings to retrieve most of those images, but the designed counterpropagation neural 
networks can do the job correctly. 
 Fig. 1-12 Retrieval results of 7th level noised digit images 
 
1.4 conclusion 
The two examples above show the potentially good performance of neural networks in function 
approximation and pattern recognition problems. Because of the attractive and powerful 
nonlinear mapping ability, we are very interested in the research of neural networks, including 
both network architectures and learning algorithms. Besides, for better understanding of neural 
networks, we have also extended our research scope to several other methods of computational 
intelligence, such as fuzzy inference systems and radial basis function neural networks. Our 
recent publications (at the end of the dissertation), as listed at the end of the dissertation, 
12 
 
somehow prove our achievement in these realms. 
In the dissertation, we will discuss how to design efficient and powerful algorithms for 
neural network learning. Especially, we will focus on the second order algorithms considering 
their high training efficiency and powerful search ability over first order algorithms. Our recently 
developed improved second order computation and the forward-only algorithm will be 
introduced as the recommended solutions to memory limitation problem and the computation 
redundancy problem, respectively, in second order algorithms. 
 
 
 
 
13 
 
 
 
 
 
 
 
CHAPTER 2 
BACKGROUND 
2.1 History 
The history of the neural networks can be traced back to 1942, when Warren McCulloch and 
Walter Pitts proposed McCulloch-Pitts model, named Threshold Logic Unit (TLU) [55]. 
Originally, TLU was designed to perform simple logic operations, such as ?&? and ?|?. In 1949, 
Donald Hebb mentioned the concept ?synaptic modification? in his book ?The organization of 
behavior? [56]. This concept was considered as a milestone during the development of neural 
networks. It is very similar with the analytical neuron models used today. In 1956, Albert Uttley 
reported that he successfully solved simple binary pattern classification problems using neural 
networks [57]. In 1958, Frank Rosenblatt introduced the important concept ?Perceptron?; in the 
following four years, Frank Rosenblatt designed several learning algorithms for the perceptron 
model, in order to do binary pattern classification [58]. As another milestone, in 1960, Bernard 
Widrow and his student Ted Hoff proposed ?ADALINE? model which consisted of linear 
neurons. Least mean squares method was designed to adjust the parameters of ADALINE model. 
Two years later (1962), as the expansion of ADALINE, Widrow and Hoff introduced 
?MADALINE? model which had two-layer architecture: multiple ADALINE units arranged in 
parallel as input layer and a single processor as output layer [59]. Based on ADALINE and 
MADALINE models, neural networks attracted lots of researchers and went through very fast 
development. Until 1969, Marvin Minsky and Seymour Papert proved the very limited power of 
14 
 
neural networks in their book ?Perceptron? [60]. They showed that the single layer perceptron 
model was only capable of classifying the patterns which were linearly separable; for linearly 
inseparable patterns, such as the very simple XOR problem, the single layer perceptron model 
would be helpless. The theory proposed by Minsky and Papert stopped the development of 
neural networks for almost 10 years, until 1986, the invention of error backpropagation 
algorithm, proposed by David E. Rumelhart [61]. The error backpropagation algorithm dispersed 
the dark clouds on the field of neural networks and could be regarded as one of the most 
significant breakthroughs in neural network training. By using the sigmoidal shape activation 
function, such as tangent hyperbolic function, and incorporating with the gradient descent 
concept in numerical methods, the error backpropagation algorithm enhanced the power of 
neural networks significantly. Neural networks can not only be used for classifying binary linear 
patterns, but also be applied to approximate any nonlinear relationships. In the following 10 
years, various learning algorithms [62-68] and network models [69-70] came out like the 
bamboo shoot after spring rain. Currently, error backpropagation (EBP) algorithms and 
multiplayer perceptron (MLP) networks are still the most popular learning algorithm and 
network architecture in practical applications. 
 
2.2 Basic Concepts 
As the basic unit of human brain, neural cells play the roles of signal transmission and storage. A 
neural cell mainly consists of cell body with lots of synapses around as shown in Fig. 2-1a. 
Extracting from the human brain model, a single neuron is made up of the linear/nonlinear 
activation function f(x) (like cell body) and weighted connections (like synapses), as shown in 
Fig. 2-1b. 
15 
 
)(xf
y
x
1
x
3
x
2
x
6
x
5
x
4
+ 1
n e t
x
7
w
0
w
1
w
2
w
3
w
4
w
5
w
6
w
7
 (a) Neural cell [71]                                          (b) Neural model 
Fig. 2-1 Neural cell in human brain and its simplified model in neural networks 
 
Taking the neuron in Fig. 2-1b as an example, the two fundamental operations in a single 
neuron can be described as: 
? Calculate the net value as sum of weighted input signals 
0
7
1
wwxn et
i ii
?? ?
?
                                                   (2-1) 
? Calculate the output y 
? ?netfy?                                                          (2-2) 
The activation function f(x) in equation (2-2) can be either linear function (equation 2-3) 
or sigmoidal shape function (equation 2-4), as shown in Fig. 2-2. 
xgainy ??                                                           (2-3) 
? ? ? ? 12e x p1 2t a n h ???????? xg a i nxg a i ny                             (2-4) 
16 
 
? ? xg a inxfy ???
1?g a in
? ? ? ?xgai nxfy ??? t a n
 (a) Linear function                                              (b) Sigmoidal function 
Fig. 2-2 Different types of activation functions 
 
It is quite straightforward that linear neurons (Fig. 2-2a), such as ADALINE model, have 
very limited power and can only handle patterns which are linear separable. On the other hand, 
sigmoidal shape functions (Fig. 2-2b), such as tangent hyperbolic function (equation 2-4), can be 
applied for nonlinear situations. It can be also noticed that, for sigmoidal shape functions, when 
the gain value becomes larger, the function behaves more like a step function. 
For more than one neuron interconnected together, the two basic computations in 
equations (2-1) and (2-2) for each neuron remain the same as for a single neuron. The only 
difference is that the inputs of a neuron could be either network inputs or the outputs of neurons 
from the previous layers. 
 
2.3 Network Architectures 
Neural networks consist of neurons and their interconnections. Technically, the interconnections 
among neurons can be arbitrary. In the dissertation, we only discuss the feedforward neural 
networks where signals are propagated from input layer to output layer without feedback. 
In this section, different types of neural network architectures are studied and compared 
17 
 
from the point of network efficiency, based on parity problems. The N-bit parity function can be 
interpreted as a mapping (defined by 2N binary vectors) that indicates whether the sum of the N 
elements of every binary vector is odd or even. Parity-N problem is also considered to be one of 
the most difficult problems in neural network training [72-74]. 
 
2.3.1 Simplified Patterns for Parity Problems  
One may notice that, in parity problems, input patterns which have the same sum of each input 
are going to have the same output. Therefore, considering all the weights on network inputs as 
?1?, the number of training patterns of parity-N problem can be reduced from 2N to N+1.  
 Fig. 2-3 shows both the original 8 training patterns and the simplified 4 training patterns 
in parity-3 problem. The two types of training patterns are identical. 
                     
I n p u t      S u m  o f  I n p u t s     O u t p u t
0  0  0                0                       0
0  0  1                1                       1
0  1  0                1                       1
0  1  1                2                       0
1  0  0                1                       1
1  0  1                2                       0
1  1  0                2                       0
1  1  1                3                       1
                         
Input   Output
0        0
1        1
2        0
3        1        
 (a) Original patterns                             (b) Simplified patterns 
Fig. 2-3 Training patterns simplification for parity-3 problem 
 
Based on this pattern simplification strategy, for parity-3 problem, instead of the network 
architecture in Fig. 2-4a, a linear neuron (with slope equal to 1) can be used as the network input 
(see Fig. 2-4b). The linear neuron works as a summator and it does not have bias input. All 
weights connected to the linear neuron, including input weights and output weights, are fixed as 
?1?. 
18 
 
+ 1
+ 1
I n p u t 1
I n p u t 2
I n p u t 3
w e ig h t s = 1
  
I n p u t 1
I n p u t 2
I n p u t 3
+ 1
+ 1
w e i g h t s = 1
L i n e a r
 (a) Original parity-3 inputs                      (b) Simplified linear neuron inputs 
Fig. 2-4 Two equivalent networks for parity-3 problem 
 
2.3.2 MLP Networks with One Hidden Layer 
Multilayer perceptron (MLP) networks are the most popular networks because they are regularly 
formed and easy for programming. In MLP networks, neurons are organized layer by layer and 
there are no connections across layers. 
Both parity-2 (XOR) and parity-3 problems can be visually illustrated in two and three 
dimensions respectively, as shown in Figs. 2-5 and 2-6. For parity-2 problem, each hidden 
neuron in Fig. 2-5b works as a separating line as shown in Fig. 2-5a and the output unit decides 
the values of separation area. Similarly, for parity-3 problem, each hidden unit in Fig. 2-6b 
represents a separating plane in Fig. 2-6a and the values of separation area are determined by the 
output unit. 
1
2
    
1
- 1
- 0 . 5
+ 1
+ 1
I n p u t 1
I n p u t 2
w e i g h t s = ( - 0 . 5 , - 1 . 5 )
1
2
w e i g h t s = 1
 (a) Graphical interpretation of separation                (b) Designed neural network 
Fig. 2-5 Analytical solution of parity-2 problem 
 
19 
 
1
2
3
   
1
2
3
w e i g h t s = 1
I n p u t 1
I n p u t 2
I n p u t 3
+ 1
+ 1
1
- 1
1
- 0 . 5
w e i g h t s = ( - 0 . 5 , - 1 . 5 , - 2 . 5 )
 (a) Graphical interpretation of separation                (b) Designed neural network 
Fig. 2-6 Analytical solution of parity-3 problem 
 
Using MLP networks with one hidden layer to solve the parity-7 problem, there could be 
at least 7 neurons in the hidden layer to separate the 8 training patterns (using the pattern 
simplification strategy described in Figs. 2-3 and 2-4), as shown in Fig. 2-7a. 
In Fig. 2-7a, 8 patterns {0, 1, 2, 3, 4, 5, 6, 7} are separated by 7 neurons (bold line). The 
thresholds of the hidden neurons are {0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5}. Then summing the outputs 
of hidden neurons weighted by {1, -1, 1, -1, 1, -1, 1}, the net inputs at the output neurons could 
be only {0, 1}, which can be separated by the neuron with threshold 0.5. Therefore, parity-7 
problem can be solved by the architecture shown in Fig. 2-7b. 
0
1
2
3
4
5
6
7
0 .5
1 .5
2 .5
3 .5
5 .5
4 .5
6 .5
0
1
0 .5
+ 1
-1
+ 1
-1
+ 1
-1
+ 1
8
1
2
3
4
5
6
7
+ 1
I n p u t 1
I n p u t 2
I n p u t 3
I n p u t 5
I n p u t 4
I n p u t 7
I n p u t 6
w e i g h t s = 1
w e i g h t s = ( - 0 . 5 , - 1 . 5 , - 2 . 5 , - 3 . 5 , - 4 . 5 , - 5 . 5 , - 6 . 5 )
- 1
- 1
- 1
+ 1
- 0 . 5
1
1
1
1
1
2
6
7
5
4 8
3
 (a) Analysis                                                  (b) Architecture 
Fig. 2-7 Solving Parity-7 problem using MLP network with one hidden layer 
 
20 
 
Generally, if there are n neurons in MLP networks with single hidden layer, the largest 
possible parity-N problem that can be solved is 
1??nN                                                           (2-5) 
Where: n is the number of neurons and N is the number of dimensions of the parity problem. 
 
2.3.3 BMLP Networks 
In MLP networks, if connections across layers are permitted, then networks have bridged 
multilayer perceptron (BMLP) topologies. BMLP networks are more powerful than traditional 
MLP networks if the number of neurons is the same. 
 
2.3.3.1 BMLP Networks with One Hidden Layer 
Considering BMLP networks with only one hidden layer, all network inputs are connected to 
both of the hidden neurons and the output neuron or neurons. 
For parity-7 problem, the 8 simplified training patterns can be separated by 3 neurons to 
four sub patterns {0, 1}, {2, 3}, {4, 5} and {6, 7}. The threshold of the hidden neurons should be 
{1.5, 3.5, 5.5}. In order to transfer all sub patterns to the unique pattern {0, 1} for separation, 
patterns {2, 3}, {4, 5} and {6, 7} should be reduce by 2, 4 and 6 separately, which determines 
the weight values on connections between hidden neurons and output neurons. After pattern 
transformation, the unique pattern {0, 1} can be separated by the output neuron with threshold 
0.5. The design process is shown in Fig. 2-8a and the corresponding solution architecture is 
shown in Fig. 2-8b. 
21 
 
0
1
2
3
4
5
6
7
0
1
1
2
3
4
1 .5
???
5 .5
3 .5
0 .5
-
2
-
4
-
6
         
+ 1
I n p u t 1
I n p u t 2
I n p u t 3
I n p u t 5
I n p u t 4
I n p u t 7
I n p u t 6
w e i g h t s = 1
w e i g h t s = ( - 1 . 5 ,  - 3 . 5 ,  - 5 . 5 )
1
2
4
3
+ 1
- 0 . 5
- 2
- 4
- 6
 (a) Analysis                                                  (b) Architecture 
Fig. 2-8 Solve parity-7 problem using BMLP networks with one hidden layer 
 
For parity-11 problem, similar analysis and related BMLP networks with single hidden 
layer solution architecture are presented in Fig. 2-9. 
0
1
2
3
4
5
6
7
8
9
1 0
1 1
1 ?
6?
2
3
4
5
???1 . 5
7 . 5
5 . 5
9 . 5
3 . 5
0 . 5
0
1
-
4
-
6
-
8
-
1
0
-
2
+ 1
I n p u t 1
I n p u t 2
I n p u t 3
I n p u t 4
I n p u t 5
I n p u t 6
I n p u t 7
I n p u t 8
I n p u t 9
I n p u t 1 0
I n p u t 1 1
w e ig h t s = 1
w e ig h t s = ( - 1 . 5 ,  - 3 . 5 ,  - 5 . 5 ,  - 7 . 5 ,  - 9 . 5 )
1
2
3
4
5
6
+ 1
- 0 . 5
- 2
- 4
- 1 0
- 6
- 8
 
(a) Analysis                                              (b) Architecture 
Fig. 2-9 Solve parity-11 problem using BMLP networks with single hidden layer 
 
Generally, for n neurons in BMLP networks with one hidden layer, the largest parity-N 
problem that can be possibly solved is: 
12 ?? nN                                                             (2-6) 
 
2.3.3.2 BMLP Networks with Multiple Hidden Layer 
22 
 
If BMLP networks have more than one hidden layers, then the further reduction of the number of 
neurons are possible, for solving the same problem. 
For parity-11 problem, using 4 neurons, in both 11=2=1=1 and 11=1=2=1 architectures, 
can find solutions. Considering the 11=2=1=1 network, the 12 simplified training patterns would 
be separated by two neurons at first, into {0, 1, 2, 3}, {4, 5, 6, 7} and {8, 9, 10 11}; the 
thresholds of the two neurons are 3.5 and 7.5 separately. Then, sub patterns {4, 5, 6, 7} and {8, 9, 
10, 11} are transformed to {0, 1, 2, 3} by subtracting -4 and -8 separately, which determines the 
weight values on connections between the first hidden layer and followed layers. In the second 
hidden layer, one neuron is introduced to separate {0, 1, 2, 3} into {0, 1} and {2, 3}, with 
threshold 1.5. After that, sub pattern {2, 3} is transferred to {0, 1} by setting weight value as -2 
on the connection between the second layer and the output layer. At last, output neuron with 
threshold 0.5 separates the pattern {0, 1}. The whole procedure is presented in Fig. 2-10 below. 
0
1
2
3
4
5
6
7
8
9
1 0
1 1
1
0
1
3
2
4
? ??
0
1
2
3
3 . 5
7 . 5
1 . 5
0 . 5
-
2
-
4
-
4
-
8
-
8
 
+ 1
I n p u t 1
I n p u t 2
I n p u t 3
I n p u t 4
I n p u t 5
I n p u t 6
I n p u t 7
I n p u t 8
I n p u t 9
I n p u t 1 0
I n p u t 1 1
w e i g h t s = 1
w e i g h t s = ( - 3 . 5 ,  - 7 . 5 ,  - 1 . 5 )
1
2
3
4
- 0 . 5
+ 1
- 8
- 8
- 4
- 4
- 2
 (a) Analysis                                                (b) Architecture 
Fig. 2-10 Solve parity-11 problem using BMLP networks with two hidden layers, 11=2=1=1 
 
Fig. 2-11 shows the 11=1=2=1 BMLP network with two hidden layers, for solving parity-
11 problem. 
23 
 
0
1
2
3
4
5
6
7
8
9
1 0
1 1
1
0
1
3
2
4
??
0
1
2
3
4
5
?
5 . 5
1 . 5
3 . 5
0 . 5
-
6
-
6
-
6
-
2
-
4
 
+ 1
I n p u t 1
I n p u t 2
I n p u t 3
I n p u t 4
I n p u t 5
I n p u t 6
I n p u t 7
I n p u t 8
I n p u t 9
I n p u t 1 0
I n p u t 1 1
w e i g h t s = 1
w e i g h t s = ( - 5 . 5 ,  - 1 . 5 ,  - 3 . 5 )
1
2
3
4
+ 1
- 0 . 5
- 4
- 2
w e i g h t s = - 6
 (a) Analysis                                                      (b) Architecture 
Fig. 2-11 Solve parity-11 problem using BMLP networks with two hidden layers, 11=1=2=1 
 
Generally, considering the BMLP network with two hidden layers, the largest parity-N 
problem can be possibly solved is: 
? ?? ? 1112 ???? nmN                                                  (2-7) 
Where: m and n are the numbers of neurons in the two hidden layers, respectively. 
For further derivation, one may notice that if there are k hidden layers and ni is the 
number of neurons in the i-th hidden layer, where i is ranged from 1 to k, then 
? ?? ? ? ? ? ? 111112 21 ?????? ki nnnnN ??                       (2-8) 
 
2.3.4 FCC Networks 
Fully connected cascade (FCC) networks can solve problems using the smallest possible number 
of neurons. In the FCC networks, all possible routines are weighted, and each neuron contributes 
to a layer. 
For parity-7 problem, the simplified 8 training patterns are divided by one neuron at first, 
as {0, 1, 2, 3} and {4, 5, 6, 7}; the threshold of the neuron is 3.5. Then the sub pattern {4, 5, 6, 7} 
is transferred to {0, 1, 2, 3} by weights equal to -4, connected to the followed neurons. Again, by 
24 
 
using another neuron, the patterns in the second hidden layer {0, 1, 2, 3} can be separated as {0, 
1} and {2, 3}; the threshold of the neuron is 1.5. In order to transfer the sub pattern {2, 3} to {1, 
2}, 2 should be subtracted from sub pattern {2, 3}, which determines that the weight between the 
second layer and the output layer is -2. At last, output neurons with threshold 0.5 is used to 
separate the pattern {0, 1}, see Fig. 2-12. 
0
1
2
3
4
5
6
7
1
3
2
?
0
1
2
3
?
0
1
3 . 5
1 . 5
0 . 5
-
2
-
4
-
4
 
+ 1
I n p u t 1
I n p u t 2
I n p u t 3
I n p u t 5
I n p u t 4
I n p u t 7
I n p u t 6
w e i g h t s = 1
w e i g h t s = ( - 3 . 5 ,  - 1 . 5 ,  - 0 . 5 )
1
2
3
- 2
- 4
- 4
 
(a) Analysis                                                   (b) Architecture 
Fig. 2-12 Solve parity-7 problem using FCC networks 
 
Fig. 2-13 shows the solution of parity-15 problem using FCC networks. 
0
1
2
3
4
5
6
7
8
9
1 0
1 1
1 2
1 3
1 4
1 5
1
3
2
? ?
0
1
2
3
4
5
6
7
0
1
2
3
0
1
4
7 . 5
3 . 5
1 . 5
0 . 5
-
8
-
8
-
8
-
4
-
4
-
2
+ 1
I n p u t 1
I n p u t 2
I n p u t 3
I n p u t 4
I n p u t 5
I n p u t 6
I n p u t 7
I n p u t 8
I n p u t 9
I n p u t 1 0
I n p u t 1 1
I n p u t 1 2
I n p u t 1 3
I n p u t 1 4
I n p u t 1 5
w e i g h t s = 1
w e i g h t s = ( - 7 . 5 ,  - 3 . 5 ,  - 1 . 5 ,  - 0 . 5 )
1
2
3
4
- 8
- 8
- 8
- 4
- 4
- 2
 (a) Analysis                                                 (b) Architecture 
Fig. 2-13 Solve parity-15 problem using FCC networks 
 
Considering the FCC networks as special BMLP networks with only one neuron in each 
hidden layer, for n neurons in FCC networks, the largest N for parity-N problem can be derived 
from equation (2-8) as: 
25 
 
? ?? ? ? ?? ? 1111111112 1 ?????? ? ???? ????? ?? ?nN                                      (2-9) 
or 
12 ?? nN                                                       (2-10) 
 
2.3.5 Comparison of Different Topologies 
Table 2-1 concludes the analysis of network efficiency above and the largest parity-N problem 
that can be solved with a given network structure. For example, with 5 neurons: the MLP 
network with only one hidden layer can solve parity-4 problem (4-4-1 network); BMLP network 
with a single hidden layer can solve parity-11 problem (11=4=1 network); BMLP network with 
two hidden layers can solve parity-15 problem (15=3=1=1 network or 15=1=3=1 network) or 
parity-17 problem (17=2=2=1 network); FCC network can solve parity-31 problem at most 
(31=1=1=1=1=1 network). 
 
Table 2-1 Different architectures for solving parity-N problem 
Network Architectures Parameters Parity-N Problem 
MLP with single 
hidden layer 
n neurons 1?n  
BMLP with single 
hidden layer 
n neurons 12?n  
BMLP with multiple 
hidden layer 
k hidden layers, each 
with ni neurons ? ?? ? ? ?? ? 111112 121 ????? ? kk nnnn ? 
FCC n neurons 12?n  
   
Based on the comparison results shown in Table 2-1, one may draw the conclusion that, 
with more connections across layers, the networks become more powerful. The FCC architecture 
is the most powerful and can solve problems with much less number of neurons. 
 
26 
 
2.4 Learning Algorithms 
Many methods have already been developed for neural networks training [62-68]. In this 
dissertation, we will focus on the gradient descent based optimization methods. 
 
2.4.1 Introduction 
Steepest descent algorithm, also known as error backpropagation algorithm [61], is the most 
popular algorithm for neural network training; however, it is also known as an inefficient 
algorithm because of its slow convergence. 
There are two main reasons for the slow convergence: the first reason is that its step sizes 
should be adequate to the gradients as shown in Fig. 2-14. Logically, small step size should be 
taken where the gradient is steep, so as not to rattle out of the required minima (because of 
oscillation). So if the step size is a constant, it needs to be chosen small. Then, in the place where 
the gradient is gentle, the training process would be very slow. The second reason is that the 
curvature of the error surface may not be the same in all directions, such as the Rosenbrock 
function, so the classic ?error valley? problem [75] may exist and may result in the slow 
convergence. 
 Fig. 2-14 Searching process of the steepest descent method with different learning constants: 
yellow trajectory (left) is for small learning constant which leads to slow convergence; purple 
trajectory (right) is for large learning constant which causes oscillation (divergence) 
 
27 
 
The slow convergence of the steepest descent method can be greatly improved by Gauss-
Newton algorithm [75]. Using second order derivatives of error function to ?naturally? evaluate 
the curvature of error surface, The Gauss-Newton algorithm can find proper step sizes for each 
direction and converge very fast. Especially, if the error function has a quadratic surface, it can 
converge directly in the first iteration. But this improvement only happens when the quadratic 
approximation of error function is reasonable. Otherwise, Gauss-Newton algorithm would be 
mostly divergent. 
Levenberg Marquardt algorithm [24-25][76] blends the steepest descent method and 
Gauss-Newton algorithm. Fortunately, it inherits the speed advantage of the Gauss-Newton 
algorithm and the stability of the steepest descent method. It?s more robust than the Gauss-
Newton algorithm, because in many cases it can converge well even if the error surface is much 
more complex than quadratic situation. Although Levenberg Marquardt algorithm tends to be a 
bit slower than Gauss-Newton algorithm (in convergent situation), it converges much faster than 
the steepest descent method. 
The basic idea of Levenberg Marquardt algorithm is that it performs a combined training 
process: around the area with complex curvature, Levenberg Marquardt algorithm switches to 
steepest descent algorithm, until the local curvature is proper to make a quadratic approximation; 
then it approximately becomes Gauss-Newton algorithm which can speed up the convergence 
significantly. 
In the following sections, the four basic gradient descent methods will be introduced, 
including (1) steepest descent method; (2) Newton method; (3) Gaussian-Newton algorithm and 
(4) Levenberg Marquardt algorithm. 
Sum square error (SSE) E is defined to evaluate the training process, as the object 
28 
 
function. For all training patterns and network outputs, it is calculated by 
? ? ? ?
? ?
? P
1p
M
1m mp
e,E 2 ,21wx                                                  (2-11) 
Where: x and w are the input vector and weight vector respectively; p is the index of training 
patterns, from 1 to P, where P is the number of training patterns; m is the index of outputs, from 
1 to M, where M is the number of outputs; ep,m is the training error at output m when applying 
pattern p and it is defined as 
mpmpmp ode ,,, ??                                                    (2-12) 
Where: d is the desired output vector and o is the actual output vector. 
 
2.4.2 Steepest Descent Algorithm 
Steepest descent algorithm is a first order algorithm. It uses the first order derivative of total 
error function to find the minima in error space. Normally, gradient g is defined as the first order 
derivative of total error function (2-11) 
? ? T
Nw
EwEwEE ?
?
??
?
? ?????????? ?
21
,w wxg                                  (2-13) 
Where: N is the number of weights. 
With the definition of gradient g in (2-13), the update rule of steepest descent algorithm 
could be written as: 
kk1k gww ????                                                  (2-14) 
Where: ? is the learning constant (step size) and k is the index of training iterations. 
The training process of steepest descent algorithm is asymptotic convergence so it never 
reaches the minima. Around the solution, all the elements of gradient vector g would be very 
29 
 
small and there would be very tiny weight changing. 
 
2.4.3 Newton Method 
Newton method assumes that all the gradient components g={g1, g2?gN} are function of weights 
and all weights are linearly independent: 
? ?
? ?
? ??
?
?
???
?
?
?
?
NNN
N
N
wwwFg
wwwFg
wwwFg
?
?
?
?
21
2122
2111
,
,
,
                                              (2-15) 
Where: {F1, F2? FN} are nonlinear relationships between weights and related gradient 
components. 
Unfold each gi (i=1, 2?N) in equations (2-15) by Taylor series and take the first order 
approximation: 
?
?
?
?
?
??
?
?
?
?
?
?
?
???
?
?
??
?
?
??
?
?
?
???
?
?
??
?
?
??
?
?
?
???
?
?
??
?
?
??
N
N
NNN
NN
N
N
N
N
w
w
g
w
w
g
w
w
g
gg
w
w
g
w
w
g
w
w
g
gg
w
w
g
w
w
g
w
w
g
gg
?
?
?
?
2
2
1
1
0,
2
2
2
2
1
1
2
0,22
1
2
2
1
1
1
1
0,11
                          (2-16) 
By combining the definition of gradient vector g in (2-13), it could be determined that 
jij
j
j
i
ww
E
w
w
E
w
g
??
??
?
???
?
???
?
?
??
???
2                                         (2-17) 
Where: i and j are the indices of weights, from 1 to N. 
By inserting equation (2-17) to (2-16): 
30 
 
?
?
?
?
?
?
?
?
?
?
?
?
?
?
???
??
?
??
??
?
??
?
??
?
???
?
?
??
??
?
??
?
??
?
???
??
?
??
?
?
??
N
NNN
NN
N
N
N
N
w
w
E
w
ww
E
w
ww
E
gg
w
ww
E
w
w
E
w
ww
E
gg
w
ww
E
w
ww
E
w
w
E
gg
2
2
2
2
2
1
1
2
0,
2
2
22
2
2
1
12
2
0,22
1
2
2
21
2
12
1
2
0,11
?
?
?
?
              (2-18) 
Comparing with the steepest descent method, the second order derivatives of the total 
error function need to be calculated for each component of gradient vector. 
In order to get the minima of total error function E, each element of the gradient vector 
should be zero. Therefore, left sides of the equations (2-18) are all zero, then 
?
?
?
?
?
?
?
?
?
?
?
?
?
?
???
??
?
??
??
?
??
?
??
?
???
?
?
??
??
?
??
?
??
?
???
??
?
??
?
?
??
N
NNN
N
N
N
N
N
w
w
E
w
ww
E
w
ww
E
g
w
ww
E
w
w
E
w
ww
E
g
w
ww
E
w
ww
E
w
w
E
g
2
2
2
2
2
1
1
2
0,
2
2
22
2
2
1
12
2
0,2
1
2
2
21
2
12
1
2
0,1
0
0
0
?
?
?
?
                  (2-19) 
By combining equation (2-13) with (2-19) 
?
?
?
?
?
?
?
?
?
?
?
?
?
?
???
??
?
??
??
?
???
?
?
?
?
??
?
???
?
?
??
??
?
???
?
?
?
?
??
?
???
??
?
??
?
?
???
?
?
?
N
NNN
N
N
N
N
N
N
w
w
E
w
ww
E
w
ww
E
g
w
E
w
ww
E
w
w
E
w
ww
E
g
w
E
w
ww
E
w
ww
E
w
w
E
g
w
E
2
2
2
2
2
1
1
2
0,
2
2
22
2
2
1
12
2
0,2
2
1
2
2
21
2
12
1
2
0,1
1
?
?
?
?
           (2-20) 
There are N equations for N parameters so that all ?wi can be calculated. With the 
solutions, the weight space can be updated iteratively. 
Equations (2-20) can be also written in matrix form 
31 
 
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
??
?
??
?
??
?
?
?
??
?
??
?
??
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
N
NNN
N
N
N
N w
w
w
w
E
ww
E
ww
E
ww
E
w
E
ww
E
ww
E
ww
E
w
E
w
E
w
E
w
E
g
g
g
?
?
????
?
?
?
?
2
1
2
2
2
2
1
2
2
2
2
2
2
12
2
1
2
21
2
2
1
2
2
1
2
1
               (2-21) 
Where: the square matrix is Hessian matrix H (N?N): 
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
??
?
??
?
??
?
?
?
??
?
??
?
??
?
?
?
?
2
2
2
2
1
2
2
2
2
2
2
12
2
1
2
21
2
2
1
2
NNN
N
N
w
E
ww
E
ww
E
ww
E
w
E
ww
E
ww
E
ww
E
w
E
?
????
?
?
H
                             (2-22) 
By Combining equations (2-13) and (2-22) with equation (2-21) 
wHg ???                                                     (2-23) 
So 
gHw 1????                                                   (2-24) 
Therefore, update rule for Newton method is 
kkkk gHww 11 ?? ??                                            (2-25) 
As the second order derivatives of total error function, Hessian matrix H gives the proper 
evaluation on the change of gradient vector. By comparing equations (2-14) and (2-25), one may 
notice that well-matched step sizes are given by the inverted Hessian matrix. 
 
2.4.4 Gaussian-Newton Algorithm 
32 
 
If Newton method is applied for weight updating, in order to get Hessian matrix H, the second 
order derivatives of total error function have to be calculated and it could be very complicated. In 
order to simplify the calculating process, Jacobian matrix J is introduced as 
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
N
MPMPMP
N
PPP
N
PPP
N
MMM
N
N
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
,
2
,
1
,
2,
2
2,
1
2,
1,
2
1,
1
1,
,1
2
,1
1
,1
2,1
2
2,1
1
2,1
1,1
2
1,1
1
1,1
?
????
?
?
????
?
????
?
?
J
                                      (2-26) 
By integrating equations (2-11) and (2-13), elements of gradient vector can be calculated 
as 
? ?
? ?
? ?
? ?
???
?
???
?
?
??
?
?
?
?
?
?
?
?
?
?
????
P
p
M
m
mp
i
mp
i
P
1p
M
1m
mp
i
i ew
e
w
e
w
Eg
1 1
,
,
2
,2
1
                        (2-27) 
Combining equations (2-26) and (2-27), the relationship between Jacobian matrix J and 
gradient vector g would be 
eJg T?                                                       (2-28) 
Where: error vector e has the form 
33 
 
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
MP
P
P
M
e
e
e
e
e
e
,
2,
1,
,1
2,1
1,1
?
?
?
e
                                                      (2-29) 
Inserting equation (2-11) into (2-22), the element at i-th row and j-th column of Hessian 
matrix can be calculated as 
ji
P
p
M
m j
mp
i
mp
ji
P
1p
M
1m
mp
ji
ji Sw
e
w
e
ww
e
ww
Eh
,
1 1
,,
2
,
2
2
,
2
1
? ?
? ?
? ?
? ? ?
?
?
?
??
??
?
?
?
?
?
?
?
?
?
??? ??               (2-30)  
Where: Si,j is equal to 
? ?? ? ???? Pp Mm mpji mpji eww eS 1 1 ,,2,                                          (2-31) 
As the basic assumption of Newton?s method is that Si,j is closed to zero and the 
relationship between Hessian matrix H and Jacobian matrix J can be rewritten as 
JJH T?                                                          (2-32) 
By combining equations (2-25), (2-28) and (2-32), the update rule of Gaussian-Newton 
algorithm is presented as 
? ? kTkkTkkk eJJJww 11 ?? ??                                       (2-33) 
Obviously, the advantage of Gaussian-Newton algorithm over the standard Newton 
34 
 
method (equation 2-25) is that the former one doesn?t require the calculation of second order 
derivatives of the total error function, by introducing Jacobian matrix J instead. However, 
Gaussian-Newton algorithm still faces the same convergent problem like Newton algorithm for 
complex error space optimization. Mathematically, the problem can be interpreted as: matrix JTJ 
may be not invertible. 
 
2.4.5 Levenberg Marquardt Algorithm 
In order to make sure that the approximated Hessian matrix JTJ is invertible, Levenberg 
Marquardt algorithm introduces another approximation to Hessian matrix 
IJJH ??? T                                                     (2-34) 
Where: ? is always positive, called combination coefficient and I is the identity matrix. 
From equation (2-34), one may notice that the elements on the main diagonal of the 
approximated Hessian matrix will be larger than zero. Therefore, with this approximation 
(equation 2-34), it can be sure that matrix H is always invertible. 
By combining equations (2-33) and (2-34), the update rule of Levenberg Marquardt 
algorithm can be presented as:  
? ? kTkkTkkk eJIJJww 11 ?? ??? ?                                  (2-35) 
As the combination of steepest descent algorithm and Gaussian-Newton algorithm, 
Levenberg Marquardt algorithm switches between the two algorithms during the training process. 
When combination coefficient ? is very small (nearly zero), equation (2-35) is approaching to 
equation (2-33) and Gaussian-Newton algorithm is used. When combination coefficient ? is very 
large, equation (2-35) approximates to equation (2-14) and the steepest descent method is used. 
35 
 
If the combination coefficient ? in equation (2-35) is very big, it can be interpreted as 
learning coefficient in the steepest descent method (2-14): 
?? 1?
                                                          (2-36) 
 
2.4.6 Comparison of Different Algorithms 
Table 2-2 summarizes the update rules and their properties of the four algorithms above. 
 
Table 2-2 Specifications of different learning algorithms 
Learning Algorithms Update Rules  Convergent 
Rate 
Computation  
Complexity 
EBP algorithm 
kk1k gww ????  
Stable, slow Gradient 
Newton algorithm 
kkkk gHww 11 ?? ??  
Unstable, fast Gradient and Hessian 
Gaussian-Newton 
algorithm ? ? kTkkTkkk eJJJww 11 ?? ??  
Unstable, fast  Jacobian 
Levenberg Marquardt 
algorithm ? ? kTkkTkkk eJIJJww 11 ?? ??? ? 
Stable, fast Jacobian 
 
In order to compare the behavior of different learning algorithms, let us use the parity-3 
problem as an example. The training patterns of pariry-3 problem are shown in Fig. 2-15a. 
                  
   In p u ts      O u tp u ts
  -1   -1   -1         -1
  -1   -1    1          1
  -1    1   -1          1
  -1    1    1         -1
   1   -1   -1          1
   1   -1    1         -1
   1    1   -1         -1
   1    1    1          1
    
+ 1
I n p u t  1
I n p u t  2
I n p u t  3
O u t p u t
+ 1
 (a) Training patterns                                        (b) MLP network 3-2-1 
Fig. 2-15 Parity-3 data and network architecture 
 
 
36 
 
Three neurons in 3-2-1 MLP network, as shown in Fig. 2-15b, are used for training and 
the required training error is 0.01. Convergent rates are tested by repeating each case for 100 
trials with randomly generated initial weights. 
 
(a) EBP algorithm (?=1)                                     (b) EBP algorithm (?=100) 
 
(c) Gaussian-Newton algorithm                  (d) Levenberg Marquardt algorithm 
Fig. 2-16 Training results of parity-3 problem 
 
Table 2-3 Comparison among different learning algorithms for parity-3 problem 
Algorithms Convergence Rate Average Iteration Average Time (ms) 
EBP algorithm (?=1) 100% 1646.52 320.6 
EBP algorithm (?=100) 79% 171.48 36.5 
Gauss-Newton algorithm 3% 4.33 1.2 
LM algorithm 100% 6.18 1.6 
 
The training results are shown in Fig. 2-16 and the comparison is presented in Table 2-3. 
It can be concluded that: 
? For EBP algorithm, the larger the training constant ? is, the faster and less stable the 
training process will be (Figs. 2-16a and 2-16b); 
? Gaussian-Newton algorithm computes very fast, but it seldom converges (Fig. 2-16c); 
37 
 
? Levenberg Marquardt algorithm is much faster than EBP algorithm and more stable 
than Gaussian-Newton algorithm (Fig. 2-16d). 
For more complex parity-N problems, Gaussian-Newton algorithm cannot converge at 
all, and EBP algorithm also becomes more time-consuming and harder to find solutions; while 
Levenberg Marquardt algorithm can still perform successful training. 
 Another example is the two-spiral classification problem [77] which is often considered 
as a very complex benchmark to evaluate the efficiency of learning algorithms and network 
architectures. As shown in Fig. 2-17, the two-spiral problem is purposed to separate two groups 
of twisted points (red circles and blue stars). 
 
Fig. 2-17 Two-spiral problem: separation of two groups of points 
 
Fig. 2-18 presents the training results the two-spiral problem, using EBP and LM 
algorithms. In both cases, fully connected cascade (FCC) networks were used; the desired sum 
squared error was 0.01; the maximum number of iteration was 1,000,000 for EBP algorithm and 
1,000 for LM algorithm. The LM algorithm was implemented by NBN algorithm [78-79], so as 
to be able to handle FCC networks. EBP algorithm not only requires much more time than LM 
algorithm (Fig. 2-18a), but also is not able to solve the problem unless excessive number of 
38 
 
neurons is used. EBP algorithm requires at least 12 neurons and the second order algorithm can 
solve it in much smaller networks, such as 7 neurons (Fig .2-18b). 
 (a) Average training time 
 
(b) Success rate 
Fig. 2-18 Comparison between EBP algorithm and LM algorithm, for different number of 
neurons in fully connected cascade networks 
 
Fig. 2-19 shows the training results of the two-spiral problem, using 16 neurons in fully 
connected cascade network, for both EBP algorithm and LM algorithm. One may notice that, 
with the same topology, LM algorithm is able to find better solutions than those found using 
EBP algorithm. 
 
39 
 
 
(a) EBP algorithm                                        (b) LM algorithm 
Fig. 2-19 Training results of the two-spiral problem with 16 neurons in fully connected cascade 
network 
 
By conclusion, Levenberg Marquardt algorithm is the most efficient gradient based 
algorithm and it is recommended for neural network learning; however, it needs much more 
challenging computation than first order gradient methods. 
 
2.5 Generalization Ability 
Neural networks can work as universal approximator [22], but it happens only after successful 
training/learning process. The generalization is defined to evaluate the ability of trained neural 
networks to successfully handle new patterns which are not used for training. In order to obtain 
neural networks with good generalization ability, the over-fitting problems [28] should be 
avoided during the training process. 
 
2.5.1 The Over-fitting Problem 
The over-fitting problem is critical for designing neural networks with good generalization 
ability. When over-fitting happens, the trained neural networks can fit the training patterns very 
preciously, but they response poorly for new patterns which are not used for training. 
40 
 
Let us have an example to illustrate the existence of the over-fitting problems in neural 
network training. The purpose of the example is to approximate the function below 
? ? ? ? ? ?? ?922 1051.0905.0e x p2, ??????? yxyxf                          (2-37) 
 As shown in the Fig. 2-20, the training patterns consist of 6?6=36 points (Fig. 2-20a) 
uniformly distributed in sampling range x ? [0, 10] and y ? [0, 10]. After training, another 
101?101=10,201 points (Fig. 2-20b, also uniformly distributed) in the same sampling range are 
applied to test the trained neural networks. 
                           
(a) Training patterns, 6?6=36 points               (b) Testing patterns, 101?101=10,201 points 
Fig. 2-20 Function approximation problem 
 
Using the most powerful neural network architecture (as analyzed in section 2.3.5), fully 
connected cascade (FCC) networks, the testing results of trained networks consisting of different 
number of neurons are shown in Fig. 2-21. 
 (a) 2 neurons                               (b) 3 neurons                           (c) 4 neurons 
41 
 
 (d) 5 neurons                               (e) 6 neurons                           (f) 7 neurons 
            
(g) 8 neurons                                              (h) 9 neurons 
Fig. 2-21 Approximation results of FCC networks with different number of neurons 
 
Table 2-4 presents the training and testing sum square errors (SSEs) of FCC networks 
with different number of neurons. 
 
Table 2-4 Training/testing SSEs of different sizes of FCC networks 
Number of Neurons Training SSEs Testing SSEs 
2 2.43055 678.7768 
3 1.17843 346.0761 
4 0.13682 49.6832 
5 0.00591 1.7712 
6 0.00022 0.2809 
7 0.00008 7.3590 
8 0.00003 249.3378 
9 0.00000008 142.3883 
 
From the results presented in Fig. 2-21 and Table 2-4, one may notice that, as the 
network size increases, the training errors keep decreasing stably; however, the testing errors 
decrease at first (when the number of neurons is less than 6) and they turned to become 
42 
 
increasing and unpredictable when more neurons are added. When the FCC networks consist of 
5 and 6 neurons, very good approximation results are obtained. 
 
2.5.2 Analytical Solutions 
Based on the experiment above, one may notice that the basic reason of the over-fitting problem 
in neural network design can be ascribed as the mismatch between the size of training patterns 
and the size of networks. Normally, using improperly large size networks to train very simple 
patterns may result in over-fitting. From another way of speaking, in order to reduce the 
probability of occurrence of the over-fitting in neural network design, there are two very 
straightforward methods: 
? Increase the size of training patterns 
? Decrease the size of neural networks 
For the first method, it is always good to get as many training patterns as possible; 
however, this strategy is only proper in practical applications when extra measurement can be 
performed. 
For the second method, it could be notice that in order to preserve the generalization 
abilities of neural networks, the size of the networks should be as small as possible. From this 
point of view, EBP algorithm is not a good choice for design compact neural networks because 
of its slow convergence and poor search ability. In order to overcome the two main 
disadvantages of EBP algorithms, networks with much larger than optimal size are often applied 
for training. 
Levenberg Marquardt (LM) algorithm is very efficient for neural network training and 
has much powerful search ability. With these properties, LM algorithm is proper to design 
43 
 
compact neural networks in practical applications. However, the most famous implementation of 
LM algorithm, Hagan and Menhaj LM algorithm [80], is only for MLP networks which perform 
much less efficiently than networks with connections across layers, such as BMLP networks and 
FCC networks. 
The recently developed neuron-by-neuron (NBN) algorithm [27] solves the network 
limitation in Hagan and Menhaj LM algorithm, and can handle arbitrarily connected neural 
networks using second order update rule. Therefore, the combination of NBN algorithm and 
BMLP/FCC networks is recommended in literature [28] for designing compact neural networks, 
so as to reduce the probability of occurrence of the over-fitting problem. 
 
2.6 Neuron-by-Neuron Algorithm 
The neuron-by-neuron (NBN) algorithm [27] was proposed to solve the network architecture 
limitation in Hagan and Menhaj LM algorithm, so that second order algorithms can be applied to 
train very efficient network architectures with connections across layers [74]. 
The NBN algorithm adopts the index technology used in SPICE problem, and it consists 
of two steps, forward computation and backward computation, to gather the information required 
for Jacobian matrix computation in equation (2-26), using 
? ?
niiimpn iiii mpn mpn mpmpnmpnmp yswn e tn e tooowow odwej ,.,.,,,,,., ???????????????? ??????
          (2-38) 
Where: jp,m,n is the element of Jacobian matrix in (2-26) related with pattern p, output m and 
weight n. Equation (2-38) is derived from (2-12) and (2-26), using the chain rule of 
differentiation. Vector ? is defined to measure the error backpropagation process [81] and vector 
yi consists of the inputs of neuron i which may be either the network inputs or the outputs of 
44 
 
other neurons. si is the slope (derivative of activation function) of the given neuron i.  i is the 
index of neurons. 
 In the forward computation, neurons are organized according to the direction of signal 
propagation; while in backward computation, the analysis will follow the error backpropagation 
procedure like in first order algorithms. In order to illustrate the computation process of NBN 
algorithm, let us consider the network architecture with arbitrary connections as shown in Fig. 2-
22. 
1
2
3
5
x
1
x
2
+ 1
+ 1
1
2
4
5
6
73
4
+ 1
+ 1
 Fig. 2-22 Arbitrarily connected neural network indexed by NBN algorithm 
 
 For the network in Fig. 2-22, using the NBN algorithm, the network topology can be 
described as 
N1  3  1  2 
N2  4  1  2 
N3  5  3  4 
N4  6  1  2  4  5 
N5  7  3  5  6 
 Notice that each line represents the connections to a given neuron. The first part, from N1 
to N5, is the neuron index. Followed, the first digit of each line is the node index of the neuron. 
The rest of the digits of each line represent the nodes connected to the specified neuron. With 
these rules, one may notice that, for each neuron, the input nodes must have smaller indices than 
the index of itself. 
 In the forward computation, the neurons connected to the network inputs are first 
45 
 
processed so that their outputs can be used as inputs to the subsequent neurons. The following 
neurons are then processed as all their input values become available. In other words, the 
selected computing sequence has to follow the concept of feedforward signal propagation. If a 
signal reaches the inputs of several neurons at the same time, then these neurons can be 
processed in any sequence. In the example in Fig. 2-22, there are two possible ways in which 
neurons can be processed in forward direction: N1N2N3N4N5 or N2N1N3N4N5. The two procedures 
have different computing processes, but lead to exactly the same results. When the forward 
computation is done, both of the vector y and the derivative vector s in equation (2-38) are 
obtained. 
 The sequence of the backward computation is opposite to the forward computation 
sequence. The process starts with the last neuron and continues toward to the inputs. In the case 
of the network in Fig. 2-22, there are two possible backpropagation paths: N5N4N3N2N1 and 
N5N4N3N1N2. Again, different paths will lead to the same results. In this example, let us use the 
N5N4N3N2N1 sequence to illustrate how to calculate the vector ? in the backward computation. 
Notice that, the vector ? represents signal propagating from a network output to the inputs of all 
other neurons, so the size of the vector ? is equal to the number of neurons. For the output 
neuron N5, it is initialed as ?5=1. For the neuron N4, ?5 is propagated by the slope of neuron N5 
and then propagated by the weight w4,5 connected between neurons N4 and N5, so as to obtain 
?4=?5s5w4,5. For the neuron N3, both of the parameters ?4 and ?5 will be propagated in two 
separated paths to the output of neuron N3 and then summed together, as ?3= ?4s4w3,4+?5s5w3,5. 
Following the same rule, it can be obtained that ?2=?3s3w2,3+?4s4w2,4 and ?1=?3s3w1,3+?5s5w1,5. 
After the backward computation, all the elements of vector ? in equation (2-38) are calculated. 
 With the forward and backward computation, all the neuron outputs y and slope s, and 
46 
 
vector ? are calculated. Then using equation (2-38), all the elements of Jacobian matrix can be 
obtained. 
 In the NBN computation above, neurons are analyzed one-by-one, following the 
specified sequence which is decided by the network architectures. This property makes the NBN 
algorithm capable of handling networks consisting of arbitrarily connected neurons. 
47 
 
 
 
 
 
 
 
CHAPTER 3 
PROBLEMS IN SECOND ORDER ALGORITHMS 
The very efficient second order Levenberg Marquardt (LM) algorithm [24-25] was adopted for 
neural network training by Hagan and Menhaj [80], and later was implemented in MATLAB 
Neural Network tool box [82]. The LM algorithm uses significantly more parameters describing 
the error surface than just gradient elements as in the EBP algorithm. As a consequence the LM 
algorithm is not only fast but also it can train neural networks for which the EBP algorithm has 
difficulty to converge [28]. Many researchers now are using the Hagan and Menhaj LM 
algorithm for neural network training, but this implementation has several disadvantages: 
(1) The Hagan and Menhaj LM algorithm requires the inversion of quasi Hessian matrix of 
size nw?nw in every iteration, where nw is the number of weights. Because of the necessity 
of matrix inversion in every iteration the speed advantage of LM algorithm over the EBP 
algorithm is less evident as the network size increases. 
(2) The Hagan and Menhaj LM algorithm was developed only for multilayer perceptron 
(MLP) neural networks. Therefore, much more powerful and efficient networks, such as fully 
connected cascade (FCC) or bridged multilayer perceptron (BMLP) architectures cannot be 
trained. 
(3) The Hagan and Menhaj LM algorithm cannot be used for the problems with many 
training patterns because the Jacobian matrix become prohibitively too large. 
(4) The implementation of the Hagan and Menhaj LM algorithm calculated elements of 
48 
 
Jacobian matrix using basically the same routines as in the EBP algorithm. The different is 
that the error backpropagation process (for Jacobian matrix computation) must be carried on 
not only for every pattern but also for every output separately. So for network with multiple 
outputs, the backpropagation process has to be repeated for each output. 
The problem (1) inherits property of the original Levenberg marquardt algorithm and it is 
still unsolved so that the LM algorithm can be used only for small and medium size neural 
networks. Considering that LM algorithm often solves problems with very efficient networks, so 
that the problem (1) is somehow compensated by this powerful search ability. 
The problem (2) was solved by the recently developed neuron-by-neuron (NBN) 
algorithm, as discussed in chapter 2.6, but this algorithm requires very complex computation. 
The NBN algorithm also inherits the problems (3) and (4) in Hagan and Menhaj LM algorithm. 
The problem (3) is called memory limitation, which makes the second order algorithms 
not proper for problems with large-sized patterns. This is a fatal issue for second order 
algorithms, since in practical problems, the size of training patterns is very large and it is 
encouraged to be as large as possible. 
The problem (4) is also called computational redundant, which makes second order 
algorithms relatively complicated and inefficient for training networks with multiple outputs. 
Also, it is easier to handle the networks with arbitrarily connected neurons, when there is no need 
for backward computation process in problem (4). 
 In the followed two chapters, we will introduce the two methods, improved second order 
computation and the forward-only algorithm, as the potential solutions to memory limitation in 
the problem (3) and computational redundant in the problem (4), respectively. 
49 
 
 
 
 
 
 
 
CHAPTER 4 
IMPROVED SECOND ORDER COMPUTATION 
The improved second order computation presented in this chapter is aimed to optimize the neural 
networks learning process using Levenberg Marquardt (LM) algorithm. Quasi Hessian matrix 
and gradient vector are computed directly, without Jacobian matrix multiplication and storage. 
The memory limitation problem for LM training is solved. Considering the symmetry of quasi 
Hessian matrix, only elements in its upper/lower triangular array need to be calculated. Therefore, 
training speed is improved significantly, not only because of the smaller array stored in memory, 
but also the reduced operations in quasi Hessian matrix calculation. The improved memory and 
time efficiencies are especially true for large-sized patterns training. 
In this chapter, firstly, computational fundamentals of LM algorithm are introduced to 
address the memory problem. Secondly, the improved computations for both quasi Hessian 
matrix and gradient vector are described in details. Thirdly, a simple problem is applied to 
illustrate the implementation of the improved computation. Finally, several experimental results 
are presented as the memory and training time comparison between the traditional computation 
and the improved computation. 
 
4.1 Problem Description 
Derived from steepest descent method and Newton algorithm, the update rule of Levenberg 
Marquardt algorithm is [76] 
50 
 
? ? eJIJJw TT 1???? ?                                                      (4-1) 
Where: w is weight vector, I is identity matrix, ? is combination coefficient, (P?M)?N Jacobian 
matrix J and (P?M)?1 error vector e are defined as 
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
N
PMPMPM
N
PPP
N
PPP
N
MMM
N
N
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
?
????
?
?
????
?
????
?
?
21
2
2
2
1
2
1
2
1
1
1
1
2
1
1
1
12
2
12
1
12
11
2
11
1
11
J
 
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
PM
P
P
M
e
e
e
e
e
e
?
?
?
2
1
1
12
11
e
                                   (4-2) 
Where: P is the number of training patterns, M is the number of outputs and N is the number of 
weights. Elements in error vector e are calculated by 
pmpmpm ode ??                                                  (4-3) 
Where: dpm and opm are the desired output and actual output respectively, at network output m 
when training pattern p. 
Traditionally, Jacobian matrix J is calculated and stored at first; then Jacobian matrix 
multiplications are performed for weight updating using (4-1). For small and median size 
patterns training, this method may work smoothly. However, for large-sized patterns, there is a 
memory limitation for Jacobian matrix J storage. 
For example, the pattern recognition problem in MNIST handwritten digit database [83] consists 
of 60,000 training patterns, 784 inputs and 10 outputs. Using only the simplest possible neural 
51 
 
network with 10 neurons (one neuron per each output), the memory cost for the entire Jacobian 
matrix storage is nearly 35 gigabytes. This huge memory requirement cannot be satisfied by any 
32-bit Windows compliers, where there is a 3 gigabytes limitation for single array storage. At 
this point, with traditional computation, one may conclude that Levenberg Marquardt algorithm 
cannot be used for problems with large number of patterns. 
 
4.2 Improved Computation 
In the following derivation, sum squared error (SSE) is used to evaluate the training process. 
? ? ? ?
? ?
? P
p
M
m
pmeE
1 1
221w                                                    (4-4) 
Where: epm is the error at output m obtained by training pattern p, defined by (4-3). 
The N?N Hessian matrix H is 
  
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
??
?
??
?
??
?
?
?
??
?
??
?
??
?
?
?
?
2
2
2
2
1
2
2
2
2
2
2
12
2
1
2
21
2
2
1
2
NNN
N
N
w
E
ww
E
ww
E
ww
E
w
E
ww
E
ww
E
ww
E
w
E
?
????
?
?
H
                                        (4-5) 
Where: N is the number of weights. 
Combining (4-4) and (4-5), elements of Hessian matrix H can be obtained as 
? ?? ? ???????? ??????????? ? Pp Mm pmji pmjpmipmji eww eweweww E 1 1 22                                (4-6) 
Where: i and j are weight indexes. 
For LM algorithm, equation (4-6) is approximated as [76] 
52 
 
ij
P
p
M
m j
pm
i
pm
ji
qweweww E ????????? ??????? ? ? ?
? ?1 1
2                                       (4-7) 
Where: qij is the element of quasi Hessian matrix in row i and column j. 
Combining (4-2) and (4-7), quasi Hessian matrix Q can be calculated as an 
approximation of Hessian matrix 
JJQH T??                                                        (4-8) 
N?1 gradient vector g is 
T
Nw
E
w
E
w
E ?
?
??
?
?
?
?
?
?
?
?? ?
21
g                                           (4-9) 
Inserting (4-4) into (4-9), elements of gradient can be calculated as 
? ?? ? ???????? ?????? Pp Mm pmipmii ewewEg 1 1                                          (4-10) 
From (4-2) and (4-10), the relationship between gradient vector g and Jacobian matrix J 
is 
eJg T?                                                          (4-11) 
Combining (4-8), (4-11) and (4-1), the update rule of Levenberg Marquardt algorithm can 
be rewritten 
? ? gIQw 1???? ?                                                (4-12) 
One may notice that the sizes of quasi Hessian matrix Q and gradient vector g are 
proportional to number of weights in networks, but they are not associated with the number of 
training patterns and outputs. 
53 
 
Equations (4-1) and (4-12) are producing identical results for weight updating. The major 
difference is that in (4-12), quasi Hessian matrix Q and gradient vector g are calculated directly 
without necessity to calculate and to store Jacobian matrix J as it is done in (4-1). 
 
4.2.1 Review of Matrix Algebra 
There are two ways to multiply rows and columns of two matrixes. If the row of first matrix is 
multiplied by the column of the second matrix, then we obtain a scalar, as shown in Fig. 4-1a. 
When the column of the first matrix is multiplied by the row of the second matrix then the result 
is a partial matrix q (Fig. 4-1b) [84]. The number of scalars is N?N, while number of partial 
matrices q, which later have to be summed is P?M. ? ?
P ? M
P ? M
TJ J Q
N
N
 (a) Row-column multiplication results in a scalar ? ?
N
N
N
N
TJ J q
 (b) Column-row multiplication results in a partial matrix q 
Fig. 4-1 Two ways of multiplying matrixes 
 
When JT is multiplied by J using routine shown in Fig. 4-1b, at first, partial matrices q 
(size: N?N) need to be calculated P?M times, then all of P?M matrices q must be summed 
54 
 
together. The routine of Fig. 4-1b seems complicated therefore almost all matrix multiplication 
processes use the routine of Fig. 4-1a, where only one element of resulted matrix is calculated 
and stored at each time. 
Even the routine of Fig. 4-1b seems to be more complicated and it is used very seldom, 
after detailed analysis, one may conclude that the number of numerical multiplications and 
additions is exactly the same as that in Fig. 4-1a, but they are performed in different order. The 
computation cost analysis is presented in Table 4-1. 
 
Table 4-1 Computation cost analysis 
JTJ Computation Addition Multiplication 
Original LM (P ? M) ? N ? N (P ? M) ? N ? N 
Improved LM N ? N ? (P ? M) N ? N ? (P ? M) 
 
In a specific case of neural network training, only one row (N elements) of Jacobian 
matrix J (or one column of JT) is calculated, when each pattern is applied. Therefore, if routine 
from Fig. 4-1b is used then the process of creation of quasi Hessian matrix can start sooner 
without necessity of computing and storing the entire Jacobian matrix for all patterns and all 
outputs. 
 
Table 4-2 Memory cost analysis 
Multiplication Methods Elements for storage 
Row-column (Fig. 4-1a) (P ? M) ? N  +  N ? N  +  N 
Column-row (Fig. 4-1b) N ? N  +  N 
Difference (P ? M) ? N 
P is the number of training patterns, M is the number of outputs and N is the number of weights. 
 
The analytical results in Table 4-2 show that the column-row multiplication (Fig. 4-1b) 
can save a lot of memory. 
 
55 
 
4.2.2 Improved Quasi Hessian Matrix Computation 
Let us introduce quasi Hessian sub matrix qpm (size: N?N) 
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
2
21
2
2
212
121
2
1
N
pmpm
N
pmpm
N
pm
N
pmpmpmpmpm
N
pmpmpmpmpm
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
?
????
?
?
pmq
                          (4-13) 
Using (4-7) and (4-13), the N?N quasi Hessian matrix Q can be calculated as the sum of 
sub matrices qpm 
??? ?? Pp Mm pm1 1 qQ                                                (4-14) 
By introducing 1?N vector jpm 
?????? ??????? Npmpmpm wewewe ?21pmj                               (4-15) 
sub matrices qpm in (4-13) can be also written in the vector form (Fig. 4-1b) 
pmTpmpm jjq ?                                             (4-16) 
One may notice that for the computation of sub matrices qpm, only N elements of vector 
jpm need to be calculated and stored. All the sub matrixes can be calculated for each pattern p and 
output m separately, and summed together, so as to obtain quasi Hessian matrix Q. 
Considering the independence among all patterns and outputs, there is no need to store all 
the quasi Hessian sub matrices qpm. Each sub matrix can be summed to a temporary matrix after 
its computation. Therefore, during the direct computation of quasi Hessian matrix Q using (4-14), 
56 
 
only memory for N elements is required, instead of that for the whole Jacobian matrix with 
(P?M)?N elements (Table 4-2). 
From equation (4-13), one may notice that all the sub matrixes qpm are symmetrical. With 
this property, only upper (or lower) triangular elements of those sub matrixes need to be 
calculated. Therefore, during the improved quasi Hessian matrix Q computation, multiplication 
operations in (4-16) and sum operations in (4-14) can be both reduced by half approximately. 
 
4.2.3 Improved Gradient Vector Computation 
Gradient sub vector ?pm (size: N?1) is 
pm
N
pm
pm
pm
pm
N
pm
pm
pm
pm
pm
pm e
w
e
w
e
w
e
e
w
e
e
w
e
e
w
e
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
??
2
1
2
1
?
                                    (4-17) 
Combining (4-10) and (4-17), gradient vector g can be calculated as the sum of gradient 
sub vector ?pm 
??? ?? P Mm pm1 1p ?g                                               (4-18) 
Using the same vector jpm defined in (4-15), gradient sub vector can be calculated using 
pmTpmpm ej? ?                                              (4-19) 
Similarly, gradient sub vector ?pm can be calculated for each pattern and output separately, 
and summed to a temporary vector. Since the same vector jpm is calculated during quasi Hessian 
matrix computation above, there is only an extra scalar epm need to be stored. 
57 
 
With the improved computation, both quasi Hessian matrix Q and gradient vector g can be 
computed directly, without Jacobian matrix storage and multiplication. During the process, only 
a temporary vector jpm with N elements needs to be stored; in other words, the memory cost for 
Jacobian matrix storage is reduced by (P?M) times. In the MINST problem mentioned in section 
4.1, the memory cost for the storage of Jacobian elements could be reduced from more than 35 
gigabytes to nearly 30.7 kilobytes. 
 
4.2.4 Simplified ?epm/?wi computation 
The key point of the improved computation above for quasi Hessian matrix Q and gradient 
vector g is to calculate vector jpm defined in (4-15) for each pattern and output. This vector is 
equivalent of one row of Jacobian matrix J. 
The elements of vector jpm can be calculated by 
? ?
i
pn
pn
pm
i
pmpm
i w
n e t
n e t
o
w
do
w
e
?
?
?
??
?
???
?
? pm                               (4-20) 
Where: d is the desired output and o is the actual output; netpn is the sum of weighted inputs at 
neuron n described as 
??? Ii ipipn wxnet 1                                                    (4-21) 
Where: xpi and wi are the inputs and related weights respectively at neuron n; I is the number of 
inputs at neuron n. 
Inserting (4-20) and (4-21) into (4-15), vector jpm can be calculated by 
? ? ??????? ??? ippppmpm xxn e to ,11,11j ? ? ?????? ??? ipnpnpnpm xxn e to ,1,         (4-22) 
58 
 
Where: xpn,i is the i-th input of neuron n, when training pattern p. 
Using the neuron by neuron computation [27], elements xpn,i in (4-22) can be calculated in the 
forward computation, while ?opm/?netpn are obtained in the backward computation. Again, since 
only one vector jpm needs to be stored for each pattern and output in the improved computation, 
the memory cost for all those temporary parameters can be reduced by (P?M) times. All matrix 
operations are simplified to vector operations. 
 
4.3 Implementation 
In order to better illustrate the direct computation process for both quasi Hessian matrix Q and 
gradient vector g, let us analyze parity-2 problem as a simple example.   
Parity-2 problem is also known as XOR problem. It has 4 training patterns, 2 inputs and 1 
output. See Fig. 4-2. 
 
Fig. 4-2 Parity-2 problem: 4 patterns, 2 inputs and 1 output 
 
The structure, 3 neurons in MLP topology (see Fig. 4-3), is used. 
 
Fig. 4-3 Three neurons in MLP network used for training parity-2 problem; weight and neuron 
indexes are marked in the figure 
 
59 
 
As shown in Fig. 4-3 above, all weight values are initialed as the vector 
w={w1,w2,w3,w4,w5,w6,w7,w8,w9}. All elements in both quasi Hessian matrix Q and gradient 
vector g are set to ?0?. 
For the first pattern (-1, -1), the forward computation is: 
a) net11=1?w1+(-1) ?w2+(-1) ?w3 
b) o11=f(net11) 
c) net12=1?w4+(-1) ?w5+(-1) ?w6 
d) o12=f(net12) 
e) net13=1?w7+o11?w 8+o12?w 9 
f) o13=f(net13) 
g) e11=1-o13 
Then the backward computation is performed to calculate ?e11/?net11, ?e11/?net12 and 
?e11/?net13 in following steps: 
h) With results of steps (f) and (g), it can be calculated 
? ?? ? ? ?
13
13
13
13
13
113 1 n e tn e tfn e tn e tfn e tes ???????????                                   (4-23) 
i) With results of step (b) to step (g), using the chain-rule in differential, one can 
obtain 
? ? ? ?
13
139
12
12
12
112 n e tn e tfwn e tn e tfn e tes ???????????                                  (4-24) 
? ? ? ?
13
138
11
11
11
111 n e tn e tfwn e tn e tfn e tes ???????????                                (4-25) 
In this example, using (4-22), the vector j11 is calculated as 
60 
 
? ? ? ?111111 1211111111 ???????? ?????? n e ten e tej  ? ??????? 12111311 1 oon ete                  (4-26) 
With (4-16) and (4-19), sub matrix q11 and sub vector ?11 can be calculated separately 
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
??
??
??
?
2
12
2
3
1211
2
3
12311131
2
1
12311131
2
1
2
1
12311131
2
1
2
1
2
1
11
0000
000
00
0
os
oos
ossosss
ossossss
ossosssss
?
??
??????
?
?
?
q
                               (4-27) 
? ? 1112311311111 eosossss ???? ??                             (4-28) 
One may notice that only upper triangular elements of sub matrix q11 are calculated, since 
all sub matrixes are symmetrical. This can save nearly half of computation. 
The last step is to add sub matrix q11 and sub vector ?11 to quasi Hessian matrix Q and 
gradient vector g. 
The analysis above is only for training the first pattern. For other patterns, the 
computation process is almost the same. During the whole process, there is no Jacobian matrix 
computation; only the derivatives and outputs of activation functions are required to be 
computed. All the temporary parameters are stored in vectors which have no relationship with 
the number of patterns and outputs. 
Generally, for the problem with P patterns and M outputs, the improved computation can 
be organized as the pseudo code shown in Fig. 4-4. 
61 
 
%  I n i t i a l i z a t i o n
Q = 0 ;
g  = 0
%  I m p r o v e d  c o m p u t a t i o n
f o r  p = 1 : P                %  N u m b e r  o f  p a t t e r n s
   %  F o r w a r d  c o m p u t a t i o n
   ?
   f o r  m = 1 : M            %  N u m b e r  o f  o u t p u t s
      %  B a c k w a r d  c o m p u t a t i o n
      ?
      c a l c u l a t e  v e c t o r  j
p m
;           %  E q .  ( 4 - 2 2 )   
      c a l c u l a t e  s u b  m a t r i x  q
p m
;   %  E q .  ( 4 - 1 6 )
      c a l c u l a t e  s u b  v e c t o r  ?
p m
;   %  E q .  ( 4 - 1 9 )
      Q = Q + q
p m
;                         %  E q .  ( 4 - 1 4 )
      g = g + ?
p m
;                          %  E q .  ( 4 - 1 8 )
   e n d ;
e n d ;
 
Fig. 4-4 Pseudo code of the improved computation for quasi Hessian matrix and gradient vector 
 
The same quasi Hessian matrices and gradient vectors are obtained in both traditional 
computation (equations 4-8 and 4-11) and the proposed computation (equations 4-14 and 4-18). 
Therefore, the proposed computation does not affect the success rate. 
 
4.4 Experiments 
Several experiments are designed to test the memory and time efficiencies of the improved 
computation, comparing with traditional computation. They are divided into two parts: (1) 
Memory comparison and (2) Time comparison. 
 
4.4.1 Memory Comparison 
Three problems, each of which has huge number of patterns, are selected to test the memory cost 
of both the traditional computation and the improved computation. LM algorithm is used for 
62 
 
training and the test results are shown Tables 4-3 and 4-4. In order to make more precise 
comparison, memory cost for program code and input files were not used in the comparison. 
 
Table 4-3 Memory comparison for parity problems 
Parity-N Problems N=14 N=16 
Patterns 16,384 65,536 
Structures* 15 neurons 17 neurons 
Jacobian matrix sizes 5,406,720 27,852,800 
Weight vector sizes 330 425 
Average iteration 99.2 166.4 
Success Rate 13% 9% 
Algorithms Actual memory cost  
Traditional LM 79.21Mb 385.22Mb 
Improved LM 3.41Mb 4.30Mb 
*All neurons are in fully connected cascade networks 
 
Table 4-4 Memory comparison for MINST problem 
Problem MINST 
Patterns 60,000 
Structures 784=1 single layer network* 
Jacobian matrix sizes 47,100,000 
Weight vector sizes 785 
Algorithms Actual memory cost 
Traditional LM 385.68Mb 
Improved LM 15.67Mb 
*In order to perform efficient matrix inversion during training, only one of ten digits is classified 
each time. 
 
From the test results in Tables 4-3 and 4-4, it is clear that memory cost for training is 
significantly reduced in the improved computation. 
In the MNIST problem [82], there are 60,000 training patterns, each of which is a digit 
(from 0 to 9) image made up of grayed 28 by 28 pixels. And also, there are another 10,000 
patterns used to test the training results. With the trained network, our testing error rate for all the 
digits is 7.68%. In this result, for compressed, stretched and moved digits, the trained neural 
63 
 
network can classify them correctly (see Fig. 4-5a); for seriously rotated or distorted images, it is 
hard to recognize them (see Fig. 4-5b). 
 
(a) Recognized patterns 
 (b) Unrecognized patterns 
Fig. 4-5 Some testing results for digit ?2? recognition 
 
4.4.2 Time Comparison 
Parity-N problems are presented to test the training time for both traditional computation and the 
improved computation using LM algorithm. The structures used for testing are all fully 
connected cascade networks. For each problem, the initial weights and training parameters are 
the same. 
 
Table 4-5 Time comparison for parity problems 
Parity-N Problems N=9 N=11 N=13 N=15 
Patterns 512 2,048 8,192 32,768 
Neurons 10 12 14 16 
Weights 145 210 287 376 
Average Iterations 38.51 59.02 68.08 126.08 
Success Rate 58% 37% 24% 12% 
Algorithms Averaged training time (s) 
Traditional LM 0.78 68.01 1508.46 43,417.06 
Improved LM 0.33 22.09 173.79 2,797.93 
 
64 
 
From Table 4-5, one may notice that the improved computation can not only handle 
much larger problems, but also computes much faster than traditional one, especially for large-
sized patterns training. The larger the pattern size is, the more time efficient the improved 
computation will be. 
Obviously, the simplified quasi Hessian matrix computation is the one reason for the 
improved computing speed (nearly two times faster for small problems). Significant computation 
reductions obtained for larger problems are most likely due to the simpler way of addressing 
elements in vectors, in comparison to addressing elements in huge matrices. 
With the presented experimental results, one may notice that the improved computation is 
much more efficient than traditional computation for training with Levenberg Marquardt 
algorithm, not only on memory requirements, but also training time. 
 
4.5 Conclusion 
In this chapter, the improved computation is introduced to increase the training efficiency of 
Levenberg Marquardt algorithm. The proposed method does not require to store and to multiply 
large Jacobian matrix. As a consequence, memory requirement for quasi Hessian matrix and 
gradient vector computation is decreased by (P?M) times, where P is the number of patterns and 
M is the number of outputs. Additional benefit of memory reduction is also a significant 
reduction in computation time. Based on the proposed computation, calculating process of quasi 
Hessian matrix is further simplified using its symmetrical property. Therefore, the training speed 
of the improved algorithm becomes much faster than traditional computation. 
In the proposed computation process, quasi Hessian matrix can be calculated on fly when 
training patterns are applied. Moreover, the proposed method has special advantage for 
65 
 
applications which require dynamically changing the number of training patterns. There is no 
need to repeat the entire multiplication of JTJ, but only add to or subtract from quasi Hessian 
matrix. The quasi Hessian matrix can be modified as patterns are applied or removed. 
Second order algorithms have lots of advantages, but they require at each iteration 
solution of large set of linear equations with number of unknowns equal to number of weights. 
Since in the case of first order algorithms, computing time is only proportional to the problem 
size, first order algorithms (in theory) could be more useful for large neural networks. However, 
as discussed in the previous chapters, first order algorithm (EBP algorithm) is not able to solve 
some problems unless excessive number of neurons is used. But with excessive number of 
neurons, networks lose their generalization ability and as a result, the trained networks will not 
respond well for new patterns, which are not used for training. 
One may conclude that both first order algorithms and second order algorithms have their 
disadvantages and the problem of training extremely large networks with second order 
algorithms is still unsolved. The method presented in this chapter at least solved the problem of 
training neural networks using second order algorithm with basically unlimited number of 
training patterns. 
 
 
 
 
 
 
 
66 
 
 
 
 
 
 
 
CHAPTER 5 
FORWARD-ONLY ALGORITHM 
Following the neuron-by-neuron (NBN) computation procedure [27], the forward-only algorithm 
[78] is introduced in this chapter also allows for training arbitrarily connected neural networks; 
therefore, more powerful network architectures with connections across layers, such as bridged 
multilayer perceptron (BMLP) networks and fully connected cascade (FCC) networks, can be 
efficiently trained. A further advantage of the proposed forward-only algorithm is that the 
learning process requires only forward computation without the necessity of the backward 
computations. Information needed for gradient vector (for first order algorithms) and Jacobian or 
Hessian matrix (for second order algorithms) is obtained during forward computation. This way 
the forward-only method, in many cases, may also lead to the reduction of the computation time, 
especially for networks with multiple outputs. 
 In this chapter, we firstly introduce the traditional gradient vector and Jacobian matrix 
computation to address the computational redundancy problem for networks with multiple 
outputs. Then, the forward-only algorithm is proposed to solve the problem by removing 
backward computation process. Thirdly, both analytical and experimental comparisons are 
performed between the proposed forward-only algorithm and Hagan and Menhaj Levenberg 
Marquardt algorithm. Experimental results also show the ability of the forward-only algorithm to 
train networks consisting of arbitrarily connected neurons. 
 
67 
 
5.1 Computational Fundamentals 
Before the derivation, let us introduce some commonly used indices in this chapter: 
? p is the index of patterns, from 1 to np, where np is the number of patterns; 
? m is the index of outputs, from 1 to no, where no is the number of outputs; 
? j and k are the indices of neurons, from 1 to nn, where nn is the number of neurons; 
? i is the index of neuron inputs, from 1 to ni, where ni is the number of inputs and it 
may vary for different neurons. 
Other indices will be explained in related places. 
Sum square error (SSE) E is defined to evaluate the training process. For all patterns and 
outputs, it is calculated by 
??? ?? np1p no1m mpeE 2 ,21                                                     (5-1) 
Where: ep,m is the error at output m defined as 
mpmpmp doe ,,, ??                                                (5-2) 
Where: dp,m and op,m are desired output and actual output, respectively, at network output m for 
training pattern p. 
In all training algorithms, the same computations are being repeated for one pattern at a 
time. Therefore, in order to simplify notations, the index p for patterns will be skipped in the 
following derivations, unless it is essential. 
 
5.1.1 Review of Basic Concepts in Neural Network Training 
Let us consider neuron j with ni inputs, as shown in Fig. 5-1. If neuron j is in the first layer, all its 
inputs would be connected to the inputs of the network; otherwise, its inputs can be connected to 
68 
 
outputs of other neurons or to networks? inputs if connections across layers are allowed. 
)( jj ne tf )(, jjm yF mo2,j
y 1,jw jy2,jijw,
nijw ,0,j 1?
1,jy
1, ?nijijy , 1, ?nijy niy ,??
 Fig. 5-1 Connection of a neuron j with the rest of the network. Nodes y
j,i could represents 
network inputs or outputs of other neurons. Fm,j(yj) is the nonlinear relationship between the 
neuron output node yj and the network output om 
 
Node y is an important and flexible concept. It can be yj,i, meaning the i-th input of 
neuron j. It also can be used as yj to define the output of neuron j. In this chapter, if node y has 
one index then it is used as a neuron output node, but if it has two indices (neuron and input), it 
is a neuron input node. 
Output node of neuron j is calculated using  
? ?jjj netfy ?                                                             (5-3) 
Where: fj is the activation function of neuron j and net value netj is the sum of weighted input 
nodes of neuron j 
j, 0
ni
i ijijj
wywn et ?? ?
?1 ,,
                                                  (5-4) 
Where: yj,i is the i-th input node of neuron j, weighted by wj,i, and wj,0 is the bias weight. 
Using (5-4) one may notice that derivative of netj is:  
ijij j ywnet ,, ???
                                                           (5-5) 
69 
 
and slope sj of activation function fj is: 
? ?
j
jj
j
jj n etn etfn etys ??????                                                      (5-6) 
Between the output node yj of a hidden neuron j and network output om there is a complex 
nonlinear relationship (Fig. 5-1): 
 ? ?jjmm yFo ,?                                                        (5-7) 
Where: om is the m-th output of the network. 
The complexity of this nonlinear function Fm,j(yj) depends on how many other neurons 
are between neuron j and network output m. If neuron j is at network output m, then om=yj and 
F?m,j(yj)=1, where F?m,j is the derivative of nonlinear relationship between neuron j and output m. 
 
5.1.2 Gradient Vector and Jacobian Matrix Computation 
For every pattern, in EBP algorithm only one backpropagation process is needed, while in 
second order algorithms the backpropagation process has to be repeated for every output 
separately in order to obtain consecutive rows of the Jacobian matrix (Fig. 5-2). Another 
difference in second order algorithms is that the concept of back propagating of ? parameter [81] 
has to be modified. In EBP algorithm, output errors are parts of ? parameter 
 ?
?
? no
m mjmjj
eFs
1
' ,?                                                       (5-8) 
In second order algorithms, the ? parameters are calculated for each neuron j and each 
output m separately. Also, in the backpropagation process [80] the error is replaced by a unit 
value 
' ,, jmjjm Fs??                                                         (5-9) 
70 
 
Knowing ?m,j, elements of Jacobian matrix are calculated as 
              '
,,,,,, jmjijjmijij mp Fsyywe ???? ?
                                      (5-10) 
In EBP algorithm, elements of gradient vector are computed as 
jijijij ywEg ?,,, ????
                                               (5-11) 
Where: ?j is obtained with error back-propagation process. In second order algorithms, gradient 
can be obtained from partial results of Jacobian calculations 
      
m
no
m jmijij
eyg ?
?
?
1 ,,,
?                                                (5-12) 
Where: m indicates a network output and ?m,j is given by (5-9). 
The update rule of EBP algorithm is 
  nnn gww ????1                                                  (5-13) 
Where: n is the index of iterations, w is weight vector, ? is learning constant, g is gradient vector.  
Derived from Newton algorithm and steepest descent method, the update rule of 
Levenberg Marquardt (LM) algorithm is [80] 
? ? nnTnnn gIJJww 11 ?? ??? ?                                     (5-14) 
Where: ? is the combination coefficient, I is the identity matrix and J is Jacobian matrix shown 
in Fig. 5-2. 
71 
 
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
??
??????
??
??
??????
??
??????
??
??????
??
??????
??
??
2,
,
1,
,
2,1
,
1,1
,
2,
2,
1,
2,
2,1
2,
1,1
2,
2,
1,
1,
1,
2,1
1,
1,1
1,
2,
,
1,
,
2,1
,
1,1
,
2,
1,
1,
1,
2,1
1,
1,1
1,
2,
,1
1,
,1
2,1
,1
1,1
,1
2,
2,1
1,
2,1
2,1
2,1
1,1
2,1
2,
1,1
1,
1,1
2,1
1,1
1,1
1,1
j
nonp
j
nonpnonpnonp
j
np
j
npnpnp
j
np
j
npnpnp
j
mp
j
mpmpmp
j
p
j
ppp
j
no
j
nonono
jj
jj
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
w
e
J
n e u r o n  1 n e u r o n  j
nom ?
2?m
1?m
1?m
?
?
2?m 1?m
nom ?
???
1?p
pp ?
npp ?
nom ??
?
? ?
 Fig. 5-2 Structure of Jacobian matrix: (1) the number of columns is equal to the number of 
weights; (2) each row is corresponding to a specified training pattern p and output m 
 
From Fig. 5-2, one may notice that, for every pattern p, there are no rows of Jacobian 
matrix where no is the number of network outputs. The number of columns is equal to number of 
weights in the networks and the number of rows is equal to np?no. 
Traditional backpropagation computation, for delta matrix (np?no?nn) computation in 
second order algorithms, can be organized as shown in Fig. 5-3. 
72 
 
f o r  a l l  p a t t e r n s
%  F o r w a r d  c o m p u t a t i o n
   f o r  a l l  n e u r o n s  ( n n )
      f o r  a l l  w e i g h t s  o f  t h e  n e u r o n  ( n x )
         c a l c u l a t e  n e t ;          %  E q .  ( 5 - 4 )
      e n d ;
      c a l c u l a t e  n e u r o n  o u t p u t ;    %  E q .  ( 5 - 7 )
      c a l c u l a t e  n e u r o n  s l o p e ;     %  E q .  ( 5 - 6 )
   e n d ;
   f o r  a l l  o u t p u t s  ( n o )
      c a l c u l a t e  e r r o r ;            %  E q .  ( 5 - 2 )
% B a c k w a r d  c o m p u t a t i o n
      i n i t i a l  d e l t a  a s  s l o p e ;  
      f o r  a l l  n e u r o n s  s t a r t i n g  f r o m  o u t p u t  n e u r o n s  ( n n )
         f o r  t h e  w e i g h t s  c o n n e c t e d  t o  o t h e r  n e u r o n s  ( n y )
            m u l t i p l y  d e l t a  t h r o u g h  w e i g h t s
            s u m  t h e  b a c k p r o p a g a t e d  d e l t a  a t  p r o p e r  n o d e s
         e n d ;
         m u l t i p l y  d e l t a  b y  s l o p e  ( f o r  h i d d e n  n e u r o n s ) ;
      e n d ;
   e n d ;
e n d ;
 Fig. 5-3 Pseudo code using traditional backpropagation of delta in second order algorithms (code 
in bold will be removed in the proposed computation) 
 
5.2 Forward-Only Computation 
The proposed forward-only method is designed to improve the efficiency of Jacobian matrix 
computation, by removing the backpropagation process. 
 
5.2.1 Derivation 
The concept of ?m,j was defined in equation (5-9). One may notice that ?m,j can be interpreted also 
as a signal gain between net input of neuron j and the network output m. Let us extend this 
concept to gain coefficients between all neurons in the network (Fig. 5-4 and Fig. 5-6). The 
notation of ?k,j is extension of equation (5-9) and can be interpreted as signal gain between 
neurons j and k and it is given by 
73 
 
? ? ? ? jjk
j
j
j
jjk
j
jjkjk sFn e tyy yFn e t yF ' ,,,, ??????????                                  (5-15) 
Where: k and j are indices of neurons; Fk,j(yj) is the nonlinear relationship between the output 
node of neuron k and the output node of neuron j. Naturally in feedforward networks, k?j. If k=j, 
then ?k,k=sk, where sk is the slope of activation function (5-6). Fig 5-4 illustrates this extended 
concept of ?k,j parameter as a signal gain. 
n e t
j
s
j
y
j
n e t
k
s
k
y
k
jm,?
jjkjk sF ' ,, ??
km ,?
jkF ,'
n
e
t
w
o
r
k
 
i
n
p
u
t
s
o
1
o
m
n
e
t
w
o
r
k
 
o
u
t
p
u
t
s
 Fig. 5-4 Interpretation of ?
k,j as a signal gain, where in feedforward network neuron j must be 
located before neuron k 
 
The matrix ? has a triangular shape and its elements can be calculated in the forward only 
process. Later, elements of gradient vector and elements of Jacobian can be obtained using 
equations (5-10) and (5-12) respectively, where only the last rows of matrix ? associated with 
network outputs are used. The key issue of the proposed algorithm is the method of calculating 
of ?k,j parameters in the forward calculation process and it will be described in the next section. 
 
5.2.2 Calculation of ? Matrix for Fully Connected Cascade Architectures 
Let us start our analysis with fully connected neural networks (Fig. 5-5). Any other architecture 
could be considered as a simplification of fully connected neural networks by eliminating 
74 
 
connections (setting weights to zero). If the feedforward principle is enforced (no feedback), 
fully connected neural networks must have cascade architectures. 
i
n
p
u
t
s
  4
  1
  2
  3
2,1w 3,2w
4,3w
3,1w 4,1w 4,2w
+ 1
 Fig. 5-5 Four neurons in fully connected neural network, with 5 inputs and 3 outputs 
 
2,1w 3,2w 4,3w3,1w 4,1w 4,2w1,2? 2,3?
1,4?1,3? 2,4?1,4?
s
1
s
2
s
3
s
4
 
Fig. 5-6 The ?k,j parameters for the neural network of Fig. 5-5. Input and bias weights are not 
used in the calculation of gain parameters 
 
Slops of neuron activation functions sj can be also written in form of ? parameter as ?j,j=sj. 
By inspecting Fig. 5-6, ? parameters can be written as: 
For the first neuron there is only one ? parameter 
11,1 s??                                                                                        (5-16) 
For the second neuron there are two ? parameters 
75 
 
12,121,2
22,2 swss????                                                                                (5-17) 
For the third neuron there are three ? parameters 
12,123,2313,131,3
23,232,3
33,3
swswssws
sws
s
??
?
?
?
?
?                                                            (5-18) 
One may notice that all ? parameters for third neuron can be also expressed as a function 
of ? parameters calculated for previous neurons. Equations (5-18) can be rewritten as  
1,23,23,31,13,13,31,3
2,23,23,32,3
33,3
?????
???
?
ww
w
s
??
?
?                                                          (5-19) 
For the fourth neuron there are four ? parameters 
1,34,34,41,24,24,41,14,14,41,4
2,34,34,42,24,24,42,4
3,34,34,43,4
44,4
???????
?????
???
?
www
ww
w
s
???
??
?
?
                                   (5-20) 
The last parameter ?4,1 can be also expressed in a compacted form by summing all terms 
connected to other neurons (from 1 to 3) 
??? 31 1,4,4,41,4 i iiw ???                                                      (5-21) 
The universal formula to calculate ?k,j parameters using already calculated data for 
previous neurons is 
???? 1 ,,,, k ji jikikkjk w ???                                                    (5-22) 
76 
 
Where: in feedforward network neuron j must be located before neuron k, so k?j; ?k,k =sk is the 
slop of activation function of neuron k; wj,k is weight between neuron j and neuron k; and ?k,j is a 
signal gain through weight wj,k and through other part of network connected to wj,k. 
In order to organize the process, the nn?nn computation table is used for calculating 
signal gains between neurons, where nn is the number of neurons (Fig. 5-7). Natural indices 
(from 1 to nn) are given for each neuron according to the direction of signals propagation. For 
signal gains computation, only connections between neurons need to be concerned, while the 
weights connected to network inputs and biasing weights of all neurons will be used only at the 
end of the process. For a given pattern, a sample of the nn?nn computation table is shown in Fig. 
5-7. One may notice that the indices of rows and columns are the same as the indices of neurons. 
In the followed derivation, let us use k and j, used as neurons indices, to specify the rows and 
columns in the computation table. In feed forward network, k?j and matrix ? has triangular shape. 
1
2
2
1
j
j
k
k
n n
n n
? ? ?? ????? ??? ?????????? ???????? ? ?? ????? ???? ?? ?1,1?N e u r o n
I n d e x 2,2
? jj,?
kk ,? nnnn ,?1,2 1,j? 2,j1,k? 2,k? jk ,?1,nn? 2,nn? jnn ,? knn ,?
2,1w jw ,1 kw ,1 nnw ,1j,2 ,2k,2 kj, nnjw ,
nnkw ,
 
Fig. 5-7 The nn?nn computation table; gain matrix ? contains all the signal gains between 
neurons; weight array w presents only the connections between neurons, while network input 
weights and biasing weights are not included 
 
The computation table consists of three parts: weights between neurons in upper triangle, 
vector of slopes of activation functions in main diagonal and signal gain matrix ? in lower 
77 
 
triangle. Only main diagonal and lower triangular elements are computed for each pattern. 
Initially, elements on main diagonal ?k,k=sk are known as slopes of activation functions and 
values of signal gains ?k,j are being computed subsequently using equation (5-22). 
The computation is being processed neuron by neuron starting with the neuron closest to 
network inputs. At first the row number one is calculated and then elements of subsequent rows. 
Calculation on row below is done using elements from above rows using (5-22). After 
completion of forward computation process, all elements of ? matrix in the form of the lower 
triangle are obtained. 
In the next step elements of gradient vector and Jacobian matrix are calculated using 
equation (5-10) and (5-12). In the case of neural networks with one output only the last row of ? 
matrix is needed for gradient vector and Jacobian matrix computation. If networks have more 
outputs no then last no rows of ? matrix are used. For example, if the network shown in Fig. 5-5 
has 3 outputs the following elements of ? matrix are used 
??
?
?
?
??
?
?
?
?
??
???
44,43,42,41,4
4,333,32,31,3
4,23,222,21,2
0
00
s
s
s
????
????
????                                      (5-23) 
and then for each pattern, the three rows of Jacobian matrix, corresponding to three outputs, are 
calculated in one step using (5-10) without additional propagation of  
? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
????
????
????
??? ????? ????? ????? ??
4321
0
00
44334224114
433223113
4322112
n e u r o n
s
n e u r o nn e u r o nn e u r o n
s
s
,,,
,,
,
yyyy
yyyy
yyyy
???
??
?
                                  (5-24) 
Where: neurons? input vectors y1 through y4 have 6, 7, 8 and 9 elements respectively (Fig. 5-5), 
corresponding to number of weights connected. Therefore, each row of Jacobian matrix has 
78 
 
6+7+8+9=30 elements. If the network has 3 outputs, then from 6 elements of ? matrix and 3 
slops, 90 elements of Jacobian matrix are calculated. One may notice that the size of newly 
introduced ? matrix is relatively small and it is negligible in comparison to other matrixes used 
in calculation. 
The proposed method gives all the information needed to calculate both gradient vector 
(5-12) and Jacobian matrix (5-10), without backpropagation process; instead, ? parameters are 
obtained in relatively simple forward computation (see equation (5-22)). 
 
5.2.3  Training Arbitrarily Connected Neural Networks 
The forward-only computation above was derived for fully connected neural networks. If 
network is not fully connected, then some elements of the computation table are zero. Fig. 5-8 
shows computation tables for different neural network topologies with 6 neurons each. Please 
notice zero elements are for not connected neurons (in the same layers). This can further simplify 
the computation process for popular MLP topologies (Fig. 5-8b). 
5
4
1
2
3
6
I n d e x
5 641 32
1
2
3
4
5
6
2s1s 3s 4s 5s 6s1,2? 1,3 2,3? 1,4? 2,4 3,4? 2,5? 3,5?1,5? 4,5?1,6 2,6 3,6 4,6? 5,6? 4,w2,1w 6,1w3,1w 5,w 4,2 5,2 6,3,2 6,3w4,3w 5,3w 6,5w 5,4 6,4
 (a) Fully connected cascade network 
79 
 
I n d e x
5 641 32
1
2
3
4
5
6
2s1s 3s 4s 5s 6s2,5? 3,5?1,5? 4,5?1,6 2,6 3,6 4,6 6,1w5,1w 5,2 6,2 6,3w 5,3w 5,4 6,4   1
   2
   3
   6
   4
   5
0000000000
 (b) Multilayer perceptron network 
I n d e x
5 641 32
1
2
3
4
5
6
2s1s 3s 4s 5s 6s1,3? 2,4? 2,5 3,5?1,5? 4,5?1,6 3,6 6,1w,w 4,2w 5,2w 6,3w 5,3 5,4w0000 000000000
   1
   2
   3
   4
   5
   6
 
(c) Arbitrarily connected neural network 
Fig. 5-8 Three different architectures with 6 neurons 
 
Most of used neural networks have many zero elements in the computation table (Fig. 5-
8). In order to reduce the storage requirements (do not store weights with zero values) and to 
reduce computation process (do not perform operations on zero elements), a part of the NBN 
algorithm [27] in chapter 5 was adopted for forward computation. 
In order to further simplify the computation process, the equation (5-22) is completed in 
two steps 
???? 1 ,,, k ji jikijk wx ?                                                    (5-25) 
and 
80 
 
jkkjkkkjk xsx ,,,, ?? ??                                           (5-26) 
The complete algorithm with forward-only computation is shown in Fig. 5-9. By adding 
two additional steps using equations (5-25) and (5-26) (highlighted in bold in Fig. 5-9), all 
computation can be completed in the forward only computing process. 
f o r  a l l  p a t t e r n s  ( n p )
%  F o r w a r d  c o m p u t a t i o n
   f o r  a l l  n e u r o n s  ( n n )
      f o r  a l l  w e i g h t s  o f  t h e  n e u r o n  ( n x )
         c a l c u l a t e  n e t ;                                 %  E q .  ( 5 - 4 )
      e n d ;
      c a l c u l a t e  n e u r o n  o u t p u t ;                  %  E q .  ( 5 - 3 )
      c a l c u l a t e  n e u r o n  s l o p e ;                    %  E q .  ( 5 - 6 )
      s e t  c u r r e n t  s l o p e  a s  d e l t a ;
      f o r  w e i g h t s  c o n n e c t e d  t o  p r e v i o u s  n e u r o n s  ( n y )
         f o r  p r e v i o u s  n e u r o n s  ( n z )
      m u l t i p l y  d e l t a  t h r o u g h  w e i g h t s  t h e n  s u m ;   %  E q .  ( 5 - 2 4 )         
    e n d ;
    m u l t i p l y  t h e  s u m  b y  t h e  s l o p e ;                        %  E q .  ( 5 - 2 5 )
      e n d ;
      r e l a t e d  J a c o b i a n  e l e m e n t s  c o m p u t a t i o n ;                   %  E q .  ( 5 - 1 2 )
   e n d ;
   f o r  a l l  o u t p u t s  ( n o )
      c a l c u l a t e  e r r o r ;           %  E q .  ( 5 - 2 )
   e n d ;
e n d ;
 
Fig. 5-9 Pseudo code of the forward-only computation, in second order algorithms 
 
5.3 Computation Comparison 
The proposed forward-only computation removes the backpropagation part, but it includes an 
additional calculation in the forward computation (the bold part in Fig. 5-9). Let us compare the 
computation cost of forward part and backward part for each method, in LM algorithm. Naturally 
such comparison can be done only for traditional MLP architectures, which can be handled by 
both algorithms. 
As is shown in Fig. 5-3 and Fig. 5-9, computation cost of traditional computation and the 
forward-only computation depends on the neural network topology. In order to do the analytical 
comparison, for each neuron, let us consider:  
81 
 
? nx as the average number of weights 
nnnwnx?                                                              (5-27) 
? ny as the average number of weights between neurons 
 nnnonhny ??                                                          (5-28) 
? nz as the average number of previous neurons 
nnnhnz?                                                             (5-29) 
Where: nw is the number of weights; nn is the number of neurons; no is the number of outputs; 
nh is the number of hidden neurons. The estimation of ny depends on network structures. 
Equation (5-28) gives the ny value for MLP networks with one hidden layer. The comparison 
below is for training one pattern. 
From the analytical results in Table 5-1, one may notice that, for the backward part, time 
cost in backpropagation computation is tightly associated with the number of outputs; while in 
the forward-only computation, the number of outputs is almost irrelevant. 
 
Table 5-1 Analysis of computation cost in Hagan and Menhaj LM algorithm and forward-only 
computation 
 Hagan and Menhaj Computation 
Forward Part Backward Part 
+/- nn?nx + 3nn + no no?nn?ny 
?/? nn?nx + 4nn no?nn?ny + no?( nn ? no) 
exp* nn 0 
 Forward-only computation 
Forward  Backward  
+/- nn?nx + 3nn + no + nn?ny?nz 0 
?/? nn?nx + 4nn + nn?ny + nn?ny?nz 0 
exp nn 0 
 Subtraction forward-only from traditional 
+/- nn?ny?( no ? 1) 
?/? nn?ny?( no ? 1) + no?( nn ? no) ? nn?ny?nz 
exp 0 
*Exponential operation. 
 
82 
 
Table 5-2 shows the computation cost for the neural network which will be used for the 
ASCII problem in section 5.4, using the equations of Table 5-1. 
In typical PC computer with arithmetic coprocessor, based on the experimental results, if 
the time cost for ?+/-? operation is set as unit ?1?, then ??/?? and ?exp? operations will cost 
nearly 2 and 65 respectively.  
 
Table 5-2 Comparison for ASCII problem 
 Hagan and Menhaj computation Forward-only computation 
Forward Backward Forward Backward 
+/- 4,088 175,616 7,224 0 
?/? 4,144 178,752 8,848 0 
exp 7,280 0 7,280 0 
Total 552,776 32,200 
Relative time 100% 5.83% 
*Network structure: 112 neurons in 8-56-56 MLP network 
 
For the computation speed testing in the next section, the analytical relative times are 
presented in Table 5-3. 
 
Table 5-3 Analytical relative time of the forward-only computation of problems 
Problems nn no nx ny nz Relative time 
ASCII conversion 112 56 33 28 0.50 5.83% 
Error correction 42 12 18.1 8.57 2.28 36.96% 
Forward kinematics 10 3 5.9 2.10 0.70 88.16% 
 
For MLP network with one hidden layer topologies, using the estimation rules in Table 5-
1, computation cost of both the forward-only method and traditional forward-backward method 
is compared in Fig. 5-10. All networks have 20 inputs. 
Based on the analytical results, it could be seen that, in LM algorithm, for single output 
networks, the forward-only computation is similar with the traditional computation; while for 
networks with multiple outputs, the forward-only computation is supposed to be more efficient. 
83 
 
0 20 40 60 80 100
0 . 4
0 . 5
0 . 6
0 . 7
0 . 8
0 . 9
1
T h e  n u m b e r  o f  h i d d e n  n e u r o n s
R
a
t
i
o
 
o
f
 
t
i
m
e
 
c
o
n
s
u
m
p
t
i
o
n
N u m b e r  o f  o u t p u t = 1  t o  1 0
 
Fig. 5-10 Comparison of computation cost for MLP networks with one hidden layer; x-axis is the 
number of neurons in hidden layer; y-axis is the time consumption radio between the forward-
only computation and the forward-backward computation 
 
5.4 Experiments 
The experiments were organized in three parts: (1) ability of handling various network 
topologies; (2) training neural networks with generalization abilities; (3) computational 
efficiency. 
 
5.4.1 Ability of Handling Various Network Topologies 
The ability of training arbitrarily connected networks of the proposed forward-only computation 
is illustrated by the two-spiral problem. 
The two-spiral problem is considered as a good evaluation of training algorithms [77]. 
Depending on neural network architecture, different numbers of neurons are required for 
successful training. For example, using standard MLP networks with one hidden layer, 34 
neurons are required for the two-spiral problem [85]. Using the proposed computation in LM 
algorithm, two types of topologies, MLP networks with two hidden layers and fully connected 
84 
 
cascade (FCC) networks, are tested for training the two-spiral patterns, and the results are 
presented in the Tables below. In MLP networks with two hidden layers, the number of neurons 
is assumed to be equal in both hidden layers. 
Results for MLP architectures shown in the Table 5-4 are identical no matter if the Hagan 
and Menhaj LM algorithm or the proposed LM algorithm is used (assuming the same initial 
weights). In other words, the proposed algorithm has the same success rate and the same number 
of iterations as those obtained by Hagan and Menhaj LM algorithm. The difference is that the 
proposed algorithm can handle also other than MLP architectures and in many cases (especially 
with multiple outputs) computation time is shorter. 
 
Table 5-4 Training results of the two-spiral problem with the proposed forward-only 
implementation of LM algorithm, using MLP networks with two hidden layers; maximum 
iteration is 1,000; desired error=0.01; there are 100 trials for each case 
Hidden 
neurons 
Success 
rate 
Average 
number of 
iterations 
Average  
time (s) 
12 Failing / / 
14 13% 474.7 5.17 
16 33% 530.6 8.05 
18 50% 531.0 12.19 
20 63% 567.9 19.14 
22 65% 549.1 26.09 
24 71% 514.4 34.85 
26 81% 544.3 52.74 
 
Table 5-5 Training results of the two-spiral problem with the proposed forward-only 
implementation of LM algorithm, using FCC networks; maximum iteration is 1,000; desired 
error=0.01; there are 100 trials for each case 
Hidden Neurons Success Rate Average Number of Iterations Average Time (s) 
7 13% 287.7 0.88 
8 24% 261.4 0.98 
9 40% 243.9 1.57 
10 69% 231.8 1.62 
11 80% 175.1 1.70 
12 89% 159.7 2.09 
13 92% 137.3 2.40 
14 96% 127.7 2.89 
15 99% 112.0 3.82 
 
85 
 
From the testing results presented in Table 5-5, one may notice that the fully connected 
cascade (FCC) networks are much more efficient than other networks to solve the two-spiral 
problem, with as little number of neurons as 8. The proposed LM algorithm is also more efficient 
than the well-known cascade correlation algorithm, which requires 12-19 hidden neurons in FCC 
architectures to converge [86]. 
 
5.4.2 Train Neural Networks with Generalization Abilities 
To compare generalization abilities, FCC networks, being proved to be the most efficient in 
section 2.3, are applied for training. These architectures can be trained by both EBP algorithm 
and the forward-only implementation of LM algorithm. The slow convergence of EBP algorithm 
is not the issue in this experiment. Generalization abilities of networks trained with both 
algorithms are compared. The Hagan and Menhaj LM algorithm was not used for comparison 
here because it cannot handle FCC networks. 
 
(a) Testing surface with 37?37=1,369 points    (b) Training surface with 10?10=100 points 
Fig. 5-11 Peak surface approximation problem 
 
Let us consider the peak surface [85] as the required surface (Fig. 5-11a) and let us use 
equally spaced 10?10=100 patterns (Fig. 5-11b) to train neural networks. The quality of trained 
networks is evaluated using errors computed for equally spaced 37?37=1,369 patterns. In order 
86 
 
to make a valid comparison between training and verification errors, the sum squared error (SSE), 
as defined in (5-1), is divided by 100 and 1,369 respectively. 
 
Table 5-6 Training Results of peak surface problem using FCC architectures 
Neurons 
 
Success Rate Average Iteration Average Time (s) 
EBP LM EBP LM EBP LM 
8 0% 5% Failing 222.5 Failing 0.33 
9 0% 25% Failing 214.6 Failing 0.58 
10 0% 61% Failing 183.5 Failing 0.70 
11 0% 76% Failing 177.2 Failing 0.93 
12 0% 90% Failing 149.5 Failing 1.08 
13 35% 96% 573,226 142.5 624.88 1.35 
14 42% 99% 544,734 134.5 651.66 1.76 
15 56% 100% 627,224 119.3 891.90 1.85 
For EBP algorithm, learning constant is 0.0005 and momentum is 0.5; maximum iteration is 
1,000,000 for EBP algorithm and 1,000 for LM algorithm; desired error=0.5; there are 100 trials 
for each case. The proposed version of LM algorithm is used in this experiment 
 
The training results are shown in Table 5-6. One may notice that it was possible to find 
the acceptable solution (Fig. 5-12) with 8 neurons (52 weights). Unfortunately, with EBP 
algorithm, it was not possible to find acceptable solutions in 100 trials within 1,000,000 
iterations each. Fig. 5-13 shows the best result out of the 100 trials with EBP algorithm. When 
the network size was significantly increased from 8 to 13 neurons (117 weights), EBP algorithm 
was able to reach the similar training error as with LM algorithm, but the network lost its 
generalization ability to respond correctly for new patterns (between training points). Please 
notice that with enlarged number of neurons (13 neurons), EBP algorithm was able to train 
network to small error SSETrain=0.0018, but as one can see from Fig. 5-14, the result is 
unacceptable with verification error SSEVerify=0.4909. 
87 
 
 Fig. 5-12 The best training result in 100 trials, using LM algorithm, 8 neurons in FCC network 
(52 weights); maximum training iteration is 1,000; SSETrain=0.0044, SSEVerify=0.0080 and 
training time=0.37 s 
 
 
Fig. 5-13 The best training result in 100 trials, using EBP algorithm, 8 neurons in FCC network 
(52 weights); maximum training iteration is 1,000,000; SSETrain=0.0764, SSEVerify=0.1271 and 
training time=579.98 s 
 
 Fig. 5-14 The best training result in 100 trials, using EBP algorithm, 13 neurons in FCC network 
(117 weights); maximum training iteration is 1,000,000; SSETrain=0.0018, SSEVerify=0.4909 and 
training time=635.72 s 
 
88 
 
From the presented examples, one may notice that often in simple (close to optimal) 
networks, EBP algorithm can?t converge to required training error (Fig. 5-13).  When the size of 
networks increases, EBP algorithm can reach the required training error, but trained networks 
lose their generalization ability and can?t process new patterns well (Fig. 5-14). On the other 
hand, the proposed implementation of LM algorithm in this chapter, works not only significantly 
faster but it can find good solutions with close to optimal networks (Fig. 5-12). 
 
5.4.3 Computational Speed 
Several problems are presented to test the computation speed of both the Hagan and Menhaj LM 
algorithm, and the proposed LM algorithm. The testing of time costs is divided into forward part 
and backward part separately. In order to compare with the analytical results in section 5.3, the 
MLP networks with one hidden layer are used for training. 
 
5.4.3.1 ASCII Codes to Images Conversion 
This problem is to associate 256 ASCII codes with 256 character images, each of which is made 
up of 7?8 pixels (Fig. 5-15). So there are 8-bit inputs (inputs of parity-8 problem), 256 patterns 
and 56 outputs. In order to solve the problem, the structure, 112 neurons in 8-56-56 MLP 
network, is used to train those patterns using LM algorithm. The computation time is presented 
in Table 5-7. The analytical result is 5.83% as shown in Table 5-3. 
 
Table 5-7 Comparison for ASCII characters recognition problem 
Computation 
methods  
Time cost (ms/iteration) Relative 
time Forward Backward 
Traditional 8.24 1,028.74 100.0% 
Forward-only 61.13 0.00 5.9% 
 
89 
 
 
Fig. 5-15 The first 90 images of ASCII characters 
 
Testing results in Table 5-7 show that, for this multiple outputs problem, the forward-
only computation is much more efficient than traditional computation, in LM training. 
 
5.4.3.2 Error Correction 
Error correction is an extension of parity-N problems [74] for multiple parity bits. In Fig. 5-16, 
the left side is the input data, made up of signal bits and their parity bits, while the right side is 
the related corrected signal bits and parity bits as outputs. The number of inputs is equal to the 
number of outputs. 
 
Fig. 5-16 Using neural networks to solve an error correction problem; errors in input data can be 
corrected by well trained neural networks 
 
The error correction problem in the experiment has 8-bit signal and 4-bit parity bits as 
inputs, 12 outputs and 3,328 patterns (256 correct patterns and 3,072 patterns with errors), using 
90 
 
42 neurons in 12-30-12 MLP network (762 weights). Error patterns with one incorrect bit must 
be corrected. Both traditional computation and the forward-only computation were performed 
with the LM algorithm. The testing results are presented in Table 5-8. The analytical result is 
36.96% as shown in Table 5-3. 
 
Table 5-8 Comparison for error correction problem 
Problems Computation Methods Time Cost (ms/iteration) Relative Time 
Forward Backward 
8-bit signal Traditional 40.59 468.14 100.0% 
Forward-only 175.72 0.00 34.5% 
 
Compared with the traditional forward-backward computation in LM algorithm, again, 
the forward-only computation has a considerably improved efficiency. With the trained neural 
network, all the patterns with one bit error are corrected successfully. 
 
5.4.3.3 Forward Kinematics 
Neural networks are successfully used to solve many practical problems in the industry, such as 
control problems, compensation nonlinearities in objects and sensors, issues of identification of 
parameters which cannot be directly measured, and sensorless control [87-89]. 
Forward kinematics is an example of these types of practical applications [43][90-92]. 
The purpose is to calculate the position and orientation of robot?s end effector as a function of its 
joint angles. Fig. 5-17 shows the two-link planar manipulator. 
As shown in Fig. 5-17, the end effector coordinates of the manipulator is calculated by: 
? ???? ??? c o sc o s 21 LLx                                            (5-30) 
? ???? ??? s ins in 21 LLy                                            (5-31) 
91 
 
Where: (x, y) is the coordinate of the end effector which is determined by angles ? and ?; L1 and 
L2 are the arm lengths. In order to avoid scanning ?blind area?, let us assume L1=L2=1. 
E n d  E f f e c t o r
?
?
L
1
L
2
 Fig. 5-17 Tow-link planar manipulator 
 
In this experiment, 224 patterns are applied for training the MLP network 3-7-3 (59 
weights), using LM algorithm. The comparison of computation cost between the forward-only 
computation and traditional computation is shown in Table 5-9. In 100 trials with different 
starting points, the experiment got 22.2% success rate and the average iteration cost for converge 
was 123.4. The analytical result is 88.16% as shown in Table 5-3. 
 
Table 5-9 Comparison for forward kinematics problem 
Computation methods  Time cost (ms/iteration) Relative time 
Forward Backward 
Traditional 0.307 0.771 100.0% 
Forward-only 0.727 0.00 67.4% 
 
The presented experimental results match the analysis in section 5.3 well: for networks 
with multiple outputs, the forward-only computation is more efficient than the traditional 
backpropagation computation. 
 
5.5 Conclusion 
92 
 
One of the major features of the proposed forward-only algorithm is that it can be easily adapted 
to train arbitrarily connected neural networks and not just MLP topologies. This is very 
important because neural networks with connections across layers are much more powerful than 
commonly used MLP architectures. For example, if the number of neurons in the network is 
limited to 8 then popular MLP topology with one hidden layer is capable to solve only parity-7 
problem. If the same 8 neurons are connected in fully connected cascade then with this network 
parity-255 problem can be solved [93]. 
It was shown (Figs. 5-13 and 5-14) that in order to secure training convergence with first 
order algorithms the excessive number of neurons much be used, and this results with a failure of 
neural network generalization abilities. This was the major reason for frustration in industrial 
practice when neural networks were trained to small errors but they would respond very poorly 
for patterns not used for training. The presented forward-only computation for second order 
algorithms can be applied to train arbitrarily connected neural networks, so it is capable to train 
neural networks with reduced number of neurons and as consequence a good generalization 
abilities were secured (Fig. 5-12). 
The proposed forward-only computation gives identical number of training iterations and 
success rates, as the Hagan and Menhaj implementation of the LM algorithm does, since the 
same Jacobian matrix are obtained from both methods. By removing backpropagation process, 
the proposed method is much simpler than traditional forward and backward procedure to 
calculate elements of Jacobian matrix. The whole computation can be described by a regular 
table (Fig. 5-7) and a general formula (equation 5-22). Additionally, for networks with multiple 
outputs, the proposed method is less computationally intensive and faster than traditional 
forward and backward computations [27][80]. 
93 
 
 
 
 
 
 
 
CHAPTER 6 
C++ IMPLEMENTATION OF NEURAL NETWORK TRAINER 
Currently, there are some excellent tools for neural networks training, such as the MATLAB 
Neural Network Toolbox (MNNT) and Stuttgart Neural Network Simulator (SNNS). The MNNT 
can do both EBP and LM training, but only for standard MLP networks which are not as efficient 
as other networks with connections across layers. Furthermore, it?s also well-known that 
MATLAB is very inefficient in executing ?for? loop, which may slow down the training process. 
SNNS can handle FCN networks well, but the training methods it contains are all developed 
based on EBP algorithm, such as QuickPROP algorithm [94] and Resilient EBP [68], which 
makes the training still somewhat slow. 
In this chapter, the neural network trainer, named NBN 2.0 [44-46], is introduced as a 
powerful training tool. It contains EBP algorithm with momentum [93], LM algorithm [80], 
NBN algorithm [26] and a newly developed second order algorithm. Based on neuron-by-neuron 
computation [27] and forward-only computation [78-79], all those algorithms can handle 
arbitrarily connected neuron (ACN) networks. Comparing with the former MATLAB version 
[96], the revised one is supposed to perform more efficient and stable training. 
The NBN 2.0 is developed based on Visual Studio 6.0 using C++ language. Its main 
graphic user interface (GUI) is shown in Fig. 8-6. In the following part of this chapter, detailed 
instructions about the software are presented. Then several examples are applied to illustrate the 
functionalities of NBN 2.0. 
94 
 
 
6.1 File Instruction 
The software is made up of 6 types of files, including executing files, parameter file (unique), 
topology files, training pattern files, training result files and training verification files. 
 
6.1.1 Executing Files 
Executing files contain three files: files ?FauxS-TOON.ssk? and ?skinppwtl.dll? for the GUI 
design; file ?NBN 2.0 exe? is used for running the software. Also, other files, such as user 
instruction, correction log and accessory tools (Matlab code ?PlotFor2D.m? for 2-D plotting), are 
included. 
 
6.1.2 Parameter File 
This file is named ?Parameters.dat? and it is necessary for running the software. It contains 
initial data of important parameters shown in Table 6-1. 
 
Table 6-1 Parameters for training 
Parameters Descriptions 
algorithm Index of algorithms in the combo box 
alpha Learning constant for EBP 
scale Parameter for LM/NBN 
mu Parameter for LM/NBN 
max mu Parameter for LM/NBN (fixed) 
min mu Parameter for LM/NBN (fixed) 
max error Maximum error 
ITE_FOR_EBP Maximum iteration for EBP 
ITE_FOR_LM Maximum iteration for LM/NBN 
ITE_FOR_PO Maximum iteration for the new Alg. 
momentum Momentum for EBP 
po alpha Parameter for the improved NBN 
po beta Parameter for the improved NBN 
po gama Parameter for the improved NBN 
training times Training times for automatic running 
 
95 
 
There are two ways to set those parameters: (1) Edit the parameter file manually, 
according to the descriptions of parameters in Table 6-1; (2) All those parameters can be edited 
in the user interface, and they will be saved in the parameter file automatically once training is 
executed, as the initial values for next time of running the software. 
 
6.1.3 Topology Files 
Topology files are named ?*.in?, and they are mainly used to construct the neural network 
topologies for training. Topology files consist of four parts: (1) topology design; (2) weight 
initialization (optional); (3) neuron type instruction and (4) training data specification. 
(1) Topology design: the topology design is aimed to create neural structures. The general 
format of the command is ?n [b] [type] [a1 a2 ? an]?, which means inputs/neurons 
indexed with a1, a2?an are connected to neuron b with a specified neural type (bipolar, 
unipolar or linear). Fig. 6-1 presents the commands and the related neural network 
topologies. Notice: the neuron type mbip stands for bipolar neurons; mu is for unipolar 
neurons and mlin is for linear neurons. They types are defined in neuron type instructions. 
 (a) 7-7-7 MLP network 
96 
 
 (b) 7=3=1 BMLP network 
 
(c) 7=1=1=1=1 FCC network 
Fig. 6-1 Commands and related neural network topologies 
 
(2) Weight initialization: the weight initialization part is used to specify initial weights for 
training and this part is optional. If there is no weight initialization in the topology file, 
the software will generate all the initial weights randomly (from -1 to 1) before training. 
The general command is ?w [wbias] [w1 w2 ? wn]?, corresponding to the topology design. 
Fig. 6-2 shows the example of weight initialization for parity-3 problem with 2 neurons 
in FCC network. 
 Fig. 6-2 Weight initialization for parity-3 problem with 2 neurons in FCC network 
 
97 
 
(3) Neuron type: in the neuron type instruction part, three different types of neurons are 
defined. They are bipolar (?mbip?), unipolar (mu) and linear (?mlin?). Bipolar neurons 
have both positive and negative outputs, while unipolar neurons only have positive 
outputs. The outputs of both bipolar and unipolar neurons are no more than 1. If the 
desired outputs are larger than 1, linear neurons are considered to be the output neurons. 
The general command is ?.model [mbip/mu/mlin] fun=[bip/uni/lin], gain=[value], 
der=[value]?. Table 6-2 presents the three types of neurons used in the software. 
 
Table 6-2 Three types of neurons in the software 
Neuron Types/Commands Activation Functions 
bipolar 
n e td e ren e tf n e tg a inb ????? ?? 11 2)(  .model mbip fun=bip, gain=0.5, der=0.001 
unipolar 
n e td e ren e tf n e tg a inu ???? ??1 1)(  .model mu fun=uni, gain=0.5, der=0.001 
linear 
n e tg a inn e tf l ??)(  .model mlin fun=lin, gain=1, der=0.005 
 
From Table 6-2, it can be seen that ?gain? and ?der? are parameters of activation 
functions. Parameter ?der? is introduced  to adjust the slope of activation function (for unipolar 
and bipolar), which is a trick we used in the software to avoid training process entering the 
saturation region where slope is approching to zero. 
(4) Training data: the training data specification part is used to set the name of training 
pattern file, in order to get correct training data. The general command is ?datafile=[file 
name]?. The file name needs to be specified by users. 
With at least three part settings (weight initialization is optional), the topology file can be 
correctly defined. 
 
98 
 
6.1.4 Training Pattern Files 
The training pattern files include input patterns and related desired outputs. In a training pattern 
file, the number of rows is equal to the number of patterns, while the number of columns is equal 
to the sum of the number of inputs and the number of outputs. However, only with the data in 
training pattern file, one can?t tell the number of inputs and the number of outputs, so the neural 
topology file should be considered together in order to decide those two parameters (Fig. 6-3). 
The training pattern files are specified in the topology files as mentioned above, and they should 
be in the same folder as related topology files. 
 
Fig. 6-3 Extract the number of inputs and the number of outputs from the data file and topology 
 
As described in Fig. 6-3, the number of inputs is obtained from the first command line of 
topology design and it is equal to the index of the first neuron minus 1. After that, the number of 
outputs is calculated by the number of columns in training pattern files minus the number of 
inputs. 
 
6.1.5 Training Result Files 
Training result files are used to store the training information and results. Once the ?save data? 
function is enabled in the software (by users), important information for current training, such as 
training algorithm, training pattern file, topology, parameters, initial weights, result weights and 
99 
 
training results, will be saved after the training is finished. The name of the training result file is 
generated automatically depending on the starting time and the format is ?date_time_result.txt?. 
Fig. 6-4 shows a sample of training result file. 
 Fig. 6-4 A sample of training result file 
 
6.1.6 Training Verification Files 
Training verification files are generated by the software when the verification function is 
performed (by users). The result weights from the current training will be verified, by computing 
the actual outputs of related patterns. The name of training verification file is also created by the 
system time when the verification starts and it is ?date_time_verification.txt?. Fig. 6-5 gives a 
sample of training verification file for parity-3 problem. 
 
Fig. 6-5 A sample of training verification file for parity-3 problem 
 
100 
 
6.2 Graphic User Interface Instruction 
As shown in Fig. 6-6, the user interface consists of 6 areas: (1) Plotting area; (2) Training 
information area; (3) Plot modes setting area; (4) Execute modes setting area; (5) Control area; 
(6) Parameter setting area; (7) Verification area; (8) Command consoler area. 
 Fig. 6-6 The user interface of NBN 2.0 
 
6.2.1 Plotting Area 
This area is marked as ? in Fig. 6-6 and it is used to depict the sum squared error (SSE) during 
training. The log scaled vertical axis presents SSE values from 0.0001 to 10000, while the 
horizontal axis, which is linearly scaled automatically with the coefficient at the bottom of 
plotting area (??[value]? in Fig. 6-6), shows the number of iterations cost for training. Current 
training process or previous training process can be recorded in the same plotting, depending on 
the users? settings. 
 
101 
 
6.2.2 Training Information Area 
This area is marked as ? in Fig. 6-6 and instantaneous training data are presented in this area, 
including sum square error (SSE) and cost iterations for current training, average iteration and 
time spent in solving the same problems, and the success rate for multiple trials. 
 
6.2.3 Plot Modes Setting Area 
This area is marked as ? in Fig. 6-6. Three plot modes are available in this software, multi 
curves, one curve and delayed curve. In multi curves mode, all the training curves will be plotted 
together and updated instantaneously. In one curve mode, only current training is plotted, while 
other curves will be erased. The delayed curve mode is only used in automatic training for 
multiple trials: during the training process, there is no plotting; while all the curves will be 
shown together after the whole training process is finished. The delayed curve mode is designed 
for training process which needs huge iterations and costs time for plotting. 
 
6.2.4 Execute Modes Setting Area 
This area is marked as ? in Fig. 6-6 and used to control training mode, either training one time 
or automatic training for several trials, either saving the training results or not. 
If it is set to run automatically, the train will not stop unless it reaches the required 
training times or the ?Stop to Train? button is clicked. The default value of automatically training 
trials is 100 and it can be changed through command consoler (area ?) or parameter settings 
(area ?). 
If it is set to save training data, all the important information, such as algorithm type, 
topology, training parameters, initial weights, result weights and training results (SSE and cost 
102 
 
iteration) will be saved in training result files (discussed in section 6.1.5). 
 
6.2.5 Control Area 
This area is marked as ? in Fig. 6-6. The combo box is used to select training algorithms. There 
are 6 choices: (1) ?EBP?; (2) ?LM?; (3) ?NBN?; (4) ?EBP, forward-only?; (5) ?NBN, forward-
only?; (6) ?NBN, improved?. All algorithms will be introduced in section 6.3. 
Button ?Load Data File? is used to choose a topology file for training; button ?Set 
Parameters? is used to set training parameters for the selected training algorithm; button ?Start 
To Train/Stop To Train? helps control the training process; button ?Clear Plotting? is used to 
erase the current plotting in the plotting area. 
 
6.2.6 Parameter Setting Area 
This area is marked as ? in Fig. 6-6 and is used to set training related parameters, such as 
training times, maximum error and maximum iteration. The training times specifies the number 
of trials if automatic training is enabled. The maximum error is used to judge whether the 
training process reaches convergence. The maximum iteration limits the number of iterations for 
each training case: training process stops if it reaches the value of maximum iteration. All the 
settings will be saved in the parameter file once training is executed, and they will be loaded as 
the initial values for the next time using the software. 
 
6.2.7 Verification Area 
This area is marked as ? in Fig. 6-6 and it is used for training results verification, by calculating 
the actual outputs for each pattern with the result weights. The verifying patterns can be training 
103 
 
patterns, testing patterns or user created patterns for 2-D situation. The verification results will be 
stored in training verification files (discussed in section 6.1.6) automatically and shown as popup 
windows. 
The verification results can be easily uploaded by Matlab, MS Excel, Origin, or other 
software for analysis and plotting. For 2-D input patterns, the verification data can be plotted in 
Matlab by the accessory tool included in the software, named as ?PlotFor2D.m?.  
 
6.2.8 Command Consoler Area 
The list box is used to show the important information or hints for users? operations. It is also a 
command consoler. Most of the operations can be achieved by related commands. Table 6-3 
presents the available commands and their functions. 
 
Table 6-3 Available commands and related functionalities 
Commands Functions 
help list all the available commands and instructions 
clr clear the content of the list box 
cc clear all the curves in plotting area 
sus suspend training and save current status 
res resume training with the status at suspending point 
tra start/stop training 
sav data saving control 
aut automatic training control 
pm= select plotting mode, e.g. pm=2 (one curve mode) 
load load input file 
para set training parameters for selected algorithm, e.g. para 
info show current training setting 
iw show current iw value (used for debug) 
topo show current topology 
alpha? / alpha= get/set learning constant, e.g. alpha=1 
mom? / mom= get/set momentum, e.g. mom=0.001 
tt? / tt= get/set training times for automatic training, e.g. tt? 
me? / me= get/set maximum error, e.g. me=0.01 
mi? / mi= get/set maximum iteration, e.g. mi=100 
th? / th= get/set parameter for the ?NBN, improved? algorithm 
104 
 
 
6.3 Implemented algorithms 
As introduced above, there are 6 available algorithms in the software for training. The following 
part is going to introduce the characteristics and limitations for each algorithm. 
EBP: This is EBP algorithm with traditional forward-backward computation. For EBP 
algorithm, the forward-back computation may work a little bit faster than forward-only 
computation. Now it is only used for standard MLP networks. EBP algorithm can be used for 
huge patterns training because of its simplification, and the tradeoff is the slow convergence. 
LM: This is LM algorithm with traditional forward-backward computation. For LM (and 
NBN) algorithm, the improved forward-only computation [78] (discussed in chapter 5) performs 
faster training than forward-backward computation for networks with multiple outputs. Now it is 
also only used for standard MLP networks. LM (and NBN) algorithm converges much faster 
than EBP algorithm for small and media sized patterns training. For huge patterns (huge 
Jacobian matrix, we have solved this problem) and huge networks (huge Hessian matrix), it may 
work slower than EBP algorithm. 
NBN: This is NBN algorithm with forward-backward computation. NBN algorithm is 
developed based on LM algorithm, but it can handle arbitrarily connected neuron (ACN) 
networks, also, the convergence is improved. 
EBP, forward-only: This is EBP algorithm with forward-only computation [78]. It can 
work on arbitrarily connected neuron networks. 
NBN, forward-only: This is NBN algorithm with forward-only computation [78]. It can 
handle arbitrarily connected neuron networks and, as mentioned above, it works faster than 
?NBN? algorithm, especially for networks with multiple outputs. 
105 
 
NBN, improved: This is a newly developed second order algorithm, implemented with 
forward-only computation, so it can handle arbitrarily connected neuron networks. In this 
algorithm, Hessian matrix is inverted only one time per iteration, so this algorithm is supposed to 
compute faster than LM (and NBN) algorithm which may have several times Hessian matrix 
inversion per iteration. The train ability (convergence) is also improved in this algorithm. 
Furthermore, a local minima detector is implemented in this algorithm. When the detector 
diagnoses that the training is trapped in local minima, all the weights will be regenerated 
randomly for further training. The detailed description of this algorithm will be introduced in 
section 6.4.3. 
 For all second order algorithms, the improved second order computation [79] discussed 
in chapter 4 is applied in the implementation. 
 
6.4 Strategies for Improving Training Performance 
Besides the neuron-by-neuron algorithm [27], forward-only algorithm [78] and the improved 
second order computation [79], there are several other strategies and algorithms implemented in 
the software in order to improve the computation efficiency and convergent ability: (1) In first 
order algorithms, momentum [95] is incorporated to stabilize the training process, so that big 
learning constants can be applied to speed up to convergence; (2) In first order algorithms, the 
slope of activation function is modified [97] to enhance the convergent ability; (3) In second 
order algorithms, Levenberg Marquardt update rule is modified routing to reduce the matrix 
inversion operations, so as to accelerate the training process; (4) Sigmoidal activation function is 
modified to enhance computation, so as to improve the convergence. 
 
106 
 
6.4.1 Momentum in First Order Gradient Search 
It was shown in section 2.4 that, for EBP algorithm, applying big learning constants may speed 
up the convergence, or lead to oscillation (Fig. 2-14). The momentum concept proposed in [95] 
was used to stabilize the training process, so that big learning constants can be applied to 
accelerate the training process without oscillation. 
 As discussed in section 2.4, the original update rule of EBP algorithm is 
kk gw ????                                                        (6-1) 
Where: k is the index of iteration and ? is the learning constant. 
 By incorporating the momentum, the update rule can be modified to 
? ? ??? 11 ??????? kkk wgw                                         (6-2) 
Where: ? is the momentum ranged between [0, 1]. 
 Fig. 6-7 interprets how the momentum stabilizes the training process. As shown in Fig. 6-
7a, the training process oscillates because the direction of current weight updating ?wk is 
completely far from minima (center point). By adding momentum ?, the current weight updating 
?wk is recalculated as the combination of part of the previous information ?wk-1 and part of the 
current gradient gk, as shown in Fig. 6-7b. One may notice that the ?wk in Fig. 6-7b is more 
likely to stably approaching to the center point than the ?wk in Fig. 6-7a. 
1?? kw k
? kg??
1?? kw? ? kg???? 1 1?? kw? kw?
 (a) Oscillation                                                           (b) Stabilization 
Fig. 6-7 Training process with and without momentum 
 
107 
 
 Let us use the XOR problem as an example to illustrate the advantage of the modified 
update rule (6-2) by comparing with update rule (6-1). Fig. 6-8 is the neural network used in the 
experiment. It consists of 2 inputs, 2 neurons in FCC network and 1 output. I n p u t  O n e
I n p u t  T w o
+ 1
O u t p u t
 
Fig. 6-8 Network architecture used for XOR problem  
 
 
(a) Update rule (8-1), ?=0.1                                (b) Update rule (8-1), ?=10 
 
(c) Update rule (8-2), ?=0.1 and ?=0.5                (d) Update rule (8-2), ?=10 and ?=0.5 
Fig. 6-9 Training results of XOR problem 
 
Table 6-4 Comparison of different EBP algorithms for solving XOR problem 
XOR problem ?=0.1 ?=10 ?=0.1 ?=10 
?=0.5 ?=0.5 
success rate 100% 18% 100% 100% 
average iteration 17845.44 179.00 18415.84 187.76 
average time (ms) 3413.26 46.83 4687.79 49.27 
 
108 
 
For each test case, the training process is repeated for 100 trials with randomly generated 
initial weights in range [-1, 1]. Fig. 6-9a and Fig. 6-9b present the training results using update 
rule (6-1). It can be seen that, the bigger the learning constant is, the less iteration it costs for 
convergence. However, too bigger learning constants also cause the divergence and lower the 
rate of successful training. Fig. 6-9c and Fig. 6-9d show the training results using update rule (6-
2). One may notice that, when using the same learning constant, the update rule (6-2) with 
momentum converges a little bit slower than the update rule (6-1), but it definitely stabilizes the 
training process with big learning constants (?=10 in the example). 
 
6.4.2 Modified Slope 
The famous ?flap spot? problem is that if the slope of the neuron is very small, while the error is 
huge, then the training will be pushed into the saturate region and get stuck. In order to avoid 
being trapped in saturate region, an equivalent slope is introduced instead of the derivative of 
activation function (see Fig. 6-10), and it is calculated by the followed algorithm [97]: 
 Fig. 6-10 The ?flat spot? problem in sigmoidal activation function 
 
109 
 
  Compute derivative of activation function; 
  Compute the slope of line AD SAD; 
  If derivate > SAD      improved slope = derivative; 
  Else                          improved slope = SAD; 
 
In order to test the modification, the ?worst case? training is performed. The ?worst case? 
training is to use a successful training result as the initial status to train the same patterns with all 
outputs reversed. For parity-3 problem, the training result of ?worst case? is showed in Fig. 6-11. 
 (a) A successful training result of parity-3 problem 
 (b) ?Worst case? training result without slope modification 
 (c) ?Worst case? training with slope modification 
Fig. 6-11 Test the modified slope by ?worst case? training 
 
110 
 
From the results showed in Fig. 6-11, it is clear that the modified slope works well for 
?worst case? training, which means the convergent ability of first order algorithms is enhanced. 
 
6.4.3 Modified Second Order Update Rule 
The update rule of Levenberg Marquardt algorithm is 
? ? kTkTkk eJIJJw 1????? ?                                                (6-3) 
During the training process using Levenberg Marquardt update rule, parameter ? may be 
adjusted (multiplied or divided by a constant) for several times in each iteration. From (6-3), it 
can be known that matrix (JTJ+?I) needs to be inverted for each adjustment of parameter ?. In 
order to avoid the multiple time matrix inversion in single iteration, the update rule can be 
modified as 
? ? ? ?? ?kTkkTkkTkk eJIJJeJw 1? ??????? ??? t a n ht a n h1                       (6-4) 
 In the update rule (6-4), there are three parameters, learning constant ?, combination 
coefficient ? and conjugate coefficient ?. For each iteration, learning constant ? is fixed as a 
small constant; while ? and ? are adjusted only once according to the algorithm in Fig. 6-12. 
I f         S S E
k
 <  S S E
k - 1
T h e n   ?  =  ? ? P  a n d  ?  =  ? / P ;
E ls e
          ?  =  ? / P  a n d  ?  =  ? ? P ;
       
Fig. 6-12 Parameter adjustment in update rule (6-4) 
 
 Several parity problems are applied to test the performance of the modified update rule 
(6-4), by comparing with (6-3). In the experiment, each testing case is repeated for 100 trials 
with randomly generated initial conditions between -1 and 1. The desired training error is 0.01 
111 
 
and the maximum iteration is 500. The gain parameter in activation function is set as 0.5. All 
neurons are connected in FCC networks and NBN computation is applied for training. 
 
Table 6-5 Testing results of parity problems using update rules (6-3) and (6-4) 
Pairty-N Neurons Ave. # of Iterations Ave. # of Training Time (ms) Success Rates 
(6-3) (6-4) (6-3) (6-4) (6-3) (6-4) 
5 3 21.1176 24.9403 63.7843 14.3881 51% 67% 
7 4 21.1667 39.6176 103.3333 90.9118 22% 36% 
9 5 32.1250 41.3333 527.5000 459.3667 8% 30% 
11 6 51.3333 56.1250 8052.0000 4548.7500 3% 16% 
 
 As the testing results presented in Table 6-5, it can be noticed that, even though it takes 
more iterations for the update rule (6-4) than (6-3), the computation time in (6-4) is reduced 
because there is only one time matrix inversion per iteration. Moreover, a little bit improved 
success rate is obtained for each testing case, which means the update rule (6-4) gets more 
powerful search ability than (6-3). 
 
6.4.4 Modified Activation Function 
The convergent ability of gradient descent methods is limited by local minima and ?flat region?. 
For one dimension case, the local minima and flat region are visualized in Fig. 6-13. 
L o c a l M in im a
G lo b a l M in im a
 
G l o b a l  M i n i m a
F l a t  R e g i o n
 (a) Local minima                                               (b) Flat Region 
Fig. 6-13 Failures of gradient based optimization 
 
112 
 
From the standpoint of computation, when the training process is trapped in local minima 
or flat region, elements of gradient are approaching to zero and weight updating gets stuck. In 
real neural network training cases, the local minima problem can be overcome by properly 
increasing the number of network size. 
Considering the sigmoidal shape of activation function, there is no absolutely flat region 
on the error surface. Normally, the training stuck in the ?flat region? means the elements of 
gradient vector are approximated to zero by compilers because of the limitation of computer 
precision. For 32-bit CPU, the ?double? type variable can only have 15-16 bits precision at the 
right side of the point. Therefore, when the elements of gradient are not exact 0, but smaller than 
the precision limit, they will be approximated as 0 by compiler. This is a hardware limitation on 
software computation and cannot be avoided. However, the computation can be furthered by 
programming strategies, such as enlarging the output range of activation functions. 
The common used sigmoidal activation function for neural network training is: 
? ? ? ? 1e x p1 2 ????? xxf ?                                                 (6-5) 
 Equation (6-5) presents the behaviors of bipolar neurons and it is ranged from [-1, 1]. In 
order to enlarge the output range, a factor can be applied and (6-6) is rewritten as 
? ? ? ? ?? ????????? ????? 1e x p1 2 xxf                                                   (6-6) 
Where: the parameter ? is larger than ?1?. So the output of the enlarged bipolar neuron will be [-
?, ?]. 
 Several parity-N problems are applied to test the performance of modified activation 
function (6-6), by comparing with the commonly used (6-5). In the experiments, all neurons are 
in fully connected cascade (FCC) networks and NBN algorithm is used for training. The desired 
113 
 
training accuracy is 0.01 and the maximum iteration is 1000. For each case, the testing is 
repeated for 100 trials with randomly generated initial weights between [-1, 1]. 
 
Table 6-6 Testing results of parity-N problems using different activation functions with the 
minimal network architecture analyzed in section 2.3 
Parity-N Neurons Parameters Iteration Times (ms) Success Rate 
2 2 ?=20, ?=0.1 13.10 (8.83) 17.60 (6.94) 100% (100%) 
3 2 ?=20, ?=0.1 12.22(8.08) 14.94 (7.13) 100% (99%) 
4 3 ?=20, ?=0.1 54.24 (22.65) 76.55 (38.69) 100% (45%) 
5 3 ?=20, ?=0.1 50.32 (21.29) 75.84 (91.91) 100% (37%) 
6 3 ?=25, ?=0.25 502.39 (/) 998.60 (/) 98% (0%) 
7 3 ?=25, ?=0.25 117.06 (/) 459.87 (/) 98% (0%) 
8 4 ?=25, ?=0.25 292.79 (/) 1869.47 (/) 93% (0%) 
9 4 ?=25, ?=0.3 383.48 (/) 4367.27 (/) 77% (0%) 
10 4 ?=25, ?=0.3 1260.12 (/) 30180.31 (/) 21% (0%) 
 
 Table 6-6 presents the testing results. Testing results for activation function (6-5) are in 
the parenthesis. One may notice that, with the modified activation function (6-6), the training 
success rates are significantly improved and the tradeoff is extra training time. Notice that, in 
chapter 2, the analytical results show that using FCC networks, parity-7 problem can be solved 
with 3 neurons. Networks with activation function (6-5) cannot solve the parity-7 problem at all; 
while with the modified activation function (6-6), the training success rate can reach 98% for 
parity-7 problem with 3 neurons in FCC network. 
It is worth to mention that the trained network using the modified activation function (6-6) 
can be scaled back as a trained network using the activation function (6-5). As shown in Fig. 6-
14, the two networks are equivalent: network in Fig. 6-14a is trained using the modified 
activation function (6-6); while network in Fig. 6-14b consists of the neurons with activation 
function (6-7). 
114 
 
I n p u t  1
I n p u t  2
I n p u t  3
+ 1
O u t p u t
I n p u t  4
I n p u t  5
w
1
w
2
w
3
I n p u t  1
I n p u t  2
I n p u t  3
+ 1
O u t p u t
I n p u t  4
I n p u t  5
? w
1
? w
2
? w
3
?
K e e p  u n c h a n g e d
 (a) Neurons with activation functions in (6-6)      (b) Neurons with activation functions in (6-7) 
Fig. 6-14 Two equivalent networks 
 
6.5 Case Studies Using NBN 2.0 
In industrial application, dataset is often obtained by sampling and testing process. In order to get 
the responses of inputs which are not sampled for testing, an approximated function has to be 
build to represent the relationship between stimulus and response, based on the known sampled 
testing results. Neural networks can always be considered as a candidate of approximator and the 
software NBN 2.0 is recommended for neural network design. 
In this section, two examples are presented to test the abilities of NBN 2.0, from the 
standpoints of data classification and function approximation. Two-dimensional examples are 
considered so that the verification results can be visualized. 
 
6.5.1 Data Classification 
Neural networks could be considered as a supervised data classification method. In this example, 
we got 21?21=441 two-dimensional dataset uniformly sampled from x ? [-1, 1] and y ? [-1, 1]. 
As shown in Fig. 6-16a, there are 5 groups, A, B, C, D and E, in the sampling range. The purpose 
is to build a classification system which can classify the non-sample inputs into correct groups. 
 The first step is to define the expression of each group. Considering the network with 
outputs, without losing generalization, let us define output of group A as ?0 0 0 0 1?, group B as 
115 
 
?0 0 0 1 0?, group C as ?0 0 1 0 0?, group D as ?0 1 0 0 0? and group E as ?1 0 0 0 0?. For other 
cases, the output is defined as ?0 0 0 0 0?. 
 15 neurons in FCC network are used to build the system and NBN algorithm is applied 
for system parameter adjustment. Fig. 6-15 shows the network construction commands in 
topology file. 
 Fig. 6-15 Network construction commands: 15 neurons in FCC network with 2 inputs and 5 
outputs 
 
After successful training (training sum square error < 0.01), 50?50=2,500 patterns in the 
same range as sample patterns are applied to test the trained networks. Testing patterns are 
generalized by click the button ?Generalization? as introduced in section 6.2.7. Fig. 6-16b shows 
the visualized testing results and it can be noticed that the five groups are classified correctly. 
 (a) Location of the five groups      (b) Generalization result with 50?50=2,500 points 
Fig. 6-16 Data classification 
 
116 
 
6.5.2 Function Approximation 
In the experiment, the purpose is to simulate the behaviors of forward kinematics described in 
Fig. 5-17. The location (x, y) of the end effector is given by the two equations (5-30) and (5-31). 
In order to avoid scanning ?blind area?, let us assume the arm lengths L1=L2=1. The sample 
range of angles ? and ? are: ? ? [0, 6] and ? ? [0, 6]. For each dimension, 13?13=169 points are 
uniformly sampled as training dataset; after training, 61?61=3,721 points in the same range as 
sampling are uniformly generated as testing dataset. All the training/testing points are visualized 
in Fig. 6-17 and Fig. 6-18 for X and Y dimension, respectively. The approximation results are 
evaluated using the averaged sum square error E defined as: 
??? Pp pePE 1 21                                                          (6-7) 
Where: ep is the difference between desired output (from equations 6-30 or 6-31) and actual 
output (from designed neural networks) for a given pattern p. P is the number of training/testing 
patterns. 
 
(a) Training patterns, 13?13=169 points        (b) Testing patterns, 61?61=3,721 points 
Fig. 6-17 X-dimension surface of forward kinematics 
 
117 
 
 
(a) Training patterns, 13?13=169 points           (b) Testing patterns, 61?61=3,721 points 
Fig. 6-18 Y-dimension surface of forward kinematics 
 
 For x-dimension approximation, using NBN algorithm, FCC network with 8 neurons can 
reach the training accuracy ETrain=5.7279e-005. The related test result is shown in Fig. 6-19. For 
y-dimension approximation, using NBN algorithm, FCC network with 6 neurons can obtain the 
training accuracy ETrain=5.7281e-005. The related test result is shown in Fig. 6-20. 
 
(a) Testing surface, SSETest=5.3567e-005                        (b) Error surface 
Fig. 6-19 X-dimension testing results 
 
 
(a) Testing surface, SSETest=5.1324e-005                         (b) Error surface 
Fig. 6-20 Y-dimension testing results 
 
118 
 
6.6 Conclusion 
In this chapter, the software NBN 2.0 is introduced for neural network training. This software 
contains both first order and second order training algorithms, which are implemented by 
traditional forward-backward computation and a newly developed forward-only computation 
respectively. It can handle not only MLP networks, but also more powerful networks with 
connections across layers, such as BMLP networks and FCC networks. 
 Detailed instructions about the files and GUI operations of NBN 2.0 are presented. Also, 
several strategies used for improving the training performance are considered in the software. 
The momentum proposed in is incorporated in the first order algorithms in order to stabilize the 
training process with big learning constant, so as to speed up the convergence. The modified 
slope of activation function enhances the convergent ability of first order algorithms and the 
improvement is proven by ?Worst cast? test. The other two strategies, modified second order 
update rule in section 6.4.3 and modified activation 6.4.4, are used to accelerate the computation 
and enhance the convergence, respectively. The latter two strategies are our recently unpublished 
work. 
 The software NBN 2.0 can be used for data analysis, such as data classification, pattern 
recognition and function approximation, as shown by the three examples in section 6.5. The 
NBN 2.0 is available at: http://www.eng.auburn.edu/users/wilambm/nnt/. And also, all the data 
of the examples presented in this chapter are included in the software package. 
119 
 
 
 
 
 
 
 
CHAPTER 7 
CONCLUSION 
The dissertation started with two interesting experiments to show potentially good behaviors of 
neural networks for pattern recognition and function approximation. Then, the efficiency of 
different neural network architectures were evaluated and compared based on parity problems. 
Analytically, we proved that neural networks with connections across layers, such as bridged 
multilayer perceptron (BMLP) networks and fully connected cascade (FCC) networks, are much 
more powerful and efficient than the popular multilayer perceptron (MLP) networks. Then, both 
first and second order gradient descent based learning algorithms were studied and derived from 
scratch. The comparison between first order algorithms, such as error backpropagation (EBP) 
algorithm [81], and second order algorithms, such as Gaussian-Newton method and Levenberg 
Marquardt (LM) algorithm [80], drew the conclusion that Levenberg Marquardt algorithm was 
indeed one of the most efficient and stable algorithms for neural network learning. Experimental 
results demonstrated the existence of over-fitting problem in neural network training which is 
mainly due to improperly choosing the size of neural networks. It was proposed that by utilizing 
efficient neural network architectures and second order learning algorithms, very compact neural 
networks could be designed for solving problems [28], so as to reduce the probability of 
occurrence of the over-fitting problem. The recently developed second order neuron-by-neuron 
(NBN) algorithm [27] is recommended for neural network training because it was designed to 
training efficient neural network architectures, such as BMLP networks and FCC networks. 
120 
 
The purpose of the dissertation was addressed by analyzing the disadvantages and 
computation inefficiencies in second order algorithms, including both the famous Hagan and 
Menhaj Levenberg Marquardt algorithm [80] and the powerful NBN algorithm [27]. There are 
three main disadvantages in Hagan and Menhaj Levenberg Marquardt algorithm and they are: (1) 
network limitation; (2) inefficient repetition in backpropagation computation; (3) memory 
limitation. The original NBN algorithm was highlighted because it solved the first problem. 
Being inherited from the NBN algorithm, the proposed forward-only computation [78] can 
handle not only traditional MLP networks, but also very efficient network architectures with 
connections across layers (solved problem one). By replacing the backpropagation process with 
extra computation in forward process of Jacobian matrix computation, the forward-only 
computation is more efficient than forward-backward (adopted by both Hagan and Menhaj LM 
algorithm and NBN algorithm) computation to handle networks with multiple outputs. The 
forward-only computation is designed based on a regular table (Fig. 5-7) and a general formula 
(5-22). The complex neural network training process with second order update rule is 
significantly simplified to a puzzle game: to obtain the unknown elements in a table based on the 
given rule (formula 5-22), which is very easy to be solved by computer programming. Another 
improvement in second order computation was proposed to solve the memory limitation 
(problem three) successfully [79]. Jacobian matrix storage was avoided and quasi Hessian matrix 
could be calculated directly. Experimental results showed that the training speed was also 
improved significantly because of the memory reduction. The improved second order 
computation made it possible to design neural networks using second order algorithms to handle 
problems with basically unlimited number of patterns. All the algorithms are implemented in the 
software NBN 2.0, developed based on Visual C++ 6.0. The NBN 2.0 is designed for neural 
121 
 
network training and can be also used to test the generalization ability of the trained neural 
networks. 
So far, very efficient second order algorithms can be applied to train the powerful 
networks with connections across layers. Therefore, compact (close to optimal) neural networks 
can be obtained in practical applications to enhance the generalization ability. However, there are 
still several unsolved problems. For example, we know that compact neural networks should be 
applied to solve problems; however, we do not know how compact they should be? Trial-by-trial 
is often used as the possible way to find compact neural architecture but it is very time-
consuming for complex problems. Some pruning/growing algorithms are introduced to find the 
optimized size of neural networks [11-12]. Recently, we have moved our research to a special 
type of neural networks: radial basis function (RBF) networks [99-100]. Comparing with 
traditional feedforward neural networks, RBF networks has the advantages of easy design, stable 
and good generalization ability, good tolerance to input noise and online learning ability. RBF 
networks are strongly recommended as an efficient and reliable way of designing dynamic 
systems [98]. Our recently developed error correction (ErrCor) algorithm [101] and improved 
second order computation [102] provide an efficient and robust way for design compact RBF 
networks. 
 
 
 
 
122 
 
 
 
 
 
 
 
REFERENCES 
[1] P. J. Werbos, "Back-propagation: Past and Future," Proceeding of International Conference 
on Neural Networks, San Diego, CA, 1, 343-354, 1988. 
 
[2] J. Moody and C. J. Darken, "Fast Learning in Networks of Locally-Tuned Processing Units," 
Neural Computation, vol. 1, no. 2, pp. 281-294, 1989. 
 
[3] R. Hecht-Nielsen, "Counterpropagation Networks," Appl. Opt. 26(23):4979-4984, 1987. 
 
[4] A. G. Ivakhnenko and J. A. Mueller, "Self-Organizing of Nets of Active Neurons," System 
Analysis Modeling Simulation, vol. 20, pp. 93-106, 1995. 
 
[5] Y. L. Chow, L. L. Frank and etc., "Disturbance and Friction Compensations in Hard Disk 
Drives Using Neural Networks," IEEE Trans. on Industrial Electronics, vol. 57, no. 2, pp. 
784-792, Feb. 2010. 
 
[6] M. A. M. Radzi and N. A. Rahim, "Neural Network and Bandless Hysteresis Approach to 
Control Switched Capacitor Active Power Filter for Reduction of Harmonics," IEEE Trans. 
on Industrial Electronics, vol. 56, no. 5, pp. 1477-1484, May 2009. 
 
[7] M. Cirrincione, M. Pucci, G. Vitale and A. Miraoui, "Current Harmonic Compensation by a 
Single-Phase Shunt Active Power Filter Controlled by Adaptive Neural Filtering," IEEE 
Trans. on Industrial Electronics, vol. 56, no. 8, pp. 3128-3143, Aug. 2009. 
 
[8] Q. N. Le and J. W. Jeon, "Neural-Network-Based Low-Speed-Damping Controller for 
Stepper Motor With an FPGA," IEEE Trans. on Industrial Electronics, vol. 57, no. 9, pp. 
3167-3180, Sep. 2010. 
 
[9] C. Xia, C. Guo and T. Shi, "A Neural-Network Identifier and Fuzzy-Controller-Based 
Algorithm for Dynamic Decoupling Control of Permanent-Magnet Spherical Motor," IEEE 
Trans. on Industrial Electronics, vol. 57, no. 8, pp. 2868-2878, Aug. 2010. 
 
[10] F. F. M. E. Sousy, "Hybrid H?-Based Wavelet-Neural-Network Tracking Control for 
Permanent-Magnet Synchronous Motor Servo Drives," IEEE Trans. on Industrial 
Electronics, vol. 57, no. 9, pp. 3157-3166, Sep. 2010. 
 
[11] Z. Li, "Robust Control of PM Spherical Stepper Motor Based on Neural Networks," 
IEEE Trans. on Industrial Electronics, vol. 56, no. 8, pp. 2945-2954, Aug. 2009. 
 
123 
 
[12] S. M. Gadoue, D. Giaouris and J. W. Finch, "Sensorless Control of Induction Motor 
Drives at Very Low and Zero Speeds Using Neural Network Flux Observers," IEEE Trans. 
on Industrial Electronics, vol. 56, no. 8, pp. 3029-3039, Aug. 2009. 
 
[13] V. Machado, A. D. D. Neto and J. D. D. Melo, "A Neural Network Multiagent 
Architecture Applied to Industrial Networks for Dynamic Allocation of Control Strategies 
Using Standard Function Blocks," IEEE Trans. on Industrial Electronics, vol. 57, no. 5, pp. 
1823-1834, May 2010. 
 
[14] H. P. Huang, J. L. Yan and T. H. Cheng, "Development and Fuzzy Control of a Pipe 
Inspection Robot," IEEE Trans. on Industrial Electronics, vol. 57, no. 3, pp. 1088-1095, 
March 2010. 
 
[15] H. Chaoui, P. Sicard and W. Gueaieb, "ANN-Based Adaptive Control of Robotic 
Manipulators With Friction and Joint Elasticity," IEEE Trans. on Industrial Electronics, vol. 
56, no. 8, pp. 3174-3187, Aug. 2009. 
 
[16] L. Y. Wang, T. Y. Chai and L. F. Zhai, "Neural-Network-Based Terminal Sliding-Mode 
Control of Robotic Manipulators Including Actuator Dynamics," IEEE Trans. on Industrial 
Electronics, vol. 56, no. 9, pp. 3296-3304, Sep. 2009. 
 
[17] F. Moreno, J. Alarc?n, R. Salvador and T. Riesgo, "Reconfigurable Hardware 
Architecture of a Shape Recognition System Based on Specialized Tiny Neural Networks 
With Online Training," IEEE Trans. on Industrial Electronics, vol. 56, no. 8, pp. 3253-3263, 
Aug. 2009. 
 
[18] Y. J. Lee and J. Yoon, "Nonlinear Image  Upsampling Method Based on Radial Basis 
Function Interpolation," IEEE Trans. on Image Processing, vol. 19, issue 10, pp. 2682-2692, 
2010. 
 
[19] S. Ferrari, F. Bellocchio, V. Piuri and N. A. Borghese, "A Hierarchical RBF Online 
Learning Algorithm for Real-Time 3-D Scanner," IEEE Trans. on Neural Networks, vol. 21, 
issue 2, pp. 275-285, 2010. 
 
[20] K. Meng, Z. Y. Dong, D. H. Wang and K. P. Wong, "A Self-Adaptive RBF Neural 
Network Classifier for Transformer Fault Analysis," IEEE Trans. on Power Systems, vol. 25, 
issue 3, pp. 1350-1360, Feb. 2010. 
 
[21] S. Huang and K. K. Tan, "Fault Detection and Diagnosis Based on Modeling and 
Estimation Methods," IEEE Trans. on Neural Networks, vol. 20, issue 5, pp. 872-881, Apr. 
2009. 
 
[22] K. Hornik, M. Stinchcombe and H. White, "Multilayer Feedforward Networks Are 
Universal Approximators," Neural Networks, vol. 2, issue 5, pp. 359-366, 1989. 
 
[23] Werbos P. J., "Back-propagation: Past and Future," Proceeding of International 
124 
 
Conference on Neural Networks, San Diego, CA, 1, 343-354, 1988. 
 
[24] K. Levenberg, "A method for the solution of certain problems in least squares," Quarterly 
of Applied Machematics, 5, pp. 164-168, 1944. 
 
[25] D. Marquardt, "An algorithm for least-squares estimation of nonlinear parameters," SIAM 
J. Appl. Math., vol. 11, no. 2, pp. 431-441, Jun. 1963. 
 
[26] B. M. Wilamowski, H. Yu and N. Cotton, "Neuron by Neuron Algorithm," Industrial 
Electronics Handbook, vol. 5 ? INTELLIGENT SYSTEMS, 2nd Edition, 2010, chapter 13, 
pp. 13-1 to 13-24, CRC Press. 
 
[27] B. M. Wilamowski, N. J. Cotton, O. Kaynak, G. Dundar, "Computing Gradient Vector 
and Jacobian Matrix in Arbitrarily Connected Neural Networks," IEEE Trans. on Industrial 
Electronics, vol. 55, no. 10, pp. 3784-3790, Oct. 2008. 
 
[28] B. M. Wilamowski, "Neural Network Architectures and Learning Algorithms: How Not 
to Be Frustrated with Neural Networks," IEEE Ind. Electron. Mag., vol. 3, no. 4, pp. 56-63, 
2009. 
 
[29] T. Takagi and M. Sugeno, "Fuzzy Identification of Systems and Its Application to 
Modeling and Control," IEEE Transactions on System, Man, Cybernetics, Vol. 15, No. 1, pp. 
116-132, 1985. 
 
[30] T. T. Xie, H. Yu and B. M. Wilamowski, "Neuro-fuzzy System," Industrial Electronics 
Handbook, vol. 5 ? INTELLIGENT SYSTEMS, 2nd Edition, 2010, chapter 20, pp. 20-1 to 
20-9, CRC Press. 
 
[31] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag: New York, 1995. 
 
[32] C. Saunders, A. Gammerman and V. Vovk, "Ridge Regression Learning Algorithm in 
Dual Variables," Proceedings of the 15th International Conference on Machine Learning, 
ICML-98, Madison-Wisconsin, 1998. 
 
[33] T. Kohonen, "Self-Organized Formation of Topologically Correct Feature Maps," 
Biological Cybernetics, vol. 43, pp. 59- 69, 1982. 
 
[34] I. T. Jolliffe, Principal Component Analysis. Series in Statistics. Springer; 2nd edition, 
Oct. 1, 2002. 
 
[35] Q. K. Pan, M. F. Tasgetiren and Y. C. Liang, "A Discrete Particle Swarm Optimization 
Algorithm for the No-Wait Flowshop Scheduling Problem," Computer & Operations 
Research,  vol. 35, issue 9, pp. 2807-2839, Sep. 2008. 
 
[36] M. Dorigo, V. Maniezzo and A. Colorni, "Ant System: Optimization by a Colony of 
Cooperating Agents," IEEE Transactions on Systems, Man, and Cybernetics-Part B, vol. 26, 
125 
 
issue 1, pp. 29-41, Feb. 1996. 
 
[37] D. E. Goldberg. Genetic Algorithms in Search, Optimization & Machine Learning. 
Addison Wesley, Reading, MA, 1989. 
 
[38] S. Wu and T. W. S. Chow, "Induction Machine Fault Detection Using SOM-Based RBF 
Neural Networks," IEEE Trans. on Industrial Electronics, vol. 51, no. 1, pp. 183-194, Feb. 
2004. 
 
[39] C. K. Goh, E. J. Teoh and K. C. Tan, "Hybrid Multiobjective Evolutionary Design for 
Artificial Neural Networks," IEEE Trans. on Neural Networks, vol. 19, no. 9, pp. 1531-1548, 
Sept 2008. 
 
[40] Y. Song, Z. Q. Chen and Z. Z. Yuan, "New Chaotic PSO-Based Neural Network 
Predictive Control for Nonlinear Process," IEEE Trans. on Neural Networks, vol. 18, no. 2, 
pp. 595-601, Feb 2007. 
 
[41] T. T. Xie, H. Yu and B. M. Wilamowski, "Replacing Fuzzy Systems with Neural 
Networks," in Proc. 3nd IEEE Human System Interaction Conf.  HSI 2010, Rzeszow, Poland, 
May 13-15, 2010, pp. 189-193. 
 
[42] M. Sugeno and G. T. Kang, "Structure Identification of Fuzzy Model," Fuzzy Sets and 
Systems, Vol. 28, No. 1, pp. 15-33, 1988. 
 
[43] A. Malinowski and H. Yu, "Comparison of Embedded System Design for Industrial 
Applications," IEEE Trans. on Industrial Informactics, vol. 7, issue 2, pp. 244-254, May 
2011. 
 
[44] R. J. Wai, J. D. Lee and K. L. Chuang, "Real-Time PID Control Strategy for Maglev 
Transportation System via Particle Swarm Optimization," IEEE Trans. on Industrial 
Electronics, vol. 58, no. 2, pp. 629-646, Jan. 2011. 
 
[45] M. A. S. K. Khan and M. A. Rahman, "Implementation of a Wavelet-Based MRPID 
Controller for Benchmark Thermal System," IEEE Trans. on Industrial Electronics, vol. 57, 
no. 12, pp. 4160-4169, Nov. 2010. 
 
[46] Y. Z. Li and K. M. Lee, "Thermohydraulic Dynamics and Fuzzy Coordination Control of 
A Microchannel Cooling Network for Space Electronics," IEEE Trans. on Industrial 
Electronics, vol. 58, no. 2, pp. 700-708, Feb. 2011. 
 
[47] R. Masuoka, N. Watanabe, A. Kawamura, Y. Owada and K. Asakawa, "Neuraofuzzy 
system-Fuzzy inference using a structured neural network," Proceedings of the International 
Conference on FuzzyLogic&Neural Networks, Hzuka, Japan, pp.173-177, July20-24,1990. 
 
[48] LIBSVM link: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ 
 
126 
 
[49] H. Yu and B. M. Wilamowski, "Efficient and Reliable Training of Neural Networks," 
IEEE Human System Interaction Conference, HSI 2009, Catania. Italy, May 21-23, 2009, pp. 
109-115. 
 
[50] H. Yu and B. M. Wilamowski, "C++ Implementation of Neural Networks Trainer," 13th 
IEEE Intelligent Engineering Systems Conference, INES 2009, Barbados, April 16-18, 2009, 
pp. 237-242. 
 
[51] N. Pham, H. Yu and B. M. Wilamowski, "Neural Network Trainer through Computer 
Networks," 24th IEEE International Conference on Advanced Information Networking and 
Applications, AINA2010, Perth, Australia, April 20-23, 2010, pp. 1203-1209. 
 
[52] H. Yu and B. M. Wilamowski, "Fast and efficient and training of neural networks," in 
Proc. 3nd IEEE Human System Interaction Conf. HSI 2010, Rzeszow, Poland, May 13-15, 
2010, pp. 175-181. 
 
[53] H. Yu and B. M. Wilamowski, "Neural Network Training with Second Order 
Algorithms," monograph by Springer on Human-Computer Systems Interaction. Background 
and Applications, 31st October, 2010. (Accepted) 
 
[54] B. M. Wilamowski, "Human factor and computational intelligence limitations in resilient 
control systems" ISRCS-10, 3-rd International Symposium on Resilient Control Systems, 
Idaho Falls, Idaho, August 10-12, 2010, pp. 5-11. 
 
[55] W. S. McCulloch and W. Pitts, "A Logical Calculus of the Idea Immanent in Nerveous 
Activity," Bulletin of Mathematical Biophysics, vol. 5, pp. 115-133, 1943.  
 
[56] D. O. Hebb, The Organization of Behavior. John Wiley & Sons, New York, 1949. 
 
[57] A. M. Uttley, "A Theory of the Mechanism of Learning Based on the Computation of 
Conditional Probabilities," Proceedings of the First International Conference on 
Cybernetics, Namur, Gauthier-Villars, Paris, France, 1956. 
 
[58] F. Rosenblatt, "The perceptron: A Probabilistic Model for Information Storage and 
Organization in the Brain," Psychological Review, vol. 65, pp. 386-408, 1959. 
 
[59] B. Widrow and M. E. Hoff, Jr., "Adaptive Switching Circuits," IRE WESCON 
Convention Record, pp. 96-104, 1960. 
 
[60] M. Minsky and S. Papert, Perceptrons. Oxford, England: M. I. T. Press, 1969. 
 
[61] D. E. Rumelhart, G. E. Hinton and R. J. Wiliams, "Learning representations by back-
propagating errors," Nature, vol. 323, pp. 533-536, 1986 MA. 
 
[62] S. I. Gallant, "Perceptron-based learning algorithms," IEEE Trans. on Neural Networks, 
vol. 1, no. 2, pp. 179-191, Feb 1990. 
127 
 
 
[63] K. S. Narendra, S. Mukhopadhyay, "Associative learning in random environments using 
neural networks ," IEEE Trans. on Neural Networks, vol. 2, no. 1, pp. 20-31, Jan 1991. 
 
[64] R. C. Frye, E.A. Rietman, C.C. Wong, "Back-propagation learning and nonidealities in 
analog neural network hardware," IEEE Trans. on Neural Networks, vol. 2, no. 1, pp. 110-
117, Jan 1991. 
 
[65] J. N. Hwang, J.J. Choi, S. Oh, R.J. Marks, "Query-based learning applied to partially 
trained multilayer perceptrons," IEEE Trans. on Neural Networks, vol. 2, no. 1, pp. 131-136, 
Jan 1991. 
 
[66] P. A. Shoemaker, "A note on least-squares learning procedures and classification by 
neural network models," IEEE Trans. on Neural Networks, vol. 2, no. 1, pp. 158-160, Jan 
1991. 
 
[67] R. Batruni, "A multilayer neural network with piecewise-linear structure and back-
propagation learning," IEEE Trans. on Neural Networks, vol. 2, no. 3, pp. 395-403, March 
1991. 
 
[68] M. Riedmiller, H. Braun, "A direct adaptive method for faster backpropagation learning: 
The RPROP algorithm," Proc. International Conference on Neural Networks, San Francisco, 
CA, 1993, pp. 586-591. 
 
[69] R. B. Allen, J. Alspector, "Learning of stable states in stochastic asymmetric networks," 
IEEE Trans. on Neural Networks, vol. 1, no. 2, pp. 233-238, Feb 1990. 
 
[70] T. M. Martinetz, H. J. Ritter and K. J. Schulten, "Three-dimensional neural net for 
learning visuomotor coordination of a robot arm," IEEE Trans. on Neural Networks, vol. 1, 
no. 1, pp. 131-136, Jan 1990. 
 
[71] Image link: http://koaboi.wordpress.com/category/uncategorized/ 
 
[72] M. E. Hohil, D. Liu, and S. H. Smith, "Solving the N-bit parity problem using neural 
networks," Neural Networks, vol. 12, pp1321-1323, 1999. 
 
[73] B. M. Wilamowski, D. Hunter, A. Malinowski, "Solving parity-N problems with 
feedforward neural networks," Proc. 2003 IEEE IJCNN, 2546-2551, IEEE Press, 2003. 
 
[74] B. M. Wilamowski, H. Yu and K. T. Chung, "Parity-N problems as a vehicle to compare 
efficiency of neural network architectures," Industrial Electronics Handbook, vol. 5 ? 
INTELLIGENT SYSTEMS, 2nd Edition, 2010, chapter 10, pp. 10-1 to 10-8, CRC Press. 
 
[75] M. R. Osborne, "Fisher?s method of scoring," Internat. Statist. Rev., 86 (1992), pp. 271-
286. 
 
128 
 
[76] H. Yu and B. M. Wilamowski, "Levenberg Marquardt Training," Industrial Electronics 
Handbook, vol. 5 ? INTELLIGENT SYSTEMS, 2nd Edition, 2010, chapter 12, pp. 12-1 to 
12-16, CRC Press. 
 
[77] J. X. Peng, Kang Li, G.W. Irwin, "A New Jacobian Matrix for Optimal Learning of 
Single-Layer Neural Networks," IEEE Trans. on Neural Networks, vol. 19, no. 1, pp. 119-
129, Jan 2008. 
 
[78] B. M. Wilamowski and H. Yu, "Neural Network Learning without Backpropagation," 
IEEE Trans. on Neural Networks, vol. 21, no.11, Nov. 2010. 
 
[79] B. M. Wilamowski and H. Yu, "Improved Computation for Levenberg Marquardt 
Training," IEEE Trans. on Neural Networks, vol. 21, no. 6, pp. 930-937, June 2010. 
 
[80] M. T. Hagan, M. B. Menhaj, "Training feedforward networks with the Marquardt 
algorithm," IEEE Trans. on Neural Networks, vol. 5, no. 6, pp. 989-993, Nov. 1994. 
 
[81] H. N. Robert, "Theory of the Back Propagation Neural Network," Proc. 1989 IEEE 
IJCNN, 1593-1605, IEEE Press, New York, 1989. 
 
[82] H. B. Demuth, M. Beale, "Neural Network Toolbox: for use with MATLAB," Mathworks 
Natick, MA, USA, 2000. 
 
[83] L. J. Cao, S. S. Keerthi, Chong-Jin Ong, J. Q. Zhang, U. Periyathamby, Xiu Ju Fu, H. P. 
Lee, "Parallel sequential minimal optimization for the training of support vector machines," 
IEEE Trans. on Neural Networks, vol. 17, no. 4, pp. 1039- 1049, April 2006. 
 
[84] D. C. Lay, Linear Algebra and its Applications. Addison-Wesley Publishing Company, 
3rd version, pp. 124, July, 2005. 
 
[85] S. Wan, L.E. Banta, "Parameter Incremental Learning Algorithm for Neural Networks," 
IEEE Trans. on Neural Networks, vol. 17, no. 6, pp. 1424-1438, June 2006. 
 
[86] S. E. Fahlman, C. Lebiere, "The cascade-correction learning architecture," in D. S. 
Touretzky (ed.), Advances in Neural Information Processing Systems 2, Morgan Kaufmann, 
1990. 
 
[87] B. K. Bose, "Neural Network Applications in Power Electronics and Motor Drives ? An 
Introduction and Perspective," IEEE Trans. on Industrial Electronics, vol. 54, no. 1, pp. 14-
33, Feb. 2007. 
 
[88] J. A. Farrell, M. M. Polycarpou, "Adaptive Approximation Based Control: Unifying 
Neural, Fuzzy and Traditional Adaptive Approximation Approaches [Book review]," IEEE 
Trans. on Neural Networks, vol. 19, no. 4, pp. 731-732, April 2008. 
 
129 
 
[89] W. Qiao, R. G. Harley, G. K. Venayagamoorthy, "Fault-Tolerant Indirect Adaptive 
Neurocontrol for a Static Synchronous Series Compensator in a Power Network With 
Missing Sensor Measurements," IEEE Trans. on Neural Networks, vol. 19, no. 7, pp. 1179-
1195, July 2008. 
 
[90] J. Y. Goulermas, A. H. Findlow, C. J. Nester, P. Liatsis, X.-J. Zeng, L. P. J. Kenney, P. 
Tresadern, S. B. Thies, D. Howard, "An Instance-Based Algorithm With Auxiliary Similarity 
Information for the Estimation of Gait Kinematics From Wearable Sensors," IEEE Trans. on 
Neural Networks, vol. 19, no. 9, pp. 1574-1582, Sept 2008. 
 
[91] N. J. Cotton, and Bogdan M. Wilamowski, "Compensation of Nonlinearities Using 
Neural Networks Implemented on Inexpensive Microcontrollers" IEEE Trans. on Industrial 
Electronics, vol. 58, No 3, pp. 733-740, March 2011. 
 
[92] N. J. Cotton and Bogdan M. Wilamowski ?Compensation of Sensors Nonlinearity with 
Neural Networks?, 24th IEEE International Conference on Advanced Information 
Networking and Applications 2010, pp. 1210-1217, April 2010. 
 
[93] B. M. Wilamowski, ?Challenges in Applications of Computational Intelligence in 
Industrials Electronics,? (keynote) ISIE? 10 IEEE International Symposium on Industrial 
Electronics, Bari, Italy, July 5-7, 2010, pp. 15-22. 
 
[94] F. H. C. Tivive and A. Bouzerdoum, "Efficient training algorithms for a class of shunting 
inhibitory convolutional neural networks," IEEE Trans. on Neural Networks, vol. 16, no. 3, 
pp. 541-556, March 2005. 
 
[95] V. V. Phansalkar, P. S. Sastry, "Analysis of the back-propagation algorithm with 
momentum," IEEE Trans. on Neural Networks, vol. 5, no. 3, pp. 505-506, March 1994. 
 
[96] B. M. Wilamowski, N. Cotton, J. Hewlett, O. Kaynak, "Neural network trainer with 
second order learning algorithms," Proc. International Conference on Intelligent Engineering 
Systems, June 29 2007-July 1 2007, pp. 127-132. 
 
[97] B. M. Wilamowski, "Modified EBP algorithm with instant training of the hidden layer," 
Proc. in 23rd International Conference on Industrial Electronics, Control, and 
Instrumentation, vol. 3, pp.1098-1101, 1997. 
 
[98] H. Yu, T. T. Xie, Stanislaw Paszczynski and B. M. Wilamowski, "Advantages of Radial 
Basis Function Networks for Dynamic System Design," IEEE Trans. on Industrial 
Electronics (Accepted) 
 
[99] H. Yu, T. T. Xie, M. Hamilton and B. M. Wilamowski, "Comparison of Different Neural 
Network Architectures for Digit Image Recognition," in Proc. 3nd IEEE Human System 
Interaction Conf.  HSI 2011, Yokohama, Japan, pp. 98-103, May 19-21, 2011. 
 
130 
 
[100] T. T. Xie, H. Yu and B. M. Wilamowski, "Comparison of Traditional Neural Networks 
and Radial Basis Function Networks," in Proc. 20th IEEE International Symposium on 
Industrial Electronics, ISIE2011, Gdansk, Poland, 27-30 June 2011 (Accepted) 
 
[101] H. Yu, T. T. Xie and B. M. Wilamowski, "Error Correction ? A Robust Learning 
Algorithm for Designing Compact Radial Basis Function Networks," IEEE Trans. on Neural 
Networks (Major Revision) 
 
[102] T. T. Xie, H. Yu, B. M. Wilamowski and J. Hewllet, "Fast and Efficient Second Order 
Method for Training Radial Basis Function Networks," IEEE Trans. on Neural Networks 
(Major Revision)