Novel Approaches to Creating Robust Globally Convergent Algorithms
for Numerical Optimization
by
Joel David Hewlett
A thesis submitted to the Graduate Faculty of
Auburn University
in partial fulflllment of the
requirements for the Degree of
Master of Science
Auburn, Alabama
May 14, 2010
Approved by:
Bogdan M. Wilamowski, Chair, Associate Professor of Electrical Engineering
Thaddeus Roppel, Associate Professor of Electrical Engineering
Robert Dean, Assistant Professor of Electrical Engineering
Vitaly Vodyanoy, Professor of Veterinary Medicine
Abstract
Two optimization algorithms are presented, each of which seeks to efiectively combine
the desirable characteristics of gradient descent and evolutionary computation into a single
robust algorithm. The flrst method, termed Quasi-Gradient Directed Migration (QGDM),
is based on a quasi-gradient approach which utilizes the directed migration of a bounded
set of randomly distributed points. The algorithm progresses by iteratively updating the
center location and radius of the given population. The second method, while similar in
spirit, takes this concept one step further by using a "variable scale gradient approximation"
(VSGA), which allows it to recognize surface behavior on both the large and small scale.
By adjusting the population radius between iterations, the algorithm is able to escape local
minima by shifting its focus onto global trends rather than local behavior. Both algorithms
are compared experimentally with existing methods and are shown to be competitive, if not
superior, in each of the tested cases.
ii
Acknowledgments
I would like to thank my parents, Bud and Patsy Hewlett. Without their love and
support, I would have likely never begun this endeavor. I would also like to express my
deepest gratitude to my advisor, Dr. Wilamowski. His guidance and teaching have been an
invaluable resource in my academic development, without which I would have likely never
flnished.
iii
Table of Contents
Abstract ii
List of Figures vi
1 Introduction 1
1.1 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Numerical Optimization 3
2.1 Mathematical Formulation of Optimization Problems . . . . . . . . . . . . . 4
2.2 The Anatomy of an Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Common Search Directions . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Choosing A Step Length . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Quasi-gradient Optimization Using Directed Migration 11
3.1 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Detailed Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Possible Modiflcations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Conically Shaped Regions . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 A Modifled Population Distribution . . . . . . . . . . . . . . . . . . 21
3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Variable Scale Gradient Approximation 31
4.1 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.1 Approximating the gradient . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.2 A modifled LM update rule . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.3 Assembling the population . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.4 The selection process . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.5 Adjustment of the population radius . . . . . . . . . . . . . . . . . . 39
4.2 Test Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 Test Function 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.2 Test Function 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.3 Test Function 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.4 Test Function 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Conclusion 48
Bibliography 50
iv
Appendices 53
A MATLAB Implementation of the QGDM Algorithm
A.1 QGDM.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
B MATLAB Implementation of the VSGA Algorithm
B.1 VSGA.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
B.2 popgrad.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
v
List of Figures
2.1 Comparison of the trust-region and Newton steps . . . . . . . . . . . . . . . 10
3.1 Four iterations in two dimensions . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Search region shapes as a function of beta . . . . . . . . . . . . . . . . . . . 21
3.3 A Plot of 1000 random points using (3.3) . . . . . . . . . . . . . . . . . . . 22
3.4 EBP using alpha=0.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 QGDM algorithm using modifled distribution with k=0.1, p=5 . . . . . . . 25
3.6 QGDM algorithm using modifled distribution with k=0.1, p=10 . . . . . . 26
3.7 QGDM algorithm using modifled distribution with k=0.1, p=20 . . . . . . 27
3.8 QGDM algorithm using normal distribution with k=0.1, p=5 . . . . . . . . 28
3.9 QGDM algorithm using normal distribution with k=0.1, p=10 . . . . . . . 29
3.10 QGDM algorithm using normal distribution with k=0.1, p=20 . . . . . . . 30
4.1 A simple illustration of the basic principal behind VSGA . . . . . . . . . . 32
4.2 Flowchart for the proposed method . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Test Function 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Test Function 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 Test Function 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.6 A Graphical summary of the test results . . . . . . . . . . . . . . . . . . . . 47
vi
Chapter 1
Introduction
The use of optimization algorithms for engineering applications has become quite com-
mon in recent years. In fact, the impact of such methods can be seen in nearly every branch
of the discipline, and it is not hard to understand why. In a fleld driven by progress and
innovation, optimization algorithms ofier a powerful means to that end. While the ana-
lytical approach is generally preferred, the fact remains that for many classes of complex
problems, the mathematical tools needed to solve them analytically are yet to be devel-
oped. In the meantime, optimization has become an attractive alternative which efiectively
spans the gap between theory and innovation, making it a staple in the modern practice of
engineering.
As of late, a rift has begun to develop within the fleld of optimization, giving rise to
two very distinct schools of thought. The net efiect of this split is that the large majority
of optimization algorithms now come in one of two  avors, generally referred to as gradient
descent and evolutionary computation. To put it plainly, a method is either fast, or it is
powerful. Although this distinction is not necessarily valid for every case, it is a widely
accepted generalization nonetheless. Each method has its own niche, and comparison of
any two is highly subjective on the basis of application. Still, the use of optimization
algorithms continues to see rapid growth in a number of diverse flelds ranging from robotics
[6] and computational intelligence [7][8][9][10] to wireless transmission [11][12] and digital
fllter design [13][14]. With this continued increase in interest and application, the demand
for a newer and more versatile approach has become evident.
Still, despite the abundance of literature and attention garnered by the fleld in recent
years, the discipline appears to have reached an impasse. This is due in large part to the fact
that while a number of algorithms have been devised, the middle ground remains largely
1
unexplored. That is to say that the polarization of the fleld with respect to gradient and
evolutionary methods has resulted in very little efiort spent in combining the two. While
some attempts have been made at bringing the two sides together [15][16][17], most have
experienced rather limited success. It is the purpose of this work to help further this cause,
with the ultimate goal being a single robust algorithm which encompasses the strengths of
both methods, making itself useful over a wider range of problems.
1.1 Thesis Outline
An overview of numerical optimization is ofiered in Chapter 2. This includes a light
review of some relevant concepts and terminology, as well as a brief look at the workings of
some commonly used algorithms. Two novel algorithms are introduced in Chapters 3 and 4,
which comprise the body of the work. The flrst algorithm, presented in Chapter 3, seeks to
combine some of the strengths of the gradient and evolutionary approaches in a simple but
efiective manner. The details of the algorithm are presented along with some experimental
data. A comparison of the algorithm?s performance with one of the most popular methods in
current use is also included. In Chapter 4, the details of the second algorithm are discussed.
Though similar in principle to the flrst, the second algorithm ofiers an even more powerful
and robust alternative. The algorithm is compared with a number of competing methods,
and is shown to have equal if not superior performance over a varied assortment of test
cases. Finally, the thesis concludes with some closing thoughts in Chapter 5.
2
Chapter 2
Numerical Optimization
Engineering is the practice of analyzing, designing, and manipulating physical systems.
This is achieved by way of detailed mathematical models, which are used to approximate
the behavior of their physical counterparts to the most accurate degree possible. In most
cases these models can be obtained analytically, however, occasionally there arise situations
in which certain attributes of a physical system are hidden, and cannot be directly measured
or observed. Consequently, the resulting lack of necessary information prohibits the use of a
purely analytical solution. Instead, the modeling of such a system requires a more heuristic
approach, which is the realm of numerical optimization.
In the most general sense, numerical optimization is the process of manipulating the
variables of a given function in order to achieve an optimal desired output. The ultimate
goal is to obtain a unique set of values which produce the best possible output with respect
to the predetermined ideal. In the case of modeling, it is the errors, or residuals, between the
measured outputs of the physical system and its corresponding model that are of interest.
Here, the objective is to minimize the mean square value of the error, which is a function
of the model coe?cients. Doing so yields a unique set of coe?cients which ofier the most
accurate representation of the model?s physical counterpart. Thus numerical optimization
ofiers a powerful heuristic alternative for acquiring the desired set of coe?cients, even when
an analytical solution is unattainable.
It is important to note that similar heuristic methods are used to solve problems in
nearly every discipline imaginable. Therefore, while system identiflcation and modeling
may represent the predominant applications of optimization within the fleld of engineering,
similar processes are utilized in nearly every facet of applied science. In fact, the average
person unknowingly employs heuristic problem solving methods on an almost day-to-day
3
basis; a fact which emphasizes the power of such methods for overcoming situations in which
detailed knowledge of a speciflc system may be lacking.
A familiar example of this is the 3-band equalizer on a home stereo system. The system
has 3 inputs, corresponding to the sliders or knobs which control the individual bands, and 1
output in the form of the sound being produced by the speakers. What is interesting about
this example is that despite having no practical understanding of the internal workings
of the system, the average person is nonetheless capable of achieving near optimal sound
quality by ear alone. What may only appear to be a simple attempt at achieving the most
faithful reproduction of one?s favorite record, is in reality the heuristic process of solving a
3-dimensional optimization problem. And it is this same powerful ability to essentially do
more with less that makes the study of optimization so attractive.
Although optimization algorithms are, by deflnition, processes of trial and error, they
should not be confused with simple brute-force methods which employ primitive techniques
such as blind or random searches. On the contrary, the study and practice of numerical
optimization rests quite heavily on a thorough and rigorous mathematical foundation. Nat-
urally, to cover the subject in it?s entirety would far exceed the practical scope of this thesis.
Thus the following chapter is intended only as a supplement to the material presented later,
and is by no means a comprehensive introduction to the subject of optimization. In keep-
ing with this objective, only the most fundamental concepts and terminology are covered.
However, should a more comprehensive treatment be desired, the reader is directed to [24],
which represents an excellent technical reference on the subject of optimization.
2.1 Mathematical Formulation of Optimization Problems
Mathematically speaking, optimization is the process of searching for an extremum x?
of some function f(x). Ideally, given some initial value x0, the algorithm will converge to
the function?s global minimizer or maximizer, depending on the desired case (for simplicity?s
sake, the term optimization will be considered synonymous with minimization1 from here
1Note that maximizing a function f is equivalent to minimizing ?f
4
on). The general mathematical formulation for an optimization problem can be expressed
as
minx f(x) (2.1)
where x 2 Rn with n ? 1. The function f : Rn ! R is commonly known as the objective
function, or simply the objective.
2.2 The Anatomy of an Algorithm
An optimization algorithm, as the name implies, is an iterative process which seeks to
minimize some objective f(x). At each iteration k, the goal is to generate an iterate, xk+1,
such that f(xk+1) < f(xk), with the expectation that as k increases, xk ! x?. While there
are numerous methods by which these iterates may be generated, most numerical methods
can be expressed in the form
xk+1 = xk + k (2.2)
where  k 2 Rn is known as the step. Looking at (2.2), it is clear that  k is responsible for
determining how the algorithm progresses from one iterate to the next, which ultimately
deflnes the algorithm itself. Thus it is the way in which this step is generated that separates
one algorithm from another.
The step  is nothing more than a vector extending from one iterate to the next, which
can be uniquely expressed by its length and direction. Thus, in a sense, the generation of
the step can be seen as the two fold process of a) determining the best direction in which to
proceed and b) deciding how far to travel in the direction chosen. In general, so long as  
faces "downhill," and the distance traveled is su?ciently short, there are an inflnite number
of possible combinations which will result in convergence. However, the rate of convergence
depends quite heavily on the length and direction of the step. Thus convergence alone is not
su?cient. A well designed algorithm should not only guarantee2 local convergence, but it
2This guarantee does not necessarily apply to cases in which the objective function is non-smooth, and
therefore not continuously difierentiable.
5
should do so in as e?cient and efiective a manner possible. Therefore, the choice of search
direction and step length are of utmost importance to an algorithm?s success.
2.2.1 Common Search Directions
The most obvious of all search direction is the gradient direction ?rfk, also known
as the direction of steepest descent. As the name implies, ?rfk represents the direction
in which the objective function f(x) decreases most rapidly, making it a natural choice.
Though sometimes referred to as the gradient direction, the steepest descent direction
actually points in the opposite direction of the true gradient, which corresponds to the
direction of greatest increase. The true gradient is, by deflnition, the vector of all partial
flrst derivatives of the objective function, or more formally,
rf(x) ,
2
66
66
4
?f
?x1
...
?f
?xn
3
77
77
5
, where x = (x1;x2;:::;xn) (2.3)
In optimization, where the goal is to flnd the point at which a given function yields the
lowest possible value, knowing the direction in which that function changes most rapidly is
of obvious value, making the gradient a particularly useful tool.
Another commonly used search direction is the Newton direction. Though it is more
di?cult to compute than the gradient, it provides signiflcantly more information about the
local behavior of the objective, which in turn yields a much higher rate of convergence.
This is achieved through the use of a second-order model mk( k) of the objective function,
derived from its Taylor series approximation
f(xk + k) ? fk + Tk rfk + 12 Tk r2fk k , mk( k) (2.4)
From this, the Newton direction is deflned as the vector  k which minimizes the quadratic
model mk( k). Assuming for now that r2fk is positive deflnite, it is possible to solve for  k
6
by setting (2.4) equal to zero. Doing so yields the following explicit formula for the Newton
direction:
 k = ?(r2fk)?1rfk: (2.5)
Whereas the steepest decent direction revolves around the computation of the gradient,
the Newton direction relies primarily on the calculation of the Hessian r2fk, which is a
matrix containing all second partial derivatives of the objective function. That is,
r2f(x) ,
2
66
66
66
64
?2f
?x21
?2f
?x1?x2 ???
?2f
?x1?xn
?2f
?x2?x1
?2f
?x22 ???
?2f
?x2?xn
... ... ... ...
?2f
?xn?x1
?2f
?xn?x2 ???
?2f
?x2n
3
77
77
77
75
, where x = (x1;x2;:::;xn) (2.6)
As alluded to earlier, the computation of the Hessian can be a rather expensive operation,
which is one of the Newton method?s major drawbacks. Still, the beneflts of this approach
are typically considered to outweigh the costs.
In addition, a number of strategies have been devised which serve to reduce the compu-
tational requirement by replacing the true Hessian matrix with a close approximation that
does not need to be fully reevaluated at each iteration. Search directions which operate in
this way are commonly referred to as quasi-Newton methods. Some of the more common
implementations include the Broyden, Fletcher, Goldfarb and Shanno (BFGS) algorithm,
and the closely related Davidon, Fletcher and Powell (DFP) algorithm.
Another attractive attribute of many quasi-Newton methods is that they can be refor-
mulated to operate on the inverse of the approximated Hessian instead of on the approxima-
tion itself, which alleviates the burden of performing costly matrix inversions when solving
(2.5). This ability can prove especially valuable on problems with high dimensionality.
7
2.2.2 Choosing A Step Length
While the Newton and steepest decent directions can both be shown to guarantee a
reduction in the value of the objective[24], this is only true for su?ciently small step lengths.
This is due to the fact that the reliability of the corresponding linear and quadratic models
diminishes as a function of distance. While there are a number of ways of selecting proper
step lengths, in general, most methods follow one of two basic strategies.
Line Search Strategies
The flrst of the two strategies is known as the line search method. In this approach,
the search direction is chosen flrst, which yields a one-dimensional sub-problem in which
the objective is treated as a function of the step length, which is equivalent to solving
fik = minfi f(xk +fidk); fi > 0 (2.7)
where dk is the search direction and fi is a scalar which controls the length of the step. In
practice, it is not practical to provide an exact solution to (2.7), and in fact, an approximate
solution is generally su?cient for convergence. The most common approach is to generate
a flnite set of trial steps and choose the value fik which yields the greatest reduction in the
objective f(x). The generation of the trial steps can be a rather complex process in and of
itself, and is often subject to one or more su?cient decrease conditions which are used to
determine if a given step produces an adequate improvement in the objective.
Trust Region Strategies
Thesecondmajorstrategyforhandlingsteplengthisknowasthetrust region approach.
For this approach, the step is chosen as the solution to a constrained sub-problem in which
the objective function f(x) is replaced by some approximate model mk(x). The resulting
8
sub-problem is written
 k = min mk(xk + ), where xk + lies within the trust region. (2.8)
Usually, the trust region is spherical in shape, which requires the chosen step to satisfy the
constraint jj jj ? rk, where rk is the radius of the constraining region. If the solution to
the constrained sub-problem does not produce su?cient improvement in the objective, the
trust region is adjusted accordingly and the process is repeated.
Typically, a quadratic model is chosen for mk(x), making most trust region strategies
Newton or quasi-Newton in nature. Therefore, in most cases, mk is of the form
mk(xk + ) = fk + Trfk + 12 TBk (2.9)
where Bk is either the true Hessian or some approximation to it.
Combining (2.8) and (2.9), while also introducing the constraint imposed by the spher-
ical trust region, results in the following reformulation of the trust region sub-problem,
 k = min fk + Trfk + 12 TBk ; s.t. jj jj? rk (2.10)
It may be shown [24] that the solution  ?k to the constrained sub-problem in (2.10) must
satisfy
(Bk +?I) ?k = ?rfk (2.11)
for some scalar ? ? 0. This characteristic is particularly useful since it implies
 ?k(?) = ?(Bk +?I)?1rfk (2.12)
which reduces to one-dimensional function of?. As withfi for line search methods, it is not
practical to seek an exact value for ?. Instead, an approximate value of lambda is obtained
in much the same way as the step length fi is chosen in a line search. An interesting
9
rk
Contours
of f(x)
Contours
of mk
x?
??
Trust region
?k
Figure 2.1: Comparison of the trust-region and Newton steps
characteristic of the trust region approach is the fact that the choice of the trust region
radius efiects not only the step length, but the direction of the search as well. An example
of this can be seen in Fig. 2.1. Here, the vector  ? represents the Newton step, whereas  k
represents the constrained trust-region step. Notice that as the radius of the trust region
decreases, the direction of the step converges toward the gradient direction. Therefore, the
trust-region step is essentially a linear combination of the gradient and Newton directions.
10
Chapter 3
Quasi-gradient Optimization Using Directed Migration
Gradient based search methods are known to be very e?cient, especially for cases in
which the surface of the solution space is relatively smooth, with few local minima. Unfor-
tunately, in cases where the search space lacks this relative uniformity, gradient methods
become easily trapped in the local minima commonly found on rough or complex surfaces.
Methods of evolutionary computation (including genetic algorithms), on the other hand, are
much more efiective for traversing these more complex surfaces, and are inherently better
suited for avoiding local minima. However, like the gradient based approach, evolutionary
computation has its own signiflcant weakness. This stems from the fact that despite its reli-
ability, solutions are often not optimal. Furthermore, both methods are known to converge
very slowly [2] [3][5].
The objective behind this chapter is to take advantage of both methods by combining
thedesirablecharacteristicsofeach. Unlikestandardevolutionarycomputation, populations
are generated using the gradient, which is not directly calculated but is instead extracted
from the properties of the existing population. Several similar approaches have been un-
dertaken along this path [1] [16][17][2][3][4], but the method which is proposed here has less
computational complexity and is more suitable for on-line hardware implementation. Sim-
ple computations are repeated with every iteration, and the gradient is updated simply by
subtracting the coordinates of the best solutions from the current and previous populations.
The steepest decent method, which is the most commonly used gradient method, tends to
approach the solution asymptotically, which results in a much slower rate of convergence.
By comparison, the proposed method converges much more rapidly.
11
3.1 Proposed Method
The proposed method is a hybrid algorithm which ofiers both the relative speed of gra-
dient descent and the methodical power of evolutionary computation. Like the latter, this
hybrid algorithm relies on a randomly generated population of initial points. It also shares
the advantage of being an exclusively feed-forward process. What separates this approach
from standard methods of evolutionary computation is the way in which the successive pop-
ulations are generated. This is where the proposed method borrows more from the gradient
based approach. The hybrid approach relies on the calculation of a \rough" gradient using
the individual errors associated with a given population. The minimum of these errors is
determined and an entirely new population is generated about the corresponding point.
This ofiers a more directional approach to the generation of successive populations. Thus
this approach can be viewed as a sort of migration rather than recombination, for which it
is termed Quasi-Gradient Directed Migration (QGDM). The result is an algorithm which
converges much more rapidly than the combinational approach commonly associated with
genetic algorithms, while at the same time reducing the risks presented by local minima.
3.1.1 Detailed Description
The QGDM method is best represented as a four step process. Although the algorithm
technically involves only three steps per iteration, an additional step is required to begin
the process.
Step 1: Choosing a Starting Point and Radius - Let f be a function deflned in Rn.
Choose an initial center point c and radius r. An initial minimum m must also be
selected. For the flrst iteration, let m = c.
Step 2: Generating a Random Population - With n being the number of points per
population, deflne a set of vectors V = fv1;v2;:::;vng such that
vik = rkXY +ck for i = 1:::n (3.1)
12
where X is a random number ranging from 0 to 1, Y is a normalized random vector,
and k is the current iteration. Using (3.1) creates a set of random vectors whose
members are centered about ck with magnitudes ranging from 0 to rk. Therefore vik
represent positional vectors of the points pik which lie within the deflned region.
Step 3: Approximationofthe Gradient-Evaluatef(pik)fori = 1:::n. If8i(min(f(pik)) >
f(mk)), repeat step two. Otherwise, let mk+1 be p for which f(p) = min(f(pik)). An
approximate gradient vector gk can then be deflned by taking the difierence between
mk+1 and mk.
gk = mk+1 ?mk
Step 4: Creating a New Center and Radius - The new center should be shifted slightly so
thatthefollowingpopulationwilllieintheinthegeneraldirectionoftheapproximated
gradient. Using
ck+1 = fig+mk for fi ? 1
allows the size of the shift to be controlled by fi. If no shift is desired, fi can be set
to 1, in which case ck+1 = mk+1. In order to ensure that mk+1 lies within the new
region, it is necessary that rk+1 ?kck+1 ?mk+1k. This can be set using
rk+1 = flkck+1 ?mk+1k for fl ? 1
Once rk+1 and ck+1 are determined, steps two through four are repeated until f(m)
is within the desired tolerance.
The two dimensional example case in flg. 1 illustrates the process through four iterations.
For this particular example, fi = 2 and fl = 3=2. This can be clearly seen by the way rk,
represented by the large circles, perfectly bisects gk?1 for each iteration. Also, notice that
for the flrst iteration, the gradient vector g extends from c to m, whereas in all subsequent
iterations g is deflned from mk to mk+1.
13
c1
c2
c3
c4
m1
m2
m3 m4
Figure 3.1: Four iterations in two dimensions
14
3.2 Possible Modiflcations
Spherically shaped regions may not be optimal when generating successive popula-
tions. Using conical regions which extend in the direction of the approximate gradient
might greatly increase the speed of convergence. The following is just one of many similar
modiflcations.
3.2.1 Conically Shaped Regions
It is desired that the vector v be used as the axis of symmetry. This is best done by
switching to polar coordinates which only require the zenith angle ` and radial component
? in order to deflne the region?s orientation. This is an especially useful way of describing a
hyperconical region since it is valid for any subspace of Rn. The drawback of this method
is that it requires that the axis of symmetry lie along the zenith axis. In order to extend
this method to cases using arbitrary axes of symmetry, it is necessary to change to a basis
U whose zenith axis is in the direction of v. The region can then be deflned using the new
set of coordinate axes. With this, representing a point or vector within the deflned region
in terms of the standard basis E can be done quite easily by changing bases from U back
to E using a simple transition matrix.
Deflning a new basis
The process described below flrst requires the formation of a new basis using the n-
dimensional axial vector v. To do this, an additional n?1 linearly independent vectors are
required. The flrst step is to devise a generalized method for generating this set of vectors.
1. Generating an ordinary basis
In order to generate an ordinary basis, it is flrst necessary to acquire the proper tools
for determining linear independence. The following three theorems are particularly
useful.
15
Theorem 3.1 Let x1;x2;:::;xn be n vectors in Rn and let
xi = (x1i;x2i;:::;xni)T
for i = 1;:::;n. If X = (x1;x2;:::;xn), then the vectors x1;x2;:::;xn will be linearly
independent if and only if X is singular.
Theorem 3.2 An n ? n matrix A is singular if and only if
det(A) = 0
Theorem 3.3 Let A be an n ? n matrix.
(i) If A has a row or column consisting entirely of zeros, then det(A) = 0.
(ii) If A has two identical rows or two identical columns, then det(A) = 0.
Combining these three theorems yields a particularly useful result.
Corollary 3.4 Let A be an n ? n matrix. The column space of A forms an ordinary
basis for Rn if and only if
(i) A contains no rows or columns consisting entirely of zeros.
(ii) No two rows or columns of A are identical.
TheconditionsfromCorollary3.4cannearlyalwaysbesatisfledbytakingthestandard
basis E and simply replacing e1 with v. The only case in which this will not result in
an ordinary basis of Rn is when v1 = 0, which can be easily remedied by replacing ei1
with (v1+1) for i = 2;:::;n. This method will work for any v except 0. To summarize,
(a) Let E = fe1;e2;:::;eng. Replace e1 with v.
(b) Replace ei1 with (v1 +1) for i = 2;:::;n.
16
The resulting set of vectors (S) is an ordinary basis for Rn with s1 = v.
Although this method does produce a basis for Rn relative to v, it still lacks one
essential characteristic. In order for the vector lengths and angles of separation to be
preserved when changing bases, S must be orthogonal. Clearly, S is not an orthogo-
nal basis, however this can be remedied using the Gram-Schmidt Orthogonalization
Process.
2. The Gram-Schmidt Process
The The Gram-Schmidt Process is a method for orthogonalizing a set of vectors in
an inner product space, most commonly the Euclidian space Rn. The GramSchmidt
process takes a flnite, linearly independent set V = fv1;:::;vng and generates an
orthogonal set V0 = fu1;:::;ung that spans the same subspace as V.
Theorem 3.5 (The Gram-Schmidt Process) Let fx1;x2;:::;xng be a basis for
the inner product space V. Let
u1 =
 1
kx1k
?
x1
and deflne u2;:::;un recursively by
uk+1 = 1kx
k+1 ?pkk
(xk+1 ?pk) for k = 1;:::;n?1
where
pk = hxk+1;u1iu1 +hxk+1;u2iu2 +:::+hxk+1;ukiuk
is the projection of xk+1 onto Span(u1;u2;:::;uk). The set
fu1;u2;:::;ung
is an orthonormal basis for V.
17
By applying Theorem 3.5 to the ordinary basis S, orthogonal basis S0 is obtained,
which spans the same subspace as S, who?s flrst element u1 shares the same axis as
s1, and therefore lies in the direction of the initial vector v. Thus the net result is an
orthonormal basis S0 which allows for orthonormal transformations with the standard
basis E.
Deflning a Hyperconically bounded region
With the basis S0, deflning a hyperconically bounded region about v is a trivial mat-
ter. Using hyperspherical coordinates, the region can be described using only the radial
component ? and the zenith angle `. Any point obtained by manipulating the other an-
gular components will lie on the boundary of this region. For a region described by a
maximum radial component ?max and a maximum zenith angle `max, any point satisfying
0 ? ? ? ?max and 0 ? ` ? `max is guaranteed to lie within the desired region. All that is
required now is to perform an orthonormal transformation from S0 back to E.
Changing bases from S0 to E
Changing bases is a relatively simple matter. A transition matrix from S0 to E can be
found with relatively little efiort. However, before it can be used to for changing the basis
of a point within the predeflned region, it?s coordinates must be converted from a spherical
to cartesian representation. Thus some generalized method for making this conversion is
needed.
Hyperspherical to cartesian coordinate conversion
A coordinate system has now been deflned for n-dimensional Euclidean space, which
is analogous to the spherical coordinate system deflned for 3-dimensional Euclidean space.
Thesecoordinatesconsistofaradialcoordinate?, andn?1angularcoordinates`1;`2;:::;`n?1.
If xk are the Cartesian coordinates, then
18
xk =
8
>>>>
>>>>
<
>>>
>>>
>>:
??cos(`k) for k = 1
??
k?1Y
i=1
sin(`i)?cos(`k) for k = 2;:::;n?1
??
k?1Y
i=1
sin(`i) for k = n
(3.2)
This ofiers a generalized method by which an n-dimensional vector or point in hy-
perspherical coordinates may be converted into it?s cartesian representation. Once the
conversion has been made, the given point or vector may be freely converted from one basis
to the other using nothing more than a simple transition matrix.
Creating a transition matrix from S0 to E
Theorem 3.6 (Transition matrices) Let fu1;u2;:::;ung be an ordered basis for Rn, and let
c be a coordinate vector with respect to fu1;u2;:::;ung. To flnd the corresponding coordinate
vector x with respect to fe1;e2;:::;eng, simply multiply the matrix U = [u1;u2;:::;un] by c.
x = Uc
Summary
A step-by-step summary of the combined process.
Step 1: Using[v;e2;:::;en], createanewbasisforRn byreplacingei1 withv1+1fori = 2;:::;n.
Call the resulting matrix S and refer to it?s column space as fs1;s2;:::;sng.
Step 2: Using S = fs1;s2;:::;sng, let
u1 =
 1
ks1k
?
s1
and deflne u2;:::;un recursively by
uk+1 = 1ks
k+1 ?pkk
(sk+1 ?pk) for k = 1;:::;n?1
19
The resulting matrix [u1;u2;:::;un]will form an orthonormal basis for Rn which is
denoted as U.
Step 3: Deflne a hyperconical region about u1 by choosing a maximum radial component ?max
and a maximum zenith angle `max.
Step 4: Generate vectors within the deflned region by choosing? and ` such that 0 ? ? ? ?max
and 0 ? ` ? `max.
Step 5: Let x be one of the vectors chosen in step four. Convert x from hyperspherical to
cartesian coordinates using (3.2).
Step 6: Express x with respect to [e1;e2;:::;en] using the U as a transition matrix.
x0 = Ux
This newly generated vector x0 has two very important characteristics, which make it par-
ticularly useful. They are:
(i) The length of x0 is no greater than ?max
(ii) The angle of separation between x0 and the original vector v is no greater than `max
Controlling the Deflned Region?s Shape
It might be desirable to have ? diminish as ` increases. This can be done by expressing
? as a function of `.
?(`) = ?max
"
1?
 `
`max
?fl#
The variable fl can be adjusted to control the shape of the given region. As fl becomes
larger, the shape of the region becomes increasingly conical. Fig. 2 ofiers a graphical
representation of how fl efiects the region?s shape.
20
  0.5
  1
  1.5
  2
30
210
60
240
90
270
120
150
330
180 0
?=1
  0.5
  1
  1.5
  2
30
210
60
240
90
270
120
150
330
180 0
?=3
  0.5
  1
  1.5
  2
30
210
60
240
90
270
120
150
330
180 0
?=10
  0.5
  1
  1.5
  2
30
210
60
240
90
270
120
150
330
180 0
?=100
Figure 3.2: Search region shapes with ?max = 2; `max = 30?
3.2.2 A Modifled Population Distribution
One of the dangers of using a uniform population distribution is its susceptibility to
local minima. If the algorithm gets into a low-lying area which is larger than the current
maximum population radius, the algorithm will be permanently stuck. In order to avoid
this scenario, occasional points must be plotted outside the desired maximum radius.
One efiective and e?cient way of doing this is to use a population density which de-
creases exponentially as a function of the radius. This can be achieved by replacing rkX in
(3.1) with
1
X + 1rk ?
rk
rk ?1 (3.3)
21
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
1
2
3
4
5
6
7
8
9
10
Figure 3.3: A Plot of 1000 random points using (3.3)
which has a range of ?
0; rk1+ 1
rk
!
Although the upper bound for kvikk is no longer rk, for larger values of rk,
rk
1+ 1rk ? rk
Fig. 3.3 shows how (3.3) efiects density.
22
3.3 Experimental Results
Two variations of the algorithm were used for testing. The flrst set of tests were
performed using the standard algorithm as described in section ??. The second set used
the modifled population distribution from section 3.2.2. The algorithms were then tested
using population sizes of 5, 10, and 20 points per iteration. Statistical Data was compiled
for both algorithms using 100 runs for each population size.
Due to the common practical application of optimization in the fleld of artiflcial neural
networks, initial tests were made using the algorithm for training a simple two-layer network.
The network was then retrained for comparison using error back-propagation, which is the
most popular training algorithm in current use. Statistical data and error plots are included
for each test. All flgures and statistics were taken for successful runs only, with the exception
of success rate, which was deflned as the ratio of successful runs to non-convergent ones.
From the results, it is clear that both algorithms perform comparably to EBP, with the
modifled version being by far the most consistent.
23
100 101 102 103
10?3
10?2
10?1
100
Total error
Iterations
25 Runs
Figure 3.4: Error back propagation using fi = 0:5
Error back propagation
Population 20
Gain 5
Learning constant 0.5
Min run 138 ite
Max run 259 ite
Ave. run 152 ite
Standard deviation 15 ite
Success rate 66 %
24
100 101 102 103
10?3
10?2
10?1
100
101
Population size 5
Total error
Iterations
Figure 3.5: QGDM algorithm using modifled distribution with k=0.1, p=5
Modifled FF algorithm
Population 5
Initial radius 5
Total runs 100
Min run 57 ite
Max run 376 ite
Ave. run 105 ite
Standard deviation 47 ite
Success rate 82 %
25
100 101 102 103
10?3
10?2
10?1
100
101
Population size 10
Total error
Iterations
Figure 3.6: QGDM algorithm using modifled distribution with k=0.1, p=10
Modifled QGDM algorithm
Population 10
Initial radius 5
Total runs 100
Min run 21 ite
Max run 302 ite
Ave. run 67 ite
Standard deviation 45 ite
Success rate 81 %
26
100 101 102 103
10?3
10?2
10?1
100
101
Population size 20
Total error
Iterations
Figure 3.7: QGDM algorithm using modifled distribution with k=0.1, p=20
Modifled QGDM algorithm
Population 20
Initial radius 5
Total runs 100
Min run 17 ite
Max run 84 ite
Ave. run 36 ite
Standard deviation 11 ite
Success rate 81 %
27
Figure 3.8: QGDM algorithm using fl = 1:5, fi = 2
Standard QGDM algorithm
Population 5
Min run 17 ite
Max run 301 ite
Ave. run 82 ite
Success rate 55 %
28
100 101 102 103
10?3
10?2
10?1
100
101
Population size 10
Total error
Iterations
Figure 3.9: QGDM algorithm using fl = 1:5, fi = 2
Standard QGDM algorithm
Population 10
Min run 11 ite
Max run 159 ite
Ave. run 34 ite
Success rate 66 %
29
100 101 102 103
10?3
10?2
10?1
100
101
Population size 20
Total error
Iterations
Figure 3.10: QGDM algorithm using fl = 1:5, fi = 2
Standard QGDM algorithm
Population 20
Min run 9 ite
Max run 197 ite
Ave. run 29 ite
Success rate 70 %
30
Chapter 4
Variable Scale Gradient Approximation
Gradient methods are often compared to a ball rolling along some surface. Neglect-
ing momentum and assuming the ball is inflnitely small, it will follow the path of steepest
descent exactly. This is a powerful illustration which, consequently, also highlights the
method?s fundamental  aws. With little efiort one can visualize any number of surfaces on
which the ball might become permanently trapped before ever reaching its desired destina-
tion. Take, for example, a  ight of stairs as in Fig. 4.1. Even though the behavior of the
surface is relatively simple, it is clear that no matter where the ball on the left is placed,
it will never flnd its way to the foot of the stairs. Though the staircase clearly exhibits
a global trend which may be determined with only a limited amount of information, the
ball is still incapable of proceeding. The advantage of this illustration is that it makes the
solution quite obvious: use a bigger ball. Though changing the size of the ball won?t change
the gradient of the surface, it will make the ball more sensitive to larger scale behavior,
allowing it to successfully descend the stairs. After all, all of the relevant information for
descending a staircase can be found at the corners of the individual steps. While steps may
be an essential part of a staircase, from the perspective of minimization, they are nothing
more than noise. Furthermore, by actively allowing the ball to expand and contract as it
descends, it is possible to navigate on as large or small a scale as necessary. This is the
basic principle by which the proposed method operates, and from which it derives its name:
Variable Scale Gradient Approximation (VSGA).
31
Figure 4.1: Ordinary gradient methods (left) are easily trapped on complex or piecewise
constant optimization surfaces. The proposed method (right) is able to overcome these
obstacles by expanding the radius of its population-based gradient approximation.
32
4.1 Proposed Method
The key feature of the proposed method is the way in which the gradient is approx-
imated. During each iteration, the algorithm generates a "population" of points within
a spherically bounded region. A portion of the generated set lies on the region?s perime-
ter, and it is these members of the population which are used in the approximation of the
gradient. Next, an additional point is generated in the direction of the approximated gra-
dient and added to the current population. Finally, the flttest member of the population is
compared to the current best solution, and the population?s center and radius are updated
accordingly. The basic  ow of the algorithm is presented in Fig. 4.2.
4.1.1 Approximating the gradient
As previously noted, the gradient approximation is the deflning feature of the proposed
method. The generalized process for computing it is presented in a step-by-step manner.
Let x0 2 Rn be the center of a population P with radius r, and let y = f(x0) be the
value of the objective function f : Rn ! R at x0 .
1. First, generate a set of n random vectors f?x1;?x2;:::;?xng , each of length r, and
deflne an n?n matrix,
?X =
2
66
66
4
?x1
...
?xn
3
77
77
5
2. Now, let xi = ?xi +x0, and deflne a vector y = [y1:::yn]T where
yi = f(xi) (4.1)
33
A p p ly L M
u p d ate ru le.
P erform
selection .
A ssem b le
p op u lation .
E n d
L ack of
P rogress?
A d ju st
rad iu s.
?T olff <
S tart
C om p u te
grad ient.
Y es
N o
Y es
N o
Figure 4.2: A simplifled  ow chart depicts the basic operation of the proposed algorithm.
34
for i = 1;:::;n. It should be clear from 4.1 that each value yi corresponds to the value
of the objective function at a randomly generated point xi, which lies on the perimeter
of the population.
3. Finally, compute the gradient approximation using
rf = ?X?1 ?(y ?y0) (4.2)
It may appear, and understandably so, that a signiflcant leap has been made between
steps 2 and 3. So for the sake of clarity, a short derivation of 4.2 is included. Let f
be an n-dimensional scalar function. A linear approximation of f with respect to a point
x0 = (x01;x02;:::;x0n) may be found using the flrst two terms of the Taylor series expansion.
f(x) ? f(x0)+ ?f?x1 ??x1 + ?f?x2 ??x2 +???+ ?f?xn ??xn
= f(x0)+rfjx0 ??x
(4.3)
Solving this equation for the gradient yields
rfjx0 ? ?f?x = f(x)?f(x0)x?x
0
(4.4)
By generating a point x in the vicinity of x0, it is possible to obtain a numerical approxi-
mation of rfjx0 using 4.4. However, for n > 1, 4.4 is an invalid expression since it implies
that ?x is a vector, and therefore has no inverse. Thus, in order to generalize 4.4), f is
evaluated for a set of n linearly independent points, allowing ?x and ?f to be replaced by
the n?n matrix ?X and the n?1 vector of corresponding function values ?f. This leads
to the generalized form
rfn?1 ?
2
66
66
4
?x11 ??? x1n
... ... ...
?xn1 ??? xnn
3
77
77
5
?1
n?n
2
66
66
4
f(x1)?f(x0)
...
f(xn)?f(x0)
3
77
77
5
n?1
? ?X?1 ??f
(4.5)
35
which is equivalent to 4.2.
4.1.2 A modifled LM update rule
Once the gradient is computed, an additional point xn+1 is generated in the direction
of the approximation. The proposed method uses a modifled version of the Levenberg-
Marquardt (LM) algorithm [19] update rule for this purpose. For a scalar valued function
f : Rn ! R , the LM update rule may be written
x(k+1)n+1 = x(k)0 ?(rf ?rfT +?I)?1rf ?y(k)0 ; (4.6)
where k is the current iteration, and ? acts as a damping factor which reduces oscillation
by actively controlling the step size. One who is familiar with the LM algorithm may notice
that the Jacobian has been replaced by the gradient in 4.6. This is because the Jacobian is
deflned as the matrix of all flrst-order partial derivatives of a vector valued function. Thus,
for a scalar valued function, the Jacobian is, by deflnition, the transpose of the gradient.
Although the proposed method may be extended to handle vector valued functions, the
version presented here is intended for use with scalar valued problems. Thus the Jacobian
is replaced by the gradient.
This is not the only modiflcation. For the proposed implementation, the rule is also
modifled to account for the population radius r. This is done by adding a term which
ensures that the length of the update step will be no smaller than r. This modiflcation is
needed to ensure that when the algorithm encounters local minima, the step size will exceed
the radius of the current population, thus providing greater diversity. The modifled version
of the rule is written
x(k+1)n+1 = x(k)0 ?(rf ?rfT +?I)?1rf ?y(k)0 ? rfkrfk ?r(k) (4.7)
The added term in 4.7 is simply a vector of length r in the direction of rf.
36
It is important to note that the algorithm presented here, using the modifled version
of the LM update rule, is but one of many possible implementations. In other words, the
shaded box in Fig. 2 may be replaced with any number of existing gradient methods without
compromising the underlying principles. The only requirement is that the gradient be
calculated in the described fashion. In fact, the proposed method was speciflcally designed
with this sort  exibility in mind. Therefore, the update rule may be readily exchanged with
another second order method such as the BFGS method [20] or the closely related Davidon
Fletcher Powell algorithm, which has been shown to outperform LM in certain applications
[21].
4.1.3 Assembling the population
The population P is generated in two parts whose combined size is denoted by the
parameter ?. The primary population,
A = fx1;:::;xn+1g; (4.8)
is created in the two previous steps, and is of size n+1, while the secondary population,
B = fxn+2;:::;x?g; (4.9)
consists of a set of m points which are randomly distributed within the region deflned x0
and r. Here, the value of m is a user deflned parameter, and may be assigned any non
negative integer value including zero. This leads to the following deflnitions for ? and P,
? = n+1+m; (4.10)
P = A[B = fx1;:::;x?g: (4.11)
37
While the inclusion of B in P can provide a noticeable increase in the rate of convergence
when applied to relatively complex optimization surfaces, in general, it is su?cient to let
m = 0.
4.1.4 The selection process
Once P has been constructed according to 4.11, its members are ranked with respect
to fltness. Next, the flttest member xopt 2 P is chosen, and the following conditional, also
shown in Fig. 2, is evaluated:
IF f(xopt) < fTol THEN
Terminate algorithm.
ELSE IF f(xopt) ? f(x0) THEN
Adjust population radius r.
ELSE
x0 = xopt .
END IF
where fTolis the maximum acceptable value for the objective. Updating in this way es-
sentially recycles the information used in the approximation of the gradient, which would
otherwise be thrown out. This process also helps to guarantee the algorithm?s stability
since it is equivalent to running two methods in parallel. That is, if the generated step
does not result in a lower value of the objective, the method will resort to selecting the
best of the points used in the gradient approximation. In this way, it is still possible for
the method to proceed. Furthermore, because the current best solution is included in the
selection process, it is guaranteed that the value of the objective will not increase from one
iteration to the next. Therefore, while the instability of the update rule may afiect the
algorithm?s convergence, the algorithm itself will remain stable as a result of selection.
38
4.1.5 Adjustment of the population radius
Any number of radius update rules may be devised, however the version presented here
is perhaps the simplest. The method uses a flxed step size ? to modify the radius over a
bounded user deflned interval [rmin;rmax], and is subject to the following conditions:
IF r ? rmax THEN
r(k+1) = r(k) +? .
ELSE
r(k+1) = rmin .
END IF
where k is the number of the current iteration. Although this method is simple, it proves
to be adequate. The two features which make the method so basic are the use of a resetting
value of r, and the constant step-size ?. The constant step size serves the purpose of
increasing the population radius when the algorithm encounters local minima, while the
increase in r causes the gradient approximation to become less sensitive to local behavior,
thereby allowing it to escape the traps introduced by complex behaviors. Conversely, if
greater accuracy of approximation becomes necessary for local search, the algorithm is still
able to proceed once r is reset.
4.2 Test Functions
Four unique functions were used for testing. Each function exhibits features deemed
relevant for the purpose of comparison. All four functions are presented in generalized form,
with n being the dimension of the search space.
4.2.1 Test Function 1
T1(x) =
nX
i=1
?xi
4
?4
, xi 2 [?10;+10] (4.12)
39
The flrst test function, T1, has a simple convex, continuous, parabolic surface with a min-
imum of T1(x) = 0 located at the origin. The function is used for two primary purposes.
First, it ofiers a fair comparison of the proposed method with some of the more common
gradient methods. Second, it highlights the strengths of such methods, which are superior
to evolutionary methods when applied to problems of this type. Also, because the function
is fourth order, it will highlight the difierence between algorithms of higher and lower orders.
4.2.2 Test Function 2
T2(x) =
nX
i=1
 bx
ic
4
?4
, xi 2 [?10;+10] (4.13)
The second test function, inspired by the illustration in Ch. 4, is of an identical form to that
of T1, except that the variables have been  oored in order to make it piecewise constant.
T2 has a minimum value of 0 for all x = [x1:::xn] which satisfy x1 2 [0;1) for i = 1:::n. The
features of this function are useful for testing the hypothesis made in Ch. 4. If the proposed
method operates as intended, there should be little difierence in performance between the
T1 and T2.
4.2.3 Test Function 3
f(x) = 12
vu
ut nX
i=1
bxic2, xi 2 [?100;+100]
g(x) = x4 +(1?cos(?x))?(tanh?x4??1)2
T3(x) = g ?f
(4.14)
Like T2, T3 is also piecewise constant, but with the addition of local minima, which is the
most common challenge faced by gradient methods. Though evolutionary methods may
also become susceptible to these traps if population diversity becomes too low, they much
better suited for this type of problem. T3 has a minimum value of 0 for all x = [x1:::xn]
which satisfy xi 2 [0;1) for i = 1:::n.
40
-10
-5
0
5
10
-10
-5
0
5
10
0
2000
4000
Figure 4.3: Test Function 2
41
-15
-10
-5
0
5
10
15
-15
-10
-5
0
5
10
15
0
1
2
Figure 4.4: Test Function 3
42
Algorithm Parameters T1 T2 T3 T4
MSD fi 1 1 1 1
LM ?0 0.1 0.1 0.1 0.1
SA-ES ?;?; 0 3,12,1 3,12,1 8,16,10 8,16,50
CMA-ES ?;?; 0 3,12,1 3,12,1 8,16,10 8,16,50
PM m;rmin;rmax;? 0;10?16;1;1 0;2;6;2 4;2;6;2 0;10?6;12;3
Table 4.1: A list of algorithm parameters used in testing
4.2.4 Test Function 4
T4(x) = 12
nX
i=1
?x2
i +tan(xi)
2 ?10?cos(2?xi)+10?, xi 2 [?100;+100] (4.15)
T4 is the most extreme of the test functions. Though it is piecewise continuous, each of the
regions of continuity is a deep local minimum. Thus, gradient methods are only capable
of minimizing T4 on a local basis. Due to the large number of local minima, evolutionary
methods which rely on modifled mutation strength for accelerated convergence may also be
susceptible to entrapment. The surface of T4 is shown in Fig. 4.5. In reality, the "walls"
which separate the local minima in Fig. 4.5 are inflnitely high. The minimum of T4 is 0,
and is located at the origin.
4.3 Experimental Results
The proposed method is compared with four well known algorithms. It should be
noted that the purpose of the comparison is not to show that the proposed method is
superior to all algorithms over all problems; clearly that is not the case. Instead, the goal
is to show that while the algorithm shows the high rate of convergence and e?cient local
search characteristics of a second order gradient method, it is also capable of minimizing
complex classes of problems which are usually associated with evolutionary methods. Thus
the evolutionary methods used for comparison were selected on the basis of these same
qualities. The following is a list of the compared methods.
43
-5
0
5
-5
0
5
20
40
60
80
Figure 4.5: Test Function 4
Algorithm Test FunctionT
1 T2 T3 T4
MSD 100% FAILURE FAILURE FAILURE5477.98 FAILURE FAILURE FAILURE
LM 100% FAILURE FAILURE FAILURE43.6 FAILURE FAILURE FAILURE
SA-ES 100% 100% 100% 48%214.48 426.28 355.72 6475.7
CMA-ES 100% 100% 100% 67%186.68 138.32 772.24 815.61
PM 100% 100% 100% 100%46.3 28.72 148.21 382.36
Table 4.2: Comparison of success rate and mean function evaluations
44
1. MSD: The Method of Steepest Descent is perhaps the most well known of all gradient
methods. MSD is a simple flrst order method which, as the name implies, employs a
user deflned step size to proceed in the direction of "steepest descent." The method
was chosen for its high degree of stability as well as its reputation as the standard
gradient method.
2. LM: The Levenberg Marquardt algorithm, described brie y in Section III-C, is re-
garded as one of the fastest gradient methods available. The method was chosen as a
benchmark among second order algorithms. Furthermore, the gradient portion of the
proposed method uses an update rule directly inspired by the LM algorithm, making
the method especially relevant for comparison.
3. -SA-ES: Self-Adaptive Evolution Strategy [22], a member of the larger family of al-
gorithms known as Evolutionary Strategies [22], is a powerful evolutionary method
which uses self-adapted mutation strength for optimal convergence as well as an ac-
celerated local search. The method was chosen as a strong representative of the power
of evolutionary methods.
4. CMA-ES: Covariance Matrix Adaptation Evolutionary Strategy [23], also a member
of the family of Evolutionary Strategies, is an evolution path related technique which
uses search space information in a highly e?cient manner, making it exceptionally
fast with respect to other evolutionary methods.
5. PM: The proposed method.
The performance of each algorithm on each of the test functions was evaluated over
a series of 100 simulation runs using the algorithm parameters listed in Table 4.1. Each
algorithm was then evaluated with respect to rate of success as well as the mean number of
function evaluations per solution. The tabulated results are presented in Table 4.2, which
represents the case n = 2 for all four functions. For each of the test functions, success was
deflned by two conditions:
45
1. f(x) < 10?6
2. No more than 105 total evaluations of the objective.
The flelds in Table 4.2 highlighted in bold font denote the top performers for each of the test
cases. One may notice the similarity between the proposed method and the LM algorithm
with regard to the flrst test function. Notice that for this particular function, the behaviors
of the two algorithms are nearly identical. This validates the earlier hypothesis that, in the
absence of local minima, the two methods should behave in a similar manner. Note too the
striking difierence in the rate of convergence between the two second order gradient methods
when compared to the flrst order method of steepest descent. Another point of interest is
the performance of the proposed method on the second test. It was suggested in Section
IV-B that, assuming proper performance, little difierence should be observed between T1
and T2. This is conflrmed by the results in Table 4.2. In fact, the algorithm actually
performs better! This behavior is a result of the increased population radius coupled with
the modifled LM update rule, which, when used together, increase the rate at which the
proposed method traverses the optimization surface.
Finally, notice the similarity between the proposed method and that of CMA-ES with
respect to T3 and T4. In fact, for T4, the similarities are striking. When applied to these
more complex surfaces, the behavior of the algorithm changes drastically, and yet there
seems to be little, if any, efiect on the level of performance. What is most striking about
the results in Fig. 4.6 is the change in behavior between T1 and T4. As the nature of
the test functions becomes more complex, the behavior of the proposed method changes
from a second order gradient method to one which bears a striking resemblance to that of
evolutionary computation!
46
0 500 10000
10
20
0 2500 50000
20
40
60
0 1000 20000
10
20
0 125 2500
20
40
60
0 7500 150000
10
20
500 1000 15000
20
40
60
3000 4500 60000
20
40
60
0 40 800
20
40
0 200 4000
10
20
0 150 3000
20
40
60
Total FunctionEvaluations
Su
cce
ssfu
lR
un
s
PM
CM
A-
ES
SA
-E
S
LM
MS
DFAILURE FAILURE
FAILURE FAILURE
FAILURE
FAILURE
TestFunction1 TestFunction2 TestFunction3 TestFunction4
0 40 800
20
40
0 50 1000
20
40
0 250 5000
20
40
0 750 15000
20
40
Figure 4.6: A Graphical summary of the test results
47
Chapter 5
Conclusion
In this thesis, two novel algorithms for numerical optimization were presented. In both
cases, the objective was to modify the general form of the traditional flrst and second order
approaches in order to include global solutions to complex problems, while also retaining
desirable local behavior commonly associated with such approaches. This was motivated
by the fact that traditional gradient based approaches are susceptible to entrapment in
local minima, and will only produce global convergence if the initial guess happens to lie
within the global minimizer?s region of attraction. Conversely, while newer methods such
as evolutionary computation are quite adept at obtaining global solutions, they are very
ine?cient with respect to local convergence, making them prohibitively time consuming for
many applications.
The flrst approach, presented in Chapter 3, borrows some from evolutionary compu-
tation through the use of successive populations of pseudo-randomly generated test points.
However, instead of using the mechanisms of combination and random mutation to gener-
ate successive populations, the proposed method uses the cumulative information contained
within each generation to "migrate" from one point to the next. This migratory behavior
allows the algorithm to progress in a way which is similar to gradient methods, while the use
of randomly distributed populations make it possible to overcome the challenges presented
by local minima. The latter is done by actively controlling the population?s radius, allowing
test points to venture beyond the region of local attraction. Not only does this method
require very little computational efiort, but it was also shown to outperform a common flrst
order method for the purpose of training a neural network.
In Chapter 4, a method was presented by which nearly any flrst or second order algo-
rithm can be modifled to overcome the risk of entrapment. This was done through the use of
48
a variable scale gradient approximation (VSGA), which efiectively controls the sensitivity
of the search direction with respect to the small and large scale behaviors of the objec-
tive. Like its predecessor, the VSGA is extracted using the information contained within a
pseudo-randomly generated set of test points. The difierence, however, is that the VSGA
direction yields a far more accurate model of the objective function?s behavior with respect
scale. Thus, as the radius of the population increases, the VSGA becomes less sensitive to
local behavior, allowing it to overcome the challenge of local minima. Conversely, when the
radius of the population is quite small, the VSGA becomes a very close approximation to
the true gradient, making it quite e?cient in local searches as well. To conflrm this, the
algorithm was tested using a series of test functions chosen to typify some of the most com-
monly faced di?culties. The performance of the proposed method was also compared with
a set of competing algorithms which included gradient as well as evolutionary methods. It
was than shown from the results that the proposed method was not only able to emulate
both classes of algorithms based on the behavior of each function, but it was also shown
to out-perform the compared methods in terms of rate of success and average number of
evaluations of the objective.
49
Bibliography
[1] H. A. Abbass, A. M. Bagirov, J. Zhang, "The discrete gradient evolutionary strategy
method for global optimization," Congress on Evolutionary Computation - CEC?03
Volume 1, 8-12 Dec. 2003 pp. 435 - 442 Vol.1
[2] M. Manic and B. Wilamowski, "Random Weights Search in Compressed Neural Net-
works using Overdetermined Pseudoinverse," Proc. of the IEEE Interantional Sympo-
sium on Industrial Electronics - ISIE?03 Rio de Janeiro, Brazil, June 9-11, 2003, pp.
678 - 683.
[3] M.ManicandB.Wilamowski"RobustNeuralNetworkTrainingUsingPartialGradient
Probing," IEEE International Conference on Industrial Informatics - INDIN?03 Banfi,
Alberta, Canada, August 21-24, 2003, pp. 175 - 180.
[4] J. Y. Wen, Q. H. Wu, L. Jiang, S.J. Cheng, "Pseudo-gradient based evolutionary
programming,",Electronics Letters Volume 39, Issue 7, 3 April 2003 pp. 631 - 632
[5] B. Wilamowski "Neural Networks and Fuzzy Systems," The Microelectronic Handbook
CRC Press -2006 chapters 124.1 to 124.8.
[6] Derenick, J. C.; Spletzer, J. R., "Convex Optimization Strategies for Coordinating
Large-Scale Robot Formations," Robotics, IEEE Transactions, vol.23, no.6, pp.1252-
1259, Dec. 2007
[7] Seok-Beom Roh, W. Pedrycz, Sung-Kwun Oh, "Genetic Optimization of Fuzzy Polyno-
mial Neural Networks," Trans. on Industrial Electronics, vol. 54, no. 4, pp. 2219-2238,
Aug. 2007.
[8] L. dos Santos Coelho, B. M. Herrera, "Fuzzy Identiflcation Based on a Chaotic Particle
Swarm Optimization Approach Applied to a Nonlinear Yo-yo Motion System," Trans.
on Industrial Electronics, vol. 54, no. 6, pp. 3234-3245, Dec. 2007.
[9] S.-K. Oh, W. Pedrycz, H.-S. Park, "A New Approach to the Development of Geneti-
cally Optimized Multilayer Fuzzy Polynomial Neural Networks," Trans. on Industrial
Electronics, vol. 53, no. 4, pp. 1309- 1321, Aug. 2006.
[10] Faa-Jeng Lin, Po-Kai Huang, W.-D. Chou, "Recurrent-Fuzzy-Neural-Network-
Controlled Linear Induction Motor Servo Drive Using Genetic Algorithms," Trans.
on Industrial Electronics, vol. 54, no. 3, pp. 1449-1461, June 2007.
50
[11] Grimaccia, F.; Mussetta, M.; Pirinoli, P.; Zich, R.E., "Genetical Swarm Optimization
(GSO): a class of Population-based Algorithms for Antenna Design," Communications
and Electronics, 2006. ICCE ?06. First International Conference on , pp.467-471, 10-11
Oct. 2006
[12] Grimaccia, F.; Mussetta, M.; Pirinoli, P.; Zich, R.E., "Optimization of a re ectarray
antenna via hybrid evolutionary algorithms," Electromagnetic Compatibility, 2006.
EMC-Zurich 2006. 17th International Zurich Symposium on , pp. 254-257, 27 Feb.-3
March 2006
[13] Jinn-Tsong Tsai, Jyh-Horng Chou, Tung-Kuan Liu, "Optimal design of digital IIR
fllters by using hybrid taguchi genetic algorithm," Trans. on Industrial Electronics,
vol. 53, no. 3, pp. 867- 879, June 2006.
[14] Yang Yu; Yu Xinjie, "Cooperative Coevolutionary Genetic Algorithm for Digital IIR
Filter Design," Industrial Electronics, IEEE Transactions on , vol.54, no.3, pp.1311-
1318, June 2007
[15] Salomon, R., "Evolutionary algorithms and gradient search: similarities and difier-
ences," Evolutionary Computation, IEEE Transactions on , vol.2, no.2, pp.45-55, Jul
1998
[16] M. Bundzel and P. Sincak, "Combining Gradient and Evolutionary Approaches to
the Artiflcial Neural Networks Training According to Principlesof Support Vector Ma-
chines," IEEE International Joint Conference on Neural Networks - IJCNN?06, pp.
2068 - 2074, 16-21 July 2006.
[17] X. Hu; Z. Huang; Z. Wang "Hybridization of the multi-objective evolutionary algo-
rithms and the gradient-based algorithms," Congress on Evolutionary Computation -
CEC?03, vol. 2, pp. 870 - 877, 8-12 Dec. 2003
[18] J. Hewlett, B. Wilamowski, and G. Dundar "Merge of Evolutionary Computations
with Gradient Based Method for Optimization Problems" ISIE 07 IEEE International
Symposium on Industrial Electronics, Vigo, Spain, June 4-7, 2007.
[19] D. Marquardt, "An Algorithm for Least-Squares Estimation of Nonlinear Parameters,"
SIAM Journal on Applied Mathematics, vol. 11, pp. 431-441, 1963.
[20] J. Nocedal and S. J. Wright, Numerical Optimization. 2nd ed., New York: Springer,
2006, pp. 136-143.
[21] Abid, S.; Mouelhi, A.; Fnaiech, F., "Accelerating the Multilayer Perceptron Learning
with the Davidon Fletcher Powell Algorithm," Neural Networks, 2006. IJCNN ?06.
International Joint Conference on, pp.3389-3394
[22] H.-G. Beyer, H.-P. Schwefel, "Evolution Strategies: A comprehensive introduction,"
Natural Computing, vol. 1, pp. 3-52, 2002.
51
[23] N. Hansen, S. D. Mller, P. Koumoutsakos, "Redusing the Time Complexity of the
Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES),"
Evolutionary Computing, vol. 11, no. 1, pp. 1-18, 2003.
[24] J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed. New York: Springer-
Verlag, 2006.
52
Appendices
53
A.1 QGDM.m
function [WW1,WW2,OP,ser,ite,k]=QGDM(fun,IP,DP,WW1,WW2,k1,k2,emax,nmax,nnp,r1);
% [WW1,WW2,OP,ser,ite,k]=QGDM(fun,IP,DP,WW1,WW2,k1,k2,emax,nmax,nnp,r1) - erorr backprop for 2 layers
% uses: versize2, augm, forw1l, oerror, fsbip, fsli, fsuni, newwt3
%
% Arguments:
% fun - activation function ?fsbip?
% IP - [np,ni] matrix of inputs (not augmented).
% DP - [np,no] matrix of desired outputs
% WW1 - [nh,ni] weight matrix for the first layer
% WW2 - [no,nh] weight matrix for the second layer
% ni - number of inputs
% nh - number of hidden neurons
% no - number of outputs
% np - number of patterns
% not working for unipolar
% k1- neuron gain at net=0 for first layer
% k2- neuron gain at net=0 for second layer
% emax - max normalized error
% nmax - max number of iterations
% nnp - number of points per population
% r1 - starting radius
% Returns:
% WW1 - [nh,ni] weight matrix for the first layer
% WW2 - [no,nh] weight matrix for the second layer
% OP - [np,no] matrix of outputs
% ser - vector of normalized errors as function of iterations
% ite - number of total iterations
if nargin ~= 11 error(?Wrong number of arguments?); end;
[ni,no,np,nh]=versize2(IP,DP,WW1,WW2);
IP1=augm(IP);
step=1;
je=0;
c1=WW1;
c2=WW2;
r=r1;
k=0;
%---------------Calculate network output (OP2)-----------------%
[NET1,OP1,GAIN1] = forw1l(fun,IP1,WW1,k1); %
IP2=augm(OP1); %
[NET2,OP2,GAIN2] = forw1l(fun,IP2,WW2,k2); %
%--------------------------------------------------------------%
%-----------------Calculate initial error (e)------------------%
e = oerror(DP,OP2); %
%--------------------------------------------------------------%
for ite=1:nmax
[WW1,WW2,r,e,OP,c1,c2]=newwt3(WW1,WW2,r,e,DP,IP1,fun,k1,k2,c1,c2,nnp);
if (ite/step == round(ite/step)) | (ite==1)
je=je+1;
ser(je)=e;
%disp(sprintf(?ite=%5d error=%12.10f?,ite,e));
end;
%WW1
54
%WW2
if e<emax return; end;
end;
return;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [ww12,ww22,r,e1,OP,c1,c2]=newwt3(WW1,WW2,r,e,DP,IP1,fun,k1,k2,c1,c2,nnp);
%----------return initial state if e2 is never less than e1----------%
OP=0; %
e1=e; %
ww12=WW1; %
ww22=WW2; %
for i=1:nnp %
%-----------------Create random weights (wwa,wwb)--------------------%
wwa=2*rand(size(WW1))-1; %
wwb=2*rand(size(WW2))-1; %
magww=sqrt(sum(sum(wwa.^2)?)+sum(wwb.^2)); %
rr=r*rand; %
wwa=rr*wwa/magww+c1; %
wwb=rr*wwb/magww+c2; %
%------------------feed-forward calculation (OP2)--------------------%
[NET1,OP1,GAIN1] = forw1l(fun,IP1,wwa,k1); %
IP2=augm(OP1); %
[NET2,OP2,GAIN2] = forw1l(fun,IP2,wwb,k2); %
%------------------------Calculate error (e2)------------------------%
e2 = oerror(DP,OP2); %
%-----------------------Update current status------------------------%
if e2<e1 %
e1=e2; %
OP=OP2; %
ww12=wwa; %
ww22=wwb; %
end; %
end; %
%---------------------Update center and radius-----------------------%
if e1<e %
r=(3/2)*sqrt(sum(sum((ww12-WW1).^2)?)+sum(sum((ww22-WW2).^2)?)); %
c1=2*(ww12-WW1)+WW1; %
c2=2*(ww22-WW2)+WW2; %
end; %
%--------------------------------------------------------------------%
return;
%******************MY CODE ENDS HERE****************************
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [NET,OP,GAIN] = forw1l(fun,IP,WW,k);
% [NET,OP,GAIN] = forw1l(fun,IP,WW,k) - forward computation for one layer
% uses:
%
% Arguments:
% fun - string with the name of activation function
% IP - np*ni matrix of input vectors (in augumented space)
% WW - no*ni weight matrix.
% ni - number of inputs
% no - number of outputs
% np - number of patterns
55
% k - gain of neurons at net=0
% Returns:
% OP - np*no matrix of actual outputs.
% NET - np*no matrix of net values
% GAIN - np*no neuron gains
if nargin ~= 4 error(?Wrong number of arguments?); end
NET=IP*WW?;
[OP,GAIN]=feval(fun,NET,k);
return
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [e,EE] = oerror(DP,OP);
% [e,EE] = oerror(DP,OP); error computation for one layer
% uses:
%
% Arguments:
% OP - [np,no] matrix of outputs
% DP - [np,no] matrix of desired outputs
% Returns:
% e - total error
% EE - [np,no] error matrix (d-o)
if nargin ~= 2 error(?Wrong number of arguments.?); end;
sprintf(?test error ?);
[np,no]=size(DP);
EE=DP-OP;
e = sqrt(sum(sum(EE.*EE)))/(2*no*np);
return
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [IP] = augm(IP);
% [IP] = augm(IP) create augmented input matrix
% uses:
%
% Arguments:
% IP - [np,ni] matrix
% Returns:
% IP - [np,ni+1] matrix with +1 on last comumn
if nargin ~= 1 error(?Wrong number of arguments.?); end;
[np,ni]=size(IP);
IP(:,ni+1)=ones(np,1);
return;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [o,gain] = fsbip(net,k)
% [o,gain] = fsbip(net,k) - bipolar sigmoidal function
% uses:
%
% Arguments:
% net - input variable
% k - gain at net=0
% Returns:
% o - output value
% gain - neuron gain
if nargin > 2 error(?Wrong number of arguments.?); end
o = 2*ones./(ones+exp(-2*k*net))-ones;
gain = k.*(1-o.*o);
56
return
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [o,gain] = fslin(net,k)
% [o,gain] = fsemlin(net,k) - semi linear function
% uses:
%
% Arguments:
% net - input variable
% k - gain at net=0
% Returns:
% o - output value
% gain - neuron gain
if nargin > 2 error(?Wrong number of arguments.?); end
o = k*net;
for i=1:length(o)
if o(i)<0
o(i)=0;
end;
end
gain = k;
return
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [o,gain] = fsuni(net,k)
% [o,gain] = fsuni(net,k) - unipolar sigmoidal function
% uses:
%
% Arguments:
% net - input variable
% k - gain at net=0
% Returns:
% o - output value
% gain - neuron gain
if nargin > 2 error(?Wrong number of arguments.?); end
if nargin == 1 k=1; end
%k=4*k;
[nr,nc] = size(net);
o = ones./(ones+exp(-k.*net));
gain = k.*(1-o).*o;
return
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [ni,no,np,nh]=versize2(IP,DP,WW1,WW2);
% [ni,no,np,nh]=versize2(IP,DP,WW1,WW2)-verification of matrix sizes for two layer backpropagation input
% uses:
%
% Arguments:
% IP - np*ni matrix of input paterns (not in augmented space)
% DP - np*no matrix of desired patterns
% WW1 - weight matrix for first layer (for augmented space)
% WW2 - weight matrix for second layer (for augmented space)
% network structure is extracted form weights matrixes
% Returns:
% ni - number of inputs (in augmented space)
% no - number of outputs
% np - number of patterns
% nh - number of hidden neurons
57
if nargin ~= 4 error(?Wrong number of arguments.?); end
[npi,ni]=size(IP);
[npo,no]=size(DP);
[nh,niw1]=size(WW1);
[now,niw2]=size(WW2);
if npi~=npo npi, npo, error(?Number of input and output patterns differs?); end
if ni~=niw1-1 ni, niw1, error(?WW1 does not match input paterns?); end
if no~=now no, now, error(?WW2 does not match output paterns?); end
if nh~=niw2-1 nh, niw2, error(?WW2 does not match hidden neurons?); end
np=npi;
return
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
58
B.1 VSGA.m
function [x0,y0] = vsga(obj,x0,y_max,k_max,r_lim,delta,m)
%VSGA Optimizes objective functions with complex surfaces.
% VSGA uses a "(V)ariable (S)cale (G)radient (A)pproximation" to solve
% problems of the form:
%
% min f(X)
% x
%
% where X and the values returned by f can be scalars or vectors.
%
% [X0,Y0] = VSGA(OBJ,X0,Y_MAX,K_MAX,R_LIM,DELTA,M) returns the minimizer X0
% and its corresponding objective value Y0.
%
% Input arguments:
% OBJ [handle] : Handle of the objective function
% X0 [1xn] : Initial starting point
% Y_MAX [scalar] : Maximum allowable tolereance for Y0
% K_MAX [scalar] : Maximum number of iterations
% R_LIM [1x2] : Radius limits [r_min,r_max]
% M [scalar] : Secondary population size
%
% For details of the algorithm?s operation, see
%
% Hewlett, J.D.; Wilamowski, B.M.; Dundar, G., "Optimization Using a
% Modified Second-Order Approach With Evolutionary Enhancement,"
% Industrial Electronics, IEEE Transactions on , vol.55, no.9,
% pp.3374-3380, Sept. 2008
%
% OBJ - The objective function is defined as a separate MATLAB function.
% For example,
%
% function [y]=testT1(x)
% y=((.25*x(1))^4+(.25*x(2))^4);
% return
%
% Example:
%
% [x0,y0] = vsga(@testT2,[100,100],1e-6,100,[1e-6,3],1,10);
%
% minimizes testT2.
% Joel Hewlett, Mar. 2009
% Auburn University Department of Electrical and Computer Engineering
% $Revision: 1.0 $ $Date: 2009/03/27 13:42:26 $
%---------------------------------------%---------------------------------------%
% Initialize constants/variables %
%---------------------------------------%
output = [?%1.8e <== Total Error ?,...
?%1.1e <== radius %d <== iteration?];
y0 = feval(obj,x0); % Initial value of the objective
y_best = [y0,zeros(1,k_max-1)]; % Holds value of y0 for each iteration
59
r = r_lim(1); % Set initial radius equal to r_min
I = eye(length(x0)); % nxn identity matrix
mu = 0.1; % Learning rate for LM
mut = mu; % Stores mu when radius changes
chk = 0; % Set when radius is growing
k = 1; % Iteration counter
rcount = 0;
%---------------------------------------%---------------------------------------%
% Begin Main Loop %
%---------------------------------------%
while k <= k_max,
k = k + 1;
%---------------------------------------%
% Compute Gradient %
%---------------------------------------%
[grad,x_opt] = popgrad(x0,obj,r,m); % Caculate gradient approximation
y_opt=feval(obj,x_opt);
Hessian = grad?*grad;
if rcond(Hessian)<0.5 % If Hessian is not well conditioned,
Hessian=0*Hessian; % set it equal to 0.
end
count = 0; % Counter for LM mu update
temp = y0;
%---------------------------------------%
% Apply Update Rule %
%---------------------------------------%
while (1),
step=((Hessian+mu*I)\grad?*y0)?;% Levenberg Marquardt step
dir_step=step/sqrt(step*step?); % Diversity offset
x_np1=x0-(step+r*dir_step);
y_np1=feval(obj,x_np1);
if chk == 0;
if y_np1<=temp
temp = y_np1;
if mu>1e-50, mu=mu/10; end;
break;
end;
if mu<1e+50, mu=mu*10; end;
end
count = count+1;
if count>2, break; end;
end;
%---------------------------------------%
% Assemble Population %
%---------------------------------------%
P=[x_opt;x0;x_np1];
y=[y_opt,y0,y_np1];
%---------------------------------------%
% Perform Selection %
%---------------------------------------%
ndex=find(y==min(y)); % Find index of best y value
y0=y(ndex(1)); % Update y0
x0=P(ndex(1),:); % Update x0
y_best(k) = y0;
disp(sprintf(output,y0,r,k)); % Display progress to the command window
if y0 < y_max % Check if y0 is sufficiently small
break;
end
%---------------------------------------%
% Check Progress/Adjust Radius %
%---------------------------------------%
if (y_best(k) == y_best(k-1)) % Check progress
chk=1;
if r >= r_lim(2) % If r has reached its upper limit,
60
r=r_lim(1); % reset r.
else
r = r + delta; % Update r
end
mu = 0.1;
else
if chk == 1;
mu = mut; % Restore mu
else
mut = mu; % Store current mu
end
chk=0;
end
end
%---------------------------------------%---------------------------------------%
% Plot y0 as function of k %
%---------------------------------------%
semilogy(y_best)
xlim([1 k]);
xlabel(?k?);
ylabel(?y0?);
title(?Run Summary?)
return
B.2 popgrad.m
function [grad,x_opt] = popgrad(x0,obj,r,m)
%POPGRAD Calculate a population based gradient approximation.
% GRAD = POPGRAD(X0,OBJ,RADIUS,M) returns the gradient approximation of
% OBJ evaluated at the point X0, using a population radius of RADIUS.
%
% [GRAD,X_OPT] = POPGRAD(X0,OBJ,RADIUS,M) also returns the vector X_OPT,
% which corresponds to the member of the generated population with the
% lowest value for the objective function OBJ. If M = 0, X_OPT is simply
% chosen from the N points used to calculate GRAD. If M > 0, M randomly
% generated points within the region defined by X0 and RADIUS are also
% considered.
%
% Input arguments:
% X0 : 1xn vector
% OBJ : function handle
% RADIUS : scalar (>0)
% M : scalar (>=0)
%
% OBJ - The objective function is defined as a separate MATLAB function.
% For example,
%
% function [y]=testT1(x)
% y=((.25*x(1))^4+(.25*x(2))^4);
% return
%
% Example:
%
% [g,x] = popgrad([10 10],@testT1,1e-6,10);
%
% finds the the values of g and x for the testT1 using a population
% radius of 1e-6 and a secondary population size of 10.
% Joel Hewlett, Mar. 2009
% Auburn University Department of Electrical and Computer Engineering
% $Revision: 1.0 $ $Date: 2009/03/27 10:07:26 $
%---------------------------------------%---------------------------------------%
61
% Generate Population (P) %
%---------------------------------------%
n = length(x0); % Size of primary population
A = 2*rand(n)-1;
A = A./(ones(n,1)*sqrt(sum(A?.^2)))?*r;
A = A+ones(n,1)*x0; % Primary population (A)
if m > 0
B=(2*rand(m,n)-1);
B=B./(ones(n,1)*sqrt(sum(B?.^2)))?*r;
B=B + ones(m,1)*x0; % Secondary population (B)
P=[A;B]; % Combined population (P)
else
P=A; % Combined population (P)
end
%---------------------------------------%
% Evaluate Population Fitness (y) %
%---------------------------------------%
y0 = feval(obj,x0);
lambda = n+m; % Size of combined population
for i = 1:lambda,
y(i) = feval(obj,P(i,:));
end
%---------------------------------------%
% Approximate Gradient (grad) %
%---------------------------------------%
dy = y(1:n)-y0;
dx = P(1:n,:)-ones(n,1)*x0;
grad = (pinv(dx)*dy?)?;
%---------------------------------------%
% Perform preliminary selection %
%---------------------------------------%
y(end+1) = y0;
P(end+1,:)= x0;
index = find(y==min(y));
x_opt = P(index(1),:);
return
62