Novel Approaches to Creating Robust Globally Convergent Algorithms for Numerical Optimization by Joel David Hewlett A thesis submitted to the Graduate Faculty of Auburn University in partial fulflllment of the requirements for the Degree of Master of Science Auburn, Alabama May 14, 2010 Approved by: Bogdan M. Wilamowski, Chair, Associate Professor of Electrical Engineering Thaddeus Roppel, Associate Professor of Electrical Engineering Robert Dean, Assistant Professor of Electrical Engineering Vitaly Vodyanoy, Professor of Veterinary Medicine Abstract Two optimization algorithms are presented, each of which seeks to efiectively combine the desirable characteristics of gradient descent and evolutionary computation into a single robust algorithm. The flrst method, termed Quasi-Gradient Directed Migration (QGDM), is based on a quasi-gradient approach which utilizes the directed migration of a bounded set of randomly distributed points. The algorithm progresses by iteratively updating the center location and radius of the given population. The second method, while similar in spirit, takes this concept one step further by using a "variable scale gradient approximation" (VSGA), which allows it to recognize surface behavior on both the large and small scale. By adjusting the population radius between iterations, the algorithm is able to escape local minima by shifting its focus onto global trends rather than local behavior. Both algorithms are compared experimentally with existing methods and are shown to be competitive, if not superior, in each of the tested cases. ii Acknowledgments I would like to thank my parents, Bud and Patsy Hewlett. Without their love and support, I would have likely never begun this endeavor. I would also like to express my deepest gratitude to my advisor, Dr. Wilamowski. His guidance and teaching have been an invaluable resource in my academic development, without which I would have likely never flnished. iii Table of Contents Abstract ii List of Figures vi 1 Introduction 1 1.1 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Numerical Optimization 3 2.1 Mathematical Formulation of Optimization Problems . . . . . . . . . . . . . 4 2.2 The Anatomy of an Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 Common Search Directions . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Choosing A Step Length . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Quasi-gradient Optimization Using Directed Migration 11 3.1 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1.1 Detailed Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Possible Modiflcations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.1 Conically Shaped Regions . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.2 A Modifled Population Distribution . . . . . . . . . . . . . . . . . . 21 3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4 Variable Scale Gradient Approximation 31 4.1 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.1 Approximating the gradient . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.2 A modifled LM update rule . . . . . . . . . . . . . . . . . . . . . . . 36 4.1.3 Assembling the population . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.4 The selection process . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.1.5 Adjustment of the population radius . . . . . . . . . . . . . . . . . . 39 4.2 Test Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.1 Test Function 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.2 Test Function 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2.3 Test Function 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2.4 Test Function 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5 Conclusion 48 Bibliography 50 iv Appendices 53 A MATLAB Implementation of the QGDM Algorithm A.1 QGDM.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 B MATLAB Implementation of the VSGA Algorithm B.1 VSGA.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 B.2 popgrad.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 v List of Figures 2.1 Comparison of the trust-region and Newton steps . . . . . . . . . . . . . . . 10 3.1 Four iterations in two dimensions . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Search region shapes as a function of beta . . . . . . . . . . . . . . . . . . . 21 3.3 A Plot of 1000 random points using (3.3) . . . . . . . . . . . . . . . . . . . 22 3.4 EBP using alpha=0.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5 QGDM algorithm using modifled distribution with k=0.1, p=5 . . . . . . . 25 3.6 QGDM algorithm using modifled distribution with k=0.1, p=10 . . . . . . 26 3.7 QGDM algorithm using modifled distribution with k=0.1, p=20 . . . . . . 27 3.8 QGDM algorithm using normal distribution with k=0.1, p=5 . . . . . . . . 28 3.9 QGDM algorithm using normal distribution with k=0.1, p=10 . . . . . . . 29 3.10 QGDM algorithm using normal distribution with k=0.1, p=20 . . . . . . . 30 4.1 A simple illustration of the basic principal behind VSGA . . . . . . . . . . 32 4.2 Flowchart for the proposed method . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Test Function 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4 Test Function 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.5 Test Function 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.6 A Graphical summary of the test results . . . . . . . . . . . . . . . . . . . . 47 vi Chapter 1 Introduction The use of optimization algorithms for engineering applications has become quite com- mon in recent years. In fact, the impact of such methods can be seen in nearly every branch of the discipline, and it is not hard to understand why. In a fleld driven by progress and innovation, optimization algorithms ofier a powerful means to that end. While the ana- lytical approach is generally preferred, the fact remains that for many classes of complex problems, the mathematical tools needed to solve them analytically are yet to be devel- oped. In the meantime, optimization has become an attractive alternative which efiectively spans the gap between theory and innovation, making it a staple in the modern practice of engineering. As of late, a rift has begun to develop within the fleld of optimization, giving rise to two very distinct schools of thought. The net efiect of this split is that the large majority of optimization algorithms now come in one of two avors, generally referred to as gradient descent and evolutionary computation. To put it plainly, a method is either fast, or it is powerful. Although this distinction is not necessarily valid for every case, it is a widely accepted generalization nonetheless. Each method has its own niche, and comparison of any two is highly subjective on the basis of application. Still, the use of optimization algorithms continues to see rapid growth in a number of diverse flelds ranging from robotics [6] and computational intelligence [7][8][9][10] to wireless transmission [11][12] and digital fllter design [13][14]. With this continued increase in interest and application, the demand for a newer and more versatile approach has become evident. Still, despite the abundance of literature and attention garnered by the fleld in recent years, the discipline appears to have reached an impasse. This is due in large part to the fact that while a number of algorithms have been devised, the middle ground remains largely 1 unexplored. That is to say that the polarization of the fleld with respect to gradient and evolutionary methods has resulted in very little efiort spent in combining the two. While some attempts have been made at bringing the two sides together [15][16][17], most have experienced rather limited success. It is the purpose of this work to help further this cause, with the ultimate goal being a single robust algorithm which encompasses the strengths of both methods, making itself useful over a wider range of problems. 1.1 Thesis Outline An overview of numerical optimization is ofiered in Chapter 2. This includes a light review of some relevant concepts and terminology, as well as a brief look at the workings of some commonly used algorithms. Two novel algorithms are introduced in Chapters 3 and 4, which comprise the body of the work. The flrst algorithm, presented in Chapter 3, seeks to combine some of the strengths of the gradient and evolutionary approaches in a simple but efiective manner. The details of the algorithm are presented along with some experimental data. A comparison of the algorithm?s performance with one of the most popular methods in current use is also included. In Chapter 4, the details of the second algorithm are discussed. Though similar in principle to the flrst, the second algorithm ofiers an even more powerful and robust alternative. The algorithm is compared with a number of competing methods, and is shown to have equal if not superior performance over a varied assortment of test cases. Finally, the thesis concludes with some closing thoughts in Chapter 5. 2 Chapter 2 Numerical Optimization Engineering is the practice of analyzing, designing, and manipulating physical systems. This is achieved by way of detailed mathematical models, which are used to approximate the behavior of their physical counterparts to the most accurate degree possible. In most cases these models can be obtained analytically, however, occasionally there arise situations in which certain attributes of a physical system are hidden, and cannot be directly measured or observed. Consequently, the resulting lack of necessary information prohibits the use of a purely analytical solution. Instead, the modeling of such a system requires a more heuristic approach, which is the realm of numerical optimization. In the most general sense, numerical optimization is the process of manipulating the variables of a given function in order to achieve an optimal desired output. The ultimate goal is to obtain a unique set of values which produce the best possible output with respect to the predetermined ideal. In the case of modeling, it is the errors, or residuals, between the measured outputs of the physical system and its corresponding model that are of interest. Here, the objective is to minimize the mean square value of the error, which is a function of the model coe?cients. Doing so yields a unique set of coe?cients which ofier the most accurate representation of the model?s physical counterpart. Thus numerical optimization ofiers a powerful heuristic alternative for acquiring the desired set of coe?cients, even when an analytical solution is unattainable. It is important to note that similar heuristic methods are used to solve problems in nearly every discipline imaginable. Therefore, while system identiflcation and modeling may represent the predominant applications of optimization within the fleld of engineering, similar processes are utilized in nearly every facet of applied science. In fact, the average person unknowingly employs heuristic problem solving methods on an almost day-to-day 3 basis; a fact which emphasizes the power of such methods for overcoming situations in which detailed knowledge of a speciflc system may be lacking. A familiar example of this is the 3-band equalizer on a home stereo system. The system has 3 inputs, corresponding to the sliders or knobs which control the individual bands, and 1 output in the form of the sound being produced by the speakers. What is interesting about this example is that despite having no practical understanding of the internal workings of the system, the average person is nonetheless capable of achieving near optimal sound quality by ear alone. What may only appear to be a simple attempt at achieving the most faithful reproduction of one?s favorite record, is in reality the heuristic process of solving a 3-dimensional optimization problem. And it is this same powerful ability to essentially do more with less that makes the study of optimization so attractive. Although optimization algorithms are, by deflnition, processes of trial and error, they should not be confused with simple brute-force methods which employ primitive techniques such as blind or random searches. On the contrary, the study and practice of numerical optimization rests quite heavily on a thorough and rigorous mathematical foundation. Nat- urally, to cover the subject in it?s entirety would far exceed the practical scope of this thesis. Thus the following chapter is intended only as a supplement to the material presented later, and is by no means a comprehensive introduction to the subject of optimization. In keep- ing with this objective, only the most fundamental concepts and terminology are covered. However, should a more comprehensive treatment be desired, the reader is directed to [24], which represents an excellent technical reference on the subject of optimization. 2.1 Mathematical Formulation of Optimization Problems Mathematically speaking, optimization is the process of searching for an extremum x? of some function f(x). Ideally, given some initial value x0, the algorithm will converge to the function?s global minimizer or maximizer, depending on the desired case (for simplicity?s sake, the term optimization will be considered synonymous with minimization1 from here 1Note that maximizing a function f is equivalent to minimizing ?f 4 on). The general mathematical formulation for an optimization problem can be expressed as minx f(x) (2.1) where x 2 Rn with n ? 1. The function f : Rn ! R is commonly known as the objective function, or simply the objective. 2.2 The Anatomy of an Algorithm An optimization algorithm, as the name implies, is an iterative process which seeks to minimize some objective f(x). At each iteration k, the goal is to generate an iterate, xk+1, such that f(xk+1) < f(xk), with the expectation that as k increases, xk ! x?. While there are numerous methods by which these iterates may be generated, most numerical methods can be expressed in the form xk+1 = xk + k (2.2) where k 2 Rn is known as the step. Looking at (2.2), it is clear that k is responsible for determining how the algorithm progresses from one iterate to the next, which ultimately deflnes the algorithm itself. Thus it is the way in which this step is generated that separates one algorithm from another. The step is nothing more than a vector extending from one iterate to the next, which can be uniquely expressed by its length and direction. Thus, in a sense, the generation of the step can be seen as the two fold process of a) determining the best direction in which to proceed and b) deciding how far to travel in the direction chosen. In general, so long as faces "downhill," and the distance traveled is su?ciently short, there are an inflnite number of possible combinations which will result in convergence. However, the rate of convergence depends quite heavily on the length and direction of the step. Thus convergence alone is not su?cient. A well designed algorithm should not only guarantee2 local convergence, but it 2This guarantee does not necessarily apply to cases in which the objective function is non-smooth, and therefore not continuously difierentiable. 5 should do so in as e?cient and efiective a manner possible. Therefore, the choice of search direction and step length are of utmost importance to an algorithm?s success. 2.2.1 Common Search Directions The most obvious of all search direction is the gradient direction ?rfk, also known as the direction of steepest descent. As the name implies, ?rfk represents the direction in which the objective function f(x) decreases most rapidly, making it a natural choice. Though sometimes referred to as the gradient direction, the steepest descent direction actually points in the opposite direction of the true gradient, which corresponds to the direction of greatest increase. The true gradient is, by deflnition, the vector of all partial flrst derivatives of the objective function, or more formally, rf(x) , 2 66 66 4 ?f ?x1 ... ?f ?xn 3 77 77 5 , where x = (x1;x2;:::;xn) (2.3) In optimization, where the goal is to flnd the point at which a given function yields the lowest possible value, knowing the direction in which that function changes most rapidly is of obvious value, making the gradient a particularly useful tool. Another commonly used search direction is the Newton direction. Though it is more di?cult to compute than the gradient, it provides signiflcantly more information about the local behavior of the objective, which in turn yields a much higher rate of convergence. This is achieved through the use of a second-order model mk( k) of the objective function, derived from its Taylor series approximation f(xk + k) ? fk + Tk rfk + 12 Tk r2fk k , mk( k) (2.4) From this, the Newton direction is deflned as the vector k which minimizes the quadratic model mk( k). Assuming for now that r2fk is positive deflnite, it is possible to solve for k 6 by setting (2.4) equal to zero. Doing so yields the following explicit formula for the Newton direction: k = ?(r2fk)?1rfk: (2.5) Whereas the steepest decent direction revolves around the computation of the gradient, the Newton direction relies primarily on the calculation of the Hessian r2fk, which is a matrix containing all second partial derivatives of the objective function. That is, r2f(x) , 2 66 66 66 64 ?2f ?x21 ?2f ?x1?x2 ??? ?2f ?x1?xn ?2f ?x2?x1 ?2f ?x22 ??? ?2f ?x2?xn ... ... ... ... ?2f ?xn?x1 ?2f ?xn?x2 ??? ?2f ?x2n 3 77 77 77 75 , where x = (x1;x2;:::;xn) (2.6) As alluded to earlier, the computation of the Hessian can be a rather expensive operation, which is one of the Newton method?s major drawbacks. Still, the beneflts of this approach are typically considered to outweigh the costs. In addition, a number of strategies have been devised which serve to reduce the compu- tational requirement by replacing the true Hessian matrix with a close approximation that does not need to be fully reevaluated at each iteration. Search directions which operate in this way are commonly referred to as quasi-Newton methods. Some of the more common implementations include the Broyden, Fletcher, Goldfarb and Shanno (BFGS) algorithm, and the closely related Davidon, Fletcher and Powell (DFP) algorithm. Another attractive attribute of many quasi-Newton methods is that they can be refor- mulated to operate on the inverse of the approximated Hessian instead of on the approxima- tion itself, which alleviates the burden of performing costly matrix inversions when solving (2.5). This ability can prove especially valuable on problems with high dimensionality. 7 2.2.2 Choosing A Step Length While the Newton and steepest decent directions can both be shown to guarantee a reduction in the value of the objective[24], this is only true for su?ciently small step lengths. This is due to the fact that the reliability of the corresponding linear and quadratic models diminishes as a function of distance. While there are a number of ways of selecting proper step lengths, in general, most methods follow one of two basic strategies. Line Search Strategies The flrst of the two strategies is known as the line search method. In this approach, the search direction is chosen flrst, which yields a one-dimensional sub-problem in which the objective is treated as a function of the step length, which is equivalent to solving fik = minfi f(xk +fidk); fi > 0 (2.7) where dk is the search direction and fi is a scalar which controls the length of the step. In practice, it is not practical to provide an exact solution to (2.7), and in fact, an approximate solution is generally su?cient for convergence. The most common approach is to generate a flnite set of trial steps and choose the value fik which yields the greatest reduction in the objective f(x). The generation of the trial steps can be a rather complex process in and of itself, and is often subject to one or more su?cient decrease conditions which are used to determine if a given step produces an adequate improvement in the objective. Trust Region Strategies Thesecondmajorstrategyforhandlingsteplengthisknowasthetrust region approach. For this approach, the step is chosen as the solution to a constrained sub-problem in which the objective function f(x) is replaced by some approximate model mk(x). The resulting 8 sub-problem is written k = min mk(xk + ), where xk + lies within the trust region. (2.8) Usually, the trust region is spherical in shape, which requires the chosen step to satisfy the constraint jj jj ? rk, where rk is the radius of the constraining region. If the solution to the constrained sub-problem does not produce su?cient improvement in the objective, the trust region is adjusted accordingly and the process is repeated. Typically, a quadratic model is chosen for mk(x), making most trust region strategies Newton or quasi-Newton in nature. Therefore, in most cases, mk is of the form mk(xk + ) = fk + Trfk + 12 TBk (2.9) where Bk is either the true Hessian or some approximation to it. Combining (2.8) and (2.9), while also introducing the constraint imposed by the spher- ical trust region, results in the following reformulation of the trust region sub-problem, k = min fk + Trfk + 12 TBk ; s.t. jj jj? rk (2.10) It may be shown [24] that the solution ?k to the constrained sub-problem in (2.10) must satisfy (Bk +?I) ?k = ?rfk (2.11) for some scalar ? ? 0. This characteristic is particularly useful since it implies ?k(?) = ?(Bk +?I)?1rfk (2.12) which reduces to one-dimensional function of?. As withfi for line search methods, it is not practical to seek an exact value for ?. Instead, an approximate value of lambda is obtained in much the same way as the step length fi is chosen in a line search. An interesting 9 rk Contours of f(x) Contours of mk x? ?? Trust region ?k Figure 2.1: Comparison of the trust-region and Newton steps characteristic of the trust region approach is the fact that the choice of the trust region radius efiects not only the step length, but the direction of the search as well. An example of this can be seen in Fig. 2.1. Here, the vector ? represents the Newton step, whereas k represents the constrained trust-region step. Notice that as the radius of the trust region decreases, the direction of the step converges toward the gradient direction. Therefore, the trust-region step is essentially a linear combination of the gradient and Newton directions. 10 Chapter 3 Quasi-gradient Optimization Using Directed Migration Gradient based search methods are known to be very e?cient, especially for cases in which the surface of the solution space is relatively smooth, with few local minima. Unfor- tunately, in cases where the search space lacks this relative uniformity, gradient methods become easily trapped in the local minima commonly found on rough or complex surfaces. Methods of evolutionary computation (including genetic algorithms), on the other hand, are much more efiective for traversing these more complex surfaces, and are inherently better suited for avoiding local minima. However, like the gradient based approach, evolutionary computation has its own signiflcant weakness. This stems from the fact that despite its reli- ability, solutions are often not optimal. Furthermore, both methods are known to converge very slowly [2] [3][5]. The objective behind this chapter is to take advantage of both methods by combining thedesirablecharacteristicsofeach. Unlikestandardevolutionarycomputation, populations are generated using the gradient, which is not directly calculated but is instead extracted from the properties of the existing population. Several similar approaches have been un- dertaken along this path [1] [16][17][2][3][4], but the method which is proposed here has less computational complexity and is more suitable for on-line hardware implementation. Sim- ple computations are repeated with every iteration, and the gradient is updated simply by subtracting the coordinates of the best solutions from the current and previous populations. The steepest decent method, which is the most commonly used gradient method, tends to approach the solution asymptotically, which results in a much slower rate of convergence. By comparison, the proposed method converges much more rapidly. 11 3.1 Proposed Method The proposed method is a hybrid algorithm which ofiers both the relative speed of gra- dient descent and the methodical power of evolutionary computation. Like the latter, this hybrid algorithm relies on a randomly generated population of initial points. It also shares the advantage of being an exclusively feed-forward process. What separates this approach from standard methods of evolutionary computation is the way in which the successive pop- ulations are generated. This is where the proposed method borrows more from the gradient based approach. The hybrid approach relies on the calculation of a \rough" gradient using the individual errors associated with a given population. The minimum of these errors is determined and an entirely new population is generated about the corresponding point. This ofiers a more directional approach to the generation of successive populations. Thus this approach can be viewed as a sort of migration rather than recombination, for which it is termed Quasi-Gradient Directed Migration (QGDM). The result is an algorithm which converges much more rapidly than the combinational approach commonly associated with genetic algorithms, while at the same time reducing the risks presented by local minima. 3.1.1 Detailed Description The QGDM method is best represented as a four step process. Although the algorithm technically involves only three steps per iteration, an additional step is required to begin the process. Step 1: Choosing a Starting Point and Radius - Let f be a function deflned in Rn. Choose an initial center point c and radius r. An initial minimum m must also be selected. For the flrst iteration, let m = c. Step 2: Generating a Random Population - With n being the number of points per population, deflne a set of vectors V = fv1;v2;:::;vng such that vik = rkXY +ck for i = 1:::n (3.1) 12 where X is a random number ranging from 0 to 1, Y is a normalized random vector, and k is the current iteration. Using (3.1) creates a set of random vectors whose members are centered about ck with magnitudes ranging from 0 to rk. Therefore vik represent positional vectors of the points pik which lie within the deflned region. Step 3: Approximationofthe Gradient-Evaluatef(pik)fori = 1:::n. If8i(min(f(pik)) > f(mk)), repeat step two. Otherwise, let mk+1 be p for which f(p) = min(f(pik)). An approximate gradient vector gk can then be deflned by taking the difierence between mk+1 and mk. gk = mk+1 ?mk Step 4: Creating a New Center and Radius - The new center should be shifted slightly so thatthefollowingpopulationwilllieintheinthegeneraldirectionoftheapproximated gradient. Using ck+1 = fig+mk for fi ? 1 allows the size of the shift to be controlled by fi. If no shift is desired, fi can be set to 1, in which case ck+1 = mk+1. In order to ensure that mk+1 lies within the new region, it is necessary that rk+1 ?kck+1 ?mk+1k. This can be set using rk+1 = flkck+1 ?mk+1k for fl ? 1 Once rk+1 and ck+1 are determined, steps two through four are repeated until f(m) is within the desired tolerance. The two dimensional example case in flg. 1 illustrates the process through four iterations. For this particular example, fi = 2 and fl = 3=2. This can be clearly seen by the way rk, represented by the large circles, perfectly bisects gk?1 for each iteration. Also, notice that for the flrst iteration, the gradient vector g extends from c to m, whereas in all subsequent iterations g is deflned from mk to mk+1. 13 c1 c2 c3 c4 m1 m2 m3 m4 Figure 3.1: Four iterations in two dimensions 14 3.2 Possible Modiflcations Spherically shaped regions may not be optimal when generating successive popula- tions. Using conical regions which extend in the direction of the approximate gradient might greatly increase the speed of convergence. The following is just one of many similar modiflcations. 3.2.1 Conically Shaped Regions It is desired that the vector v be used as the axis of symmetry. This is best done by switching to polar coordinates which only require the zenith angle ` and radial component ? in order to deflne the region?s orientation. This is an especially useful way of describing a hyperconical region since it is valid for any subspace of Rn. The drawback of this method is that it requires that the axis of symmetry lie along the zenith axis. In order to extend this method to cases using arbitrary axes of symmetry, it is necessary to change to a basis U whose zenith axis is in the direction of v. The region can then be deflned using the new set of coordinate axes. With this, representing a point or vector within the deflned region in terms of the standard basis E can be done quite easily by changing bases from U back to E using a simple transition matrix. Deflning a new basis The process described below flrst requires the formation of a new basis using the n- dimensional axial vector v. To do this, an additional n?1 linearly independent vectors are required. The flrst step is to devise a generalized method for generating this set of vectors. 1. Generating an ordinary basis In order to generate an ordinary basis, it is flrst necessary to acquire the proper tools for determining linear independence. The following three theorems are particularly useful. 15 Theorem 3.1 Let x1;x2;:::;xn be n vectors in Rn and let xi = (x1i;x2i;:::;xni)T for i = 1;:::;n. If X = (x1;x2;:::;xn), then the vectors x1;x2;:::;xn will be linearly independent if and only if X is singular. Theorem 3.2 An n ? n matrix A is singular if and only if det(A) = 0 Theorem 3.3 Let A be an n ? n matrix. (i) If A has a row or column consisting entirely of zeros, then det(A) = 0. (ii) If A has two identical rows or two identical columns, then det(A) = 0. Combining these three theorems yields a particularly useful result. Corollary 3.4 Let A be an n ? n matrix. The column space of A forms an ordinary basis for Rn if and only if (i) A contains no rows or columns consisting entirely of zeros. (ii) No two rows or columns of A are identical. TheconditionsfromCorollary3.4cannearlyalwaysbesatisfledbytakingthestandard basis E and simply replacing e1 with v. The only case in which this will not result in an ordinary basis of Rn is when v1 = 0, which can be easily remedied by replacing ei1 with (v1+1) for i = 2;:::;n. This method will work for any v except 0. To summarize, (a) Let E = fe1;e2;:::;eng. Replace e1 with v. (b) Replace ei1 with (v1 +1) for i = 2;:::;n. 16 The resulting set of vectors (S) is an ordinary basis for Rn with s1 = v. Although this method does produce a basis for Rn relative to v, it still lacks one essential characteristic. In order for the vector lengths and angles of separation to be preserved when changing bases, S must be orthogonal. Clearly, S is not an orthogo- nal basis, however this can be remedied using the Gram-Schmidt Orthogonalization Process. 2. The Gram-Schmidt Process The The Gram-Schmidt Process is a method for orthogonalizing a set of vectors in an inner product space, most commonly the Euclidian space Rn. The GramSchmidt process takes a flnite, linearly independent set V = fv1;:::;vng and generates an orthogonal set V0 = fu1;:::;ung that spans the same subspace as V. Theorem 3.5 (The Gram-Schmidt Process) Let fx1;x2;:::;xng be a basis for the inner product space V. Let u1 = 1 kx1k ? x1 and deflne u2;:::;un recursively by uk+1 = 1kx k+1 ?pkk (xk+1 ?pk) for k = 1;:::;n?1 where pk = hxk+1;u1iu1 +hxk+1;u2iu2 +:::+hxk+1;ukiuk is the projection of xk+1 onto Span(u1;u2;:::;uk). The set fu1;u2;:::;ung is an orthonormal basis for V. 17 By applying Theorem 3.5 to the ordinary basis S, orthogonal basis S0 is obtained, which spans the same subspace as S, who?s flrst element u1 shares the same axis as s1, and therefore lies in the direction of the initial vector v. Thus the net result is an orthonormal basis S0 which allows for orthonormal transformations with the standard basis E. Deflning a Hyperconically bounded region With the basis S0, deflning a hyperconically bounded region about v is a trivial mat- ter. Using hyperspherical coordinates, the region can be described using only the radial component ? and the zenith angle `. Any point obtained by manipulating the other an- gular components will lie on the boundary of this region. For a region described by a maximum radial component ?max and a maximum zenith angle `max, any point satisfying 0 ? ? ? ?max and 0 ? ` ? `max is guaranteed to lie within the desired region. All that is required now is to perform an orthonormal transformation from S0 back to E. Changing bases from S0 to E Changing bases is a relatively simple matter. A transition matrix from S0 to E can be found with relatively little efiort. However, before it can be used to for changing the basis of a point within the predeflned region, it?s coordinates must be converted from a spherical to cartesian representation. Thus some generalized method for making this conversion is needed. Hyperspherical to cartesian coordinate conversion A coordinate system has now been deflned for n-dimensional Euclidean space, which is analogous to the spherical coordinate system deflned for 3-dimensional Euclidean space. Thesecoordinatesconsistofaradialcoordinate?, andn?1angularcoordinates`1;`2;:::;`n?1. If xk are the Cartesian coordinates, then 18 xk = 8 >>>> >>>> < >>> >>> >>: ??cos(`k) for k = 1 ?? k?1Y i=1 sin(`i)?cos(`k) for k = 2;:::;n?1 ?? k?1Y i=1 sin(`i) for k = n (3.2) This ofiers a generalized method by which an n-dimensional vector or point in hy- perspherical coordinates may be converted into it?s cartesian representation. Once the conversion has been made, the given point or vector may be freely converted from one basis to the other using nothing more than a simple transition matrix. Creating a transition matrix from S0 to E Theorem 3.6 (Transition matrices) Let fu1;u2;:::;ung be an ordered basis for Rn, and let c be a coordinate vector with respect to fu1;u2;:::;ung. To flnd the corresponding coordinate vector x with respect to fe1;e2;:::;eng, simply multiply the matrix U = [u1;u2;:::;un] by c. x = Uc Summary A step-by-step summary of the combined process. Step 1: Using[v;e2;:::;en], createanewbasisforRn byreplacingei1 withv1+1fori = 2;:::;n. Call the resulting matrix S and refer to it?s column space as fs1;s2;:::;sng. Step 2: Using S = fs1;s2;:::;sng, let u1 = 1 ks1k ? s1 and deflne u2;:::;un recursively by uk+1 = 1ks k+1 ?pkk (sk+1 ?pk) for k = 1;:::;n?1 19 The resulting matrix [u1;u2;:::;un]will form an orthonormal basis for Rn which is denoted as U. Step 3: Deflne a hyperconical region about u1 by choosing a maximum radial component ?max and a maximum zenith angle `max. Step 4: Generate vectors within the deflned region by choosing? and ` such that 0 ? ? ? ?max and 0 ? ` ? `max. Step 5: Let x be one of the vectors chosen in step four. Convert x from hyperspherical to cartesian coordinates using (3.2). Step 6: Express x with respect to [e1;e2;:::;en] using the U as a transition matrix. x0 = Ux This newly generated vector x0 has two very important characteristics, which make it par- ticularly useful. They are: (i) The length of x0 is no greater than ?max (ii) The angle of separation between x0 and the original vector v is no greater than `max Controlling the Deflned Region?s Shape It might be desirable to have ? diminish as ` increases. This can be done by expressing ? as a function of `. ?(`) = ?max " 1? ` `max ?fl# The variable fl can be adjusted to control the shape of the given region. As fl becomes larger, the shape of the region becomes increasingly conical. Fig. 2 ofiers a graphical representation of how fl efiects the region?s shape. 20 0.5 1 1.5 2 30 210 60 240 90 270 120 150 330 180 0 ?=1 0.5 1 1.5 2 30 210 60 240 90 270 120 150 330 180 0 ?=3 0.5 1 1.5 2 30 210 60 240 90 270 120 150 330 180 0 ?=10 0.5 1 1.5 2 30 210 60 240 90 270 120 150 330 180 0 ?=100 Figure 3.2: Search region shapes with ?max = 2; `max = 30? 3.2.2 A Modifled Population Distribution One of the dangers of using a uniform population distribution is its susceptibility to local minima. If the algorithm gets into a low-lying area which is larger than the current maximum population radius, the algorithm will be permanently stuck. In order to avoid this scenario, occasional points must be plotted outside the desired maximum radius. One efiective and e?cient way of doing this is to use a population density which de- creases exponentially as a function of the radius. This can be achieved by replacing rkX in (3.1) with 1 X + 1rk ? rk rk ?1 (3.3) 21 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10 1 2 3 4 5 6 7 8 9 10 Figure 3.3: A Plot of 1000 random points using (3.3) which has a range of ? 0; rk1+ 1 rk ! Although the upper bound for kvikk is no longer rk, for larger values of rk, rk 1+ 1rk ? rk Fig. 3.3 shows how (3.3) efiects density. 22 3.3 Experimental Results Two variations of the algorithm were used for testing. The flrst set of tests were performed using the standard algorithm as described in section ??. The second set used the modifled population distribution from section 3.2.2. The algorithms were then tested using population sizes of 5, 10, and 20 points per iteration. Statistical Data was compiled for both algorithms using 100 runs for each population size. Due to the common practical application of optimization in the fleld of artiflcial neural networks, initial tests were made using the algorithm for training a simple two-layer network. The network was then retrained for comparison using error back-propagation, which is the most popular training algorithm in current use. Statistical data and error plots are included for each test. All flgures and statistics were taken for successful runs only, with the exception of success rate, which was deflned as the ratio of successful runs to non-convergent ones. From the results, it is clear that both algorithms perform comparably to EBP, with the modifled version being by far the most consistent. 23 100 101 102 103 10?3 10?2 10?1 100 Total error Iterations 25 Runs Figure 3.4: Error back propagation using fi = 0:5 Error back propagation Population 20 Gain 5 Learning constant 0.5 Min run 138 ite Max run 259 ite Ave. run 152 ite Standard deviation 15 ite Success rate 66 % 24 100 101 102 103 10?3 10?2 10?1 100 101 Population size 5 Total error Iterations Figure 3.5: QGDM algorithm using modifled distribution with k=0.1, p=5 Modifled FF algorithm Population 5 Initial radius 5 Total runs 100 Min run 57 ite Max run 376 ite Ave. run 105 ite Standard deviation 47 ite Success rate 82 % 25 100 101 102 103 10?3 10?2 10?1 100 101 Population size 10 Total error Iterations Figure 3.6: QGDM algorithm using modifled distribution with k=0.1, p=10 Modifled QGDM algorithm Population 10 Initial radius 5 Total runs 100 Min run 21 ite Max run 302 ite Ave. run 67 ite Standard deviation 45 ite Success rate 81 % 26 100 101 102 103 10?3 10?2 10?1 100 101 Population size 20 Total error Iterations Figure 3.7: QGDM algorithm using modifled distribution with k=0.1, p=20 Modifled QGDM algorithm Population 20 Initial radius 5 Total runs 100 Min run 17 ite Max run 84 ite Ave. run 36 ite Standard deviation 11 ite Success rate 81 % 27 Figure 3.8: QGDM algorithm using fl = 1:5, fi = 2 Standard QGDM algorithm Population 5 Min run 17 ite Max run 301 ite Ave. run 82 ite Success rate 55 % 28 100 101 102 103 10?3 10?2 10?1 100 101 Population size 10 Total error Iterations Figure 3.9: QGDM algorithm using fl = 1:5, fi = 2 Standard QGDM algorithm Population 10 Min run 11 ite Max run 159 ite Ave. run 34 ite Success rate 66 % 29 100 101 102 103 10?3 10?2 10?1 100 101 Population size 20 Total error Iterations Figure 3.10: QGDM algorithm using fl = 1:5, fi = 2 Standard QGDM algorithm Population 20 Min run 9 ite Max run 197 ite Ave. run 29 ite Success rate 70 % 30 Chapter 4 Variable Scale Gradient Approximation Gradient methods are often compared to a ball rolling along some surface. Neglect- ing momentum and assuming the ball is inflnitely small, it will follow the path of steepest descent exactly. This is a powerful illustration which, consequently, also highlights the method?s fundamental aws. With little efiort one can visualize any number of surfaces on which the ball might become permanently trapped before ever reaching its desired destina- tion. Take, for example, a ight of stairs as in Fig. 4.1. Even though the behavior of the surface is relatively simple, it is clear that no matter where the ball on the left is placed, it will never flnd its way to the foot of the stairs. Though the staircase clearly exhibits a global trend which may be determined with only a limited amount of information, the ball is still incapable of proceeding. The advantage of this illustration is that it makes the solution quite obvious: use a bigger ball. Though changing the size of the ball won?t change the gradient of the surface, it will make the ball more sensitive to larger scale behavior, allowing it to successfully descend the stairs. After all, all of the relevant information for descending a staircase can be found at the corners of the individual steps. While steps may be an essential part of a staircase, from the perspective of minimization, they are nothing more than noise. Furthermore, by actively allowing the ball to expand and contract as it descends, it is possible to navigate on as large or small a scale as necessary. This is the basic principle by which the proposed method operates, and from which it derives its name: Variable Scale Gradient Approximation (VSGA). 31 Figure 4.1: Ordinary gradient methods (left) are easily trapped on complex or piecewise constant optimization surfaces. The proposed method (right) is able to overcome these obstacles by expanding the radius of its population-based gradient approximation. 32 4.1 Proposed Method The key feature of the proposed method is the way in which the gradient is approx- imated. During each iteration, the algorithm generates a "population" of points within a spherically bounded region. A portion of the generated set lies on the region?s perime- ter, and it is these members of the population which are used in the approximation of the gradient. Next, an additional point is generated in the direction of the approximated gra- dient and added to the current population. Finally, the flttest member of the population is compared to the current best solution, and the population?s center and radius are updated accordingly. The basic ow of the algorithm is presented in Fig. 4.2. 4.1.1 Approximating the gradient As previously noted, the gradient approximation is the deflning feature of the proposed method. The generalized process for computing it is presented in a step-by-step manner. Let x0 2 Rn be the center of a population P with radius r, and let y = f(x0) be the value of the objective function f : Rn ! R at x0 . 1. First, generate a set of n random vectors f?x1;?x2;:::;?xng , each of length r, and deflne an n?n matrix, ?X = 2 66 66 4 ?x1 ... ?xn 3 77 77 5 2. Now, let xi = ?xi +x0, and deflne a vector y = [y1:::yn]T where yi = f(xi) (4.1) 33 A p p ly L M u p d ate ru le. P erform selection . A ssem b le p op u lation . E n d L ack of P rogress? A d ju st rad iu s. ?T olff < S tart C om p u te grad ient. Y es N o Y es N o Figure 4.2: A simplifled ow chart depicts the basic operation of the proposed algorithm. 34 for i = 1;:::;n. It should be clear from 4.1 that each value yi corresponds to the value of the objective function at a randomly generated point xi, which lies on the perimeter of the population. 3. Finally, compute the gradient approximation using rf = ?X?1 ?(y ?y0) (4.2) It may appear, and understandably so, that a signiflcant leap has been made between steps 2 and 3. So for the sake of clarity, a short derivation of 4.2 is included. Let f be an n-dimensional scalar function. A linear approximation of f with respect to a point x0 = (x01;x02;:::;x0n) may be found using the flrst two terms of the Taylor series expansion. f(x) ? f(x0)+ ?f?x1 ??x1 + ?f?x2 ??x2 +???+ ?f?xn ??xn = f(x0)+rfjx0 ??x (4.3) Solving this equation for the gradient yields rfjx0 ? ?f?x = f(x)?f(x0)x?x 0 (4.4) By generating a point x in the vicinity of x0, it is possible to obtain a numerical approxi- mation of rfjx0 using 4.4. However, for n > 1, 4.4 is an invalid expression since it implies that ?x is a vector, and therefore has no inverse. Thus, in order to generalize 4.4), f is evaluated for a set of n linearly independent points, allowing ?x and ?f to be replaced by the n?n matrix ?X and the n?1 vector of corresponding function values ?f. This leads to the generalized form rfn?1 ? 2 66 66 4 ?x11 ??? x1n ... ... ... ?xn1 ??? xnn 3 77 77 5 ?1 n?n 2 66 66 4 f(x1)?f(x0) ... f(xn)?f(x0) 3 77 77 5 n?1 ? ?X?1 ??f (4.5) 35 which is equivalent to 4.2. 4.1.2 A modifled LM update rule Once the gradient is computed, an additional point xn+1 is generated in the direction of the approximation. The proposed method uses a modifled version of the Levenberg- Marquardt (LM) algorithm [19] update rule for this purpose. For a scalar valued function f : Rn ! R , the LM update rule may be written x(k+1)n+1 = x(k)0 ?(rf ?rfT +?I)?1rf ?y(k)0 ; (4.6) where k is the current iteration, and ? acts as a damping factor which reduces oscillation by actively controlling the step size. One who is familiar with the LM algorithm may notice that the Jacobian has been replaced by the gradient in 4.6. This is because the Jacobian is deflned as the matrix of all flrst-order partial derivatives of a vector valued function. Thus, for a scalar valued function, the Jacobian is, by deflnition, the transpose of the gradient. Although the proposed method may be extended to handle vector valued functions, the version presented here is intended for use with scalar valued problems. Thus the Jacobian is replaced by the gradient. This is not the only modiflcation. For the proposed implementation, the rule is also modifled to account for the population radius r. This is done by adding a term which ensures that the length of the update step will be no smaller than r. This modiflcation is needed to ensure that when the algorithm encounters local minima, the step size will exceed the radius of the current population, thus providing greater diversity. The modifled version of the rule is written x(k+1)n+1 = x(k)0 ?(rf ?rfT +?I)?1rf ?y(k)0 ? rfkrfk ?r(k) (4.7) The added term in 4.7 is simply a vector of length r in the direction of rf. 36 It is important to note that the algorithm presented here, using the modifled version of the LM update rule, is but one of many possible implementations. In other words, the shaded box in Fig. 2 may be replaced with any number of existing gradient methods without compromising the underlying principles. The only requirement is that the gradient be calculated in the described fashion. In fact, the proposed method was speciflcally designed with this sort exibility in mind. Therefore, the update rule may be readily exchanged with another second order method such as the BFGS method [20] or the closely related Davidon Fletcher Powell algorithm, which has been shown to outperform LM in certain applications [21]. 4.1.3 Assembling the population The population P is generated in two parts whose combined size is denoted by the parameter ?. The primary population, A = fx1;:::;xn+1g; (4.8) is created in the two previous steps, and is of size n+1, while the secondary population, B = fxn+2;:::;x?g; (4.9) consists of a set of m points which are randomly distributed within the region deflned x0 and r. Here, the value of m is a user deflned parameter, and may be assigned any non negative integer value including zero. This leads to the following deflnitions for ? and P, ? = n+1+m; (4.10) P = A[B = fx1;:::;x?g: (4.11) 37 While the inclusion of B in P can provide a noticeable increase in the rate of convergence when applied to relatively complex optimization surfaces, in general, it is su?cient to let m = 0. 4.1.4 The selection process Once P has been constructed according to 4.11, its members are ranked with respect to fltness. Next, the flttest member xopt 2 P is chosen, and the following conditional, also shown in Fig. 2, is evaluated: IF f(xopt) < fTol THEN Terminate algorithm. ELSE IF f(xopt) ? f(x0) THEN Adjust population radius r. ELSE x0 = xopt . END IF where fTolis the maximum acceptable value for the objective. Updating in this way es- sentially recycles the information used in the approximation of the gradient, which would otherwise be thrown out. This process also helps to guarantee the algorithm?s stability since it is equivalent to running two methods in parallel. That is, if the generated step does not result in a lower value of the objective, the method will resort to selecting the best of the points used in the gradient approximation. In this way, it is still possible for the method to proceed. Furthermore, because the current best solution is included in the selection process, it is guaranteed that the value of the objective will not increase from one iteration to the next. Therefore, while the instability of the update rule may afiect the algorithm?s convergence, the algorithm itself will remain stable as a result of selection. 38 4.1.5 Adjustment of the population radius Any number of radius update rules may be devised, however the version presented here is perhaps the simplest. The method uses a flxed step size ? to modify the radius over a bounded user deflned interval [rmin;rmax], and is subject to the following conditions: IF r ? rmax THEN r(k+1) = r(k) +? . ELSE r(k+1) = rmin . END IF where k is the number of the current iteration. Although this method is simple, it proves to be adequate. The two features which make the method so basic are the use of a resetting value of r, and the constant step-size ?. The constant step size serves the purpose of increasing the population radius when the algorithm encounters local minima, while the increase in r causes the gradient approximation to become less sensitive to local behavior, thereby allowing it to escape the traps introduced by complex behaviors. Conversely, if greater accuracy of approximation becomes necessary for local search, the algorithm is still able to proceed once r is reset. 4.2 Test Functions Four unique functions were used for testing. Each function exhibits features deemed relevant for the purpose of comparison. All four functions are presented in generalized form, with n being the dimension of the search space. 4.2.1 Test Function 1 T1(x) = nX i=1 ?xi 4 ?4 , xi 2 [?10;+10] (4.12) 39 The flrst test function, T1, has a simple convex, continuous, parabolic surface with a min- imum of T1(x) = 0 located at the origin. The function is used for two primary purposes. First, it ofiers a fair comparison of the proposed method with some of the more common gradient methods. Second, it highlights the strengths of such methods, which are superior to evolutionary methods when applied to problems of this type. Also, because the function is fourth order, it will highlight the difierence between algorithms of higher and lower orders. 4.2.2 Test Function 2 T2(x) = nX i=1 bx ic 4 ?4 , xi 2 [?10;+10] (4.13) The second test function, inspired by the illustration in Ch. 4, is of an identical form to that of T1, except that the variables have been oored in order to make it piecewise constant. T2 has a minimum value of 0 for all x = [x1:::xn] which satisfy x1 2 [0;1) for i = 1:::n. The features of this function are useful for testing the hypothesis made in Ch. 4. If the proposed method operates as intended, there should be little difierence in performance between the T1 and T2. 4.2.3 Test Function 3 f(x) = 12 vu ut nX i=1 bxic2, xi 2 [?100;+100] g(x) = x4 +(1?cos(?x))?(tanh?x4??1)2 T3(x) = g ?f (4.14) Like T2, T3 is also piecewise constant, but with the addition of local minima, which is the most common challenge faced by gradient methods. Though evolutionary methods may also become susceptible to these traps if population diversity becomes too low, they much better suited for this type of problem. T3 has a minimum value of 0 for all x = [x1:::xn] which satisfy xi 2 [0;1) for i = 1:::n. 40 -10 -5 0 5 10 -10 -5 0 5 10 0 2000 4000 Figure 4.3: Test Function 2 41 -15 -10 -5 0 5 10 15 -15 -10 -5 0 5 10 15 0 1 2 Figure 4.4: Test Function 3 42 Algorithm Parameters T1 T2 T3 T4 MSD fi 1 1 1 1 LM ?0 0.1 0.1 0.1 0.1 SA-ES ?;?; 0 3,12,1 3,12,1 8,16,10 8,16,50 CMA-ES ?;?; 0 3,12,1 3,12,1 8,16,10 8,16,50 PM m;rmin;rmax;? 0;10?16;1;1 0;2;6;2 4;2;6;2 0;10?6;12;3 Table 4.1: A list of algorithm parameters used in testing 4.2.4 Test Function 4 T4(x) = 12 nX i=1 ?x2 i +tan(xi) 2 ?10?cos(2?xi)+10?, xi 2 [?100;+100] (4.15) T4 is the most extreme of the test functions. Though it is piecewise continuous, each of the regions of continuity is a deep local minimum. Thus, gradient methods are only capable of minimizing T4 on a local basis. Due to the large number of local minima, evolutionary methods which rely on modifled mutation strength for accelerated convergence may also be susceptible to entrapment. The surface of T4 is shown in Fig. 4.5. In reality, the "walls" which separate the local minima in Fig. 4.5 are inflnitely high. The minimum of T4 is 0, and is located at the origin. 4.3 Experimental Results The proposed method is compared with four well known algorithms. It should be noted that the purpose of the comparison is not to show that the proposed method is superior to all algorithms over all problems; clearly that is not the case. Instead, the goal is to show that while the algorithm shows the high rate of convergence and e?cient local search characteristics of a second order gradient method, it is also capable of minimizing complex classes of problems which are usually associated with evolutionary methods. Thus the evolutionary methods used for comparison were selected on the basis of these same qualities. The following is a list of the compared methods. 43 -5 0 5 -5 0 5 20 40 60 80 Figure 4.5: Test Function 4 Algorithm Test FunctionT 1 T2 T3 T4 MSD 100% FAILURE FAILURE FAILURE5477.98 FAILURE FAILURE FAILURE LM 100% FAILURE FAILURE FAILURE43.6 FAILURE FAILURE FAILURE SA-ES 100% 100% 100% 48%214.48 426.28 355.72 6475.7 CMA-ES 100% 100% 100% 67%186.68 138.32 772.24 815.61 PM 100% 100% 100% 100%46.3 28.72 148.21 382.36 Table 4.2: Comparison of success rate and mean function evaluations 44 1. MSD: The Method of Steepest Descent is perhaps the most well known of all gradient methods. MSD is a simple flrst order method which, as the name implies, employs a user deflned step size to proceed in the direction of "steepest descent." The method was chosen for its high degree of stability as well as its reputation as the standard gradient method. 2. LM: The Levenberg Marquardt algorithm, described brie y in Section III-C, is re- garded as one of the fastest gradient methods available. The method was chosen as a benchmark among second order algorithms. Furthermore, the gradient portion of the proposed method uses an update rule directly inspired by the LM algorithm, making the method especially relevant for comparison. 3. -SA-ES: Self-Adaptive Evolution Strategy [22], a member of the larger family of al- gorithms known as Evolutionary Strategies [22], is a powerful evolutionary method which uses self-adapted mutation strength for optimal convergence as well as an ac- celerated local search. The method was chosen as a strong representative of the power of evolutionary methods. 4. CMA-ES: Covariance Matrix Adaptation Evolutionary Strategy [23], also a member of the family of Evolutionary Strategies, is an evolution path related technique which uses search space information in a highly e?cient manner, making it exceptionally fast with respect to other evolutionary methods. 5. PM: The proposed method. The performance of each algorithm on each of the test functions was evaluated over a series of 100 simulation runs using the algorithm parameters listed in Table 4.1. Each algorithm was then evaluated with respect to rate of success as well as the mean number of function evaluations per solution. The tabulated results are presented in Table 4.2, which represents the case n = 2 for all four functions. For each of the test functions, success was deflned by two conditions: 45 1. f(x) < 10?6 2. No more than 105 total evaluations of the objective. The flelds in Table 4.2 highlighted in bold font denote the top performers for each of the test cases. One may notice the similarity between the proposed method and the LM algorithm with regard to the flrst test function. Notice that for this particular function, the behaviors of the two algorithms are nearly identical. This validates the earlier hypothesis that, in the absence of local minima, the two methods should behave in a similar manner. Note too the striking difierence in the rate of convergence between the two second order gradient methods when compared to the flrst order method of steepest descent. Another point of interest is the performance of the proposed method on the second test. It was suggested in Section IV-B that, assuming proper performance, little difierence should be observed between T1 and T2. This is conflrmed by the results in Table 4.2. In fact, the algorithm actually performs better! This behavior is a result of the increased population radius coupled with the modifled LM update rule, which, when used together, increase the rate at which the proposed method traverses the optimization surface. Finally, notice the similarity between the proposed method and that of CMA-ES with respect to T3 and T4. In fact, for T4, the similarities are striking. When applied to these more complex surfaces, the behavior of the algorithm changes drastically, and yet there seems to be little, if any, efiect on the level of performance. What is most striking about the results in Fig. 4.6 is the change in behavior between T1 and T4. As the nature of the test functions becomes more complex, the behavior of the proposed method changes from a second order gradient method to one which bears a striking resemblance to that of evolutionary computation! 46 0 500 10000 10 20 0 2500 50000 20 40 60 0 1000 20000 10 20 0 125 2500 20 40 60 0 7500 150000 10 20 500 1000 15000 20 40 60 3000 4500 60000 20 40 60 0 40 800 20 40 0 200 4000 10 20 0 150 3000 20 40 60 Total FunctionEvaluations Su cce ssfu lR un s PM CM A- ES SA -E S LM MS DFAILURE FAILURE FAILURE FAILURE FAILURE FAILURE TestFunction1 TestFunction2 TestFunction3 TestFunction4 0 40 800 20 40 0 50 1000 20 40 0 250 5000 20 40 0 750 15000 20 40 Figure 4.6: A Graphical summary of the test results 47 Chapter 5 Conclusion In this thesis, two novel algorithms for numerical optimization were presented. In both cases, the objective was to modify the general form of the traditional flrst and second order approaches in order to include global solutions to complex problems, while also retaining desirable local behavior commonly associated with such approaches. This was motivated by the fact that traditional gradient based approaches are susceptible to entrapment in local minima, and will only produce global convergence if the initial guess happens to lie within the global minimizer?s region of attraction. Conversely, while newer methods such as evolutionary computation are quite adept at obtaining global solutions, they are very ine?cient with respect to local convergence, making them prohibitively time consuming for many applications. The flrst approach, presented in Chapter 3, borrows some from evolutionary compu- tation through the use of successive populations of pseudo-randomly generated test points. However, instead of using the mechanisms of combination and random mutation to gener- ate successive populations, the proposed method uses the cumulative information contained within each generation to "migrate" from one point to the next. This migratory behavior allows the algorithm to progress in a way which is similar to gradient methods, while the use of randomly distributed populations make it possible to overcome the challenges presented by local minima. The latter is done by actively controlling the population?s radius, allowing test points to venture beyond the region of local attraction. Not only does this method require very little computational efiort, but it was also shown to outperform a common flrst order method for the purpose of training a neural network. In Chapter 4, a method was presented by which nearly any flrst or second order algo- rithm can be modifled to overcome the risk of entrapment. This was done through the use of 48 a variable scale gradient approximation (VSGA), which efiectively controls the sensitivity of the search direction with respect to the small and large scale behaviors of the objec- tive. Like its predecessor, the VSGA is extracted using the information contained within a pseudo-randomly generated set of test points. The difierence, however, is that the VSGA direction yields a far more accurate model of the objective function?s behavior with respect scale. Thus, as the radius of the population increases, the VSGA becomes less sensitive to local behavior, allowing it to overcome the challenge of local minima. Conversely, when the radius of the population is quite small, the VSGA becomes a very close approximation to the true gradient, making it quite e?cient in local searches as well. To conflrm this, the algorithm was tested using a series of test functions chosen to typify some of the most com- monly faced di?culties. The performance of the proposed method was also compared with a set of competing algorithms which included gradient as well as evolutionary methods. It was than shown from the results that the proposed method was not only able to emulate both classes of algorithms based on the behavior of each function, but it was also shown to out-perform the compared methods in terms of rate of success and average number of evaluations of the objective. 49 Bibliography [1] H. A. Abbass, A. M. Bagirov, J. Zhang, "The discrete gradient evolutionary strategy method for global optimization," Congress on Evolutionary Computation - CEC?03 Volume 1, 8-12 Dec. 2003 pp. 435 - 442 Vol.1 [2] M. Manic and B. Wilamowski, "Random Weights Search in Compressed Neural Net- works using Overdetermined Pseudoinverse," Proc. of the IEEE Interantional Sympo- sium on Industrial Electronics - ISIE?03 Rio de Janeiro, Brazil, June 9-11, 2003, pp. 678 - 683. [3] M.ManicandB.Wilamowski"RobustNeuralNetworkTrainingUsingPartialGradient Probing," IEEE International Conference on Industrial Informatics - INDIN?03 Banfi, Alberta, Canada, August 21-24, 2003, pp. 175 - 180. [4] J. Y. Wen, Q. H. Wu, L. Jiang, S.J. Cheng, "Pseudo-gradient based evolutionary programming,",Electronics Letters Volume 39, Issue 7, 3 April 2003 pp. 631 - 632 [5] B. Wilamowski "Neural Networks and Fuzzy Systems," The Microelectronic Handbook CRC Press -2006 chapters 124.1 to 124.8. [6] Derenick, J. C.; Spletzer, J. R., "Convex Optimization Strategies for Coordinating Large-Scale Robot Formations," Robotics, IEEE Transactions, vol.23, no.6, pp.1252- 1259, Dec. 2007 [7] Seok-Beom Roh, W. Pedrycz, Sung-Kwun Oh, "Genetic Optimization of Fuzzy Polyno- mial Neural Networks," Trans. on Industrial Electronics, vol. 54, no. 4, pp. 2219-2238, Aug. 2007. [8] L. dos Santos Coelho, B. M. Herrera, "Fuzzy Identiflcation Based on a Chaotic Particle Swarm Optimization Approach Applied to a Nonlinear Yo-yo Motion System," Trans. on Industrial Electronics, vol. 54, no. 6, pp. 3234-3245, Dec. 2007. [9] S.-K. Oh, W. Pedrycz, H.-S. Park, "A New Approach to the Development of Geneti- cally Optimized Multilayer Fuzzy Polynomial Neural Networks," Trans. on Industrial Electronics, vol. 53, no. 4, pp. 1309- 1321, Aug. 2006. [10] Faa-Jeng Lin, Po-Kai Huang, W.-D. Chou, "Recurrent-Fuzzy-Neural-Network- Controlled Linear Induction Motor Servo Drive Using Genetic Algorithms," Trans. on Industrial Electronics, vol. 54, no. 3, pp. 1449-1461, June 2007. 50 [11] Grimaccia, F.; Mussetta, M.; Pirinoli, P.; Zich, R.E., "Genetical Swarm Optimization (GSO): a class of Population-based Algorithms for Antenna Design," Communications and Electronics, 2006. ICCE ?06. First International Conference on , pp.467-471, 10-11 Oct. 2006 [12] Grimaccia, F.; Mussetta, M.; Pirinoli, P.; Zich, R.E., "Optimization of a re ectarray antenna via hybrid evolutionary algorithms," Electromagnetic Compatibility, 2006. EMC-Zurich 2006. 17th International Zurich Symposium on , pp. 254-257, 27 Feb.-3 March 2006 [13] Jinn-Tsong Tsai, Jyh-Horng Chou, Tung-Kuan Liu, "Optimal design of digital IIR fllters by using hybrid taguchi genetic algorithm," Trans. on Industrial Electronics, vol. 53, no. 3, pp. 867- 879, June 2006. [14] Yang Yu; Yu Xinjie, "Cooperative Coevolutionary Genetic Algorithm for Digital IIR Filter Design," Industrial Electronics, IEEE Transactions on , vol.54, no.3, pp.1311- 1318, June 2007 [15] Salomon, R., "Evolutionary algorithms and gradient search: similarities and difier- ences," Evolutionary Computation, IEEE Transactions on , vol.2, no.2, pp.45-55, Jul 1998 [16] M. Bundzel and P. Sincak, "Combining Gradient and Evolutionary Approaches to the Artiflcial Neural Networks Training According to Principlesof Support Vector Ma- chines," IEEE International Joint Conference on Neural Networks - IJCNN?06, pp. 2068 - 2074, 16-21 July 2006. [17] X. Hu; Z. Huang; Z. Wang "Hybridization of the multi-objective evolutionary algo- rithms and the gradient-based algorithms," Congress on Evolutionary Computation - CEC?03, vol. 2, pp. 870 - 877, 8-12 Dec. 2003 [18] J. Hewlett, B. Wilamowski, and G. Dundar "Merge of Evolutionary Computations with Gradient Based Method for Optimization Problems" ISIE 07 IEEE International Symposium on Industrial Electronics, Vigo, Spain, June 4-7, 2007. [19] D. Marquardt, "An Algorithm for Least-Squares Estimation of Nonlinear Parameters," SIAM Journal on Applied Mathematics, vol. 11, pp. 431-441, 1963. [20] J. Nocedal and S. J. Wright, Numerical Optimization. 2nd ed., New York: Springer, 2006, pp. 136-143. [21] Abid, S.; Mouelhi, A.; Fnaiech, F., "Accelerating the Multilayer Perceptron Learning with the Davidon Fletcher Powell Algorithm," Neural Networks, 2006. IJCNN ?06. International Joint Conference on, pp.3389-3394 [22] H.-G. Beyer, H.-P. Schwefel, "Evolution Strategies: A comprehensive introduction," Natural Computing, vol. 1, pp. 3-52, 2002. 51 [23] N. Hansen, S. D. Mller, P. Koumoutsakos, "Redusing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES)," Evolutionary Computing, vol. 11, no. 1, pp. 1-18, 2003. [24] J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed. New York: Springer- Verlag, 2006. 52 Appendices 53 A.1 QGDM.m function [WW1,WW2,OP,ser,ite,k]=QGDM(fun,IP,DP,WW1,WW2,k1,k2,emax,nmax,nnp,r1); % [WW1,WW2,OP,ser,ite,k]=QGDM(fun,IP,DP,WW1,WW2,k1,k2,emax,nmax,nnp,r1) - erorr backprop for 2 layers % uses: versize2, augm, forw1l, oerror, fsbip, fsli, fsuni, newwt3 % % Arguments: % fun - activation function ?fsbip? % IP - [np,ni] matrix of inputs (not augmented). % DP - [np,no] matrix of desired outputs % WW1 - [nh,ni] weight matrix for the first layer % WW2 - [no,nh] weight matrix for the second layer % ni - number of inputs % nh - number of hidden neurons % no - number of outputs % np - number of patterns % not working for unipolar % k1- neuron gain at net=0 for first layer % k2- neuron gain at net=0 for second layer % emax - max normalized error % nmax - max number of iterations % nnp - number of points per population % r1 - starting radius % Returns: % WW1 - [nh,ni] weight matrix for the first layer % WW2 - [no,nh] weight matrix for the second layer % OP - [np,no] matrix of outputs % ser - vector of normalized errors as function of iterations % ite - number of total iterations if nargin ~= 11 error(?Wrong number of arguments?); end; [ni,no,np,nh]=versize2(IP,DP,WW1,WW2); IP1=augm(IP); step=1; je=0; c1=WW1; c2=WW2; r=r1; k=0; %---------------Calculate network output (OP2)-----------------% [NET1,OP1,GAIN1] = forw1l(fun,IP1,WW1,k1); % IP2=augm(OP1); % [NET2,OP2,GAIN2] = forw1l(fun,IP2,WW2,k2); % %--------------------------------------------------------------% %-----------------Calculate initial error (e)------------------% e = oerror(DP,OP2); % %--------------------------------------------------------------% for ite=1:nmax [WW1,WW2,r,e,OP,c1,c2]=newwt3(WW1,WW2,r,e,DP,IP1,fun,k1,k2,c1,c2,nnp); if (ite/step == round(ite/step)) | (ite==1) je=je+1; ser(je)=e; %disp(sprintf(?ite=%5d error=%12.10f?,ite,e)); end; %WW1 54 %WW2 if e 2 error(?Wrong number of arguments.?); end o = 2*ones./(ones+exp(-2*k*net))-ones; gain = k.*(1-o.*o); 56 return %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [o,gain] = fslin(net,k) % [o,gain] = fsemlin(net,k) - semi linear function % uses: % % Arguments: % net - input variable % k - gain at net=0 % Returns: % o - output value % gain - neuron gain if nargin > 2 error(?Wrong number of arguments.?); end o = k*net; for i=1:length(o) if o(i)<0 o(i)=0; end; end gain = k; return %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [o,gain] = fsuni(net,k) % [o,gain] = fsuni(net,k) - unipolar sigmoidal function % uses: % % Arguments: % net - input variable % k - gain at net=0 % Returns: % o - output value % gain - neuron gain if nargin > 2 error(?Wrong number of arguments.?); end if nargin == 1 k=1; end %k=4*k; [nr,nc] = size(net); o = ones./(ones+exp(-k.*net)); gain = k.*(1-o).*o; return %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [ni,no,np,nh]=versize2(IP,DP,WW1,WW2); % [ni,no,np,nh]=versize2(IP,DP,WW1,WW2)-verification of matrix sizes for two layer backpropagation input % uses: % % Arguments: % IP - np*ni matrix of input paterns (not in augmented space) % DP - np*no matrix of desired patterns % WW1 - weight matrix for first layer (for augmented space) % WW2 - weight matrix for second layer (for augmented space) % network structure is extracted form weights matrixes % Returns: % ni - number of inputs (in augmented space) % no - number of outputs % np - number of patterns % nh - number of hidden neurons 57 if nargin ~= 4 error(?Wrong number of arguments.?); end [npi,ni]=size(IP); [npo,no]=size(DP); [nh,niw1]=size(WW1); [now,niw2]=size(WW2); if npi~=npo npi, npo, error(?Number of input and output patterns differs?); end if ni~=niw1-1 ni, niw1, error(?WW1 does not match input paterns?); end if no~=now no, now, error(?WW2 does not match output paterns?); end if nh~=niw2-1 nh, niw2, error(?WW2 does not match hidden neurons?); end np=npi; return %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 58 B.1 VSGA.m function [x0,y0] = vsga(obj,x0,y_max,k_max,r_lim,delta,m) %VSGA Optimizes objective functions with complex surfaces. % VSGA uses a "(V)ariable (S)cale (G)radient (A)pproximation" to solve % problems of the form: % % min f(X) % x % % where X and the values returned by f can be scalars or vectors. % % [X0,Y0] = VSGA(OBJ,X0,Y_MAX,K_MAX,R_LIM,DELTA,M) returns the minimizer X0 % and its corresponding objective value Y0. % % Input arguments: % OBJ [handle] : Handle of the objective function % X0 [1xn] : Initial starting point % Y_MAX [scalar] : Maximum allowable tolereance for Y0 % K_MAX [scalar] : Maximum number of iterations % R_LIM [1x2] : Radius limits [r_min,r_max] % M [scalar] : Secondary population size % % For details of the algorithm?s operation, see % % Hewlett, J.D.; Wilamowski, B.M.; Dundar, G., "Optimization Using a % Modified Second-Order Approach With Evolutionary Enhancement," % Industrial Electronics, IEEE Transactions on , vol.55, no.9, % pp.3374-3380, Sept. 2008 % % OBJ - The objective function is defined as a separate MATLAB function. % For example, % % function [y]=testT1(x) % y=((.25*x(1))^4+(.25*x(2))^4); % return % % Example: % % [x0,y0] = vsga(@testT2,[100,100],1e-6,100,[1e-6,3],1,10); % % minimizes testT2. % Joel Hewlett, Mar. 2009 % Auburn University Department of Electrical and Computer Engineering % $Revision: 1.0 $ $Date: 2009/03/27 13:42:26 $ %---------------------------------------%---------------------------------------% % Initialize constants/variables % %---------------------------------------% output = [?%1.8e <== Total Error ?,... ?%1.1e <== radius %d <== iteration?]; y0 = feval(obj,x0); % Initial value of the objective y_best = [y0,zeros(1,k_max-1)]; % Holds value of y0 for each iteration 59 r = r_lim(1); % Set initial radius equal to r_min I = eye(length(x0)); % nxn identity matrix mu = 0.1; % Learning rate for LM mut = mu; % Stores mu when radius changes chk = 0; % Set when radius is growing k = 1; % Iteration counter rcount = 0; %---------------------------------------%---------------------------------------% % Begin Main Loop % %---------------------------------------% while k <= k_max, k = k + 1; %---------------------------------------% % Compute Gradient % %---------------------------------------% [grad,x_opt] = popgrad(x0,obj,r,m); % Caculate gradient approximation y_opt=feval(obj,x_opt); Hessian = grad?*grad; if rcond(Hessian)<0.5 % If Hessian is not well conditioned, Hessian=0*Hessian; % set it equal to 0. end count = 0; % Counter for LM mu update temp = y0; %---------------------------------------% % Apply Update Rule % %---------------------------------------% while (1), step=((Hessian+mu*I)\grad?*y0)?;% Levenberg Marquardt step dir_step=step/sqrt(step*step?); % Diversity offset x_np1=x0-(step+r*dir_step); y_np1=feval(obj,x_np1); if chk == 0; if y_np1<=temp temp = y_np1; if mu>1e-50, mu=mu/10; end; break; end; if mu<1e+50, mu=mu*10; end; end count = count+1; if count>2, break; end; end; %---------------------------------------% % Assemble Population % %---------------------------------------% P=[x_opt;x0;x_np1]; y=[y_opt,y0,y_np1]; %---------------------------------------% % Perform Selection % %---------------------------------------% ndex=find(y==min(y)); % Find index of best y value y0=y(ndex(1)); % Update y0 x0=P(ndex(1),:); % Update x0 y_best(k) = y0; disp(sprintf(output,y0,r,k)); % Display progress to the command window if y0 < y_max % Check if y0 is sufficiently small break; end %---------------------------------------% % Check Progress/Adjust Radius % %---------------------------------------% if (y_best(k) == y_best(k-1)) % Check progress chk=1; if r >= r_lim(2) % If r has reached its upper limit, 60 r=r_lim(1); % reset r. else r = r + delta; % Update r end mu = 0.1; else if chk == 1; mu = mut; % Restore mu else mut = mu; % Store current mu end chk=0; end end %---------------------------------------%---------------------------------------% % Plot y0 as function of k % %---------------------------------------% semilogy(y_best) xlim([1 k]); xlabel(?k?); ylabel(?y0?); title(?Run Summary?) return B.2 popgrad.m function [grad,x_opt] = popgrad(x0,obj,r,m) %POPGRAD Calculate a population based gradient approximation. % GRAD = POPGRAD(X0,OBJ,RADIUS,M) returns the gradient approximation of % OBJ evaluated at the point X0, using a population radius of RADIUS. % % [GRAD,X_OPT] = POPGRAD(X0,OBJ,RADIUS,M) also returns the vector X_OPT, % which corresponds to the member of the generated population with the % lowest value for the objective function OBJ. If M = 0, X_OPT is simply % chosen from the N points used to calculate GRAD. If M > 0, M randomly % generated points within the region defined by X0 and RADIUS are also % considered. % % Input arguments: % X0 : 1xn vector % OBJ : function handle % RADIUS : scalar (>0) % M : scalar (>=0) % % OBJ - The objective function is defined as a separate MATLAB function. % For example, % % function [y]=testT1(x) % y=((.25*x(1))^4+(.25*x(2))^4); % return % % Example: % % [g,x] = popgrad([10 10],@testT1,1e-6,10); % % finds the the values of g and x for the testT1 using a population % radius of 1e-6 and a secondary population size of 10. % Joel Hewlett, Mar. 2009 % Auburn University Department of Electrical and Computer Engineering % $Revision: 1.0 $ $Date: 2009/03/27 10:07:26 $ %---------------------------------------%---------------------------------------% 61 % Generate Population (P) % %---------------------------------------% n = length(x0); % Size of primary population A = 2*rand(n)-1; A = A./(ones(n,1)*sqrt(sum(A?.^2)))?*r; A = A+ones(n,1)*x0; % Primary population (A) if m > 0 B=(2*rand(m,n)-1); B=B./(ones(n,1)*sqrt(sum(B?.^2)))?*r; B=B + ones(m,1)*x0; % Secondary population (B) P=[A;B]; % Combined population (P) else P=A; % Combined population (P) end %---------------------------------------% % Evaluate Population Fitness (y) % %---------------------------------------% y0 = feval(obj,x0); lambda = n+m; % Size of combined population for i = 1:lambda, y(i) = feval(obj,P(i,:)); end %---------------------------------------% % Approximate Gradient (grad) % %---------------------------------------% dy = y(1:n)-y0; dx = P(1:n,:)-ones(n,1)*x0; grad = (pinv(dx)*dy?)?; %---------------------------------------% % Perform preliminary selection % %---------------------------------------% y(end+1) = y0; P(end+1,:)= x0; index = find(y==min(y)); x_opt = P(index(1),:); return 62