i Improved Nelder Mead?s Simplex Method and Applications by Nam Dinh Pham A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Auburn, Alabama May 7, 2012 Keywords: Nelder Mead?s simplex method, quasi gradient method, lossy filter, Error Back Propagation, Lavenberg Marquardt, neural networks Copyright 2012 by Nam Dinh Pham Approved by Bogdan Wilamowski, Chair, Professor of Electrical and Computer Engineering Lloyd Stephen Riggs, Professor of Electrical and Computer Engineering Thaddeus Roppel, Associate Professor of Electrical and Computer Engineering Michael Baginski, Associate Professor of Electrical and Computer Engineering Wei-Shinn Ku, Assistant Professor of Computer Science and Software Engineering ii Abstract Derivative free optimization algorithms are often used when it is difficult to find function derivatives, or if finding such derivatives are time consuming. The Nelder Mead?s simplex method is one of the most popular derivative free optimization algorithms in the fields of engineering, statistics, and sciences. This algorithm is favored and widely used because of its fast convergence and simplicity. The simplex method converges really well with small scale problems of some variables. However, it does not have much success with large scale problems of multiple variables. This factor has reduced its popularity in optimization sciences significantly. Two solutions of quasi gradients are introduced to improve it in terms of the convergence rate and the convergence speed. The improved algorithm with higher success rate and faster convergence which still maintains the simplicity is the key feature of this paper. This algorithm will be compared on several benchmark functions with the original simplex method and other popular optimization algorithms such as the genetic algorithm, the differential evolution algorithm, and the particle swarm algorithm. Then the comparing results will be reported and discussed. iii Acknowledgments I would like to express my deepest appreciation to my advisor, Professor B. M. Wilamowski. Without his patience and guidance I would not be in the position I am today. From him I have learned a multitude of things, of which, engineering is only the tip of the iceberg. iv Table of Contents Abstract .......................................................................................................................................... ii Acknowledgements ........................................................................................................................ iii List of Tables ............................................................................................................................... vii List of Figures ............................................................................................................................. viii List of Abbreviations ......................................................................................................................x 1 Chapter 1: Introduction .............................................................................................................1 1.1 Genetic Algorithm ...............................................................................................................5 1.2 Differential Evolution Algorithm .......................................................................................6 1.3 Particle Swarm Optimization ..............................................................................................7 1.4 Nelder Mead?s Simplex Algorithm ......................................................................................8 2 Chapter 2: Improved Simplex Method with Quasi Gradient Methods ...................................13 2.1 Deficiency of Nelder Mead?s Simplex Method ................................................................13 2.2 Quasi Gradient Methods ...................................................................................................15 2.2.1 Quasi Gradient Method Using an Extra vertex ......................................................16 2.2.2 Quasi Gradient Method Using a Hyper Plane Equation .......................................17 2.3 Testing Functions ...............................................................................................................20 2.4 Experimental Results ........................................................................................................24 3 Chapter 3: Synthesize Lossy Ladder Filters with Improved Simplex Method .......................35 v 3.1 Filter Synthesis Algorithms ..............................................................................................35 3.1. 1 Butterworth Low- pass Filter ................................................................................35 3.1. 2 Chebyshev Low- pass Filter...................................................................................37 3.1. 3 Inverse Chebyshev Low- pass Filter .....................................................................38 3.1. 4 Cauer Elliptic Low- pass Filter .............................................................................39 3.2 Ladder Prototype Synthesis Algorithms ...........................................................................41 3.2. 1 Design Ladder Low-pass Prototype without Zeros ..............................................42 3.2. 2 Design Ladder Low-pass Prototype with Zeros ....................................................44 3.3 Transformation from Low-pass Filters to Other Type Filters ..........................................46 3.4 Lossy Filter Synthesis with Improved Simplex Method ...................................................47 4 Chapter 4: Training Neural Networks with Improved Simplex Method ................................53 4.1 Artificial Neural Networks ...............................................................................................53 4.1.1 Neural Network Architectures ..............................................................................54 4.1.2 Error Back Propagation Algorithm .......................................................................58 4.1.3 Lavenberg Marquardt Algorithm ..........................................................................63 4.2 Training Neural Networks with Improved Simplex Method ............................................67 4.3 Control Robot Arm Kinematics with Improved Simplex Method ....................................71 5 Chapter 5: Conclusions ...........................................................................................................77 References: ....................................................................................................................................79 Appendix 1: Nelder Mead?s simplex method ...............................................................................86 Appendix 2: Improved simplex method with quasi- gradient method using an extra point .........90 Appendix 3: Improved simplex method with quasi-gradient method using a hyper plane ..........94 vi Appendix 4: Test function .............................................................................................................98 vii List of Tables Table 2.1: Coefficients of Kowalik and Osborne function ...........................................................22 Table 2.2: Coefficients of Bard Functions ....................................................................................23 Table 2.3: Comparison of derivative free optimization algorithms of 20-dimensional function ..24 Table 2.4: Comparison of derivative free optimization algorithms of 10-dimensional function ..31 Table 2.5: Evaluation of success rate and computing time of 15-dimensional functions .............31 Table 2.6: Evaluation of success rate and computing time of 20-dimensional functions ..................32 Table 2.7: Evaluation of success rate and computing time of 40-dimensional functions .............32 Table 3.1: Transformation of the low pass immittances L and C to ladder arms for high pass, band-pass, band-reject, and multiple pass-band filters .....................................................46 Table 4.1: Number of neurons/weights required for different parity problems using neural network architectures ........................................................................................................58 Table 4.2: Comparison of training algorithms with MLP architecture .........................................69 Table 4.3: Comparison of training algorithms with BMLP architecture ......................................70 Table 4.4: Comparison of training algorithms with fully connected cascade architecture ...........71 viii List of Figures Fig. 1.1: Triangular simplex ?BGW with midpoint M, reflected point R and extended point E ..10 Fig. 1.2: Contracted points C1 and C2, shrinking points S and M toward B ................................11 Fig. 2.1: The triangular simplex ?BGW with similar function values at W and G (case (a)) and the triangular simplex ?BGW with similar function values at B and G (case (b)) ......................15 Fig. 2.2: The simplex ?BGW with extra point E .............................................................................17 Fig. 2.3: Simulated error curves of De Jong function 1(a) and De Jong function 1 with moved axis (b) ..............................................................................................................................27 Fig. 2.4: Simulated error curves of Quadruple function (a) and Powell function (b) ...................27 Fig. 2.5: Simulated error curves of Parallel hyperellipsoid function (a) and Zakarov function (b) ......................................................................................................................................28 Fig. 2.6: Simulated error curves of Schwefel function (a) and sum of different power function (b) ......................................................................................................................................28 Fig. 2.7: Simulated error curves of Step function (a) and Box function (b) .................................29 Fig. 2.8: Simulated error curves of Rosenbrock function (a) and Biggs Exp6 function (b) .........29 Fig. 2.9: Simulated error curves of Kowalik Osborne function (a) and Colville function (b) ......30 Fig. 2.10: Simulated error curves of Wood function (a) and Bard function (b) ..........................30 Fig. 3.1: Butterworth filter response ...............................................................................................36 Fig. 3.2: Chebyshev filter response ...............................................................................................37 Fig. 3.3: Inverse Chebyshev filter response ..................................................................................38 Fig. 3.4: Cauer Elliptic filter response ..........................................................................................39 ix Fig. 3.5: Doubly terminated ladder network without zeros ..........................................................42 Fig. 3.6: General ladder circuit with presence of zeros ................................................................44 Fig. 3.7: Models of real capacitors and real inductors ..................................................................47 Fig. 3.8: Chebyshev filter circuit with ideal elements ..................................................................48 Fig. 3.9: Chebyshev filter circuit with lossy elements .....................................................................48 Fig. 3.10: Magnitude and phase responses (a) and pole locations (b) of filters ................................49 Fig. 3.11: 4th Chebyshev filter circuit with ideal elements ...........................................................51 Fig. 3.12: 4th Chebyshev filter circuit with lossy elements ...........................................................51 Fig. 3.13: Magnitude and phase responses (a) and pole locations (b) of filters ...........................52 Fig. 4.1: Multilayer perceptron architecture 3-3-4-1 (MLP) ........................................................56 Fig. 4.2: Bridged multilayer perceptron architecture 3-3-4-1 (BMLP) ........................................56 Fig. 4.3: Fully connected cascade architecture 3-1-1-1 (FCC) .....................................................57 Fig. 4.4: Neural network with one input layer and one output layer ............................................59 Fig. 4.5: Neural network with one hidden layer ...........................................................................60 Fig. 4.6: Multilayer perceptron neural network to train parity-N problems .................................68 Fig. 4.7: Bridged multilayer perceptron neural network to train parity-N problems ....................69 Fig. 4.8: Fully connected cascade neural network to train parity-N problems .............................70 Fig. 4.9: Two-link planar manipulator ..........................................................................................72 Fig. 4.10: Neural network architecture to control robot arm kinematics ......................................73 Fig. 4.11: Desired output (a) and actual output (b) from the neural network in x direction .........73 Fig. 4.12: Desired output (a) and actual output (b) from the neural network in y direction .........74 Fig. 4.13: Desired output of a function f .......................................................................................75 Fig. 4.14: Output from fuzzy system (a), output from neural network (b) ...................................76 x List of Abbreviations SIM: Nelder Mead?s Simplex Method EBP: Error Back Propagation LM: Lavernberg Marquardt LMS: Least Mean Square MLP: Multilayer Perceptron BMLP: Bridged Multilayer Perceptron FCC: Fully Connected Cascade RBF: Radial Basis Function LVQ: Learning Vector Quantization 1 Chapter 1 Introduction The desire for optimality is the inherent nature of humans such as a manufacturer wants to produce its products with the lowest cost, or a delivery company wants to deliver its products to all distributers with the shortest distance to save gasoline, time, etc. These are the typical examples which optimization theories can be applied to give optimal solutions. From the appearance of computers, mathematical theories of optimization have been developed and applied widely. The computer with its computing power has the ability to implement optimization theories very efficiently in the manner of time and cost. The goal of the optimization theories is the creation of a reliable method to optimize models by an intelligent process. Applications of these theories play more important roles for modern engineering and planning, etc. In real life scientists, engineers, and managers often collect a lot of data and usually fall into difficult situations how to select different factors to obtain desired results. Optimization is a process of how to trade off these factors to find the best solution by evaluating their combinations. Many engineering problems can be defined as optimization problems such as process design, logistics, process synthesis & analysis, telecommunication network, finding of an optimal trajectory for a robot arm, the optimal thickness of steel in pressure vessels, etc [1]. In 2 practice, optimization algorithms are able to solve these problems but to find the best solution for these problems is often not very easy and straightforward because they include in large search spaces. It will be more challenging particularly in real life systems, which require optimal solutions in an acceptable amount of time. Optimization is a useful and important tool in the decision science and the analysis of physical systems. In order to use this tool, an objective function has to be defined. This objective function can be the cost, profit, time, etc. Normally, an objective function is modeled by unknown variables to describe its characteristics. And optimization algorithms define values of these variables to meet the requirements of this objective function. If the model is so simplistic, the solution will not reflect useful insights into practical systems. If the model is so complex, optimization algorithms may not give solutions. Therefore, models and optimization algorithms usually have to be complex enough to be handled by the computer. There are numerous optimization algorithms. Each is developed to solve a particular set of problems, and each has its own strength and weakness. Users usually have to evaluate a model and decide which algorithm is suited for [2]. Discrete and continuous optimization: discrete optimization problems are known as integer programming problems. In discrete optimization problems, solutions make sense if and only if variables are integers. To meet this constraint, a good strategy is to solve problems with real variables and then round them up to the closest integers. This type of work is by no means guaranteed to give optimal solutions. In contrast with discrete optimization problems, continuous optimization problems are easier to solve because of the smoothness of continuous functions. Moreover, these problems have an infinite set of solutions with real values; therefore, we can use other information at any point to speculate the function?s behavior. However, the same method 3 cannot be applied to solve discrete optimization problems with a finite set of solutions, where points are close, may have different function values. Constrained and unconstrained optimization: constrained optimization problems arise from models which have constraints on variables. These can be the constraints of input variables or the constraints to reflect relationships among variables, etc. Unconstrained optimization problems can be considered as particular cases of constrained optimization problems in which constraints of variables can be ignored without effect on the solution. Or these constraints can be counted as penalization terms in the objective functions of unconstrained problems. Global and local optimization: local optimization algorithms converge much faster than global optimization algorithms. However, its solution is just a local one which is the minimum in the vicinity and it is not guaranteed to be the global solution which is the best of all minima. Stochastic and deterministic optimization: in some optimization problems, the model cannot be fully defined because it depends on quantities that are unknown at the time of formulation. Normally, a modeler can predict unknown quantities with some degree of confidence. Stochastic optimization algorithms will use these quantifications of the uncertainty to produce solutions that optimize the expected performance of the model. Vice versus with stochastic optimization algorithms, deterministic optimization algorithms assume that the model is fully specified. Each optimization algorithm has different techniques to converge iteratively to optimal solutions. Some use first derivatives, second derivatives, or function values, etc. to converge. Some accumulate information from previous iterations to predict its sequential convergence to target values. The optimization technique is a key to differentiate one algorithm from another. A good optimization algorithm should possess some following properties: 4 ? Robustness: the algorithm has the ability to converge a wide range of problems in its category ? Efficiency: the algorithm can converge without too expensive computing cost. This cost can be understood as computing time and storage cost ? Accuracy: the algorithm can give solutions with precision. It is not very sensitive with errors when being implemented on computers. The Nelder Mead?s simplex method [3] is a popular derivative free optimization algorithm and is a method of choice for many practitioners. It is the prime choice algorithm in Matlab optimization toolbox. It converges relatively fast and can be implemented relatively easily compared with other classical algorithms relying on gradients or evolutionary computations, etc. Unlike gradient methods [4], [5], the simplex method can optimize a function without calculating its derivatives, which usually require a lot of computing power and are expensive in high dimensional problems as well. This property makes it more advantageous than others. Although the simplex method is simple and robust in small scale optimization, it easily fails with large scale optimization. In order to become a reliable optimization tool, it has to overcome this shortcoming by improving its convergence rate and convergence speed. This literature will give some new insights on why the simplex method may become inefficient in high dimensional optimization because of its lack of gradient information. This approach explains the low convergence rate without concerning its descent property when the objective function is uniformly convex presented in other literature [6], [7]. The dissertation will particularly present how to improve the simplex method by combining with two different quasi 5 gradient methods. The improved algorithm without complex mathematic computations can optimize multi-dimensional problems with higher success rate and faster convergence speed. The genetic algorithm [8], the differential evolution algorithm [9], [10] and the particle swarm algorithm [11], etc. are the other popular derivative free optimization tools which are widely applied and familiar by researchers and practitioners [12]-[14]. These algorithms can perform well in both a global and local search and have the ability to find the optimum solution without getting trapped in local minima. This capability is mostly lacked by local search algorithms such as the calculus-based algorithms or the simplex method. The big issue of global search algorithms is the computational cost which often makes their convergence speed much slower than local search algorithms. The particle swarm optimization is a kind of global search technique. It is a probabilistic technique which is different from the deterministic and stochastic techniques. Compared with the genetic algorithm and the differential evolution algorithm, the particle swarm optimization is simpler in term of computations because its crossover and mutation operation are done simultaneously while the crossover and mutation operation of the genetic algorithm and the differential evolution algorithm are done between each pair in the whole population. With improvements contributed in this paper, the simplex method can be considered as another optional optimization algorithm which can work much more efficiently than other well-known derivative free optimization algorithms in many different fields of engineering or sciences. 1.1 Genetic Algorithm The genetic algorithm (GA) is invented to mimic the natural behavior of evolution according to the Darwin principle of survival and reproduction [15]. Unlike calculus-based 6 methods, GA does not require derivatives, and it also has the ability to do a parallel search in the solution space simultaneously. Therefore, it is less likely to get trapped in local minima. Like the particle swarm algorithm and the differential evolution algorithm, GA starts by its initial population, and each individual is called a chromosome to represent a solution. During each generation, chromosomes will be evaluated according to their fitness values and evolved to create new chromosomes for the next generation. New childish chromosomes can be produced in two different ways either by emerging from two parental chromosomes in current generation with the crossover operator or by modifying chromosomes with the mutation operator. In order to maintain the population size, all chromosomes have to go through the natural selecting process. The chromosomes with better genes or better fitness will have higher probability to go to the next generation and other ones with worse genes is more likely to be rejected. This procedure is repeated until the best chromosome close to the optimum solution can be obtained. Another big advantage of GA is that it can be applied in different domains, not just only in optimization problems. However, it still has the limitation of premature convergence and low local convergence speed. Therefore, GA is usually improved by research scholars [16], [17]. 1.2 Differential Evolution Algorithm The differential evolution algorithm (DE) was introduced by R. Storn and K. Price in 1997 [9], [10]. Today it becomes one of the most robust function minimizers with relatively simple self-adapting mutation and is able to solve a wide range of optimization problems. The whole idea of DE is generating a new scheme to compute trial parameter vectors. These new parameter vectors are computed by adding the weighted difference between two population members to a third one. If the resulting vector has a lower objective function value than a 7 predefined population member, the newly generated vector will replace the vector with which it was compared. Through time, this algorithm has been adapted to increase its efficiency. In 2007, a new concept of multiple trial vectors [18] was introduced into this algorithm. This approach aims to make DE able to converge for a broader range of problems because one scheme of calculating trial vectors may work well with certain type of problems but may not work with other ones. Another approach was proposed where the choice of learning strategies and the two control parameters F (weighing factor) and CR (crossover constant) are dynamically adjusted and also made a significant improvement [19]. Recently, an adaptive differential evolution algorithm with multiple trial vectors can train artificial neural networks successfully and shows its competitive results with the error back propagation algorithm and the Lavenberg Marquardt algorithm [20]. 1.3 Particle Swarm Optimization The particle swarm optimization (PSO) is a concept that simulates the social swarm behavior of a flock of birds or a school of fish in searching for food [21]. The main concept is to utilize the inter-communication between each individual swarm with the best one to update its position and velocity. This algorithm operates on a randomly created population of potential solutions and searches for the optimum value by creating the successive population of solutions. PSO sounds similar to the differential evolution algorithm or the genetic algorithm in term of its selecting strategy of the best child (or the best swarm), but it is really different. In this algorithm, the potential solutions so called swarm particles are moving to the actual (dynamically changing) optimum in the solution space. Each swarm has its own location, best location, velocity, and fitness. In each generation, each swarm will contact with the best swarm and follow him to 8 update its own information. During its motion, if some swarms find better positions by comparing with their own fitness, they will automatically update themselves. In case there is a swarm finding the new best position, that swarm will be considered immediately as the current best. Because of its global search ability and fast convergence speed compared with other global search algorithms, PSO is applied widespread in optimization. 1.4 Nelder Mead?s Simplex Algorithm The simplex method [3] is a direct downhill search method. It is a simple algorithm to search for local minima and applicable for multidimensional optimization applications. Unlike classical gradient methods, this algorithm does not have to calculate derivatives. Instead it just creates a geometric simplex and uses this simplex?s movement to guide its convergence. A simplex is defined as a geometrical figure which is formed by (n+1) vertices. Where n is the number of variables of an optimization function, and vertices are points selected to form a simplex. In each iteration, the simplex method will calculate a reflected vertex of the worst vertex through a centroid vertex. According to the function value at this new vertex, the algorithm will do all kinds of operations as reflection or extension, contraction, or shrink to form a new simplex. In other words, the function values at each vertex will be evaluated iteratively, and the worst vertex with the highest function value will be replaced by a new vertex which has just been found. Otherwise, a simplex will be shrunk around the best vertex, and this process will be continued until a desired minimum is met. Moreover, the convergence speed of this algorithm can also be influenced by three parameters ?, ?, ? (? is the reflection coefficient to define how far a reflected point should be from a centroid point; ? is the contraction coefficient to define how far a contracted point should be when it is contracted from the worst point and the reflected point 9 in case the function value of the reflected point is smaller than the function value of the worst point; ? is the expansion coefficient to define how far to expand from the reflected point in case a simplex moves on the right direction). Depending on these coefficients ?, ?, ?, the volume of a simplex will be changed by the operations of reflection, contraction, or expansion respectively. The Nelder Mead?s simplex method can be summarized as following and more details can be found in the original paper [3]. ? Step 1: get ?, ?, ?, select an initial simplex with random vertices x0, x1,?, xn-1 and calculate their function values. ? Step 2: sort the vertices x0.,x1,?, xn-1 of the current simplex so that f0, f1,?, fn-1 in the ascending order. ? Step 3: calculate the reflected point xr, fr ? Step 4: if fr < f0: (a) calculate the extended point xe, fe (b) if fe < f0 , replace the worst point by the extended point xn = xe, fn-1 = fe (c) if fe > f0 , replace the worst point by the reflected point xn = xr, fn-1 = fr ? Step 5: if fr > f0: (a) if fr < fi, replace the worst point by the reflected point xn = xr, fn-1 = fr (b) if fr > fi: (b1) if fr > fn-1: calculate the contracted point xc, fc (c1) if fc > fn-1 then shrink the simplex (c2) if fc