Adaptive Scheduling Using Support Vector Machine on Heterogeneous Distributed Systems by Yongwon Park A dissertation submitted to the Graduate Faculty of Auburn University in partial ful llment of the requirements for the Degree of Doctor of Philosophy Auburn, Alabama August 6, 2011 Keywords: Heterogeneous Computing, Task mapping, Support Vector Machine Copyright 2011 by Yongwon Park Approved by Sanjeev Baskiyar, Chair, Associate Professor of Computer Science and Software Engineering Cheryl Seals, Associate Professor of Computer Science and Software Engineering Dean Hendrix, Associate Professor of Computer Science and Software Engineering Abstract Since the advent of the modern von Neumann computer in the 1940s, startling advances have been made in computing technology with the creation of innovative, reliable, and faster electronic components, from vacuum tubes to transistors. Computing power has risen expo- nentially over relatively brief periods of time as Moore?s law projected a salient trend for the growth in memory-chip performance, estimating the capacity of the integrated circuit to dou- ble every 18 months. Although these developments were essential in solving computationally intensive problems, faster devices were not the sole contributing factor to high performance computing. Since electronic processing speeds began to approach limitations imposed by the laws of physics, it became evident that the performance of uniprocessor computers was limited. This has led to the prominent rise of parallel and distributed computing. Such sys- tems could be homogeneous or heterogeneous. In the past decade homogeneous computing has solved large computationally intensive problem by harnessing a multitude of computers via a high-speed network. However, as the amount of homogeneous parallelism in appli- cations decreases, the homogeneous system cannot o er the desired speedups. Therefore, heterogeneous architectures to exploit the heterogeneity in computations came to be a crit- ical research issue. In heterogeneous computing (HC) systems consisting of a multitude of autonomous computers, a mechanism that can harness these computing resources e ciently is essential to maximize system performance. This is especially true in mapping tasks to heterogeneous computers according to the task computation type, so as to maximize the bene ts from the heterogeneous architecture. The general problem of mapping tasks onto machines is known to be NP-complete, as such, many good heuristics have been developed. However, the performance of most heuris- tics is susceptible to the dynamic environment, and a ected by various system variables. ii Such susceptibility makes it di cult to choose an appropriate heuristic. Furthermore, an adaptable scheduler has been elusive to researchers. In this research, we show that using a support vector machine (SVM) an elegant scheduler can be constructed which is capable of making heuristic selections dynamically and which adapts to the environment as well. To the best of our knowledge, this research is the rst use of SVM to perform schedule selec- tions in distributed computing. We call the novel meta-scheduler, support vector scheduler (SVS). Once trained, SVS can perform the schedule selections in O(n) complexity, where n is the number of tasks. Using rigorous simulations, we evaluated the e ectiveness of SVS in making the best heuristic selection. We nd that the average improvement of SVS over random selection is 29%, and over worst selection is 49%. Indeed, SVS is only 5% worse than the theoretical best selection. Since SVS contains a structural generalization of the system, the heuristic selections are adaptive to the dynamic environment in terms of task heterogeneity and machine heterogeneity. Furthermore, our simulations show that the SVS is highly scalable with number of tasks as well as number of machines. iii Acknowledgments Looking back since I studied in America, I am surprised and at the same time very grateful for all I have achieved throughout the study. It has certainly shaped me as a Computer Scientist and has led me where I am now. I would like to rst express my deepest gratitude to my adviser, Dr. Sanjeev Baskiyar, for his excellent guidance and his invaluable comments from the beginning of this study to the completion. His valuable comments about the need for this dissertation helped me to keep going when times were tough. He inspired me and encouraged me to a deeper understanding of research work before I thought I could do any research at all. He has enlightened me through his wide knowledge of scienti c work and his deep insights where it should go and what is necessary to get there. I would like to thank my dissertation committee members, Dr. Cheryl Seals and Dr. Dean Hendrix, for many suggestions that have improved this dissertation. I also wish to acknowledge Dr. Adit Singh, who provided interesting discussions and reviews for the dissertation. Thanks to Dr. Caio Soares for the careful reading of early draft of this document. I would like to thank Dr. Umphress for his many years of service for my assistantship in the doctoral program. I would like to thank the graduate school and the Department of Computer Science and Software Engineering sta members for assisting me with administrative tasks for completing my doctoral program. A very special thanks goes to Dr. Kai Chang and Dr. Homer Carlisle for their support with my academic career. I thank my cousin, Taewon Kim, who is currently pursuing postdoctoral research at Stanford University, for his help with preparing study in America. Many thanks to those friends in Auburn who shared friendship long years with me playing sports together. iv Finally I wish to express my special gratitude to my parents for their emotional support. I can?t possibly thank my parents enough. They were always supporting me and encouraging me with their best wishes. I nish with a nal silence of gratitude for my life. v Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Heuristic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Immediate mode mapping heuristics . . . . . . . . . . . . . . . . . . 6 2.1.2 Batch mode heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 GA(Genetic Algorithm) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Load sharing algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4 Parallel and Distributed Computing (PDC) . . . . . . . . . . . . . . . . . . . . 14 4.1 Von Neumann computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 Parallel Processing Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.3 Organization of PDC systems . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.4 PDC Systems performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.5 Partitioning and Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.5.1 Dynamic scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.5.2 Static scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.5.3 Task Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5 Heterogeneous Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 vi 5.1 Heterogeneous Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1.1 Heterogeneous problem model . . . . . . . . . . . . . . . . . . . . . . 31 5.1.2 SIMD VS MIMD mode . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2 Task Pro ling and Analytical Benchmarking . . . . . . . . . . . . . . . . . . 33 5.3 Matching and Scheduling for HC systems . . . . . . . . . . . . . . . . . . . . 33 6 Support vector machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.1 Overview of Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . 35 6.2 Linear Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.2.1 Non-linear Support Vector Machines . . . . . . . . . . . . . . . . . . 38 6.3 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 7 Support Vector Machine & Particle Swarm Optimization . . . . . . . . . . . . . 42 7.1 Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 42 7.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 7.1.2 CPSO model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 7.1.3 Simulations & Results . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7.1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 7.2 Implementing Support Vector Machine using Particle Swarm Optimization . 52 7.3 Application of SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 8 Support Vector Scheduler (SVS) framework . . . . . . . . . . . . . . . . . . . . 56 8.1 Phase 1: Creating sample training data . . . . . . . . . . . . . . . . . . . . . 57 8.2 Phase 2: Constructing a SV model . . . . . . . . . . . . . . . . . . . . . . . 59 8.3 Phase 3: Heuristic selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 9 Simulation Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 10 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 vii List of Figures 4.1 The von Neumann computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Flynn?s taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.3 The organization of parallel computers . . . . . . . . . . . . . . . . . . . . . . . 19 4.4 Matrix multiplication using a systolic array . . . . . . . . . . . . . . . . . . . . 21 4.5 Traditional shared memory multiprocessor model . . . . . . . . . . . . . . . . . 22 4.6 Shared memory multiprocessor implementation . . . . . . . . . . . . . . . . . . 23 4.7 Message-passing multiprocessor model(multicomputer) . . . . . . . . . . . . . . 24 4.8 Dynamic scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.1 An example heterogeneous computing environment . . . . . . . . . . . . . . . . 30 5.2 A hypothetical example of the advantage of using heterogeneous computing . . 31 5.3 Heterogeneous application model . . . . . . . . . . . . . . . . . . . . . . . . . . 32 6.1 Example of large margin hyperplane with support vectors circled . . . . . . . . 36 6.2 Non linear samples inseparable in the input space, rendered separable in feature space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.3 Linear samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 viii 6.4 Non-linear samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.5 Decision boundary and support vectors when using a Gaussian kernel . . . . . . 41 7.1 Cluster particles using SOM. Particles (circles) move toward random vectors (dark circles) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7.2 Rastrigin?s Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 7.3 Griewank Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 7.4 CPSO Vs PSO for De Jong?s F2 . . . . . . . . . . . . . . . . . . . . . . . . . . 50 7.5 CPSO Vs PSO for Scha er F6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7.6 CPSO Vs PSO for Rastrigin F1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7.7 CPSO Vs PSO for Griewank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7.8 SVMpso Vs SVMlight by kernel . . . . . . . . . . . . . . . . . . . . . . . . . . 55 8.1 Support Vector Scheduler (SVS) Framework . . . . . . . . . . . . . . . . . . . . 57 8.2 The format of the input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 8.3 Several batches of n tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 8.4 The procedure of heuristic selection . . . . . . . . . . . . . . . . . . . . . . . . . 61 8.5 Evaluation of an input vector x by the SV model . . . . . . . . . . . . . . . . . 62 9.1 Average improvement in makespan from SVS . . . . . . . . . . . . . . . . . . . 64 10.1 Average improvement by machine heterogeneity in batch size 10 . . . . . . . . . 67 10.2 Average improvement by machine heterogeneity in batch size 20 . . . . . . . . . 68 ix 10.3 Average improvement by machine heterogeneity in batch size 50 . . . . . . . . . 69 10.4 Average improvement by machine heterogeneity in batch size 100 . . . . . . . . 70 10.5 Average improvement by no. of processors in batch size 10 . . . . . . . . . . . . 70 10.6 Average improvement by no. of processors in batch size 20 . . . . . . . . . . . . 71 10.7 Average improvement by no. of processors in batch size 50 . . . . . . . . . . . . 71 10.8 Average improvement by no. of processors in batch size 100 . . . . . . . . . . . 72 10.9 Average improvement by no. of tasks in batch size 10 . . . . . . . . . . . . . . . 72 10.10Average improvement by no. of tasks in batch size 20 . . . . . . . . . . . . . . . 73 10.11Average improvement by no. of tasks in batch size 50 . . . . . . . . . . . . . . . 73 10.12Average improvement by no. of tasks in batch size 100 . . . . . . . . . . . . . . 74 10.13Average improvement by task heterogeneity in batch size 10 . . . . . . . . . . . 74 10.14Average improvement by task heterogeneity in batch size 20 . . . . . . . . . . . 75 10.15Average improvement by task heterogeneity in batch size 50 . . . . . . . . . . . 75 10.16Average improvement by task heterogeneity in batch size 100 . . . . . . . . . . 76 10.17Average Improvement by batch task size . . . . . . . . . . . . . . . . . . . . . . 76 10.18Normalized makespan and Accuracy by machine heterogeneity . . . . . . . . . 77 10.19Normalized makespan and Accuracy by task heterogeneity . . . . . . . . . . . . 78 10.20Normalized makespan and Accuracy by no. of machines . . . . . . . . . . . . . 78 x 10.21Normalized makespan and Accuracy by no. of tasks . . . . . . . . . . . . . . . 79 10.22Normalized makespan and Accuracy by batch size . . . . . . . . . . . . . . . . 79 10.23Accuracy by random number generator . . . . . . . . . . . . . . . . . . . . . . . 80 10.24Average improvement over MinMin and MaxMin by Task Heterogeneity in Uni-Uni 80 10.25Average improvement over MinMin and MaxMin by Machine Heterogeneity in Uni-Uni . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 10.26Average improvement over MinMin and MaxMin by No. of Task in Uni-Uni . . 81 10.27Average improvement over MinMin and MaxMin by No. of Machine in Uni-Uni 82 10.28Average improvement over MinMin and MaxMin by Batch size in Uni-Uni . . . 82 10.29Average improvement over MinMin and MaxMin by Task Heterogeneity in Uni- Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 10.30Average improvement over MinMin and MaxMin by Machine Heterogeneity in Uni-Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 10.31Average improvement over MinMin and MaxMin by No. of Task in Uni-Norm . 84 10.32Average improvement over MinMin and MaxMin by No. of Machine in Uni-Norm 85 10.33Average improvement over MinMin and MaxMin by Batch size in Uni-Norm . . 85 10.34Average improvement over MinMin and MaxMin by Task Heterogeneity in Norm- Uni . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 10.35Average improvement over MinMin and MaxMin by Machine Heterogeneity in Norm-Uni . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 xi 10.36Average improvement over MinMin and MaxMin by No. of Task in Norm-Uni . 87 10.37Average improvement over MinMin and MaxMin by No. of Machine in Norm-Uni 87 10.38Average improvement over MinMin and MaxMin by Batch size in Norm-Uni . . 88 10.39Average improvement over MinMin and MaxMin by Task Heterogeneity in Norm- Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 10.40Average improvement over MinMin and MaxMin by Machine Heterogeneity in Norm-Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 10.41Average improvement over MinMin and MaxMin by No. of Task in Norm-Norm 89 10.42Average improvement over MinMin and MaxMin by No. of Machine in Norm-Norm 90 10.43Average improvement over MinMin and MaxMin by Batch size in Norm-Norm . 90 xii List of Tables 5.1 Trade-o s between SIMD and MIMD . . . . . . . . . . . . . . . . . . . . . . . . 32 6.1 List of relevant symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 8.1 ETC for a batch of tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 10.1 Combinations of input and training distributions functions . . . . . . . . . . . . 78 xiii Chapter 1 Introduction Since the advent of the modern von Neumann computer in the 1940s, startling ad- vancements have been made in computing technology with the availability of innovative, reliable, and fast electronic components, from vacuum tubes to transistors. The clock speed of early computers in the millisecond level increased up to a factor of a million in fty years; accordingly, uniprocessors? speed increased by a factor of 10 every seven years. The technical advancement achieved in the development of memory-chip performance, roughly followed Moore?s law, which projected the capacity of the integrated circuit double every 18 months. Concurrent with these developments, there has been increasing demand for com- puting resources. Since computers make it possible to perform scienti c computations in far less time than would otherwise be possible, its use extended greatly to various areas, and the dependence on computers increased rapidly. High performance computing has a history that has been developed within crisis management, placing demand on high-performance computations for modeling natural phenomenon relevant to crises, such as severe storms, earthquakes, and atmospheric dispersion of toxic substances [110]. Similarly, there have been various e orts for improving the performance of computers to meet diverse needs for computing resources, giving rise to great interests in high-performance computing. In one such e ort, the desire to speed up computation led to the development of a variety types of supercomputers. They were usually devised for highly calculation-intensive tasks, grand challenge problems such as weather forecasting, climate research, biological macromolecules, and so forth. When speculating on the evolution of supercomputers, we can nd a certain trend in speed-gains of computers. In early supercomputers, the speed-up was achieved by the development of fast scalar processors, serial processors that operated in the simplest 1 operation mode. The next powerful form of multiprocessing was provided by vector proces- sors, which are particularly good at performing vector and matrix operations, and further by parallelizing computations through the connection of a massive collection of processing units. The computer architecture based on a uni-processor in which its performance mainly depends on its clock speed has been changed to the form of multi-processor. Also, as the commodity development of computers with powerful micro-processor is available, relatively inexpensive computers such as personal computers and workstations came to be exploited in high-performance computing by combining these computing resources in a cooperative way using software technique to create a computing power required for computation inten- sive tasks, e.g., Beowulf clustering system [101] was able to create a great computing power needed by connecting personal computers via widely available networking technology running any one of several open-source operating system such as Linux. Nowadays, high-performance computing seems to continue to favor such an integration of existing computing resources, in- uenced by the fact that electronic processing speeds began to approach limitations imposed by the laws of physics, rather than develop a new fast processor. Furthermore, the advent of high-speed networks such as Fiber Distributed Data Interface made the construction of large scale of parallel and distributed (PDC) systems with the least communication overhead possible. Thus, even computers placed a physically long distance apart can be seamlessly connected with the help of data communication and system operating technology. Therefore, it is expected in the future that the main method of obtaining computing power needed for computation demanding tasks will be an integrated collection of computing resources using high-speed network. In this sense, PDC is currently considered a promising technique that can provide a vast amount of computing power in a cost-e ective way, and is able to provide an underlying environment on which high-pro le technologies such as pervasive and nomadic computing [99] can ourish. For example, millions of computers can be harnessed coopera- tively with the dedication of part of their computing resources via the Internet. The collected 2 computing power can be an amount so formidable that it would surpass even the fastest su- percomputers. With this gradual change in the computing paradigm, it is tempting to think that PDC is such a magical technology that it can solve computational demanding prob- lems by connecting a multitude of computers via a high-speed network, and further breach barriers which would be posed by the physical limitations of single processor. However, the optimistic vision may not be achievable by just connecting a multitude of computers; in fact, beyond this vision we are faced with many challenging problems to be solved in order to make it a reality PDC has been developed in the form of homogeneous and heterogeneous computing architectures. Homogeneous computing, which uses one or more machines of the same type, has provided adequate performance for many applications in the past. Many of these appli- cations had more than one type of embedded parallelism, such as single instruction, multiple data (SIMD) and multiple instruction, multiple data (MIMD). Most of the current paral- lel machines are suited only for homogeneous computing, however, numerous applications that have more than one type of embedded parallelism are now being considered for parallel implementations. As the amount of homogeneous parallelism in applications decreases, the distributed homogeneous system cannot o er the desired speedups. Therefore, a suit of het- erogeneous architecture to exploit the heterogeneity in computations came to be a critical research issue [5]. Heterogeneous computing (HC) is the well-orchestrated and coordinated e ective use of a suite of diverse high-performance machines, including parallel machines to provide super speed processing for computationally demanding tasks with diverse computing needs. Vision processing that has di erent computation requirements at each processing level can be an example of the e ective use of heterogeneous computing [1]. HC systems include heterogeneous machines, high-speed networks, interfaces, operating systems, communication protocols, and programming environments, all combining to produce a positive impact on ease of use and performance. Heterogeneous computing should be distinguished from net- work computing or high-performance distributed computing, which have generally come to 3 mean either clusters of workstations or ad hoc connectivity among computers using little more than opportunistic load-balancing. HC is a plausible, novel technique for solving com- putationally intensive problems that have several types of embedded parallelism. However, in order to best use heterogeneous computers to our advantage, a well constructed mechanism that orchestrates the heterogeneous computing resources e ectively is essential, especially if the purpose of the system is to minimize total completion time (called \makespan"). There- fore, the problem of mapping and scheduling tasks to heterogeneous machines is important in order to maximize system throughput in an HC system [3][12][11]. The general problem of assigning independent tasks onto heterogeneous computers is known to be NP-complete. Thus, most of approaches so far have developed new heuristics, which have been speci cally tailored to the problem [12][109][93][10][104][3]. Researchers have developed a variety of heuristic algorithms. However, often a heuristic, by its very nature, which works well in a speci c environment does not work as well in another. This e ect is ampli ed in a dynamic system where new machines may join or depart the system [3]. The performance of heuristic algorithms is in uenced by dynamic system variables including system heterogeneity. Therefore, the choice of a mapping heuristic in a dynamic system environment remains a di cult problem. In this research, rather than create another new heuristic, we develop a method of dynamically selecting a heuristic among a group of conventional heuristics. The conventional heuristics should be chosen in a way that is suited for the particular system. We propose a support vector scheduler (SVS) framework, which is based on support vector machine (SVM), that selects a heuristic from available heuristics to map independent tasks onto heterogeneous machines. In this dissertation, we use the nish time on the objective function; however, our research is easily extensible to other objective functions. We generalize the correlation between system variables and heuristics? performance through learning, resulting in a parameterized quadratic function, namely support vector (SV) model. The SVS is able to make a heuristic selection using the SV model. To the best of our knowledge, such an 4 approach to heuristic selection for task mapping that accounts for dynamic system variables has not been studied before. The remainder of the dissertation is structured as follows. Chapter 2 describes back- ground. Chapter 3 discusses related work. In Chapter 4 we address PDC systems. Chapter 5 addresses heterogeneous computing. Chapter 6 introduces the SVM. Chapter 7 gives the de- scriptions of the design and mechanism of Particle Swarm Optimization & SVMpso, which is implemented here. Chapter 8 presents the SVS framework. In Chapter 9 the simulation pro- cedure is presented. Chapter 10 shows results with analysis. Chapter 11 has the concluding remarks. 5 Chapter 2 Background In this chapter, we will introduce a variety of heuristic and machine learning algorithms to solve the task mapping problem. 2.1 Heuristic algorithm First, heuristic algorithms can be categorized into batch mode and immediate mode algorithm. In immediate-mode, the task is mapped onto the machine immediately upon arrival, but in batch mode the task is not scheduled until a mapping event occurs. 2.1.1 Immediate mode mapping heuristics Minimum completion time (MCT) heuristic is a variant of fast-greedy heuristic. It has been used as a benchmark for the immediate mode [3]. MCT assigns each task to the machine on which to complete the task the earliest. Braun et al. [20] compared 11 heuristic algorithms and found MCT to perform around the median of heuristics. MCT requires O(m) time to nd the machine that can nish a task earliest, where m is the number of machines. In minimum execution time (MET) heuristic, as a job arrives, each task is assigned to the machine that provides the least execution time for that task. Although MET heuristic is very simple with complexity O(m), it may result in severe imbalance in load across the machines [20]. All of the computing nodes in the cluster are examined to determine the node that gives the best execution time for the job. As mentioned, MET may result in load imbalance at some point because it does not consider the ready time of each machine. With respect to this, switching algorithm (SA) alternates from MET to MCT at the sense of load imbalance. SA has the same complexity with MCT and its performance is close to MCT. K-percent 6 best (KPB) heuristic implements the idea that too much selection pressure may lead to local optimum since it suppresses the diversity of search. The parameter K determines the selection pressure. Therefore, in KPB, a subset of machines, in which K is less than 100, is selected based on the earliest completion time. A task is assigned to the machine in the reduced set whose completion time is the least. That is, KPB looks forward to achieving the improvement in the long run as considering task heterogeneity instead of expecting the current marginal improvement promptly. In a similar way, feasible robust k-percent best (FRKPB) rst nds the feasible set of machines for the newly arrived task. From this set, the FRKPB identi es the k-percent that has the smallest execution times for the task [30]. In opportunistic load balancing (OLB), which was known as a naive O(n) algorithm, it simply places each job in order of arrival on the next available machine regardless of its completion time. The performance of OLB is worse than other aforementioned algorithms. 2.1.2 Batch mode heuristics Whereas tasks are mapped onto machines immediately upon arrival in immediate mode, they are collected into a set that is examined for mapping at prescheduled times in batch mode. This enables mapping heuristics to make better decisions because the heuristics have the resource requirement information for a whole meta-task. When the task arrival rate is high, there is a su cient number of tasks to keep the machines busy between the mapping events. Min-Min heuristic algorithm uses expected time to compute (ETC) matrix to make a decision for mapping a task to a suitable machine. The expected completion time is de ned as: Cij = Eij +Rj (2.1) The completion time of task i in machine j is calculated by adding the ready time of machine j to its expected completion time (2.1). Basically, a task is assigned to the machine that provides minimum completion time. When there is contention for the same machine to which two or more tasks are eligible, a task is assigned to the machine that should result in a change 7 of the least amount of ready time, where ready time is the time when the machine becomes available after having executed the tasks previously assigned to it. In this algorithm, it is expected that a smaller makespan can be obtained if a larger number of tasks are assigned to the machine that, not only completes them earliest, but also executes them fastest. Max- Min heuristic is similar to Min-Min heuristic except that a task with maximum completion time is chosen among the candidate tasks whose completion time is minimum for all the machines. The Max-Min heuristic is likely to be better when there are more short tasks than long tasks since it can execute many short tasks concurrently along with the long task. The main idea of Su erage heuristic is to assign a task to a machine that would su er most if it is not assigned to the machine. The Su erage algorithm also uses the same ETC matrix as it is used in Min-Min heuristic. An arbitrary task with ETC is selected and assigned to a corresponding machine. If the machine, however, is already assigned a task, an old task is replaced by a new one and returned to the task queue or it keeps its place by comparing between both su erage values, which is the di erence between the earliest completion time and the second earliest completion time. 2.2 GA(Genetic Algorithm) Genetic Algorithms (GAs), devised by Holland [116], are adaptive heuristic search algo- rithms based on the evolutionary ideas of Darwin?s principle of natural selection and genetics. In GA, a group of solutions evolve towards the good area in the search space. A population{ (a group of individual solutions) evolves by surviving or dying through natural selection using various operators. In its? simple form, it recursively applies the concepts of selection, crossover, and mutation to a randomly generated population of promising solutions as de- scribed in Procedure 2. In [12] a GA was used to minimize the makespan, and the algorithm outperformed six other heuristic algorithms, about 10% against Early First algorithm. 8 Procedure 1 The Procedure of Genetic Algorithm Initialize Population Evaluate Population while until stopping condition is met do Select parent Create o spring Evaluate o spring Select survivors end while 2.3 Load sharing algorithm In a heterogeneous computing environment with two processor classes, fast and slow, a job migration mechanism can be used. This mechanism distributes loads fairly over all processors from slow to fast using six scheduling strategies: probabilistic, probabilistic with migration of generic jobs, shortest queue (SQ), shortest queue with migration of generic jobs (SQM), least expected response time for generic jobs-maximum wait for dedicated jobs, least expected response time for generic jobs-maximum wait for dedicated jobs with migration[9]. In overall performance, SQ and SQM methods were better than all other methods. 2.4 Machine Learning Scheduling plays an important role in production control in exible manufacturing sys- tem (FMS), which involves several real-time decisions, such as part type and machine selec- tion [13]. Consequently, a scheduled FMS is able to improve the machine utilization, enhance throughput, reduce the number of work-in-process, mean ow time, and the number of tardy parts. Assigning correct dispatching rules dynamically is critical for the scheduling problem. After receiving useful information from an FMS, a good scheduler should be able to make a right decision, i.e., output a right dispatching rule, for the next period to gain good per- formance. It needs as much expert knowledge stored in the scheduler as possible. Due to such reasons, Machine Learning, based on simulated sample data, has been used with good results. 9 Chapter 3 Related work This chapter introduces conventional heuristics to solve the mapping problem and ma- chine learning approaches to various scheduling problems as related to our work. Much work has been done in the study of the dynamic task mapping problem for the distributed computing systems [35][36][12][2][11][20][8]. A number of systems have been developed for managing the execution of tasks on a network of machines, scheduling tasks in order to evenly balance the load on the machines in the meta computer [32][33][34] [9][37][10]. The Switching Algorithm (SA) [3] utilizes the MCT and MET heuristics in an alternating man- ner, considering the load distribution across machines, which is determined by computing the load balance index based on machine ready times. However, the performance of SA is very close to MCT. The SmartNet scheduler [10] used a variety of scheduling algorithms to obtain near optimal schedules for di erent problems. However, the selection was done statically and was based upon whether the tasks had dependencies or not. It used MinMin and MaxMin heuristics when all of the tasks are computationally intensive and independent, and other heuristic algorithms when there are data dependencies among tasks. It does not select between MinMin and MaxMin heuristics. Evolutionary Techniques approach the task mapping problem by formulating the problem as an optimization problem in which an opti- mal or near-optimal solution is sought by exploring the solution space to nd best available values of an objective function given a de ned domain. Page & Naughton [12] used genetic algorithms (GAs) to dynamically schedule heterogeneous tasks on non-identical processors in a distributed system. However, the complexity of GA techniques is high and therefore they are di cult to use at run time. 10 Researchers have also attempted to apply machine learning technique to solve schedul- ing problems arising in a variety of domains. Liu, Huang, and Lin [13] used a multi-class SVM to schedule tasks in a exible manufacturing system by dynamically applying a set of dispatching rules according to real-time system attributes. Their scheduler outperformed all static dispatching rules, with its mean throughput outperforming rst-in- rst-out scheduler by about 7%. Park, Casey, and Baskiyar [114] used a multi-class SVM to dispatch tasks dynamically onto heterogeneous machines by directly choosing appropriate machines to run tasks. The approach proved plausible, with performance very close to Early First [12] algo- rithm. However, it does not select heuristics, rather it directly maps tasks to machines {it can not be used to synthesize heuristics in the same way as the work in this research. It also functions in immediate mode whereas the research in this paper addresses tasks in batch mode. Burke, Petrovic, and Qu [38] built a case-based reasoning (CBR) knowledge based sys- tem, to solve a timetabling problem, speci cally a university course and exam timetabling system which has various constraints. Their system operates by memorizing heuristics that worked well in similar situations in a case base and retrieving the best heuristic for solving any new problem by searching the knowledge database. It provides better quality solutions, than a single heuristic by using di erent heuristics which were selected dynamically. In or- der to evaluate the quality of solutions, they used the following formula: P = PsWs Ss where Ss is the number of situations in which violation of constraint s occurred and Ws is a weight that re ects the importance of constraint. We note that in the CBR System, a knowledge database is updated with new cases, whereas SVM does not require update until new training is needed. Arbelaez, Hamadi and Sebag [40] applied a combination of heuris- tics dynamically using SVM to nd a solution to a constraint satisfaction problem (CSP). It outperformed a single heuristic solution in solving a collection of nurse scheduling problems by solving on average 50% more instances than a single heuristic. The CSP problem was solved by searching a tree using correctly chosen heuristics to nd a solution. The selection 11 of heuristic is based upon its execution time vis-a-vis a default heuristic. If worse, the default is used. The domain of this problem is di erent from the research presented in this paper. Also, the objective function in SVS is makespan and it schedules in batch mode rather than immediate mode, SVS also addresses issues of adaptability, task and machine heterogeneity and scalability. Ravikumar and Vedi used a special class of neural networks, called Boltzmann Machines for solving the problem of statically mapping tasks onto processors in a recon gurable envi- ronment [39]. The neural algorithm was much slower than heuristic algorithms, but provided superior solutions than the heuristic algorithms. Their neural networks rely on the principle of empirical risk minimization, whereas our approach develops a meta-scheduler based on the principle of structural risk minimization making the latter adaptive. Beck and Siewiorek represented a task allocation problem for bus-based multicomputers as a vector packing problem. The goal was to obtain an assignment of tasks to nodes that is feasible in respect to the constraints that minimize the number of processing nodes and the utilization level of the broadcast bus. It was also further formulated as a multi-dimensional problem [91]. The task allocation problem was formulated as the minimization of a quadratic pseudo-boolean function with linear constraints [92]. Hong and Prasanna scheduled a large set of equal- sized independent tasks on heterogeneous computing systems to maximize the throughput of the system by using an extended network ow representation [93]. Khalifa et al. created utlMinMin heuristics, which revised an original MinMin heuristics, that employs preemp- tive mapping approach that allows tasks in execution to be interrupted, moved to another processor, and resume execution in a di erent environment [94]. This algorithm was moti- vated by the fact that the original MinMin algorithm can cause some tasks to be starved of machines due to the expected heterogeneous nature of the tasks. That is, some tasks may be remapped at each successive mapping event without actually beginning execution. Meanwhile, other machines may remain idle during the whole mapping session. Braun et 12 al. discussed the factors that makes it di cult to select the best heuristic in a given sce- nario in a heterogeneous computing (HC) environment [11]. In [96], a heuristic algorithm based on genetic algorithm for the task mapping problem was used in the context of local- memory multiprocessors with a hypercube interconnection topology. Chen et al. solved a task scheduling problem in grid environment as a task-resource assignment graph and thus mapped the task scheduling problem into a graph optimal selection problem. They solved the optimization problem using Particle Swarm Optimization [97]. Huang et al. solved a task mapping problem as Master-Slave model in grid environment, where the master node delivers a set of equal-sized and independent tasks to available slave nodes in a single-port pattern which means tasks are transmitted sequentially. Each slave begins processing tasks after receiving them, and then returns computation results upon completion[98]. In [95], the mapping problem was solved based on randomized mapping whose key idea is to randomly choose a valid mapping event and evaluate its performance. By repeating this process a large number of times and picking the best solution that has been found over all iterations, it is possible to achieve a good approximation to the global optimum. The intuition behind this is that any algorithm that does not consider all possible solutions with a non-zero probability might get stuck in a local optimum. With the randomized approach any possible solution was considered and chosen with a small, but non-zero probability. 13 Chapter 4 Parallel and Distributed Computing (PDC) Although the essence of internal design of computer and the way information ows within a computer have not changed since von Neumann computer [60], noticeable advance- ments in computing technology have been made by the changes in implementation from vacuum tubes to integrated circuits, and further the availability of fast, reliable, and cheap electronic components, thus bringing revolutionary changes in computing paradigm. These resolute developments in computing technology enabled the solution of a wide range of computationally intensive problems requiring great computational speed such as numerical modeling and simulation of scienti c and engineering problems [58][59]. However, whatever the computational speed of current processors, there are still many problems not possible to process by single processor. For example, the SETI@home project [56] is the rst attempt to use large-scale distributed computing with over 3 million users to perform a sensitive search for radio signals from extraterrestrial civilizations. Radio telescope signals consist primarily of noise (from celestial sources and the receiver?s electronics) and man-made signals such as TV stations, radar, and satellites. Modern radio SETI projects analyze the data digi- tally. More computing power enables searches to cover greater frequency ranges with more sensitivity. Radio SETI, therefore, has an insatiable appetite for computing power. It is evident that the performance of single processor computer is limited, because pro- cessor has its clock speed constrained by the limitation of device technology such as CPU power and dissipation problem resulting in thermal hotspot by non-uniform temperature dis- tribution across processor chip [21]. Therefore, the evolutionary transition from sequential to parallel and distributed computing (PDC) is seen as inevitable ow in solving computing- power demanding problems to overcome the limitation from the use of a single processor. 14 Furthermore, PDC system is recognized as an important means for the solution of many grand challenge problems [61] that refer to really di cult tasks that stretch the limits of cognitive ability in science and engineering. These include climate modeling, human genome mapping, semiconductor and superconductor modeling, pollution dispersion, and pharma- ceutical design. It is common that these applications require computing power greater than that is obtainable with many conventional computers. These applications are common to exceed the computing power which is obtainable with many traditional computers. Although this technology has risen to prominence during last decades, many relevant issues still remain unresolved. The eld is in a state of rapid ux where various applications are being made on several subsequent issues. In this chapter we will review relevant subjects that are essential to make it useful to use PDC systems. 4.1 Von Neumann computer The overwhelming majority of today?s computers follow architectures and modes of operation, the same basic design principles formulated in the late 1940s by John von Neu- mann and coworkers [62]. The central idea of von Neumann computer is that it fetches an instruction and its operands from a memory unit, and saves in memory the results from the execution of the instruction in the processing unit. The von Neumann architecture is depicted in the Fig. 4.1. In such a computer, the instructions are executed sequentially, one operation at a time. One problem of conventional von Neumann machines is CPU to memory bottleneck, which mainly results from the disparity between high CPU speeds and much lower memory speeds. Over time, e orts to improve the early designs techniques were introduced, such as cache memories and pipelining [64] [65]. Although these e orts concen- trated on the improvement of single processor mechanism, the development of parallel and distributed computing is slowly leading to the end of sequential processing depending on the performance of single processor. 15 Figure 4.1: The von Neumann computer 4.2 Parallel Processing Paradigms Computers operate simply by executing instructions on data. A stream of instructions informs the computer of what to do at each step. Flynn [44] created a classi cation for the architecture of a computer on the basis of how the machine relates its instructions to the data being processed. The multiplicity of instruction streams and data streams in the system produced following four classi cations: single instruction stream, single data stream (SISD) single instruction stream, multiple data stream (SIMD) multiple instruction stream, single data stream (MISD) multiple instruction stream, multiple data stream (MIMD) All von Neumann machines belong to the SISD class. A SIMD machine consists of N processors, N memories, an interconnection network, and a control unit. All the same processing elements are supervised by the same control unit, and the processors operate on di erent data sets from distinct data streams. For example, for a SIMD computer with 16 Figure 4.2: Flynn?s taxonomy N processors, each processor will be executing the same instruction at the same time, but each on its own data. SIMD machines can be classi ed into two categories: shared- memory (SM) and local memory (LM) [66]. High performance on SIMD machines requires rewriting of conventional algorithms to manipulate many data simultaneously by sending instructions to all processors. SIMD machine can solve e ciently the problems that require parallel manipulations of large data sets, for example, algorithms based on vector and matrix operations [78] [79], and image processing algorithms where operations are performed on a large number of image pixels [80] [81] [1]. Furthermore, many large-scale SIMD parallel machines have been implemented including: Illiac IV: Illiac IV SIMD machine was implemented with 64 processing elements (PEs) with 64-bit words, an N-bit vector is used as the mask. Each vector bit represents the state of one PE. If the bit is a 1, the corresponding PE will be active; otherwise, the 17 PE will be inactive. While this scheme permits enabling and disabling of arbitrary PE sets, it is prohibitively expensive for large N. It was used to run large-scale computa- tions needed for climate simulations, signal processing, seismic research, and physics calculations. Massively Parallel Processor (MPP): MPP, which incorporates 16,384 bit-serial PEs, was designed for high-speed satellite imagery processing [82] , and was delivered to NASA in 1982. The Connection machine CM-2: A fully con gured CM-2 machine comprises 65,536 PEs that are divided into four groups. Each group of 16,384 PEs is controlled by a sequencer, and the four sequencers are connected up to four front end computers via a 4 x 4 Nexus switch. The GF11 parallel computer: The GF11 SIMD machine, designed at the IBM T.J. Watson Research Center, contains 566 very powerful processing elements. The MasPar MP-1 and MP-2: The machine built by MasPar Computer Corporation contains up to 16,384 processing elements. The two machine types have the same basic architecture and di er mostly in the processing element complexity; While the PEs in the MP-1 are 4-bit wide, those of the MP-2 are 32-bit CPUs. In a MISD computer, each of the processors has its own control unit and shares a common memory unit where data reside. Therefore, parallelism is realized by enabling each processor to perform di erent operations on the same datum at the same time. Systolic arrays are known to belong to this class of architectures [67] [68]. In MIMD machines there are N processors, N streams of instructions, and N streams of data. The processors of MIMD machines are of the same type used in a SISD computer; that is each processor has its own control unit in addition to its local memory and arithmetic and logic unit. A MIMD computer is considered tightly coupled or loosely coupled according to the degree of interactions among 18 processors. As an addition to Flynn?s taxonomy, another class known as single program, multiple data (SPMD) can be included within the MIMD classi cation to describe the cases where many programs that have the same process type are executed on di erent data sets, synchronously or asynchronously, e.g., a single source program is written and each processor will execute its personal copy of this program [69]. 4.3 Organization of PDC systems PDC systems can be organized in two folds: General purpose architecture and Special- purpose architecture. Theoretically, PDC systems can be constructed with a general purpose by building parallel computers for a wide variety of applications based on the MIMD model. In practice, it is reasonable to assemble several processors in a con guration speci cally designed for the problem under consideration. Fig. 4.3 provides an overall organization of parallel computers. In SIMD machines, the parallel operations are synchronized at the machine language level, and scheduling and allocation need to be done by the programmer. In MIMD machines, the processes that run in parallel need to be synchronized whenever they communicate with each other. In general, paradigms for building parallel computers can be divided into the following categories: Figure 4.3: The organization of parallel computers 19 Pipelining : This is a classical way to exploit parallelism and concurrency in computer architectures. Pipelining was stimulated by the need for faster and more cost-e ective systems. It refers to a segmentation of a computational process. Ramamoorthy et al. [70] de ned this technique as organizing repeated sequential processes in an over- lapped mode. In a pipelining architecture, computational processes are decomposed into several sub-processes, and processed in an overlapped mode by dedicated au- tonomous modules. As an illustration, consider the process of executing an instruction in computers. Processing instructions involve several repetitive processes: Fetching, Decoding, and Execution. Thus, each process can be pipelined by a dedicated unit like an industrial assemble line. Systolic Array processors: The word systolic has been borrowed from the medical eld;-just as the heart pumps blood, information is pumped through a systolic array in various directions, and at regular intervals. These are systems that consists of a number of processing elements with nearest-neighbor connection that operate in parallel under the direction of a single control unit. Array processors can be viewed as a subclass of SIMD computers in that each of the processing elements performs the same operation at the same time, on di erent data elements. Fig. 4.4 shows a systolic array used to multiply 4 x 4 matrices, A and B. The elements of A enter from the left and the elements of B enter from the top. The nal product terms of C will be held in the processors as shown. Each processor, Pi;j, repeatedly performs the same algorithm on di erent data elements. Vector processors: A vector processor is a processor that can operate on an entire vector in one instruction. The operands to the instructions are complete vectors instead of one element. Therefore, vector architectures exhibit SIMD behavior by having vector operations that are applied to all elements in a vector. Most vector machines have 20 Figure 4.4: Matrix multiplication using a systolic array a pipelined structure and specially designed register sets (vector-register architecture) that can hold vectors. Multiprocessors: These are small scale MIMD machines. A single processor model is extended to have multiple processors connected to multiple memory modules, such that each processor can access any memory module in a so-called shared memory con- guration, as shown in Fig. 4.5. This architecture, which provides processors with uniform memory access (UMA) structure, is attractive because of the convenience of sharing data. However, it is especially di cult to implement the hardware to achieve fast access to all the shared memory by all the processors. Hence, most large, practical 21 shared model systems have some form of hierarchical or distributed memory struc- ture, such that processors can access physically nearby memory locations much faster than more distant memory locations, thus resulting in non-uniform memory access (NUMA) structure. The NUMA model is described in Fig. 4.6. Message-passing multiprocessors, which is an alternative form of multiprocessor system, is created by connecting complete computers through an interconnection network, as shown in Fig. 4.7. In this architecture, a processor can only access a location in its own memory, thus the interconnection network is provided for processors to send messages to other processors. Figure 4.5: Traditional shared memory multiprocessor model 22 Figure 4.6: Shared memory multiprocessor implementation Data ow machines & Reduction machines: Data ow machines directly contrast with the traditional von Neumann architecture in that instructions are executed in out of or- der without both control ow and program counter. Normally instructions are sequen- tially organized without recoding dependency information between them into binaries. It is because binaries compiled for data ow machine contain this dependency informa- tion that enables out-of-order execution. Therefore, in data ow machines scheduling is based on the availability of data. Reduction machines are di erent from data ow machines in that scheduling is based on the need. Therefore, reduction machines are known as demand-driven execution model, while data ow machines are data-driven execution model [71]. 23 Figure 4.7: Message-passing multiprocessor model(multicomputer) Neural networks: Neural computers learn or model the behaviors of complex systems through the use of a large number of processors and interconnections. Clustering: This is a paradigm that enables multiple computers to cooperate simultane- ously to solve a computationally intensive problem. The rapid increase in microproces- sor performance and network bandwidth has made clustering a practical, cost e ective computing solution which is readily available to the masses. At the hardware level, a cluster is simply a collection of independent systems, typically workstations, connected via a commodity network. Cluster computers communicate through software methods such as message passing and distributed shared memory. Mega problems, such as grand challenge applications, that require all the computational capabilities of any available 24 system, including memory and CPU, historically have been used as the principal justi- cation for adoption of massively parallel processing (MPP) on such capability demand problems. On the other hand, clusters are considered the most cost-e ective solution for capacity demand problems requiring substantial, but far from ultimate, performance and making moderate demands on memory. PDC systems are created by connecting multiple computers, using various forms of local area networks (LANs) and wide area networks (WANs) to provide high performance that approaches supercomputer levels [72]. The type and topology of inter connection network, therefore, is an important design issue in parallel processor and multicomputer systems. Many network topologies, such as bus architectures, ring networks, crossbars, meshes, shu e exchanges network, and hypercube ensembles, are described in the literature [73]. Choosing an appropriate parallelizing algorithms that can make excellent use of a certain network topology is essential for achieving high performance, as well as the availability of more e cient and reliable networks. Important factors that characterize a parallel algorithm in relation to the host parallel architecture are the number of processors, capabilities of these processors, and so forth. In addition to a parallel algorithm, a parallel architecture is characterized by the control mode in which computers within the architecture operate. In most computers, the control mode is command driven, which means that di erent events are driven by the sequence of instructions. Other computers employ a data-driven approach (e.g., data ow machines). In this case, the control- ow mechanism is triggered by the availability of data. Another mode of control is demand driven, whereby computations take place only if their results are requested by other events. Alternate modes of control are based on combinations of these approaches. 4.4 PDC Systems performance With respect to system performance, PDC systems have several issues to be solved in relation to parallelization that do not arise in sequential systems. One of these issues is 25 task allocation, which is the breakdown of the total workload into smaller tasks assigned to di erent processors, and the proper sequencing of the tasks when some of them are interdependent and cannot be executed simultaneously. A PDC system can achieve the highest level of performance when it sustains high per-processor utilization through the process of proper scheduling or load balancing. The scheduling problem belongs to an NP- complete problem [74] [75] [76] [77]. 4.5 Partitioning and Scheduling Partitioning deals with how to detect embedded parallelism in the program. For this, the program is rst analyzed to determine the ideal parallelism revealed by the control and data dependences. Two operations which are not related by control or data dependences can potentially be executed in parallel. This kind of parallelism is usually referred to as ne grain parallelism because the unit of parallelism is normally an operation or a single statement. Several operations or statements may be combined into a larger grain to reduce communi- cation overhead. The problem is to nd the best grain size that maximizes parallelism while reducing overhead. In the context of parallel and distributed computing, partitioning must be followed by scheduling in which the concurrent parts of a parallel program are arranged in time and space so that the overall execution time of the program is minimized. The problem of scheduling program tasks onto multiprocessor systems is known to be NP-complete in general. The intractability of the scheduling problem has led to a large number of heuristics, each of which may work under di erent circumstances. Partitioning and scheduling can be conducted implicitly or explicitly. In the implicit approach, compilers are required to ana- lyze the application to explore embedded parallelism and perform partitioning, for example, when the underlying computing environment is entirely concealed from the programmer. On the other hand, in the explicit approach, the programmer is responsible for identifying parallelism within the application. In this case, existing languages are extended, or entirely new languages are required to express parallelism directly. 26 4.5.1 Dynamic scheduling Dynamic scheduling is any runtime technique for placing tasks onto processors and deciding when is best to execute them so that parallel programs nish in the earliest possible time. The most elementary approach to dynamic scheduling attempts to balance processor load across m processors using very local information. In its simplest form, m + 1 processors are used: one processor runs a scheduler that dispatches tasks on a rst-in- rst-out (FIFO) basis to all other m processors as shown in Fig. 4.8. Each of the m processors maintains a private list of waiting tasks called the waiting queue. This FIFO list holds all tasks assigned to the processor. As soon as one task is nished, the processor takes the next waiting task from its queue and processes it to completion. As tasks are processed, they may require the services of other tasks. Thus, requests for new tasks are made by the m processors as they do their work. These requests are placed on the scheduler queue, maintained by the scheduler processor. The scheduler processor also dispatches tasks in rst-come- rst-severed order. However, deciding which waiting queue to select is made by various heuristics. The simplest heuristic attempts to balance the load on all processors. Figure 4.8: Dynamic scheduling 27 4.5.2 Static scheduling Static scheduling of nondeterministic programs requires the use of a di erent program model that distinguishes between conditional branching and precedence relations among par- allel program tasks. The branch graph represents the control dependences and the di erent execution instances of the program, while the precedence graph represents the data depen- dence among tasks. An execution instance of a program can be de ned as the set of tasks that are selected for execution at one time for some input. For scheduling, all execution instances within a program are rst de ned by the branch graph. Precedence graphs are used to construct task graphs based on execution instances from the branch graph, each of which consists of the nodes given in the corresponding execution instance and the precedence relations among them. Each task graph also shows the amount of computation needed at each node as well as the size of the data messages passed among the nodes. Each task graph is scheduled independently using one of the techniques used in scheduling branch-free task graphs. Given a task graph and a target machine description, a schedule is created in the form of Gantt chart [100]. Finally, a number of Gantt charts are combined into a uni ed schedule. The schedule is given in the form of processor allocation and execution order of the tasks allocated to the same processor. 4.5.3 Task Allocation In a distributed computing system made up of several processors, the interacting tasks constituting a distinguished program must be assigned to the processors so as to make use of the system e ciently. To improve the performance of a distributed system, it is necessary that task allocation be performed with minimal inter-processor communication between tasks and the execution cost. Thus, the purpose of task allocation technique is to nd some task assignment in which the total cost due to inter-processor communication and task execution is minimized. 28 Chapter 5 Heterogeneous Computing Homogeneous computing usually uses one mode of parallelism in a given machine (like SIMD, MIMD, or vector processing) and thus cannot adequately meet the requirements of applications that require more than one single type of machine. Therefore, homogeneous sys- tems consisting of a single type of machine architecture might su er from inherent limitations. For example, vector machines employ interleaved memory with a pipelined arithmetic logic unit, leading to performance in high million oating-point operations per second (M ops). However, if the data distribution of an application and the resulting computations cannot exploit these features, the performance degrades severely. In general, it is known that a single machine architecture is not able to satisfy all the computational requirements of various subtasks in certain applications in a feasible amount of time[24]. Thus, the limitation of homogeneous computing from the use of single machine architecture can be overcome by constructing a heterogeneous computing environment. A heterogeneous computing (HC) system provides a variety of architectural capabilities, or- chestrated to perform an application whose subtasks have diverse execution requirements. Fig. 5.1 shows an example of a heterogeneous computing environment. HC systems can be constructed with two di erent modes. A mixed-mode HC system is a single parallel process- ing machine that is capable of operating in either the synchronous SIMD or asynchronous MIMD mode of parallelism, and can dynamically switch between modes at instruction-level granularity. In a mixed-mode the HC system, a single machine operates with multiple-mode, dynamically switching between di erent modes at instruction-level granularity with generally negligible overhead [84]. In a A mixed-machine mode, HC system consists of a heterogeneous suite of independent machines of di erent types interconnected by a high-speed network [85]. 29 Figure 5.1: An example heterogeneous computing environment To fully exploit HC systems, a task must be decomposed into subtasks, which may have dif- ferent machine architectural requirements. Fig. 5.1 helps support the notion that the system performance in a system running a hypothetical application which contains diverse computa- tional types is distinctly di erent according to the system architecture. Executing the whole program on a vector super computer only gives twice the performance achieved by a baseline serial machine due to the speedup of the vector portion of the program, but slight improve- ment in the non-vector portion. However, the use of ve di erent machines, each matched with the computational requirements of the subtasks for which it is used, can execute 20 times as fast as the baseline serial machine. 30 Figure 5.2: A hypothetical example of the advantage of using heterogeneous computing 5.1 Heterogeneous Systems MIMD and SIMD machine are the most common heterogeneous systems [86]. 5.1.1 Heterogeneous problem model To exploit the HC system, application tasks should be decomposed into subtasks. Fig. 5.3 has described the decomposition of the task into subtasks, code segments, and code blocks. The parallel task T is divided into subtasks ti, 1 i N . Each subtask ti is further divided into code segments tij, 1 j S, which can be executed concurrently. Each code segment within a subtask can belong to a di erent type of parallelism (i.e., SIMD, MIMD,vector, and so forth) and should be mapped onto a machine with a matching type of parallelism. Each code segment may further be decomposed into several concurrent code blocks with the same type of parallelism. These code blocks tijk, 1 k B, are suited for parallel execution on machines having the same type of parallelism. 31 Figure 5.3: Heterogeneous application model Table 5.1: Trade-o s between SIMD and MIMD Operation SIMD MIMD Synchronization overhead No Yes Data parallelization No Yes Control parallelism No Yes Memory cost Low High 5.1.2 SIMD VS MIMD mode A comparison of SIMD and MIMD machines is shown in Table 5.1. In SIMD mode, processing elements are implicitly synchronized at the instruction level. While implicit syn- chronization scheme is required in MIMD mode, it generally incurs overhead. MIMD allows di erent operations to be performed on di erent processing elements simultaneously through multiple threads of control, thus enabling independent functional parallelism, while the SIMD machine might result in time-delay for a sequence of data-dependent instructions due to the lack of concurrent operations. 32 5.2 Task Pro ling and Analytical Benchmarking To execute a task on a mixed-machine HC system, the task must be decomposed into a collection of subtasks, where each subtask is a homogeneous code block, such that the computations within a given code block has similar processing requirements (see Fig. 5.3). These homogeneous code blocks are then assigned to di erent types of machines to minimize the overall execution time. Task pro ling is a method used to estimate the code?s execution time on a particular machine and also to identify the code?s type [87]. Code types that can be identi ed include vectorizable, SIMD/MIMD parallel, scalar, and special purpose (such as fast Fourier transform). Traditional program pro ling involves testing a program assumed to consist of several modules by executing it on suitable test data. The pro ler monitors the execution of the program and gathers statistics, including the execution time of each program module. This information is then used to modify the modules to improve the overall execution time. Analytical benchmarking is a test procedure that measures how well the available ma- chines in the heterogeneous suite performs on a given code-type [87]. While code-type pro ling identi es the type of code, analytical benchmarking ranks the available machines in terms of their e ciency in executing a given code type. Experimental results obtained by analytical benchmarking show that SIMD machines are well suited for operations such as ma- trix computations and low-level image processing, while MIMD machines are most e cient when an application can be partitioned into a number of tasks that have limited intercom- munication [5]. It can be noted that analytical benchmark results are used in partitioning and mapping. 5.3 Matching and Scheduling for HC systems For HC systems, matching involves deciding on which machines each code block should be executed, and scheduling involves deciding when to execute a code block on the machines 33 in which it was mapped. Mapping and scheduling for general distributed computing systems has focused on how to e ectively execute multiple subtasks across a network of sequential processors. In such an environment, a load balancing scheme can be an e ective way to improve system throughput. However, there is a fundamental distinction between mapping and scheduling for distributed systems consisted of sequential processors and mapping and scheduling for an HC systems consisting of various types of heterogeneous computers. In the latter case, a subtask may execute most e ectively on a particular type of parallel architecture and matching subtasks to machines of the appropriate type is a more important factor than merely balancing the load across systems. Matching and scheduling also can be viewed as a mechanism which is used to e ciently and e ectively manage the access and use of a resource by its various consumers under speci c policies [88]. In the context of HC systems, the consumers are represented by the code blocks. The resources include the suite of computers, the network that interconnects these computers, and the I/O devices. The policy is the set of rules used by the scheduler to determine how to allocate resources to consumers based on knowledge of the availability of the resources and the suitability of the available resources for each computer. 34 Chapter 6 Support vector machine In this chapter, we introduce the central ideas of Support Vector Machine (SVM) and describe the support vector machine-particle swarm optimization (SVMpso) we implemented using the Java programming language. 6.1 Overview of Support Vector Machine In this section, we present an overview of SVM, as related to this research, for solving a binary classi cation problem. The SVM was originally proposed as a statistical learning theory based on the principle of structural risk minimization, and has recently gained wide acceptance especially in the area of pattern recognition [14], [19]. An example application is recognizing a pattern in text documents, e.g., classifying text documents automatically into pre-de ned categories on the basis of previously seen patterns [102]. SVM is basically a supervised learning method in which samples are required to learn a speci c pattern. A sample xi and its label yi are usually represented as a tuple (xi;yi), where xi 2 RN has N-dimensional attributes yi 2f 1;1g, i = 1; ;l, and l is the number of samples. (A list of relevant symbols has been presented in Table 6.1 for quick reference.) Learning from these samples is to nd a general function, which best estimates the pattern of the samples, f : RN ! f 1;1g which can classify unknown samples (x;y) into +1 or -1, which are generated from the same underlying probability distribution P(x;y). Such general functions can be expressed in the hyperplane form as: (w x) +b = 0; w2RN;b2R (6.1) 35 Figure 6.1: Example of large margin hyperplane with support vectors circled where the vector w is orthogonal to the hyperplane, and varying b moves the hyperplane along w. To classify an unknown input vector x, in a binary fashion, we de ne function f, as simply yielding the sign (+ or -) of the function de ned above: f(x) = sign((w x) +b) (6.2) Fig. 6.1 shows the hyperplanes separating two distinct classes of samples represented by dots( ) and crosses ( ). The support vectors are circled. There may exist in nite number of hyperplanes separating the samples. The training of SVM is to nd the optimal hyperplane among these hyperplanes. The optimal hyperplane yields the maximal margin between two distinct classes of samples, and is parallel to the two hyperplanes consisting of support vectors as shown in Fig. 6.1. The optimal hyperplane can be directly found within the input space if the samples are separable, otherwise in higher dimensional feature space where they may be easily separable. Whether or not samples are separable is determined by the distribution of samples. If the non linear samples are inseparable in the input space, we rst transform the coordinates via a mapping function to higher dimensional feature space as shown in Fig. 6.2, and then nd the optimal hyperplane in the feature space. In the feature space, the dots and crosses represent the ?+? or ?-? outcomes of the evaluation of f in equation (6.2). In 36 Table 6.1: List of relevant symbols Function mapping from input to feature space Lagrange multiplier K Radial Basis Function Kernel x Input vector y Label for input vector n Number of tasks X Support vectors s Number of support vectors l Number of samples m Number of machines p Number of tasks within a batch q Number of batches Heterogeneity of negative samples Heterogeneity of positive samples order to nd the optimal hyperplane we have to solve the constrained optimization problem [19] which maximizes jjwjj subject to: 8i yi(xi w +b) 1 (6.3) Using Lagrangian, one can reformulate this optimization problem as a convex optimiza- tion problem, which consists of minimizing LD with respect to : LD = 12 lX i;j=1 i jyiyjxi xj lX i=1 i (6.4) subject to the constraint: lX i=1 iyi = 0 (6.5) where xi represents the sample vector, yi its class, i the Lagrange multiplier of xi, and C (0 i C) is a parameter that determines the upper boundary for training errors (l is the number of samples, as stated earlier). A larger C will result in more classi cation errors. We used C = 1 which provides normalized values for . Those training vectors for which > 0 are termed support vectors. 37 6.2 Linear Support Vector Machines Samples are linearly separable if it is possible to partition the space between two di erent classes of samples using straight lines as shown in Fig. 6.3. For the linearly separable case, the support vector algorithm simply looks for the separating hyperplane with the largest margin as shown in Fig. 6.1. 6.2.1 Non-linear Support Vector Machines Samples are non-linearly separable if two di erent classes of samples separable, but it is not possible to partition them using straight lines as shown in Fig. 6.4. In the case where the decision function is not a linear function of the sample data, a kernel trick can be used to accomplish this in an astonishingly straightforward way. For the non-linearly separable case, map the training data nonlinearly into a higher-dimensional feature space via a mapping function , and construct a separating hyperplane with maximum margin there. This yields a nonlinear decision function in input space. By the use of a kernel function, it is possible to compute the separating hyperplane without explicitly carrying out the map into the feature space 6.3 Kernels The use of kernel functions provides a powerful and principled way of detecting nonlinear relations using well-understood linear algorithms in an appropriate feature space. The Kernel maps the data into some other dot product space F called the feature space via a nonlinear map, : RN !F; (6.6) and perform the linear algorithm in F. This only requires the evaluation of dot products. K(x;y) := ( (x) (y)) (6.7) 38 If F is high-dimensional, the evaluation of dot products will be very expensive. There is a computational shortcut which makes it possible to represent linear patterns e ciently in high-dimensional spaces to ensure adequate representational power. The shortcut is what we call a kernel function. In some cases, we can use simple kernels that can be evaluated e ciently. For example, the polynomial kernel K(x;y) = (x y)d (6.8) can be shown to correspond to a map into the space spanned by all products of exactly d dimensions of for d=2 and , for example, we have (x;y)2 = ((x1;y1) (y1;y2))2 = ((x21;p2x1x2;x22) (y21;p2y1y2;y22))2 (6.9) =( (x) (y)), de ning (x) = (x21;p2x1x2;x22): For every kernel we can construct a map such that Equation (6.7) holds. Radial Basis Function (RBF) kernels and sigmoid kernels are de ned as follows: K(x;y) = exp( jjx yjj 2 2 2 ) (6.10) Sigmoid kernels: K(x;y) = tanh(K(x;y) + ) (6.11) In input space, the hyperplane corresponds to a nonlinear decision function, whose form is determined by the kernel (see Fig. 6.5). 39 Figure 6.2: Non linear samples inseparable in the input space, rendered separable in feature space Figure 6.3: Linear samples 40 Figure 6.4: Non-linear samples Figure 6.5: Decision boundary and support vectors when using a Gaussian kernel 41 Chapter 7 Support Vector Machine & Particle Swarm Optimization In this chapter, we will introduce a Particle Swarm Optimization (PSO) method which can be employed to solve the quadratic programming problem issued when creating a support vector machine. We will also discuss a modi cation for preventing premature convergence, a problem commonly observed in evolutionary computation algorithms such as PSO and Genetic algorithms (GA). 7.1 Particle Swarm Optimization Image synthesis of dynamic objects such as clouds, smoke, water, and re are known to be di cult to simulate due to their fuzziness. Researchers have studied particle systems in which particles have their own behavior [41] and have tried to simulate such natural dynamic objects. Many scientists have also been interested in the movement of a ock of birds or a school of sh to discover underlying rules that make possible their aggregate motion. The aggregate motion enables them to nd food quickly, and protects them from predators through early detection and the spread of information. Intrigued by such social behavior, Kennedy and Eberhart [31] developed the particle swarm optimization (PSO) method to apply their natural behavior to problem solving. In PSO, particles simulate the social behavior of a ock of birds, which cooperate with one another to nd food (goal), in a way that each one remembers its own best location ever visited called \local best," and shares the local information with their neighbor to identify \global best" location within a group. The local information refers to individual cognition and the global information to social interaction. Using the local and global information a swarm of particles are able to cooperate and explore the solution space e ectively to nd an optimal solution. One of 42 the advantages of PSO is the particle?s prompt convergence on solutions by exploring the solution space in a fast speed like a ock of birds moving fast, but astonishingly perfect harmony. However, the solution by early convergence may become a local solution if done prematurely. This drawback of PSO can be contributed to its lack of capability to indicate premature convergence. The premature convergence phenomenon is commonly observed in evolutionary methods such as Genetic Algorithms (GA). In the case of a GA, it is known that high selection pressure in choosing only superior o springs for new generations results in premature convergence. This is because the high selection pressure restricts diversity of new population, making a search for solution space limited to local areas. Therefore, a search may get stuck in a local optimum. Riget and Vesterstrlm attributed the premature convergence of PSO to the fast information ow between particles, which causes particles to cluster early around local minima [43]. Since little diversity among population is considered the main cause of premature convergence, many researchers [105] [112] [103] have tried to avoid premature convergence in a way that provides su cient diversity for particles at the indication of premature convergence. However, it remains a very di cult problem to identify the sign of premature convergence. Furthermore, using such a reactive approach may be di cult to escape local minima for optimization problems. In this research, we propose a novel cluster-PSO (CPSO) using self organizing map (SOM), which is known as an unsupervised learning method called \Kohonen vector" [45]. Mohan and Al-kazemi [103], and Seo et al.[117] used clustered particles to search the solution space concurrently. However, the search may not be adaptive to the type of solution space since the population of group is static. On the other hand, CPSO using SOM technique is able to make the search adaptive to the solution space by dynamically updating the population of clustered particles. SOM, the unsupervised learning method is able to cover large solution space e ectively by periodically clustering particles around Kohonen vectors randomly created, and thereby enables preventing particles from getting stuck in local optima. 43 7.1.1 Background The PSO model simulates a swarm of particles moving in a n-dimensional solution space where a particle corresponds to a candidate solution characterized by n attributes. It is represented in the solution space by its position vector ~xi, and a velocity represented by a velocity vector ~vi. The velocity of ith particle of the swarm and its projected position in the dth dimension are de ned by the following two equations: ~vid = ~vid + c1 rand() (~lid ~xid) + c2 rand() (~gid ~xid) (7.1) ~xid = ~xid +~vid (7.2) where i = 1; ;n and n is the size of the swarm, d = 1; ;m and m is the number of dimension in the solution space. Each particle remembers its own best position in ~lid. Furthermore, the best position found by any particle i is stored in~gi to share the information with neighbors within the swarm. New velocity of particles is determined by (7.1), where c1 and c2 control learning rate of particles for individual cognition and social interaction, respectively, and rand() is a random function in the range [0,1]. Shi and Eberhart [113] introduced a parameter inertia weight ! into the basic PSO: ~vid = ! ~vid + c1 rand() (~lid ~xid) + c2 rand() (~gid ~xid) (7.3) where ! determines the magnitude of the old velocity ~vid. They found the range of [0.9,1.2] a good area to choose ! from. Many researchers have tackled the premature convergence problem in PSO. They tried to overcome the premature convergence problem in a reactive way that provides diversity to particles at the indication of premature convergence so as to escape a local optimum. Krink and Riget [105] provided diversity for particles when indicating their collision, which was determined based on radius between particles, and subsequently particles bounced away. The direction in which particles should bounce away 44 was determined randomly or by simple velocity-line bouncing in which particles continue to move in the direction of old velocity-vector with a scaled speed, resulting in U-turn. The tailored PSO outperformed the basic PSO for several benchmark functions. However, the reactive method may lead to trouble for multi-objective problems. Once converged at a local optimum, clustered on local best by their nature, particles would struggle to escape the local optimum without enormous diversity. On the other hand, CPSO explores solution space concurrently by explicitly clustering particles around sample vectors randomly created, thus being able to escape local optima for multi-objective problems. Wei et al. presented Elite Particle Swarm with Mutation (EPSM) [111]. EPSM tried to take full advantage of outstanding particles in order to avoid wasting a considerable amount of time visiting the solution space with poor tness values. To do this particles with poor tness is substituted by elite particles with better tness. In the EPSM the diversity of particles will be decreased due to the elitism. In order to cover the little diversity they employed a mutation operator so that the global best individual may be mutated to generate a new particle. In contrast to the elitism Wang and Qiu [112] tried to give inferior particles opportunities to search solution space. Their approach was motivated by the observation that a search process is very likely to be dominated by several super particles, which may usually be turned out not really good in the long term. In order to alleviate the dominance by super particles, they introduced roulette wheel selection operator in which the selection probability of a particle was ensured to be in inverse proportion to its original tness, and chosen a particle in the roulette wheel manner, which is expected to mitigate the high selection pressure by super particles. Their approach outperformed other published algorithms in terms of solution quality with additional computational time for tness scaling and roulette selecting process. Veeramachaneni and Osadciw [106] claimed that particles by nature oscillate between local optima and a global optimum, wasting most time moving in the same direction to converge at a global optimum. They made particles attracted toward the best positions visited by their neighbors, and thereby particles are in uenced by successful neighbors to explore all other 45 possible solution space. This algorithm was improved by concurrent PSO algorithms [107] in which two particle groups worked concurrently, each group tracing particles independently and sharing the information for the best particle. Multi-Phase Particle Swarm Optimization algorithm employed multiple groups of particles, each changing a search direction every phase to increase population diversity [103]. 7.1.2 CPSO model The CPSO is a modi ed version of PSO with an additional process of clustering particles. In the CPSO model, particles are periodically clustered around sample vectors using SOM to provide particles with enough diversity to prevent them from prematurely converging to local optima. The CPSO procedure is described in Procedure 2. Procedure 2 The Procedure of CPSO Step 1) Initialize particles Step 2) Randomly create sample vectors (particles)~yi on the solution space, where 1 i k, k being the maximum number of sample vectors Step 3) Traverse each particle ~pi;0 i < n; to nd the best matching particle (BMP) for each sample vector by similarity between sample vector and particles using euclidean distance. Step 4) Update the velocity of particles in the neighborhood of BMP by pulling them closer to sample vectors using the following formula: V~p(t+ 1) = V~p(t) + (p;t) (t)(~y(t) V~p(t)) (7.4) Step 5) Evolve particles using PSO. In eq. 7.4, V~p(t+1) is new velocity of particles, (t) controls the learning rate where t is the generation number of particles and (p;t) is the neighborhood function which determines the degree of neighborhood between BMP and particle p. We took a Gaussian function as a neighborhood function for particles which denotes the lateral particle interaction and the degree of excitation of the particle. The Gaussian function which returns values between 0 and 1 is commonly used a a simple model simulating a large number of random values. 46 Gaussian function for particles returns a value close to 1 if the particle is close to BMP (neighbors of BMP). Particles for which the Gaussian function returns values close to 1 are considered neighbors of BMP. The number of neighbors is reduced as the generation number grows. From Step 1 through Step 4, particles are clustered around sample vectors. This process enables particles to be relocated around sample vectors, thus giving them a chance to explore a new possible solution space which may contain an optimal or near optimal solution. In Step 5, each cluster of particles is evolved using PSO. These processes of clustering and evolving particles are iterated until the generation number is exhausted or a stopping condition which identi es no more improvements in tness on objective functions is met. Fig. 7.1 shows particles (clear circles) moving toward three randomly generated sample vectors (dark circles) to form clusters. 7.1.3 Simulations & Results To run the simulations, we implemented the PSO and CPSO using the Java program- ming language. In the simulations, PSO and CPSO were run for three well known objective functions namely DeJong?s F2, Scha er F6 and Rastrigin?s functions (also used by Kennedy [115]). The optimization performance of PSO and CPSO were compared. The objective functions are described as follows: f(x;y) = 100(x2 y)2 + (1 x)2;( 2:048