ENERGY AWARE TASK SCHEDULING ON HETEROGENEOUS SYSTEMS Except where reference is made to the work of others, the work described in this dissertation is my own or was done in collaboration with my advisory committee. This dissertation does not include proprietary or classified information. Rabab Farouk Abdel-Kader Certificate of Approval: Cheryl Seals Sanjeev Baskiyar, Chair Assistant Professor Associate Professor Computer Science Computer Science and Software Engineering and Software Engineering Levent Yilmaz George T. Flowers Assistant Professor Interim Dean Computer Science Graduate School and Software Engineering ENERGY AWARE TASK SCHEDULING ON HETEROGENEOUS SYSTEMS Rabab F. Abdel-Kader A Dissertation Submitted to the Graduate Faculty of Auburn University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Auburn, Alabama December 17, 2007 ENERGY AWARE TASK SCHEDULING ON HETEROGENEOUS SYSTEMS Rabab F. Abdel-Kader Permission is granted to Auburn University to make copies of this dissertation and its discretion, upon request of individuals or institutions and at their expense. The author reserves all publication rights. Signature of Author Date of Graduation iii iv DISSERTATION ABSTRACT ENERGY AWARE TASK SCHEDULING ON HETEROGENEOUS SYSTEMS Rabab F. Abdel-Kader Doctor of Philosophy December 17, 2007 (M.S. Tuskegee University 2002) (B.S. Suez Canal University, 1998) 128 Typed Pages Directed by Sanjeev Baskiyar We consider the problem of scheduling directed a-cyclic task graphs (DAG) on heterogeneous distributed processor systems with the twin objectives of minimizing finish time and energy consumption. Previous scheduling heuristics have assigned DAGs to processors to minimize overall run-time of applications. But due to many new applications on embedded systems such as high performance DSP in image processing, multimedia, and wireless security, there is a strong need for scheduling algorithms which lower energy consumption and yet attain good finish times. In this research, we employ dynamic voltage scaling (DVS) within the scheduling heuristics to achieve the twin objectives. The processors used can run on different discrete operating voltages. Processors can scale down their voltages to slow down in v order to reduce energy consumption whenever they idle due to task dependencies. Specifically, we combine Decisive Path Scheduling (DPS) and Heterogeneous N- predecessor Duplication (HNPD) with DVS. Using simulations, we show average energy consumption reductions of 40% over DPS and 28% over HNPD. The simulations used large number of randomly generated DAGs with various characteristics as well as DAGs of real world problems. Energy savings increased with increasing number of nodes or increasing Communication to Computation Ratios (CCR) whereas it decreased with increasing parallelism (out-degree) or increasing number of available processors. Increasing nodes, increase tasks dependencies and thus idle times. When CCR increases, processors are idle longer due to communication between tasks. Our algorithms used such idle times to achieve energy savings. An increase in out-degree resulted in smaller average energy savings. A larger out-degree allows more processors to run in parallel, reducing idle times. vi? ? Style manual: IEEE Standard Software used: Microsoft Word 2007, Microsoft Excel 2007, Microsoft Visual Studio 2003,and C++ vii TABLE OF CONTENTS LIST OF FIGURES ........................................................................................................... ix LIST OF TABLES ........................................................................................................... xiii 1. INTRODUCTION ...........................................................................................................1 1.1 Contributions......................................................................................................3 2. RELATED WORKS ........................................................................................................5 2.1 Power Estimation Techniques ............................................................................6 2.2 Power Optimization Techniques ........................................................................8 2.3 Software power optimization on system level ...................................................9 2.4 Scheduling to lower energy consumption ........................................................10 3. DYNAMIC VOLTAGE SCALING ..............................................................................15 3.1 Power Management Implementation ...............................................................19 4. DIRECTED A-CYCLIC GRAPH .................................................................................21 4.1 Performance .....................................................................................................24 5. SCHEDULING ALGORITHMS ...................................................................................28 5.1 EADAGS algorithm .........................................................................................28 5.2 EAGS-D algorithm ..........................................................................................37 5.3 Voltage scaling strategies ................................................................................46 viii 6. RESULTS AND DISCUSSION ....................................................................................51 6.1 Results for random DAGs ................................................................................51 6.1.1 Results for EADAGS ........................................................................52 6.1.2 Results for EAGS-D .........................................................................60 6.2 Results for real world problems .......................................................................68 6.2.1 Gaussian elimination .........................................................................68 6.2.2 Molecular dynamic code ...................................................................79 6.2.3 Fast Fourier transforms .....................................................................88 6.2.4 Sieve of Eratosthenes ........................................................................98 7. CONCLUSIONS..........................................................................................................108 BIBLIOGRAPHY ............................................................................................................110 ix LIST OF FIGURES 3.1. Different operating rates for the same task .................................................................17 3.2. Voltage Scaling graph .................................................................................................19 4.1. A sample DAG ............................................................................................................21 5.1. EADAGS algorithm ....................................................................................................36 5.2. Timeline for machine k ...............................................................................................36 5.3. EAGS-D algorithm .....................................................................................................45 5.4. Timeline for machine k (case 1) .................................................................................45 5.5. Timeline for machine k (case 2) .................................................................................46 5.6. Timeline for machine k (case 3) .................................................................................46 5.7. Different finish times with different execution rates ..................................................49 6.1. Average power savings with respect to number of nodes...........................................53 6.2. Average power savings with respect of CCR .............................................................55 6.3. Average power savings with respect to PNR ..............................................................56 6.4. Average power savings with respect to shape parameter ...........................................57 6.5. Average power savings with respect to out-degree ....................................................59 6.6. Average power savings with respect to number of nodes...........................................61 6.7. Average power savings with respect of CCR .............................................................63 x 6.8. Average power savings with respect to PNR ..............................................................64 6.9. Average power savings with respect to shape parameter ...........................................65 6.10. Average power savings with respect to out-degree ..................................................67 6.11. Gaussian elimination task graph .........................???????????????71 6.12. Average power savings for Gaussian elimination algorithm with EADAGS with different number of processors ..........................................................................................73 6.13. Average power savings for Gaussian elimination algorithm with EADAGS with different CCR values ..........................................................................................................73 6.14. Average power savings for Gaussian elimination algorithm with EAGS-D with different number of processors ..........................................................................................76 6.15. Average power savings for Gaussian elimination algorithm with EAGS-D with different CCR values ..........................................................................................................77 6.16. Average power savings for Gaussian elimination algorithm with EADAGS and EAGS-D with voltage scaling levels .................................................................................78 6.17. Directed Acyclic graph (DAG) for a molecular dynamics code .......................................... 80 6.18. Average power savings for molecular dynamic code with EADAGS with different number of processors .........................................................................................................82 6.19. Average power savings for molecular dynamic code with EADAGS with different CCR values ........................................................................................................................83 6.20. Average power savings for molecular dynamic code with EAGS-D with different number of processors .........................................................................................................85 xi 6.21. Average power savings for molecular dynamic code with EAGS-D with different CCR values ........................................................................................................................86 6.22. Average power savings for molecular dynamic code algorithm with EADAGS and EAGS-D with voltage scaling levels .................................................................................88 6.23. The generated DAG for FFT with four points ..........................................................90 6.24. Average power savings for FFT with EADAGS with different number of processors92 6.25. Average power savings for FFT with EADAGS with different CCR values ...........93 6.26. Average power savings for FFT with EAGS-D with different number of processor ... ............................................................................................................................................95 6.27. Average power savings for FFT with EAGS-D with different CCR values .............96 6.28. Average power savings for FFT with EADAGS and EAGS-D with voltage scaling levels ..................................................................................................................................97 6.29. Sieve of Eratosthenes task graph for N=32 ..............................................................99 6.30. Average power savings for different number of processors for Sieve of Eratosthenes with EADAGS .................................................................................................................101 6.31. Average power savings for different CCR values for Sieve of Eratosthenes with EADAGS .........................................................................................................................102 6.32. Average power savings for Sieve of Eratosthenes with EAGS-D with different number of processors .......................................................................................................104 6.33. Average power savings for Sieve of Eratosthenes with EAGS-D with different CCR values ...............................................................................................................................105 xii 6.34. Average power savings for Sieve of Eratosthenes with EADAGS and EAGS-D with voltage scaling levels .......................................................................................................107 xiii LIST OF TABLES 3.1. The relationship between clock frequencies, supply voltage, and power dissipation of Transmeta?s Crusoe TM5400 microprocessor ...................................................................18 5.1. EADAGS algorithm ....................................................................................................32 5.2. EAGS-D algorithm .....................................................................................................40 6.1 Makespan and average power savings with respect to number of nodes for EADAGS.. ?????????????????????????????????..54 6.2. Makespan and average power savings for different CCR values for EADAGS ........55 6.3. Makespan and average power savings for different shape parameter for EADAGS .58 6.4. Makespan and average power savings with respect to out-degree for EADAGS ......59 6.5. Makespan and average power savings with respect to number of nodes for EAGS-D .. ???????????????????????????????????61 6.6. Makespan and average power savings for different CCR values for EAGS-D ..........63 6.7. Makespan and average power savings for different shape parameters for EAGS-D .66 6.8. Makespan and average power savings with respect the out-degree for EAGS-D ......67 6.9. Gaussian elimination algorithm ..................................................................................69 6.10. Makespan and average power savings for different number of processors for Gaussian elimination with EADAGS ................................................................................74 xiv 6.11. Makespan and average power savings with respect to CCR for Gaussian elimination with EADAGS ...................................................................................................................75 6.12. Makespan and average power savings for different number of processors for Gaussian elimination with EAGS-D ..................................................................................77 6.13. Makespan and average power savings with respect to CCR for Gaussian elimination with EAGS-D .....................................................................................................................78 6.14. Makespan and average power savings for different number of processors for molecular dynamic code with EADAGS ...........................................................................82 6.15. Makespan and average power savings with respect to CCR for molecular dynamic code with EADAGS ...........................................................................................................84 6.16. Makespan and average power savings for different number of processors for molecular dynamic code with EAGS-D ............................................................................85 6.17. Makespan and average power savings with respect to CCR for molecular dynamic code with EAGS-D ............................................................................................................87 6.18. Makespan and average power savings with respect to number of processors for FFT with EADAGS ...................................................................................................................92 6.19. Makespan and average power savings with respect to CCR for FFT with EADAGS . ............................................................................................................................................94 6.20. Makespan and average power savings for different number of processors for FFT with EAGS-D .....................................................................................................................96 6.21. Makespan and average power savings with respect to CCR for FFT with EAGS-D. ............................................................................................................................................97 xv 6.22. Makespan and average power savings for different number of processors for Sieve of Eratosthenes with EADAGS .......................................................................................101 6.23. Makespan and average power savings for different CCR values for Sieve of Eratosthenes with EADAGS ............................................................................................103 6.24. Makespan and average power savings for different number of processors for Sieve of Eratosthenes with EAGS-D .........................................................................................104 6.25. Makespan and average power savings for different CCR values for Sieve of Eratosthenes with EAGS-D .............................................................................................106 1 CHAPTER 1 INTRODUCTION We consider scheduling on heterogeneous distributed computing systems interconnected by high-speed networks. Such systems are promising for fast processing of computationally intensive applications with diverse computation needs. One of the challenges in heterogeneous computing is to develop scheduling algorithms that assign the tasks of applications to processors [Reut97]. Therefore, researchers have proposed many static, dynamic and even hybrid algorithms to minimize execution time of applications running on a heterogeneous system [Iver98][Kwok99] [Radu00][Seig97][Topc99][Wang97]. Another challenge facing distributed computing is energy consumption [Dong01] There are many applications which require both low finish time and low energy consumption. Energy consumption is a major issue in many real-time distributed embedded systems. Furthermore, most applications running on an energy limited system inherently constrain the finish time. Low-cost, low-energy sensor networks composed of Smart Dust Mote [Smar06] are examples of such systems. New wireless communication systems are expected to evolve using this system. These networks are distributed networks operating on energy constraints, also called energy-aware distributed systems 2 (PADS). Hence there is a need for scheduling algorithms which would effectively reduce the overall energy consumed and yet attain the best possible finish time. We consider the problem of scheduling a directed a-cyclic task graph (DAG) on a heterogeneous distributed processor system with the twin objectives of minimizing finish time and energy consumption. The DAG structure is important as it occurs in many regular and irregular applications in forms of Cholesky factorization, LU decomposition, Gaussian elimination, FFT, Laplace transforms, and instruction level parallelism. Such low energy schedules can help run such applications in multi-hop sensor radio networks. Traditionally, priority has been on performance, and consequently the supply voltage has been set at the maximum allowable level based on device breakdown potentials to enable fast operation. However, applications may not require the maximum achievable speed at all times. The top energy consumers in a computer system [Kump94] are display (68%), disk (20%), and CPU (12%). There seems little which can be done to minimize screen energy-consumption, beyond employing a screen-saver and relying on hardware improvements. Disk energy consumption is minimized by spinning down the disk when it has been inactive. However, in the future, we may well see ubiquitous computing devices with neither disks nor conventional displays. For such devices, minimizing the energy consumed by the CPU will be critical if the replacements of disks and displays consume relatively smaller fractions of total energy. A study by Argonne National Laboratory has indicated that a 2.5 petaflop supercomputer, made of over a hundred thousand CPUs, will be available by 2010. The 3 study predicts that such a system will cost $16 million and would require 8 mega watts of energy to operate at a cost of about $8 million per year. Hence, high energy prices and rising concerns about the environmental impact of electronics systems highlight the importance of incorporating low energy design schemes at all levels of such systems. Current microprocessors from AMD, Intel and Transmeta allow the speed of the microprocessor to be set dynamically. Another study by NASA in 1998 predicted energy need of 25 megawatts for Japan?s NEC earth simulator, which is capable of executing 40 Tflops. That amount increased to 100 megawatts in a more recent study that much energy is enough to light 1.6 million 60-watts light bulbs, the lighting requirements of a small city. Reducing systems components energy and energy consumption decrease systems expenses. Assuming a rate of $100 per megawatt a Pflops machine consuming 100 megawatts of energy would cost $10,000 per hour approximate $85 million dollar a year. These estimates do not include air cooling expenses which are commonly 40% of systems operating cost. For such systems even small reduction in overall energy consumption would significantly impact Pflops systems? operational costs. 1.1 Contributions In this dissertation we presented two scheduling algorithms for scheduling Directed Acyclic Graphs on a distributed computing system for low energy. The first proposed algorithm combines Decisive Path Scheduling (DPS) with dynamic voltage scaling for the twin objectives of low energy consumption and minimum 4 execution time. We call it Energy Aware DAG Scheduling (EADAGS). The second algorism proposed combines Heterogeneous N-predecessor Duplication (HNPD) with dynamic voltage scaling to minimize both finish time and consumed energy, we identify that algorithm as Energy Aware Graph Scheduling with Duplication (EAGS-D). In both cases first the initial algorithm is completed using either DPS or HNPD for minimum finish time, then the amount of consumed energy is estimated and the voltage scaling algorithm is simulated to minimize the consumed energy without affecting the finish time. The remainder of this dissertation is organized as follows. In the next Chapter, we describe different types of scheduling on both homogenous and heterogeneous systems, energy estimation and energy optimization techniques, and prior work on scheduling for low energy are discussed. Chapter 3 explains the concept of dynamic voltage scaling and how it can be used to reduce the consumed energy in computing systems. Chapter 4 defines DAG and explains some of the definitions and terminology used by the scheduling algorithms. Chapter 5 introduces both scheduling algorithms presented in this work EADAGS and EAGS-D. Chapter 6 is for the simulation and analysis of results. Finally the conclusion and suggestions for future work are presented in Chapter 7. 5 CHAPTER 2 RELATED WORK Traditional scheduling algorithms did not consider the amount of energy consumption. Instead, they focus on performance or fairness. Recently, low energy system design has gained significant attention largely due to demands from the portable electronics industry. System design for low energy is also very important for other industries such as automotive, telecommunications, information technology, etc. This is due to the fact that low energy designs can offer significant reduction in system packaging costs and improvement in reliability. Two main design aspects of scheduling are how to build the scheduling queue and how to choose the optimal processor. List and cluster scheduling are primary techniques to schedule tasks on heterogeneous systems. In list scheduling, tasks are ordered in a scheduling queue based on the priority assigned to free tasks. List scheduling algorithms have been shown to have good cost- performance trade-offs. Cluster scheduling involves merging nodes/paths to form clusters that can be scheduled on the same processor so as to get closer to the objectives of schedule length, number of processors, etc. 6 Several algorithms for static scheduling on heterogeneous multiprocessors systems are available: Dynamic Level Scheduling (DLS), Generalized Dynamic Level Scheduling (GDLS), Best Imaginary Level (BIL), Mapping Heuristics (MH), Heterogonous Earliest Finish Time (HEFT), Task Duplication Scheduling (TDS), Static Task Duplication Scheduling (STDS), Fast Critical Path (FCP), and Fast Load Balancing (FLB). Among the above TDS and STDS employ task duplication to suppress communication whereas others do not. A brief description of these algorithms is available in [Topc02]. Heterogeneous N-predecessors Decisive Path (HNPD) is based on DPS but with Task duplication. The performance of HNPD was proven to outperform two of the best existing heuristics, Heterogeneous Earliest Finish Time (HEFT) and Static Task Duplication Scheduling (STDS), in terms of finish time and the number of processors employed over a wide range of parameters. Low energy scheduling research can be classified in two major categories 1. Energy estimation techniques (energy model) 2. Energy optimization techniques. 2.1 Power Estimation Techniques Existing energy estimation methodologies can be classified based on their level of abstraction, namely instruction level, architecture level, and gate level. ? Instruction level: application for this can be for embedded processing systems as presented by Nikoladis in [Niko02]. He uses an assembly or machine level 7 program as input and gives an estimate of the energy consumed for that specific program on a specific processing system. This provided an accurate estimation of energy consumption even in the presence of instantaneous energy supply variation. ? Architecture level: it provides cycle-by-cycle energy consumption data of the architecture on the basis of the instruction/data flow stream. At the architecture level a technique presented by Landman uses the black box energy model for the architecture level components to estimate the energy consumed while preserving the accuracy of the gate or circuit level estimation [Land94]. ? Gate level: Ishehara summarized and compared different techniques for energy estimation and proved that gate level estimation of energy consumption is the most accurate measurement [Ishe96]. The techniques for energy estimation at the gate level and low levels of abstraction can be classified into o Simulation based techniques: the earliest techniques proposed suggested monitoring both the supply voltage and current waveforms. These techniques were to slow to handle very large circuits so other techniques were introduced assuming that the supply and ground voltages are constant and estimate only the supply current waveform. o Probabilistic techniques: most of the probabilistic techniques are 8 applicable to combinational circuits only. In these techniques user supplied input signal probabilities are propagated into the circuit. To achieve this, special models for the components have to be developed and stored in a module library. o Statistical techniques: they do not require any specialized model for the components. The idea is to simulate the circuit with randomly generated input vectors until energy converges to the average energy. The convergence is tested by statistical mean estimation techniques. 2.2 Power Optimization Techniques Competition is driving the requirement for energy optimization. Systems are designed with low energy consumption as one of the important criteria. Energy optimization can be achieved through both hardware and software. ? Hardware Optimization: o Behavior level: transformations, scheduling, resource allocation, etc. o Architecture level: low energy flip-flops, low energy adder, etc. o Circuit level: low energy circuits. ? Software Optimization o Instruction level: low energy compiling, low energy instruction scheduling. 9 o System level: Dynamic energy management, low energy memory management, etc. 2.3 Software energy optimization on system level There are two techniques that can reduce energy consumption on system level scheduling: Dynamic Energy Management (DPM) and Dynamic Voltage Scaling (DVS). The DPM technique dynamically reconfigures an electronic system by reducing number of active components and/or load on such components while providing services. DPM is used in various forms usually in portable devices. However, the complexity of interfacing heterogeneous components has limited designers to simple solutions in DPM. An example of a simple policy, mostly applied to laptops and PDA, is a timeout policy, which turns off a component after a fixed inactivity period, under the assumption that it is highly likely that a component remains idle if it has been idle for the timeout period. The fundamental premise for the applicability of DPM is that systems (and their components) experience non-uniform workloads during operation time. Such an assumption is valid for most systems, both when considered in isolation and when systems are connected via internet. A second assumption of DPM is that it is possible to predict, with a certain degree of confidence, the fluctuations of workload. The analytical process of prediction should not consume significant energy. Typically, a energy manager (PM) implements a control procedure based on some observations and/or assumptions on the workload. In DVS technique, computation and communication tasks are run at reduced voltages and clock frequencies which fill idle periods but reduce energy dissipation 10 while providing required performance. The key idea of DVS is to dynamically scale the supply voltage of CPU while meeting total computation time and/or throughput. It is a trade-off between processor speed and energy consumption which is especially useful in real-time systems. The energy consumption of a processor running at high speed and high voltage is much larger than running at low speed and low voltage. For example, reducing the supply voltage from 5 V to 3.3 V in some cases has reduced energy by 56% [Pouw01]. DVS essentially fills the slack times by elongated computation or communication times. There are two types of slack times: Worst Slack Time (WST) and Workload- Variation Slack time (WVST). WST results from low processor utilization. WVST occurs due to execution time variations caused by data-dependent computation. WST can be roughly estimated from the scheduling results before task execution whereas WVST can be known only after execution. 2.4 Scheduling to lower energy consumption For uniprocessor real-time systems, many schemes have been proposed to manage energy consumption. Mosse et. al. [Moss00] proposed and analyzed several schemes to dynamically adjust processor speed for slack reclamation. They used a compiler to assist the operating system in changing the CPU operating levels to reduce energy consumption. Weiser [Weis94] discussed several methods for varying the clock speed dynamically under control of the operating system. He proved that by adjusting the clock 11 speed at a fine grain, substantial CPU energy can be saved with a limited impact on performance. Chandrakasan [Chan96] has shown that for periodic tasks, a few voltage/frequency levels are sufficient to achieve almost the same energy savings as infinite voltage/speed levels. Yang [Yang01] proposed a two-phase scheduling scheme which contains a design time scheduler and a runtime scheduler that minimizes energy consumption while meeting timing constraints. By choosing different scheduling options at compile time they achieved 20-40% average energy savings. Zhang and Chen [Zhan02] proposed a priority based task mapping and scheduling for a fixed task graph applying the earliest deadline first scheduling and formulating the voltage scaling problem as an integer programming problem. They proved that their framework can slow 8% of cycles in very short time. The main concern in DVS is to increase slack time utilization as much as possible and to make resultant energy consumption as low as possible. Two types of slack time are defined as worst slack time (WST) and workload-variation slack time (VST). WST results from processors utilization that is less than 100%. Low processor utilization is always the case even if all tasks exhibit their worst-case execution time. VST occurs due to execution time variations caused by data-dependent computation. WST can be estimated from the scheduling results before task execution whereas VST can be known only after actual task execution. Shin et. al. [Shin01] proposed a low-energy, priority-based scheduling which 12 consists of two parts: an off-line component which determines minimum processor speed while guaranteeing deadlines of all tasks and an online component which dynamically varies processor speed to utilize both WST and WVST. Shang et. al. [Shan03] proposed a history-based DVS for interconnected networks. Their technique leverages network history to predict future network needs, judiciously controlling the frequency (and voltage) of links to track actual network utilization. Such mechanisms resulted in 46% average energy savings at the cost of 15.2% increase in network latency and 2.5% decrease in network throughput. Theoretical investigations of speed scaling algorithms were initiated by Yao, Demers, and Shankar [Yao 95]. Yao et al. propose formulating speed scaling problems as scheduling problems. They assumed that each task has a release time when it arrives into the system, an amount of work that must be performed to complete the task and also a deadline that specifies the time by which the task should be completed. A schedule specifies which task to run at each time, and at what speed that task should be run. In some settings, for example, the playing of a video or other multimedia presentation, there may be natural deadlines for the various tasks imposed by the application. In other settings, the system may impose deadlines to better manage tasks or ensure a certain quality of service to each task. They studied the problem of minimizing the total energy used subject to the deadline feasibility constraints Few other literatures have considered energy saving in addition to the performance. Lu, Benini and Micheli [Lu00] presented a greedy on-line scheduling algorithm to facilitate energy management for multiple devices. They ordered the 13 execution of tasks so that devices can have continuous long idle periods during which they can be shut down. They achieved an average energy savings of 33%. Mishra et. al. [Mish03] proposed two novel techniques for energy management in distributed systems. The first is a static technique which uses a greedy algorithm to manage energy in presence of parallelism. The second technique uses task reallocation that enhances the first algorithm by allowing out-of-order execution where preemption is allowed. Their technique saved an average of 10-20% more savings than a simple static energy management technique. Shiue and Chakrabarti in [Shiu00] presented polynomial time algorithms for (i) resource-constrained scheduling and (ii) latency-constrained scheduling for the case when the resources operate at multiple voltages. Both scheduling schemes try to reduce the overall energy consumption. The resource-constrained scheduling scheme tries to balance the conflicting requirements of reducing the latency and maximally utilizing resources operating at reduced voltages. The latency-constrained scheduling scheme assigns as many nodes as possible to the resources operating at low voltages without violating the timing constraint. Kirovski and M. Potkonjak [Kiro97] developed a system-level approach for energy minimization of cost-constrained hard real-time designs. The approach simultaneously optimizes all three degrees of freedom for energy minimization, namely switching activity, effective capacity and supply voltage. Srivastava in [Sriv96] described three broad architectural approaches for energy efficient programmable computation: predictive shutdown, concurrency driven supply 14 voltage reduction, and switching activity reduction. A significant reduction in energy consumption can be obtained by employing these architectural techniques. They have shown that an aggressive shut down strategy based on a predictive technique can reduce the energy consumption by a large factor compared to the straightforward conventional schemes where the energy decision is based solely on a predetermined idle time threshold. An example where the idea embodied in their techniques can be applied is the combination of parallelism-driven voltage reduction with switching activity reduction to increase the energy efficient of memory operations when the access pattern is sequential in nature. Such sequential access patterns occur, for example, when fetching video data from a frame buffer memory or when writing a page of virtual memory back to disk. Instead of accessing data from memory in a serial fashion, several words can be read from memory and the memory can be clocked at a lower rate for the same throughput. If the serial implementation runs at a supply of 3V to meet a given throughput, then the parallel version can run at a supply voltage of 1.3 V while meeting throughput requirements. Chaeseok and Ha. [ImHa04] proposed an energy efficient real-time multi-task scheduling by the use of buffers with DVS. They saved an average energy of 44% with reasonable machine specifications. The buffers increase CPU utilization by averaging the workload. Their technique was designed for multimedia applications where a slight buffering delay is tolerable. CHAPTER 3 DYNAMIC VOLTAGE SCALING For most energy-conscious designs, a major source of energy savings is voltage scaling, which scales operating voltages of processors and corresponding maximum clock speeds. The dominant source of energy consumption in digital CMOS circuits is the dynamic energy dissipation P, characterized by fCVP 2 ? where C is the effective switching capacitance, V is the supply voltage, and f is the clock speed [Burd96]. Since energy varies linearly with the clock speed and the square of the voltage, adjusting the voltage can result in significant energy reductions, at least in theory. However, reducing the supply voltage requires a corresponding decrease in clock speed and increase in task execution latency. The settling time for a gate is proportional to the voltage; the lower the voltage drop across the gate, the longer the gate takes to stabilize. To lower the voltage and still operate correctly, the cycle time must be lowered first. When raising the clock rate, the voltage must be increased first. Given that the voltage and the cycle time of a chip could be adjusted together, it should be clear now that the lower-voltage, slower-clock chip will 15 16 dissipate less energy per cycle. If the voltage level can be reduced linearly as the clock rate is reduced, then the energy savings per instruction will be proportional to the square of the voltage reduction. Of course, for a real chip it may not be possible to reduce the voltage linear with the clock reduction. However, if it is possible to reduce the voltage at all by running slower, then there will be a net energy savings per cycle. Currently manufacturers do not test and rate their chips across a smooth range of voltages. However, some data is available for chips at a set of voltage levels. For example, a Motorola CMOS 6805 microcontroller is rated at 6 MHz at 5.0 Volts, 4.5 MHz at 3.3 Volts, and 3 MHz at 2.2 Volts. This is a close to linear relationship between voltage and clock rate. Thus there is seemingly no technical objection to designing a variable-voltage system provided that the input reference voltage to the processor?s voltage regulator may be a digital word writable by the processor. The other important factor is the time it takes to change the voltage. The main time-cost would be for the converter or regulator to ramp the supply voltage up or down. The ramping time is determined by the time constants of the converter. The frequency for voltage regulators is on the order of 200 KHz [Weis94], so we speculate that it will take a few tens of microseconds to boost the voltage on the chip. Moreover the CPU should be able to continue working during a voltage ramp and ramping should not have any substantial energy cost. Finally, why run slower? Suppose a task has a deadline in 100 milliseconds, but it will only take 50 milliseconds of CPU time when running at full speed to complete. A normal system would run at full speed for 50 milliseconds and then idle for 50 17 milliseconds (assuming there were no other ready tasks) as in Figure 3.1. During the idle time the CPU can be stopped altogether by putting it into a mode that wakes up upon an interrupt, such as from a periodic clock or from an I/O completion. Figure 3.1. Different operating rates for the same task Now, compare this to a system that runs the task at half speed so that it completes just before its deadline. If it can also reduce the voltage by half, then the task will consume 1/4 the energy of the normal system, even taking into account stopping the CPU during the idle time. This is because the same number of cycles is executed in both systems, but the modified system reduces energy use by reducing the operating voltage. Another way to view this is that idle time represents wasted energy, even if the CPU is stopped. The relation between clock speed, supply voltage, and energy dissipation for Transmeta?s Crusoe TM5400 microprocessor as reported in its data sheet [Tiwa96] are shown in Table 3.1. For a program running for time duration of T, its total energy 10050 50 100 ms ms Operating voltage Operating voltage consumption E is approximately equal to E = P avg ? T , Where P avg is the average energy consumed. Researchers have proposed many ways of determining ?appropriate? operating clock rate. The basic idea behind these energy saving approaches is to slow down the tasks as much as possible without violating the deadline. This ?just-in-time? strategy can be illustrated through a voltage scheduling graph as in Figure 3.2 [Much97]. In a voltage scheduling graph, the X-axis represents time and the Y-axis represents processor speed. The total amount of work for a task is defined by the area of the task ?box.? For example, task 1 in Figure 3.2 has a total workload of 8,000 cycles. By ?stretching? it out all the way to the deadline without change of the area, we are able to decrease the CPU speed from 600MHz down to 400MHz. As a result, 23.4% of total (CPU) energy may be saved on a Crusoe TM5400 processor. Table 3.1. The relationship between clock frequencies, supply voltage, and energy dissipation of Transmeta?s Crusoe TM5400 microprocessor. 18 (a) Original schedule (b) Voltage scaled schedule Figure 3.2. Voltage scheduling graph 3.1 Power management implementation There are several ways of achieving energy reduction. In this section, we study the various approaches. 3.1.1 Simple shutdown Energy shutdown to a component is a radical solution that eliminates all sources if energy dissipation, including leakage. Energy consumption of idle processors can be avoided by energying off the unit. This radical solution requires controllable switches. An advantage of this approach is the wide applicability to all kind of electronic components. A major disadvantage is the wake-up time or the recovery time because the processor operation must be reinitialized. 3.1.2 Multiple and variable energy supplies Dynamic energy management is also applicable to processors that are not idle, but whose performance requirement varies with time. The implementation technology can then be based on the slowdown. The slowdown is achieved by lowering the 19 20 voltage supply, such that the machine becomes performance critical. Dynamically-varying supply voltages may be quantified [Chan96] and thus be restricted to a finite number of values, or may take values in a continuous range. In the former case it is possible to identify a finite number of energy states for the system. 3.1.3 OnNow approach The OnNow approach uses energy-management hardware to put the PC into a low-energy sleep state instead of shutting down completely, so that the system can quickly resume working [Micr04]. While in the sleep state, the PC's processor is not executing code and thus no work is being accomplished for the user. However, events from both hardware devices (such as modem ring or network request) and the real-time clock can be enabled to cause the system to wake up. Each device in the system has its own energy states, and these are independently managed by the device driver (or other policy owner) while the system is in the working state. The device's policy integrates any particular application's needs with device capabilities and other operating system information to conserve energy without adversely affecting the work that the user is doing. 21 CHAPTER 4 DIRECTED A-CYCLIC GRAPH DAG We define a DAG as an a-cyclic graph with nodes representing tasks and edges representing execution precedence between tasks. A weight is associated to each node and edge. The node weight represents the task execution time and the edge weight represents the communication time between connected tasks. This communication time is zero if the tasks are executed on the same processor. Each DAG has a root node which is a node with no incoming edges and a sink node which is a node with no outgoing edges. Figure 4.1. A sample DAG 1 3 4 5 8 7 6 1 2 19 13 9 17 14 5 7 22 19 11 9 22 The DAG structure occurs in many regular and irregular applications such as Cholesky factorization, LU decomposition, Gaussian elimination, FFT, Laplace transforms, and instruction level parallelism. Along the lines of [Bask03] a DAG is represented by the tuple G=(V, E, M, T, C, P) where: ? V is the set of n nodes. ? E is the set of e edges between the nodes. ? M is a set of m machines or processors. ? E(n, c) is an edge between nodes n and c. ? T is the set of costs T(n,k), represents the computational time of task n on machine k. ? C is the set of costs C(n,c), which represents the communication cost associated with the edges E(n,c). Since intra-processor communication is insignificant compared to inter-processor communication, C(n,c) is considered to be zero if n and c are executed on the same processor. ? P is the set of costs P(n,k), which presents the consumed power when task n is executed on processor k. The length of a path is defined as the sum of node and edge weights in that path. Node n is a predecessor of node c if there is a directed edge originating from n and ending at c. In figure 4.1, node 1 is a predecessor of nodes 2, 3, 4 and 7. Likewise, node s is a successor of node n if there is a directed edge originating from n and ending at s. From figure 4.1, node 6 is a successor of nodes 1, 2, and 3. We can further define pred(n) as the set of all predecessors of n and succ(n) as the set of all successors of n as an example pred(6)={1,2,3} and succ(6)={8}. An ancestor of node n is any node c that is contained in pred(n ), or any node a, that is also an ancestor of any node c contained in pred(n ). The earliest execution start time of node n on processor k is represented as EST(n,k). Likewise, the earliest execution finish time of node n on processor k is represented as EFT(n , k). EST(n) and EFT(n) represent the earliest start time upon any processor and the earliest finish time upon any processor, respectively. R nk is defined as the earliest time that processor k will be ready to begin executing task n. We can mathematically define these terms as follows: EST(n, k ) = max{ R nk , EFT(c, m)+C(c, n )},Where, c?pred(n) EFT(c, m) = EFT(c) and C(c, n ) = 0 when k = m, EFT(n, k ) = T (n, k )+EST(n, k ), EST(n ) = min (EST(n, k ), k?M), EFT(n ) = min (EFT(n, k ), k?M). The maximum clause finds the latest time that a predecessor?s data will arrive at processor k. If the predecessor finishes earlier on a processor other than k, communication cost must also be included in this time. In other words, the earliest start time of any task n on processor k, EST(n,k) is the maximum of times at which processor k becomes available and the time at which the last message arrives from any of the predecessors of n. 23 24 The main goal for any scheduling technique is to minimize the makespan of the DAG. The makespan is defined as the time at which all nodes finish executing. In our case, the makespan will be equal to EFT(y), where y is the exit node in the graph. From Figure 4.1 the makespan is EFT(8). The critical path (CP) is the longest path from an entry node to an exit node. The critical path excluding communication cost (CPEC) is the longest path from an entry node to an exit node, not including the communication cost of any edge traversed. In our work we assume that each task?s mean execution cost across all processors is used to calculate CP while each task?s minimum execution cost from any processor is used to calculate the CPEC. The top distance for a given node is the longest distance from an entry node to the node, excluding the computation cost of the node itself. The bottom distance for a given node is the longest distance from the node to an exit node. Again we assume that each task?s mean execution cost is used to calculate the top distance and bottom distance. The bottom distance is also referred to as the upper rank or the blevel. The Decisive Path (DP) is defined as the top distance of a given node plus the bottom distance of the node. The DP is defined for every node in the DAG. The critical path, CP, then becomes the largest DP for an exit node. 4.1 Performance Our algorithms are for scheduling directed acyclic weighted task graph running on a bounded number of heterogeneous processors with the twin objectives of 25 minimizing the amount of energy consumed and minimizing the finish time. In other words, tasks arrive with given execution time and need to meet certain execution deadlines as well as minimize the consumed energy. Using simulations, we evaluated the performance of EADAGS and EAGS-D. The first test suite consists of random directed a-cyclic graphs. The input parameters used to generate the graphs were: ? Number of nodes (tasks) in the graph, n. ? Shape parameter of the graph, ?. If ? = 1.0, the graph is balanced. A DAG with high parallelism can be generated by selecting ? >> 1. Whereas ? << 1 will generate a long DAG with small degree of parallelism. ? Out-degree of a node, out-degree, represents the average number of outgoing edges from each node. Each node?s out-degree is randomly generated from a uniform distribution with mean equal to out-degree. ? Communication to Computation ratio, CCR. CCR is the ratio of the average communication to average computation cost. If a DAG?s CCR is less than 1, it is a computation-intensive application; if it?s CCR is much greater than 1, it is communication-intensive. ? Computation Range, ?, represents the range of computation costs on processors. A high ? causes significant difference of node?s computation costs among processors, whereas a low ? means that the expected execution times of a node on any processor are almost equal. ? Processor to node ratio, PNR, represents the availability of processors with 26 respect to number of nodes. A PNR of 100% means the number of processors is equal to the number of nodes. In generating random DAGs, the parameters were varied as follows: n= {10, 20, 40, 60, 80, 100, 500, 1000} CCR = {0.1, 0.5, 1, 5, 10} ? = {0.5, 1, 2} Out-degree = {1, 2, 3, 4, 5, 100} ? = {0.1, 0.25, 0.5, 0.75, 1.0} PNR = {25%, 50%, 100%} These values produced 10,800 DAGs, which were repeated for both presented algorithms, EADAGS and EAGS-D. The second test suite to evaluate the performance of EADAGS and EAGS-D used task graphs of real world problems, specifically Gaussian elimination [Wu90], Molecular dynamic code [Chun92], Sieve of Eratosthenes, and Fast Fourier Transform (FFT). For these problems, the shape of the DAG is fixed and the only parameters we changed were the number of available processors and CCR. Processors were assumed to have three different operating voltage levels and five operating strategies based on the Motorola CMOS 6805 microcontroller, which is rated at 6 MHz at 5.0 Volts, 4.5 MHz at 3.3 Volts, and 3 MHz at 2.2 Volts. First operating voltage was 5V when using this voltage if the processor becomes idle, it shuts down. We will refer to this operating voltage as 5V/off. This level is used for reference only since it has physical limitations. The other two operating voltages are 2V and 3.3V, which slow 27 the processor during task execution and during processor?s idle times. 28 CHAPTER 5 SCHEDULING ALGORITHIMS Many scheduling algorithms for scheduling DAGs in a distributed computing environment have been proposed as has been mentioned in Chapter 2. However, the problem of discovering the schedule that gives the minimum finish time is NP-Complete. Among all these scheduling algorithms, some prove to perform better in term of finish time or makespan. In this work two new scheduling algorithms for scheduling DAGs on distributed computing systems were presented: Energy Aware DAG Scheduling (EADAGS) and Energy Aware Graph Scheduling with Duplication (EAGS-D). 5.1 EADAGS algorithm A new algorithm for scheduling DAGs on distributed computing systems has been introduced. EADAGS combines Decisive Path Scheduling (DPS) with DVS to minimize both finish time and energy consumption. DPS [Park97], since it is one of the most efficient algorithms, was chosen. The new algorithm is called Energy Aware DAG Scheduling (EADAGS). It consists of two phases. In the first phase, after DPS is run on the DAG to provide a low finish time, the energy consumed is estimated for all 29 processors. In the second phase, voltage scaling is applied during slack times to reduce energy while maintaining the schedule length. EADAGS transforms a DAG to one with a single entry node and a single exit node, if not so already. This transformation is accomplished by adding a dummy entry node and/or exit node with zero costs. Next, the top and bottom distances from each node are calculated. The top and bottom distances are calculated using the mean computation value for each node. After building the DP for each node, EADAGS begins creating the scheduling queue, ScheduleQ, in a top-down fashion starting with the DAGs entry node and traversing down the CP (which is the DP of the exit node). Nodes are prioritized based on the lengths of their DPs. The priorities are decided as follows: EADAGS puts the CP nodes into the ScheduleQ in the ascending order of their top-distances. A node is added to the queue only if all its predecessors have been added. If not, EADAGS attempts to schedule its predecessors first. The first predecessors added to the queue are those included in the nodes? DP other are sorted and added to ScheduleQ in increasing top-distance. Next, EADAGS assign tasks in ScheduleQ to processors. At each step of the assignment, the selected processor provides the earliest finish time for the task under consideration, taking into account all the communications from the task?s parents. If EFT of the exit node is larger than the sum of all the computation costs of the nodes on the best processor, EADAGS assigns all nodes to that processor and exits. The time complexity of first phase of EADAGS is O(n 2 ). Next, EADAGS computes the consumed energy. The total energy consumed when no voltage scaling is used, E 1 is first calculated by the following equations: E 1 = T ? ? ?Mk 1 )(P k 2 )( 2 1 1 fV kP = where: ? P 1 (k) is the amount of power consumed by processor k?M, ? f is the operating frequency of machine k, ? V 1 is the operating voltage of machine k, ? T is the makespan In the second phase of EADAGS, voltage scaling is applied to all processors during their idle times by reducing the execution rate to f 2 by lowering the voltage to a predetermined level V 2 . Such voltage scaling is applied to a task only if slowing its execution would not increase the makespan. We also reduce the voltage level of processors during all remaining slack times. The total energy consumed after applying voltage scaling is E 2 = ? E 2 (k) where: ? 2 )( 2 222 2 111 2 VfTVfT kE + = represents the energy consumed by processor k?M when voltage scaling is used, ? is the total task and communication time when operating at V 1 , ?? += iji jiCiTT ),()( 1 30 ? T(i) is the computation time of task i on the chosen processor, ? C(i, j) is the communication cost between tasks i and j if i and j are not scheduled on the same processor, ? T 2 is the total time processor k operates at V 2 (includes idle times during which the processor operates at V 2 ). A non blocking send protocol has been assumed in which only the sending processor has to process the communication while the receiving processor has a buffer to receive all transmitted data without interrupting its job. The difference between E 1 and E 2 represents the energy that could be saved. The percentage average energy savings = 100 1 21 ? ? E EE . A high level description of EADAGS appears in Table 5.1. 31 Table 5.1. EADAGS algorithm Let G represent a DAG Let M be the set of m processors in the system EADAGS Transform G to a DAG with a single entry node and a single exit node Compute DP of each node n?G //DP of the exit node is the critical path, CP Fill ScheduleQ with nodes //Starting from the entry node traversing CP in increasing top-distance. while ScheduleQ ? ? do i ? head ( ScheduleQ) Schedule i on processor p?M that provides earliest finish time of i. Remove i from ScheduleQ end while if scheduling all nodes on the fastest processor provides a shorter makespan, do so and discard prior schedule T? makespan Total energy consumed before voltage scaling 2 2 1 1 fTV E = Total energy consumed when employing voltage scaling, E 2 = ScaledEnergy( ) end EADAGS ScaledEnergy( ) // Returns the total amount of energy consumption on all processors when voltage scaling has been applied for each processor p? M do for each node n?G scheduled on p do //traverse first scheduled to last if (executing n on scaled voltage fits within the next slack) then Scale down the operating voltage during execution of n end if end for Energy consumed by processor p = sum of energy consumed by all nodes scheduled on p end for E = Sum of energy consumed by all processors return E end ScaledEnergy 32 Figure 5.1 lists the notations used by EADAGS and procedures of EADAGS (EADAGS, AddQ, StartTime, and ScaledEnergy). Let ? G represent a DAG ? y?G be the exit node of G ? M be the set of m processors in the system ? R k represent the ready time of machine k ? r n represent the ready time of node n ? f be the frequency of operation ? s ik represent the start time of node i on machine k ? Succ(n) represent the list of all successor nodes of node n?G ? pred(n) is the list of all predecessors of node n?G ? ScheduleQ queue of tasks in order of execution ? T(i, k) represent the execution time of node i?G on machine k ? C(n, c) represent the communication cost from node n to node c ? EFT(i, k) represent the earliest finish time of node i?G on machine k ? EFT(i) represent the scheduled finish time of node i?G ? EFT 2 (i) represent the finish time for node i with the scaled down voltage ? EST(i) represent the scheduled start time of node i?G ? T represent the makespan of G ? V 1 be the voltage of operation ? V 2 be the scaled down voltage of operation ? T 1 (n) and T 2 (n) represent the execution time for any node n?G before and after voltage scaling respectively ? E(k, n) represent the energy consumed by processor k to execute node n ? E 1 represent the total energy consumption before voltage scaling ? E 2 is the total energy consumption after scaling down voltage ? E 1 (k) and E 2 (k) represent energy consumed by processor k?M before and after voltage scaling respectively Figure 5.1.a. Notations for EADAGS 33 EADAGS Transform G to a DAG with a single entry node and a single exit node Compute DP for each node n?G // O(n 2 ) // DP of the exit node is the critical path, CP // Fill ScheduleQ with nodes in CP in increasing top-distance. ScheduleQ ? ? for each node n?CP do // O(n 2 ) // Traverse in increasing top-distance. ScheduleQ = addQ(n) end for //Schedule nodes in ScheduleQ to processors while ScheduleQ ? ? do // O(n) Pick the head node i in ScheduleQ for each processor k?M do // O(m) s ik = StartTime (i, k) EFT(i, k) = s ik + T(i, k) end for EFT(i) = min k?M EFT(i, k ) Schedule i on processor p?M that gave minimum earliest finish time Remove i from ScheduleQ end while if EFT(y) ? ? ? ? Gi Mk kiT ),(min Schedule all nodes on the processor p?M, which provided the minimum end if T?EFT (y) // makespan 2 2 1 1 fTV E = E 2 =ScaledEnergy( ) end EADAGS Figure 5.1.b. EADAGS procedure 34 addQ(n) // Adds parents of node n?G and n to ScheduleQ // Returns ScheduleQ for each parent b of n not visited // in decreasing DP addQ(b) end for Add to n to ScheduleQ and mark it visited return ScheduleQ end addQ Figure 5.1.c. addQ procedure StartTime (node n, machine k) // Returns the earliest available start time of node n?G on machine k M ? s nk ? R k for each parent b of n do // O(n) s nk = max (s nk , EFT(b)+C(b , n)) // if b is scheduled on k, C(b, n)=0 end for return s nk end StartTime Figure 5.1.d. StartTime procedure 35 ScaledEnergy( ) // Returns the total amount of energy consumption after applying voltage scaling for each processor k? M do // O(m) for each node n?G scheduled on k do //O(n) from first to last scheduled T 2 (n) = )( 1 2 2 2 1 nT V V ? EFT 2 (n) = s nk + T 2 (n) if (EFT 2 (n)+C(n ,c) < min EST(c) for each c?succ(n) then // if c is scheduled on k, C(n, c) = 0 EFT(n) = EFT 2 (n) //update EFT(n) Let i be the node scheduled immediately after n on k // use scaled voltage, see Figure 5.2 E(k, n)=T 2 (n)?V 2 2 +(EST(i)-EFT(n))?V 2 2 else //original voltage for execution and scaled voltage during idle time, // see Figure 5.2 E(k, n)=T 1 (n)?V 1 2 +(EST(i)-EFT(n))?V 2 2 end if end for E 2 (k)=+E(k, n ) end for E 2 = ? ?Mk kE )( 2 return E 2 end ScaledEnergy Figure 5.1.e. ScaledEnergy procedure 36 Figure 5.2. Timeline for machine k n i V 1 T 1 (n) n i V 2 T 2 (n) EFT(n) EST(i) EFT 2 (n) Task execution time Communication time EST(i) 37 5.2 EAGS-D Algorithm Energy Aware Graph Scheduling with Duplication (EAGS-D) combines HNPD and dynamic voltage scaling. The Heterogeneous N-Predecessor Duplication (HNPD) algorithm combines the techniques of insertion-based list scheduling with multiple task duplication to minimize schedule length. The performance of HNPD was proven to outperform two of the best existing heuristics, Heterogeneous Earliest Finish Time (HEFT) and Static Task Duplication Scheduling (STDS), in terms of finish time and the number of processors employed over a wide range of parameters [Bask03]. EAGS-D works as follows: EAGS-D assigns tasks to the best available processor according to the order of tasks in a scheduling queue called ScheduleQ. EAGS-D assigns highest priority to critical path nodes (CPN) and then to those predecessors of CPNs that include the CPN in their DP. Among these predecessors it gives higher priority to nodes with higher DP values. This is because the nodes with the higher DP values are likely to be on longer paths. In the order of tasks in ScheduleQ, EAGS-D uses EFT(n) to select the processor for each task n. As it is an insertion-based algorithm, it calculates ready time of any machine k?M, R k to be the earliest idle time slot large enough to execute T(n, k). In other words, it looks for a possible insertion between two already scheduled tasks on the given processor without violating precedence relationships. Once tasks have been assigned to processors, it attempts to duplicate predecessors of the tasks. Predecessors are selected for duplication from most favorite to least and by descending top distance. The goal of duplicating predecessors is to decrease the length of time for which the node is awaiting data by making use of the processor?s slack time. Predecessors of node n are duplicated on processor k, on which node n is scheduled on. If the duplication result in a lower finish time, duplication is retained; otherwise it is discarded. If there is idle time between the recently assigned task n and the preceding task on processor k, EAGS-D attempts to duplicate each predecessor j. If j is not already scheduled on processor k, it is duplicated if EFT(j,k) is less than EFT(j)+C(j, n). The duplication is retained if EFT(n,k) decreases. Otherwise, it is discarded. The same duplication procedure is repeated for each predecessor in order of most favorite to least. After EAGS-D attempts to duplicate each predecessor, it recursively attempts to duplicate the predecessors of any duplicated tasks. Duplication recursively continues until no further duplication is possible. Then, EAGS-D computes the consumed energy. The total energy consumed when no voltage scaling is used, E 1 is first calculated by the following equations: E 1 = T ? ? ?Mk 1 )(P k 2 )( 2 1 1 fV kP = where: ? P 1 (k) is the amount of power consumed by processor k?M, ? f is the operating frequency of machine k, ? V 1 is the operating voltage of machine k, ? T is the makespan In the second phase of EAGS-D, voltage scaling is applied to all processors during their idle times by reducing the execution rate to f 2 by lowering the voltage to a 38 predetermined level V 2 . Such voltage scaling is applied to a task only if slowing its execution would not increase the makespan. We also reduce the voltage level of processors during all remaining slack times. The total energy consumed after applying voltage scaling is E 2 = ? E 2 (k) where: ? 2 )( 2 222 2 111 2 VfTVfT kE + = represents the energy consumed by processor k?M when voltage scaling is used, ? is the total task and communication time when operating at V 1 , ?? += iji jiCiTT ),()( 1 ? T(i) is the computation time of task i on the chosen processor, ? C(i, j) is the communication cost between tasks i and j if i and j are not scheduled on the same processor, ? T 2 is the total time processor k operates at V 2 (includes idle times during which the processor operates at V 2 ), A non blocking send protocol has been assumed in which only the sending processor has to process the communication while the receiving processor has a buffer to receive all transmitted data without interrupting its job. The difference between E 1 and E 2 represents the energy that could be saved. The percentage average energy savings = 100 1 21 ? ? E EE . A high level description of EAGS-D appears in Table 5.2. 39 Table 5.2. EAGS-D algorithm Let G represent a DAG Let M be the set of m processors in the system EAGS-D Transform G to a DAG with a single entry node and a single exit node Compute DP of each node n?G //DP of the exit node is the critical path, CP Fill ScheduleQ with nodes //Starting from the entry node traversing CP in increasing top-distance. while ScheduleQ ? ? do i ? head ( ScheduleQ) Schedule i on processor p?M that provides earliest finish time of i. Remove i from ScheduleQ Duplicate predecessors of i on p if doing so results in a shorter schedule //duplicate executions performed within slacks of the schedule of p end while if scheduling all nodes on the fastest processor provides a shorter makespan, do so and discard prior schedule T? makespan Total energy consumed before voltage scaling 2 2 1 1 fTV E = Total energy consumed when employing voltage scaling, E 2 = ScaledEnergy( ) end EAGS-D ScaledEnergy( ) // Returns the total amount of energy consumption on all processors when voltage scaling // has been applied for each processor p? M do for each node n?G scheduled on p do //traverse first scheduled to last if (executing n on scaled voltage fits within the next slack) then Scale down the operating voltage during execution of n end if end for Energy consumed by processor p = sum of energy consumed by all nodes scheduled on p end for E = Sum of energy consumed by all processors return E end ScaledEnergy 40 Figure 5.3.a lists the notations used by EAGS-D while Figures 5.3.b, 5.3.c, 5.3.d, 5.3.e, 5.3.f and 5.3.g shows the detailed algorithm procedures: EAGS-D, AddQ, StartTime, DuplicatePred, Duplicate, and ScaledEnergy respectively. Let ? G represent a DAG ? y?G be the exit node of G ? M be the set of m processors in the system ? R k represent the ready time of machine k ? r n represent the ready time of node n ? f be the frequency of operation ? s ik represent the start time of node i on machine k ? Succ(n) represent the list of all successor nodes of node n?G ? pred(n) is the list of all predecessors of node n?G ? ScheduleQ queue of tasks in order of execution ? T(i, k) represent the execution time of node i?G on machine k ? C(n, c) represent the communication cost from node n to node c ? EFT(i, k) represent the earliest finish time of node i?G on machine k ? EFT(i) represent the scheduled finish time of node i?G ? EFT 2 (i) represent the finish time for node i with the scaled down voltage ? EST(i) represent the scheduled start time of node i?G ? T represent the makespan of G ? V 1 be the voltage of operation ? V 2 be the scaled down voltage of operation ? T 1 (n) and T 2 (n) represent the execution time for any node n?G before and after voltage scaling respectively ? E(k, n) represent the energy consumed by processor k to execute node n ? E 1 represent the total energy consumption before voltage scaling ? E 2 is the total energy consumption after scaling down voltage ? E 1 (k) and E 2 (k) represent energy consumed by processor k?M before and after voltage scaling respectively Figure 5.3.a. EAGS-D notations 41 EAGS-D Transform G to a DAG with a single entry node and a single exit node Compute DP for each node n?G // O(n 2 ) // DP of the exit node is the critical path, CP // Fill ScheduleQ with nodes in CP in increasing top-distance. ScheduleQ ? ? for each node n?CP do // O(n 2 ) // Traverse in increasing top-distance. ScheduleQ = addQ(n) end for //Schedule nodes in ScheduleQ to processors while ScheduleQ ? ? do // O(n) Pick the head node i in ScheduleQ for each processor k?M do // O(m) s ik = StartTime (i, k) EFT(i, k) = s ik + T(i, k) end for EFT(i) = min k?M EFT(i, k ) Schedule i on processor p?M that gave minimum earliest finish time Remove i from ScheduleQ // Duplicate predecessors of i on p if it results in a shorter schedule DuplicatePred (i, pred(i), p) end while if EFT(y) ? ? ? ? Gi Mk kiT ),(min Schedule all nodes on the processor p?M, which provided the minimum end if T?EFT (y) // makespan 2 2 1 1 fTV E = E 2 =ScaledEnergy( ) end EAGS-D Figure 5.3.b. EAGS-D procedure 42 addQ(n) // Adds parents of node n?G and n to ScheduleQ // Returns ScheduleQ for each parent b of n not visited // in decreasing DP addQ(b) end for Add to n to ScheduleQ and mark it visited return ScheduleQ end addQ Figure 5.3.c. addQ procedure StartTime (node n, machine k) // Returns the earliest available start time of node n?G on machine k M ? s nk ? R k for each parent b of n do // O(n) s nk = max(s nk , EFT(b) + C(b , n)) // if b is scheduled on k, C(b, n)=0 end for return s nk end StartTime Figure 5.3.d. StartTime procedure DuplicatePred (node n, pred(n), machine k) for each q?pred (n) // in order from most favorite to least Duplicate (n, q, k) if q was duplicated then DuplicatePred (q, pred(q), k) end if end for end DuplicatePred Figure 5.3.e. DuplicatePred procedure 43 Duplicate (node n, node i, machine k) // Attempts to duplicate node n, a predecessor of node i, on machine k which executes node i if n has not been scheduled on k then r nk =StartTime (n, k) // Attempt to insert node n in the first available slack in the schedule of Processor k?M for each node j already scheduled on k?M do //scan from last to first scheduled if j? pred(n) then if j is the first node scheduled on k then if EST(j, k ) >= r nk +T(n, k) then //see Figure 5.4 s nk = r nk end if Let p be the node immediately scheduled prior to j on k else if(EST(j, k)-EFT(p, k)>= T(n,k))and(EFT(p, k)>= r nk ) then s nk = EFT(p, k ) // see Figure 5.5 end if EFT(n, k) = s nk + T(n, k) end if end for //If EFT(i) did not decrease due to duplication, discard duplication if EFT(n ,k) < min succ(n) (EFT(n)+C(n,succ(n)))then Insert and schedule n on k Recalculate EFT(i, k) if EFT(i ,k) improves then Keep n scheduled on k else Discard the duplication of node n on k end if end if end if end Duplicate Figure 5.3.f. Duplicate procedure 44 ScaledEnergy( ) // Returns the total amount of energy consumption after applying voltage scaling for each processor k? M do // O(m) for each node n?G scheduled on k do //O(n) from first to last scheduled T 2 (n) = )( 1 2 2 2 1 nT V V ? EFT 2 (n) = s nk + T 2 (n) if (EFT 2 (n)+C(n ,c) < min EST(c) for each c?succ(n) then // if c is scheduled on k, C(n, c) = 0 EFT(n) = EFT 2 (n) //update EFT(n) Let i be the node scheduled immediately after n on k // use scaled voltage, see Figure 5.6 E(k, n)=T 2 (n)?V 2 2 +(EST(i)-EFT(n))?V 2 2 else //original voltage for execution and scaled voltage during idle time, // see Figure 5.6 E(k, n)=T 1 (n)?V 1 2 +(EST(i)-EFT(n))?V 2 2 end if end for E 2 (k)=+E(k, n ) end for E 2 = ? ?Mk kE )( 2 return E 2 end ScaledEnergy Figure 5.3.g. ScaledEnergy procedure 45 Task execution time Communication time Figure 5.4. Timeline for machine k (case 1) n j i T(n,k) ? r n EST(j,k) Already scheduled EST(i,k) Already scheduled EST(j,k) Already scheduled EFT(p,k) Already scheduled r n n j p T(n,k) Figure 5.5. Timeline for machine k (case 2) 46 Figure 5.6. Timeline for machine k 5.3 Voltage scaling Strategies Executing any task with a slower rate will require longer time to finish. And since we are dealing with DAGs in which tasks are depending on other tasks. Changing the finish time of a task may affect the start time of its predecessors resulting in a change of the schedule makespan. In our algorithm we want to make use of processor?s slack time to save energy without changing the makespan of the original scheduling algorithms. Processors were assumed to have three different operating voltage levels based on the Motorola CMOS n i V 1 T 1 (n) n i V 2 T 2 (n) EFT(n) EST(i) EFT 2 (n) Task execution time Communication time EST(i) 47 6805 microcontroller, which is rated at 6 MHz at 5.0 Volts, 4.5 MHz at 3.3 Volts, and 3 MHz at 2.2 Volts. We tested for five different strategies: 1. 5V/off: execute tasks with regular voltage level and turn off processors during their idle time. This technique is used for comparison reasons only since turning processors off each time they incur slack time has physical limitations. 2. 2V during idle: involves executing tasks with regular execution rate and lowering the processor?s voltage level to 2V only during its idle time. 3. 3.3V during idle: involves executing tasks with regular execution rate and lowering the processor?s voltage level to 3.3V during all its idle time. 4. 2V scale: we check if the slack time is long enough to scale down the execution rate of tasks to 2V without altering the finish time. If so, execute the task with 2V rate and also lower the voltage during the remaining idle time. If not, execute task with regular voltage rate and lower the voltage rate during the idle time. 5. 3.3V scale: we check if the idle time is long enough to lower the execution rate to 3.3V of tasks without affecting the makespan. If so, execute the task with 3.3V rate and also lower the voltage during the remaining idle time. If not, execute task with regular voltage rate and lower the voltage rate during the idle time. 48 6.3.1. Voltage scaling algorithm The task in Figure 5.7 has finish time of t 1 when executed with V 1 and a finish time of t 2 when executed with V 2 . This task can be executed with the lower voltage level only if t 2 < min (start time of all its children). For the task in Figure 5.7(a), the execution rate can be slowed down to V 2 and reduce the amount of energy consumed, while the task in Figure 5.7(b) cannot be scaled down to V 2 since the new finish time is greater than the earliest children?s start time. So tasks need to be executed with the original voltage level. The voltage for the processors will be scaled down during idle time x only. 49 Children?s earliest start time t 1 original finish time t 2 new finish time with scaled voltage V 1 V 2 T (a) Children?s earliest start time t 1 original finish time t 2 new finish time with scaled voltage V 1 V 2 T x (b) Figure 5.7. Different finish times with different execution rates Although we choose to do the energy scaling as a second phase after the scheduling of all tasks, it could have been done in parallel with the scheduling phase resulting in the same schedule and the same amount of energy savings. This could be explained as follows: each task is scheduled on the processor that results in the earliest finish time. The finish time of each node on each processor is 50 determined by the first available start time of that node on that processor and the execution time of the node on that specific processor, taking into consideration the finish time of its predecessors and all communication cost between the task and any other node. Then its successors are scheduled based on the same criteria. So immediately after the scheduling of all the task children on the appropriate processors, the decision for scaling down this task can be made based on the finish time of the task and the start time of its successors. A task can be scaled down only if its finish time after executing with a lower voltage level is earlier than the start time of all its successors. Executing the task with the scaled down voltage level would not affect the scheduling decision for the next nodes since it only affect the finish time of the scaled down task. The advantage of doing it as a second phase is that we do not have to check if all the task?s successors have been scheduled before deciding whether to scale that task down or not resulting in a lower complexity algorithm. 51 CHAPTER 6 RESULTS AND DISCUSSION The experimental simulation estimates the energy gains of using voltage scaling techniques on scheduling DAG tasks on multiple processors on top of two of the best known high performing scheduling algorithms, DPS and HNPD. As been stated in Chapter 5, processors were assumed to have three different operating voltage levels based on the Motorola CMOS 6805 microcontroller, which is rated at 6 MHz at 5.0 Volts, 4.5 MHz at 3.3 Volts, and 3 MHz at 2.2 Volts [Weis94]. First operating voltage was 5V; when using this voltage, if the processor becomes idle, it shuts down. We will refer to this operating voltage as 5V/off. This level is used for reference only since it has physical limitations. The other two operating voltages are 2V and 3.3V, which slow the processor during task execution. The algorithms were tested using random DAGs and DAG for real world problems; more specifically we tested on DAGs for Molecular Dynamics code, Gaussian elimination, Sieve of Eratosthenes, and Fast Fourier Transform. 6.1 Results for random DAGs A set of random generated DAGs were generated to evaluate the voltage scaling 52 technique. Those DAGs had different input parameters (number of nodes, communication to computation ratio, shape parameter, out-degree, and computation range) varied as follows: n = {10, 20, 40, 60, 80, 100, 500, 1000} CCR = {0.1, 0.5, 1, 5, 10} ? = {0.5, 1, 2} out-degree = {1, 2, 3, 4, 5, 100} ? = {0.1, 0.25, 0.5, 0.75, 1.0} PNR = {25%, 50%, 100%} The above values produce 10,800 DAGs, which were repeated for both presented scheduling algorithms, EADAGS and EAGS-D. 6.1.1 Results for EADAGS The amount of energy consumed was measured for DPS and EADAGS for different test sets of random DAGs. Then, test sets were created by combining results from DAGs with similar properties, such as the number of nodes or the CCR. The amount of energy consumed were measured for five different strategies; 5V/off, 2V during idle, 3.3V during idle, 2V scale and 3.3V scale (explained in Chapter 5). The first test set was achieved by combining DAGs with respect to number of nodes. The energy saving was averaged over DAGs with varying CCR, ?, ?, out-degree, and PNR. Figure 6.1 shows the average energy saved by EADAGS over DPS with respect to number of nodes. The percentage of energy savings increased with increasing number of nodes. Average energy savings were 30% for a DAG of 10 nodes and the savings gradually increases to 46% for 1000 node DAGs. Larger numbers of nodes increase the chance of dependency between tasks, causing a slight increase in the processor wait time and accordingly increasing the energy savings. Table 6.1 lists the makespan and the average energy savings with respect to number of nodes. Average energy savings ranges between 30% and 46% for the 5V/off technique, 26% and 46% when 2V during idle is used, 24% and 46% when using 3.3V during idle, 29% and 46 % when 2V scale is used, and 28% and 46% when 3.3V scale is used. 0% 10% 20% 30% 40% 50% 60% 10 20 40 60 80 100 500 1000 Number of Nodes A ver age E n er gy S avin g s 5V/off 2V During idle 3.3V During idle 2V Scale 3.3V Scale Figure 6.1. Average energy savings with respect to number of nodes 53 54 Number of nodes Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 10 174.7 30.28% 26.53% 24.09% 29.98% 28.84% 20 175.27 34.32% 32.6% 31.48% 33.63% 32.45% 40 321.67 40.2% 39.18% 38.51% 40.12% 39.57% 60 404.36 41.2% 40.54% 40.11% 40.93% 40.47% 80 525.5 43.14% 42.56% 42.19% 42.91% 42.51% 100 674.65 43.8% 43.39% 43.12% 43.64% 43.35% 500 835.29 45.94% 45.85% 45.8% 45.82% 45.28% 1000 1127 46.29% 46.25% 46.22% 46.88% 46.03% Table 6.1. Makespan and average energy savings with respect to number of nodes The second test set combines DAGs with respect to CCR. The average energy savings were averaged over different DAGs with varying n, ?, ?, out-degree, and PNR. In Figure 6.2 the average energy savings has been plotted with respect to CCR. The average energy savings increased with increasing CCR. When CCR increases, processors incur longer idle times due to communication between tasks. Our algorithm was able to use such idle times to achieve energy savings. Table 6.2 shows the makespan and the average energy savings for five different CCR values. The average energy savings over DPS ranges from 35% for CCR=0.1 to 51% when CCR = 10 for 5V/off technique. Savings are smaller for 2V during idle; they range between 25% and 38%. Savings are even smaller for the 3.3V during idle time; they are 19% for CCR = 0.1 and 30% for CCR = 10. Average energy savings are 28% for CCR = 0.1 and 44.7% when CCR = 10 for 2V scale, while for 3.3V scale energy savings is 22.8% for CCR = 0.1 and 40% when CCR = 10. 0% 10% 20% 30% 40% 50% 60% 0.1 0.5 1 5 10 CCR Avera g e En erg y S a v i n g s 5V/off 2V During idle 3.3V During idle 2V Scale 3.3V Scale Figure 6.2. Average energy savings with respect to CCR CCR Makespan Percentage of energy savings 5V/off 2V During idle 3.3V during idle 2V Scale 3.3V Scale 0.1 117 35.17% 25.53% 19.27% 28.33% 22.87% 0.5 124 36.93% 26.95% 20.45% 29.42% 24.77% 1 133.89 37.25% 27.2% 20.66% 29.69% 25.06% 5 209.89 45.84% 34.07% 26.42% 41.98% 38.05% 10 280 51.4% 38.52% 30.15% 44.73% 40.11% Table 6.2. Makespan and average energy savings for different CCR values 55 The third test set combines DAGs with respect to processor to node ratio, PNR. The average energy savings is averaged over randomly generated DAGs with varying n, CCR, ?, ?, and out-degree. Figure 6.3 shows the average energy saving with respect to PNR. Figure 6.3 shows decrease in the average energy savings with increasing PNR. This is because increasing number of processors allows several parallel task executions, thus minimizing the wait times. The average energy savings measured was 47% for PNR = 25% in the 5V/off technique, 41% when processors operate at 2V during idle, 38% if they operate at 3.3V when idle, 42% when 2V scale is used, and 39% when 3.3V scale is used. But for PNR = 100% the average energy savings for all three operating levels are almost equal to 33% , which is due to the significant decrease in the processor?s wait time. . 0% 10% 20% 30% 40% 50% 60% 0.25 0.5 1 Processor Node Ratio A ver ag e E n er gy S aving s 5V/off 2V During Idle 3.3V During idle 2V Scale 3.3V Scale Figure 6.3. Average energy savings with respect to PNR 56 Another test set combined DAGs with respect to shape parameter ?. The average energy savings were computed over 36,000 randomly generated DAGs with varying n, PNR, CCR, ?, and out-degree. In Figure 6.4 the average energy savings for the five operating strategies were plotted with respect to ?. The results in this figure indicate that the overall energy savings marginally increased with increasing ?. Increasing ? increases parallelism in the DAG, resulting in more idle time for the processors due to the task dependency. This time can be used to reduce the consumed energy. The average energy savings was measured as 24% for ? = 0.5, 25% for ? = 1 and 25.5% when ? = 2. 0% 10% 20% 30% 40% 50% 60% 0.5 1 2 shape Parameter Av er a g e E n e r g y S a vin g s 5V/off 2V During idle 3.3V During idle 2V Scale 3.3V Scale Figure 6.4. Average energy savings with respect to ? 57 58 Shape parameter Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V scale 3.3V Scale 0.5 110.87 34.19% 24.75% 18.62% 28.47% 22.54% 1 189.87 34.38% 24.9% 18.74% 29.01% 23.71% 2 219.4 35.22% 25.58% 19.31% 29.33% 24.37% Table 6.3. Makespan and average energy savings for different shape parameters The last test set was for out-degree. The average energy savings were averaged from 21,600 randomly generated DAGs with varying n, ?, ?, CCR, PNR. Figure 6.5 shows the average energy savings with respect to five values for out- degree. The results in this figure indicate that increase in out-degree results in smaller average energy savings. A larger out-degree allows many processors to run in parallel. Table 6.4 lists the makespan and average energy savings with respect to out- degree. The amount of energy that could be saved ranges between 44% and 13% for processors using 5V/off, 32% and 8.5% for 2V during idle, 25% and 5% for 3.3V during idle, 36% and 10% when 2V scale is used, and 29% and 6% for a 3.3V scale. 0% 10% 20% 30% 40% 50% 60% 1234510 Out Degree A ver age E n er gy savings 5V/off 2V During idle 3.3V During idle 2V Scale 3.3V Scale Figure 6.5. Average energy savings with respect to out-degree Out-degree Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V scale 3.3V Scale 1 233.97 44.24% 32.79% 25.35% 36.29% 29.44% 2 321.67 42.67% 31.54% 24.3% 34.07% 26.36% 3 166.17 32.55% 23.44% 17.52% 26.18% 21.54% 4 194.29 29.53% 21.02% 15.5% 23.37% 19.28% 5 203.6 16.17% 10.33% 6.54% 14.71% 9.39% 100 157.13 13.83% 8.47% 4.98% 10.23% 6.41% Table 6.4. Makespan and average energy savings with respect to out-degree 59 60 6.1.2 Results for EAGS-D The amount of energy consumed was measured for HNPD and EAGS-D for different test sets of random DAGs. Test sets were created by combining results from DAGs with similar properties, such as the number of nodes or CCR. The amount of energy consumed was measured for the five different strategies explained before, 5V/off, 2V during idle, 3.3V during idle, 2V scale, and 3.3V scale. The first test set was achieved by combining DAGs with respect to number of nodes. The energy saving was averaged over DAGs with varying CCR, ?, ?, out-degree, and PNR. Figure 6.6 shows the average energy saved by EAGS-D over HNPD with respect to number of nodes. The percentage of energy savings increased with increasing the number of nodes. Average energy savings is 15% for a DAG of 10 nodes and the savings gradually increases to 30% for 1000 node DAGs. Larger numbers of nodes increase the chance of dependency between tasks, causing a slight increase in the processor wait time and accordingly increasing energy savings. Average energy savings ranges between 15% and 30% for processor using 5V/off technique, 13% and 30% for 2V during idle, 12% and 30% when using 3.3V during idle, 14% and 31% for 2V scale, and 12% and 30% for 3.3V scale. 0% 10% 20% 30% 40% 50% 10 20 40 60 80 100 500 1000 Number of Nodes Ave rag e E n erg y S avi n g s 5V/off 2V during idle 3.3 V during idle 2V Scale 3.3V Scale Figure 6.6. Average energy savings with respect to number of nodes Number of Nodes makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V scale 3.3V Scale 10 164.3 15.24% 13.36% 12.13% 14.61% 12.75% 20 169.02 15.65% 14.79% 14.05% 15.31% 14.55% 40 318.07 26.02% 25.36% 24.85% 15.75% 15.57% 60 416.31 28.81% 28.33% 27.99% 28.61% 28.27% 80 537.6 30.43% 30% 29.41% 30.26% 31.07% 100 668.29 29.88% 29.6% 29.8% 29.77% 30.58% 500 799.83 30.67% 30.66% 30.3% 31.89% 30.78% 1000 1345.7 30.88% 30.78% 30.55% 31.09% 30.89% Table 6.5. Makespan and average energy savings with respect to number of nodes 61 62 The second test set combines DAGs with respect to CCR. The average energy savings is averaged for DAGs with varying n, ?, ?, out-degree, and PNR values. In Figure 6.7 the average energy savings has been plotted with respect to CCR. The average energy savings increased with increasing CCR. When CCR increases, processors incur longer idle times due to communication between tasks. Our algorithm is able to use such idle times to achieve energy savings. The average energy savings over HNPD listed in Table 6.6 ranges from 8% for CCR = 0.1 to 43% when CCR = 10 for 5V/off technique. Savings are smaller for 2V during idle; they range between 4% and 39%. Savings are even smaller for the 3.3V during idle; they are 2% for CCR = 0.1 and 32% when CCR = 10. Savings are higher for 2V scale than 2V during idle; they range from 6% when CCR = 0.1 to 50% for CCR = 10. And finally the savings are 4% for CCR = 0.1 and 48% when CCR = 10 for 3.3V scale. 0% 10% 20% 30% 40% 50% 60% 0.1 0.5 1 5 10 CCR A ve r age E n er gy S a vi ngs 5V/off 2V During idle 3.3V During idle 2V Scale 3.3 Scale Figure 6.7. Average energy savings with respect to CCR CCR Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 0.1 108.44 8.1% 4.25% 2.37% 6.92% 4.41% 0.5 119.11 13.11% 8.81% 5.62% 10.33% 9.38% 1 140.56 24.38% 21.03% 18.8% 21.67% 16.01% 5 206.78 40.58% 33.86% 30.74% 48.67% 47.03% 10 261.67 43.08% 38.98% 32.6% 50.71% 48.42% Table 6.6. Makespan and average energy savings for different CCR values 63 The third test set combines DAGs with respect to processor to node ratio, PNR. The average energy savings were averaged over randomly generated DAGs with varying n, CCR, ?, ?, and out-degree. Figure 6.8 shows the average energy saving with respect to PNR. Figure 6.8 shows a decrease in the average energy savings with increasing PNR. This is because increasing number of processors allows several parallel task executions, thus minimizing the wait times. The average energy savings measured were 28% for PNR = 25% for the 5V/off technique, 27% if processors operate at 2V during idle, 27% for 3.3V during idle, and 27% if they operate at either 2V scale or 3.3V scale. But for PNR = 100% the average energy savings for all five strategies were almost equal to 25%, which is due to the significant decrease in the processor?s wait time. 0% 10% 20% 30% 40% 0.25 0.5 1 Processor Node Ratio A ver age E n er gy S a v i ngs 5V/off 2V During Idle 3.3V During idle 2V Scale 3.3V Scale Figure 6.8 Average energy savings with respect to PNR 64 Another test set combined DAGs with respect to shape parameter ?. The average energy savings was computed over 36,000 randomly generated DAGs with varying n, PNR, CCR, ?, and out-degree. In Figure 6.9 the average energy savings for the five operating strategies were plotted with respect to ?. The results in this figure indicate that the overall energy savings marginally increased with increasing ? from 0.5 to 1 and significantly decreased when ? = 2. Increasing ? increases parallelism in the DAG, resulting in more idle time for the processors due to the task dependency, but when ? = 2 all this idle time is being used by processors to duplicate tasks and so the average energy savings decreased, which was not the case with EADAGS. The average energy savings were measured as 18% when ? = 0.5, 22% when ? = 1, and 3% when ? = 2 as in Table 6.7. 0% 10% 20% 30% 40% 0.5 1 2 Shape Parameter A v er age E n er gy S a vi ngs 5V/off 2V During idle 3.3V During idle 2V Scale 3.3V Scale Figure 6.9. Average energy savings with respect to ? 65 66 Shape Parameter Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 0.5 97.73 19.75% 18.44% 17.59% 19.33% 18.01% 1 190.6 23.82% 22.62% 21.84% 22.89% 22.15% 2 218.73 3.38% 3.3% 3.25% 3.41% 3.33% Table 6.7. Makespan and average energy savings for different shape parameters The last test set combines DAGs with respect to out-degree. The average energy savings were averaged over 21,600 randomly generated DAGs with varying n, ?, ?, CCR, PNR. Figure 6.10 shows the average energy savings with respect to five values for out- degree. The results in this figure indicate that the increase in out-degree results in smaller average energy savings. A larger out-degree allows many processors to run in parallel. The amount of energy that could be saved ranged between 25% and 10% for processors using 5V/off technique, 23% and 7% for 2V during idle, 20% and 5% for 3.3V during idle, 24% and 9% for 2V scale, and finally 21% to 6% when 3.3V scale is used as shown in Table 6.8. 0% 10% 20% 30% 40% 1234510 Out Degree Av er age E n er gy S a v i ngs 5V/off 2V During idle 3.3V During idle 2V Scale 3.3V Scale Figure 6.10. Average energy savings with respect to out-degree Out-degree Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 1 158.9 25.44% 23.45% 20.74% 24.72% 21.64% 2 218.06 22.68% 21.04% 19.35% 22.19% 20.38% 3 144.93 21.09% 18.44% 16.84% 19.07% 17.33% 4 164.03 19.33% 16.17% 13.97% 17.41% 14.86% 5 138.27 12.16% 10.86% 9.46% 12.03% 10.82% 100 106.71 9.67% 7.36% 5.47% 9.11% 6.71% Table 6.8. Makespan and average energy savings with respect to out-degree 67 68 Regardless of the changing parameters, the average energy savings for EADAGS is higher than the average energy savings measured for EAGS-D. That is due to the task duplication in EAGS-D. EAGS-D attempts to duplicate the predecessor of tasks to decrease the length of the time for which the node is awaiting data by making use of the processor?s idle time. So the same task may be executed on several processors to reduce the makespan. And since we use the processor?s idle time to save energy, reduction of that time reduces the average energy that could be saved. 6.2 Results for real world problems To evaluate the performance of the proposed algorithms, we used task graphs of four real world problems: Gaussian elimination [Wu90], Molecular dynamics code [Chun92], fast Fourier Transform [Chun92], and Sieve of Eratosthenes [Bask00]. 6.2.1 Gaussian elimination In mathematics, Gaussian elimination or Gauss-Jordan elimination, named after Carl Friedrich Gauss and Wilhelm Jordan, is an algorithm in linear algebra for determining the solution of a system of linear equations, for determining the rank of a matrix, and for calculating the inverse of an invertible square matrix [Corm90]. Gaussian Elimination is a systematic application of elementary row operations to a system of linear equations in order to convert the system to upper triangular form. Once the coefficient matrix is in upper triangular form, we use back substitution to find a solution. The general procedure for Gaussian Elimination can be summarized in the steps in Table 6.9. Gaussian Elimination Steps 1. Write the augmented matrix for the system of linear equations. 2. Use elementary row operations on the augmented matrix [A|b] to transform A into upper triangular form. If a zero is located on the diagonal, switch the rows until a nonzero is in that place. If you are unable to do so, stop; the system has either infinite or no solutions. 3. Use back substitution to find the solution of the problem. Table 6.9. Gaussian elimination algorithm steps The computational complexity of Gaussian elimination is O(n 3 ); that is, the number of operations required is (approximately) proportional to n 3 for a matrix of size n x n. The DAG for the Gaussian elimination algorithm for n=3, n=4, and n=5 is shown in Figure 6.11 where n is the matrix size. Each T k , k represents a pivot column operation and T k , j is an update operation. The total number of tasks in a graph is 2 2 2 ?+ nn , where n is the size of the matrix. In the simulation, a matrix of size 8 x 8 has been used to evaluate EADAGS. Since the structure of the graph is fixed only the number of processors and the CCR values were varied. For a matrix of size 8 the total number of tasks in the graph is 35 and largest number of tasks at a single level is 7 so the number of processors is bounded to 69 7. CCR values were 0.1, 0.5, 1.0, 5.0, and 10. In this experiment since the same operation is executed at every processor and the same information is communicated from one processor to another, a uniform computation cost for all tasks and equal communication cost for all communication links were assumed. T 1, 1 T 2, 2 T 1, 3 T 1, 2 T 3, 4 T 1, 1 T 2, 2 T 2, 4 T 3, 3 T 2, 3 T 1, 4 T 1, 3 T 1, 2 T 2, 3 (a) (b) 70 T 1 , 5 T 2 , 5 T 1 , 4 T 2 , 4 T 2 , 3 T 2 , 2 T 1 , 3 T 1 , 2 T 1 , 1 T 3 , 3 T 3 , 4 T 3 , 5 T 4 , 4 T 4 , 5 (C) Figure 6.11. Gaussian elimination task graph (a) matrix of size 3, (b) matrix of size 4, (c) matrix of size 5 71 72 6.2.1.1 Results for EADAGS Figure 6.12 and Figure 6.13 show the average energy savings using EADAGS over DPS with respect to number of processors and CCR values. Figure 6.12 shows an increase in the average energy savings with increasing number of processors. This is because at each level only a certain number of tasks can be executed at the same time so increasing the available processors number produces more idle time, thus increasing the energy savings. The average energy savings measured were 32% for 2 processors using the 5V/off technique, 27% of energy savings if processors operate at 2V during idle, 18% energy savings if processors use 3.3V during idle, 29% energy savings for 2V scale, and 19% for 3.3V scale. When three processors are used, the average energy were 52% for the 5V/off technique, 43% and 29% when processors scaled down during their idle time to 2V and 3.3V respectively, and 45% when processors use the 2V scale, and 30% for the 3.3V scale. For four processors the average energy savings were 63%, 53%, 36%, 54%, and 36% for the 5V/off, 2V during idle, 3.3V during idle, 2V scale, and 3.3V scale respectively. The average energy savings were 69%, 57%, 38%, 58%, and 39% when processors use the 5V/off technique, 2V during idle, 3.3V during idle, 2V scale, and 3.3V scale respectively. For 6 processors the average energy savings were 74%, 62%, 42%, 63%, and 42% for the 5V/off, 2V during idle, 3.3V during idle, 2V scale, and 3.3V scale respectively. Finally, the average energy savings were 77%, 64%, 43%, 65%, and 44% when 7 processors are used for the 5V/off technique, 2V during idle, 3.3V during idle, 2V scale, and 3.3V scale respectively. 0% 20% 40% 60% 80% 100% 234567 Number of processors A v e r ag e ene r g y saving s 5V/off 2V during idle 3.3V during idle 2V scale 3.3V scale Figure 6.12. Average energy savings for Gaussian elimination algorithm with EADAGS with different number of processors 0% 20% 40% 60% 80% 100% 0.1 0.5 1 5 10 CCR Av er age e n er gy s avings 5V/off 2V during idle 3.3V during idle 2V scale 3.3V scale Figure 6.13. Average energy savings for Gaussian elimination algorithm with EADAGS with different CCR values 73 74 Figure 6.13 plots the average energy savings with respect to different CCR values. The average savings increased with increasing CCR. The average energy savings over DPS ranges from 52% for CCR = 0.1 to 74% when CCR = 10 for 5V/off technique. Savings are smaller for 2V during idle; they range between 44% and 62%. Savings are even smaller for the 3.3V during idle; they are 29% for CCR = 0.1 and 41% for CCR = 10, while for 2V scale and 3.3V scale the average energy savings over DPS ranges from 45% for CCR = 0.1 to 62% when CCR = 10 and 30% for CCR = 0.1 and 42% for CCR = 10 respectively. Table 6.10 and Table 6.11 list the makespan and the average energy savings for different number of processors and different CCR values for the EADAGS algorithm. Number of processors Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 2 820 32.19 % 27.04 % 18.17 % 28.97% 19.46% 3 824.4 52.1 % 43.77 % 29.41 % 45.15% 30.34% 4 860 63.91 % 53.69 % 36.07 % 54.75% 36.79% 5 784.8 68.19 % 57.28 % 38.49 % 58.22% 39.12% 6 851.2 74.63 % 62.69 % 42.12 % 63.44% 42.63% 7 831.2 77.3 % 64.93 % 43.63 % 65.61% 44.08% Table 6.10. Makespan and average energy savings for different number of processors for Gaussian elimination with EADAGS 75 CCR Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 0.1 381.33 52.69 % 44.26 % 29.74 % 45.94% 30.87% 0.5 421.67 54.19 % 45.52 % 30.58 % 47.06% 31.62% 1 480 56.41 % 47.38 % 31.84 % 48.76% 32.76% 5 1063.33 69.55 % 58.42 % 39.25 % 59.06% 39.68% 10 1796.67 74.11 % 62.25 % 41.83 % 62.63% 42.08% Table 6.11. Makespan and average energy savings with respect to CCR for Gaussian elimination with EADAGS 6.2.1.2 Results for EAGS-D The same number of processors and values for CCR were tested in the second part of this experiment to evaluate EAGS-D. Figure 6.14 shows the average energy savings for EAGS-D with respect to number of processors. The average energy savings measured were 21% for 2 processors in the 5V/off technique, 17% when processors operate at 2V during idle, 11% if they operate at 3.3V during idle, 20% for 2V scale, and 13% for 3.3V scale. These values increase gradually with increasing the number of processors due to increasing idle time. The average energy savings were 69%, 58%, 38%, 59%, and 39% when 7 processors are used for the 5V/off technique, 2V during idle, 3.3V during idle, 2V scale, and 3.3V scale respectively. 0% 20% 40% 60% 80% 100% 234567 Number of processors Ave r age e n er gy s avings 5V/off 2V during idle 3.3V during idle 2V scale 3.3V scale Figure 6.14. Average energy savings for Gaussian elimination algorithm with EAGS-D with different number of processors Figure 6.15 plots the average energy savings with respect to different CCR values. The average savings increase with increasing CCR. The average energy savings over HNPD ranges from 33% for CCR = 0.1 to 69% when CCR = 10 for 5V/off technique. Savings are smaller for 2V during idle operating voltage; they range between 27% and 58%. Savings are even smaller for the 3.3V operating voltage; they are 18% for CCR = 0.1 and 39% for CCR = 10, while for 2V scale and 3.3V scale the average energy savings over HNPD ranges from 30% for CCR = 0.1 to 59% when CCR = 10 and 20% for CCR = 0.1 and 40% for CCR = 10 respectively. Table 6.12 and Table 6.13 list the makespan and the average energy savings for different number of processors and different CCR values for the EAGS-D algorithm. 76 0% 20% 40% 60% 80% 100% 0.1 0.5 1 5 10 CCR A ver ag e ener g y savings 5V/off 2V during idle 3.3V during idle 2V scale 3.3V scale Figure 6.15. Average energy savings for Gaussian elimination algorithm with EAGS-D algorithm with different CCR values Number of processors Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 2 484 21.15 % 17.76 % 11.94 % 20.48% 13.76% 3 442.4 32.93 % 27.66 % 18.59 % 29.79% 20.02% 4 432 45.68 % 38.37 % 25.78 % 40.04% 26.91% 5 432 56.78 % 47.69 % 32.05 % 49.03% 32.95% 6 432 63.98 % 53.74 % 36.11 % 54.86% 36.86% 7 432 69.13 % 58.07 % 39.02 % 59.02% 39.66% Table 6.12. Makespan and average energy savings for different number of processors for Gaussian elimination with EAGS-D 77 CCR Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 0.1 303.67 33.21 % 27.9 % 18.74 % 30.03% 20.17% 0.5 305 33.59 % 28.22 % 18.96 % 30.33% 20.38% 1 350 42.13 % 35.39 % 23.78 % 37.27% 25.04% 5 580 62.7 5 52.67 % 35.39 % 53.82% 36.16% 10 673.33 69.73 % 58.58 % 39.36 % 59.58% 40.03% Table 6.13. Makespan and average energy savings with respect to CCR for Gaussian elimination with EAGS-D 0% 20% 40% 60% 80% 5V/off 2V during idle 3.3V during idle 2V scale 3.3V scale A ver age ener gy savi n g s LPDPS LPHNPD Figure 6.16. Average energy savings for Gaussian elimination algorithm with EADAGS and EAGS-D with voltage scaling levels 78 79 Figure 6.16 shows the average energy savings for both EADAGS and EAGS-D with respect to the five voltage scaling levels tested for the Gaussian elimination algorithm for a matrix of size 8. The overall average energy savings for EAGS-D is lower than the average energy savings measured for EADAGS. That is due to the nature of the EAGS-D algorithm, which uses task duplication to minimize the makespan. That duplication uses a big portion of the processor?s idle time and that reduces energy savings. 6.2.2 Molecular dynamic code Figure 6.17 represents the DAG of a molecular dynamics code as given in [Chun92]. We used this graph to evaluate the performance of EADAGS and EAGS-D, since the graph has a fixed structure and fixed number of nodes, the only parameters that could be varied was the number of processors and CCR values. Since there are at most seven tasks at any level in Figure 6.17, the number of processors were bounded to seven. The amount of energy consumed is measured for the five different voltage scaling levels explained earlier. We assumed that the computation costs of all nodes are not equal and the communication costs were also not equal for all links since the task computed at each node and the data communicated from one node to another is different. Five values for CCR were used in our experiments: 0.25, 0.5, 1, 5, and 10. 1 23 24 25 26 27 28 29 41 2 4 14 6 3 5 7 11 13 15 8 12 9 10 21 20 19 18 17 16 22 30 31 32 34 35 33 40 39 38 37 36 80 Figure 6.17. Directed a-cyclic graph (DAG) for a molecular dynamics code 81 6.2.2.1 Results for EADAGS Figure 6.18 and Figure 6.19 show the average energy savings for EADAGS with respect to number of processors and CCR values respectively. Figure 6.18 shows an increase in the average energy savings with increasing number of processors. The average energy savings measured was 22% for 2 processors in the 5V/off technique, 18% when processors operate at 2V during idle, 12% if they operate at 3.3V during idle, 19% for the 2V scale, and 13% for the 3.3Vscale. When three processors are used, the average energy savings were 50% for the 5V/off technique, 40% when processors scaled down to 2V during idle, 29% for 3.3V during idle, 44% for 2V scale, and 30% for 3.3V scale. For four processors, the average energy savings were 52%, 43%, 29%, 44%, and 30% for the 5V/off, 2V during idle, 3.3V during idle, 2V scale, and 3.3V scale respectively. The average energy savings were 59%, 49%, 33%, 50%, and 33% when 5 processors are used for the 5V/off technique, 2V during idle, 3.3V during idle, 2V scale, and 3.3V scale respectively. For 6 processors the average energy savings were 63%, 52%, 35%, 53%, and 36% for the 5V/off, 2V during idle, 3.3V during idle, 2V scale, and 3.3V scale respectively. Finally, the average energy savings were 67%, 57%, 38%, 57%, and 38% when 7 processors are used for the 5V/off technique, 2V during idle, 3.3V during idle, 2V scale, and 3.3V scale respectively. The makespan and the average energy savings for different number of processors are listed in Table 6.14. 0% 20% 40% 60% 80% 100% 234567 Number of processors A ver age ener gy savings 5V/off 2V during idle 3.3V during idle 2V scale 3.3V scale Figure 6.18, Average energy savings for molecular dynamics code with EADAGS with different number of processors Number of processors Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 2 740.8 22.08 % 18.55 % 12.46 % 19.79% 13.30% 3 859.2 52.42 % 44.04 % 29.59 % 44.82% 30.11% 4 67.2 52.19 % 43.84 % 29.46 % 44.64% 30.00% 5 663.2 59.19 % 49.72 % 33.41 % 50.41% 33.87% 6 634 63.07 % 52.98 % 35.6 % 53.61% 36.02% 7 631.2 67.09 % 57.09 % 38.36 % 57.64% 38.73% Table 6.14. Makespan and average energy savings for different number of processors for molecular dynamics code with EADAGS 82 Figure 6.19 plots the average energy savings with respect to different CCR values. The average savings increased with increasing CCR. The average energy savings over DPS ranges from 45% for CCR = 0.1 to 64% when CCR = 10 for the 5V/off technique. Savings are smaller for 2V during idle; they range between 38% and 54%. Savings are even smaller for the 3.3V during idle; they are 25% for CCR = 0.1 and 36% for CCR = 10, while for 2V scale and 3.3V scale the average energy savings over DPS ranges from 39% for CCR = 0.1 to 54% when CCR = 10 and 26% for CCR = 0.1 and 36% for CCR = 10 respectively. The makespan and the average energy savings for different CCR values are listed in Table 6.15. 0% 20% 40% 60% 80% 100% 0.1 0.5 1 5 10 CCR Av er age e n er gy sa vings 5V/off 2V during idle 3.3V during idle 2V scale 3.3V scale Figure 6.19. Average energy savings for molecular dynamics code with EADAGS with different CCR values 83 84 CCR Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 0.1 391.33 45.99 % 38.63 % 25.96 % 39.72% 26.69% 0.5 411.67 46.21 % 38.81 % 26.08 % 39.85% 26.78% 1 453.33 48.29 % 40.56 % 27.25 % 41.50% 27.89% 5 853.33 59.11 % 49.65 % 33.36 % 50.17% 33.71% 10 1396.67 64.5 % 54.18 % 36.41 % 54.51% 36.63% Table 6.15. Makespan and average energy savings with respect to CCR for molecular dynamics code with EADAGS 6.2.2.2 Results for EAGS-D Figure 6.20 and Figure 6.21 show the average energy savings for EAGS-D with respect to number of processors and CCR values respectively. Figure 6.20 shows an increase in the average energy savings with increasing number of processors. The average energy savings measured were 18% for 2 processors using the 5V/off technique, 15% when processors operate at 2V during idle, 10% if they operate at 3.3V during idle, 17% for 2V scale, and 11% for 3.3V scale. The average energy savings increases to 61%, 51%, 34%, 52%, and 35% when 7 processors are used for the 5V/off technique, 2V during idle, 3.3V during idle, 2V scale, and 3.3V scale respectively. The makespan and the average energy savings for different number of processors are listed in Table 6.16. 0% 20% 40% 60% 80% 100% 234567 Number of processors A v e r age ene r g y savi ngs 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale Figure 6.20. Average energy savings for molecular dynamics code with EAGS-D with different number of processors Number of processors Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 2 566.4 18.87 % 15.85 % 10.65 % 17.44% 11.71% 3 488.8 27.58 % 23.17 % 15.57 % 24.48% 16.45% 4 467.2 41.6 % 34.95 % 23.48 % 36.02% 24.20% 5 459.2 52.96 % 44.49 % 29.89 % 45.38% 30.49% 6 445.2 58.15 % 48.85 % 32.82 % 49.64% 33.35% 7 438.4 61.72 % 51.85 % 34.84 % 52.56% 35.31% Table 6.16. Makespan and average energy savings for different number of processors for molecular dynamics code with EAGS-D 85 Figure 6.21 plots the average energy savings with respect to different CCR values. The average savings increased with increasing CCR. The average energy savings over HNPD ranges from 24% for CCR = 0.1 to 74% when CCR = 10 for the 5V/off technique. Savings are smaller for 2V during idle; they range between 20% and 62%. Savings are even smaller for the 3.3V during idle; they are 13% for CCR = 0.1 and 41% for CCR = 10, while for 2V scale and 3.3V scale the average energy savings over HNPD ranges from 220% for CCR = 0.1 to 62% when CCR = 10 and 15% for CCR = 0.1 and 42% for CCR = 10 respectively. The makespan and the average energy savings for different CCR values are listed in Table 6.17. 0% 20% 40% 60% 80% 100% 0.1 0.5 1 5 10 CCR Av er age e n er gy sa vings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale Figure 6.21. Average energy savings for molecular dynamics code with EAGS-D with different CCR values 86 87 CCR Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 0.1 299.33 25.37 % 21.31 % 14.32 % 22.73% 15.27% 0.5 321.67 30.13 % 25.31 % 17.01 % 26.64% 17.90% 1 360 38.28 % 32.16 % 21.61 % 33.35% 22.41% 5 506.67 49.51 % 41.59 % 27.94 % 42.46% 28.53% 10 900 74.12 % 62.26 % 41.83 % 62.75% 42.16% Table 6.17. Makespan and average energy savings with respect to CCR for molecular dynamics code with EAGS-D Figure 6.22 shows the average energy savings for both EADAGS and EAGS-D with respect to the five voltage scaling strategies tested for the molecular dynamic code algorithm. The overall average energy savings for EAGS-D is lower than the average energy savings measured for EADAGS. That is due to the nature of the EAGS-D algorithm which uses task duplication to minimize the makespan. That duplication uses a big portion of the processor?s idle time and that reduces energy savings. 0% 20% 40% 60% 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale A ver age ener gy s avings LPDPS LPHNPD Figure 6.22. Average energy savings for molecular dynamics code algorithm with EADAGS and EAGS-D with voltage scaling levels 6.2.3 Fast Fourier Transform FFT FFT is an efficient algorithm to compute the discrete Fourier transform (DFT) and its inverse. FFTs are of great importance to a wide variety of applications, from digital signal processing to solving partial differential equations to algorithms for multiplying large integers. DFT and FFT are used to generate frequency analysis of a discrete non- periodic signal. The computation of DFT is complicated; it involves many additions and multiplications involving complex numbers. Even a simple eight sample signal would require 49 complex multiplications and 56 complex additions to work out the DFT. At this level it is still manageable; however a realistic signal could have 1024 samples 88 89 which requires over 20,000,000 complex multiplications and additions. FFT is a simpler method of laying out the computation and much faster for larger number of samples. The idea behind the FFT is the divide and conquer approach, by breaking up the original N point sample into two (N/2) sequences. This is because a series of smaller problems is easier to solve than one large one. The DFT requires (N-1) 2 complex multiplications and N(N-1) complex additions as opposed to the FFT's approach, which only requires 1 multiplication and 2 additions and the recombination of the points which is minimal [Corm90]. The recursive, one-dimensional FFT task graph for 4 data points is shown in Figure 6.23 [Chun92]. The FFT algorithm consists of two parts: recursive calls and the butterfly operations. The task graph in Figure 6.23 can be divided into two parts; the tasks above the dashed line are the recursive call tasks while the tasks below the dashed line are the butterfly operation tasks. T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 T 9 T 10 T 12 T 13 T 14 T 11 T 15 Figure 6.23. The generated DAG for FFT with four points We used this task graph to evaluate the performance of EADAGS and EAGS-D. Since the graph has a fixed structure and fixed number of nodes, the only parameters we changed were the number of processors and CCR values. Since there are at most four tasks at any level in Figure 6.23, the number of processors were bounded to four processors starting with only 2 processors in the system and up to 4 processors incrementing by 1. Each path from the entry node to an exit node is a critical path since the computation cost of tasks in any level are equal and the communication costs of all edges between two consecutive levels are equal. The amount of energy consumed was measured for the five different strategies listed before. 90 91 6.2.3.1 Results for EADAGS Figure 6.24 shows the average energy savings for EADAGS with respect to number of processors. Figure 6.24 shows a decrease in the average energy savings with increasing number of processors. This is because increasing number of processors allows several parallel task executions, thus minimizing the wait times which were used by our algorithm to save energy. The average energy savings measured were 34% for 2 processors using the 5V/off technique, 29% when processors operate at 2V during idle, 19% if they operate at 3.3V during idle, 30% for 2V scale, and 20% for 3.3V scale. When three processors are used, the average energy were 51% for the 5V/off technique, 42% for 2V during idle, 27% for 3.3V during idle, 44% for 2V scale, and 29% savings for 3.3V scale. For four processors, the average energy savings are 53%, 45%, 31%, 47%, and 33% for the 5V/off, 2V during idle, 3.3V during idle, 2V scale, and 3.3V scale respectively. Table 6.18 lists the average energy savings and the makespan for EADAGS over DPS for different number of processors. 0% 20% 40% 60% 80% 100% 234 Number of processors A ver a g e ener gy savings 5V/off 2V during idle 3.3V during idle 2V scale 3.3V scale Figure 6.24. Average energy savings for FFT with EADAGS with different number of processors Number of processors Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 2 233 34.97% 29.37% 19.74% 30.08% 20.43% 3 296 55.89% 46.95% 31.54% 44.02% 29.11% 4 232.8 51.32% 43.11% 28.96% 47.66% 33.59% Table 6.18. Makespan and average energy savings with respect to number of processors for FFT with EADAGS 92 Figure 6.25 plots the average energy savings with respect to different CCR values. The average savings increased with increasing CCR. When CCR increases, processors incur longer idle times due to communication between tasks. Our algorithm was able to use such idle times to achieve energy savings. The average energy savings over DPS ranges from 23% for CCR = 0.1 to 77% when CCR = 10 for 5V/off technique. Savings are smaller for 2V during idle; they range between 19% and 64%. Savings are even smaller for the 3.3V during idle; they are 12% for CCR = 0.1 and 43% for CCR = 10, while for 2V scale and 3.3V scale the average energy savings over DPS ranges from 23% for CCR = 0.1 to 64% when CCR = 10 and 16% for CCR = 0.1 and 43% for CCR = 10 respectively. 0% 20% 40% 60% 80% 100% 0.1 0.5 1 5 10 CCR A ver age ene r gy savings 5V/off 2V during idle 3.3V during idle 2V scale 3.3V scale Figure 6.25. Average energy savings for FFT with EADAGS with different CCR values 93 94 Table 6.19 lists the average energy savings and the makespan for EADAGS over DPS for different CCR values. CCR Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 0.1 155.33 25.30 % 21.25 % 14.28 % 23.13% 16.23% 0.5 163.33 29.95 % 25.16 % 16.9 % 25.32% 16.67% 1 180 36.5 % 30.67 % 20.61 % 34.67% 23.89% 5 340 66.42 % 55.79 % 37.48 % 55.58% 38.81% 10 540 78.79 % 66.19 % 44.47 % 64.22% 43.32% Table 6.19. Makespan and average energy savings with respect to CCR for FFT with EADAGS 6.2.3.2 Results for EAGS-D Figure 6.26 and Figure 6.27 show the average energy savings for EAGS-D with respect to number of processors and CCR values respectively. Figure 6.26 shows an increase in the average energy savings with increasing number of processors. The average energy savings measured were 18% for 2 processors for the 5V/off technique, 14% for 2V during idle, 9% for 3.3V during idle time, 15% for 2V scale, and 10% for 3.3V scale. When three processors are used, the average energy were 39% for the 5V/off technique, 33% for 2V during idle, 21% for 3.3V during idle, 34% for 2V scale, and 22% for 3.3V scale. For four processors, the average energy savings were 49%, 41%, 27%, 42%, and 29% for the 5V/off, 2V during idle, 3.3V during idle, 2V scale, and 3.3V scale respectively. 0% 20% 40% 60% 80% 100% 234 Number of processors A ver age ener gy savings 5V/off 2V during idle 3.3V during idle 2V scale 3.3V scale Figure 6.26. Average energy savings for FFT with EAGS-D with different number of processors Figure 6.27 plots the average energy savings with respect to different CCR values. The average savings increased with increasing CCR. When CCR increases, processors incur longer idle times due to communication between tasks. Our algorithm was able to use such idle times to achieve energy savings. The average energy savings over HNPD ranges from 16% for CCR = 0.1 to 61% when CCR = 10 for 5V/off technique. Savings are smaller for 2V during idle; they range between 13% and 51%. Savings are even smaller for the 3.3V during idle; they are 9% for CCR = 0.1 and 34% for CCR = 10, while for 2V scale and 3.3V scale the average energy savings over HNPD ranges from 15% for CCR = 0.1 to 51% when CCR = 10 and 11% for CCR = 0.1 and 35% for CCR = 10 respectively. Tables 6.20 and 6.21 lists the average energy savings for EAGS-D over 95 HNPD for different processors number and for different CCR values respectively 0% 20% 40% 60% 80% 100% 0.1 0.5 1 5 10 CCR Av er age e n er gy sa vings 5V/off 2V durin idle 3.3V during idle 2V scale 3.3V scale Figure 6.27. Average energy savings for FFT with EAGS-D with different CCR values Number of processors Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 2 218.4 18.09 % 14.88 % 9.34% 15.82% 10.90% 3 208 39.96 % 33.25 % 21.68 % 34.18% 22.21% 4 199.2 49.93 % 41.62 % 27.31 % 42.15% 29.23% Table 6.20. Makespan and average energy savings for different number of processors for FFT with EAGS-D 96 CCR Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 0.1 136 16.52 % 13.88 % 9.33 % 15.45% 11.74% 0.5 146.67 22.71 % 19.08 % 12.82 % 20.05% 13.46% 1 160 27.78 % 23.33 % 15.67 % 24.21% 16.56% 5 300 51.48 % 41.64 % 24.7 % 42.53% 26.09% 10 300 61.48 % 51.64 % 34.7 % 51.48% 35.91% Table 6.21. Makespan and average energy savings with respect to CCR for FFT with EAGS-D 0% 20% 40% 60% 5V/off 2V during idle 3.3V during idle 2V scale 3.3V scale Av er age ener gy sav in gs LPDPS LPHNPD Figure 6.28. Average energy savings for FFT with EADAGS and EAGS-D with voltage scaling levels 97 98 Figure 6.28 shows the average energy savings for both EADAGS and EAGS-D with respect to the five voltage scaling strategies tested for a four point FFT DAG. The overall average energy savings for EAGS-D is lower than the average energy savings measured for EADAGS. That is due to the nature of the EAGS-D algorithm, which uses task duplication to minimize the makespan. That duplication uses a big portion of the processor?s idle time and that reduces the amount of energy savings. 6.2.4. Sieve of Eratosthenes Sieve Eratosthenes is a method of identifying all prime numbers in a sequence of numbers up to a certain N. A prime number is a natural number greater than 1 that can be divided only by itself and by 1, while a composite number n is a natural number that can be divided by a number less than n and greater that 1. The Sieve of Eratosthenes identifies all prime numbers up to a given number N as follows [Corm90]: 1. Write down all numbers 1, 2, 3,?, N. We will eliminate composites by marking them. Initially all numbers are unmarked. 2. Mark number 1 as special (it is neither prime nor composite). 3. Set k = 1. While k is less than the square root of N, do this: a. Find the first number in the list greater than k that has not been identified as composite (the first number found is 2) and call it m. Mark the numbers 2m, 3m, 4m,?? as composites. (Thus in the first run we mark all even numbers greater than 2. In the second run we mark all multiples of 3 greater than 3.) b. m is a prime number; put it on your list. c. Set k = m and repeat. 4. Put the remaining unmarked numbers in the list of prime numbers. The Sieve of Eratosthenes algorithm can be presented by a task graph DAG. The DAG for the Sieve of Eratosthenes for N = 32 is shown in Figure 6.29. [Bask00] 1 13 12 11 10 9876542 16 15 3 14 17 24 23 22 21 20 19 18 25 29 28 27 26 30 32 31 Figure 6.29. Sieve of Eratosthenes task graph for N = 32 99 100 In the simulation, we used the graph of Sieve of Eratosthenes for a sequence of 32 numbers (N = 32) to test both EADAGS and EAGS-D. Since the structure of the graph is fixed, only the number of processors and the CCR values were changed. For a sequence of 32 numbers, the total number of tasks in the graph is 32 nodes and the largest number of tasks at a single level is 16 tasks, so the number of processors was bounded to 8 processors. CCR had five different values: 0.1, 0.5, 1.0, 5.0, and 10. The amount of energy consumed was measured for the five different operating strategies explained earlier in Chapter 5. 6.2.4.1 Results for EADAGS Figure 6.30 shows a decrease in the average energy savings with increasing number of processors for Sieve of Eratosthenes with EADAGS over DPS. This is because increasing number of processor allows several parallel task executions, thus minimizing the wait times, which are used by our algorithm to save energy. The average energy savings measured were 23% for 2 processors in the 5V/off technique, 19% for 2V during idle, 13% for 3.3V during idle, 21% for 2V scale, and 14% for 3.3V scale. When three processors are used, the average energy was 41% for the 5V/off technique, 33% for 2V during idle, 33% for 3.3V during idle, 35% for 2V scale, and 25% for 3.3V scale. For eight processors, the average energy savings are 65%, 54%, 36%, 56%,a and 7% for the 5V/off, 2V during idle, 3.3V during idle, 2V scale, and 3.3V scale respectively. Table 6.22 lists the makespan and the average energy savings for each different number of processors tested. 0% 20% 40% 60% 80% 100% 2345678 Number of processors A v er ag e en er g y savin g s 5V/off 2V during idle 3.3V during idle 2V scale 3.3V scale Figure 6.30. Average energy savings for different number of processors for Sieve of Eratosthenes with EADAGS Number of processors Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 2 502.4 23.57 % 19.73 % 13.26 % 21.19% 14.89% 3 472 41.35 % 33.14 % 22.37 % 35.58% 25.82% 4 384.8 43.12 % 34.22 % 24.34 % 37.18% 27.62% 5 382.4 53.78 % 45.18 % 30.36 % 47.97% 32.20% 6 382.4 61.49 % 51.65 % 34.7 % 54.98% 35.91% 7 382 62.89 % 54.19 % 36.16 % 56.91% 38.22% 8 391.2 65.26 % 54.82 % 36.83 % 56.33% 37.83% Table 6.22. Makespan and average energy savings for different number of processors for Sieve of Eratosthenes with EADAGS 101 Figure 6.31 plots the average energy savings with respect to CCR. The average energy savings increased with increasing CCR. When CCR increases, processors incur longer idle times due to communication between tasks. Our algorithm is able to use such idle times to achieve energy savings. The average energy savings over DPS ranges from 29% for CCR = 0.1 to 77% when CCR = 10 for 5V/off technique. Savings are smaller for 2V during idle; they range between 24% and 65%. Savings are even smaller for the 3.3V during idle; they are 16% for CCR = 0.1 and 43% for CCR = 10, while for 2V scale and 3.3V scale the average energy savings over DPS ranges from 31% for CCR = 0.1 to 65% when CCR = 10 and 20% for CCR = 0.1 and 45% for CCR = 10 respectively. Table 6.23 lists the average energy savings and the makespan for EADAGS over DPS for different CCR values. 0% 20% 40% 60% 80% 100% 0.1 0.5 1 5 10 CCR A ver ag e en er g y savi n g s 5V/off 2V during idle 3.3V during idle 2V scale 3.3 V scale Figure 6.31. Average energy savings for different CCR values for Sieve of Eratosthenes with EADAGS 102 103 CCR Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 0.1 233.71 29.69 % 24.53 % 16.05 % 31.01% 20.84% 0.5 244.29 35.23 % 29.5 % 19.75 % 33.69% 21.29% 1 268.57 41.57 % 33.49 % 23.46 % 34.19% 23.97% 5 505.71 66.74 % 56.4 % 38.47 % 57.28% 39.77% 10 817.14 77.81 % 65.3 % 43.7 % 65.34% 45.84% Table 6.23. Makespan and average energy savings for different CCR values for Sieve of Eratosthenes with EADAGS 6.2.4.2 Results for EAGS-D Figure 6.32 and Figure 6.33 show the average energy savings for EAGS-D with respect to the number of processors and CCR values respectively for the DAG for Sieve of Eratosthenes. Figure 6.32 shows an increase in the average energy savings with increasing number of processors. The average energy savings measured were 12% for 2 processors using the 5V/off technique, 10% for 2V during idle, 6% for 3.3V during idle, 13% for the 2V scale, and 8% for the 3.3V scale. The average energy savings increases to 63%, 53%, 35%, 54%, and 37% when 8 processors are used for the 5V/off technique, 2V during idle, 3.3V during idle, 2V scale, and 3.3V scale respectively. The makespan and the average energy savings for different numbers of processors are listed in Table 6.24. 0% 20% 40% 60% 80% 100% 2345678 Number of processors A v er ag e en er g y s av i n g s 5V/off 2V during idle 3.3V during idle 2V scale 3.3V scale Figure 6.32. Average energy savings for Sieve of Eratosthenes with EAGS-D with different numbers of processors Number of processors Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 2 416 12.27 % 10.75 % 6.71 % 13.16% 8.81% 3 354.4 26.93 % 22.56 % 16.72 % 24.63% 18.49% 4 314.4 38.49 % 31.78 % 20.84 % 33.84% 21.67% 5 304 45.34 % 37.39 % 26.35 % 39.45% 27.76% 6 288.8 51.67 % 43.03 % 28.59 % 44.10% 29.99% 7 284 60.37 5 50.71 % 34.07 % 52.74% 35.60% 8 276.8 63.65 % 53.47 % 35.92 % 54.30% 37.79% Table 6.24. Makespan and average energy savings for different numbers of processors for Sieve of Eratosthenes with EAGS-D 104 Figure 6.33 plots the average energy savings with respect to different CCR values. The average savings increased with increasing CCR. The average energy savings over HNPD ranges from 29% for CCR = 0.1 to 62% when CCR = 10 for 5V/off technique. Savings are smaller for 2V during idle; they range between 25% and 51%. Savings are even smaller for the 3.3V during idle; they are 16% for CCR = 0.1 and 35% for CCR = 10, while for 2V scale and 3.3V scale the average energy savings over HNPD ranges from 30% for CCR = 0.1 to 51% when CCR = 10 and 21% for CCR = 0.1 and 35% for CCR = 10 respectively. The makespan and the average energy savings for different CCR values are listed in Table 6.25. 0% 20% 40% 60% 80% 100% 0.1 0.5 1 5 10 CCR Av er ag e e n e r g y s avi n g s 5V/off 2V during idle 3.3V during idle 2V scale 3.3V scale Figure 6.33. Average energy savings for Sieve of Eratosthenes with EAGS-D with different CCR values 105 106 CCR Makespan Percentage of energy savings 5V/off 2V during idle 3.3V during idle 2V Scale 3.3V Scale 0.1 236 29.94 5 25.23 % 16.59 % 30.15% 21.21% 0.5 242.86 34.05 % 28.02 % 18.98 % 31.77% 21.03% 1 257.14 38.66 % 32.5 % 21.8 % 32.32% 21.71% 5 342.86 48.32 % 40.72 % 27.77 % 41.28% 29.37% 10 520 62.38 % 51.87 % 35.74 % 51.71% 35.34% Table 6.25. Makespan and average energy savings for different CCR values for Sieve of Eratosthenes with EAGS-D Figure 6.34 shows the average energy savings for both EADAGS and EAGS-D with respect to the five voltage scaling strategies tested for Sieve of Eratosthenes. The overall average energy savings for EAGS-D is lower than the average energy savings measured for EADAGS. That is due to the nature of the EAGS-D algorithm, which uses task duplication to minimize the makespan. That duplication uses a big portion of the processor?s idle time and that reduces the amount of energy savings. 0% 20% 40% 60% 5V/off 2V during idle 3.3V during idle 2V scale 3.3V scale A v e r age ener gy savi n g s LPDPS LPHNPD Figure 6.34. Average energy savings for Sieve of Eratosthenes with EADAGS and EAGS-D with voltage scaling levels 107 108 CHAPTER 7 CONCLUSIONS We have proposed two new scheduling algorithms, EADAGS and EAGS-D, which try to minimize finish time as well as energy consumption by the use of dynamic voltage scaling. The results were based on a two part software simulation study. The first part consists of a large set of randomly generated DAGs with various characteristics such as number of nodes, CCR, shape parameter, processor node ratio, and out degree. For each parameter the results were averaged across all other variables. The total number of random DAG generated for evaluating each algorithm were 10,800 graphs for each. The second part of the simulation contained DAGs for real world problems namely; Gaussian elimination, molecular dynamic code, Sieve of Eratosthenes, and Fast Fourier transform. These DAGs has a specific structure so numbers of nodes, shape parameter, and out degree are fixed. We tested both EADAGS and EAGS-D with different number of available processors and CCR. The results from the randomly generated DAGs showed that EADAGS algorithm resulted in an average energy saving of 40% over simple DPS, while EAGS-D algorithm estimated a reduction in energy by 28% which is less than that for EADAGS due to the nature of EAGS-D algorithm which involves task duplication. 109 For the second test set, first for Gaussian elimination an average of 44% of energy savings were achieved by EADAGS and an average of 37% by EAGS-D. For the molecular dynamic code problem the average energy savings with EADAGS and EAGS-D were 46% and 38% respectively. For FFT the average energy savings measured were 42% and 33% for EADAGS and EAGS-D respectively. While for the Sieve of Eratosthenes average energy savings for EADAGS and EAGS-D were 40% and 36% respectively. The effect of different DAG characteristic on the amount of energy savings is discussed next. For both EADAGS and EAGS-D the amount of energy savings increased by increasing the number of nodes due to the increase in idle time due to task dependency. The rate of the increase is lower in EAGS-D than EADAGS due to task duplication. The increase of CCR resulted in an increase in the average energy savings for both EADAGS and EAGS-D. When CCR increases, processors incur longer idle times due to communication between tasks. Both algorithms were able to use such idle times to achieve energy savings. The results showed that the overall energy savings marginally increased with increasing the shape parameter. Increasing shape parameter increases parallelism in the DAG resulting in more idle time for the processors due to the task dependency. This time was used by both EADAGS and EAGS-D to reduce the consumed energy. The last parameter tested was out degree. An increase in out-degree resulted in smaller average energy savings. A larger out-degree allows many processors to run in parallel reducing the idle time for all processors and so less energy to save. The future work can involve applying the voltage scaling technique to other scheduling algorithm. Aiming to find the optimal solution especially to the real world problem and then physically implementing them to save energy. 110 BIBLIOGRAPHY [Bask00] S. Baskiyar, ?Scheduling DAGs on message passing m-processor systems,? IEICE Transactions on Information and Systems, vol. E-38-D, July 2000. [Bask03] S. Baskiyar, and C. Dickinson, ?Scheduling Directed A-cyclic Graphs on a Bounded Set of Heterogeneous Processors Using Task Duplication,? LNCS, vol. 2913, pp. 259-267, Springer-Verlag, 2003. [Burd96] T. Burd and R. Brodersen, ?Processor Design for Portable Systems,? Journal of VLSI Signal Processing, 13(2-3), pp.203-222, 1996. [Chae04] Chaeseok Im and Soonhoi Ha, ?Dynamic Voltage Scaling for Real-time Multi- task Scheduling using Buffers,? Proc. ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, pp.88-94, 2004. [Chan96] A. Chandrakasan, V. Gutnik and T. Xanthopoulos, ?Data Driven Signal Processing: An Approach for Energy Efficient Computing,? International Symposium on Low Power Electronics and Design, pp. 347-352, Aug. 1996. [Choi04] K. Choi, R. Soma, and M. Pedram, ?Dynamic Voltage and Frequency Scaling based on Workload Decomposition,? Proceedings of the 2004 International Symposium on Low Power Electronics and Design, August 2004. [Chun92] Y. Chung and S. Ranka, ?Applications and Performance Analysis of a Compile-Time Optimaization Approach for List Scheduling Algorithms on Distributed Memory Multiprocessors,? Proc. Supercomputing, pp. 512-521, Nov 1992. [Corm01] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, ?Introduction to Algorithms,? MIT Press, 2001. [Dong01] J. J. Dongarra and D. W. Walker, ?The Quest for Petascale Computing,? In IEEE Transactions on Computing in Science and Engineering, pp. 32-39, May 2001. [Gebo96] C. H. Gebotys and R. J. Gebotys, ?Power-Minimization in Heterogeneous Processing,? In Proceedings of 29th Hawaii International Conference on System Sciences (HICSS?96), vol. 1, pp. 330-337, January 1996. 111 [ImHa04] C. Im and S. Ha, ?Dynamic Voltage Scaling for Real-time Multi-task Scheduling using Buffers,? Proc. ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, pp.88-94, 2004. [Iver98] M. A. Iverson and F. Ozguner, ?Dynamic Competitive Scheduling of Multiple DAGs in a Distributed Heterogeneous Environment,? In Proceedings of the 1998 Workshop on Heterogeneous Processing, pp. 70-78, March 1998. [Khok93] A. Khokhar, V. K. Prasanna, M. E. Shaaban and C. Wang, ?Heterogeneous Computing: Challenges and Opportunities,? IEEE Transactions on Computers, vol. 26, pp. 18- 27, June 1993. [Kiro97] D. Kirovski and M. Potkonjak, ?System-level Synthesis of Low-power Hard Real-time Systems,? In Proceedings of the 34th Annual Conference on Design Automation, 1997. [Kump94] K. Li, R. Kumpf, P. Horton, and T. Anderson, ?A Quantitative Analysis of Disk Drive Power Management in Portable Computers,? PTUC. Winter 1994 USENIX Conference, pp. 279-292, Jan. 1994. [Kwok99] Y. K. Kwok and A. Ishfaq, ?Link Contention-Constrained Scheduling and Mapping of Tasks and Messages to a Network of Heterogeneous Processors,? In Proceedings of 1999 International Conference on Parallel Processing, pp. 551- 558, September 1999. [Mart02] S. Martin, K. Flautner, T. Mudge, and D. Blaauw, ?Combined Dynamic Voltage Scaling and Adaptive Body Biasing for Lower Power Microprocessors under Dynamic Workloads,? Proceeding of the 2002IEEE/ACM International Conference on Computer-aided Design, November 2002. [Lu00] Y. H. Lu, L. Benini and G. Di Micheli, ?Low-Power Task Scheduling for Multiple Devices,? International Workshop on Hardware/Software Codesign, pp. 39- 43, May 2000. [Micr04] ?OnNow Power Management Architecture for Applications? at http://www.microsoft.com/hwdev/desinit/onnowapp.HTM [Mish03] R. Mishra, N. Rastogi, D. Zhu, D. Moss?, and R. Melhem, ?Energy Aware Scheduling for Distributed Real-Time Systems,? Proc. Int?l Parallel and Distributed Processing Symposium, pp. 9-16, April 2003. [Much97] S. Muchnick, ?Advanced Compiler Design and Implementation,? Morgan Kaufmann Publishers, Inc., 1997. 112 [Pouw01] J. Pouwelse, K. Langendoen, and H. Sips, ?Dynamic Voltage Scaling on a Low-Power Microprocessor,? Proc. of the 7 th Annual International Conference on Mobile Computing and Networking, pp. 251-259, July 2001. [Radu00] A. Radulescu and A. J. C. Van Gemund, ?Fast and Effective Task Scheduling in Heterogeneous Systems,? In 9 th Heterogeneous Computing Workshop, pp. 229-239, May 2000. [Reut97] C. Reuter, M. Schwiegershausen and P. Pirsch, ?Heterogeneous Multiprocessor Scheduling and Allocation using Evolutionary Algorithms,? In Proceedings of the IEEE International Conference on Application-Specific Systems, Architecture and Processors, pp. 294-303, July 1997. [Seig97] M. Tin, H. J. Seigel, J. K. Antonio and Y. A. Li, ?Minimizing the Application Execution Time Through Scheduling of Subtasks and Communication Traffic in a Heterogeneous Computing System,? In IEEE Transactions on Parallel and Distributed Systems, vol. 8, no. 8, pp. 857-870, August 1997. [Shan03] L. Shang, Li-Shiuan Peh, and N. K. Jha, ?Dynamic Voltage Scaling with Links for Power Optimization of Interconnection Networks,? Proc. of the 9th International Symposium on High-Performance Computer Architecture, pp. 91- 102, February 2003. [Shin01] D. Shin, S. Lee, and J. Kim, ?Intra-Task Voltage Scheduling for Low-Energy Hard Real-Time Applications,? IEEE Design and Test of Computers, vol. 18, pp. 20-30, March 2001. [Shiu00] W. T. Shiue and C. Chakrabarti. ?Low-Power Scheduling with Resources Operating at Multiple Voltages,? In IEEE Transactions on Circuits and Systems-2: Analog and Digital Signal Processing, vol. 47, no. 6, pp. 536- 543, June 2000. [Smar06]http://www.bsac.eecs.berkeley.edu/archive/users/warneke-brett/SmartDust/, accessed on February 2006. [Sriv96] M. Srivastava, A. Chandrakasan, and R. Brodersen, ?Predictive System Shutdown and Other Architectural Techniques for Energy Efficient Programmable Computation,? IEEE Transactions on VLSI Systems, vol. 4, pp. 42-55, March 1996. [Tiwa94] V. Tiwari, S. Malik, and A. Wolfe, ?Power Analysis of Embedded Software: A First Step toward Software Power Minimization,? Proceedings of the 1994 IEEE/ACM International Conference on Computer-aided Design, November 1994. 113 [Tiwa96] V. Tiwari, S. Malik, A. Wolfe, and M. Lee, ?Instruction Level Power Analysis and Optimization of Software,? Journal of VLSI Signal Processing, 13(2/3), pp. 1-18, 1996. [Topc99] H. Topcuoglu, S. Hariri and M. Y. Wu, ?Task Scheduling Algorithms for Heterogeneous Processors,? In Proceedings of the 8 th Heterogeneous Computing Workshop, pp. 3-14, April 1999. [Topc02] H. Topcuoglu, S. Hariri and M. Y. Wu, ?Performance-effective and Low Complexity Task Scheduling for Heterogonous Computing Parallel and Distributed Systems,? IEEE Transactions on Parallel and Distributed Systems, vol. 13, no. 3, March 2002. [Wang97] L. Wang, H. J. Siegel, V. P. Rowchowdhury and A. A. Maciejewski, ?Task Matching and Scheduling in Heterogeneous Computing Environments Using a Genetic- Algorithm-Based Approach,? Journal of Parallel and Distributed Computing, vol. 47, pp. 8-22, November 1997. [Weis94] M. Weiser, B. Demers, and Shenker, ?Scheduling for reduced CPU energy,? Proc 1 st USENIX Symposium On operating Systems Design and Implementation, pp. 13-23, Nov 1994. [Yang01] P. Yang, C. Wong, P. Marchal, F. Catthoor, D. Desmet, D. Verkest, and R. Lauwereins, ?Energy-Aware Runtime Scheduling for Embedded- Multiprocessor SOCs,? IEEE Design and Test of Computers, vol. 18, no. 5, pp. 46-58, Sept./Oct. 2001. [Zhan02] Y. Zhang, D. Chen, ?Task Scheduling and Voltage Selection for Energy Minimization,? Design Automation Conference, pp.183-188, June 2002.