ALGORITHMS FOR TASK SCHEDULING IN HETEROGENEOUS
COMPUTING ENVIRONMENTS
Except where reference is made to the work of others, the work described in this
dissertation is my own or was done in collaboration with my advisory committee. This
dissertation does not include proprietary or classified information.
__________________________
Prashanth C. Sai Ranga
Certificate of Approval:
__________________________ __________________________
Homer W. Carlisle Sanjeev Baskiyar, Chair
Associate Professor Associate Professor
Computer Science and Software Computer Science and Software
Engineering Engineering
__________________________ __________________________
Yu Wang Joe F. Pittman
Assistant Professor Interim Dean
Computer Science and Software Graduate School
Engineering
ALGORITHMS FOR TASK SCHEDULING IN HETEROGENEOUS
COMPUTING ENVIRONMENTS
Prashanth C. Sai Ranga
A Dissertation
Submitted to
the Graduate Faculty of
Auburn University
in Partial Fulfillment of the
Requirements for the
Degree of
Doctor of Philosophy
Auburn, Alabama
December 15, 2006
iii
ALGORITHMS FOR TASK SCHEDULING IN HETEROGENEOUS
COMPUTING ENVIRONMENTS
Prashanth C. Sai Ranga
Permission is granted to Auburn University to make copies of this dissertation at its
discretion, upon request of individuals or institutions and at their expense. The author
reserves all publication rights.
________________________
Signature of Author
________________________
Date of Graduation
iv
DISSERTATION ABSTRACT
ALGORITHMS FOR TASK SCHEDULING IN HETEROGENEOUS
COMPUTING ENVIRONMENTS
Prashanth C. Sai Ranga
Doctor of Philosophy, Dec 15,2006
(M.S., University of Texas at Dallas, Dec, 2001)
(B.E., Bangalore University, India, Aug 1998)
136 Typed pages
Directed by Sanjeev Baskiyar
Current heterogeneous metacomputing systems, such as computational clusters
and grids offer a low cost alternative to supercomputers. In addition they are highly
scalable and flexible. They consist of a host of diverse computational devices which
collaborate via a high speed network and may execute highperformance applications.
Many highperformance applications are an aggregate of modules. Efficient scheduling
of such applications on metacomputing systems is critical to meeting deadlines. In this
dissertation, we introduce three new algorithms, the Heterogeneous Critical Node First
(HCNF) algorithm, the Heterogeneous Largest Task First (HLTF) algorithm and the
Earliest Finish Time with Dispatch Time (EFTDT) algorithm. HCNF is used to schedule
v
parallel applications of forms represented by directed acyclic graphs onto networks of
workstations to minimize their finish times. We compared the performance of HCNF
with those of the Heterogeneous Earliest Finish Time (HEFT) and Scalable Task
Duplication based Scheduling (STDS) algorithms. In terms of Schedule Length Ratio
(SLR) and speedup, HCNF outperformed HEFT on average by 13% and 18%
respectively. HCNF outperformed STDS in terms of SLR and speedup on an average by
8% and 12% respectively. The HLTF algorithm is used to schedule a set of independent
tasks onto a network of heterogeneous processors to minimize finish time. We compared
the performance of HLTF with that of the Sufferage algorithm. In terms of makespan,
HLTF outperformed Sufferage on average by 4.5 %, with a tenth runtime. The EFTDT
algorithm schedules a set of independent tasks onto a network of heterogeneous
processors to minimize finish time when considering dispatch times of tasks. We
compared the performance of EFTDT with that of a First in First out (FIFO) schedule. In
terms of minimizing makespan, on average EFTDT outperformed FIFO by 30%.
vi
ACKNOWLEDGMENTS
The author is highly indebted to his advisor, Dr. Sanjeev Baskiyar, for his clear
vision, encouragement, persistent guidance and stimulating technical inputs. His patience,
understanding and support are deeply appreciated. Thanks to Dr. Homer Carlisle and Dr.
Yu Wang, for their review and comments on this research work. Their invaluable time
spent on serving on my graduate committee is sincerely appreciated. Special thanks to
Mr. Victor Beibighauser, Mr. Basil Manly and Mr. Ron Moody of South University,
Montgomery, for their concern, understanding and cooperation. Finally, the author
would like to thank his parents, sister and botherinlaw for their constant support and
encouragement.
vii
Style manual or journal used: IEEE Transactions on Parallel and Distributed Systems
Computer software used: Microsoft Word, Adobe PDF
viii
TABLE OF CONTENTS
LIST OF IGURES x
LIST OF TABLES xiii
CHAPTER 1 INTRODUCTION 1
1.1 Motivation 1
1.2 Cluster Computing 5
1.3 Grid Computing 7
1.4 Task Scheduling in Heterogeneous Computing Environments 10
1.5 NPComplete Problems 14
1.6 Research Objectives and Outline 15
CHAPTER 2 LITERATURE REVIEW 16
2.1 Scheduling a Parallel Application Represented by a Directed
Acyclic Graph onto a Network of Heterogeneous Processors
to Minimize the MakeSpan 16
2.1.1 Directed Acyclic Graphs 16
2.1.2 Problem Statement 17
2.1.3 The Best Imaginary Level Algorithm 19
2.1.4 The Generalized Dynamic Level Algorithm 21
2.1.5 The Levelized MinTime Algorithm 24
2.1.6 The Heterogeneous Earliest Finish Time Algorithm 26
2.1.7 The Critical Path on Processor Algorithm 27
2.1.8 The Fast Critical Path Algorithm 30
2.1.9 The Fast Load Balancing Algorithm 32
2.1.10 The Hybrid Remapper Algorithm 34
2.1.11 Performance Comparison 36
2.2 Scheduling a Parallel Application Represented by a Set of
Independent Tasks onto a Network of Heterogeneous
Processors to Minimize the MakeSpan 38
2.2.1 Problem Statement 38
2.2.2 The MinMax and the MaxMin Algorithm 38
2.2.3 The Sufferage Algorithm 40
CHAPTER 3 THE HETEROGENEOUS CRITICAL NODE FIRST
ALGORITHM 43
ix
3.1 Motivation 43
3.2 The HCNF Algorithm 44
3.3 Running Trace 46
3.4 Simulation Study 54
3.4.1 Performance Parameters 54
3.4.2 Randomly Generated Graphs 55
3.4.3 Gaussian Elimination Graphs 56
3.4.4 Benchmark Graphs 57
3.4.5 Parametric Random Graph Generator 73
3.5 Conclusion 79
CHAPTER 4 THE HETERGOENEOUS LARGEST TASK FIRST
ALGORITHM 80
4.1 Motivation 80
4.2 The HLTF Algorithm 81
4.3 Theoretical NonEquivalence of Sufferage and HLTF 83
4.4 Simulation Study 87
4.4.1 Comparison of Makespan 88
4.4.2 Comparison of Running Times 88
CHAPTER 5 SCHEDULING INDEPENDENT TASKS WITH
DISPATCH TIMES 95
5.1 Motivation 95
5.2 The EFTDT Algorithm 96
5.3 Example Run of EFTDT 97
5.4 Simulation Study 99
CHAPTER 6 CONCLUSION 113
BIBLIOGRAPHY 116
x
LIST OF FIGURES
1.1 Architecture of Cluster Computing Systems 6
1.2 Grid Architecture 8
2.1 A sample DAG G
1
17
2.2 The BIL algorithm 21
2.3 The GDL algorithm 23
2.4 The LMT algorithm 25
2.5 The HEFT algorithm 27
2.6 The CPOP algorithm 29
2.7 The FCP algorithm 31
2.8 The FLB algorithm 33
2.9 The Hybrid Remapper algorithm 35
2.10 The MinMin algorithm 37
2.11 The Sufferage algorithm 38
3.1 The HCNF algorithm 39
3.2 Sample DAG (G
1
) 40
3.4 Gantt chart for G
1
41
3.5 HCNF running tracestep 1 42
3.6 HCNF running tracestep 2 42
3.7 HCNF running tracestep 3 42
3.8 HCNF running tracestep 4 43
3.9 HCNF running tracestep 5 43
3.10 HCNF running tracestep 6 43
3.11 HCNF running tracestep 7 44
3.12 HCNF running tracestep 8 45
3.13 HCNF running tracestep 9 45
3.14 HCNF running tracestep10 45
3.15 Random graphsAverage SLR vs. number of nodes 46
3.16 Random graphsAverage speedup vs. number of nodes 46
3.17 Random graphsAverage SLR vs. CCR (0.1 to 1) 47
3.18 Random graphsAverage SLR vs. CCR (1 to 5) 48
3.19 Random graphsAverage speedup vs. CCR (0.1 to 1) 48
3.20 Random graphsAverage speedup vs. CCR (1 to 5) 48
3.21 Gaussian EliminationAverage SLR vs. matrix size 49
3.22 Gaussian EliminationEfficiency vs. no. of processors 50
xi
3.23 Trace GraphsSLR 51
3.24 Trace GraphsSpeedup 52
3.25 RGBOS SLR (CCR = 0.1) 52
3.26 RGBOS SLR (CCR = 1.0) 53
3.27 RGBOS SLR (CCR = 10.0) 53
3.28 RGBOS Speedup (CCR = 0.1) 54
3.29 RGBOS Speedup (CCR = 1.0) 54
3.30 RGBOS Speedup (CCR = 10.0) 55
3.31 RGPOS SLR (CCR = 0.1) 56
3.32 RGPOS SLR (CCR = 1.0) 56
3.33 RGPOS SLR (CCR = 10.0) 57
3.34 RGPOS Speedup (CCR = 0.1) 57
3.35 RGPOS Speedup (CCR = 1.0) 58
3.36 RGPOS Speedup (CCR = 10.0) 58
3.37 Fast Fourier Transform SLR vs. CCR 59
3.38 Fast Fourier Transform Speedup vs. CCR 59
3.39 Cholesky Factorization Speedup vs. CCR 60
3.40 Gaussian Elimination Speedup vs. CCR 60
3.41 Laplace Transform Speedup vs. CCR 61
3.42 LU Decomposition Speedup vs. CCR 61
3.43 MVA Speedup vs. CCR 62
3.44 Cholesky SLR vs CCR 62
3.45 Gaussian Elimination SLR vs.CCR 63
3.46 Laplace Transform SLR vs.CCR 63
3.47 LU Decomposition SLR vs. CCR 64
3.48 MVA SLR vs. CCR 64
3.49 Parametric random graphs  SLR vs. number of nodes 67
3.50 Parametric random graphs  Speedup vs. number of nodes 67
3.51 Parametric random graphsSLR vs. CCR (0.1 to 0.9) 68
3.52 Parametric random graphsSLR vs. CCR (1.0 to 5.0) 68
3.53 Parametric random graphsSpeedup vs. CCR (0.1 to 0.9) 69
3.54 Parametric random graphsSpeedup vs. CCR (1.0 to 5.0) 69
4.1 Running times of the Sufferage Algorithm 70
4.2 HLTF Algorithm 72
4.3 The Sufferage algorithm 74
4.4 Average Makespan of Metatasks std_dev=5 76
4.5 Average Makespan of Metatasks std_dev=10 78
4.6 Average Makespan of Metatasks std_dev=15 80
4.7 Average Makespan of Metatasks std_dev=20 82
4.8 Average Makespan of Metatasks std_dev=25 84
xii
4.9 Average Makespan of Metatasks std_dev=30 85
4.10 Running Times {n =50,100,200} 87
4.10 Running Times {n =500,1000,2000} 87
4.11 Running Times {n =3000,4000,5000} 90
5.1 The EFTDT Algorithm 94
5.2 Gantt Chart for the MetaTask 96
5.3 Average Makespan std_dev=5, proc_dev=2 98
5.4 Average Makespan std_dev=10, proc_dev=2 99
5.5 Average Makespan std_dev=15, proc_dev=2 99
5.6 Average Makespan std_dev=20, proc_dev=2 100
5.7 Average Makespan std_dev=25, proc_dev=2 100
5.8 Average Makespan std_dev=30, proc_dev=2 101
5.9 Average Makespan std_dev=5, proc_dev=4 101
5.10 Average Makespan std_dev=10, proc_dev=4 102
5.11 Average Makespan std_dev=15, proc_dev=4 102
5.12 Average Makespan std_dev=20, proc_dev=4 103
5.13 Average Makespan std_dev=25, proc_dev=4 103
5.14 Average Makespan std_dev=30, proc_dev=4 104
5.15 Average Makespan std_dev=5, proc_dev=6 104
5.16 Average Makespan std_dev=10, proc_dev=6 105
5.17 Average Makespan std_dev=15, proc_dev=6 105
5.18 Average Makespan std_dev=20, proc_dev=6 106
5.19 Average Makespan std_dev=25, proc_dev=6 106
5.20 Average Makespan std_dev=30, proc_dev=6 107
xiii
LIST OF TABLES
2.1 Table of values for G
1
18
2.2 Definition of terms used in BIL 20
2.3 Definition of terms used in GDL 22
2.4 Definition of terms used in LMT 24
2.5 Definition of terms used in HEFT 27
2.6 Definition of terms used in CPOP 28
2.7 Definition of terms used in FCP 30
2.8 Definition of terms used in FLB 32
2.9 Definition of terms used in Hybrid Remapper 34
2.10 Performance Comparison 38
2.11 Definition of terms used in MinMin 40
2.12 Definition of terms used in Sufferage 42
3.1 HCNFdefinition of terms 55
3.2 Task execution times of G
1
on three different processors 58
3.3 Runtime values for G
1
60
3.4 Trace graph details 64
4.1 Definition of Terms used in Sufferage and HLTF 81
4.2 Theoretical Nonequivalence of the Sufferage and the HLTF Algorithms 83
5.1 EFTDT Algorithm ?Defnition of Terms 93
5.2 Asample metask 95
5.3 Metatask Dispatch Times 95
1
CHAPTER 1
INTRODUCTION
This chapter provides an introduction to our research work and discusses a few
relevant topics. Section 1.1 discusses our research motivation. Section 1.2 describes the
architecture of cluster computing systems. Section 1.3 describes the architecture of grid
computing systems. Section 1.4 provides an overview of task scheduling in
heterogeneous computing systems. Section 1.5 provides an introduction to NPcomplete
problems and Section 1.6 discusses the organization of this dissertation.
1.1 Motivation
Information Technology has revolutionized the way we share and use
information. The IT revolution has witnessed a myriad number of applications with a
wide range of objectives which include: small personal computer based applications like
the calculator program, mediumsized applications like the Microsoft Word, largesized
applications like the Computer Aided Design software and verylarge sized applications
like the Weather Forecasting application. Some of these programs can run efficiently on a
normal personal computer and some may need a more powerful workstation. However,
there are applications like Weather Forecasting, Earthquake Analysis, Particle Simulation
and a host of other engineering and scientific applications that require computing
2
capabilities beyond that of personal computers or workstations. They are called ?High
Performance Applications?.
How do we run these highperformance applications efficiently, given the fact
that sequential computers (PCs, workstations) are too slow to handle them? There are
three ways to improve efficiency [1]: work harder, work smarter or get help. In this
context, working harder refers to increasing the speed of sequential uniprocessor
computers. In the last two decades, microprocessor speed has on an average doubled once
in 18 months. Today?s microprocessor chip is faster than the mainframes of yesteryears,
owing to the phenomenal advances in Very Large Scale Integration (VLSI) technology.
Even though this trend is expected to continue in the future, microprocessor speed is
severely limited by the laws of physics and thermodynamics [2]. There is very high
probability that it will eventually hit a plateau in the near future.
Working smarter refers to designing efficient algorithms and programming
environments to deal with highperformance applications. By working smarter, we can
definitely improve the overall efficiency, but will not be able to overcome the speed
bottleneck of sequential computers.
Getting help refers to involving multiple processors to solve the problem. The
idea of multiple processors working together simultaneously to run an application is
called ?Parallel Processing.? Most of the applications consist of thousands of modules or
subprograms that may or may not interact with each other depending on the nature of the
application. In either case, there are usually a number of modules that are independent of
one another and could run simultaneously on different processors. The parallel nature of
many applications is what makes parallel processing very appealing. In other words, if
3
applications were to be one large sequential module, parallel processing would not be
feasible.
Parallel processing has captivated researchers for a long time. The initial trend in
parallel processing was to create tightly coupled multiprocessor systems with shared
memory, running proprietary software. These systems were generally referred to as
?Supercomputers?. Supercomputers were extremely fast and expensive. In the 1960s
Seymour Cray created the world?s first commercial supercomputer the CDC 6600. Other
companies like IBM, Digital and Texas Instruments created their own proprietary
versions of supercomputers. The 70s and the 80s witnessed major companies and
research labs across the word vie with one another to create the world?s fastest super
computer. Even though the trend continues to this day, parallel processing has slowly
drifted away from supercomputing for a number of reasons. Supercomputers are
extremely expensive systems that run on proprietary technology. Since they run on
proprietary technology, they offer less flexibility with respect to developing software
solutions to execute high performance applications. Since supercomputers are very
expensive to lease/purchase and maintain, it is beyond the reach of many organizations to
deploy them. Also in view of today?s technological growth, it is important for systems to
be readily scalable. Owing to factors like proprietary hardware and software
technologies, most of the supercomputers are not readily scalable. To summarize,
supercomputers have a very high cost/performance factor.
The very high cost/performance factor made them unattractive to a number of
organizations. Most organizations (business, academic, military etc) were interested in
high performance computing but were seeking systems with low cost/performance factor,
4
which could not be offered by supercomputers. In the meantime, PCs and workstations
became extremely powerful and significant advances were made in networking
technologies. Researchers began to explore the possibility of connecting low cost PCs
with a highspeed network to mimic the functioning of a supercomputer albeit with a low
cost/performance factor.
Extensive research has been carried out to create high performance systems by
connecting PCs/workstations with a highspeed network. Most of the research was
focused on creating viable parallel programming environments, developing highspeed
network protocols and devising effective scheduling algorithms. Initially, the
PCs/workstations had uniform hardware characteristics and thus the systems were termed
?Homogeneous.? However due to rapid advances in PC technology, computers and other
hardware items had to be continuously upgraded and it was no longer the case that all the
machines had identical hardware characteristics. This led to the notion of ?Heterogeneous
Systems? where individual PCs/workstations in a network could have different hardware
characteristics. Researchers today focus on creating a highperformance system with a
low cost/performance factor using a Heterogeneous Network of Workstations (NOWs).
So, what goes into creating a viable high performance computing system with a
low cost/performance ratio out of a NOW given the fact that we have powerful
workstations and very highspeed networks? Firstly, an efficient runtime environment
must be provided for highperformance applications. Extensive research has been done in
this area and has led to the creation of efficient technologies like the Message Passing
Interface (MPI) [2] and the Parallel Virtual Machine (PVM) [2]. Secondly, in order to be
able to provide a low cost/performance ratio, these systems must optimize the overall
5
execution time (or turnaround time) of highperformance applications. This requires
efficient scheduling of the subtasks of highperformance applications onto the individual
machines of a NOW. The subtasks of a parallel application may either be independent or
may have precedence constraints. In either case, the problem of scheduling these subtasks
to optimize the overall execution time of an application is a wellknown NP Complete
problem [3].
The focus of our research is to devise efficient scheduling algorithms for
scheduling parallel applications represented by independent tasks as well as tasks with
precedence constraints onto heterogeneous computing systems to minimize the overall
execution time. We strongly believe that efficient task scheduling is the most important
factor in creating a lowcost highperformance computing system. We now discuss the
architectures of two very popular heterogeneous computing systems, the Cluster and the
Grid.
1.2 Cluster Computing
A cluster is a heterogeneous parallel computing system which consists of several
stand alone systems that are interconnected to function as an integrated computing
resource. A cluster generally refers to two of more computers interconnected via a local
area network. A cluster of computers can appear as a single system to users and
applications. It provides a lowcost alternative to supercomputers with a relatively
reasonable performance.
Figure 1.1 describes the architecture and the main components of a cluster
computing system [2]. The individual nodes of a cluster could be PCs or high speed
6
workstations connected through a highspeed network. The network interface hardware
acts as a communication processor and is responsible for transmitting and receiving
packets of data between cluster nodes. The cluster communication software provides a
means for fast and reliable data communication among cluster nodes and to the outside
world. Clusters often use communication protocols such as ?Active Messages? [2] for
fast communication among their nodes. They usually bypass the operating system and
remove the critical communication overhead normally involved by providing a direct
userlevel access to the network interface.
Figure 1.1 Architecture of ClusterComputing Systems
The cluster nodes can either work as individual computers or can work
collectively as an integrated computing resource. The cluster middleware is responsible
for offering an illusion of a unified system image (Single System Image) and Availability
7
out of a collection of independent but interconnected computers. Parallel programming
environments offer portable, efficient, and easytouse tools for development of
applications. They include message passing libraries, debuggers, and profilers. Clusters
also run resource management and scheduling software such as LSF (Load Sharing
Facility) and CODINE (Computing in Distributed Networked Environments) [2]. The
individual nodes of a cluster can have different hardware characteristics and new nodes
can be seamlessly integrated into existing clusters thus making them easily scalable.
Clusters make use of these hardware and software resources to execute high performance
applications and typically provide a very low cost/performance ratio.
1.3 Grid Computing
The massive growth of the Internet in the recent years has encouraged many
scientists to explore the possibility of harnessing idle CPU clock cycles and other
unutilized computational resources spread across the Internet. The idea was to harness
idle CPU cycles and other computational resources and provide a unified computational
resource to those in need of highperformance computation. This led to the notion of
?Grid Computing?.
The concept of grid computing is similar to that of ?Electrical Grids.? In
electrical grids, power generation stations in different geographical locations are
integrated to provide a unified power resource for consumers to plug into on demand. In
the same fashion, computational grids allow users to plug into a virtual unified resource
for their computational needs.
8
1.3.1 Architecture of a Grid Computing System Grid systems are highly complex and
comprise of a host of integrated hardware and software features as illustrated in Figure
1.2. The following subsections describe the major components of a grid.
Figure 1.2 Grid Architecture
1.3.1.1 Interface
Grid systems are designed to shield their internal complexities from users. User
interfaces can come in many forms and can be application specific. Typically grid
interfaces are similar to web portals. A grid portal provides an interface to launch
applications which would use its resources and services. Through this interface, users see
the grid as a virtual computing resource.
9
1.3.1.2 Security
Security is a critical issue in grid computing. A grid environment should consist
of mechanisms to provide security, which includes authentication, authorization, data
encryption etc. Most of the grid implementations include an Open SSL [4]
implementation. They also provide a single signon mechanism, so that once a user is
authenticated, a proxy certificate is created and used while performing actions within the
grid.
1.3.1.3 Broker
A grid system typically consists of a diverse range of resources spread across the
internet. When a user desires to launch an application through the portal, depending on
the application and other parameters provided by the user, the system needs to identify
and appropriate the resources to use. This task is accomplished by the grid broker system.
The broker makes use of the services provided by the Grid Information Service (GIS)
which is also known as the Monitoring and Discovery Service (MDS). It provides
information about the available resources within the grid and their status. Upon
identifying available resources, the broker needs to choose the most viable resource based
on the requirements of the user. Resource brokering is a major research topic in grid
computing and forms the focus of what is known as ?GCommerce?.
1.3.1.4 Scheduler
Applications requiring services of a grid could be one large module or could
consist of several independent modules with or without data dependencies. Depending on
the nature of the application, the scheduler must be able to effectively map the
10
application or its components onto the best available resource. Most of the grid
schedulers use different algorithms to deal with different cases. Grid schedulers have a
number of algorithms to choose from depending on scheduling parameters and user
requirements. However, the most common criteria for schedulers is to minimize the
turnaround time of an application.
1.3.1.5 Data Management
Scheduling high performance applications onto grids constantly requires
movement of data files from one node to another. The grid environment should provide a
reliable and a secure means for data exchange. The Data Management component of the
grid system commonly uses the Grid Access to Secondary Storage (GASS) [4]
component to move data files across the grid. The GASS incorporates the GridFTP,
which is protocol built over the standard FTP in the TCP/IP protocol suite. The GridFTP
protocol adds a layer of encryption and other security features on top of the standard FTP
protocol.
1.3.1.6 Job Management
This component includes the core set of services that perform the actual work in a grid
environment. It provides service to actually launch a job on a particular resource, check
its status, and retrieve results when it is complete. The component is also responsible for
ensuring fault tolerance.
11
1.4 Overview of Task Scheduling in Heterogeneous Computing Environments
There are a number of reasons why scheduling programs or the tasks that
comprise the programs is important. For users it is important that the programs they wish
to run are executed as quickly as possible (faster turnaround times). On the other hand the
owners of computing resources would ideally wish to optimize their machine utilization.
These two objectives, faster turnaround times and optimal resource utilization, are not
always complementary. Owners are not usually willing to let a single user utilize all their
resources (especially in grid systems), and users are not usually willing to wait an
arbitrarily long time before they are allowed access to particular resources. Scheduling,
from both points of view, is the process by which both the users and the owners achieve a
satisfactory quality of service.
1.4.1 Scheduling Strategies
There are different approaches to the selection of processors onto which subtasks
of a program would be placed for execution. In the static model, each subtask is assigned
to a processor before the execution of a program commences. In the dynamic scheduling
model, subtasks are assigned to different processors in runtime. In the Hybrid
scheduling model, a combination of both static and dynamic scheduling strategies is used.
1.4.1.1 Static Scheduling
In the static model, all subtasks of a program are assigned once to a processing
element. An estimate of the cost of computation can be made a priori . Heuristic models
for static task scheduling are discussed in Chapter 2. One of the main benefits of the
12
static model is that it is easier to implement from a scheduling and mapping point of
view. Since the mapping of tasks is fixed a priori, it is easy to monitor the progress of
computation. Likewise, estimating the cost of jobs is simplified. Processors can give
estimates of the time that will be spent processing the subtasks. On completion of the
program they can be instructed to supply the precise time that was spent in processing.
This facilitates updating of actual running costs and could be used in making
performance estimates for new programs. The Static Scheduling model has a few
drawbacks. The model is based on an approximate estimation of processor execution
times and interprocessor communication times. The actual execution time of a program
may often vary from the estimated execution time and sometimes may result in a poorly
generate schedule. This model also does not consider node and network failures
1.4.1.2 Dynamic Scheduling
Dynamic scheduling operates on two levels: the local scheduling strategy, and a
load distribution strategy. The load distribution strategy determines how tasks would be
placed on remote machines. It uses an information policy to determine the kind of
information that needs to be collected from each machine, the frequency at which it needs
to be collected and also the frequency at which it needs to be exchanged among different
machines. In a traditional dynamic scheduling model, the subtasks of an application are
assigned to processors based on whether they can provide an adequate quality of service.
The meaning of quality of service is dependent on the application. Quality of service
could mean whether an upper bound could be placed on the time a task needs to wait
before it can start its execution; the minimum time under which the task can complete its
13
execution without interruption and the relative speed of the processor as compared to
other processors in the system. If a processor is assigned too many tasks, it may invoke a
transfer policy to check to see if it needs to transfer tasks to other nodes and if so, to
which ones? The transfer of tasks could be sender initiated or receiver initiated. In the
later case, a processor that is lightly loaded will voluntarily advertise to offer its services
to heavily loaded nodes.
The main advantage of dynamic scheduling over static scheduling is that the
scheduling system need not be aware of the runtime behavior of the application before
execution. Dynamic scheduling is particularly useful in systems where the goal is to
optimize processor utilization as opposed to minimizing the turnaround times. Dynamic
scheduling is also more efficient and fault tolerant when compared to static scheduling.
1.4.1.3 Hybrid StaticDynamic Scheduling
Static scheduling algorithms are easy to implement and usually have a low
schedule generating cost. However, since static scheduling is based on estimated
execution costs, it may not always produce the best schedules. On the other hand,
dynamic scheduling uses runtime information in the scheduling process and generates
better schedules. But dynamic scheduling suffers from very high running costs and may
be prohibitively expensive while trying to schedule very large applications with tens and
thousands of subtasks. Since both the scheduling techniques have their own advantages,
researchers have tried to combine them to create a hybrid scheduling technique. Usually
in hybrid scheduling, the initial schedule is obtained using static scheduling and the sub
14
tasks are mapped onto the respective processors. However, after the execution
commences, the processors use runtime information to check and see if the tasks could
be mapped to better processors to yield a better a makespan. The running cost of a hybrid
scheduling algorithm is greater than the static scheduling algorithms, but is significantly
lower than the dynamic only scheduling algorithms.
1.5 NPcomplete Problems
Computational problems can be broadly classified into two categories, tractable
problems and intractable problems [3]. Tractable problems are the ones whose worst case
running time or time complexity is smaller than O(n
k
), where n is the input size of the
problem and k is a constant. These problems are also known as ?Polynomial Time
Problems? since they can be executed in polynomial time. The Intractable problems are
ones that cannot be executed in polynomial time. They take superpolynomial times to
execute.
However, there is a class of problems whose status is unknown to this day. These
problems are known as the ?NPcomplete problems?. For these problems, no polynomial
time solution has yet been discovered, nor has anyone been able to solve them with a
superpolynomial lower bound [3]. Many computer scientists believe that NPcomplete
problems are intractable. This is mainly because there has been no success in devising a
polynomial time solution to any of the existing NPcomplete problems so far and if a
polynomial time solution is devised for one NPcomplete problem, mathematically a
polynomial time solution can be devised for all NPcomplete problems.
15
Algorithm designers need to understand the basics and importance of NP
complete problems. If designers can prove that a problem is NPcomplete, then there is a
good chance that the problem is intractable. If a problem is intractable, it would be better
to design an approximation algorithm instead of a perfect algorithm.
The task scheduling problems that form the focus of this dissertation are well
known NPcomplete problems [3]. We devise approximation algorithms or heuristics to
deal with various cases of the taskscheduling problem, which forms the focus of this
research.
1.6 Research Objectives
In this dissertation, we intend to propose new algorithms for scheduling tasks in
heterogeneous computing systems. In Section 2 we provide a comprehensive literature
review on the existing work in the area of task scheduling in heterogeneous computing
systems. In Section 3, we propose a new algorithm called the Heterogeneous Critical
Node First (HCNF) to schedule a parallel application modeled by a Directed Acyclic
Graph (DAG) onto a network of heterogeneous processing elements. In Section 4, we
propose a new lowcomplexity algorithm called the Heterogeneous Largest Task First
(HLTF) to schedule independent tasks of a metatask onto a network of heterogeneous
processors. In Section 5, we propose a new algorithm called the Earliest Finish Time with
Dispatch Time (EFTDT) to schedule a set of independent tasks of a metatask onto a
network of heterogeneous processors while also considering the dispatch times. In
Section 6, we provide the concluding remarks and also make suggestions for future
research in this area.
16
17
CHAPTER 2
LIERATURE REVIEW
Among the problems related to task scheduling in heterogeneous computing
environments, scheduling a parallel application represented by a directed acyclic graph
(DAG) to minimize the overall execution time (makespan) and scheduling a parallel
application represented by a metatask (set of independent tasks) to minimize the
makespan are the most important and often researched ones. This section defines the two
problems and surveys related research work.
2.1 Scheduling Parallel Applications Represented by Directed Acyclic Graphs
onto Heterogeneous Computing Systems to Minimize the Makespan
Many parallel applications consist of subtasks with precedence constraints and
can be modeled by directed acyclic graphs. This section discusses the problem of
scheduling a parallel application represented by a DAG onto a network of heterogeneous
processors to minimize its makespan and reviews related research work
2.1.1 Directed Acyclic Graphs
A DAG is represented by G={V,E,W,C}. V is the set of n nodes: {n
1
, n
2
, n
3
, n
4
,?}.
E is the set of directed edges of the form (n
i
, n
j
) which represents an edge directed from
18
node n
i
to n
j
. W is the set of node weights of the form w
i
, where w
i
denotes the weight of
node n
i
. C is the set of edge weights of the form c
i,j
, where c
i,j
denotes the weight of the
edge (n
i
, n
j
). A DAG is a graph without a cycle (A directed path from a node onto itself).
The set of nodes in a DAG which have an edge directed towards a node n
i
are called its
predecessor nodes and are denoted by PRED(n
i
). Likewise, the set of nodes which have a
directed edge from a node n
i
are called its successor nodes and are denoted by SUCC(n
i
).
Nodes in a DAG that do not have a predecessor are called start nodes and nodes that do
not have a successor are called exit nodes. blevel(n
i
) is the bottom level of n
i
and is
length of the longest path from n
i
to any exit node including the weight of n
i
. The length
of a path in a DAG is the sum of its node and edge weights. tlevel(n
i
) is the is the top
level of n
i
and is the length of the longest path from a start node to node n
i
excluding the
weight of n
i
. The longest path in a DAG is called the critical path. A DAG may have
multiple critical paths. A sample DAG is illustrated in Figure 2.1. The node weights are
to the right of each node and the edge weights are to the left of each edge. Table 2.1
provides the table of values for the sample DAG.
2.1.2 Problem Statement
The objective is to schedule a parallel application represented by a DAG onto a
network of heterogeneous processors to minimize its overall execution time. Node
weights in a DAG represent average execution times of nodes over all the processors in
the target execution system. Edges represent precedence constraints between nodes. An
edge (n
i
,n
j
) indicates that node n
j
cannot start execution until n
i
completes execution and
19
receives all the required data from it. Edgeweights represent the time required to transfer
the required data.
Figure 2.1 A sample DAG, G
1
Table 2.1 Table of values for G
1
n
i
PRED(n
i
)
SUCC(n
i
)
tlevel(n
i
)
blevel(n
i
)
1 {null} {2,3,4,5,6} 0 108.01
2 {1} {8,9} 31 77.01
3 {1} {7} 25 80
4 {1} {8,9} 22 81.34
5 {1} {9} 24 69
6 {1} {8} 27 63.34
7 {3} {10} 62.33 42.67
8 {2,4,6} {10} 66.67 35.67
9 {2,4,5} {10} 67.67 44.34
10 {7,8,9} {null} 97.34 14.67
20
The target execution system consists of a finite number of heterogeneous
processors connected with a high speed network. Communication among processors is
assumed to be contentionless. Computation and communication is assumed to take place
simultaneously. Nodeexecution is assumed to be nonpreemptive; meaning nodes once
scheduled on a processor cannot be removed (or preempted) and scheduled on other
processors. If a DAG has multiple start nodes, a dummy start node with a zero node
weight is added. Zero weight communication edges are then added from the dummy start
node to the multiple start nodes. Likewise, if a DAG has multiple exit nodes, a dummy
exit node is added. The makespan of a DAG is the time difference between the
commencement of execution of the start node and the completion of execution of the exit
node. The heterogeneous DAG scheduling problem is NPcomplete [28] and can be
formally defined as: To schedule the nodes of a DAG representing a parallel application
onto a network of heterogeneous processors such that all the data precedence constraints
are satisfied and the overall execution time of the DAG is minimized. The following
sections survey existing research related to this problem.
2.1.3 The Best Imaginary Level Algorithm
The Best Imaginary Level (BIL) algorithm [22] assigns nodepriorities based on
the best imaginary level of each node. At each scheduling step, a free node with the
highest priority is selected and mapped onto a processor based on a criterion. Table 2.2
defines the terms used in BIL and Figure 2.2 lists the algorithm.
21
BIL(n
i
, p
j
) is the best imaginary level of node n
i
on processor p
j
. It is the length of
the longest path in the DAG beginning with n
i
assuming it is mapped onto p
j
, and is
recursively defined as:
))])((min),([min(max)( ,,,)(,, pipijpjinSuccnjiji cpnBILpnBILwpnBIL ik += ??+ .
BIL of a node is adjusted to its basic imaginary makespan (BIM) as follows:
][_)()( ,, jAvailableTpnBILpnBIM jiji += .
Table 2.2 Definition of terms used in BIL
Term Definition
N = {n
1
, n
2
, n
3
, n
4
, n
5
, n
6
?.}//Set of nodes in the DAG, N=n
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors, P=m
w
i,j
Time required to execute n
i
on p
j
c
i,j
Time required to transfer all the requisite data from n
i
to n
j
when
they are scheduled on different processors
)( , ji pnBIL ))])((min),([min(max ,,,)(, kililjjknSuccnji cpnBILpnBILw ik += ??+
T_Available[p
j
] Time at which processor p
j
completes the execution of all the
nodes previously assigned to it
)( , ji pnBIM
][_)( ,
j
ji pAvaialbleTpnBIL +=
k Number of free nodes at a scheduling step
)(* , ji pnBIM
)0,1/max()(
,
, ??+= mkwpnBIM
ji
ji
22
If k is the number of free nodes (those nodes whose predecessors have completed
execution) at a scheduling step, the priority of a free node is the k
th
smallest BIM value. If
the k
th
smallest BIM value is undefined, the largest finite BIM value becomes its priority.
If two or more nodes have the same priority, ties are broken randomly. At each
scheduling step, the free node with the highest priority is selected for mapping. If k is
greater than the number of processors, node execution times become more important than
the communication overhead. On the contrary, if k is less than the number of available
processors, node execution times become less important. The BIM value for the selected
node is revised to incorporate this factor as follows:
)0,1/max()()(*
,
,, ??+= mkwpnBIMpnBIM
ji
jiji .
The processor which provides the highest revised BIM value for the node is selected. If
more than one processor provides the same revised BIM value, the processor that
maximizes the sum of the revised BIM values of all the other nodes is selected. The time
complexity of the algorithm is O(n
2
+ m log m).
2.1.4 The Generalized Dynamic Level Algorithm
The Generalized Dynamic Level (GDL) Algorithm [28] assigns nodepriorities
based on their generalized dynamic levels. A number of factors are incorporated in the
calculation of the generalized dynamic level and are explained next. The definition of
terms used in GDL is listed in Table 2.3 and the algorithm is listed in Figure 2.3.
23
BIL Algorithm
ReadyTaskList ? Start node
While ReadyTaskList NOT empty
k ?  ReadyTaskList// Number of free nodes
For all n
i
in ReadyTaskList and p
j
in P
Compute BIM(n
i
, p
j
)
End For
Priority of n
i
? k
th
smallest BIM value, or the largest finite
BIM value if the k
th
smallest value is undefined
n
t
? node in ReadyTaskList with the highest priority
For all p
j
in P
Compute BIM*(n
t
, p
j
)//Revised BIM
End For
p
fav
? The processor that provides the highest revised BIM value
for n
t
Map n
t
on p
fav
ReadyTaskList ? ReadyTaskList  n
t
+ Free nodes(if any)
End While
End BIL
Figure 2.2 The BIL algorithm
SL(n
i
) is the static level of a node n
i
and is the largest sum of the median execution times
of all the nodes from node n
i
to an exit node along any path in the DAG. DL(n
i
,p
j
) =
SL(n
i
) EST(n
i
, p
j
) + ?(n
i
, p
j
) is the Dynamic Level (DL) of a node n
i
on processor p
j
. It
indicates how well the node and the processor are matched for execution. Even though
DL(n
i
,p
j
) indicates how well n
i
and p
j
are matched, it does not indicate how well the
descendents of n
i
are matched with p
j
. D(n
j
) is the descendent of node n
i
to which n
i
passes the maximum data. F(n
i
,D(n
i
),p
j
)= d(n
i
,D(n
i
))+ min
k ? j
E(D(n
i
),p
k
) is defined to
indicate how quickly D(n
i
) can be completed on a processor other than p
j
, if node n
i
is
24
executed on processor p
j
.The Descendent Consideration (DC) term is defined as: DC(n
i
,
p
j
) = w*( D(n
i
)) ? min { E(D(n
i
),p
j
), F(n
i
,D(n
i
),p
j
)}
Table 2.3 Definition of terms used in GDL
Term Definition
N = {n
1
, n
2
, n
3
, n
4
, n
5
, n
6
?.}//Set of nodes in the DAG, N=n
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors, P=m
w
i,j
Execution time of node n
i
on p
j
c
i,j
Data transfer time from node n
i
to n
j
w*(n
i
) Median execution time of n
i
over all the processors
SL(n
i
) largest sum of the median execution times of all the nodes
from node n
i
to an exit node along any path in the DAG
?(n
i
, p
j
) = w*(n
i
) ? w
i,j
EST(n
i
, p
j
) Earliest start time of n
i
on p
j
DL(n
i
, p
j
) = SL(n
i
) EST(n
i
, p
j
) + ?(n
i
, p
j
)
D(n
j
) Descendent node of node n
i
to which n
i
passes the
maximum data
d(n
i
,D(n
i
)) Time required to transfer data from n
i
to D(n
i
)
E(D(n
i
),p
k
) Time required to execute D(n
i
)
on processor p
k
F(n
i
,D(n
i
),p
j
) = d(n
i
,D(n
i
))+ min
k ? j
E(D(n
i
),p
k
)
DC(n
i
, p
j
) = w*( D(n
i
)) ? min { E(D(n
i
),p
j
), F(n
i
,D(n
i
),p
j
)}
C(n
i
) = DL(n
i
, p
pref
) ? max
k ?pref
DL(n
i
,p
k
) (pref
is the processor
on which node n
i
obtains the maximum DL)
GDL(n
i
, p
j
) = DL(n
i
, p
j
)+ DC(n
i
, p
j
)+ C(n
i
)
25
GDL Algorithm
For all n
i
in N
Compute SL(n
i
)
End For
ReadyTaskList ? Start Node
While ReadyTaskList is NOT NULL do
For all n
i
in ReadyTaskList and p
j
in P
Compute DL(n
i
, p
j
)
Compute DC(n
i
, p
j
)
Compute C(n
i
)
GDL(n
i
, p
j
)? DL(n
i
, p
j
)+ DC(n
i
, p
j
)+ C(n
i
)
End For
Select the nodeprocessor pair with the maximum GDL
Update ReadyTaskList
End While
End GDL
Figure 2.3 The GDL algorithm
The preferred processor of a node is the processor which maximizes its dynamic level
(DL). The cost of not scheduling a node on its preferred processor is defined as follows.
C(n
j
)= DL(n
i
, p
pref
) ? max
k ?j
DL(n
i
,p
k
) (p
pref
is the preferred processor)
The combination of DL, the Descendent Consideration (DC) term and the cost incurred in
not scheduling a node on its preferred processor is used to define the Generalized
Dynamic Level (GDL) of a node as: GDL(n
i
, p
j
)= DL(n
i
, p
j
)+ DC(n
i
, p
j
)+ C(n
j
).
At each scheduling step, the algorithm selects among the free nodes, the node and the
processor with the maximum GDL. The time complexity is O(n
2
+ m log m).
26
2.1.5 The Levelized Min Time Algorithm
In the Levelized Min Time (LMT) algorithm [16], the input DAG is divided into k
levels using the following rules. The levels are numbered 0 to k1. All the nodes in a level
Table 2.4 Definition of terms used in LMT
Term Definition
N = {n
1
, n
2
, n
3
, n
4
, n
5
, n
6
?.}//Set of nodes in the DAG, n=N
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors, m=P
k Number of levels in the DAG
w
i,j
Time required to execute n
i
on p
j
c
i,j
Time required to transfer all the requisite data from n
i
to n
j
when
they are scheduled on different processors
T_Available[p
j
] Time at which processor p
j
completes the execution of all the
nodes previously assigned to it
,(
i
nEST )
j
p Max(][_ jAvailableT ,
imkmnpredn
cpnEFT
im
,)(
),(max +
?
))
),(
ji
pnEFT = ),(
, jiji
pnESTw +
are independent of each other. Level 0 contains the start nodes and level k1 contains the
exit nodes. For any level j, where 0 < j < k1, nodes in level j can have incident edges
from any of the nodes in levels 0 thru j+1. Additionally, there must be at least one node
in level j with an edge incident from a node in level j+1. LMT maps the nodes one level
at a time starting from level 0. If the number of nodes at a given level is more than the
number of processors in the target system, the smallest nodes (based on the average
computation times) are merged until the number of nodes equals the number of
27
processors. Nodes are then sorted by the descending order of their average computations
times. At each scheduling, the largest node is mapped onto the processor that provides its
minimum finish time. Table 2.4 defines the terms used in LMT and Figure 2.4 lists the
algorithm.
LMT Algorithm
Divide the input DAG into k levels (level 0 to level k1)
For levels 0 thru k1 do
num? number of nodes in the current level
If num>m
Merge the smallest nodes in the current level until num=m
End If
ReadyTaskList ? Nodes in the current level sorted in the
descending order of average node weights
While ReadyTaskList is NOT NULL do
n
i
? First node in the ReadyTaskList
For all p
j
in P
Compute ,(
i
nEST )
j
p
),(
ji
pnEFT ? ),(
, jiji
pnESTw +
End For
Map node n
i
on processor p
j
which provides its least EFT
Update T_Available[p
j
]
Update ReadyTaskList
End While
End For
End LMT
Figure 2.4 The LMT algorithm
2.1.6 The Heterogeneous Earliest Finish Time Algorithm
The Heterogeneous Earliest Finish Time (HEFT) algorithm [30] assigns node 
priorities based on the bottom level (blevel) of each node. The blevel of a node is the
28
length of the longest path in the DAG from the node to the exit node. The length of a path
in a DAG is the sum of the node and edge weights that constitute the path. At each
scheduling step, a node with the highest priority is assigned to a processor that minimizes
its finish time. The definition of terms used in HEFT in listed in Table 2.5 and the
algorithm is listed in Figure 2.5. As a first step, HEFT traverses the DAG in a top down
fashion and computes the blevels of all the nodes. At each scheduling step, a node with
the highest blevel is selected for mapping. Ties are broken randomly
EST(n
i
, p
j
) is the earliest start time of a node n
i
on a processor p
j
and is defined
as: ,(
i
nEST )
j
p = Max(][_
j
pAvailableT ,
imkmnpredn
cpnEFT
im
,)(
),(max +
?
)). It is the
maximum of a) the time at which processor p
j
becomes free or b) The time at which node
n
i
receives all the required data from its predecessor nodes after the completion of their
exeuction. EFT(n
i
, p
j
) is the Earliest Finish Time of a node n
i
on a processor p
j
and is
defined as: EFT(n
i
, p
j
) = EST(n
i
, p
j
) + w
i,j
. HEFT computes the EFTs of the selected
node on all the processors and selects the processor that provides the minimum EFT. The
time complexity is O(n
2
m).
2.1.7 The Critical Path on Processor Algorithm
The critical path is the longest path in a DAG. The length of the critical path gives
the lower bound on the overall execution time of the DAG [30]. Minimizing the length of
the critical path would aid minimizing the overall execution time of a DAG [30]. The
Critical Path on Processor (CPOP) algorithm [30] is a variant of the HEFT algorithm and
is from the same authors [30]. CPOP adopts a different mapping strategy for the critical
path nodes and the noncritical path nodes.
29
Table 2.5 Definition of terms used in HEFT
Term Definition
N = {n
1
, n
2
, n
3
, n
4
, n
5
, n
6
?.}//Set of nodes in the DAG, n=N
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors, m=P
w
i,j
Time required to execute n
i
on p
j
c
i,j
Time required to transfer all the requisite data from n
i
to n
j
when
they are scheduled on different processors
priority(n
i
) = blevel(n
i
)
T_Available[p
j
] Time at which processor p
j
completes the execution of all the
nodes previously assigned to it
,(
i
nEST )
j
p Max(][_ jAvailableT ,
imkmnpredn
cpnEFT
im
,)(
),(max +
?
))
),(
ji
pnEFT = ),(
, jiji
pnESTw +
HEFT Algorithm
For all n
i
in N
Compute blevel(n
i
)
End For
ReadyTaskList ? Start Node
While ReadyTaskList is NOT NULL do
n
i
? node in the ReadyTaskList with the maximum blevel
For all p
j
in P
Compute ,(
i
nEST )
j
p
),(
ji
pnEFT ? ),(
, jiji
pnESTw +
End For
Map node n
i
on processor p
j
which provides its least EFT
Update T_Available[p
j
] and ReadyTaskList
End While
End HEFT
Figure 2.5 The HEFT Algorithm
30
Table 2.6 Definition of terms used in CPOP
Term Definition
N = {n
1
, n
2
, n
3
, n
4
, n
5
, n
6
?.}//Set of nodes in the DAG, n=N
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors, m=P
w
i,j
Time required to execute n
i
on p
j
c
i,j
Time required to transfer all the requisite data from n
i
to n
j
when
they are scheduled on different processors
priority(n
i
) = tlevel(n
i
)+ blevel(n
i
)
CP processor
p
j
?
P which minimizes
?
?CPn
ji
i
w
,
//CP is the critical path
T_Available[p
j
] Time at which processor p
j
completes the execution of all the
nodes previously assigned to it
,(
i
nEST )
j
p Max(][_
j
pAvailableT ,
imkmnpredn
cpnEFT
im
,)(
),(max +
?
))
),(
ji
pnEFT = ),(
, jiji
pnESTw +
CPOP traverses the DAG in a top down fashion to compute the tlevels and blevels
of all the nodes. It identifies the critical path/s and marks the critical path nodes. The
priority of each node is the sum of its tlevel and blevel . At each scheduling step, a free
task with the highest priority is selected for mapping. Ties (if any) are broken randomly.
A CP processor is defined as the processor that minimizes the overall execution
time of the critical path assuming all the critical path nodes are mapped onto it. If the
selected node is a critical path node, it is mapped onto the CP processor. Else, it is
31
mapped onto a processor that minimizes its EFT (like the HEFT algorithm). The time
complexity is O(n
2
m).
CPOP Algorithm
For all n
i
in N
Compute tlevel(n
i
) and blevel(n
i
)
Identify the critical path/s and mark the critical path nodes
priority(n
i
) ? tlevel(n
i
)+ blevel(n
i
)
End For
ReadyTaskList ? Start Node
While ReadyTaskList is NOT NULL do
n
i
? node in the ReadyTaskList with the maximum priority
If n
i
?critical path
Map n
i
on the CP processor
Else
For all p
j
in P
Compute ,(
i
nEST )
j
p
),(
ji
pnEFT ? ),(
, jiji
pnESTw +
End For
Map node n
i
on processor p
j
which provides its least EFT
End If
Update T_Available[p
j
]
Update ReadyTaskList
End While
End CPOP
Figure 2.6 The CPOP algorithm
2.1.8 The Fast Critical Path Algorithm
There are three steps involved in a typical static DAG scheduling algorithm:
computation of node priorities, node selection, and processor selection. These steps
contribute to the overall time complexity of the algorithm. The Fast Critical Path (FCP)
32
algorithm [24] tries to reduce the overall time complexity by reducing the complexity of
the node selection and the processor selection steps.
Table 2.7 Definition of terms used in FCP
Term Definition
N = {n
1
, n
2
, n
3
, n
4
, n
5
, n
6
?.}//Set of nodes in the DAG, n=N
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors, m=P
e Number of edges in the DAG
w
i,j
Time required to execute n
i
on p
j
priority(n
i
) = blevel(n
i
)
c
i,j
Time required to transfer all the requisite data from n
i
to n
j
when
they are scheduled on different processors
T_Available[p
j
] Time at which processor p
j
completes the execution of all the
nodes previously assigned to it
,(
i
nEST )
j
p Max( ][_ jAvailableT ,
imkmnpredn
cpnEFT
im
,)(
),(max +
?
))
),(
ji
pnEFT = ),(
, jiji
pnESTw +
Node Selection: FCP tries to reduce the complexity of the node selection process by
restricting the size of the ReadyTaskList to m (The number of processors in the target
execution system). Additional free nodes (if any) are stored in a FIFO queue. Node
priorities are based on their blevels. At each scheduling step, a node with the highest
priority is selected for mapping. By restricting the size of the ReadyTaskList to m, the
time complexity of the node selection process would be O(n log m).
33
Processor Selection: The complexity of the processor selection step is reduced by
restricting the choice to just two processors: the first processor that becomes free and the
enabling processor (The processor which is the last to send a data item to a node). The
authors [24] prove that the EFT of a node is always minimized by one of these two
processors. The time complexity of the processor selection step would be reduced to
O(nlogm+e). Of the two processors, the one which provides the least EFT for the
selected node is chosen. The overall time complexity of FCP is O(nlogm+e).
FCP Algorithm
For all n
i
in N
Compute tlevel(n
i
)
priority(n
i
) ? blevel(n
i
)
End For
ReadyTaskList ? Start Node
AdditionalTaskList ? NULL //FIFO Queue
While ReadyTaskList is NOT NULL do
n
i
? node in the ReadyTaskList with the maximum priority
p
1
? First processor in P to become free
P
2
? Enabling processor of n
i
Compute EST(n
i
, p
1
)
EFT(n
i
, p
1
)? EST(n
i
, p
1
)+ w
i,1
Compute EST(n
i
, p
2
)
EFT(n
i
, p
1
)? EST(n
i
, p
2
)+ w
i,2
Map node n
i
on processor p
j
which provides its least EFT
Update T_Available[p
j
]
Update ReadyTaskList
Update AdditionalTaskList (If applicable)
End While
End FCP
Figure 2.7 The FCP algorithm
34
2.1.9 The Fast Load Balancing Algorithm
The Fast Load Balancing (FLB) algorithm [24] is a variant of the FCP algorithm.
The node selection complexity is reduced by limiting the number of nodes in the
ReadyTaskList to m (number of processors). Additional free nodes, if any, are added to a
FIFO list. As was discussed in previous section, the earliest start time for a node can be
obtained on either the first processor to become free or a task?s enabling processor. For
each node in the ReadyTaskList, the earliest start time of the node on the first processor to
become free and the node?s enabling processor is calculated. Among the free nodes, the
node with the minimum earliest start time is selected and mapped onto the corresponding
processor. The overall time complexity of FLB is O(nlogm+e).
Table 2.8 Definition of terms used in FLB
Term Definition
N = {n
1
, n
2
, n
3
, n
4
, n
5
, n
6
?.}//Set of nodes in the DAG, n=N
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors, m=P
w
i,j
Time required to execute n
i
on p
j
priority(n
i
) = blevel(n
i
)
c
i,j
Time required to transfer all the requisite data from n
i
to n
j
when
they are scheduled on different processors
T_Available[p
j
] Time at which processor p
j
completes the execution of all the
nodes previously assigned to it
,(
i
nEST )
j
p Max( ][_ jAvailableT ,
imkmnpredn
cpnEFT
im
,)(
),(max +
?
))
),(
ji
pnEFT = ),(
, jiji
pnESTw +
35
FLB Algorithm
For all n
i
in N
Compute tlevel(n
i
)
priority(n
i
) ? tlevel(n
i
)
End For
Readytasklist ? Start Node
AdditionalTaskList ? NULL // FIFO queue
While ReadyTaskList is NOT NULL
For all n
i
in Readytasklist
p
1
? First processor in P to become free
p
2
? Enabling processor of n
i
Compute EST(n
i
, p
1
)
Compute EST(n
i
, p
2
)
End For
Select n
i
with the least EST and map it onto the
corresponding processor.
Update T_Available[p
j
]
Update ReadyTaskList
Update AdditionalTaskList (If applicable)
End While
End FLB
Figure 2.8 The FLB algorithm
2.1.10 The Hybrid Remapper Algorithm
Static scheduling algorithms use estimates of node execution times in the
scheduling process. Estimates can be obtained by techniques such as code profiling and
analytical benchmarking [21]. However, actual node execution times may sometimes
vary largely from the estimated execution times and may result in a bad schedule. To
mitigate this problem, the Hybrid Remapper [21] algorithm uses a combination of static
mapping and the actual runtime values of node execution times. It tries to fine tune the
schedule obtained by a static scheduling algorithm by making use of runtime values as
36
and when they are made available. The inputs to the algorithm are the DAG and the
schedule obtained using a list based static scheduling heuristic. The input DAG is divided
into k levels marked 0 thru k1, such that nodes in a level do not have precedence
constraints between one another. The start nodes are in level 0 and the exit nodes in level
k1. Node priorities are based on their blevels. Nodes in level 0 are mapped according to
the static schedule. For levels 1 thru k1, nodes at a level are considered for remapping
as soon as the first node of the previous level starts execution. The node with the highest
priority is remapped onto a processor that provides its least partial completion time (pct).
In the calculation of partial completion times (see Table 2.9), available run time values (if
any) are recursively used. If run time values are not available, statically obtained values
are used. The algorithm is listed in Figure 2.9. The time complexity is O(n
2
).
Table 2.9 Definition of terms used in Hybird Remapper
Term Definition
N = {n
1
, n
2
, n
3
, n
4
, n
5
, n
6
?.}//Set of nodes in the DAG, N=n
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors, P=m
e
i,j
Time required to execute n
i
on p
j
in real time
c
i,j
Time required to transfer all the requisite data from n
i
to n
j
when
they are scheduled on different processors in real time
)( inpriority =)( inblevel
ips(n
i
) Immediate predecessor set of node n
i
A[p
j
] Time at which processor p
j
completes the execution of all the
nodes previously assigned to it in real time
dr(n
i
) ),(max(
,)( kijinipsn
pnpctc
ij
+
?
pct(n
i
, p
j
) = e
i,j
+ max(A[j], dr(n
i
))
37
Hybrid ReMapper Algorithm
Divide the input DAG into levels such that nodes in a level are
independent of each other
k? number of levels
Mark the levels starting with 0 and ending with k1
//Start nodes are in level 0 and exit nodes are in level k1
For all n
i
in N
priority(n
i
) ? blevel(n
i
)
End For
For all n
i
in level 0
Map n
i
using the static schedule
End For
For levels 1 thru k1
For all nodes
in the current level
n
i
? node with the highest priority
For all p
j
in P
dr(n
i
)= ),(max(
,)( kijinipsn
pnpctc
ij
+
?
pct(n
i
, p
j
) = e
i,j
+ max(A[j], dr(n
i
))
End for
Map n
i
onto p
j
that provides its least pct
End For
Update A[j]
End for
End Hybrid Remapper
Figure 2.9 The Hybrid Remapper algorithm
2.1.11 Performance Comparison
The performance of DAG scheduling algorithms depends on a number of factors
such as the Communication to Computation Ratio (CCR) (the ratio of the sum of the
edgeweights to the sum of the nodeweights) of the input DAG, number of nodes,
38
processor speed variance etc. While running times of an algorithm become significant for
large DAGs, it is desirable to have an algorithm with a good performancecomplexity
tradeoff. The most important performance metric used to compare the performance of
DAG scheduling algorithms is the Schedule Length Ratio (SLR). SLR is the ratio of the
overall execution time of the input DAG to the sum of the weights of the critical path
nodes on the fastest processor. Table 2.9 summarizes the relative performance of the
algorithms discussed in the previous sections.
Table 2.10 Comparison of complexity and schedule length ratio of different
algorithms
Algorithm
A
Complexity Schedule Length Ratio, L(A)
BIL O(n
2
+plogp) L(BIL) < L(GDL) by 20%
STDS O(n
2
) L(STDS) < L(BIL) for CCRs within 0.2 and 1
FLB O(nlogp+e) L(HEFT) < L(FLB) by 63% when processor speed variance
is high. Otherwise FLB performs equally well.
FCP O(nlogp+e) L(HEFT) < L(FCP) by 32 % with high processor speed
variance. Otherwise identical.
HEFT O(n
2
m) HEFT better than GDL,LMT by 8, 52% respectively.
39
2.2 Scheduling a Set of Independent Tasks onto a Network of Heterogeneous
Processors to Minimize the Overall Execution Time
2.2.1 Problem Statement
Independent tasks are tasks without communication or precedence constraints. A
metatask is a finite set of independent tasks. The overall execution time (makespan) of a
metatask is the time required to complete the execution of all the tasks in it. The target
execution system consists of a finite number of heterogeneous processors connected with
a high speed network. Tasks in a mettask can have different execution times on different
processors. Communication among processors is assumed to be contentionless.
Computation and communication is assumed to take place simultaneously. Node
execution is assumed to be nonpreemptivenodes once scheduled on a processor cannot
be removed (preempted) and scheduled on other processors. The objective of the
independent task scheduling problem is formally described as follows. To schedule the
independent tasks of a metatask onto a network of heterogeneous processors such that
the overall execution time of the metatask is minimized. In the following sections,
existing research work in this area is surveyed.
2.2.2 The MinMin and MaxMin Algorithms
In the MinMin algorithm [15], the earliest finish time (EFT) of all the nodes over
all the processors is calculated. The node with the least EFT is selected and scheduled
40
onto the processor on which the minimum EFT was obtained. The process is repeated
until all the tasks in the metatask are scheduled. The time complexity is O(s
2
m), where s
Table 2.11 Definition of terms used in MinMin
Term Definition
T = {t
1
,t
2
, t
3
, t
4
, t
5
, t
6
?.}//MetaTask
s =T
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors
m =P
w
i,j
Time required to execute t
i
on p
j
EST(t
i
, p
j
) Time at which all the tasks previously assigned to
p
j
complete execution
EFT(t
i
, p
j
) = EST(t
i
, p
j
)+ w
i,j
MinMin Algorithm
While T is NOT NULL do
For all t
i
in T and p
j
in P
Compute EFT(t
i
, p
j
)
End For
t
min
? task with the least EFT
p
min
? processor providing the least EFT
Map t
min
on p
min
T ? T t
min
End While
End MinMin
Figure 2.11 MinMin Algorithm
41
is the number of tasks in the metatask and m the number of processors in the target
system. The MaxMin algorithm is similar to MinMax, however; instead of selecting the
task with the least EFT, the task with the highest EFT is selected. MinMin is detailed in
Figure 2.11 and the definition of terms used in MinMin is listed in Table 2.11.
2.2.3 The Sufferage Algorithm
Term Definition
T = {t
1
,t
2
, t
3
, t
4
, t
5
, t
6
?.}//MetaTask
s =T
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors
m =P
w
i,j
Time required to execute t
i
on p
j
EST(t
i
, p
j
) Time at which all the tasks previously assigned to p
j
complete execution
EFT(t
i
, p
j
) = EST(t
i
, p
j
)+ w
i,j
FT
1
Earliest finish time of t
i
on any processor p
j
FT
2
Second earliest finish time of t
i
on any processor p
j
Sufferage(t
i
) FT
2
 FT
1
Table 2.12 Definition of Terms used in Sufferage
42
The Sufferage algorithm [15] is based on the idea that a better mapping of tasks
can be obtained by assigning a processor to a task that ?suffers? the most in case the task
Suffereage Algorithm
T
1
? temporary set of tasks
T
1
? NULL
While T is NOT NULL do
For all t
i
in T and p
j
in P
Compute EFT(t
i
, p
j
)
p
temp
? processor on which t
i
has the least EFT
If a task is already assigned to p
temp
then
t
prev
? task already assigned to p
temp
If Sufferage(t
i
) > Sufferage(p
temp
) then
Remove t
prev
from p
temp
Tentatively assign t
i
to p
temp
T? T  t
i
T
1
? T
1
+ t
prev
Else
T
1
? T
1
+ t
i
End If
Else
Tentatively assign t
i
to p
temp
T? T  t
i
End If
End For
T ? T + T
1
T
1
? NULL
End While
End Sufferage
Figure 2.12 The Sufferage Algorithm
is not assigned to the processor. The sufferage of a task t
i
is defined as the difference
between the earliest finish time of t
i
and the second earliest finish time. Tasks are
considered for mapping in an arbitrary order. At each scheduling step, the earliest finish
times of a task over all the processors is computed. The processor which provides the
43
minimum earliest finish time is determined. If a task is already scheduled on it, the
suffreages of the task under consideration and the previously scheduled task are
compared. If the sufferage of the task under consideration is greater, the previously
assigned task is removed and the given task is tentatively assigned to the processor. The
removed task is is reinserted into the metatask. However, if the sufferage of the task
already assigned is greater, the given task is reinserted into the metatask and is
considered for mapping in the next iteration. At the end of the iteration, the tasks which
are tentatively mapped onto the processors are permanently mapped. The steps are
repeated until all the tasks in the metatask are mapped. The time complexity is O(s
2
m).
Table 2.12 provides the definition of terms used in Sufferage and Figure 2.12 lists the
algorithm.
44
CHAPTER 3
THE HETEROGENEOUS CRITICAL NODE FIRST (HCNF) ALGORITHM
This chapter presents a new taskduplication based static scheduling heuristic
called the Heterogeneous Critical Node First (HCNF) for the DAG scheduling problem
discussed in section 2.2. The chapter is organized as follows. Section 3.1 discusses the
key concepts related to the heterogeneous DAG scheduling problem that motivated the
development of HCNF. Section 3.2 discusses the algorithm in detail. Section 3.3 provides
the running trace of HCNF. Section 3.4 provides the simulation study and Section 3.5
provides concluding remarks.
3.1 Motivation
The length of the critical path in a DAG provides a lower bound on its overall
execution time [30]. Thus, minimizing the execution time of the critical path nodes would
abet minimizing the overall execution time of a DAG. One way to achieve this would be
to assign top priority to critical path nodes at each scheduling step.
A DAG may have one or more free nodes which are ready to be mapped onto the
processors at each scheduling step. In heterogeneous computing environments, local
optimization can be obtained at each scheduling step by selecting the largest task among
the free nodes and mapping it onto the processor that minimizes its finish time.
45
Nodes have to wait until they receive all the required data from their predecessors before
they could start execution. The predecessor node which is the last to send data to a given
node is called the favorite predecessor. This process could be potentially expedited by
duplicating the execution of favorite predecessors in idle processor times. Duplicating
favorite predecessors can potentially suppress communication times and could lead to
earlier start times for the nodes.
We propose a static scheduling algorithm called the Heterogeneous Critical Node
First (HCNF) that incorporates the strategies discussed above in the scheduling process.
At each scheduling step, among the free nodes, HCNF assigns top priority to a critical
path node and schedules it onto a processor that minimizes its finish time. In the absence
of a critical path node, HCNF picks the largest node and assigns it onto a processor that
minimizes its finish time. HCNF also explores the possibility of duplicating favorite
predecessors in idle processor times to obtain earlier start times. The algorithm is
explained next.
3.2 The HCNF Algorithm
HCNF begins by identifying the critical path/s of the input DAG. Nodes
belonging to the critical path/s are marked as CP nodes. The algorithm starts the mapping
process by mapping the startnode onto the processor that provides its fastest execution
time. If the fastest execution time is obtained on more than one processor, the processor
with the least average execution time over all nodes is selected. (The average execution
time over all nodes of a processor is the sum of the execution times of all the nodes in the
DAG on the processor divided by the number of nodes) Among the immeddiate
46
successors of the startnode, the CP node is inserted at the beginning of the
ReadyTaskList. The remaining nodes are added to the ReadyTaskList by the decreasing
order of their node weights. At each scheduling step, the first node of the ReadyTaskList
is selected for mapping. Table 3.1 defines the terms used in HCNF and Figure 3.1 lists
the algorithm.
Table 3.1 HCNFdefinition of terms
Term Definition
N = {n
1
, n
2
, n
3
, n
4
, n
5
, n
6
?.}//Set of nodes in the DAG
n =N
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors
m =P
w
i,j
Time required to execute n
i
on p
j
c
i,j
Time required to transfer all the requisite data from n
i
to n
j
when
they are scheduled on different processors
T_Available[p
j
] Time at which processor p
j
completes the execution of all the
nodes previously assigned to it
)(
i
npred Set of immediate predecessors of task
i
n
n
en
Favorite Predecessor (A node which is the last to send data to a
given node.)
,(
1 i
nEST )
j
p Max( Max(][_
j
pAvailableT ,
imkmnpredn
cpnEFTMax
im
,)(
),( +
?
))
,(
2 i
nEST )
j
p Max(Max [_
j
pAvailableT ,EST(n
en
,p
j
))+w
en,,j
,
imkmnnpredn
cpnEFTMax
enim
,)(
),( +
??
))
,(
i
nEST )
j
p Min(EST
1
(n
i
,p
j
),EST
2
(n
i
,p
j
))
),(
ji
pnEFT = ),(
, jiji
pnESTw +
47
EST
2
(n
i
,p
j
) is the earliest start time of node n
i
on processor p
j
assuming that n
en
(the
favorite predecessor of n
i
) would be duplicated on p
j
. EST
1
(n
i
,p
j
) is the earliest start time
of node n
i
on processor p
j
without duplicating the favorite predecessor. EFT(n
i
,p
j
) is the
earliest finish time of n
i
on p
j
. At each scheduling step, for the selected node n
i
,
EFT(n
i
,p
j
) over all the processors is computed. fproc(n
i
) is the processor on which the
least EFT is obtained. If EST
2
(n
i
,p
j
) is used in the computation of the least EFT, n
en
is
duplicated on fproc(n
i
), otherwise; n
en
is not duplicated. n
i
is mapped onto fproc(n
i
). n
i
is
then removed from the ReadyTaskList and its successors are added to it. The nodes in the
ReadyTaskList are realigned as follows. The CP node is inserted at the first position. In
the presence of multiple CP nodes, the CP nodes are sorted by the descending order of
their node weights and are inserted at the beginning of the ReadyTaskList. All the
remaining (nonCP) nodes are sorted by the decreasing order of their node weights. The
first node in the ReadyTaskList is selected for mapping and is scheduled onto a processor
that provides its least EFT (as discussed earlier). The process is repeated until all the
nodes in the DAG are scheduled.
HCNF takes O(n
2
) to find the critical path, O(np) to calculate the EFTs and
O(n*logn) to sort the tasks in the descending order using mergesort. Ignoring the lower
order terms, the overall time complexity would be O(n
2
).
3.3 Running trace of HCNF
The working of HCNF is illustrated with a sample DAG G
1
shown in Figure 3.2.
The target execution system consists of three processors: p
1
, p
2
and p
3
. Node execution
48
times are listed in Table 3.2. Node weights in Figure 3.2 represent average execution
times. Runtime values for each step of HCNF are shown in Table 3.3. The Gantt chart
for the final schedule is shown in Fig. 3.4 and the Gantt chart for the individual steps are
shown in Figures 3.5 thru 2.17. HCNF begins by calculating the critical path of G
1
(
1? 2? 9? 10) and marking the critical path nodes.
Step 1 (Figure 3.5) The start node (node 1) is mapped onto processor 3 which provides
its least finish time of 9 seconds.
Step 2 (Figure 3.6) Among the successors of node 1, the CP node (node 2) is inserted at
the beginning of the ReadyTaskList and the remaining nodes are inserted in the
descending order of their weights. Node 2 is selected for mapping and its EFTs over all
the processors is computed (see Figure 3.3). Both p
1
and p
3
provide the least finish time
(27 seconds). However, since the finish time on p
1
is obtained by duplicating node 1, p
3
is
chosen. Node 2 is removed and the ReadyTaskList is updated to {3,4,6,5} .
Step 3 (Figure 3.7) Node 3 is selected for mapping. The minimum EFT is obtained on p
1
by duplicating node 1 on p
1
. The successor of node 3 (node 7) becomes free as a result of
this mapping and the ReadyTaskList is updated to {4,6,5,7}
Step 4 (Figure 3.7) Node 4 is selected for mapping. The minimum EFT is obtained on p
2
by duplicating node 1 on p
2
. The ReadyTaskList is updated to {6,5,7}
Step 5 (Figure 3.8) Node 6 is selected for mapping. The minimum EFT is obtained on p
1
.Node 6 is scheduled on p
3
and one of the successor nodes (node 8) becomes free as a
result of this mapping. The ReadyTaskList is updated to {5,7,8}.
49
Algorithm HCNF
//Identify the CP nodes of the input DAG
//Map the StartNode onto a processor that provides its fastest execution time
//Among the successors of the StartNode , add the CP node to the ReadyTaskList
//Add the remaining successors of the StartNode in the decreasing order of task sizes
to the ReadyTaskList
While ReadyTaskList is NOT NULL do
t
n ? First node in the ReadyTaskList
For all p
j
P? do
EST
1
(n
t
, p
j
) = Max{T_available[p
j
],
k ? j
EFT(n
en
, p
k
)+ c
k,j
}
If(EST(n
en
, p
j
) ? T_available[p
j
]) then
EST
2
(n
t
, p
j
) = EST(n
en
, p
j
)+ w
en,,j
Else
EST
2
(n
t
, p
j
) = T_available[p
j
] + w
en,,j
End if
If EST
1
(n
t
, p
j
) ? EST
2
(n
t
, p
j
) then
EST(n
t
, p
j
)= EST
1
(n
t
, p
j
)
Else
EST(n
t
, p
j
)= EST
2
(n
t
, p
j
)
Tentatively duplicate n
en
on Processor p
j
End if
EFT(n
t
, p
j
)= EST
1
(n
t
, p
j
)+ w
t,,j
End For
fproc(n
t
) ? processor p
j
that provides minimum EFT for n
t
Map n
t
on fproc(n
t
) and permanently duplicate any tentatively duplicated
n
en
node
Add the successors of
t
n to the ReadyTaskList
ReadyTaskList? ReadyTaskList  n
t
Realign the ReadyTaskList such that the CP node is in the first position
and the remaining nodes are sorted in the decreasing order of their weights
End While
End HCNF
Figure 3.1 The HCNF algorithm
50
Figure 3.2 Sample DAG (G
1
)
Table 3.2 Task execution times of G
1
on three different processors
n
i
p
1
p
2
p
3
Average
Execution
Time
1 14 16 9 13
2 13 19 18 16.67
3 11 13 19 14.33
4 13 8 17 14
5 12 13 10 11.66
6 13 16 9 12.67
7 7 15 11 11
8 5 11 14 10
9 18 12 20 16.67
10 21 7 16 14.67
51
Table 3.3 Runtime values for G
1
EST
1
(n
i
, p
1
) EST
1
(n
i
, p
2
) EST
1
(n
i
, p
3
)
EST
2
(n
i
, p
1
) n
en
EST
2
(n
i
, p
2
) n
en
EST
2
(n
i
, p
3
) n
en
Iteration ReadyTaskLsit n
i
EFT(n
i
, p
1
) EFT(n
i
, p
2
) EFT(n
i
, p
3
)
EFT(n
i
) fproc(n
i
)
0 0 0
0 n/a 0 n/a 0 n/a
1
1 1
14 16 9
9 3
27 27
14 1 16 1 n/a n/a
2 2,3,4,6,5 2
27 35 27
27 3
27 27
14 1 16 1 n/a n/a
3 3,4,6,5 3
25 29 46
25 1
25 18 27
n/a n/a 16 1 n/a n/a
4 4,6,5,7 4
38 24 44
24 2
25 24 27
n/a n/a n/a n/a n/a n/a
5 6,5,7 6
38 40 36
36 3
25 24
n/a n/a n/a n/a n/a n/a
6 5,7,8 5
37 37 46
37 1
47 50 50
50 4 43 5 47 5
7 9,7,8 9
65 55 67
55 2
37 55 48
n/a n/a 68 3 55 3
8 7,8 7
44 70 49
44 1
51 55 51
57 6 71 6 53 4
9 8 8
56 66 65
56 1
68 67 68
77 9 66 8 88 9
10 10 10
89 74 86
74 2
0
10
20
30
40
50
60
70
80
123
Processors
Ti
m
e
(
S
e
c
)
Duplicated Idle Task
1
1
1
2
4
5
3
1
9
8
10
5
7
8
6
52
Figure 3.4 Gantt chart for G
1
Step 6 (Figure 3.9) Node 5 is selected for mapping. The minimum EFT is obtained on p
1
.Node 5 is scheduled on p
1
and node 9 becomes free as a result of this mapping. The
ReadyTaskList is updated to {9,7,8} (since 9 is a CP node, it is inserted at the beginning
of the list)
Step 7 (Figure 3.10) Node 9 is selected for mapping. The minimum EFT is obtained on
p
2
by duplicating node 5. The ReadyTaskList is updated to {7,8}.
Step 8 (Figure 3.11) Node 7 is selected for mapping. The minimum EFT is obtained on
p
1
. The ReadyTaskList is updated to {8}
Step 9 (Figure 3.12) Node 8 is selected for mapping. The minimum EFT is obtained on
p
3
. Node 10 becomes free as s result of this mapping and the ReadyTaskList is updated to
{10}.
Step 10 (Figure 3.13) Node 10 is selected for mapping. The minimum EFT is obtained on
p
2
by duplicating node 8.
53
Figure 3.5 HCNF running tracestep 1:
Node 1 is scheduled on processor 3
Figure 3.6 HCNF running tracestep 2:
Node 2 is scheduled on processor 3
Figure 3.7 HCNF running tracestep 3:
Node 3 is scheduled on processor 1 by
duplicating node 1
Figure 3.8 HCNF running tracestep 4:
Node 4 is scheduled in processor 2 by
duplicating node 1
54
Figure 3.9 HCNF running tracestep 5:
Node 6 is scheduled on processor 3
Figure 3.10 HCNF running tracestep 6:
Node 5 is scheduled on processor 1
Figure 3.11 HCNF running tracestep 7:
Node 9 is scheduled on processor 2 by
duplicating node 5
Figure 3.12 HCNF running tracestep 8:
Node 7 is scheduled in processor 1
55
Figure 3.13 HCNF running tracestep 9:
Node 8 is scheduled on processor 3
Figure 3.14 HCNF running trace ?step10
Node 10 is scheduled on processor 2 by
duplicating node 8
3.4 Simulation Study
The simulation study consists of two parts. In the first part, the performance of
HCNF is compared against that of the Heterogeneous Earliest Finish Time (HEFT) [30]
algorithm. The experimental test suite[18] includes: randomly generated graphs,
Gaussian elimination graphs, Trace graphs, Benchmark graphs and Application graphs.
In the second part, a parametric random graph generator is developed to generate a
diverse range of graphs with specified input parameters. The performance of HCNF is
compared against that of the HEFT and the Scalable Task Duplication based scheduling
algorithm (STDS) [25].
3.4.1 Performance Parameters
The three commonly used performance parameters to gauge the performance of
DAG scheduling algorithms are:
56
Schedule Length Ratio (SLR): The ratio of the overall execution time of a DAG to the
sum of the weights of its criticalpath nodes on the fastest processor.
Speedup: The ratio of the sequential execution time of the DAG on the fastest processor
to the parallel execution time.
Efficiency: The ratio of the speedup to the number of processors in the system.
3.4.2 Randomly Generated Graphs
The performance of HCNF and HEFT was compared using randomly generated
graphs of different sizes and CCRs. Each node in the random graph was allowed to have
up to five children. Node and the edge weights were generated randomly and the edge
weights were then iteratively adjusted to obtain a given CCR.
The SLR and speedup of HCNF and HEFT was compared using graphs of
different sizes. For each graph size shown in Figures 3.15 and 3.16, readings were
averaged using 10 random graphs of the same size with CCRs ranging from 0.5 to 1.5.
and out_degree = {1,2,5,100}. The average SLR of HCNF was better than HEFT by
12.3% and the speedup was better than HEFT by 7.9 %.
57
0
1
2
3
4
5
6
7
8
20 30 40 50 60 70 80 90 100
Number of Nodes
SL
R
HEFT HCNF
Figure 3.15 Average SLR vs. number of nodes
0
0.5
1
1.5
2
2.5
3
3.5
20 30 40 50 60 70 80 90 100
Number of Nodes
S
pee
du
p
HEFT HCNF
Figure 3.16 Average speedup vs. number of nodes
58
3.4.3 Gaussian Elimination Graphs
The SLR and Efficiency of HEFT and HCNF were compared using DAGs
representing the Gaussian Elimination algorithm. Figure 3.17 gives the SLR for matrix
sizes ranging from 5 to 15. HCNF outperformed HEFT by an average of 25.7%. Figure
3.18 gives the efficiency for different number of processors, with the matrix size fixed at
50. HCNF outperformed HEFT by an average of 22.6%. The efficiency of HCNF
increased with the number of processors because of increased speedup facilitated by
enhanced task duplication (in the presence of a lager number of processors).
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5678910112131415
Matrix Size
SL
R
HEFT HCNF
Figure 3.17 Average SLR vs. matrix size
3.4.4 Benchmark Graphs
DAG scheduling algorithms are commonly compared using randomly generated
graphs. However, to facilitate a fair and an unbiased comparison of algorithms from
59
different authors, some researchers [21] have proposed using benchmark graphs. In the
following sections we compare the performance of HCNF using the ?benchmark graph
test suite? [21]. The benchmark test suite consists of: Trace graphs, Graphs with optimal
solution generated by the branch and bound technique, Graphs with predetermined
optimal solutions and Application graphs
3.4.4.1 Trace Graphs
These graphs are obtained from the referenced articles listed in Table 3.4.The
SLR and speedup of HEFT and HCNF was compared using these graphs. Figures 3.19
and 3.20 show the results. HCNF outperformed HEFT in SLR and speedup by an average
of 29.5% and 38.4% respectively
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
24816
Number of Processors
Effi
c
i
e
n
c
y
HEFT HCNF
Figure 3.18 Efficiency vs. no. of processors
60
.3.4.4.2 Random Graphs with Optimal Solutions (RGBOS)
DAGs in this set are small sized, with the maximum node size being 32. Their
optimal solution can be obtained using the branch and bound technique. The set consists
of three subsets of graphs with different CCRs (0.1, 1.0, 10.0), and number of nodes vary
from 10 to 32, in increments of 2. Figures 3.21 thru 3.26 show the results. HNCF
outperformed HEFT in SLR and speedup by 32.5% and 24.6% respectively.
3.4.4.3 Random Graphs with PreDetermined Optimal Schedules (RGPOS)
The graphs in this set are reverse engineered [23]. A schedule for a set of
multiprocessors is generated and then the node and the edge weights are generated
randomly, but, consistent with the generated schedule. The graphs comprise of three sets
with CCR values 0.1,1.0 and 10.0. Within each set, the number of nodes vary from 50 to
500 in increments of 50. Figures 3.27 thru 3.32 show the results. HCNF outperformed
HEFT in SLR and speedup by 21.1% and 16.9% respectively.
Table 3.4 Trace graph details
Graph Tag Trace Graph # of Nodes Article Reference
D1 AhmedKwok 13 [22]
D2 Yang1 7 [13]
D3 ColinChretienne 9 [25]
D4 McCreary 9 [12]
D5 Kruatrachue 11 [16]
D6 Yang2 7 [19]
D7 Ranka 11 [22]
D8 Shirazi 11 [23]
D9 WuGajski 18 [25]
D10 AlMaasarani 16 [33]
D11 ALMouhamed 17 [32]
61
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11
Trace Graphs
SL
R
HEFT HCNF
Figure 3.19 Trace GraphsSLR
0
0.5
1
1.5
2
2.5
3
3.5
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11
Trace Graphs
S
p
ee
du
p
HEFT HCNF
Figure 3.20 Trace GraphsSpeedup
62
0
0.5
1
1.5
2
2.5
3
3.5
10 12 14 16 18 20 22 24 26 28 30 32
Number of Nodes
SL
R
HEFT HCNF
Figure 3.21 RGBOS SLR (CCR = 0.1)
0
0.5
1
1.5
2
2.5
3
3.5
4
10 12 14 16 18 20 22 24 26 28 30 32
Number of Nodes
SL
R
HEFT HCNF
Figure 3.22 RGBOS SLR (CCR = 1.0)
63
0
0.5
1
1.5
2
2.5
3
3.5
4
10 12 14 16 18 20 22 24 26 28 30 32
Number of Nodes
SL
R
HEFT HCNF
Figure 3.23 RGBOS SLR (CCR = 10.0)
0
0.5
1
1.5
2
2.5
10 12 14 16 18 20 22 24 26 28 30 32
Number of Nodes
S
p
ee
du
p
HEFT HCNF
Figure 3.24 RGBOS Speedup (CCR = 0.1)
64
0
0.5
1
1.5
2
2.5
3
10 12 14 16 18 20 22 24 26 28 30 32
Number of Nodes
S
p
ee
du
p
HEFT HCNF
Figure 3.25 RGBOS Speedup (CCR = 1.0)
0
0.5
1
1.5
2
2.5
3
3.5
4
10 12 14 16 18 20 22 24 26 28 30 32
Number of Nodes
S
p
ee
du
p
HEFT HCNF
Figure 3.26 RGBOS Speedup (CCR = 10.0)
65
Figure 3.27 RGPOS SLR (CCR = 0.1)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
50 100 150 200 250 300 350 400 450 500
Number of Nodes
SL
R
HEFT HCNF
Figure 3.28 RGPOS SLR (CCR = 1.0)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
50 100 150 200 250 300 350 400 450 500
Number of Nodes
SL
R
HEFT HCNF
66
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
50 100 150 200 250 300 350 400 450 500
Number of Nodes
SL
R
HEFT HCNF
Figure 3.29 RGPOS SLR (CCR = 10.0)
0
0.5
1
1.5
2
2.5
3
3.5
50 100 150 200 250 300 350 400 450 500
Number of Nodes
S
p
ee
du
p
HEFT HCNF
Figure 3.30 RGPOS Speedup (CCR = 0.1)
67
0
0.5
1
1.5
2
2.5
3
3.5
4
50 100 150 200 250 300 350 400 450 500
Number of Nodes
S
p
ee
du
p
HEFT HCNF
Figure 3.31 RGPOS Speedup (CCR = 1.0)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
50 100 150 200 250 300 350 400 450 500
Number of Nodes
S
p
ee
du
p
HEFT HCNF
Figure 3.32 RGPOS Speedup (CCR = 10.0)
68
3.4.4.5 Application Graphs
These graphs represent a few numerical parallel application programs. This set
contains of over 320 graphs in six categories: Cholesky factorization, LU decomposition,
Gaussian elimination, FFT, Laplace transforms and Mean Value Analysis (MVA). The
number of nodes ranges from 100 to 300. Figures 3.33 to 3.44 show the results of the
simulation. On an average, HCNF outperformed HEFT in SLR and Speedup by 27.5%
and 22.7% respectively
0
0.5
1
1.5
2
2.5
3
3.5
0.1 0.5 1 2 10
CCR
SL
R
HEFT HCNF
Figure 3.33 Fast Fourier Transform SLR vs. CCR
69
0
0.5
1
1.5
2
2.5
3
0.1 0.5 1 2 10
CCR
S
pe
edup
HEFT HCNF
Figure 3.34 Fast Fourier Transform Speedup vs. CCR
0
0.5
1
1.5
2
2.5
3
0.1 0.5 1 2 10
CCR
S
pe
edup
HEFT HCNF
Figure 3.35 Cholesky Factorization Speedup vs. CCR
70
0
0.5
1
1.5
2
2.5
3
3.5
4
0.1 0.5 1 2 10
CCR
S
pe
edup
HEFT HCNF
Figure 3.36 Gaussian Elimination Speedup vs. CCR
0
0.5
1
1.5
2
2.5
3
0.1 0.5 1 2 10
CCR
S
pe
edup
HEFT HCNF
Figure 3.37 Laplace Transform Speedup vs. CCR
71
0
0.5
1
1.5
2
2.5
3
0.1 0.5 1 2 10
CCR
S
pe
edup
HEFT HCNF
Figure 3.38 LU Decomposition Speedup vs. CCR
0
0.5
1
1.5
2
2.5
3
3.5
0.1 0.5 1 2 10
CCR
S
pe
edup
HEFT HCNF
Figure 3.39 MVA Speedup vs. CCR
72
0
0.5
1
1.5
2
2.5
3
3.5
0.1 0.5 1 2 10
CCR
SL
R
HEFT HCNF
Figure 3.40 Cholesky SLR vs CCR
0
0.5
1
1.5
2
2.5
3
3.5
4
0.1 0.5 1 2 10
CCR
SL
R
HEFT HCNF
Figure 3.41 Gaussian Elimination SLR vs.CCR
73
0
0.5
1
1.5
2
2.5
3
3.5
0.1 0.5 1 2 10
CCR
SL
R
HEFT HCNF
Figure 3.42 Laplace Transform SLR vs.CCR
0
0.5
1
1.5
2
2.5
3
3.5
4
0.1 0.5 1 2 10
CCR
SL
R
HEFT HCNF
Figure 3.43 LU Decomposition SLR vs. CCR
74
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0.1 0.5 1 2 10
CCR
SL
R
HEFT HCNF
Figure 3.44 MVA SLR vs. CCR
3.4.5 Performance Comparison using a Parametric Random Graph Generator
The performance of DAG scheduling algorithms varies largely with the type of
the input DAG. In order to conduct a fair comparison, we need to evaluate the
performance using a comprehensive set of randomly generated DAGs, exhibiting a wide
range of parameters. In our simulation study, a parametric random graph generator was
developed to generate diverse DAG types based on the following input parameters.
? n : Number of nodes in the DAG
75
? CCR (Communication to Computation Ratio): Ratio of the sum of the edge
weights to the sum of the node weights in a DAG.
? out_degree: Maximum number of children a node in the DAG can have.
? ? (The shape parameter of a DAG) : The height of a DAG is randomly generated
from a uniform distribution with mean equal to ? ? n
. The width of a DAG is
randomly generated from a uniform distribution with mean equal to n ? ?
? ? (Range percentage of computation costs on processors): If the average
computation cost of a node over all the processors is avg_comp , the computation
cost n
i
on any processor p
j
is randomly selected from the range
avg_comp?(1 ?/2 ) ? avg_comp ? avg_comp?(1+ ?/2)
Input parameters were assigned the following values in our simulation study.
? n = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}
? CCR = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5,
5.0}
? ? = {0.5, 1.0, 2.0}
? out_degree = {1, 2, 5, 100}
? ? = {0.1, 0.5, 0.75, 1.0}
These combinations yield 8640 different DAG types. For each DAG type, 25 different
random graphs were generated with the same parameters but with different edge and
node weights. Thus a sum total of 216,000 random DAGs were used in the study. The
76
number of processors was fixed at 10. The processor speeds were randomly selected
based on the ? value.
Figure 3.45 provides the SLR of HCNF, HEFT and the STDS algorithms for graphs with
different Node sizes. Each data point is averaged over 864 distinct readings. The average
SLR improvement of HEFT over STDS is 6%, and over HEFT is 10% approximately.
Figure 3.46 gives the average speedup versus number of nodes. Each data point is
averaged over 864 different readings. The Average improvement in the speedup of the
HCNF over STDS is 9%, and over HEFT is 14%. Figure 3.47 provides the average SLR
values for CCR values ranging from 0.1 to 1.0 in steps of 0.1. Each data point is averaged
over 480 different readings. The average improvement of HCNF over HEFT is 11% and
over HEFT is 4 %. Figure 3.48 provides the average SLR values for CCR values ranging
from 1.0 to 5.0 in steps of 0.5. Each data point is averaged over 480 different readings.
The average improvement of HCNF over STDS is 7% and over HEFT is 11%. For higher
CCRs the STDS algorithm performs better than the HEFT algorithm since there is more
scope for task duplication. Figure 3.49 provides the average Speedup values for CCR
values ranging from 0.1 to 1.0 in steps of 0.1. Each data point is averaged over 480
different readings. The average improvement of HCNF over HEFT is 18% and over
HEFT is 9%. Figure 3.50 provides the average SLR values for CCR values ranging from
1.0 to 5.0 in steps of 0.5. Each data point is averaged over 480 different readings. The
average improvement of HCNF over STDS is 9% and over HEFT is 5%.
77
0
1
2
3
4
5
6
7
8
20 30 40 50 60 70 80 90 100
Number of Nodes
SL
R
HEFT STDS HCNF
Figure 3.45 Parametric random graphs  SLR vs. number of nodes
0
0.5
1
1.5
2
2.5
3
20 30 40 50 60 70 80 90 100
Number of Nodes
S
p
ee
du
p
HEFT STDS HCNF
Figure 3.46 Parametric random graphs  Speedup vs. number of nodes
78
0
0.5
1
1.5
2
2.5
3
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
CCR
SL
R
HEFT STDS HCNF
Figure 3.47 Parametric random graphsSLR vs. CCR (0.1 to 0.9)
0
1
2
3
4
5
6
7
8
1 1.5 2 2.5 3 3.5 4 4.5 5
CCR
SL
R
HEFT STDS HCNF
Figure 3.48 Parametric random graphsSLR vs. CCR (1.0 to 5.0)
79
0
0.5
1
1.5
2
2.5
3
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
CCR
S
p
ee
du
p
HEFT STDS HCNF
Figure 3.49 Parametric random graphsSpeedup vs. CCR (0.1 to 0.9)
0
0.5
1
1.5
2
2.5
3
3.5
11.522.533.544.55
CCR
S
p
ee
du
p
HEFT STDS HCNF
Figure 3.50 Parametric random graphsSpeedup vs. CCR (1.0 to 5.0)
80
3.5 Conclusion
A new taskduplication based static scheduling algorithm called the
Heterogeneous Critical Node First (HCNF) for scheduling DAGs onto a network of
heterogeneous processors was proposed. The performance of HCNF, HEFT and STDS
was compared using randomly generated graphs, benchmark graphs and parametric
graphs. HCNF clearly outperformed both HEFT and STDS with respect to speedup and
SLR. The superior performance of HCNF can be attributed to the lowcost task
duplication strategy that facilitates earlier start times for many nodes which otherwise
have to wait for all the data items to arrive from their favorite predecessors. HCNF can be
improved by exploring the possibility of duplicating the second and the third favorite
predecessors (if any) to further expedite the start times of nodes. The feasibility of such
an approach needs to be investigated.
81
CHAPTER 4
THE HETEROGENEOUS LARGEST TASK FIRST (HLTF) ALGORITHM
This chapter presents a new lowcomplexity algorithm called the Heterogeneous
Largest Task First (HLTF) for scheduling independent tasks of a metatask onto a
network of heterogeneous processors to minimize the overall execution time. The
problem was formally defined in section 2.8. This chapter is organized as follows.
Section 4.2 discusses the motivation behind the development of HLTF. Section 4.2
describes the algorithm in detail. Section 4.3 provides the running trace of HLTF.
Section 4.4 discusses the theoretical nonequivalence of HLTF and the Sufferage
algorithm [23] and section 4.5 provides the simulation study.
4.1 Motivation
A metatask is a set of independent tasks without any precedence constraints.
Scheduling a metatask onto a set of heterogeneous processors to minimize the overall
execution time is a NPcomplete problem. Among the scheduling algorithms discussed in
the literature review, the Sufferage has the best performance in terms of minimizing the
makespan [23]. The time complexity of Sufferage is O(s
2
* m), where s is the size of the
metatask and m the number of processors.
82
Metacomputing systems such as clusters and grids need to schedule tens of
thousands of tasks on a regular basis. A metatask could contain over a 1000 independent
tasks in practical scenarios [23]. Figure 4.1 summarizes the running times of Sufferage.
Sufferage takes more than 100 seconds to schedule a metatask of 1000 tasks. The
running time increases with the size of the metatask. The algorithm takes more than
3000 seconds to schedule a metatask of 5000 tasks. This can be mainly attributed to the
high time complexity O(s
2
* m) of the Sufferage algorithm . The high running times of
Sufferage could be a major bottleneck in the scheduling process and could negatively
impact the overall performance of a metacomputing system.
To counter this problem, we propose a new lowcomplexity algorithm called the
Heterogeneous Largest Task First (HLTF) to map a metatask onto a set of heterogeneous
processors with an objective to minimize its makespan. Simulation results in chapter 4.5
reveal that in terms of minimizing the makespan, HLTF is at par with Sufferage.
However, with respect to running times, HLTF with a lower time complexity of O(s(log s
+ m)), significantly outperforms Sufferage.
4.2 The Heterogeneous Largest Task First (HLTF) Algorithm
HLTF adapts a simple approach to reduce the overall time complexity of the
scheduling process. We first recap the working of the Sufferage algorithm and then
explain the working of HLTF. Table 4.1 provides the definition of terms used in HLTF
and the Sufferage algorithms.
83
Sufferage Running Times
0
500
1000
1500
2000
2500
3000
3500
4000
1000 2000 3000 4000 5000
Metatask Size
Seconds
Figure 4.1 Running times of the Sufferage Algorithm
The Sufferage Algorithm
The algorithm is listed in Figure 4.1. At each scheduling step, the Sufferage
algorithm picks an arbitrary task from the metatask set and computes its Earliest
Completion Time (ECT), favorite processor (fproc) and Sufferage values. If the task?s
favorite processor has no task previously assigned to it, the current task is tentatively
assigned to it. However, if the task?s favorite processor has a task already assigned to it,
the Sufferages of the current task and the task already assigned are compared. If the
Sufferage of the current task is higher, the previously assigned task is removed and the
current task is assigned to it. The task that is removed is reinserted into the metatask.
The process is repeated until all the tasks of the MetaTask set are scheduled.
The HLTF Algorithm
The calculation of Sufferages at each scheduling step, reinserting the tasks into
the metatask list and repeating all the steps each time a task is reinserted into the list
84
leads to the high complexity O(s
2
* m) of the Sufferage Algorithm. The HLTF algorithm
listed in Figure 4.3 drastically reduces the time complexity of the Sufferage Algorithm by
adopting the following approach. Instead of tentatively mapping tasks to processors,
HLTF algorithm sorts all tasks in the metatask set in the nondecreasing order of their
sizes before the start of the mapping process. At each scheduling step HLTF picks the
largest task in the list and maps it onto a processor that provides its earliest completion
time. This seemingly simple approach leads to a very substantial decrease in running time
without compromising the performance. Simulation results are reported in Section 4.5.
The HLTF algorithm takes O(s*log s) to perform merge sort, O(s* m) to compute the
completion times of the tasks on all the processors and O(s * m) to compute the earliest
completion time of each task. The overall time complexity is O(s* log s+ s*m+ s) or
O(s(log s)+m)).
4.3 Theoretical nonequivalence of Sufferage and HLTF algorithms
At each scheduling step, the Sufferage algorithm maps the task with the
maximum Sufferage to a machine that provides its earliest finish time. The HLTF
algorithm, at each scheduling step, maps the largest task among the candidate tasks to a
machine that provide its earliest finish time. Intuitively, the Sufferage and the HLTF
algorithms seem to be equivalent.This is because we tend to assume that the largest task
will always have the maximum Sufferage i.e for any two tasks t
i
and t
j
in the metatask
set (t
1
, t
2
, t
3
,?t
n1
, t
n
) where t
i
> t
j
and i < j , Sufferage(t
i
) > Sufferage(t
j
). However, this
is not the case always and is proved in Theorem 1.
85
Table 4.1 Definition of Terms used in Sufferage and HLTF
Term Definition
T Metatask set of size s
M Set of processors available for scheduling
m Number of processors
T_available(p
j
) Time at which processor p
j
can start execution of a new task.
W
k,j
Running time of task t
k
on processor p
j.
CT(t
k
, p
j
) T_available(p
j
)+ W
k,j
// Execution completion time of task t
k
on
processor p
j.
ECT(t
k
)
Min
k?T & j?M
{ CT(t
k
, p
j
) }//Earliest Completion time of task t
k
Proc(t
k
) The processor on which ECT(t
k
) can be obtained
Sufferage(t
k
) ECT(t
k
)Second best CT(t
k
)
HLTF Algorithm
Sort T using merge sort in nondecreasing order
While T ? ? do
Pick the largest task t
k
in T.
For all j ? M
Compute CT(t
k
, p
j
)
End For
Compute ECT(t
k
)
T=T {t
k
}
Schedule t
k
on Proc(t
k
)
End While
End HLTF
Figure 4.2 HLTF Algorithm
86
Sufferage Algorithm
While T ? ? do
Pick a task t
k
? T
in an arbitrary order.
For all j ? M
Compute CT(t
k
, p
j
)
End For
Compute ECT(t
k
)
Sufferage(t
k
)= ECT(t
k
)Second best CT(t
k
)
If Proc(t
k
) has a task t
s
already assigned to it
If Sufferage(t
k
)> Sufferage(t
s
)
Remove t
s
from Proc(t
k
) and schedule t
k
on Proc(t
k
)
T=T+ {t
s
}
T=T {t
k
}
End If
Else
Schedule t
k
on Proc(t
k
)
T=T {t
k
}
End If
End While
End Sufferage
Figure 4.3 The Sufferage algorithm
Thoerem 1 : For any two tasks t
i
and t
j
in the metatask set (t
1
, t
2
, t
3
,?t
n1
, t
n
) where t
i
> t
j
and i < j , Sufferage(t
i
) is not always greater than Sufferage(t
j
)
Case 1 : Let m
x
and m
y
be the processors on which tasks t
i
and t
j
obtain their best ect
and the next best ect?s repectively. Let p
x
and p
y
where p
x
> p
y
, be the speeds of
processors m
x
and m
y
in MIPS.
Sufferage(t
i
)=( t
i
/ p
y
+T_Available(p
y
)) ? ( t
i
/ p
x
+ T_Available(p
x
))
Sufferage(t
j
)= ( t
j
/ p
y
+T_Available(p
y
)) ? ( t
j
/ p
x
+ T_Available(p
x
))
To prove
Sufferage(t
i
)> Sufferage(t
j
)
87
or, ( t
i
/ p
y
+T_Available(p
y
)) ? ( t
i
/ p
x
+ T_Available(p
x
) ) > ( t
j
/ p
y
+T_Available(p
y
)) ? ( t
j
/
p
x
+ T_Available(p
x
) )
or, t
i
/ p
y
+T_Available(p
y
) ? t
i
/ p
x
 T_Available(p
x
) > t
j
/ p
y
+T_Available(p
y
) ? t
j
/ p
x

T_Available(p
x
)
or, t
i
/ p
y
 t
j
/ p
y
> t
i
/ p
x
 t
j
/ p
x
OR
(t
i
 t
j
)/ p
y
> (t
i
 t
j
)/ p
x
Which is true since t
i
> t
j
, (t
i
 t
j
) > 0 and p
x
> p
y
Case 2 : Let m
x
and m
y
be the processors on which tasks t
i
and t
j
obtain their best ect
and the next best ect?s repectively. Let p
x
and p
y
where p
x
< p
y
, be the speeds of
processors m
x
and m
y
in MIPS.
Sufferage(t
i
)=( t
i
/ p
y
+T_Available(p
y
)) ? ( t
i
/ p
x
+ T_Available(p
x
))
Sufferage(t
j
)= ( t
j
/ p
y
+T_Available(p
y
)) ? ( t
j
/ p
x
+ T_Available(p
x
))
To prove
Sufferage(t
i
)> Sufferage(t
j
)
or, ( t
i
/ p
y
+T_Available(p
y
)) ? ( t
i
/ p
x
+ T_Available(p
x
) ) > ( t
j
/ p
y
+T_Available(p
y
)) ? ( t
j
/
p
x
+ T_Available(p
x
) )
or, t
i
/ p
y
+T_Available(p
y
) ? t
i
/ p
x
 T_Available(p
x
) > t
j
/ p
y
+T_Available(p
y
) ? t
j
/ p
x

T_Available(p
x
)
or, t
i
/ p
y
 t
j
/ p
y
> t
i
/ p
x
 t
j
/ p
x
OR
(t
i
 t
j
)/ p
y
> (t
i
 t
j
)/ p
x
Which is NOT true since t
i
> t
j
, (t
i
 t
j
) > 0 and p
x
< p
y
Therefore for any two tasks t
i
and t
j
in the metatask set (t
1
, t
2
, t
3
,?t
n1
, t
n
) where t
i
> t
j
and i < j , Sufferage(t
i
) > Sufferage(t
j
) is not always true.
As an example, in Table 4.2 observe that in the third iteration, T3 is the largest task in the
Metatask set and the HLTF algorithm picks T3 and schedules it onto its favorite
processor p1. However, notice that the Sufferage of T3 (14.25) is less than the sufferage
of T2 (16.91), despite T3 being larger than T2. Also, observe that the fproc1 of all the
88
tasks is processor 1, the fproc2 of all the tasks is processor 3 and the speed of processor 1
(4 MIPS) is less than that of processor 3 (5 MIPS). This scenario illustrates case 2 of
Theorem 1 and provides a practical example of the difference between the Sufferage and
the HLTF algorithms
4.4 Simulation and Results
Simulations were conducted on a 440 MHz Sun Ultra 5 machine running on a
Solaris 8 Operating System. We compared the relative performance of HLTF and
Sufferage w.r.t makespan and running costs. We developed a simulator with the
following input parameters.
n : Number of tasks in the metatask.
p: Number of processors in the distributed system.
std_dev: Standard deviation of the metatask
size_min :Minimum task size in MIPS.
size_max: Maximum task size in MIPS.
m: Number of metatasks.
The maximum number of the processors used in our simulations was 20.
89
4.4.1 Comparison of Makespan
The makespan of various metatasks using HLTF and sufferage was measured
using the following input parameters.
n ={50,100,200,300,400,500,750,1000}
P ={5,10,15,20}
std_dev={5,10,15,20,25,30}
size_min ={10}
size_max ={ 100}
m ={1}
The results are shown in Figures 3 to 6. Each data point is an average different readings
on 4 different processors. The performance of HLTF was slightly better than that of
Sufferage. An important observation was that we did not come across a metatask for
which the performance of Sufferage was better than that of HLTF. The Average
improvement of HLTF over Sufferage was 0.48%.
4.4.2 Comparison of Running Costs
The running times of Sufferage and HLTF were measured using different
metatask sizes. The results are shown in Figures 4.10 to 4.12. Each data point is an
average of 25 different readings. The running cost of the Sufferage Algorithm
exponentially increases as the size of the mettask increases. For metatask sizes > 1000,
the HLTF provides a very significant reduction in the running costs.
90
Table 4.2 Example showing nonequivalence of the Sufferage and the HLTF Algorithms:
Sufferage
MetaTask={t1,t2,t3,t4,t5,t6}
Task Sizes t1=157, t2=111, t3=143, t4=128, t5=111, t6=149 (MI)
Processor Speeds p1=4, p2=5, p3=6 (MIPS)
HLTF
MetaTask={t1,t6,t3,t4,t2,t5}// Sorted in
the non decreasing order using merge sort
First Iteration
task Eft1 fproc1 eft2 fproc2 Sufferage
T1 26.17 3 31.4 2 5.23
T2 18.5 3 22.2 2 3.7
T3 23.83 3 28.6 2 4.77
T4 21.33 3 25.6 2 4.27
T5 18.5 3 22.2 2 3.76
T6 24.83 3 29.8 2 4.97
Schedule t1 on processor 3 MetaTask={t2,t3,t4,t5,t6}
t_avail[1]=0, t_avail[2]=0, t_avail[3]=26.16
First Iteration
Largest task = t1
Schedule task t1 on processor p3
t_avail[1]=0, t_avail[2]=0, t_avail[3]=26.16
MetaTask={t6,t3,t4,t2,t5}
Second Iteration
task Eft1 fproc1 eft2 fproc2 Sufferage
T2 22.2 2 27.75 1 5.55
T3 28.6 2 35.75 1 7.15
T4 25.6 2 32 1 6.4
T5 22.2 2 27.75 1 5.55
T6 29.8 2 37.25 1 7.45
Schedule t6 on processor 2 MetaTask={t2,t3,t4,t5}
t_avail[1]=0, t_avail[2]=29.8 , t_avail[3]=26.16
Second Iteration
Largest task = t6
Schedule task t6 on processor p2
t_avail[1]=0, t_avail[2]=29.8,
t_avail[3]=26.16
MetaTask={t3,t4,t2,t5}
Third Iteration
task Eft1 fproc1 eft2 fproc2 sufferage
T2 27.75 1 44.67 3 16.92
T3 35.75 1 50 3 14.25
T4 32.0 1 47.5 3 15.5
T5 27.75 1 44.67 3 16.92
Schedule t2 on processor 1 MetaTask={t3,t4,t5}
t_avail[1]=27.75, t_avail[2]=29.8 , t_avail[3]=26.16
Third Iteration
Largest task = t3
Schedule task t3 on processor p1
t_avail[1]=35.75, t_avail[2]=29.8,
t_avail[3]=26.16
MetaTask={t4,t2,t5}
91
Average Makespan
0
500
1000
1500
2000
2500
3000
3500
4000
4500
50 100 200 300 400 500 750 1000
Metatask Size
S
ec
onds
Sufferage HLTF
Figure 4.4 Average Makespan of Metatasks, std_dev=5
Average Makespan
0
500
1000
1500
2000
2500
3000
3500
4000
4500
50 100 200 300 400 500 750 1000
Metatask Size
S
ec
onds
Sufferage HLTF
Figure 4.5 Average Makespan of Metatasks, std_dev=10
92
Average Makespan
0
500
1000
1500
2000
2500
3000
3500
4000
4500
50 100 200 300 400 500 750 1000
Metatask Size
S
ec
onds
Sufferage HLTF
Figure 4.6 Average Makespan of Metatasks, std_dev =15
Average Makespan
0
500
1000
1500
2000
2500
3000
3500
4000
50 100 200 300 400 500 750 1000
Metatask Size
S
ec
onds
Sufferage HLTF
Figure 4.7 Average Makespan of Metatasks, std_dev=20
93
Average Makespan
0
500
1000
1500
2000
2500
3000
3500
4000
50 100 200 300 400 500 750 1000
Metatask Size
S
ec
onds
Sufferage HLTF
Figure 4.8 Average Makespan of Metatasks std_dev=25
Average Makespan
0
500
1000
1500
2000
2500
3000
3500
4000
50 100 200 300 400 500 750 1000
Metatask Size
S
ec
onds
Sufferage HLTF
Figure 4.9 Average Makespan of Metatasks, std_dev=30
94
Running Times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
50 100 200
Metatask Size (n)
Seconds
HLTF Log (10) of Sufferage
Figure 4.10 Running Times {n =50,100,200}
Running Times
0
0.5
1
1.5
2
2.5
3
500 1000 2000
Metatask Size (n)
Seconds
HLTF Log (10) of Sufferage
Figure 4.11 Running Times {n =500,1000,2000}
95
Running Times
0
0.5
1
1.5
2
2.5
3
3.5
4
3000 4000 5000
Metatask Size (n)
Seconds
HLTF Log (10) of Sufferage
Figure 4.10 Running Times {n =3000,4000,5000}
96
CHAPTER 5
SCHEDULING INDEPENDENT TASKS WITH DISPATCH TIMES
In this section we introduce a novel heuristic to schedule independent tasks of a
metatask onto a network of heterogeneous processors to minimize the makespan of the
metatask. This section is organized as follows. In Section 5.1 we provide the motivation
towards solving this problem. In Section 5.2 we introduce the Earliest Finish Time with
Dispatch Time (EFTDT) algorithm. In Section 5.3 we provide a practical example of the
algorithm?s working. In Section 5.4 we discuss simulation results.
5.1 Motivation
In metacomputing systems such as the grid, a centralized scheduler may make all
scheduling decisions with respect to independent tasks. The scheduler makes a
scheduling decision and maps tasks onto processors. In reality, the mapping of tasks onto
processors requires time to dispatch the task from the scheduler onto a processor. In the
previous works [20] [21][23] related to scheduling independent tasks of a metatask onto
a network of heterogeneous processors, the dispatch times of the tasks have not been
considered in making scheduling decisions. The Sufferage, MinMin and the MinMax
[23] algorithms assume a zero dispatch time in their scheduling model. We believe that in
practical scenarios a zero dispatch time is not feasible and may lead to unrealistic
schedules. In this section we introduce a novel heuristic to schedule independent tasks of
97
a metatask onto a network of heterogeneous processors considering the dispatch times of
tasks.
5.2 The Earliest Finish Time with Dispatch Time (EFTDT) Algorithm
In the EFTDT algorithm, the priority of a task is defined as the sum of its mean
execution time over all the processors and the standard deviation of its execution time
over all the processors. At each scheduling step, EFTDT picks the task with the highest
Table 5.1 EFTDT Algorithm ?Defnition of Terms
Term Definition
T Metatask set of size s
M Set of processors available for scheduling
m Number of processors
mean
k
Mean execution time of task t
k
over all the processors
std
k
Standard deviation of the execution times of task t
k
over all the
processors
T_available(p
j
) Time at which processor p
j
can start execution of a new task.
W
k,j
Running time of task t
k
on processor p
j.
D
kj
Time required to dispatch task t
k
from the scheduler to processor p
j
CT(t
k
, p
j
) = Max{T_available(p
j
),D
kj
} + W
k,j
// Execution completion time of
task t
k
on processor p
j.
ECT(t
k
)
= Min
k?T & j?M
{ CT(t
k
, p
j
) }//Earliest Completion time of task t
k
Proc(t
k
) The processor on which ECT(t
k
) is obtained
98
priority and schedules it onto a processor that provides its earliest completion time.
EFTDT Algorithm
For all t
k
? T
priority(t
k
)? mean
k
+ std
k
End For
While T ? ? do
Pick a task t
k
? T
with the highest priority
For all j ? M
CT(t
k
, p
j
) ? Max{T_available(p
j
), D
kj
}+ W
k,j
End For
ECT(t
k
) ? Min
k?T & j?M
{ CT(t
k
, p
j
) }
Compute Proc(t
k
)
Assign t
k
to Proc(t
k
)
T_available(Proc(t
k
))? ECT(t
k
)
T=T {t
k
}
End While
End EFTDT
Figure 5.1 The EFTDT Algorithm
The completion time of task on a processor is defined as
CT(t
k
, p
j
) ? Max{T_available(p
j
), D
kj
}+ W
k,j
to account for the dispatch times. EFTDT later calculates the processor on which the
least completion time is obtained and schedules the task onto it. EFTDT takes O(s) to
compute the priorities of all the tasks and O(s ? m) to calculate the earliest completion
times of the tasks. Thus the overall complexity is O(s ? m).
99
5.3 Example Run of EFTDT
We now show the working of EFTDT with a sample metatask shown in Figure
5.2. Task priorities are computed as follows {8,2,10,1,3,9,4,7,5,6}.
Step1: Schedule task 8 on processor P2
Step 2: Schedule task 2 on processor P3
Table 5.2 A sample metatask
Task P1 P2 P3
1 15 13 15
2 16 20 16
3 17 16 11
4 20 13 10
5 11 11 12
6 14 14 12
7 15 11 15
8 20 17 13
9 19 10 16
10 18 19 18
Table 5.3 Metatask Dispatch Times
Task P1 P2 P3
1 6 6 7
2 9 8 5
3 7 8 9
4 6 9 10
5 8 7 7
6 10 7 6
7 9 7 7
8 6 10 10
9 10 9 5
10 9 8 10
Step 3: Schedule task 10 on processor P1
Step 4: Schedule task 1 on processor P2
100
Step 5: Schedule task 3 on processor P1
Step 6: Schedule task 9 on processor P3
Step 7: Schedule task 4 on processor P3
Step 8: Schedule task 7 on processor P2
Step 9: Schedule task 5 on processor P2
Makespan
0
10
20
30
40
50
60
70
80
90
123
Proce ssors
Seconds
task Idle
10
8
2
3
1
9
7
4
5
6
Figure 5.2 Gantt Chart for the MetaTask
Step 10 : Schedule task 6 on processor P3
The Gantt chart for the metatask is provide in Figure 5.3
5.4 Simulation Study
We developed a simulator with the following input parameters to compare the
performance of EFTDT and the FIFO approach.
n: Metatask size
101
size_max: Maximum size of a task within a metatask
dis_max: Maximum dispatch time of each task
std_dev: Standard deviation of the metatask
proc_dev: Standard deviation of the processor speeds
num_proc: Number of processors used.
The input parameters were set with the following values in our simulation study.
n: {1000,2000,5000,7500,10000}
size_max: ={100}
dis_max: {50}
std_dev: {5,10,15,20,25,30}
proc_dev: {2,4,6}
num_proc: {5,10,15,20}
Table 5.4 Parameter Values
Parameter Minimum Maximum Standard
Deviation
Task Size 10 100 530
Dispatch Times 10 50 X
Proc Speeds 1 10 2,4,6
No of tasks 1000 10000 X
No of Processors 5 20 X
Each data point in the graphs that follow is an average of 4 different readings obtained
using different number of processors. Figure 5.3 compares the makespan of EFTDT and
102
FIFO for std_dev=5 and proc_dev=2. The average improvement of EFTDT is 28%.
Figure 5.4 provides the comparison for std_dev=10 and proc_dev=2. The average
improvement of EFTDT is 29%. Figure 5.5 provides the comparison for std_dev=15 and
proc_dev=2. The average improvement of EFTDT is 28%. Figure 5.6 provides the
comparison for std_dev=20 and proc_dev=2.The average improvement of EFTDT is
30%. Figure 5.7 provides the comparison for std_dev=25 and proc_dev=2. The average
improvement of EFTDT is 29%. Figure 5.8 provides the comparison for std_dev=30 and
proc_dev=2. The average improvement of EFTDT was 30%. Figure 5.9 provides the
comparison for std_dev=5 and proc_dev=4. The average improvement of EFTDT was
29%. Figure 5.10 provides the comparison for std_dev=10 and proc_dev=4. The average
improvement of EFTDT was 28%. Figure 5.11 provides the comparison for std_dev=15
and proc_dev=4. The average improvement of EFTDT was 31%. Figure 5.12 provides
the comparison for std_dev=20 and proc_dev=4. The average improvement of EFTDT
was 31%. Figure 5.13 provides the comparison for std_dev=25 and proc_dev=4. The
average improvement of EFTDT is 30%. Figure 5.14 provides the comparison for
std_dev=30 and proc_dev=4. The average improvement of EFTDT is 29%. Figure 5.15
provides the comparison for std_dev=5 and proc_dev=6. The average improvement of
EFTDT is 30%. Figure 5.16 provides the comparison for std_dev=10 and proc_dev=6.
The average improvement of EFTDT is 29%. Figure 5.17 provides the comparison for
std_dev=15 and proc_dev=6. The average improvement of EFTDT is 32%. Figure 5.18
provides the comparison for std_dev=20 and proc_dev=6. The average improvement of
EFTDT is 28%. Figure 5.19 provides the comparison for std_dev=25 and
proc_dev=6.The average improvement of EFTDT is 32%. Figure 5.20 provides the
103
comparison for std_dev =30 and proc_dev=6.The average improvement of EFTDT is
30%. From all these average improvements, the overall average improvement of EFTDT
over FIFO is 29%
Average Makespan
0
50000
100000
150000
200000
250000
300000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFTDT
Figure 5.3 Average Makespan std_dev=5, proc_dev=2
Average Makespan
0
50000
100000
150000
200000
250000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFTDT
Figure 5.4 Average Makespan std_dev=10, proc_dev=2
104
Average Makespan
0
50000
100000
150000
200000
250000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFTDT
Figure 5.5 Average Makespan std_dev=15, proc_dev=2
Average Makespan
0
50000
100000
150000
200000
250000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFTDT
Figure 5.6 Average Makespan std_dev=20, proc_dev=2
105
Average Makespan
0
50000
100000
150000
200000
250000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFTDT
Figure 5.7 Average Makespan std_dev=25, proc_dev=2
Average Makespan
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFTDT
Figure 5.8 Average Makespan std_dev=30, proc_dev=2
106
Average Makespan
0
50000
100000
150000
200000
250000
300000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFTDT
Figure 5.9 Average Makespan std_dev=5, proc_dev=4
Average Makespan
0
50000
100000
150000
200000
250000
300000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFTDT
Figure 5.10 Average Makespan std_dev=10, proc_dev=4
107
Average Makespan
0
50000
100000
150000
200000
250000
300000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFTDT
Figure 5.11 Average Makespan std_dev=15, proc_dev=4
Average Makespan
0
50000
100000
150000
200000
250000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFTDT
Figure 5.12 Average Makespan std_dev=20, proc_dev=4
108
Average Makespan
0
50000
100000
150000
200000
250000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFTDT
Figure 5.13 Average Makespan std_dev=25, proc_dev=4
Average Makespan
0
50000
100000
150000
200000
250000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFTDT
Figure 5.14 Average Makespan std_dev=30, proc_dev=4
109
Average Makespan
0
50000
100000
150000
200000
250000
300000
350000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFTDT
Figure 5.15 Average Makespan std_dev=5, proc_dev=6
Average Makespan
0
50000
100000
150000
200000
250000
300000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFTDT
Figure 5.16 Average Makespan std_dev=10, proc_dev=6
110
Average Makespan
0
50000
100000
150000
200000
250000
300000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFTDT
Figure 5.17 Average Makespan std_dev=15, proc_dev=6
Average Makespan
0
50000
100000
150000
200000
250000
300000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFTDT
Figure 5.18 Average Makespan std_dev=20, proc_dev=6
111
Average Makespan
0
50000
100000
150000
200000
250000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFTDT
Figure 5.19 Average Makespan std_dev=25, proc_dev=6
Average Makespan
0
50000
100000
150000
200000
250000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFTDT
Figure 5.20 Average Makespan std_dev=30, proc_dev=6
112
Percentage Improvement over FIFO
0
5
10
15
20
25
30
5 1015202530
std_dev
P
er
c
ent
age
Figure 5.21 Percentage improvement of EFTDT over FIFO for various std_dev
Percentage Improvement over FIFO
0
5
10
15
20
25
30
35
246
proc_dev
P
er
c
ent
age
Figure 5.22 Percentage improvement of EFTDT over FIFO for various proc_dev
113
Percentage Improvement over FIFO
0
5
10
15
20
25
30
1000 2000 5000 7500 10000
Number of tasks in the metatask
P
er
c
ent
age
Figure 5.23 Percentage improvement of EFTDT over FIFO for various metatask sizes
114
CHAPTER 6
CONCLUSION
In this dissertation, we presented three new algorithms: the Heterogeneous Critical Node
First (HCNF) algorithm; the Heterogeneous Largest Task First (HLTF) algorithm and the
Earliest Finish Time with Dispatch Time (EFTDT) algorithm. These algorithms were
compared against existing algorithms through extensive simulation. The simulation
results were presented in the earlier chapters. In this chapter, we would like to summarize
the simulation results briefly and provide concluding remarks.
With respect to the HCNF algorithm, the experimental test suite consisted of
application graphs, trace graphs, RGBOS, RGPOS graphs etc. HCNF was compared
against HEFT using these graphs and was later compared using both HEFT and STDS
using the parametric random graph generator. We now summarize the experimental
results.
The SLR and speedup of HCNF and HEFT was compared using graphs of
different sizes. The average SLR of HCNF was better than HEFT by 12.3% and the
speedup was better than HEFT by 7.9 %. The SLR and Efficiency of HEFT and HCNF
were compared using DAGs representing the Gaussian Elimination algorithm. HCNF
outperformed HEFT by an average of 25.7% with respect to SLR. With respect to
efficiency, HCNF outperformed HEFT by an average of 22.6%. The SLR and speedup of
115
HEFT and HCNF was compared using trace graphs. HCNF outperformed HEFT in SLR
and speedup by an average of 29.5% and 38.4% respectively. The SLR and speedup of
HEFT and HCNF was compared using RGBOS graphs whose optimal schedule can be
obtained using branch and bound technique. HNCF outperformed HEFT in SLR and
speedup by 32.5% and 24.6% respectively. The SLR and speedup of HEFT and HCNF
was compared using the RGPOS graphs. HCNF outperformed HEFT in SLR and speedup
by 21.1% and 16.9% respectively. The SLR and speedup of HEFT and HCNF was
compared using application graphs.These graphs represent a few numerical parallel
application programs. This set contains of over 320 graphs in six categories: Cholesky
factorization, LU decomposition, Gaussian elimination, FFT, Laplace transforms and
Mean Value Analysis (MVA). The number of nodes ranges from 100 to 300. On an
average, HCNF outperformed HEFT in SLR and Speedup by 27.5% and 22.7%
respectively. The SLR of HCNF, HEFT and the STDS algorithms was compared using
the parametric random graph generator. The average SLR improvement of HEFT over
STDS is 6%, and over HEFT is 10% approximately. The speedup of HCNF, HEFT and
the STDS algorithms was compared using a parametric random graph generator. The
Average improvement in the speedup of the HCNF over STDS is 9%, and over HEFT is
14%. The average SLR values for CCR values ranging from 0.1 to 1.0 in steps of 0.1 was
compared using the parametric random graph generator. The average improvement of
HCNF over HEFT is 11% and over HEFT is 4 %. The average SLR values for CCR
values ranging from 1.0 to 5.0 in steps of 0.5 was compared using the parametric random
graph generator. The average improvement of HCNF over STDS is 7% and over HEFT is
11%. The superior performance of HNCF can be attributed to the lowcost task
116
duplication strategy that facilitates earlier start times for many nodes which otherwise
have to wait for all the data items to arrive from their favorite predecessors. HCNF can be
improved by exploring the possibility of duplicating the second and the third favorite
predecessors (if any) to further expedite the start times nodes. The feasibility of such an
approach needs to be investigated.
The average makespan of HLTF and the Sufferage algorithms was compared
using different metatask sizes and different std_dev. The average improvement of HLTF
over Sufferage was 4.13%.
The running times of HLTF and Sufferage were compared using metatasks of
different sizes. For metatask sizes greater than 1000, HLTF shows an improvement of
over a 1000%. The superior performance of HLTF in terms of running times can be
attributed to the low complexity sorting technique that is used by the algorithm.
Experiments were conducted to compare the makespan of EFTDT and FIFO. The
overall average improvement of EFTDT over FIFO is 30%. The superior performance of
EFTDT over FIFO can be attributed to the dispatch times and task execution times
occurring in parallel.
117
BIBLIOGRAPHY
[1] A. Abraham, R. Buyya and B. Nath, ?Nature?s Heuristics for Scheduling
Jobs on Computational Grids,? Proc. ADCOM 2000, pp. 45  52, Cochin
India.
[2] V. A. F. Almeida , I. M. M. Vasconcelos , J. N. C. Rabe and D. A.
Menasc, ?Using random task graphs to investigate the potential benefits of
heterogeneity in parallel systems,?Proc. 1992 ACM/IEEE conference on
Supercomputing, pp. 683691, Nov. 1992.
[3] J. R. Allen and K. Kennedy, ?PFC: A Program to Convert FORTRAN to
Parallel Form,? Proc. of the IBM Conference on Parallel Computers and
Scientific Computations, March 1982.
[4] R. Bajaj and D.P. Agarwal, ?Improving Scheduling of Tasks in a
Heterogeneous Environment,? IEEE Trans. Parallel and Distributed
Systems, Vol. 15 No. 2,pp. 107118 February 2004.
.
[5] S. Baskiyar, ?Scheduling DAGs on Message Passing mProcessors
Systems,? IEICE Trans. Information and Systems, v E83D, no. 7, Oxford
University Press, July 2000.
[6] S. Baskiyar, ?Scheduling TaskIn Trees on Distributed Memory Systems,?
IEICE Trans. Information and Systems, vol. E84D, no. 6, June 2001.
[7] S. Baskiyar and P.C. SaiRanga, ?Scheduling DAGs on Heterogeneous
Multiprocessor Systems to Minimize Finish Time,? Proc. ISCA PDCS,
Reno, Nevada, Aug 2003.
[8] S. Baskiyar and P.C. SaiRanga, ?Scheduling DAGs on Heterogeneous
Network of Workstations to Minimize Schedule Length,? Proc. ICPP
Workshops, Taiwan, Oct 2003.
118
[9] S..Baskiyar and P.C. SaiRanga, ?Scheduling independent tasks of a
metatask with significant dispatch times,? Technical Report # CSSE0603,
Auburn University, Nov 2006.
[10] S.Baskiyar and P.C SaiRanga, ?Scheduling DAGs on Heterogeneous
Network of Workstations to Minimize Finish Time,? Trans. IJCA, Vol 13,
No 4, Dec 2006.
[11] O. Beaumont, V. Boudet, and Y. Robert, ?A Realistic Model and an
Efficient Heuristic for Scheduling with Heterogeneous Processors,? Proc.
IPDPS, 2002.
[12] C. Boeres, G. Chochia and P. Thanisch, ?On the Scope of Applicability of
the ETF Algorithm,? Proc. Workshop on Parallel Algorithms for
Irregularly Structured Problems, pp. 159164, 1995.
[13] R. Buyya, D. Abramson, and J. Giddy, ?An Economy Driven Resource
Management Architecture for Global Computational Power Grids,? Proc.
International Conference on Parallel and Distributed Processing
Techniques and Applications (PDPTA 2000), June 2629, 2000, Las
Vegas, USA, CSREA Press, USA, 2000.
[14] R. Buyya, D. Abramson, and J. Giddy, ?NimrodG: An Architecture for a
Resource Management and Scheduling System in a Global Computational
Grid,? Proc. 4th International Conference on High Performance
Computing in AsiaPacific Region (HPC Asia 2000), May 2000, Beijing,
China, IEEE Computer Society Press, USA.
[15] R. Buyya, D. Abramson, and J. Giddy, ?A Case for Economy Grid
Architecture for ServiceOriented Grid Computing,? Proc. International
Parallel and Distributed Processing Symposium: 10th IEEE International
Heterogeneous Computing Workshop (HCW 2001), April 23, 2001, San
Francisco, California, USA, IEEE CS Press, USA, 2001
[16] W.Y Chan and C.K. Li, ?Heterogeneous Dominant Sequence Cluster
(HDSC): a low complexity heterogeneous scheduling algorithm,? Proc.
IEEE Pacific Rim Conference on Communications, Computers and Signal
Processing, 1997, Vol. 2 , pp. 956959, Aug. 1997.
[17] W.Y. Chan and C.K. Li, ?Scheduling tasks in DAG to heterogeneous
processor system,? Proc. Sixth Euromicro Workshop on Parallel and
Distributed Processing(PDP ?98), pp. 2731, Jan. 1998.
119
[18] H. B. Chen, B. Shirazi, K. Kavi, and A. R. Hurson, ?Static scheduling
using linear clustering and task duplication,? Proc. ISCA International
Conference on Parallel and Distributed Computing and systems, 1993, pp.
285290.
[19] C. Chiang, C. Lee, and M. Chang, ?A Dynamic Grouping Scheduling for
Heterogeneous InternetCentric Metacomputing System,? Proc. ICPADS,
pp. 7782, 2001.
[20] W.Y. Chan and C.K. Li, ?Scheduling Tasks in DAG to Heterogeneous
Processors System,? Proc. 6
th
Euromicro Workshop on Parallel and
Distributed Processing, Jan.1998.
[21] M. Chetty and R. Buyya, ?Weaving computational grids: how analogous
are they with electrical grids,? IEEE Trans. Computational Science and
Enginering, Volume 4, Issue 4, JulyAug. 2002, Pages:61 ? 71.
[22] B. Cirou and E. Jeannot, ?Triplet : a Clustering Scheduling Algorithm for
Heterogeneous Systems,? Proc. IEEE ICPP International Workshop on
Metacomputing Systems and Applications (MSA'2001), sept. 2001,
Valencia, Spain
[23] T.H Cormen, C.E. Leiserson, and R.L. Rivest, Introduction to Algorithms,
The MIT Press, 1990.
[24] M. Cosnard and E. Jeannot, ?Compact DAG representation and it?s
dynamic Scheduling,? Journal of Parallel and Distributed Computing,
Vol. 58, No. 3, September 1999, pp. 487514.
[25] S. Darbha and D. P. Agrawal, ?A task duplication based scalable
scheduling algorithm for distributed memory systems?, Journal of parallel
and Distributed Computing, Vol. 46, No. 1, October 1997, pp. 1527.
[26] S. Darbha and D.P. Agrawal, ?Optimal Scheduling algorithm for
distributed memory machines?, IEEE Trans. Parallel and Distributed
Systems, Vol. 9, No. 1, January 1998, pp. 8795.
[27] A. Dogan and F. Ozguner, ?Stochastic Scheduling of a Metatask in
Heterogeneous Distributed Computing,? Proc. ICPP Workshop on
Scheduling and Resource Management for Cluster Computing, 2001.
120
[28] R. F. Freund, M. Gherrity, S. Ambrosius, M. Campbell, M. Halderman, D.
Hensgen, E. Keith, T. Kidd, M. Kussow, J. D. Lima, F. Mirabile, L.
Moore, B. Rust, and H. J. Siegel, ?Scheduling Resources in MultiUser,
Heterogeneous, Computing Environments with SmartNet,? Proc. 7th
IEEE Heterogeneous Computing Workshop (HCW ?98), Mar.1998, pp.
184?199.
[29] M.Grajcar, ?Genetic list scheduling algorithm for scheduling and
allocation on a loosely coupled heterogeneous multiprocessor system,?
Proc. 36th Design Automation Conference, pp. 280285, 1999.
[30] A. Gerasoulis and T. Yang, ?A comparison of clustering heuristics for
scheduling directed acyclic graphs onto multiprocessors,? Journal of
Parallel and Distributed Computing, Vol. 16, No. 4, December 1992, pp.
276291.
[31] D. Hensgen, M. Maheswaran, S. Ali, and H.J. Siegal, ?Dynamic matching
and scheduling of a class of independent tasks onto heterogeneous
computing systems,? Proc. Heterogeneous Computing Workshop, 1999.
[32] J. Huang and S.Y.Lee, ?Effects of Spatial and Temporal Heterogeneity on
Performance of a Target Task in Heterogeneous Computing
Environments,? Proc. 15th ISCA International Conference on Parallel
and Distributed Systems, Sept. 2002.
[33] C. C. Hui and S. T. Chanson, ?Allocating task interaction graphs to
processors in heterogeneous networks,? IEEE Trans. Parallel and
Distributed Systems, Vol. 8, No. 9, September 1997, pp. 908926.
[34] M. Iverson, F. Ozguner, and G. Follen, ?Parallelizing Existing
Applications in a Distributed Computing Environment,? Proc.
Heterogeneous Computing Workshop, pp. 93100, 1995.
[35] M. A. Iverson, F. Ozguner and L.C. Potter, ?StatisticalPrediction of Task
Execution Times Through Analytic Benchmarking for Scheduling in a
Heterogeneous Environment,? Proc. 8th Heterogeneous Computing
Workshop (HCW ?99), p. 99, April 1999.
[36] M. Kafil and I. Ahmed, ?Optimal Task Assignment in Heterogeneous
Distributed Computing Systems,? Proc. IEEE Concurrency, Vol. 6, No. 3,
JulySeptember 1998, pp. 4251.
121
[37] S.J. Kim and J.C. Browne, ?A General Approach to Mapping of Parallel
Computations upon Multiprocessor Architectures,? Proc. ICPP, IEEECS,
v. 3, 1988.
[38] D.J. Kuck et. al., ?Dependence Graphs and Compiler Optimizations,?
Proc. 8
th
ACM Symposium on Principles of Programming Languages, pp
207218, Jan. 1981.
[39] Y. Kwok, I. Ahmad and J. Gu, ?FAST: A LowComplexity Algorithm for
Efficient Scheduling of DAGs on Parallel Processors,? Proc. ICPP, 1997.
[40] Y.K Kwok, ?Parallel Program Execution on a Heterogeneous PC Cluster
Using Task Duplication,? Proc. 9
th
HCW, 364374, 2000.
[41] D.Li and N.Ishii, ?Scheduling task graphs onto heterogeneous
multiprocessors,? Proc. IEEE Region 10?s Ninth Annual International
Conference. Theme: ?Frontiers of Computer Technology? (TENCON ?94),
pp. 556563 vol.2, Aug. 1994.
[42] Y. A. Li and J. K. Antonio, ?Estimating the execution time distribution for
a task graph in a heterogeneous computing system,? Proc.6th
Heterogeneous Computing Workshop (HCW ?97), p.172, April 1997.
[43] Z.Liu, ?Scheduling of random task graphs on parallel processors,? Proc.
Third International Workshop on Modeling, Analysis, and Simulation of
Computer and Telecommunication Systems (MASCOTS ?95), pp. 143
147, Jan. 1995.
[44] S.Y. Lee and J.Huang, ?A Theoretical Approach to Load Balancing of a
Target Task in a Temporally and Spatially Heterogeneous Grid
Computing Environment,? Proc.GRID 2002, pp. 7081.
[45] Z. Liu, B. Fang, Y.Zhang and J.Tang ?Scheduling algorithms for a fork
DAG in a NOWs,? Proc. Fourth International Conference/Exhibition on
High Performance Computing in the AsiaPacific Region, Vol. 2 , pp. 959
960, May 2000.
[46] M. Maheswaran and H. J. Siegel, ?A Dynamic Matching and Scheduling
Algorithm for Heterogenous Computing Systems,? Proc.7
th
HCW, pp. 57
69, IEEE Press, Mar. 1998.
122
[47] H. Oh and S. Ha, ?A Static Scheduling Heuristic for Heterogeneous
Processors,? Proc. EuroPar, pp. 573577, v 2, 1996.
[48] S. S. Pande, D. P. Agrawal and J. Mauney, ?A scalable scheduling method
for functional parallelism on distributed memory multiprocessors,? IEEE
Trans. Parallel and Distributed Systems, Vol. 6, No. 4, April 1995, pp.
388399.
[49] C. Papadimitriou and M. Yannakakis, ?Towards an Architecture
Independent Analysis of Parallel Programs,? SIAM J. of Computing, v 19,
no. 2, pp 322328, 1990.
[50] G.L Park, B.Shirazi, J.Marquis and H.Choo, ?Decisive path scheduling: a
new list scheduling method,? Proc. International Conference on Parallel
Processing, pp. 472480, Aug. 1997
[51] A. Radulescu and A.J.C. Van Gemund, ?Fast and Effective Task
Scheduling in Heterogeneous Systems,? Proc. HCW, pp.229238, May,
2000.
[52] A. Ranaweera and D. P. Agrawal, ?A Task Duplication Based algorithm
for Heterogeneous Systems,? Proc. IPDPS, pp 445450, May 15, 2000.
[53] H. E.Rewini and T. G. Lewis, ?Scheduling parallel programs onto
arbitrary target architecture,? Journal of Parallel and Distributed
Computing, Vol. 9, No. 2, June 1990, pp. 138153.
[54] P.C. SaiRanga and Sanjeev Baskiyar, ?A LowComplexity Algorithm for
Dynamic Matching and Scheduling of Independent Tasks onto
Heterogeneous Computing Systems,? Proc. ACMSE 2005, March 2005.
[55] V.Sarkar, ?Partitioning and Scheduling Parallel Programs for
Multiprocessors,? The MIT Press, Cambridge, MA 1989.
[56] G. Sih and E. Lee, ?A Compile Time Scheduling Heuristic for
Interconnection Constrained Heterogeneous Processor Architectures,?
IEEE Trans. Parallel and Distributed Systems, vol. 4(2), pp. 175187,
1993.
[57] H. Song, X. Liu, D. Jakobsen, R. Bhagwan, X. Zhang, K. Taura, and A.
Chien, ?The MicroGrid: A Scientific Tool for Modeling Computational
Grids,? Proc. IEEE Supercomputing (SC 2000), Nov. 410, 2000, Dallas,
USA.
123
[58] H. Topcuoglu, S. Hariri and M.Y. Wu, ?Task Scheduling Algorithms for
Heterogeneous Processors,? Proc. HCW, pp 314, 1999.
[59] H. Topcuoglu, S. Hariri, and MY. Wu ?Performanceeffective and low
complexity task scheduling for heterogeneous computing Parallel and
Distributed Systems,? IEEE Trans. Parallel and Distributed Systems,
Volume: 13 Issue: 3, Mar 2002.
[60] T. Tsuchiya, T. Osada, T. Kikuno, ?A new heuristic algorithm based on
GA?s for multiprocessor scheduling with task duplication,? Proc. Third
International Conference on Algorithms and Architectures for Parallel
Processing, 1997, pp. 295 308.
[61] J. Ullman, ?NPcomplete Scheduling Problems,? Proc. JCSS, vol. 10, pp.
384393. 1975.
[62] L.Yang, M. Jennifer and I.Foster, ?Conservative Scheduling: Using
Predicted Variance to Improve Scheduling Decisions in Dynamic
Environments,? Proc. Supercomputing?03, November 2003.
[63] T. Yang and A. Gerasoulis, ?DSC: Scheduling Parallel Tasks on an
Unbounded Number of Processors,? IEEE Trans. Parallel and Distributed
Systems, v. 5, no. 9,1994.