ALGORITHMS FOR TASK SCHEDULING IN HETEROGENEOUS  
COMPUTING ENVIRONMENTS 
 
 
 
Except where reference is made to the work of others, the work described in this 
dissertation is my own or was done in collaboration with my advisory committee. This 
dissertation does not include proprietary or classified information. 
 
 
 
 __________________________ 
Prashanth C. Sai Ranga 
 
 
 
 
Certificate of Approval: 
 
 
 
__________________________    __________________________ 
Homer W. Carlisle      Sanjeev Baskiyar, Chair  
Associate Professor      Associate Professor 
Computer Science and Software    Computer Science and Software 
Engineering      Engineering 
 
       
 
 
__________________________    __________________________ 
Yu Wang       Joe F. Pittman 
Assistant Professor      Interim Dean 
Computer Science and Software    Graduate School 
Engineering   
 
 
ALGORITHMS FOR TASK SCHEDULING IN HETEROGENEOUS 
COMPUTING ENVIRONMENTS 
  
  
     Prashanth C. Sai Ranga 
 
 
 
 
A Dissertation 
Submitted to 
the Graduate Faculty of 
Auburn University 
in Partial Fulfillment of the 
Requirements for the 
Degree of 
Doctor of Philosophy 
 
Auburn, Alabama 
 December 15, 2006 
 
 
 
 iii
ALGORITHMS FOR TASK SCHEDULING IN HETEROGENEOUS  
COMPUTING ENVIRONMENTS 
  
  
     Prashanth C. Sai Ranga 
 
 
 
 
 
Permission is granted to Auburn University to make copies of this dissertation at its 
discretion, upon request of individuals or institutions and at their expense. The author 
reserves all publication rights. 
 
 
 
 
________________________ 
Signature of Author 
 
 
________________________ 
Date of Graduation
 iv
 
DISSERTATION ABSTRACT 
 
ALGORITHMS FOR TASK SCHEDULING IN HETEROGENEOUS 
COMPUTING ENVIRONMENTS 
 
Prashanth C. Sai Ranga 
 
 
 
Doctor of Philosophy, Dec 15,2006 
(M.S., University of Texas at Dallas, Dec, 2001) 
(B.E., Bangalore University, India, Aug 1998) 
 
136 Typed pages 
 
 
Directed by Sanjeev Baskiyar 
 
 
Current heterogeneous meta-computing systems, such as computational clusters 
and grids offer a low cost alternative to supercomputers. In addition they are highly 
scalable and flexible. They consist of a host of diverse computational devices which 
collaborate via a high speed network and may execute high-performance applications. 
Many high-performance applications are an aggregate of modules. Efficient scheduling 
of such applications on meta-computing systems is critical to meeting deadlines. In this 
dissertation, we introduce three new algorithms, the Heterogeneous Critical Node First 
(HCNF) algorithm, the Heterogeneous Largest Task First (HLTF) algorithm and the 
Earliest Finish Time with Dispatch Time (EFT-DT) algorithm. HCNF is used to schedule 
 v
parallel applications of forms represented by directed acyclic graphs onto networks of 
workstations to minimize their finish times. We compared the performance of HCNF 
with those of the Heterogeneous Earliest Finish Time (HEFT) and Scalable Task 
Duplication based Scheduling (STDS) algorithms. In terms of Schedule Length Ratio 
(SLR) and speedup, HCNF outperformed HEFT on average by 13% and 18% 
respectively. HCNF outperformed STDS in terms of SLR and speedup on an average by 
8% and 12% respectively. The HLTF algorithm is used to schedule a set of independent 
tasks onto a network of heterogeneous processors to minimize finish time. We compared 
the performance of HLTF with that of the Sufferage algorithm. In terms of makespan, 
HLTF outperformed Sufferage on average by 4.5 %, with a tenth run-time. The EFT-DT 
algorithm schedules a set of independent tasks onto a network of heterogeneous 
processors to minimize finish time when considering dispatch times of tasks. We 
compared the performance of EFT-DT with that of a First in First out (FIFO) schedule. In 
terms of minimizing makespan, on average EFT-DT outperformed FIFO by 30%. 
 vi
ACKNOWLEDGMENTS 
 
The author is highly indebted to his advisor, Dr. Sanjeev Baskiyar, for his clear 
vision, encouragement, persistent guidance and stimulating technical inputs. His patience, 
understanding and support are deeply appreciated. Thanks to Dr. Homer Carlisle and Dr. 
Yu Wang, for their review and comments on this research work. Their invaluable time 
spent on serving on my graduate committee is sincerely appreciated. Special thanks to 
Mr. Victor Beibighauser, Mr. Basil Manly and Mr. Ron Moody of South University, 
Montgomery, for their concern, understanding and co-operation.  Finally, the author 
would like to thank his parents, sister and bother-in-law for their constant support and 
encouragement.  
 
 
 
 
 
 
 
 
 
 
 vii
Style manual or journal used: IEEE Transactions on Parallel and Distributed Systems  
 
 
Computer software used: Microsoft Word, Adobe PDF  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 viii
TABLE OF CONTENTS  
 
LIST OF IGURES           x
LIST OF TABLES                   xiii  
CHAPTER 1 INTRODUCTION          1 
1.1 Motivation                      1 
1.2 Cluster Computing          5 
1.3 Grid Computing          7 
1.4 Task Scheduling in Heterogeneous Computing Environments  10 
1.5 NP-Complete Problems       14  
1.6 Research Objectives and Outline     15 
 
CHAPTER 2  LITERATURE REVIEW      16 
2.1 Scheduling a Parallel Application Represented by a Directed  
Acyclic Graph onto a Network of Heterogeneous Processors  
to Minimize the Make-Span       16 
         2.1.1 Directed Acyclic Graphs     16 
         2.1.2 Problem Statement      17 
         2.1.3 The Best Imaginary Level Algorithm    19 
                    2.1.4 The Generalized Dynamic Level Algorithm   21 
                    2.1.5 The Levelized Min-Time Algorithm    24 
                    2.1.6 The Heterogeneous Earliest Finish Time Algorithm  26 
                    2.1.7 The Critical Path on Processor Algorithm   27 
                    2.1.8 The Fast Critical Path Algorithm    30 
                    2.1.9 The Fast Load Balancing Algorithm    32 
                    2.1.10 The Hybrid Re-mapper Algorithm     34 
                    2.1.11 Performance Comparison     36 
           2.2 Scheduling a Parallel Application Represented by a Set of  
  Independent Tasks onto a Network of Heterogeneous  
  Processors to Minimize the Make-Span    38 
                    2.2.1 Problem Statement      38 
                    2.2.2 The Min-Max and the Max-Min Algorithm   38 
                    2.2.3 The Sufferage Algorithm     40 
 
CHAPTER 3 THE HETEROGENEOUS CRITICAL NODE FIRST   
  ALGORITHM       43 
 ix
3.1 Motivation        43 
 3.2 The HCNF Algorithm       44 
 3.3 Running Trace        46 
 3.4 Simulation Study       54 
          3.4.1 Performance Parameters     54 
          3.4.2 Randomly Generated Graphs     55 
          3.4.3 Gaussian Elimination Graphs     56 
                     3.4.4 Benchmark Graphs      57 
          3.4.5 Parametric Random Graph Generator    73 
            3.5      Conclusion         79 
   
CHAPTER 4 THE HETERGOENEOUS LARGEST TASK FIRST  
  ALGORITHM       80 
 4.1 Motivation        80 
 4.2 The HLTF Algorithm       81 
 4.3 Theoretical Non-Equivalence of Sufferage and HLTF  83 
 4.4 Simulation Study       87 
          4.4.1 Comparison of Make-span     88 
          4.4.2 Comparison of Running Times    88 
  
 
CHAPTER 5 SCHEDULING INDEPENDENT TASKS WITH  
  DISPATCH TIMES       95 
 5.1 Motivation         95 
 5.2 The EFT-DT Algorithm      96 
 5.3 Example Run of EFT-DT      97 
 5.4 Simulation Study       99 
  
CHAPTER 6 CONCLUSION                 113 
         
 
BIBLIOGRAPHY                  116 
 
 
  
                  
 
 
 
 
 
 
 x
 
LIST OF FIGURES 
 
1.1  Architecture of Cluster Computing Systems       6 
1.2 Grid Architecture          8 
2.1 A sample DAG G
1
        17 
2.2 The BIL algorithm        21 
2.3 The GDL algorithm        23 
2.4 The LMT algorithm        25 
2.5 The HEFT algorithm        27 
2.6 The CPOP algorithm        29 
2.7 The FCP algorithm        31 
2.8 The FLB algorithm        33 
2.9 The Hybrid Re-mapper algorithm      35  
2.10 The Min-Min algorithm       37 
2.11 The Sufferage algorithm       38 
3.1 The HCNF algorithm        39 
3.2  Sample DAG (G
1
)        40 
3.4  Gantt chart for G
1        
41 
3.5  HCNF running trace-step 1       42 
3.6  HCNF running trace-step 2       42  
3.7  HCNF running trace-step 3        42 
3.8 HCNF running trace-step 4       43 
3.9 HCNF running trace-step 5       43 
3.10 HCNF running trace-step 6        43 
3.11 HCNF running trace-step 7       44 
3.12  HCNF running trace-step 8       45 
3.13  HCNF running trace-step 9       45 
3.14  HCNF running trace-step10        45 
3.15  Random graphs-Average SLR vs. number of nodes    46 
3.16  Random graphs-Average speedup vs. number of nodes   46 
3.17  Random graphs-Average SLR vs. CCR (0.1 to 1)    47 
3.18  Random graphs-Average SLR vs. CCR (1 to 5)    48 
3.19  Random graphs-Average speedup vs. CCR (0.1 to 1)   48 
3.20  Random graphs-Average speedup vs. CCR (1 to 5)    48 
3.21  Gaussian Elimination-Average SLR vs. matrix size    49 
3.22  Gaussian Elimination-Efficiency vs. no. of  processors   50
 xi
 
3.23  Trace Graphs-SLR        51 
3.24  Trace Graphs-Speedup       52 
3.25  RGBOS SLR (CCR = 0.1)       52 
3.26  RGBOS SLR (CCR = 1.0)        53 
3.27  RGBOS SLR (CCR = 10.0)        53 
3.28  RGBOS Speedup (CCR = 0.1)       54 
3.29  RGBOS Speedup (CCR = 1.0)      54 
3.30  RGBOS Speedup (CCR = 10.0)      55 
3.31  RGPOS SLR (CCR = 0.1)       56 
3.32  RGPOS SLR (CCR = 1.0)       56 
3.33  RGPOS SLR (CCR = 10.0)       57 
3.34  RGPOS Speedup (CCR = 0.1)      57 
3.35  RGPOS Speedup (CCR = 1.0)       58 
3.36  RGPOS Speedup (CCR = 10.0)      58 
3.37  Fast Fourier Transform- SLR vs. CCR     59 
3.38  Fast Fourier Transform- Speedup vs. CCR     59 
3.39  Cholesky Factorization- Speedup vs. CCR     60 
3.40  Gaussian Elimination- Speedup vs. CCR     60 
3.41  Laplace Transform- Speedup vs. CCR     61 
3.42  LU Decomposition- Speedup vs. CCR     61 
3.43  MVA- Speedup vs. CCR       62 
3.44  Cholesky- SLR vs CCR       62 
3.45  Gaussian Elimination- SLR vs.CCR      63 
3.46  Laplace Transform- SLR vs.CCR      63 
3.47  LU Decomposition- SLR vs. CCR      64 
3.48  MVA- SLR vs. CCR        64 
3.49  Parametric random graphs - SLR vs. number of nodes   67 
3.50  Parametric random graphs - Speedup vs. number of nodes   67 
3.51  Parametric random graphs-SLR vs. CCR (0.1 to 0.9)   68 
3.52  Parametric random graphs-SLR vs. CCR (1.0 to 5.0)   68 
3.53  Parametric random graphs-Speedup vs. CCR (0.1 to 0.9)   69 
3.54  Parametric random graphs-Speedup vs. CCR (1.0 to 5.0)   69 
4.1 Running times of the Sufferage Algorithm     70 
4.2 HLTF Algorithm        72 
4.3  The Sufferage algorithm       74 
4.4 Average Makespan of Metatasks std_dev=5     76 
4.5  Average Makespan of Metatasks std_dev=10    78 
4.6 Average Makespan of Metatasks std_dev=15    80 
4.7 Average Makespan of Metatasks std_dev=20                     82 
4.8 Average Makespan of Metatasks std_dev=25     84 
 xii
4.9 Average Makespan of Metatasks std_dev=30    85 
4.10  Running Times {n =50,100,200}                                                                                                                  87 
4.10 Running Times {n =500,1000,2000}                                                                                                            87 
4.11 Running Times {n =3000,4000,5000}                                                                                                          90 
5.1  The EFT-DT Algorithm       94 
5.2  Gantt Chart for the Meta-Task       96 
5.3  Average Makespan- std_dev=5, proc_dev=2     98 
5.4  Average Makespan- std_dev=10, proc_dev=2    99 
5.5  Average Makespan- std_dev=15, proc_dev=2    99 
5.6  Average Makespan- std_dev=20, proc_dev=2             100 
5.7  Average Makespan- std_dev=25, proc_dev=2             100 
5.8  Average Makespan- std_dev=30, proc_dev=2             101 
5.9  Average Makespan- std_dev=5, proc_dev=4              101 
5.10  Average Makespan- std_dev=10, proc_dev=4               102 
5.11  Average Makespan- std_dev=15, proc_dev=4               102 
5.12  Average Makespan- std_dev=20, proc_dev=4               103 
5.13  Average Makespan- std_dev=25, proc_dev=4               103 
5.14  Average Makespan- std_dev=30, proc_dev=4               104 
5.15  Average Makespan- std_dev=5, proc_dev=6              104 
5.16  Average Makespan- std_dev=10, proc_dev=6             105 
5.17 Average Makespan- std_dev=15, proc_dev=6             105 
5.18  Average Makespan- std_dev=20, proc_dev=6             106 
5.19  Average Makespan- std_dev=25, proc_dev=6             106 
5.20  Average Makespan- std_dev=30, proc_dev=6             107 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 xiii
 
 
 
LIST OF TABLES 
 
 
2.1 Table of values for G
1        
18 
2.2 Definition of terms used in BIL      20  
2.3 Definition of terms used in GDL      22 
2.4 Definition of terms used in LMT      24 
2.5 Definition of terms used in HEFT      27 
2.6 Definition of terms used in CPOP      28 
2.7 Definition of terms used in FCP      30 
2.8 Definition of terms used in FLB      32 
2.9 Definition of terms used in Hybrid Re-mapper      34 
2.10 Performance Comparison        38 
2.11 Definition of terms used in Min-Min      40 
2.12 Definition of terms used in Sufferage      42 
3.1 HCNF-definition of terms        55 
3.2 Task execution times of G
1 
on three different processors     58  
3.3  Run-time values for G
1       
60 
3.4  Trace graph details        64 
4.1  Definition of Terms used in Sufferage and HLTF    81 
4.2  Theoretical Nonequivalence of the Sufferage and the HLTF Algorithms 83 
5.1  EFT-DT Algorithm ?Defnition of Terms     93 
5.2  Asample metask        95 
5.3  Meta-task Dispatch Times       95 
 
 
 
 
 
 
 
 1
CHAPTER 1 
 
INTRODUCTION 
 
This chapter provides an introduction to our research work and discusses a few 
relevant topics. Section 1.1 discusses our research motivation. Section 1.2 describes the 
architecture of cluster computing systems. Section 1.3 describes the architecture of grid 
computing systems. Section 1.4 provides an overview of task scheduling in 
heterogeneous computing systems. Section 1.5 provides an introduction to NP-complete 
problems and Section 1.6 discusses the organization of this dissertation.  
 
 
1.1  Motivation  
 
 Information Technology has revolutionized the way we share and use 
information. The IT revolution has witnessed a myriad number of applications with a 
wide range of objectives which include: small personal computer based applications like 
the calculator program, medium-sized applications like the Microsoft Word, large-sized 
applications like the Computer Aided Design software and very-large sized applications 
like the Weather Forecasting application. Some of these programs can run efficiently on a 
normal personal computer and some may need a more powerful workstation. However, 
there are applications like Weather Forecasting, Earthquake Analysis, Particle Simulation 
and a host of other engineering and scientific applications that require computing 
 2
capabilities beyond that of personal computers or workstations. They are called ?High-
Performance Applications?.  
 How do we run these high-performance applications efficiently, given the fact 
that sequential computers (PCs, workstations) are too slow to handle them? There are 
three ways to improve efficiency [1]: work harder, work smarter or get help. In this 
context, working harder refers to increasing the speed of sequential uni-processor 
computers. In the last two decades, microprocessor speed has on an average doubled once 
in 18 months. Today?s microprocessor chip is faster than the mainframes of yesteryears, 
owing to the phenomenal advances in Very Large Scale Integration (VLSI) technology. 
Even though this trend is expected to continue in the future, microprocessor speed is 
severely limited by the laws of physics and thermodynamics [2]. There is very high 
probability that it will eventually hit a plateau in the near future.  
Working smarter refers to designing efficient algorithms and programming 
environments to deal with high-performance applications. By working smarter, we can 
definitely improve the overall efficiency, but will not be able to overcome the speed 
bottleneck of sequential computers. 
Getting help refers to involving multiple processors to solve the problem. The 
idea of multiple processors working together simultaneously to run an application is 
called ?Parallel Processing.? Most of the applications consist of thousands of modules or 
sub-programs that may or may not interact with each other depending on the nature of the 
application. In either case, there are usually a number of modules that are independent of 
one another and could run simultaneously on different processors. The parallel nature of 
many applications is what makes parallel processing very appealing. In other words, if 
 3
applications were to be one large sequential module, parallel processing would not be 
feasible.  
Parallel processing has captivated researchers for a long time. The initial trend in 
parallel processing was to create tightly coupled multi-processor systems with shared 
memory, running proprietary software. These systems were generally referred to as 
?Supercomputers?. Supercomputers were extremely fast and expensive. In the 1960s 
Seymour Cray created the world?s first commercial supercomputer the CDC 6600.  Other 
companies like IBM, Digital and Texas Instruments created their own proprietary 
versions of supercomputers. The 70s and the 80s witnessed major companies and 
research labs across the word vie with one another to create the world?s fastest super 
computer. Even though the trend continues to this day, parallel processing has slowly 
drifted away from supercomputing for a number of reasons. Supercomputers are 
extremely expensive systems that run on proprietary technology. Since they run on 
proprietary technology, they offer less flexibility with respect to developing software 
solutions to execute high performance applications. Since supercomputers are very 
expensive to lease/purchase and maintain, it is beyond the reach of many organizations to 
deploy them. Also in view of today?s technological growth, it is important for systems to 
be readily scalable. Owing to factors like proprietary hardware and software 
technologies, most of the supercomputers are not readily scalable. To summarize, 
supercomputers have a very high cost/performance factor.  
The very high cost/performance factor made them unattractive to a number of 
organizations. Most organizations (business, academic, military etc) were interested in 
high performance computing but were seeking systems with low cost/performance factor, 
 4
which could not be offered by supercomputers.  In the meantime, PCs and workstations 
became extremely powerful and significant advances were made in networking 
technologies. Researchers began to explore the possibility of connecting low cost PCs 
with a high-speed network to mimic the functioning of a supercomputer albeit with a low 
cost/performance factor.  
Extensive research has been carried out to create high performance systems by 
connecting PCs/workstations with a high-speed network. Most of the research was 
focused on creating viable parallel programming environments, developing high-speed 
network protocols and devising effective scheduling algorithms. Initially, the 
PCs/workstations had uniform hardware characteristics and thus the systems were termed 
?Homogeneous.?  However due to rapid advances in PC technology, computers and other 
hardware items had to be continuously upgraded and it was no longer the case that all the 
machines had identical hardware characteristics. This led to the notion of ?Heterogeneous 
Systems? where individual PCs/workstations in a network could have different hardware 
characteristics. Researchers today focus on creating a high-performance system with a 
low cost/performance factor using a Heterogeneous Network of Workstations (NOWs).  
So, what goes into creating a viable high performance computing system with a 
low cost/performance ratio out of a NOW given the fact that we have powerful 
workstations and very high-speed networks? Firstly, an efficient run-time environment 
must be provided for high-performance applications. Extensive research has been done in 
this area and has led to the creation of efficient technologies like the Message Passing 
Interface (MPI) [2] and the Parallel Virtual Machine (PVM) [2].  Secondly, in order to be 
able to provide a low cost/performance ratio, these systems must optimize the overall 
 5
execution time (or turnaround time) of high-performance applications. This requires 
efficient scheduling of the sub-tasks of high-performance applications onto the individual 
machines of a NOW. The sub-tasks of a parallel application may either be independent or 
may have precedence constraints. In either case, the problem of scheduling these subtasks 
to optimize the overall execution time of an application is a well-known NP Complete 
problem [3].  
 The focus of our research is to devise efficient scheduling algorithms for 
scheduling parallel applications represented by independent tasks as well as tasks with 
precedence constraints onto heterogeneous computing systems to minimize the overall 
execution time. We strongly believe that efficient task scheduling is the most important 
factor in creating a low-cost high-performance computing system.  We now discuss the 
architectures of two very popular heterogeneous computing systems, the Cluster and the 
Grid.  
 
1.2 Cluster Computing  
 
A cluster is a heterogeneous parallel computing system which consists of several 
stand alone systems that are interconnected to function as an integrated computing 
resource. A cluster generally refers to two of more computers interconnected via a local 
area network. A cluster of computers can appear as a single system to users and 
applications. It provides a low-cost alternative to supercomputers with a relatively 
reasonable performance.  
Figure 1.1 describes the architecture and the main components of a cluster 
computing system [2]. The individual nodes of a cluster could be PCs or high speed 
 6
workstations connected through a high-speed network.  The network interface hardware 
acts as a communication processor and is responsible for transmitting and receiving 
packets of data between cluster nodes. The cluster communication software provides a 
means for fast and reliable data communication among cluster nodes and to the outside 
world. Clusters often use communication protocols such as ?Active Messages? [2] for 
fast communication among their nodes. They usually bypass the operating system and 
remove the critical communication overhead normally involved by providing a direct 
user-level access to the network interface. 
 
 
Figure 1.1 Architecture of Cluster-Computing Systems  
 
The cluster nodes can either work as individual computers or can work 
collectively as an integrated computing resource. The cluster middleware is responsible 
for offering an illusion of a unified system image (Single System Image) and Availability 
 7
out of a collection of independent but interconnected computers. Parallel programming 
environments offer portable, efficient, and easy-to-use tools for development of 
applications. They include message passing libraries, debuggers, and profilers. Clusters 
also run resource management and scheduling software such as LSF (Load Sharing 
Facility) and CODINE (Computing in Distributed Networked Environments) [2]. The 
individual nodes of a cluster can have different hardware characteristics and new nodes 
can be seamlessly integrated into existing clusters thus making them easily scalable. 
Clusters make use of these hardware and software resources to execute high performance 
applications and typically provide a very low cost/performance ratio.  
 
1.3 Grid Computing  
The massive growth of the Internet in the recent years has encouraged many 
scientists to explore the possibility of harnessing idle CPU clock cycles and other 
unutilized computational resources spread across the Internet. The idea was to harness 
idle CPU cycles and other computational resources and provide a unified computational 
resource to those in need of high-performance computation. This led to the notion of 
?Grid Computing?. 
 The concept of grid computing is similar to that of ?Electrical Grids.?  In 
electrical grids, power generation stations in different geographical locations are 
integrated to provide a unified power resource for consumers to plug into on demand. In 
the same fashion, computational grids allow users to plug into a virtual unified resource 
for their computational needs.  
 8
1.3.1 Architecture of a Grid Computing System  Grid systems are highly complex and 
comprise of a host of integrated hardware and software features  as illustrated in  Figure 
1.2.  The following sub-sections describe the major components of a grid.   
Figure 1.2 Grid Architecture 
1.3.1.1 Interface  
 Grid systems are designed to shield their internal complexities from users. User 
interfaces can come in many forms and can be application specific. Typically grid 
interfaces are similar to web portals. A grid portal provides an interface to launch 
applications which would use its resources and services. Through this interface, users see 
the grid as a virtual computing resource. 
 
 
 9
1.3.1.2 Security 
Security is a critical issue in grid computing. A grid environment should consist 
of mechanisms to provide security, which includes authentication, authorization, data 
encryption etc. Most of the grid implementations include an Open SSL [4] 
implementation. They also provide a single sign-on mechanism, so that once a user is 
authenticated, a proxy certificate is created and used while performing actions within the 
grid.  
1.3.1.3 Broker  
A grid system typically consists of a diverse range of resources spread across the 
internet. When a user desires to launch an application through the portal, depending on 
the application and other parameters provided by the user, the system needs to identify 
and appropriate the resources to use. This task is accomplished by the grid broker system. 
The broker makes use of the services provided by the Grid Information Service (GIS) 
which is also known as the Monitoring and Discovery Service (MDS). It provides 
information about the available resources within the grid and their status. Upon 
identifying available resources, the broker needs to choose the most viable resource based 
on the requirements of the user. Resource brokering is a major research topic in grid 
computing and forms the focus of what is known as ?G-Commerce?. 
1.3.1.4 Scheduler 
Applications requiring services of a grid could be one large module or could 
consist of several independent modules with or without data dependencies. Depending on 
the nature of the application, the scheduler must be able to effectively map the 
 10
application or its components onto the best available resource. Most of the grid 
schedulers use different algorithms to deal with different cases. Grid schedulers have a 
number of algorithms to choose from depending on scheduling parameters and user 
requirements. However, the most common criteria for schedulers is to minimize the 
turnaround time of an application.  
1.3.1.5 Data Management 
Scheduling high performance applications onto grids constantly requires 
movement of data files from one node to another. The grid environment should provide a 
reliable and a secure means for data exchange. The Data Management component of the 
grid system commonly uses the Grid Access to Secondary Storage (GASS) [4] 
component to move data files across the grid. The GASS incorporates the GridFTP, 
which is protocol built over the standard FTP in the TCP/IP protocol suite. The GridFTP 
protocol adds a layer of encryption and other security features on top of the standard FTP 
protocol.  
1.3.1.6 Job Management 
This component includes the core set of services that perform the actual work in a grid 
environment. It provides service to actually launch a job on a particular resource, check 
its status, and retrieve results when it is complete. The component is also responsible for 
ensuring fault tolerance. 
 
 
 11
1.4 Overview of Task Scheduling in Heterogeneous Computing Environments  
There are a number of reasons why scheduling programs or the tasks that 
comprise the programs is important. For users it is important that the programs they wish 
to run are executed as quickly as possible (faster turnaround times). On the other hand the 
owners of computing resources would ideally wish to optimize their machine utilization. 
These two objectives, faster turnaround times and optimal resource utilization, are not 
always complementary. Owners are not usually willing to let a single user utilize all their 
resources (especially in grid systems), and users are not usually willing to wait an 
arbitrarily long time before they are allowed access to particular resources. Scheduling, 
from both points of view, is the process by which both the users and the owners achieve a 
satisfactory quality of service.  
1.4.1 Scheduling Strategies 
There are different approaches to the selection of processors onto which sub-tasks 
of a program would be placed for execution. In the static model, each sub-task is assigned 
to a processor before the execution of a program commences. In the dynamic scheduling 
model, sub-tasks are assigned to different processors in run-time. In the Hybrid 
scheduling model, a combination of both static and dynamic scheduling strategies is used.   
1.4.1.1 Static Scheduling 
In the static model, all sub-tasks of a program are assigned once to a processing 
element. An estimate of the cost of computation can be made  a priori . Heuristic models 
for static task scheduling are discussed in Chapter 2. One of the main benefits of the 
 12
static model is that it is easier to implement from a scheduling and mapping point of 
view. Since the mapping of tasks is fixed a priori, it is easy to monitor the progress of 
computation. Likewise, estimating the cost of jobs is simplified. Processors can give 
estimates of the time that will be spent processing the sub-tasks. On completion of the 
program they can be instructed to supply the precise time that was spent in processing. 
This facilitates updating of actual running costs and could be used in making 
performance estimates for new programs. The Static Scheduling model has a few 
drawbacks. The model is based on an approximate estimation of processor execution 
times and inter-processor communication times. The actual execution time of a program 
may often vary from the estimated execution time and sometimes may result in a poorly 
generate schedule. This model also does not consider node and network failures  
1.4.1.2 Dynamic Scheduling 
Dynamic scheduling operates on two levels: the local scheduling strategy, and a 
load distribution strategy.  The load distribution strategy determines how tasks would be 
placed on remote machines. It uses an information policy to determine the kind of 
information that needs to be collected from each machine, the frequency at which it needs 
to be collected and also the frequency at which it needs to be exchanged among different 
machines. In a traditional dynamic scheduling model, the sub-tasks of an application are 
assigned to processors based on whether they can provide an adequate quality of service. 
The meaning of quality of service is dependent on the application. Quality of service 
could mean whether an upper bound could be placed on the time a task needs to wait 
before it can start its execution; the minimum time under which the task can complete its 
 13
execution without interruption and the relative speed of the processor as compared to 
other processors in the system. If a processor is assigned too many tasks, it may invoke a 
transfer policy to check to see if it needs to transfer tasks to other nodes and if so, to 
which ones? The transfer of tasks could be sender initiated or receiver initiated. In the 
later case, a processor that is lightly loaded will voluntarily advertise to offer its services 
to heavily loaded nodes.  
The main advantage of dynamic scheduling over static scheduling is that the 
scheduling system need not be aware of the run-time behavior of the application before 
execution. Dynamic scheduling is particularly useful in systems where the goal is to 
optimize processor utilization as opposed to minimizing the turnaround times. Dynamic 
scheduling is also more efficient and fault tolerant when compared to static scheduling. 
1.4.1.3 Hybrid Static-Dynamic Scheduling 
Static scheduling algorithms are easy to implement and usually have a low 
schedule generating cost. However, since static scheduling is based on estimated 
execution costs, it may not always produce the best schedules. On the other hand, 
dynamic scheduling uses run-time information in the scheduling process and generates 
better schedules. But dynamic scheduling suffers from very high running costs and may 
be prohibitively expensive while trying to schedule very large applications with tens and 
thousands of sub-tasks. Since both the scheduling techniques have their own advantages, 
researchers have tried to combine them to create a hybrid scheduling technique. Usually 
in hybrid scheduling, the initial schedule is obtained using static scheduling and the sub-
 14
tasks are mapped onto the respective processors. However, after the execution 
commences, the processors use run-time information to check and see if the tasks could 
be mapped to better processors to yield a better a makespan. The running cost of a hybrid 
scheduling algorithm is greater than the static scheduling algorithms, but is significantly 
lower than the dynamic only scheduling algorithms.  
1.5 NP-complete Problems  
 Computational problems can be broadly classified into two categories, tractable 
problems and intractable problems [3]. Tractable problems are the ones whose worst case 
running time or time complexity is smaller than O(n
k
), where n is the input size of the 
problem and k is a constant. These problems are also known as ?Polynomial Time 
Problems? since they can be executed in polynomial time. The Intractable problems are 
ones that cannot be executed in polynomial time. They take super-polynomial times to 
execute.  
However, there is a class of problems whose status is unknown to this day. These 
problems are known as the ?NP-complete problems?. For these problems, no polynomial 
time solution has yet been discovered, nor has anyone been able to solve them with a 
super-polynomial lower bound [3].  Many computer scientists believe that NP-complete 
problems are intractable. This is mainly because there has been no success in devising a 
polynomial time solution to any of the existing NP-complete problems so far and if a 
polynomial time solution is devised for one NP-complete problem, mathematically a 
polynomial time solution can be devised for all NP-complete problems.  
 15
 Algorithm designers need to understand the basics and importance of NP-
complete problems. If designers can prove that a problem is NP-complete, then there is a 
good chance that the problem is intractable. If a problem is intractable, it would be better 
to design an approximation algorithm instead of a perfect algorithm.  
 The task scheduling problems that form the focus of this dissertation are well 
known NP-complete problems [3]. We devise approximation algorithms or heuristics to 
deal with various cases of the task-scheduling problem, which forms the focus of this 
research.  
 
1.6 Research Objectives  
 In this dissertation, we intend to propose new algorithms for scheduling tasks in 
heterogeneous computing systems. In Section 2 we provide a comprehensive literature 
review on the existing work in the area of task scheduling in heterogeneous computing 
systems. In Section 3, we propose a new algorithm called the Heterogeneous Critical 
Node First (HCNF) to schedule a parallel application modeled by a Directed Acyclic 
Graph (DAG) onto a network of heterogeneous processing elements. In Section 4, we 
propose a new low-complexity algorithm called the Heterogeneous Largest Task First 
(HLTF) to schedule independent tasks of a meta-task onto a network of heterogeneous 
processors. In Section 5, we propose a new algorithm called the Earliest Finish Time with 
Dispatch Time (EFT-DT) to schedule a set of independent tasks of a meta-task onto a 
network of heterogeneous processors while also considering the dispatch times. In 
Section 6, we provide the concluding remarks and also make suggestions for future 
research in this area. 
 16
 
 17
 
CHAPTER 2 
LIERATURE REVIEW 
Among the problems related to task scheduling in heterogeneous computing 
environments, scheduling a parallel application represented by a directed acyclic graph 
(DAG) to minimize the overall execution time (makespan) and scheduling a parallel 
application represented by a meta-task (set of independent tasks) to minimize the 
makespan are the most important and often researched ones. This section defines the two 
problems and surveys related research work.  
 
2.1 Scheduling Parallel Applications Represented by Directed Acyclic Graphs 
onto Heterogeneous Computing Systems to Minimize the Makespan 
Many parallel applications consist of sub-tasks with precedence constraints and 
can be modeled by directed acyclic graphs. This section discusses the problem of 
scheduling a parallel application represented by a DAG onto a network of heterogeneous 
processors to minimize its makespan and reviews related research work 
 
2.1.1  Directed Acyclic Graphs     
A DAG is represented by G={V,E,W,C}. V is the set of n nodes: {n
1
, n
2
, n
3
, n
4
,?}. 
E is the set of directed edges of the form (n
i
, n
j
) which represents an edge directed from 
 18
node n
i  
to n
j 
. W is the set of node weights of the form w
i
 , where w
i 
denotes the weight of 
node n
i 
. C is the set of edge weights of the form c
i,j
, where c
i,j
 denotes the weight of the 
edge (n
i
, n
j
). A DAG is a graph without a cycle (A directed path from a node onto itself). 
The set of nodes in a DAG which have an edge directed towards a node n
i
 are called its 
predecessor nodes and are denoted by PRED(n
i
). Likewise, the set of nodes which have a 
directed edge from a node n
i 
are called its successor nodes and are denoted by SUCC(n
i
). 
Nodes in a DAG that do not have a predecessor are called start nodes and nodes that do 
not have a successor are called exit nodes. blevel(n
i
) is the bottom level of n
i
  and is 
length of the longest path from n
i
 to any exit node including the weight of n
i
. The length 
of a path in a DAG is the sum of its node and edge weights. tlevel(n
i
) is the is the top 
level of n
i
  and is the length of the longest path from a start node to node n
i
 excluding the 
weight of n
i
.  The longest path in a DAG is called the critical path. A DAG may have 
multiple critical paths.  A sample DAG is illustrated in Figure 2.1. The node weights are 
to the right of each node and the edge weights are to the left of each edge. Table 2.1 
provides the table of values for the sample DAG.   
 
2.1.2 Problem Statement  
The objective is to schedule a parallel application represented by a DAG onto a 
network of heterogeneous processors to minimize its overall execution time. Node-
weights in a DAG represent average execution times of nodes over all the processors in 
the target execution system. Edges represent precedence constraints between nodes. An 
edge (n
i
,n
j
) indicates that node n
j
 cannot start execution until n
i 
completes execution and 
 19
receives all the required data from it. Edge-weights represent the time required to transfer 
the required data.  
 
 
 
Figure 2.1 A sample DAG, G
1 
 
 
Table 2.1 Table of values for G
1
 
 
 
 n
i 
 
PRED(n
i
)
 
SUCC(n
i
) 
 
tlevel(n
i
)
 
blevel(n
i
) 
1 {null} {2,3,4,5,6} 0 108.01 
2 {1} {8,9} 31 77.01 
3 {1} {7} 25 80 
4 {1} {8,9} 22 81.34 
5 {1} {9} 24 69 
6 {1} {8} 27 63.34 
7 {3} {10} 62.33 42.67 
8 {2,4,6} {10} 66.67 35.67 
9 {2,4,5} {10} 67.67 44.34 
10 {7,8,9} {null} 97.34 14.67 
 20
The target execution system consists of a finite number of heterogeneous 
processors connected with a high speed network. Communication among processors is 
assumed to be contention-less. Computation and communication is assumed to take place 
simultaneously. Node-execution is assumed to be non-preemptive; meaning nodes once 
scheduled on a processor cannot be removed (or preempted) and scheduled on other 
processors. If a DAG has multiple start nodes, a dummy start node with a zero node 
weight is added. Zero weight communication edges are then added from the dummy start 
node to the multiple start nodes. Likewise, if a DAG has multiple exit nodes, a dummy 
exit node is added. The make-span of a DAG is the time difference between the 
commencement of execution of the start node and the completion of execution of the exit 
node. The heterogeneous DAG scheduling problem is NP-complete [28] and can be 
formally defined as: To schedule the nodes of a DAG representing a parallel application 
onto a network of heterogeneous processors such that all the data precedence constraints 
are satisfied and the overall execution time of the DAG is minimized. The following 
sections survey existing research related to this problem. 
 
2.1.3 The Best Imaginary Level Algorithm  
The Best Imaginary Level (BIL) algorithm [22] assigns node-priorities based on 
the best imaginary level of each node. At each scheduling step, a free node with the 
highest priority is selected and mapped onto a processor based on a criterion. Table 2.2 
defines the terms used in BIL and Figure 2.2 lists the algorithm.  
 21
BIL(n
i
, p
j
) is the best imaginary level of node n
i
 on processor p
j
. It is the length of 
the longest path in the DAG beginning with n
i
 assuming it is mapped onto p
j
, and is 
recursively defined as:    
))])((min),([min(max)( ,,,)(,, pipijpjinSuccnjiji cpnBILpnBILwpnBIL ik += ??+ . 
BIL of a node is adjusted to its basic imaginary make-span (BIM) as follows:  
][_)()( ,, jAvailableTpnBILpnBIM jiji += . 
 
 
Table 2.2 Definition of terms used in BIL 
Term Definition 
N = {n
1
, n
2
, n
3
, n
4
, n
5
, n
6
?.}//Set of nodes in the DAG, |N|=n 
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors, |P|=m 
w
i,j 
Time required to execute n
i  
on  p
j 
c
i,j 
Time required to transfer all the requisite data from n
i
 to  n
j
 when 
they are scheduled on different processors 
)( , ji pnBIL  ))])((min),([min(max ,,,)(, kililjjknSuccnji cpnBILpnBILw ik += ??+  
T_Available[p
j
] Time at which processor p
j
 completes the execution of all the 
nodes previously assigned to it 
)( , ji pnBIM  
][_)( ,
j
ji pAvaialbleTpnBIL +=  
k Number of free nodes at a scheduling step 
)(* , ji pnBIM  
)0,1/max()(
,
, ??+= mkwpnBIM
ji
ji  
 
 
 22
If k is the number of free nodes (those nodes whose predecessors have completed 
execution) at a scheduling step, the priority of a free node is the k
th
 smallest BIM value. If 
the k
th
 smallest BIM value is undefined, the largest finite BIM value becomes its priority. 
If two or more nodes have the same priority, ties are broken randomly. At each 
scheduling step, the free node with the highest priority is selected for mapping. If k is 
greater than the number of processors, node execution times become more important than 
the communication overhead. On the contrary, if k is less than the number of available 
processors, node execution times become less important. The BIM value for the selected 
node is revised to incorporate this factor as follows:  
)0,1/max()()(*
,
,, ??+= mkwpnBIMpnBIM
ji
jiji . 
The processor which provides the highest revised BIM value for the node is selected. If 
more than one processor provides the same revised BIM value, the processor that 
maximizes the sum of the revised BIM values of all the other nodes is selected. The time 
complexity of the algorithm is O(n
2 
+ m log m). 
 
2.1.4 The Generalized Dynamic Level Algorithm  
The Generalized Dynamic Level (GDL) Algorithm [28] assigns node-priorities 
based on their generalized dynamic levels. A number of factors are incorporated in the 
calculation of the generalized dynamic level and are explained next. The definition of 
terms used in GDL is listed in Table 2.3 and the algorithm is listed in Figure 2.3. 
 
 
 23
 
BIL Algorithm  
ReadyTaskList ?  Start node 
While ReadyTaskList  NOT empty 
     k ? | ReadyTaskList|// Number of free nodes 
        For all n
i
 in ReadyTaskList and p
j
 in P 
                  Compute BIM(n
i
 , p
j
) 
        End For  
        Priority of n
i
 ?  k
th
 smallest BIM value, or the largest finite 
                                  BIM value if the k
th
 smallest value is undefined 
        n
t
 ?  node in ReadyTaskList with the highest priority  
        For all p
j
 in P  
                   Compute BIM*(n
t
 , p
j
)//Revised BIM 
        End For  
        p
fav 
?  The processor that provides the highest revised BIM value  
                   for n
t
                 
        Map n
t
 on p
fav
  
        ReadyTaskList ?  ReadyTaskList - n
t
 + Free nodes(if any) 
End While 
End BIL 
     
Figure 2.2 The BIL algorithm 
 
SL(n
i
) is the static level of a node n
i
 and is the largest sum of the median execution times 
of all the nodes from node n
i 
 to an exit node along any path in the DAG. DL(n
i
,p
j
) = 
SL(n
i
)- EST(n
i 
, p
j
) + ?(n
i 
, p
j
)  is the Dynamic Level (DL) of a node n
i 
on processor p
j  
. It 
indicates how well the node and the processor are matched for execution. Even though 
DL(n
i
,p
j
) indicates how well n
i
 and p
j
 are matched, it does not indicate how well the 
descendents of n
i
 are matched with p
j
. D(n
j
) is the descendent of node n
i 
 to which n
i  
passes the maximum data. F(n
i
,D(n
i
),p
j
)= d(n
i
 ,D(n
i
))+ min 
k ? j   
E(D(n
i
),p
k
) is defined to 
indicate how quickly D(n
i
) can be completed on a processor other than p
j
, if node n
i
 is 
 24
executed on processor  p
j 
.The Descendent Consideration (DC) term is defined as: DC(n
i 
, 
p
j
) = w*( D(n
i
)) ? min { E(D(n
i
),p
j
), F(n
i
 ,D(n
i
),p
j
)} 
 
Table 2.3 Definition of terms used in GDL 
Term Definition 
N = {n
1
, n
2
, n
3
, n
4
, n
5
, n
6
?.}//Set of nodes in the DAG, |N|=n 
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors, |P|=m 
w
i,j 
Execution time of node n
i  
on  p
j 
c
i,j 
Data transfer time from node n
i
 to  n
j
  
w*(n
i
) Median execution time of n
i  
over all the processors 
SL(n
i
) largest sum of the median execution times of all the nodes 
from node n
i 
 to an exit node along any path in the DAG 
?(n
i 
, p
j
) =  w*(n
i 
) ? w
i,j
 
EST(n
i 
, p
j
)  Earliest start time of n
i  
on  p
j
 
DL(n
i 
, p
j
)  = SL(n
i
)- EST(n
i 
, p
j
) + ?(n
i 
, p
j
) 
D(n
j
) Descendent node of node n
i 
 to which n
i  
passes the 
maximum data 
d(n
i
 ,D(n
i
)) Time required to transfer data from n
i
 to D(n
i
) 
E(D(n
i
),p
k
) Time required to execute D(n
i
)
 
on processor  p
k
 
F(n
i
 ,D(n
i
),p
j
) = d(n
i
 ,D(n
i
))+ min 
k ? j   
E(D(n
i
),p
k
) 
DC(n
i 
, p
j
) = w*( D(n
i
)) ? min { E(D(n
i
),p
j
), F(n
i
 ,D(n
i
),p
j
)} 
C(n
i
) = DL(n
i 
, p
pref
) ? max
k ?pref 
DL(n
i 
,p
k
)  (pref 
 
is the processor 
on which node n
i
 obtains the maximum DL) 
 
GDL(n
i 
, p
j
) = DL(n
i 
, p
j
)+ DC(n
i 
, p
j
)+ C(n
i
) 
 
 25
 
GDL Algorithm  
 For all n
i
 in N  
    Compute SL(n
i
) 
 End For 
 ReadyTaskList ?  Start Node 
 While ReadyTaskList is NOT NULL do 
       For all n
i
 in ReadyTaskList and p
j
 in P 
  Compute DL(n
i 
, p
j
) 
              Compute DC(n
i 
, p
j
) 
   Compute C(n
i
) 
   GDL(n
i 
, p
j
)?  DL(n
i 
, p
j
)+ DC(n
i 
, p
j
)+ C(n
i
) 
       End For 
       Select the node-processor pair with the  maximum GDL 
       Update ReadyTaskList 
   End While  
End GDL 
 
    
Figure 2.3 The GDL algorithm 
 
The preferred processor of a node is the processor which maximizes its dynamic level 
(DL). The cost of not scheduling a node on its preferred processor is defined as follows.  
C(n
j
)= DL(n
i 
, p
pref
) ? max
k ?j 
DL(n
i 
,p
k
)  (p
pref 
is the preferred processor)  
The combination of DL, the Descendent Consideration (DC) term and the cost incurred in 
not scheduling a node on its preferred processor is used to define the Generalized 
Dynamic Level (GDL) of a node as:  GDL(n
i 
, p
j
)= DL(n
i 
, p
j
)+ DC(n
i 
, p
j
)+ C(n
j
). 
At each scheduling step, the algorithm selects among the free nodes, the node and the 
processor with the maximum GDL. The time complexity is O(n
2 
+ m log m). 
 
 
 
 26
2.1.5 The Levelized Min Time Algorithm 
In the Levelized Min Time (LMT) algorithm [16], the input DAG is divided into k 
levels using the following rules. The levels are numbered 0 to k-1. All the nodes in a level  
 
Table 2.4 Definition of terms used in LMT 
Term Definition 
N = {n
1
, n
2
, n
3
, n
4
, n
5
, n
6
?.}//Set of nodes in the DAG, n=|N| 
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors, m=|P| 
k Number of levels in the DAG 
w
i,j 
Time required to execute n
i  
on  p
j 
c
i,j 
Time required to transfer all the requisite data from n
i
 to  n
j
 when 
they are scheduled on different processors 
T_Available[p
j
] Time at which processor p
j
 completes the execution of all the 
nodes previously assigned to it 
,(
i
nEST )
j
p  Max(][_ jAvailableT ,
imkmnpredn
cpnEFT
im
,)(
),(max +
?
)) 
),(
ji
pnEFT  = ),(
, jiji
pnESTw +  
   
are independent of each other. Level 0 contains the start nodes and level k-1 contains the 
exit nodes. For any level j, where 0 < j < k-1, nodes in level j can have incident edges 
from any of the nodes in levels 0 thru j+1. Additionally, there must be at least one node 
in level j with an edge incident from a node in level j+1. LMT maps the nodes one level 
at a time starting from level 0. If the number of nodes at a given level is more than the 
number of processors in the target system, the smallest nodes (based on the average 
computation times) are merged until the number of nodes equals the number of 
 27
processors. Nodes are then sorted by the descending order of their average computations 
times. At each scheduling, the largest node is mapped onto the processor that provides its 
minimum finish time. Table 2.4 defines the terms used in LMT and Figure 2.4 lists the 
algorithm.  
 
 
LMT Algorithm  
   Divide the input DAG into k levels (level 0 to level k-1) 
   For levels 0 thru k-1 do 
          num?  number of nodes in the current level  
              If num>m 
                 Merge the smallest nodes in the current level until num=m 
              End If 
             ReadyTaskList ?  Nodes in the current level sorted in the 
                                          descending order of average node weights 
             While ReadyTaskList is NOT NULL do 
      n
i 
?  First node in the ReadyTaskList  
      For all p
j
 in P 
            Compute ,(
i
nEST )
j
p  
            ),(
ji
pnEFT ?  ),(
, jiji
pnESTw +  
       End For 
       Map node n
i
 on processor p
j 
which provides its least EFT 
      Update T_Available[p
j
] 
      Update ReadyTaskList 
              End While 
    End For  
End LMT 
 
 
    Figure 2.4 The LMT algorithm   
  
2.1.6 The Heterogeneous Earliest Finish Time Algorithm  
The Heterogeneous Earliest Finish Time (HEFT) algorithm [30] assigns node -
priorities based on the bottom level (blevel) of each node. The blevel of a node is the 
 28
length of the longest path in the DAG from the node to the exit node. The length of a path 
in a DAG is the sum of the node and edge weights that constitute the path. At each 
scheduling step, a node with the highest priority is assigned to a processor that minimizes 
its finish time. The definition of terms used in HEFT in listed in Table 2.5 and the 
algorithm is listed in Figure 2.5. As a first step, HEFT traverses the DAG in a top down 
fashion and computes the blevels of all the nodes. At each scheduling step, a node with 
the highest blevel is selected for mapping. Ties are broken randomly 
EST(n
i 
, p
j
) is the earliest start time of a node n
i 
 on a processor p
j
 and is defined 
as: ,(
i
nEST )
j
p = Max(][_
j
pAvailableT ,
imkmnpredn
cpnEFT
im
,)(
),(max +
?
)). It is the 
maximum of a) the time at which processor p
j
 becomes free or b) The time at which node 
n
i
 receives all the required data from its predecessor nodes after the completion of their 
exeuction. EFT(n
i 
, p
j
) is the Earliest Finish Time of a node n
i 
 on a processor p
j
 and is 
defined as: EFT(n
i 
, p
j
) = EST(n
i 
, p
j
) + w
i,j
. HEFT computes the EFTs of the selected 
node on all the processors and selects the processor that provides the minimum EFT. The 
time complexity is O(n
2 
m). 
 
2.1.7 The Critical Path on Processor Algorithm  
The critical path is the longest path in a DAG. The length of the critical path gives 
the lower bound on the overall execution time of the DAG [30]. Minimizing the length of 
the critical path would aid minimizing the overall execution time of a DAG [30]. The 
Critical Path on Processor (CPOP) algorithm [30] is a variant of the HEFT algorithm and 
is from the same authors [30]. CPOP adopts a different mapping strategy for the critical 
path nodes and the non-critical path nodes.  
 29
 
Table 2.5 Definition of terms used in HEFT  
Term Definition 
N = {n
1
, n
2
, n
3
, n
4
, n
5
, n
6
?.}//Set of nodes in the DAG, n=|N| 
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors, m=|P| 
w
i,j 
Time required to execute n
i  
on  p
j 
c
i,j 
Time required to transfer all the requisite data from n
i
 to  n
j
 when 
they are scheduled on different processors 
priority(n
i
) = blevel(n
i
) 
T_Available[p
j
] Time at which processor p
j
 completes the execution of all the 
nodes previously assigned to it 
,(
i
nEST )
j
p  Max(][_ jAvailableT ,
imkmnpredn
cpnEFT
im
,)(
),(max +
?
)) 
),(
ji
pnEFT  = ),(
, jiji
pnESTw +  
 
 
HEFT Algorithm  
   For all n
i
 in N  
      Compute blevel(n
i
) 
   End For 
   ReadyTaskList ?  Start Node 
      While ReadyTaskList is NOT NULL do 
  n
i 
?  node in the ReadyTaskList with the maximum blevel 
      For all p
j
 in P 
            Compute ,(
i
nEST )
j
p  
            ),(
ji
pnEFT ?  ),(
, jiji
pnESTw +  
       End For 
       Map node n
i
 on processor p
j 
which provides its least EFT 
      Update T_Available[p
j
] and ReadyTaskList 
       End While  
End HEFT 
 
 
Figure 2.5 The HEFT Algorithm 
 30
 
 
Table 2.6 Definition of terms used in CPOP 
Term Definition 
N = {n
1
, n
2
, n
3
, n
4
, n
5
, n
6
?.}//Set of nodes in the DAG, n=|N| 
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors, m=|P| 
w
i,j 
Time required to execute n
i  
on  p
j 
c
i,j 
Time required to transfer all the requisite data from n
i
 to  n
j
 when 
they are scheduled on different processors 
priority(n
i
) = tlevel(n
i
)+ blevel(n
i
) 
CP processor 
p
j 
?
 
P which minimizes 
?
?CPn
ji
i
w
,
//CP is the critical path 
T_Available[p
j
] Time at which processor p
j
 completes the execution of all the 
nodes previously assigned to it 
,(
i
nEST )
j
p  Max(][_
j
pAvailableT ,
imkmnpredn
cpnEFT
im
,)(
),(max +
?
)) 
),(
ji
pnEFT  = ),(
, jiji
pnESTw +  
 
CPOP traverses the DAG in a top down fashion to compute the tlevels and blevels 
of all the nodes. It identifies the critical path/s and marks the critical path nodes. The 
priority of each node is the sum of its tlevel and blevel . At each scheduling step, a free 
task with the highest priority is selected for mapping. Ties (if any) are broken randomly. 
A CP processor is defined as the processor that minimizes the overall execution 
time of the critical path assuming all the critical path nodes are mapped onto it. If the 
selected node is a critical path node, it is mapped onto the CP processor. Else, it is 
 31
mapped onto a processor that minimizes its EFT (like the HEFT algorithm). The time 
complexity is O(n
2 
m). 
 
 
CPOP Algorithm  
   For all n
i
 in N  
      Compute tlevel(n
i
) and blevel(n
i
) 
      Identify the critical path/s and mark the critical path nodes 
      priority(n
i
) ?  tlevel(n
i
)+ blevel(n
i
) 
   End For 
   ReadyTaskList ?  Start Node 
      While ReadyTaskList is NOT NULL do 
  n
i 
?  node in the ReadyTaskList with the maximum priority 
                 If n
i
?critical path  
                        Map n
i
 on the CP processor  
                 Else  
         For all p
j
 in P 
            Compute ,(
i
nEST )
j
p  
            ),(
ji
pnEFT ?  ),(
, jiji
pnESTw +  
          End For 
             Map node n
i
 on processor p
j 
which provides its least EFT
                  End If  
      Update T_Available[p
j
] 
      Update ReadyTaskList 
       End While  
End CPOP 
 
  
 Figure 2.6 The CPOP algorithm  
 
2.1.8 The Fast Critical Path Algorithm  
There are three steps involved in a typical static DAG scheduling algorithm: 
computation of node priorities, node selection, and processor selection. These steps 
contribute to the overall time complexity of the algorithm. The Fast Critical Path (FCP) 
 32
algorithm [24] tries to reduce the overall time complexity by reducing the complexity of 
the node selection and the processor selection steps.  
 
Table 2.7 Definition of terms used in FCP 
Term Definition 
N = {n
1
, n
2
, n
3
, n
4
, n
5
, n
6
?.}//Set of nodes in the DAG, n=|N| 
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors, m=|P| 
e Number of edges in the DAG 
w
i,j 
Time required to execute n
i  
on  p
j 
priority(n
i
) = blevel(n
i
) 
c
i,j 
Time required to transfer all the requisite data from n
i
 to  n
j
 when 
they are scheduled on different processors 
T_Available[p
j
] Time at which processor p
j
 completes the execution of all the 
nodes previously assigned to it 
,(
i
nEST )
j
p  Max( ][_ jAvailableT ,
imkmnpredn
cpnEFT
im
,)(
),(max +
?
)) 
),(
ji
pnEFT  = ),(
, jiji
pnESTw +  
 
Node Selection: FCP tries to reduce the complexity of the node selection process by 
restricting the size of the ReadyTaskList to m (The number of processors in the target 
execution system). Additional free nodes (if any) are stored in a FIFO queue. Node 
priorities are based on their blevels. At each scheduling step, a node with the highest 
priority is selected for mapping. By restricting the size of the ReadyTaskList to m, the 
time complexity of the node selection process would be O(n log m).  
 33
Processor Selection: The complexity of the processor selection step is reduced by 
restricting the choice to just two processors: the first processor that becomes free and the 
enabling processor (The processor which is the last to send a data item to a node). The 
authors [24] prove that the EFT of a node is always minimized by one of these two 
processors. The time complexity of the processor selection step would be reduced to 
O(nlogm+e).  Of the two processors, the one which provides the least EFT for the 
selected node is chosen. The overall time complexity of FCP is O(nlogm+e). 
 
 
FCP Algorithm  
   For all n
i
 in N  
      Compute tlevel(n
i
) 
      priority(n
i
) ?  blevel(n
i
) 
   End For 
      ReadyTaskList ?  Start Node 
      AdditionalTaskList ?  NULL //FIFO Queue  
       While ReadyTaskList is NOT NULL do 
  n
i 
?  node in the ReadyTaskList with the maximum priority 
                 p
1
?  First processor in P to become free 
                P
2
?  Enabling processor of  n
i
       
     Compute EST(n
i 
, p
1
) 
                EFT(n
i 
, p
1
)?  EST(n
i 
, p
1
)+ w
i,1
 
                Compute EST(n
i 
, p
2
) 
                EFT(n
i 
, p
1
)?  EST(n
i 
, p
2
)+ w
i,2
 
     Map node n
i
 on processor p
j 
which provides its least EFT 
    Update T_Available[p
j
] 
    Update ReadyTaskList  
    Update AdditionalTaskList (If applicable)  
        End While  
End FCP 
 
  
Figure 2.7 The FCP algorithm 
 
 
 34
2.1.9 The Fast Load Balancing Algorithm  
 
The Fast Load Balancing (FLB) algorithm [24] is a variant of the FCP algorithm. 
The node selection complexity is reduced by limiting the number of nodes in the 
ReadyTaskList to m (number of processors). Additional free nodes, if any, are added to a 
FIFO list. As was discussed in previous section, the earliest start time for a node can be 
obtained on either the first processor to become free or a task?s enabling processor. For 
each node in the ReadyTaskList, the earliest start time of the node on the first processor to 
become free and the node?s enabling processor is calculated. Among the free nodes, the 
node with the minimum earliest start time is selected and mapped onto the corresponding 
processor. The overall time complexity of FLB is O(nlogm+e). 
 
Table 2.8 Definition of terms used in FLB 
Term Definition 
N = {n
1
, n
2
, n
3
, n
4
, n
5
, n
6
?.}//Set of nodes in the DAG, n=|N| 
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors, m=|P| 
w
i,j 
Time required to execute n
i  
on  p
j 
priority(n
i
) = blevel(n
i
) 
c
i,j 
Time required to transfer all the requisite data from n
i
 to  n
j
 when 
they are scheduled on different processors 
T_Available[p
j
] Time at which processor p
j
 completes the execution of all the 
nodes previously assigned to it 
,(
i
nEST )
j
p  Max( ][_ jAvailableT ,
imkmnpredn
cpnEFT
im
,)(
),(max +
?
)) 
),(
ji
pnEFT  = ),(
, jiji
pnESTw +  
 
 
 35
 
 
FLB Algorithm  
   For all n
i
 in N  
      Compute tlevel(n
i
) 
      priority(n
i
) ?  tlevel(n
i
) 
   End For 
     Readytasklist ?  Start Node 
     AdditionalTaskList ?  NULL // FIFO queue  
   While ReadyTaskList is NOT NULL 
           For all n
i
 in Readytasklist 
                p
1
?  First processor in P to become free 
                p
2
?  Enabling processor of  n
i
       
     Compute EST(n
i 
, p
1
) 
                Compute EST(n
i 
, p
2
) 
             End For 
     Select n
i
 with the least EST and map it onto the  
                corresponding processor. 
                Update T_Available[p
j
] 
    Update ReadyTaskList  
    Update AdditionalTaskList (If applicable)  
     End While  
End FLB 
 
  
Figure 2.8 The FLB algorithm 
 
2.1.10 The Hybrid Re-mapper Algorithm  
Static scheduling algorithms use estimates of node execution times in the 
scheduling process. Estimates can be obtained by techniques such as code profiling and 
analytical benchmarking [21].  However, actual node execution times may sometimes 
vary largely from the estimated execution times and may result in a bad schedule. To 
mitigate this problem, the Hybrid Re-mapper [21] algorithm uses a combination of static 
mapping and the actual run-time values of node execution times. It tries to fine tune the 
schedule obtained by a static scheduling algorithm by making use of run-time values as 
 36
and when they are made available. The inputs to the algorithm are the DAG and the 
schedule obtained using a list based static scheduling heuristic. The input DAG is divided 
into k levels marked 0 thru k-1, such that nodes in a level do not have precedence 
constraints between one another. The start nodes are in level 0 and the exit nodes in level 
k-1.  Node priorities are based on their blevels. Nodes in level 0 are mapped according to 
the static schedule. For levels 1 thru k-1, nodes at a level are considered for re-mapping 
as soon as the first node of the previous level starts execution. The node with the highest 
priority is re-mapped onto a processor that provides its least partial completion time (pct). 
In the calculation of partial completion times (see Table 2.9), available run time values (if 
any) are recursively used. If run time values are not available, statically obtained values 
are used. The algorithm is listed in Figure 2.9. The time complexity is O(n
2
). 
 
Table 2.9 Definition of terms used in Hybird Re-mapper 
 
Term Definition 
N = {n
1
, n
2
, n
3
, n
4
, n
5
, n
6
?.}//Set of nodes in the DAG, |N|=n 
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors, |P|=m 
e
i,j 
Time required to execute n
i  
on  p
j
 in real time 
c
i,j 
Time required to transfer all the requisite data from n
i
 to  n
j
 when 
they are scheduled on different processors in real time 
)( inpriority  =)( inblevel  
 ips(n
i
) Immediate predecessor set of node n
i
 
A[p
j
] Time at which processor p
j
 completes the execution of all the 
nodes previously assigned to it in real time 
dr(n
i
) ),(max(
,)( kijinipsn
pnpctc
ij
+
?
 
pct(n
i 
, p
j
) = e
i,j 
+ max(A[j], dr(n
i
)) 
 
 37
 
 
Hybrid Re-Mapper Algorithm  
   Divide the input DAG into levels such that nodes in a level are        
   independent of each other 
   k?  number of levels  
   Mark the levels starting with 0 and ending with k-1  
   //Start nodes are in level 0 and exit nodes are in level k-1  
 
   For all n
i
 in N  
    priority(n
i
) ?  blevel(n
i
) 
   End For  
 
   For all n
i
 in level 0  
      Map n
i 
 using the static schedule 
   End For 
    
      For levels 1 thru k-1 
            For all nodes 
 
in the current level  
                    n
i 
?  node with the highest priority  
                    For all p
j
 in P 
                    dr(n
i
)=  ),(max(
,)( kijinipsn
pnpctc
ij
+
?
 
                     pct(n
i 
, p
j
)  = e
i,j 
+ max(A[j], dr(n
i
)) 
                    End for  
                    Map n
i 
 onto p
j
 that provides its least pct  
              End For 
              Update A[j] 
        End for  
End Hybrid Re-mapper 
 
 
Figure 2.9 The Hybrid Re-mapper algorithm 
 
2.1.11 Performance Comparison  
 
The performance of DAG scheduling algorithms depends on a number of factors 
such as the Communication to Computation Ratio (CCR) (the ratio of the sum of the 
edge-weights to the sum of the node-weights) of the input DAG, number of nodes, 
 38
processor speed variance etc. While running times of an algorithm become significant for 
large DAGs, it is desirable to have an algorithm with a good performance-complexity 
tradeoff. The most important performance metric used to compare the performance of 
DAG scheduling algorithms is the Schedule Length Ratio (SLR). SLR is the ratio of the 
overall execution time of the input DAG to the sum of the weights of the critical path 
nodes on the fastest processor.  Table 2.9 summarizes the relative performance of the 
algorithms discussed in the previous sections.  
 
Table 2.10 Comparison of complexity and schedule length ratio of different 
algorithms 
 
 
Algorithm 
A 
Complexity  Schedule Length Ratio, L(A) 
BIL O(n
2
+plogp) L(BIL) < L(GDL) by 20% 
STDS O(n
2
) L(STDS) < L(BIL) for CCRs within 0.2 and 1 
FLB O(nlogp+e) L(HEFT) < L(FLB) by 63% when processor speed variance 
is high. Otherwise FLB performs equally well. 
FCP O(nlogp+e) L(HEFT) < L(FCP) by 32 % with high processor speed 
variance. Otherwise identical. 
HEFT O(n
2
m) HEFT better than GDL,LMT by 8, 52% respectively. 
 
 
 
 
 39
 
2.2 Scheduling a Set of Independent Tasks onto a Network of Heterogeneous 
Processors to Minimize the Overall Execution Time 
 
2.2.1 Problem Statement  
Independent tasks are tasks without communication or precedence constraints. A 
meta-task is a finite set of independent tasks. The overall execution time (make-span) of a 
meta-task is the time required to complete the execution of all the tasks in it. The target 
execution system consists of a finite number of heterogeneous processors connected with 
a high speed network. Tasks in a met-task can have different execution times on different 
processors. Communication among processors is assumed to be contention-less. 
Computation and communication is assumed to take place simultaneously. Node 
execution is assumed to be non-preemptive-nodes once scheduled on a processor cannot 
be removed (preempted) and scheduled on other processors. The objective of the 
independent task scheduling problem is formally described as follows. To schedule the 
independent tasks of a meta-task onto a network of heterogeneous processors such that 
the overall execution time of the meta-task is minimized. In the following sections, 
existing research work in this area is surveyed.  
 
2.2.2 The Min-Min and Max-Min Algorithms  
In the Min-Min algorithm [15], the earliest finish time (EFT) of all the nodes over 
all the processors is calculated. The node with the least EFT is selected and scheduled 
 40
onto the processor on which the minimum EFT was obtained. The process is repeated 
until all the tasks in the meta-task are scheduled. The time complexity is O(s
2
m), where s  
 
Table 2.11 Definition of terms used in Min-Min 
 
Term Definition 
T = {t
1
,t
2
, t
3
, t
4
, t
5
, t
6
?.}//Meta-Task  
s =|T| 
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors 
m =|P| 
w
i,j 
Time required to execute t
i  
on  p
j 
EST(t
i
 , p
j
) Time at which all the tasks previously assigned to 
p
j
 complete execution 
EFT(t
i
 , p
j
) = EST(t
i
 , p
j
)+ w
i,j
 
   
 
 
 
Min-Min Algorithm  
   While T is NOT NULL do  
   For all t
i
 in T and p
j
 in P 
      Compute EFT(t
i
 , p
j
) 
   End For 
       t
min 
?  task with the least EFT 
       p
min 
?  processor providing the least EFT  
       Map t
min
 on p
min 
          
T ?   T- t
min
 
    End While 
End Min-Min 
 
 
Figure 2.11 Min-Min Algorithm 
 41
 
is the number of tasks in the meta-task and m the number of processors in the target 
system. The Max-Min algorithm is similar to Min-Max, however; instead of selecting the 
task with the least EFT, the task with the highest EFT is selected. Min-Min is detailed in 
Figure 2.11 and the definition of terms used in Min-Min is listed in Table 2.11. 
 
2.2.3 The Sufferage Algorithm 
 
 
Term Definition 
T = {t
1
,t
2
, t
3
, t
4
, t
5
, t
6
?.}//Meta-Task  
s =|T| 
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors 
m =|P| 
w
i,j 
Time required to execute t
i  
on  p
j 
EST(t
i
 , p
j
) Time at which all the tasks previously assigned to p
j
 
complete execution 
EFT(t
i
 , p
j
) = EST(t
i
 , p
j
)+ w
i,j
 
FT
1 
Earliest finish time of t
i
 on any processor p
j 
FT
2
 Second earliest finish time of t
i
 on any processor p
j
 
Sufferage(t
i
) FT
2
 - FT
1
 
   
Table 2.12 Definition of Terms used in Sufferage  
 
 42
The Sufferage algorithm [15] is based on the idea that a better mapping of tasks 
can be obtained by assigning a processor to a task that ?suffers? the most in case the task  
 
Suffereage Algorithm  
   T
1
?  temporary set of tasks  
   T
1
?  NULL  
   While T is NOT NULL do  
   For all t
i
 in T and p
j
 in P 
      Compute EFT(t
i
 , p
j
) 
       p
temp 
?  processor on which t
i
 has the least EFT  
          If  a task is already assigned to p
temp  
then  
              t
prev 
?  task already assigned to p
temp 
                    
 If  Sufferage(t
i
) > Sufferage(p
temp
) then  
                   Remove t
prev
 from p
temp 
                            
Tentatively assign t
i
 to p
temp 
                            
T?  T -  t
i 
                            
T
1
?  T
1
+ t
prev 
                     
Else  
                  
  
T
1
?  T
1
+ t
i 
                     
End If 
           Else  
              
 
Tentatively assign t
i
 to p
temp 
                     T?  T -  t
i 
                
End If  
     End For  
     T ?  T + T
1 
        
T
1
?  NULL 
End While  
End Sufferage 
 
 
 
Figure 2.12 The Sufferage Algorithm 
 
is not assigned to the processor. The sufferage of a task t
i
 is defined as the difference 
between the earliest finish time of t
i
 and the second earliest finish time. Tasks are 
considered for mapping in an arbitrary order. At each scheduling step, the earliest finish 
times of a task over all the processors is computed. The processor which provides the 
 43
minimum earliest finish time is determined. If a task is already scheduled on it, the 
suffreages of the task under consideration and the previously scheduled task are 
compared. If the sufferage of the task under consideration is greater, the previously 
assigned task is removed and the given task is tentatively assigned to the processor. The 
removed task is is re-inserted into the meta-task. However, if the sufferage of the task 
already assigned is greater, the given task is reinserted into the meta-task and is 
considered for mapping in the next iteration. At the end of the iteration, the tasks which 
are tentatively mapped onto the processors are permanently mapped. The steps are 
repeated until all the tasks in the meta-task are mapped. The time complexity is O(s
2
 m). 
Table 2.12 provides the definition of terms used in Sufferage and Figure 2.12 lists the 
algorithm.  
 
 
 
 
 
 
 
 
 
 44
 
CHAPTER 3 
THE HETEROGENEOUS CRITICAL NODE FIRST (HCNF) ALGORITHM 
 This chapter presents a new task-duplication based static scheduling heuristic 
called the Heterogeneous Critical Node First (HCNF) for the DAG scheduling problem 
discussed in section 2.2. The chapter is organized as follows. Section 3.1 discusses the 
key concepts related to the heterogeneous DAG scheduling problem that motivated the 
development of HCNF. Section 3.2 discusses the algorithm in detail. Section 3.3 provides 
the running trace of HCNF.  Section 3.4 provides the simulation study and Section 3.5 
provides concluding remarks. 
  
3.1 Motivation  
The length of the critical path in a DAG provides a lower bound on its overall 
execution time [30]. Thus, minimizing the execution time of the critical path nodes would 
abet minimizing the overall execution time of a DAG. One way to achieve this would be 
to assign top priority to critical path nodes at each scheduling step.   
A DAG may have one or more free nodes which are ready to be mapped onto the 
processors at each scheduling step. In heterogeneous computing environments, local 
optimization can be obtained at each scheduling step by selecting the largest task among 
the free nodes and mapping it onto the processor that minimizes its finish time.  
 45
Nodes have to wait until they receive all the required data from their predecessors before 
they could start execution. The predecessor node which is the last to send data to a given 
node is called the favorite predecessor. This process could be potentially expedited by 
duplicating the execution of favorite predecessors in idle processor times. Duplicating 
favorite predecessors can potentially suppress communication times and could lead to 
earlier start times for the nodes.  
 We propose a static scheduling algorithm called the Heterogeneous Critical Node 
First (HCNF) that incorporates the strategies discussed above in the scheduling process. 
At each scheduling step, among the free nodes, HCNF assigns top priority to a critical 
path node and schedules it onto a processor that minimizes its finish time. In the absence 
of a critical path node, HCNF picks the largest node and assigns it onto a processor that 
minimizes its finish time. HCNF also explores the possibility of duplicating favorite 
predecessors in idle processor times to obtain earlier start times. The algorithm is 
explained next.  
 
3.2 The HCNF Algorithm  
HCNF begins by identifying the critical path/s of the input DAG. Nodes 
belonging to the critical path/s are marked as CP nodes. The algorithm starts the mapping 
process by mapping the start-node onto the processor that provides its fastest execution 
time. If the fastest execution time is obtained on more than one processor, the processor 
with the least average execution time over all nodes is selected. (The average execution 
time over all nodes of a processor is the sum of the execution times of all the nodes in the 
DAG on the processor divided by the number of nodes) Among the immeddiate 
 46
successors of the start-node, the CP node is inserted at the beginning of the 
ReadyTaskList. The remaining nodes are added to the ReadyTaskList by the decreasing 
order of their node weights. At each scheduling step, the first node of the ReadyTaskList 
is selected for mapping. Table 3.1 defines the terms used in HCNF and Figure 3.1 lists 
the algorithm.  
Table 3.1 HCNF-definition of terms  
Term Definition 
N = {n
1
, n
2
, n
3
, n
4
, n
5
, n
6
?.}//Set of nodes in the DAG 
n =|N| 
P = {p
1
,p
2
, p
3
, p
4
, p
5
, p
6
?.}//Set of processors 
m =|P| 
w
i,j 
Time required to execute n
i  
on  p
j 
c
i,j 
Time required to transfer all the requisite data from n
i
 to  n
j
 when 
they are scheduled on different processors 
T_Available[p
j
] Time at which processor p
j
 completes the execution of all the 
nodes previously assigned to it 
)(
i
npred  Set of immediate predecessors of task 
i
n 
n
en
 Favorite Predecessor (A node which is the last to send data to a 
given node.)  
,(
1 i
nEST )
j
p  Max( Max(][_
j
pAvailableT ,
imkmnpredn
cpnEFTMax
im
,)(
),( +
?
)) 
,(
2 i
nEST )
j
p  Max(Max [_
j
pAvailableT ,EST(n
en
,p
j
))+w
en,,j
 
,
imkmnnpredn
cpnEFTMax
enim
,)(
),( +
??
)) 
,(
i
nEST )
j
p  Min(EST
1
(n
i
,p
j
),EST
2
(n
i
,p
j
)) 
),(
ji
pnEFT  = ),(
, jiji
pnESTw +  
 
 47
 
EST
2
(n
i
,p
j
) is the earliest start time of node n
i
 on processor p
j
 assuming that n
en 
 (the 
favorite predecessor of n
i
) would be duplicated on p
j
. EST
1
(n
i
,p
j
) is the earliest start time 
of node n
i
 on processor p
j
 without duplicating the favorite predecessor. EFT(n
i
,p
j
) is the 
earliest finish time of n
i
 on p
j
 . At each scheduling step, for the selected node n
i
, 
EFT(n
i
,p
j
) over all the processors is computed.  fproc(n
i
) is the processor on which the 
least EFT is obtained. If EST
2
(n
i
,p
j
) is used in the computation of the least EFT, n
en
 is 
duplicated on fproc(n
i
), otherwise; n
en
 is not duplicated. n
i
 is mapped onto fproc(n
i
). n
i
 is 
then removed from the ReadyTaskList and its successors are added to it. The nodes in the 
ReadyTaskList are realigned as follows. The CP node is inserted at the first position. In 
the presence of multiple CP nodes, the CP nodes are sorted by the descending order of 
their node weights and are inserted at the beginning of the ReadyTaskList. All the 
remaining (non-CP) nodes are sorted by the decreasing order of their node weights. The 
first node in the ReadyTaskList is selected for mapping and is scheduled onto a processor 
that provides its least EFT (as discussed earlier). The process is repeated until all the 
nodes in the DAG are scheduled.   
HCNF takes O(n
2
) to find the critical path, O(np) to calculate the EFTs and  
O(n*logn) to sort the tasks in the descending order using merge-sort. Ignoring the lower-
order terms, the overall time complexity would be O(n
2
).  
 
3.3 Running trace of HCNF  
 
The working of HCNF is illustrated with a sample DAG G
1 
shown in Figure 3.2. 
The target execution system consists of three processors: p
1
, p
2
 and p
3
. Node execution- 
 48
times are listed in Table 3.2. Node weights in Figure 3.2 represent average execution 
times.  Run-time values for each step of HCNF are shown in Table 3.3. The Gantt chart 
for the final schedule is shown in Fig. 3.4 and the Gantt chart for the individual steps are 
shown in Figures 3.5 thru 2.17. HCNF begins by calculating the critical path of G
1
 ( 
1? 2? 9? 10) and marking the critical path nodes.  
Step 1 (Figure 3.5) The start node (node 1) is mapped onto processor 3 which provides 
its least finish time of 9 seconds.  
Step 2 (Figure 3.6) Among the successors of node 1, the CP node (node 2) is inserted at 
the beginning of the ReadyTaskList and the remaining nodes are inserted in the 
descending order of their weights. Node 2 is selected for mapping and its EFTs over all 
the processors is computed (see Figure 3.3). Both p
1
 and p
3 
provide the least finish time 
(27 seconds). However, since the finish time on p
1
 is obtained by duplicating node 1, p
3
 is 
chosen. Node 2 is removed and the ReadyTaskList is updated to {3,4,6,5} . 
Step 3 (Figure 3.7) Node 3 is selected for mapping. The minimum EFT is obtained on p
1
 
by duplicating node 1 on p
1
.  The successor of node 3 (node 7) becomes free as a result of 
this mapping and the ReadyTaskList is updated to {4,6,5,7} 
Step 4 (Figure 3.7) Node 4 is selected for mapping. The minimum EFT is obtained on p
2
 
by duplicating node 1 on p
2
. The ReadyTaskList is updated to {6,5,7} 
Step 5 (Figure 3.8) Node 6 is selected for mapping. The minimum EFT is obtained on p
1
 
.Node 6 is scheduled on p
3
 and one of the successor nodes (node 8) becomes free as a 
result of this mapping. The ReadyTaskList is updated to {5,7,8}. 
 
 49
 
Algorithm HCNF 
//Identify the CP nodes of the input DAG 
//Map the Start-Node onto a processor that provides its fastest execution time  
//Among the successors of the Start-Node , add the CP node to the ReadyTaskList 
//Add the remaining successors of the Start-Node in the decreasing order of task sizes  
to the ReadyTaskList 
 
While ReadyTaskList  is NOT NULL do 
t
n ? First node in the ReadyTaskList  
For all  p
j
P? do 
                                         EST
1
(n
t
 , p
j
) = Max{T_available[p
j
],
k ? j
 EFT(n
en
 , p
k
)+ c
k,j
} 
                                             If(EST(n
en
 , p
j
) ?  T_available[p
j
]) then 
                                                 EST
2
(n
t
 , p
j
) =  EST(n
en
 , p
j
)+  w
en,,j
 
                                                 Else 
                                                 EST
2
(n
t
 , p
j
) =  T_available[p
j
] + w
en,,j 
    
          End if 
                                               If EST
1
(n
t
 , p
j
) ?  EST
2
(n
t
 , p
j
) then 
      EST(n
t
 , p
j
)= EST
1
(n
t
 , p
j
) 
    Else 
      EST(n
t
 , p
j
)= EST
2
(n
t
 , p
j
) 
                                                   Tentatively duplicate n
en
  on Processor  p
j
 
    End if 
EFT(n
t
 , p
j
)= EST
1
(n
t
 , p
j
)+ w
t,,j
 
   End For 
                              fproc(n
t
) ?  processor p
j
 that provides minimum EFT for n
t
 
                              Map n
t
 on fproc(n
t
) and permanently duplicate any tentatively duplicated    
                              n
en
 node 
                              Add the successors of 
t
n  to the ReadyTaskList  
                              ReadyTaskList?  ReadyTaskList -  n
t
 
                              Realign the ReadyTaskList  such that the CP node is in the first position  
                              and the remaining nodes are sorted in the decreasing order of their weights 
End While 
End HCNF 
 
 
 
Figure 3.1 The HCNF algorithm 
 
 50
 
 
Figure 3.2 Sample DAG (G
1
) 
 
 
 
Table 3.2 Task execution times of G
1 
on three different processors  
 
 
n
i 
p
1 
p
2 
p
3
 
Average 
Execution 
Time 
1 14 16 9 13 
2 13 19 18 16.67 
3 11 13 19 14.33 
4 13 8 17 14 
5 12 13 10 11.66 
6 13 16 9 12.67 
7 7 15 11 11 
8 5 11 14 10 
9 18 12 20 16.67 
10 21 7 16 14.67 
 
 
 51
 
Table 3.3 Run-time values for G
1 
 
 
EST
1
(n
i
 , p
1
) EST
1
(n
i
 , p
2
) EST
1
(n
i
 , p
3
) 
EST
2
(n
i
 , p
1
) n
en
 EST
2
(n
i
 , p
2
) n
en
 EST
2
(n
i
 , p
3
) n
en
 
Iteration ReadyTaskLsit n
i
 
EFT(n
i
 , p
1
) EFT(n
i
 , p
2
) EFT(n
i
 , p
3
) 
EFT(n
i
) fproc(n
i
) 
0 0 0 
0 n/a 0 n/a 0 n/a 
1 
 
1 1 
14 16 9 
9 3 
27 27 
14 1 16 1 n/a n/a 
2 2,3,4,6,5 2 
27 35 27 
27 3 
27 27 
14 1 16 1 n/a n/a 
3 3,4,6,5 3 
25 29 46 
25 1 
25 18 27 
n/a n/a 16 1 n/a n/a 
4 4,6,5,7 4 
38 24 44 
24 2 
25 24 27 
n/a n/a n/a n/a n/a n/a 
5 6,5,7 6 
38 40 36 
36 3 
25 24 
n/a n/a n/a n/a n/a n/a 
6 5,7,8 5 
37 37 46 
37 1 
47 50 50 
50 4 43 5 47 5 
7 9,7,8 9 
65 55 67 
55 2 
37 55 48 
n/a n/a 68 3 55 3 
8 7,8 7 
44 70 49 
44 1 
51 55 51 
57 6 71 6 53 4 
9 8 8 
56 66 65 
56 1 
68 67 68 
77 9 66 8 88 9 
10 10 10 
89 74 86 
74 2 
 
    
 
0
10
20
30
40
50
60
70
80
123
Processors
Ti
m
e
 (
S
e
c
)
Duplicated Idle Task
1
1
1
2
4
5
3
1
9
8
10
5
7
8
6
 
 
 52
Figure 3.4 Gantt chart for G
1
 
 
Step 6 (Figure 3.9) Node 5 is selected for mapping. The minimum EFT is obtained on p
1
 
.Node 5 is scheduled on p
1
 and node 9 becomes free as a result of this mapping. The 
ReadyTaskList is updated to {9,7,8} (since 9 is a CP node, it is inserted at the beginning 
of the list) 
Step 7 (Figure 3.10) Node 9 is selected for mapping. The minimum EFT is obtained on 
p
2
 by duplicating node 5. The ReadyTaskList is updated to {7,8}. 
Step 8 (Figure 3.11) Node 7 is selected for mapping. The minimum EFT is obtained on 
p
1
. The ReadyTaskList is updated to {8} 
Step 9 (Figure 3.12) Node 8 is selected for mapping. The minimum EFT is obtained on 
p
3
. Node 10 becomes free as s result of this mapping and the ReadyTaskList is updated to 
{10}. 
Step 10 (Figure 3.13) Node 10 is selected for mapping. The minimum EFT is obtained on 
p
2
 by duplicating node 8.  
 
 
 53
 
 
Figure 3.5 HCNF running trace-step 1: 
Node 1 is scheduled on processor 3 
 
Figure 3.6 HCNF running trace-step 2: 
Node 2 is scheduled on processor 3 
 
 
 
 
 
 
Figure 3.7 HCNF running trace-step 3: 
Node 3 is scheduled on processor 1 by 
duplicating node 1  
 
 
Figure 3.8 HCNF running trace-step 4: 
Node 4 is scheduled in processor 2 by 
duplicating node 1 
 
 
 
 
 54
 
 
 
Figure 3.9 HCNF running trace-step 5: 
Node 6 is scheduled on processor 3 
 
Figure 3.10 HCNF running trace-step 6:  
Node 5 is scheduled on processor 1 
 
  
 
Figure 3.11 HCNF running trace-step 7: 
Node 9 is scheduled on processor 2 by 
duplicating node 5 
 
Figure 3.12  HCNF running trace-step 8: 
Node 7 is scheduled in processor 1 
 
 
 
 
 
 55
  
Figure 3.13 HCNF running trace-step 9: 
Node 8 is scheduled on processor 3 
 
 
Figure 3.14 HCNF running trace ?step10 
Node 10 is scheduled on processor 2 by 
duplicating node 8 
 
 
3.4 Simulation Study  
The simulation study consists of two parts. In the first part, the performance of 
HCNF is compared against that of the Heterogeneous Earliest Finish Time (HEFT) [30] 
algorithm. The experimental test suite[18] includes: randomly generated graphs, 
Gaussian elimination graphs, Trace graphs, Benchmark graphs and Application graphs. 
In the second part, a parametric random graph generator is developed to generate a 
diverse range of graphs with specified input parameters. The performance of HCNF is 
compared against that of the HEFT and the Scalable Task Duplication based scheduling 
algorithm (STDS) [25].  
3.4.1 Performance Parameters  
 The three commonly used performance parameters to gauge the performance of 
DAG scheduling algorithms are:  
 56
Schedule Length Ratio (SLR): The ratio of the overall execution time of a DAG to the 
sum of the weights of its critical-path nodes on the fastest processor.  
Speedup: The ratio of the sequential execution time of the DAG on the fastest processor 
to the parallel execution time.  
Efficiency: The ratio of the speedup to the number of processors in the system. 
 
3.4.2 Randomly Generated Graphs  
The performance of HCNF and HEFT was compared using randomly generated 
graphs of different sizes and CCRs. Each node in the random graph was allowed to have 
up to five children. Node and the edge weights were generated randomly and the edge 
weights were then iteratively adjusted to obtain a given CCR. 
The SLR and speedup of HCNF and HEFT was compared using graphs of 
different sizes. For each graph size shown in Figures 3.15 and 3.16, readings were 
averaged using 10 random graphs of the same size with CCRs ranging from 0.5 to 1.5. 
and out_degree = {1,2,5,100}. The average SLR of HCNF was better than HEFT by 
12.3% and the speedup was better than HEFT by 7.9 %.  
 
 57
0
1
2
3
4
5
6
7
8
20 30 40 50 60 70 80 90 100
Number of Nodes
SL
R
HEFT HCNF
 
 
Figure 3.15 Average SLR vs. number of nodes 
0
0.5
1
1.5
2
2.5
3
3.5
20 30 40 50 60 70 80 90 100
Number of Nodes
S
pee
du
p
HEFT HCNF
 
Figure 3.16 Average speedup vs. number of nodes 
 
 
 
 58
3.4.3 Gaussian Elimination Graphs   
 The SLR and Efficiency of HEFT and HCNF were compared using DAGs 
representing the Gaussian Elimination algorithm. Figure 3.17 gives the SLR for matrix 
sizes ranging from 5 to 15. HCNF outperformed HEFT by an average of 25.7%. Figure 
3.18 gives the efficiency for different number of processors, with the matrix size fixed at 
50. HCNF outperformed HEFT by an average of 22.6%. The efficiency of HCNF 
increased with the number of processors because of increased speedup facilitated by 
enhanced task duplication (in the presence of a lager number of processors).  
 
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5678910112131415
Matrix Size
SL
R
HEFT HCNF
 
Figure 3.17 Average SLR vs. matrix size 
 
3.4.4 Benchmark Graphs  
  DAG scheduling algorithms are commonly compared using randomly generated 
graphs. However, to facilitate a fair and an unbiased comparison of algorithms from 
 59
different authors, some researchers [21] have proposed using benchmark graphs. In the 
following sections we compare the performance of HCNF using the ?benchmark graph 
test suite? [21]. The benchmark test suite consists of: Trace graphs, Graphs with optimal  
solution generated by the branch and bound technique, Graphs with predetermined 
optimal solutions and Application graphs 
 
3.4.4.1 Trace Graphs  
 These graphs are obtained from the referenced articles listed in Table 3.4.The 
SLR and speedup of HEFT and HCNF was compared using these graphs. Figures 3.19 
and 3.20 show the results. HCNF outperformed HEFT in SLR and speedup by an average 
of 29.5% and 38.4% respectively 
 
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
24816
Number of Processors
Effi
c
i
e
n
c
y
HEFT HCNF
 
Figure 3.18 Efficiency vs. no. of processors 
 
 
 60
.3.4.4.2 Random Graphs with Optimal Solutions (RGBOS)  
 DAGs in this set are small sized, with the maximum node size being 32. Their 
optimal solution can be obtained using the branch and bound technique. The set consists 
of three subsets of graphs with different CCRs (0.1, 1.0, 10.0), and number of nodes vary 
from 10 to 32, in increments of 2. Figures 3.21 thru 3.26 show the results. HNCF 
outperformed HEFT in SLR and speedup by 32.5% and 24.6% respectively.  
 
3.4.4.3 Random Graphs with Pre-Determined Optimal Schedules (RGPOS) 
 The graphs in this set are reverse engineered [23]. A schedule for a set of 
multiprocessors is generated and then the node and the edge weights are generated 
randomly, but, consistent with the generated schedule. The graphs comprise of three sets 
with CCR values 0.1,1.0 and 10.0. Within each set, the number of nodes vary from 50 to 
500 in increments of 50.  Figures 3.27 thru 3.32 show the results. HCNF outperformed 
HEFT in SLR and speedup by 21.1% and 16.9% respectively.  
 
Table 3.4 Trace graph details 
Graph Tag Trace Graph # of Nodes Article Reference 
D1 Ahmed-Kwok 13 [22] 
D2 Yang-1 7 [13]  
D3 Colin-Chretienne 9 [25] 
D4 McCreary 9 [12]  
D5 Kruatrachue 11 [16] 
D6 Yang-2 7 [19] 
D7 Ranka 11 [22] 
D8 Shirazi 11 [23] 
D9 Wu-Gajski 18 [25] 
D10 Al-Maasarani 16 [33] 
D11 AL-Mouhamed 17 [32] 
 
 61
 
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11
Trace Graphs
SL
R
HEFT HCNF
 
Figure 3.19 Trace Graphs-SLR 
 
 
 
0
0.5
1
1.5
2
2.5
3
3.5
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11
Trace Graphs
S
p
ee
du
p
HEFT HCNF
 
Figure 3.20 Trace Graphs-Speedup 
 
 62
 
0
0.5
1
1.5
2
2.5
3
3.5
10 12 14 16 18 20 22 24 26 28 30 32
Number of Nodes
SL
R
HEFT HCNF
 
Figure 3.21 RGBOS SLR (CCR = 0.1) 
 
 
 
 
0
0.5
1
1.5
2
2.5
3
3.5
4
10 12 14 16 18 20 22 24 26 28 30 32
Number of Nodes
SL
R
HEFT HCNF
 
Figure 3.22 RGBOS SLR (CCR = 1.0)  
 63
 
 
 
0
0.5
1
1.5
2
2.5
3
3.5
4
10 12 14 16 18 20 22 24 26 28 30 32
Number of Nodes
SL
R
HEFT HCNF
 
Figure 3.23 RGBOS SLR (CCR = 10.0)  
 
 
0
0.5
1
1.5
2
2.5
10 12 14 16 18 20 22 24 26 28 30 32
Number of Nodes
S
p
ee
du
p
HEFT HCNF
 
Figure 3.24 RGBOS Speedup (CCR = 0.1)  
 64
 
 
 
0
0.5
1
1.5
2
2.5
3
10 12 14 16 18 20 22 24 26 28 30 32
Number of Nodes
S
p
ee
du
p
HEFT HCNF
 
Figure 3.25 RGBOS Speedup (CCR = 1.0) 
 
 
0
0.5
1
1.5
2
2.5
3
3.5
4
10 12 14 16 18 20 22 24 26 28 30 32
Number of Nodes
S
p
ee
du
p
HEFT HCNF
 
Figure 3.26 RGBOS Speedup (CCR = 10.0) 
 65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 3.27 RGPOS SLR (CCR = 0.1) 
 
 
 
 
 
 
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
50 100 150 200 250 300 350 400 450 500
Number of Nodes
SL
R
HEFT HCNF
 
 
Figure 3.28 RGPOS SLR (CCR = 1.0) 
 
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
50 100 150 200 250 300 350 400 450 500
Number of Nodes
SL
R
HEFT HCNF
 66
 
 
 
 
 
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
50 100 150 200 250 300 350 400 450 500
Number of Nodes
SL
R
HEFT HCNF
 
 
Figure 3.29 RGPOS SLR (CCR = 10.0) 
 
 
 
 
 
0
0.5
1
1.5
2
2.5
3
3.5
50 100 150 200 250 300 350 400 450 500
Number of Nodes
S
p
ee
du
p
HEFT HCNF
 
 
Figure 3.30 RGPOS Speedup (CCR = 0.1) 
 
 
 67
 
 
 
 
 
 
0
0.5
1
1.5
2
2.5
3
3.5
4
50 100 150 200 250 300 350 400 450 500
Number of Nodes
S
p
ee
du
p
HEFT HCNF
 
 
Figure 3.31 RGPOS Speedup (CCR = 1.0)  
 
 
 
 
 
 
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
50 100 150 200 250 300 350 400 450 500
Number of Nodes
S
p
ee
du
p
HEFT HCNF
 
 
Figure 3.32 RGPOS Speedup (CCR = 10.0) 
 68
3.4.4.5 Application Graphs 
 These graphs represent a few numerical parallel application programs. This set 
contains of over 320 graphs in six categories: Cholesky factorization, LU decomposition, 
Gaussian elimination, FFT, Laplace transforms and Mean Value Analysis (MVA). The 
number of nodes ranges from 100 to 300. Figures 3.33 to 3.44 show the results of the 
simulation. On an average, HCNF outperformed HEFT in SLR and Speedup by 27.5% 
and 22.7% respectively 
 
 
 
 
0
0.5
1
1.5
2
2.5
3
3.5
0.1 0.5 1 2 10
CCR
SL
R
HEFT HCNF
 
 
Figure 3.33 Fast Fourier Transform SLR vs. CCR 
 
 
 
 
 
 69
0
0.5
1
1.5
2
2.5
3
0.1 0.5 1 2 10
CCR
S
pe
edup
HEFT HCNF
 
 
Figure 3.34 Fast Fourier Transform Speedup vs. CCR 
 
 
 
 
 
 
 
 
 
0
0.5
1
1.5
2
2.5
3
0.1 0.5 1 2 10
CCR
S
pe
edup
HEFT HCNF
 
 
Figure 3.35 Cholesky Factorization Speedup vs. CCR 
 
 
 
 70
 
0
0.5
1
1.5
2
2.5
3
3.5
4
0.1 0.5 1 2 10
CCR
S
pe
edup
HEFT HCNF
 
 
Figure 3.36 Gaussian Elimination Speedup vs. CCR 
 
 
 
 
 
 
 
 
0
0.5
1
1.5
2
2.5
3
0.1 0.5 1 2 10
CCR
S
pe
edup
HEFT HCNF
 
 
Figure 3.37 Laplace Transform Speedup vs. CCR 
 
 
 
 71
 
 
0
0.5
1
1.5
2
2.5
3
0.1 0.5 1 2 10
CCR
S
pe
edup
HEFT HCNF
 
 
Figure 3.38 LU Decomposition Speedup vs. CCR 
 
 
 
 
 
 
0
0.5
1
1.5
2
2.5
3
3.5
0.1 0.5 1 2 10
CCR
S
pe
edup
HEFT HCNF
 
 
Figure 3.39 MVA Speedup vs. CCR 
 
 
 
 
 72
 
0
0.5
1
1.5
2
2.5
3
3.5
0.1 0.5 1 2 10
CCR
SL
R
HEFT HCNF
 
 
Figure 3.40 Cholesky SLR vs CCR 
 
 
 
 
 
 
 
0
0.5
1
1.5
2
2.5
3
3.5
4
0.1 0.5 1 2 10
CCR
SL
R
HEFT HCNF
 
 
Figure 3.41 Gaussian Elimination SLR vs.CCR 
 
 
 
 
 73
 
0
0.5
1
1.5
2
2.5
3
3.5
0.1 0.5 1 2 10
CCR
SL
R
HEFT HCNF
 
 
Figure 3.42 Laplace Transform SLR vs.CCR 
 
 
 
 
 
 
 
0
0.5
1
1.5
2
2.5
3
3.5
4
0.1 0.5 1 2 10
CCR
SL
R
HEFT HCNF
 
 
Figure 3.43 LU Decomposition SLR vs. CCR 
 
 
 
 
 74
 
 
 
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0.1 0.5 1 2 10
CCR
SL
R
HEFT HCNF
 
 
Figure 3.44 MVA SLR vs. CCR 
 
 
 
 
 
 
 
 
3.4.5 Performance Comparison using a Parametric Random Graph Generator  
 
 The performance of DAG scheduling algorithms varies largely with the type of 
the input DAG. In order to conduct a fair comparison, we need to evaluate the 
performance using a comprehensive set of randomly generated DAGs, exhibiting a wide 
range of parameters. In our simulation study, a parametric random graph generator was 
developed to generate diverse DAG types based on the following input parameters. 
? n : Number of nodes in the DAG 
 75
? CCR (Communication to Computation Ratio): Ratio of the sum of the edge 
weights to the sum of the node weights in a DAG. 
? out_degree: Maximum number of children a node in the DAG can have.  
? ? (The shape parameter of a DAG) : The height of a DAG is randomly generated 
from a uniform distribution with mean equal to ? ? n
 
. The width of a DAG is 
randomly generated from a uniform distribution with mean equal to n ? ?  
? ? (Range percentage of computation costs on processors): If the average 
computation cost of a node over all the processors is avg_comp , the computation 
cost n
i 
on any processor p
j 
 is randomly selected from the range-  
      avg_comp?(1- ?/2 ) ?  avg_comp ?  avg_comp?(1+ ?/2) 
 
Input parameters were assigned the following values in our simulation study.   
? n = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}  
 
? CCR = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 
5.0}  
 
? ? = {0.5, 1.0, 2.0}  
 
? out_degree = {1, 2, 5, 100}  
 
? ? = {0.1, 0.5, 0.75, 1.0} 
 
 
These combinations yield 8640 different DAG types. For each DAG type, 25 different 
random graphs were generated with the same parameters but with different edge and 
node weights.  Thus a sum total of 216,000 random DAGs were used in the study.  The 
 76
number of processors was fixed at 10. The processor speeds were randomly selected 
based on the ? value. 
 Figure 3.45 provides the SLR of HCNF, HEFT and the STDS algorithms for graphs with 
different Node sizes. Each data point is averaged over 864 distinct readings. The average 
SLR improvement of HEFT over STDS is 6%, and over HEFT is 10% approximately.  
Figure 3.46 gives the average speedup versus number of nodes. Each data point is 
averaged over 864 different readings. The Average improvement in the speedup of the 
HCNF over STDS is 9%, and over HEFT is 14%. Figure 3.47 provides the average SLR 
values for CCR values ranging from 0.1 to 1.0 in steps of 0.1. Each data point is averaged 
over 480 different readings. The average improvement of HCNF over HEFT is 11% and 
over HEFT is 4 %. Figure 3.48 provides the average SLR values for CCR values ranging 
from 1.0 to 5.0 in steps of 0.5. Each data point is averaged over 480 different readings. 
The average improvement of HCNF over STDS is 7% and over HEFT is 11%. For higher 
CCRs the STDS algorithm performs better than the HEFT algorithm since there is more 
scope for task duplication. Figure 3.49 provides the average Speedup values for CCR 
values ranging from 0.1 to 1.0 in steps of 0.1. Each data point is averaged over 480 
different readings. The average improvement of HCNF over HEFT is 18% and over 
HEFT is 9%. Figure 3.50 provides the average SLR values for CCR values ranging from 
1.0 to 5.0 in steps of 0.5. Each data point is averaged over 480 different readings. The 
average improvement of HCNF over STDS is 9% and over HEFT is 5%. 
  
 77
0
1
2
3
4
5
6
7
8
20 30 40 50 60 70 80 90 100
Number of Nodes
SL
R
HEFT STDS HCNF
 
Figure 3.45 Parametric random graphs - SLR vs. number of nodes 
 
 
0
0.5
1
1.5
2
2.5
3
20 30 40 50 60 70 80 90 100
Number of Nodes
S
p
ee
du
p
HEFT STDS HCNF
 
 
Figure 3.46 Parametric random graphs - Speedup vs. number of nodes 
 
 78
0
0.5
1
1.5
2
2.5
3
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
CCR
SL
R
HEFT STDS HCNF
 
Figure 3.47 Parametric random graphs-SLR vs. CCR (0.1 to 0.9) 
 
 
 
0
1
2
3
4
5
6
7
8
1 1.5 2 2.5 3 3.5 4 4.5 5
CCR
SL
R
HEFT STDS HCNF
 
 
Figure 3.48 Parametric random graphs-SLR vs. CCR (1.0 to 5.0) 
 
 
 79
0
0.5
1
1.5
2
2.5
3
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
CCR
S
p
ee
du
p
HEFT STDS HCNF
 
 
Figure 3.49 Parametric random graphs-Speedup vs. CCR (0.1 to 0.9) 
 
 
 
 
0
0.5
1
1.5
2
2.5
3
3.5
11.522.533.544.55
CCR
S
p
ee
du
p
HEFT STDS HCNF
 
 
Figure 3.50 Parametric random graphs-Speedup vs. CCR (1.0 to 5.0) 
 
 
 
 80
3.5 Conclusion  
 A new task-duplication based static scheduling algorithm called the 
Heterogeneous Critical Node First (HCNF) for scheduling DAGs onto a network of 
heterogeneous processors was proposed. The performance of HCNF, HEFT and STDS 
was compared using randomly generated graphs, benchmark graphs and parametric 
graphs. HCNF clearly outperformed both HEFT and STDS with respect to speedup and 
SLR.  The superior performance of HCNF can be attributed to the low-cost task 
duplication strategy that facilitates earlier start times for many nodes which otherwise 
have to wait for all the data items to arrive from their favorite predecessors. HCNF can be 
improved by exploring the possibility of duplicating the second and the third favorite 
predecessors (if any) to further expedite the start times of nodes. The feasibility of such 
an approach needs to be investigated.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 81
 
 
CHAPTER 4 
THE HETEROGENEOUS LARGEST TASK FIRST (HLTF) ALGORITHM 
 This chapter presents a new low-complexity algorithm called the Heterogeneous 
Largest Task First (HLTF) for scheduling independent tasks of a meta-task onto a 
network of heterogeneous processors to minimize the overall execution time. The 
problem was formally defined in section 2.8. This chapter is organized as follows. 
Section 4.2 discusses the motivation behind the development of HLTF. Section 4.2 
describes the algorithm in detail. Section 4.3 provides the running trace of HLTF.  
Section 4.4 discusses the theoretical non-equivalence of HLTF and the Sufferage 
algorithm [23] and section 4.5 provides the simulation study.  
 
 
4.1 Motivation  
 
A meta-task is a set of independent tasks without any precedence constraints. 
Scheduling a meta-task onto a set of heterogeneous processors to minimize the overall 
execution time is a NP-complete problem. Among the scheduling algorithms discussed in 
the literature review, the Sufferage has the best performance in terms of minimizing the 
makespan [23]. The time complexity of Sufferage is O(s
2
* m), where s is the size of the 
meta-task and m  the number of processors. 
 82
Meta-computing systems such as clusters and grids need to schedule tens of 
thousands of tasks on a regular basis. A meta-task could contain over a 1000 independent 
tasks in practical scenarios [23]. Figure 4.1 summarizes the running times of Sufferage. 
Sufferage takes more than 100 seconds to schedule a meta-task of 1000 tasks. The 
running time increases with the size of the meta-task. The algorithm takes more than 
3000 seconds to schedule a meta-task of 5000 tasks. This can be mainly attributed to the 
high time complexity O(s
2
* m) of the Sufferage algorithm . The high running times of 
Sufferage could be a major bottleneck in the scheduling process and could negatively 
impact the overall performance of a meta-computing system. 
To counter this problem, we propose a new low-complexity algorithm called the 
Heterogeneous Largest Task First (HLTF) to map a meta-task onto a set of heterogeneous 
processors with an objective to minimize its makespan. Simulation results in chapter 4.5 
reveal that in terms of minimizing the makespan, HLTF is at par with Sufferage. 
However, with respect to running times, HLTF with a lower time complexity of O(s(log s 
+ m)), significantly outperforms Sufferage.  
 
  4.2 The Heterogeneous Largest Task First (HLTF) Algorithm  
 HLTF adapts a simple approach to reduce the overall time complexity of the 
scheduling process. We first recap the working of the Sufferage algorithm and then 
explain the working of HLTF. Table 4.1 provides the definition of terms used in HLTF 
and the Sufferage algorithms.  
 
 83
Sufferage Running Times 
0
500
1000
1500
2000
2500
3000
3500
4000
1000 2000 3000 4000 5000
Metatask Size
Seconds
 
Figure 4.1 Running times of the Sufferage Algorithm  
 
The Sufferage Algorithm  
The algorithm is listed in Figure 4.1. At each scheduling step, the Sufferage 
algorithm picks an arbitrary task from the meta-task set and computes its Earliest 
Completion Time (ECT), favorite processor (fproc) and Sufferage values. If the task?s 
favorite processor has no task previously assigned to it, the current task is tentatively 
assigned to it. However, if the task?s favorite processor has a task already assigned to it, 
the Sufferages of the current task and the task already assigned are compared. If the 
Sufferage of the current task is higher, the previously assigned task is removed and the 
current task is assigned to it. The task that is removed is reinserted into the meta-task. 
The process is repeated until all the tasks of the Meta-Task set are scheduled.  
The HLTF Algorithm  
 The calculation of Sufferages at each scheduling step, re-inserting the tasks into 
the meta-task list and repeating all the steps each time a task is reinserted into the list 
 84
leads to the high complexity O(s
2
* m) of the Sufferage Algorithm. The HLTF algorithm 
listed in Figure 4.3 drastically reduces the time complexity of the Sufferage Algorithm by 
adopting the following approach. Instead of tentatively mapping tasks to processors,  
HLTF algorithm sorts all tasks in the meta-task set in the non-decreasing order of their 
sizes before the start of the mapping process. At each scheduling step HLTF picks the 
largest task in the list and maps it onto a processor that provides its earliest completion 
time. This seemingly simple approach leads to a very substantial decrease in running time 
without compromising the performance.  Simulation results are reported in Section 4.5. 
The HLTF algorithm takes O(s*log s) to perform merge sort, O(s* m) to compute the 
completion times of the tasks on all the processors and O(s * m) to compute the earliest 
completion time of each task.  The overall time complexity is O(s* log s+ s*m+ s) or 
O(s(log s)+m)). 
 
4.3 Theoretical nonequivalence of Sufferage and HLTF algorithms  
 
 At each scheduling step, the Sufferage algorithm maps the task with the 
maximum Sufferage to a machine that provides its earliest finish time. The HLTF 
algorithm, at each scheduling step, maps the largest task among the candidate tasks to a 
machine that provide its earliest finish time. Intuitively, the Sufferage and the HLTF 
algorithms seem to be equivalent.This is because we tend to assume that the largest task 
will always have the maximum Sufferage i.e for any two tasks t
i 
and t
j 
 in the meta-task 
set (t
1
, t
2
, t
3
,?t
n-1
, t
n
) where t
i 
> t
j 
 and i < j ,  Sufferage(t
i
) > Sufferage(t
j
). However, this 
is not the case always and is proved in Theorem 1. 
 
 85
Table 4.1 Definition of Terms used in Sufferage and HLTF 
Term Definition 
T Meta-task set of size s 
M Set of processors available for scheduling 
m Number of processors 
T_available(p
j
) Time at which processor p
j 
 can start execution of a new task. 
W
k,j
 Running time of task t
k  
on processor p
j.
 
CT(t
k
, p
j
) T_available(p
j
)+ W
k,j 
// Execution completion time of task t
k  
on 
processor p
j.
 
ECT(t
k
)
 
Min 
k?T & j?M
{ CT(t
k
, p
j
) }//Earliest Completion time of task t
k
 
Proc(t
k
) The processor on which ECT(t
k
) can be obtained 
Sufferage(t
k
)   ECT(t
k
)-Second best CT(t
k
) 
 
 
 
 
HLTF Algorithm  
Sort T using merge sort in non-decreasing order 
While T ? ? do  
                 Pick the largest task  t
k
  in T. 
                       For all j ? M       
                              Compute CT(t
k
, p
j
)                                
                       End For 
Compute ECT(t
k
)                                
T=T- {t
k
} 
            Schedule t
k
 on Proc(t
k
) 
End While 
End HLTF 
 
 
    Figure 4.2 HLTF Algorithm 
 
 
 86
 
 
Sufferage Algorithm  
While T ? ? do 
Pick a task t
k 
? T
    
in an arbitrary order. 
For all j ? M 
            Compute CT(t
k
, p
j
)                                                                  
End For 
Compute ECT(t
k
)                                                                  
Sufferage(t
k
)= ECT(t
k
)-Second best CT(t
k
)                                          
                  If Proc(t
k
) has a task t
s
  already assigned to it                      
                     If Sufferage(t
k
)> Sufferage(t
s
)                                   
                        Remove t
s  
from Proc(t
k
) and schedule t
k
 on Proc(t
k
)   
                        T=T+ {t
s
}                   
                        T=T- {t
k
}  
                      End If                  
                   Else 
                         Schedule t
k
 on Proc(t
k
) 
                         T=T- {t
k
} 
                   End If 
End While 
End Sufferage  
 
 
Figure 4.3 The Sufferage algorithm 
 
 
 
Thoerem 1 : For any two tasks t
i 
and t
j 
 in the meta-task set (t
1
, t
2
, t
3
,?t
n-1
, t
n
) where t
i 
> t
j 
 
and i < j ,  Sufferage(t
i
) is not always greater than Sufferage(t
j
) 
  
Case 1 :  Let  m
x
 and  m
y  
be the processors on which tasks t
i 
and t
j
 obtain their best ect 
and the next best ect?s repectively. Let p
x
 and p
y
 where p
x
 > p
y
 , be the speeds of 
processors m
x
 and  m
y  
in MIPS. 
Sufferage(t
i
)=( t
i
/ p
y
+T_Available(p
y
)) ? ( t
i
/ p
x 
+ T_Available(p
x
)) 
Sufferage(t
j
)= ( t
j
/ p
y
+T_Available(p
y
)) ? ( t
j
/ p
x 
+ T_Available(p
x
)) 
To prove  
Sufferage(t
i
)> Sufferage(t
j
) 
 87
or,  ( t
i
/ p
y
+T_Available(p
y
)) ? ( t
i
/ p
x 
+ T_Available(p
x
) ) > ( t
j
/ p
y
+T_Available(p
y
)) ? ( t
j
/ 
p
x 
+ T_Available(p
x
) ) 
or,  t
i
/ p
y
+T_Available(p
y
) ?  t
i
/ p
x 
- T_Available(p
x
)  >  t
j
/ p
y
+T_Available(p
y
) ?  t
j
/ p
x 
- 
T_Available(p
x
) 
 or,   t
i
/ p
y
- t
j
/ p
y
>  t
i
/ p
x
- t
j
/ p
x
     OR 
(t
i
- t
j
)/ p
y 
> (t
i
- t
j
)/ p
x              
 
Which is true since t
i 
> t
j 
, (t
i
- t
j
) > 0  and  p
x
 > p
y   
 
Case 2 :  Let  m
x
 and  m
y  
be the processors on which tasks t
i 
and t
j
 obtain their best ect 
and the next best ect?s repectively. Let p
x
 and p
y
 where p
x
 < p
y
 , be the speeds of 
processors m
x
 and  m
y  
in MIPS. 
Sufferage(t
i
)=( t
i
/ p
y
+T_Available(p
y
)) ? ( t
i
/ p
x 
+ T_Available(p
x
)) 
Sufferage(t
j
)= ( t
j
/ p
y
+T_Available(p
y
)) ? ( t
j
/ p
x 
+ T_Available(p
x
)) 
To prove  
Sufferage(t
i
)> Sufferage(t
j
) 
or,  ( t
i
/ p
y
+T_Available(p
y
)) ? ( t
i
/ p
x 
+ T_Available(p
x
) ) > ( t
j
/ p
y
+T_Available(p
y
)) ? ( t
j
/ 
p
x 
+ T_Available(p
x
) ) 
or,  t
i
/ p
y
+T_Available(p
y
) ?  t
i
/ p
x 
- T_Available(p
x
)  >  t
j
/ p
y
+T_Available(p
y
) ?  t
j
/ p
x 
- 
T_Available(p
x
) 
 or,   t
i
/ p
y
- t
j
/ p
y
>  t
i
/ p
x
- t
j
/ p
x
     OR 
(t
i
- t
j
)/ p
y 
> (t
i
- t
j
)/ p
x              
 
Which is NOT  true since t
i 
> t
j 
, (t
i
- t
j
) > 0  and  p
x
 < p
y  
 
Therefore for any two tasks t
i 
and t
j 
 in the meta-task set (t
1
, t
2
, t
3
,?t
n-1
, t
n
) where t
i 
> t
j 
 
and i < j ,  Sufferage(t
i
) > Sufferage(t
j
) is not always true. 
 
As an example, in Table 4.2 observe that in the third iteration, T3 is the largest task in the 
Metatask set and the HLTF algorithm picks T3 and schedules it onto its favorite 
processor p1. However, notice that the Sufferage of T3 (14.25) is less than the sufferage 
of T2 (16.91), despite T3 being larger than T2. Also, observe that the fproc1 of all the 
 88
tasks is processor 1, the fproc2 of all the tasks is processor 3 and the speed of processor 1 
(4 MIPS) is less than that of processor 3 (5 MIPS). This scenario illustrates case 2 of 
Theorem 1 and provides a practical example of the difference between the Sufferage and 
the HLTF algorithms 
 
4.4 Simulation and Results 
 
 Simulations were conducted on a 440 MHz Sun Ultra 5 machine running on a 
Solaris 8 Operating System. We compared the relative performance of HLTF and 
Sufferage w.r.t makespan and running costs. We developed a simulator with the 
following input parameters.  
 
n : Number of tasks in the metatask. 
p: Number of processors in the distributed system. 
std_dev: Standard deviation of the metatask 
size_min :Minimum task size in MIPS. 
size_max: Maximum task size in MIPS. 
m: Number of metatasks. 
 
The maximum number of the processors used in our simulations was 20.  
 
 
 
 89
4.4.1 Comparison of Makespan 
 The makespan of various metatasks using HLTF and sufferage was measured 
using the following input parameters. 
n ={50,100,200,300,400,500,750,1000} 
P ={5,10,15,20} 
std_dev={5,10,15,20,25,30} 
size_min ={10} 
size_max ={ 100} 
m ={1} 
The results are shown in Figures 3 to 6. Each data point is an average different readings 
on 4 different processors. The performance of HLTF was slightly better than that of 
Sufferage. An important observation was that we did not come across a metatask for 
which the performance of Sufferage was better than that of HLTF.  The Average 
improvement of HLTF over Sufferage was 0.48%.  
 
 
4.4.2 Comparison of Running Costs 
 
The running times of Sufferage and HLTF were measured using different 
metatask sizes. The results are shown in Figures 4.10 to 4.12. Each data point is an 
average of 25 different readings. The running cost of the Sufferage Algorithm 
exponentially increases as the size of the met-task increases. For meta-task sizes > 1000, 
the HLTF provides a very significant reduction in the running costs. 
 
 90
Table 4.2 Example showing nonequivalence of the Sufferage and the HLTF Algorithms:  
 
Sufferage 
Meta-Task={t1,t2,t3,t4,t5,t6} 
Task Sizes t1=157, t2=111, t3=143, t4=128, t5=111, t6=149 (MI) 
Processor Speeds p1=4, p2=5, p3=6 (MIPS) 
HLTF 
Meta-Task={t1,t6,t3,t4,t2,t5}// Sorted in 
the non decreasing order using merge sort 
First Iteration 
task Eft1 fproc1 eft2 fproc2 Sufferage 
T1 26.17 3 31.4 2 5.23 
T2 18.5 3 22.2 2 3.7 
T3 23.83 3 28.6 2 4.77 
T4 21.33 3 25.6 2 4.27 
T5 18.5 3 22.2 2 3.76 
T6 24.83 3 29.8 2 4.97 
Schedule t1 on processor 3       Meta-Task={t2,t3,t4,t5,t6} 
t_avail[1]=0,  t_avail[2]=0,  t_avail[3]=26.16 
First Iteration 
Largest task = t1 
Schedule task t1 on processor p3 
t_avail[1]=0, t_avail[2]=0, t_avail[3]=26.16 
Meta-Task={t6,t3,t4,t2,t5} 
Second Iteration 
task Eft1 fproc1 eft2 fproc2 Sufferage 
T2 22.2 2 27.75 1 5.55 
T3 28.6 2 35.75 1 7.15 
T4 25.6 2 32 1 6.4 
T5 22.2 2 27.75 1 5.55 
T6 29.8 2 37.25 1 7.45 
Schedule t6 on processor 2        Meta-Task={t2,t3,t4,t5} 
t_avail[1]=0,  t_avail[2]=29.8 ,  t_avail[3]=26.16 
Second Iteration 
Largest task =  t6 
Schedule task t6 on processor p2 
t_avail[1]=0, t_avail[2]=29.8, 
t_avail[3]=26.16 
Meta-Task={t3,t4,t2,t5} 
Third Iteration 
task Eft1 fproc1 eft2 fproc2 sufferage 
T2 27.75 1 44.67 3 16.92 
T3 35.75 1 50 3 14.25 
T4 32.0 1 47.5 3 15.5 
T5 27.75 1 44.67 3 16.92 
Schedule t2 on processor 1     Meta-Task={t3,t4,t5} 
t_avail[1]=27.75,  t_avail[2]=29.8 ,  t_avail[3]=26.16 
 
Third Iteration 
Largest task =  t3 
Schedule task t3 on processor p1 
t_avail[1]=35.75, t_avail[2]=29.8, 
t_avail[3]=26.16 
Meta-Task={t4,t2,t5} 
 
 91
 
Average Makespan
0
500
1000
1500
2000
2500
3000
3500
4000
4500
50 100 200 300 400 500 750 1000
Metatask Size
S
ec
onds
Sufferage HLTF
 
         
                              Figure 4.4 Average Makespan of Metatasks, std_dev=5 
 
 
Average Makespan
0
500
1000
1500
2000
2500
3000
3500
4000
4500
50 100 200 300 400 500 750 1000
Metatask Size
S
ec
onds
Sufferage HLTF
 
 
  Figure 4.5 Average Makespan of Metatasks, std_dev=10 
 
 92
Average Makespan
0
500
1000
1500
2000
2500
3000
3500
4000
4500
50 100 200 300 400 500 750 1000
Metatask Size
S
ec
onds
Sufferage HLTF
 
 
Figure 4.6 Average Makespan of Metatasks, std_dev =15 
 
 
Average Makespan
0
500
1000
1500
2000
2500
3000
3500
4000
50 100 200 300 400 500 750 1000
Metatask Size
S
ec
onds
Sufferage HLTF
 
 
Figure 4.7 Average Makespan of Metatasks, std_dev=20 
                                   
 
 
 93
Average Makespan
0
500
1000
1500
2000
2500
3000
3500
4000
50 100 200 300 400 500 750 1000
Metatask Size
S
ec
onds
Sufferage HLTF
 
 
Figure 4.8 Average Makespan of Metatasks std_dev=25 
                                   
 
Average Makespan
0
500
1000
1500
2000
2500
3000
3500
4000
50 100 200 300 400 500 750 1000
Metatask Size
S
ec
onds
Sufferage HLTF
 
 
Figure 4.9 Average Makespan of Metatasks, std_dev=30 
                                   
 
 
 
 94
 
Running Times 
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
50 100 200
Metatask Size (n)
Seconds
HLTF Log (10) of Sufferage
 
Figure 4.10 Running Times {n =50,100,200} 
 
 
 
Running Times 
0
0.5
1
1.5
2
2.5
3
500 1000 2000
Metatask Size (n)
Seconds
HLTF Log (10) of Sufferage
 
Figure 4.11 Running Times {n =500,1000,2000} 
 
 95
Running Times 
0
0.5
1
1.5
2
2.5
3
3.5
4
3000 4000 5000
Metatask Size (n)
Seconds
HLTF Log (10) of Sufferage
 
Figure 4.10 Running Times {n =3000,4000,5000} 
 
 
 
 
 
 
 
 
 
 
 
 
 
 96
CHAPTER 5 
 
SCHEDULING INDEPENDENT TASKS WITH DISPATCH TIMES 
 
 
In this section we introduce a novel heuristic to schedule independent tasks of a 
meta-task onto a network of heterogeneous processors to minimize the makespan of the 
meta-task. This section is organized as follows. In Section 5.1 we provide the motivation 
towards solving this problem. In Section 5.2 we introduce the Earliest Finish Time with 
Dispatch Time (EFT-DT) algorithm. In Section 5.3 we provide a practical example of the 
algorithm?s working. In Section 5.4 we discuss simulation results. 
 
5.1 Motivation 
 In meta-computing systems such as the grid, a centralized scheduler may make all 
scheduling decisions with respect to independent tasks. The scheduler makes a 
scheduling decision and maps tasks onto processors. In reality, the mapping of tasks onto 
processors requires time to dispatch the task from the scheduler onto a processor. In the 
previous works [20] [21][23] related to scheduling independent tasks of a meta-task onto 
a network of heterogeneous processors, the dispatch times of the tasks have not been 
considered in making scheduling decisions. The Sufferage, Min-Min and the Min-Max 
[23] algorithms assume a zero dispatch time in their scheduling model. We believe that in 
practical scenarios a zero dispatch time is not feasible and may lead to unrealistic 
schedules. In this section we introduce a novel heuristic to schedule independent tasks of 
 97
a meta-task onto a network of heterogeneous processors considering the dispatch times of 
tasks.  
 
5.2 The Earliest Finish Time with Dispatch Time (EFT-DT) Algorithm 
 In the EFT-DT algorithm, the priority of a task is defined as the sum of its mean  
execution time over all the processors and the standard deviation of its execution time 
over all the processors. At each scheduling step, EFT-DT picks the task with the highest  
Table 5.1 EFT-DT Algorithm ?Defnition of Terms 
 
 
Term Definition 
T Meta-task set of size s 
M Set of processors available for scheduling 
m Number of processors 
mean
k 
Mean execution time of task t
k
 over all the processors  
std
k 
Standard deviation of the execution times of task t
k
 over all the 
processors 
T_available(p
j
) Time at which processor p
j 
 can start execution of a new task. 
W
k,j
 Running time of task t
k  
on processor p
j.
 
D
kj 
Time required to dispatch task t
k
 from the scheduler to processor p
j
 
CT(t
k
, p
j
) = Max{T_available(p
j
),D
kj
 } + W
k,j 
// Execution completion time of 
task t
k  
on processor p
j. 
 
ECT(t
k
)
 
= Min 
k?T & j?M
{ CT(t
k
, p
j
) }//Earliest Completion time of task t
k 
 
Proc(t
k
) The processor on which ECT(t
k
) is obtained 
 98
 priority and schedules it onto a processor that provides its earliest completion time.   
    
 
 
EFT-DT Algorithm  
For all  t
k 
? T    
priority(t
k
)?  mean
k
 + std
k
                                                      
End For 
While T ? ? do 
Pick a task t
k 
? T
    
with the highest priority 
      For all j ? M 
                   CT(t
k
, p
j
) ?  Max{T_available(p
j
), D
kj
}+  W
k,j
            
      End For 
           ECT(t
k
) ?   Min 
k?T & j?M
{ CT(t
k
, p
j
) }  
           Compute Proc(t
k
)   
           Assign  t
k
 to Proc(t
k
)                                               
           T_available(Proc(t
k
))?  ECT(t
k
) 
           T=T- {t
k
} 
End While 
End EFT-DT 
 
 
Figure 5.1 The EFT-DT Algorithm 
 
 
The completion time of task on a processor is defined as  
CT(t
k
, p
j
) ?  Max{T_available(p
j
), D
kj
}+  W
k,j
  
to account for the dispatch times. EFT-DT later calculates the processor on which the 
least completion time is obtained and schedules the task onto it.  EFT-DT takes O(s) to 
compute the priorities of all the tasks and O(s ? m) to calculate the earliest completion 
times of the tasks. Thus the overall complexity is O(s ? m). 
 
 
 
 99
5.3 Example Run of EFT-DT    
 We now show the working of EFT-DT with a sample meta-task shown in   Figure 
5.2. Task priorities are computed as follows {8,2,10,1,3,9,4,7,5,6}.  
Step1: Schedule task 8 on processor P2 
Step 2: Schedule task 2 on processor P3 
 
Table 5.2 A sample metatask 
Task P1 P2 P3 
1 15 13 15 
2 16 20 16 
3 17 16 11 
4 20 13 10 
5 11 11 12 
6 14 14 12 
7 15 11 15 
8 20 17 13 
9 19 10 16 
10 18 19 18 
 
 
Table 5.3 Meta-task Dispatch Times 
 
Task P1 P2 P3 
1 6 6 7 
2 9 8 5 
3 7 8 9 
4 6 9 10 
5 8 7 7 
6 10 7 6 
7 9 7 7 
8 6 10 10 
9 10 9 5 
10 9 8 10 
 
 
Step 3: Schedule task 10 on processor P1 
 
Step 4: Schedule task 1 on processor P2 
 100
 
Step 5: Schedule task 3 on processor P1 
 
Step 6: Schedule task 9 on processor P3 
 
Step 7: Schedule task 4 on processor P3 
 
Step 8: Schedule task 7 on processor P2 
 
Step 9: Schedule task 5 on processor P2 
 
 
Makespan
0
10
20
30
40
50
60
70
80
90
123
Proce ssors
Seconds
task Idle
10
8
2
3
1
9
7
4
5
6
 
 
Figure 5.2 Gantt Chart for the Meta-Task  
 
 
 
 Step 10 : Schedule task 6 on processor P3  
The Gantt chart for the meta-task is provide in Figure 5.3 
 
5.4 Simulation Study 
We developed a simulator with the following input parameters to compare the 
performance of EFT-DT and the FIFO approach.  
n:  Meta-task size 
 101
size_max: Maximum size of a task within a meta-task  
dis_max: Maximum dispatch time of each task  
std_dev: Standard deviation of the meta-task  
proc_dev: Standard deviation of the processor speeds 
num_proc: Number of processors used. 
The input parameters were set with the following values in our simulation study. 
n:  {1000,2000,5000,7500,10000} 
size_max: ={100} 
dis_max: {50} 
std_dev: {5,10,15,20,25,30}  
proc_dev: {2,4,6} 
num_proc: {5,10,15,20}  
 
Table 5.4 Parameter Values 
 
Parameter Minimum Maximum Standard 
Deviation 
Task Size 10 100 5-30 
Dispatch Times 10 50 X 
Proc Speeds 1 10 2,4,6 
No of tasks 1000 10000 X 
No of Processors 5 20 X 
 
 
 
Each data point in the graphs that follow is an average of 4 different readings obtained 
using different number of processors. Figure 5.3 compares the makespan of EFT-DT and 
 102
FIFO for std_dev=5 and proc_dev=2. The average improvement of EFT-DT is 28%. 
Figure 5.4 provides the comparison for std_dev=10 and proc_dev=2. The average 
improvement of EFT-DT is 29%. Figure 5.5 provides the comparison for std_dev=15 and 
proc_dev=2. The average improvement of EFT-DT is 28%. Figure 5.6 provides the 
comparison for std_dev=20 and proc_dev=2.The average improvement of EFT-DT is 
30%. Figure 5.7 provides the comparison for std_dev=25 and proc_dev=2. The average 
improvement of EFT-DT is 29%. Figure 5.8 provides the comparison for std_dev=30 and 
proc_dev=2. The average improvement of EFT-DT was 30%. Figure 5.9 provides the 
comparison for std_dev=5 and proc_dev=4. The average improvement of EFT-DT was 
29%. Figure 5.10 provides the comparison for std_dev=10 and proc_dev=4. The average 
improvement of EFT-DT was 28%. Figure 5.11 provides the comparison for std_dev=15 
and proc_dev=4. The average improvement of EFT-DT was 31%. Figure 5.12 provides 
the comparison for std_dev=20 and proc_dev=4. The average improvement of EFT-DT 
was 31%. Figure 5.13 provides the comparison for std_dev=25 and proc_dev=4. The 
average improvement of EFT-DT is 30%. Figure 5.14 provides the comparison for 
std_dev=30 and proc_dev=4. The average improvement of EFT-DT is 29%. Figure 5.15 
provides the comparison for std_dev=5 and proc_dev=6. The average improvement of 
EFT-DT is 30%. Figure 5.16 provides the comparison for std_dev=10 and proc_dev=6. 
The average improvement of EFT-DT is 29%. Figure 5.17 provides the comparison for 
std_dev=15 and proc_dev=6. The average improvement of EFT-DT is 32%. Figure 5.18 
provides the comparison for std_dev=20 and proc_dev=6. The average improvement of 
EFT-DT is 28%. Figure 5.19 provides the comparison for std_dev=25 and 
proc_dev=6.The average improvement of EFT-DT is 32%. Figure 5.20 provides the 
 103
comparison for std_dev =30 and proc_dev=6.The average improvement of EFT-DT is 
30%. From all these average improvements, the overall average improvement of EFT-DT 
over FIFO is 29% 
Average Makespan
0
50000
100000
150000
200000
250000
300000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFT-DT
 
 
 
Figure 5.3 Average Makespan- std_dev=5, proc_dev=2 
 
 
Average Makespan
0
50000
100000
150000
200000
250000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFT-DT
 
 
 
Figure 5.4 Average Makespan- std_dev=10, proc_dev=2 
 
 
 104
Average Makespan
0
50000
100000
150000
200000
250000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFT-DT
 
 
Figure 5.5 Average Makespan- std_dev=15, proc_dev=2 
 
 
 
 
 
Average Makespan
0
50000
100000
150000
200000
250000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFT-DT
 
 
Figure 5.6 Average Makespan- std_dev=20, proc_dev=2 
 
 
 
 
 
 
 105
Average Makespan
0
50000
100000
150000
200000
250000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFT-DT
 
 
Figure 5.7 Average Makespan- std_dev=25, proc_dev=2 
 
 
 
 
 
 
Average Makespan
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFT-DT
 
 
Figure 5.8 Average Makespan- std_dev=30, proc_dev=2 
 
 
 
 
 
 
 106
Average Makespan
0
50000
100000
150000
200000
250000
300000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFT-DT
 
 
Figure 5.9 Average Makespan- std_dev=5, proc_dev=4 
 
 
 
 
 
 
Average Makespan
0
50000
100000
150000
200000
250000
300000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFT-DT
 
 
Figure 5.10 Average Makespan- std_dev=10, proc_dev=4 
 
 
 
 
 107
Average Makespan
0
50000
100000
150000
200000
250000
300000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFT-DT
 
 
Figure 5.11 Average Makespan- std_dev=15, proc_dev=4 
 
 
 
 
 
Average Makespan
0
50000
100000
150000
200000
250000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFT-DT
 
 
Figure 5.12 Average Makespan- std_dev=20, proc_dev=4 
 
 
 108
Average Makespan
0
50000
100000
150000
200000
250000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFT-DT
 
 
Figure 5.13 Average Makespan- std_dev=25, proc_dev=4 
 
 
 
 
 
 
Average Makespan
0
50000
100000
150000
200000
250000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFT-DT
 
 
Figure 5.14 Average Makespan- std_dev=30, proc_dev=4 
 
 
 
 
 
 109
Average Makespan
0
50000
100000
150000
200000
250000
300000
350000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFT-DT
 
 
Figure 5.15 Average Makespan- std_dev=5, proc_dev=6 
 
 
 
 
 
 
Average Makespan
0
50000
100000
150000
200000
250000
300000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFT-DT
 
 
Figure 5.16 Average Makespan- std_dev=10, proc_dev=6 
 
 
 
 
 110
Average Makespan
0
50000
100000
150000
200000
250000
300000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFT-DT
 
 
Figure 5.17 Average Makespan- std_dev=15, proc_dev=6 
 
 
 
 
 
 
Average Makespan
0
50000
100000
150000
200000
250000
300000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFT-DT
 
 
Figure 5.18 Average Makespan- std_dev=20, proc_dev=6 
 
 
 
 
 111
Average Makespan
0
50000
100000
150000
200000
250000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFT-DT
 
 
Figure 5.19 Average Makespan- std_dev=25, proc_dev=6 
 
 
 
 
 
 
Average Makespan
0
50000
100000
150000
200000
250000
1000 2000 5000 7500 10000
Number of tasks in the metatask
S
ec
onds
FIFO EFT-DT
 
 
 
Figure 5.20 Average Makespan- std_dev=30, proc_dev=6 
 
 
 
 
 112
Percentage Improvement over FIFO
0
5
10
15
20
25
30
5 1015202530
std_dev
P
er
c
ent
age
 
Figure 5.21 Percentage improvement of EFT-DT over FIFO for various std_dev 
 
 
 
Percentage Improvement over FIFO
0
5
10
15
20
25
30
35
246
proc_dev
P
er
c
ent
age
 
Figure 5.22 Percentage improvement of EFT-DT over FIFO for various proc_dev 
 
 
 
 
 
 
 113
Percentage Improvement over FIFO
0
5
10
15
20
25
30
1000 2000 5000 7500 10000
Number of tasks in the metatask
P
er
c
ent
age
 
Figure 5.23 Percentage improvement of EFT-DT over FIFO for various metatask sizes 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 114
CHAPTER 6 
 
CONCLUSION 
 
In this dissertation, we presented three new algorithms: the Heterogeneous Critical Node 
First (HCNF) algorithm; the Heterogeneous Largest Task First (HLTF) algorithm and the 
Earliest Finish Time with Dispatch Time (EFT-DT) algorithm. These algorithms were 
compared against existing algorithms through extensive simulation. The simulation 
results were presented in the earlier chapters. In this chapter, we would like to summarize 
the simulation results briefly and provide concluding remarks. 
With respect to the HCNF algorithm, the experimental test suite consisted of 
application graphs, trace graphs, RGBOS, RGPOS graphs etc. HCNF was compared 
against HEFT using these graphs and was later compared using both HEFT and STDS 
using the parametric random graph generator. We now summarize the experimental 
results. 
 The SLR and speedup of HCNF and HEFT was compared using graphs of 
different sizes. The average SLR of HCNF was better than HEFT by 12.3% and the 
speedup was better than HEFT by 7.9 %. The SLR and Efficiency of HEFT and HCNF 
were compared using DAGs representing the Gaussian Elimination algorithm. HCNF 
outperformed HEFT by an average of 25.7% with respect to SLR. With respect to 
efficiency, HCNF outperformed HEFT by an average of 22.6%. The SLR and speedup of 
 115
HEFT and HCNF was compared using trace graphs. HCNF outperformed HEFT in SLR 
and speedup by an average of 29.5% and 38.4% respectively. The SLR and speedup of 
HEFT and HCNF was compared using RGBOS graphs whose optimal schedule can be 
obtained using branch and bound technique. HNCF outperformed HEFT in SLR and 
speedup by 32.5% and 24.6% respectively. The SLR and speedup of HEFT and HCNF 
was compared using the RGPOS graphs. HCNF outperformed HEFT in SLR and speedup 
by 21.1% and 16.9% respectively. The SLR and speedup of HEFT and HCNF was 
compared using application graphs.These graphs represent a few numerical parallel 
application programs. This set contains of over 320 graphs in six categories: Cholesky 
factorization, LU decomposition, Gaussian elimination, FFT, Laplace transforms and 
Mean Value Analysis (MVA). The number of nodes ranges from 100 to 300. On an 
average, HCNF outperformed HEFT in SLR and Speedup by 27.5% and 22.7% 
respectively. The SLR of HCNF, HEFT and the STDS algorithms was compared using 
the parametric random graph generator. The average SLR improvement of HEFT over 
STDS is 6%, and over HEFT is 10% approximately. The speedup of HCNF, HEFT and 
the STDS algorithms was compared using a parametric random graph generator. The 
Average improvement in the speedup of the HCNF over STDS is 9%, and over HEFT is 
14%. The average SLR values for CCR values ranging from 0.1 to 1.0 in steps of 0.1 was 
compared using the parametric random graph generator. The average improvement of 
HCNF over HEFT is 11% and over HEFT is 4 %. The average SLR values for CCR 
values ranging from 1.0 to 5.0 in steps of 0.5 was compared using the parametric random 
graph generator. The average improvement of HCNF over STDS is 7% and over HEFT is 
11%. The superior performance of HNCF can be attributed to the low-cost task 
 116
duplication strategy that facilitates earlier start times for many nodes which otherwise 
have to wait for all the data items to arrive from their favorite predecessors. HCNF can be 
improved by exploring the possibility of duplicating the second and the third favorite 
predecessors (if any) to further expedite the start times nodes. The feasibility of such an 
approach needs to be investigated. 
The average makespan of HLTF and the Sufferage algorithms was compared 
using different metatask sizes and different std_dev.  The average improvement of HLTF 
over Sufferage was 4.13%.  
The running times of HLTF and Sufferage were compared using metatasks of 
different sizes. For metatask sizes greater than 1000, HLTF shows an improvement of 
over a 1000%. The superior performance of HLTF in terms of running times can be 
attributed to the low complexity sorting technique that is used by the algorithm. 
Experiments were conducted to compare the makespan of EFT-DT and FIFO. The 
overall average improvement of EFT-DT over FIFO is 30%. The superior performance of 
EFT-DT over FIFO can be attributed to the dispatch times and task execution times 
occurring in parallel. 
 
 
 
 
 
 
 117
BIBLIOGRAPHY 
 
 
[1] A. Abraham, R. Buyya and B. Nath, ?Nature?s Heuristics for Scheduling 
Jobs on Computational Grids,? Proc. ADCOM 2000, pp. 45 - 52, Cochin 
India. 
 
[2]  V. A. F. Almeida , I. M. M. Vasconcelos , J. N. C. Rabe and D. A. 
Menasc, ?Using random task graphs to investigate the potential benefits of 
heterogeneity in parallel systems,?Proc. 1992 ACM/IEEE conference on 
Supercomputing, pp. 683-691, Nov. 1992. 
 
 
[3]  J. R. Allen and K. Kennedy, ?PFC: A Program to Convert FORTRAN to 
Parallel Form,? Proc. of the IBM Conference on Parallel Computers and 
Scientific Computations, March 1982. 
 
[4] R. Bajaj and D.P. Agarwal, ?Improving Scheduling of Tasks in a 
Heterogeneous Environment,? IEEE Trans. Parallel and Distributed 
Systems, Vol. 15 No. 2,pp. 107-118 February 2004. 
.  
 
[5]  S. Baskiyar, ?Scheduling DAGs on Message Passing m-Processors 
Systems,? IEICE Trans. Information and Systems, v E-83-D, no. 7, Oxford 
University Press, July 2000. 
 
[6] S. Baskiyar, ?Scheduling Task-In Trees on Distributed Memory Systems,? 
            IEICE Trans. Information and Systems, vol. E-84-D, no. 6, June 2001. 
                                              
 
      [7]  S. Baskiyar and P.C. SaiRanga, ?Scheduling DAGs on Heterogeneous 
Multiprocessor Systems to Minimize Finish Time,? Proc. ISCA PDCS, 
Reno, Nevada, Aug 2003. 
 
[8]  S. Baskiyar and P.C. SaiRanga, ?Scheduling DAGs on Heterogeneous 
Network of Workstations to Minimize Schedule Length,? Proc. ICPP 
Workshops, Taiwan, Oct 2003. 
 
 
 118
[9] S..Baskiyar and P.C. SaiRanga, ?Scheduling independent tasks of a 
metatask with significant dispatch times,? Technical Report # CSSE06-03, 
Auburn University, Nov 2006. 
 
[10] S.Baskiyar and  P.C SaiRanga, ?Scheduling DAGs on Heterogeneous 
Network of Workstations to Minimize Finish Time,? Trans. IJCA, Vol 13, 
No 4, Dec 2006.  
 
[11] O. Beaumont, V. Boudet, and Y. Robert, ?A Realistic Model and an 
Efficient Heuristic for Scheduling with Heterogeneous Processors,? Proc. 
IPDPS, 2002. 
 
[12]  C. Boeres, G. Chochia and P. Thanisch, ?On the Scope of Applicability of 
the ETF Algorithm,? Proc. Workshop on Parallel Algorithms for 
Irregularly Structured Problems, pp. 159-164, 1995. 
 
 
[13] R. Buyya, D. Abramson, and J. Giddy, ?An Economy Driven Resource 
Management Architecture for Global Computational Power Grids,? Proc.   
International Conference on Parallel and Distributed Processing 
Techniques and Applications (PDPTA 2000), June 26-29, 2000, Las 
Vegas, USA, CSREA Press, USA, 2000. 
 
[14]  R. Buyya, D. Abramson, and J. Giddy, ?Nimrod-G: An Architecture for a 
Resource Management and Scheduling System in a Global Computational 
Grid,? Proc. 4th International Conference on High Performance 
Computing in Asia-Pacific Region (HPC Asia 2000), May 2000, Beijing, 
China, IEEE Computer Society Press, USA. 
 
[15]  R. Buyya, D. Abramson, and J. Giddy, ?A Case for Economy Grid 
Architecture for Service-Oriented Grid Computing,? Proc. International 
Parallel and Distributed Processing Symposium: 10th IEEE International 
Heterogeneous Computing Workshop (HCW 2001), April 23, 2001, San 
Francisco, California, USA, IEEE CS Press, USA, 2001 
 
 
[16]  W.Y Chan and C.K. Li, ?Heterogeneous Dominant Sequence Cluster 
(HDSC): a low complexity heterogeneous scheduling algorithm,? Proc. 
IEEE Pacific Rim Conference on Communications, Computers and Signal 
Processing, 1997, Vol. 2 , pp. 956-959, Aug. 1997. 
 
[17] W.Y. Chan and C.K. Li, ?Scheduling tasks in DAG to heterogeneous 
processor system,? Proc. Sixth Euromicro Workshop on Parallel and 
Distributed Processing(PDP ?98), pp. 27-31, Jan. 1998. 
 
 119
[18]  H. B. Chen, B. Shirazi, K. Kavi, and A. R. Hurson, ?Static scheduling 
using linear clustering and task duplication,? Proc. ISCA International 
Conference on Parallel and Distributed Computing and systems, 1993, pp. 
285-290. 
 
 
[19] C. Chiang, C. Lee, and M. Chang, ?A Dynamic Grouping Scheduling for 
Heterogeneous Internet-Centric Metacomputing System,? Proc. ICPADS, 
pp. 77-82, 2001.  
 
[20] W.Y. Chan and C.K. Li, ?Scheduling Tasks in DAG to Heterogeneous 
Processors System,? Proc. 6
th
 Euromicro Workshop on Parallel and 
Distributed Processing, Jan.1998. 
 
[21] M. Chetty and R. Buyya, ?Weaving computational grids: how analogous 
are they with electrical grids,? IEEE Trans. Computational Science and 
Enginering, Volume 4, Issue 4,  July-Aug. 2002, Pages:61 ? 71. 
 
 
[22]  B. Cirou and E. Jeannot, ?Triplet : a Clustering Scheduling Algorithm for 
Heterogeneous Systems,?  Proc. IEEE ICPP International Workshop on 
Metacomputing Systems and Applications (MSA'2001), sept. 2001, 
Valencia, Spain 
 
[23] T.H Cormen, C.E. Leiserson, and R.L. Rivest, Introduction to Algorithms, 
The MIT Press, 1990. 
 
[24]  M. Cosnard and E. Jeannot, ?Compact DAG representation and it?s 
dynamic Scheduling,? Journal of Parallel and Distributed Computing, 
Vol. 58, No. 3, September 1999, pp. 487-514. 
 
 
[25]  S. Darbha and D. P. Agrawal, ?A task duplication based scalable 
scheduling algorithm for distributed memory systems?, Journal of parallel 
and Distributed Computing, Vol. 46, No. 1, October 1997, pp. 15-27. 
 
[26] S. Darbha and D.P. Agrawal, ?Optimal Scheduling algorithm for 
distributed memory machines?, IEEE Trans. Parallel and Distributed 
Systems, Vol. 9, No. 1, January 1998, pp. 87-95. 
 
[27] A. Dogan and F. Ozguner, ?Stochastic Scheduling of a Meta-task in 
Heterogeneous Distributed Computing,? Proc. ICPP Workshop on 
Scheduling and Resource  Management for Cluster Computing, 2001. 
 
 120
[28] R. F. Freund, M. Gherrity, S. Ambrosius, M. Campbell, M. Halderman, D. 
Hensgen, E. Keith, T. Kidd, M. Kussow, J. D. Lima, F. Mirabile, L. 
Moore, B. Rust, and H. J. Siegel, ?Scheduling Resources in Multi-User, 
Heterogeneous, Computing Environments with SmartNet,? Proc. 7th 
IEEE Heterogeneous Computing Workshop (HCW ?98), Mar.1998, pp. 
184?199. 
 
[29] M.Grajcar, ?Genetic list scheduling algorithm for scheduling and 
allocation on a loosely coupled heterogeneous multiprocessor system,? 
Proc. 36th Design Automation Conference, pp. 280-285, 1999. 
 
[30] A. Gerasoulis and T. Yang, ?A comparison of clustering heuristics for 
scheduling directed acyclic graphs onto multiprocessors,? Journal of 
Parallel and Distributed Computing, Vol. 16, No. 4, December 1992, pp. 
276-291. 
 
 
[31] D. Hensgen, M. Maheswaran, S. Ali, and H.J. Siegal, ?Dynamic matching 
and scheduling of a class of independent tasks onto heterogeneous 
computing systems,? Proc. Heterogeneous Computing Workshop, 1999.  
 
[32]  J. Huang and S.Y.Lee, ?Effects of Spatial and Temporal Heterogeneity on 
Performance of a Target Task in Heterogeneous Computing 
Environments,? Proc. 15th ISCA International Conference on Parallel 
and Distributed Systems, Sept. 2002. 
 
[33]  C. C. Hui and S. T. Chanson, ?Allocating task interaction graphs to 
processors in heterogeneous networks,? IEEE Trans. Parallel and 
Distributed Systems, Vol. 8, No. 9, September 1997, pp. 908-926. 
 
 
[34] M. Iverson, F. Ozguner, and G. Follen, ?Parallelizing Existing 
Applications in a Distributed Computing Environment,? Proc. 
Heterogeneous Computing Workshop, pp. 93-100, 1995.  
 
 
[35]  M. A. Iverson, F. Ozguner and L.C. Potter, ?StatisticalPrediction of Task 
Execution Times Through Analytic Benchmarking for Scheduling in a 
Heterogeneous Environment,? Proc. 8th Heterogeneous Computing 
Workshop (HCW ?99), p. 99, April 1999. 
 
[36]  M. Kafil and I. Ahmed, ?Optimal Task Assignment in Heterogeneous 
Distributed Computing Systems,? Proc. IEEE Concurrency, Vol. 6, No. 3, 
July-September 1998, pp. 42-51. 
 
 121
 
[37] S.J. Kim and J.C. Browne, ?A General Approach to Mapping of Parallel 
Computations upon Multiprocessor Architectures,? Proc. ICPP, IEEE-CS, 
v. 3, 1988. 
 
 
[38] D.J. Kuck et. al.,  ?Dependence Graphs and Compiler Optimizations,? 
Proc. 8
th
 ACM Symposium on Principles of Programming Languages, pp 
207-218, Jan. 1981. 
 
[39] Y. Kwok, I. Ahmad and J. Gu, ?FAST: A Low-Complexity Algorithm for 
Efficient Scheduling of DAGs on Parallel Processors,? Proc. ICPP, 1997. 
 
[40] Y.K Kwok, ?Parallel Program Execution on a Heterogeneous PC Cluster 
Using Task Duplication,? Proc. 9
th
 HCW, 364-374, 2000. 
 
[41]  D.Li and N.Ishii, ?Scheduling task graphs onto heterogeneous 
multiprocessors,? Proc.  IEEE Region 10?s Ninth Annual International 
Conference. Theme: ?Frontiers of Computer Technology? (TENCON ?94), 
pp. 556-563 vol.2, Aug. 1994. 
 
[42]  Y. A. Li and J. K. Antonio, ?Estimating the execution time distribution for 
a task graph in a heterogeneous computing system,? Proc.6th 
Heterogeneous Computing Workshop (HCW ?97), p.172, April 1997. 
 
 
 [43]  Z.Liu, ?Scheduling of random task graphs on parallel processors,? Proc.     
Third International Workshop on Modeling, Analysis, and Simulation of 
Computer and Telecommunication Systems (MASCOTS ?95), pp. 143- 
147, Jan. 1995. 
 
[44]  S.Y. Lee and J.Huang, ?A Theoretical Approach to Load Balancing of a 
Target Task in a Temporally and Spatially Heterogeneous Grid 
Computing Environment,?  Proc.GRID 2002, pp. 70-81. 
 
[45] Z. Liu, B. Fang, Y.Zhang and J.Tang ?Scheduling algorithms for a fork 
DAG in a NOWs,? Proc.  Fourth International Conference/Exhibition on 
High Performance Computing in the Asia-Pacific Region, Vol. 2 , pp. 959-
960, May 2000. 
 
 
[46]  M. Maheswaran and H. J. Siegel, ?A Dynamic Matching and Scheduling 
Algorithm for Heterogenous  Computing Systems,? Proc.7
th
 HCW, pp. 57-
69, IEEE Press, Mar. 1998. 
 122
 
[47]  H. Oh and S. Ha, ?A Static Scheduling Heuristic for Heterogeneous 
Processors,? Proc.  Euro-Par,  pp. 573-577, v 2, 1996.  
 
[48]  S. S. Pande, D. P. Agrawal and J. Mauney, ?A scalable scheduling method 
for functional parallelism on distributed memory multiprocessors,? IEEE 
Trans. Parallel and Distributed Systems, Vol. 6, No. 4, April 1995, pp. 
388-399. 
 
[49]  C. Papadimitriou and M. Yannakakis, ?Towards an Architecture 
Independent Analysis of Parallel Programs,? SIAM J. of Computing, v 19, 
no. 2, pp 322-328, 1990. 
 
[50]  G.L Park, B.Shirazi, J.Marquis and H.Choo, ?Decisive path scheduling: a 
new list scheduling method,? Proc.  International Conference on Parallel 
Processing, pp. 472-480, Aug. 1997 
 
 
[51]  A. Radulescu and A.J.C. Van Gemund, ?Fast and Effective Task 
Scheduling in Heterogeneous Systems,? Proc. HCW, pp.229-238, May, 
2000. 
 
[52]  A. Ranaweera and D. P. Agrawal, ?A Task Duplication Based algorithm 
for Heterogeneous Systems,? Proc. IPDPS, pp 445-450, May 1-5, 2000. 
 
[53]  H. E.Rewini and T. G. Lewis, ?Scheduling parallel programs onto 
arbitrary target architecture,? Journal of Parallel and Distributed 
Computing, Vol. 9, No. 2, June 1990, pp. 138-153. 
 
[54]  P.C. SaiRanga and Sanjeev Baskiyar, ?A Low-Complexity Algorithm for 
Dynamic Matching and Scheduling of Independent Tasks onto 
Heterogeneous Computing Systems,? Proc. ACMSE 2005, March 2005.  
 
[55] V.Sarkar, ?Partitioning and Scheduling Parallel Programs for 
Multiprocessors,? The MIT Press, Cambridge, MA 1989. 
 
[56]  G. Sih and E. Lee, ?A Compile Time Scheduling Heuristic for 
Interconnection Constrained Heterogeneous Processor Architectures,? 
IEEE Trans. Parallel and Distributed Systems, vol. 4(2), pp. 175-187, 
1993. 
 
[57]  H. Song, X. Liu, D. Jakobsen, R. Bhagwan, X. Zhang, K. Taura, and A. 
Chien, ?The MicroGrid: A Scientific Tool for Modeling Computational 
Grids,? Proc. IEEE Supercomputing (SC 2000), Nov. 4-10, 2000, Dallas, 
USA. 
 123
 
 
[58]  H. Topcuoglu, S. Hariri and M.-Y. Wu, ?Task Scheduling Algorithms for 
Heterogeneous Processors,? Proc. HCW, pp 3-14, 1999. 
 
[59] H. Topcuoglu, S. Hariri, and M-Y. Wu ?Performance-effective and low- 
complexity task scheduling for heterogeneous computing Parallel and 
Distributed Systems,? IEEE Trans. Parallel and Distributed Systems, 
Volume: 13 Issue: 3, Mar 2002. 
 
[60]  T. Tsuchiya, T. Osada, T. Kikuno, ?A new heuristic algorithm based on 
GA?s for multiprocessor scheduling with task duplication,?  Proc. Third  
International Conference on Algorithms and Architectures for Parallel 
Processing, 1997, pp. 295- 308. 
 
 
[61] J. Ullman, ?NP-complete Scheduling Problems,? Proc. JCSS, vol. 10, pp. 
384-393. 1975. 
 
[62]  L.Yang, M. Jennifer and I.Foster,  ?Conservative Scheduling: Using 
Predicted Variance to Improve Scheduling Decisions in Dynamic 
Environments,? Proc. Supercomputing?03, November 2003. 
 
 
[63] T. Yang and A. Gerasoulis, ?DSC: Scheduling Parallel Tasks on an         
Unbounded Number of Processors,? IEEE Trans. Parallel and Distributed 
Systems, v. 5, no. 9,1994.