Heterogeneity-Aware Approaches to Optimizing Performance of
Computing and Communication Tasks
Except where reference is made to the work of others, the work described in this
dissertation is my own or was done in collaboration with my advisory committee.
This dissertation does not include proprietary or classifled information.
Jun Huang
Certiflcate of Approval:
Victor P. Nelson
Professor
Electrical and Computer Engineering
Soo-Young Lee, Chair
Professor
Electrical and Computer Engineering
Alvin S. Lim
Associate Professor
Electrical and Computer Engineering
Stephen L. McFarland
Dean
Graduate School
Heterogeneity-Aware Approaches to Optimizing Performance of
Computing and Communication Tasks
Jun Huang
A Dissertation
Submitted to
the Graduate Faculty of
Auburn University
in Partial Fulflllment of the
Requirements for the
Degree of
Doctor of Philosophy
Auburn, Alabama
December 16, 2005
Heterogeneity-Aware Approaches to Optimizing Performance of
Computing and Communication Tasks
Jun Huang
Permission is granted to Auburn University to make copies of this dissertation at its
discretion, upon the request of individuals or institutions and at their expense. The
author reserves all publication rights.
Signature of Author
Date of Graduation
iii
Dissertation Abstract
Heterogeneity-Aware Approaches to Optimizing Performance of
Computing and Communication Tasks
Jun Huang
Doctor of Philosophy, December 16, 2005
(M.S., University of Science and Technology of China, 2000)
(B.S., University of Science and Technology of China, 1997)
132 Typed Pages
Directed by Soo-Young Lee
As the domain of computing and communication systems grows, heterogeneity
among computers and subnetworks employed for a task also increases. It is important
to understand how heterogeneity afiects performance of computing and communica-
tion tasks in order to optimally utilize heterogeneous resources. However, the efiects
of heterogeneity in heterogeneous computing and communication systems were either
not taken into account explicitly or not thoroughly analyzed in the previous research
work in this fleld.
In this dissertation, efiects of heterogeneity are analyzed, and heterogeneity-
aware approaches are proposed for both computing and communication systems.
In the computing system context, temporal heterogeneity refers to variation,
along the time dimension, of computing power available for a task on a computer,
iv
and spatial heterogeneity represents the variation among computers. Efiects of het-
erogeneity on the performance of a target task have been analyzed in terms of the
mean and standard deviation of parallel execution time. The results reveal that, in
minimizing the average parallel execution time of a target task on a spatially het-
erogeneous computing system, it is not optimal to distribute the target task linearly
proportional to the average computing powers available on computers. Based on the
analysis results, an approach to load balancing for minimizing the average parallel
execution time of a target task is described. The proposed approach, of which validity
has been verifled through simulation, considers temporal and spatial heterogeneities
in addition to the average computing power of each computer.
In the communication system context, the concept of temporal and spatial het-
erogeneity in the available communication resource is applicable to various levels of
a communication network. Efiects of heterogeneity on the performance of individual
messages in a heterogeneous communication systems are analyzed. A heterogeneity-
aware approach to source rate control is proposed, which utilizes the heterogeneity
information on the available bandwidth in a UDP-based protocol, to improve through-
put, dropping rate, end-to-end delay, and real-time performance. Two main compo-
nents of the heterogeneity aware source rate control scheme are a receiver side feature
extractor and a sender-side adaptive rate controller. The feature extractor captures
the dynamic features in the bandwidth heterogeneity, and the source rate controller
utilizes the extracted features in the rate control. Performance of the proposed source
rate control scheme has been analyzed in detail through an extensive simulation for
the single and multiple path media streaming, and multiple HAA and/or TCP  ows.
v
Acknowledgments
First and foremost, I want to express my thanks and gratitude to my advisor,
Dr. Soo-Young Lee, for his guidance, encouragement, and support in every step
of my doctoral program at Auburn University. His comprehensive knowledge and
insight on the research were of great help to the completion of this dissertation.
His understanding and patience are sincerely appreciated, and his teaching on the
academic research and many other aspects will be an everlasting treasure in my life.
My thanks also go to the members of my dissertation committee, Dr. Victor
Nelson, Dr. Alvin Lim, and my outside reader Dr. Kai Chang. In spite of their very
busy schedules, they read the drafts of this dissertation and provided many valuable
comments. This dissertation owes a great deal to the time and attention they have
given to it.
Last, but not least, I want to thank my family for their understanding and love
along the path of my academic pursuits, and many of my friends for their generous
help when I was in Auburn. Without their continuous support and encouragement,
it would never have been possible to flnish this dissertation.
vi
Style manual or journal used Transactions of the American Mathematical Society
(together with the style known as \auphd"). Bibliography follows van Leunen?s A
Handbook for Scholars.
Computer software used The document preparation package TEX (speciflcally
LATEX) together with the departmental style-flle auphd.sty.
vii
Table of Contents
List of Figures xi
List of Tables xvi
1 Introduction 1
1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Computing Systems . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Communication Systems . . . . . . . . . . . . . . . . . . . . . 2
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Heterogeneous Computing Systems . . . . . . . . . . . . . . . 3
1.2.2 Heterogeneous Communication Systems . . . . . . . . . . . . . 5
1.2.3 Network Measurement . . . . . . . . . . . . . . . . . . . . . . 7
1.2.4 Source Rate Control . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.1 Computing Systems . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.2 Communication Systems . . . . . . . . . . . . . . . . . . . . . 10
2 Effects of Heterogeneity on A Target Computing Task 12
2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Mean and Standard Deviation of Execution Time . . . . . . . 16
2.3.2 Spatially Homogeneous Environment . . . . . . . . . . . . . . 17
2.3.3 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.4 Stability of Performance . . . . . . . . . . . . . . . . . . . . . 19
2.3.5 Temporal and Spatial Heterogeneity . . . . . . . . . . . . . . . 19
2.3.5.1 Temporal Heterogeneity . . . . . . . . . . . . . . . . 20
2.3.5.2 Spatial Heterogeneity . . . . . . . . . . . . . . . . . 20
2.4 Simulation Results and Discussion . . . . . . . . . . . . . . . . . . . . 21
2.4.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.2 Efiects of Heterogeneity on a Target Task . . . . . . . . . . . 22
2.4.2.1 Spatially Homogeneous and Temporally Heterogeneous 22
viii
2.4.2.2 Spatially and Temporally Heterogeneous . . . . . . . 23
2.4.2.3 Synchronization . . . . . . . . . . . . . . . . . . . . . 26
2.4.2.4 Granularity of Availability . . . . . . . . . . . . . . . 27
2.4.2.5 Communication Overhead . . . . . . . . . . . . . . . 29
2.4.2.6 Load Distribution and Stability (Risk Factor) . . . . 30
3 Load Balancing on a Heterogeneous Computing System 33
3.1 Two-Step Heuristic Approach . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1 When si = sj for all i;j: . . . . . . . . . . . . . . . . . . . . . 34
3.1.2 When si 6= sj for some i;j: . . . . . . . . . . . . . . . . . . . . 36
3.2 Simulation Results and Discussion . . . . . . . . . . . . . . . . . . . . 37
4 Effects of Heterogeneity on an Individual Network Traffic 41
4.1 Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.1 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Efiects of Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 Feedback Latency . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.2 Temporally Heterogeneous and Spatially Homogeneous . . . . 47
4.2.2.1 Serial Path . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2.2 Parallel Path . . . . . . . . . . . . . . . . . . . . . . 48
4.2.3 Temporally and Spatially Heterogeneous . . . . . . . . . . . . 49
4.2.3.1 Serial Path . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.3.2 Parallel Path . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1 Path Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.2 Multi-Path Data Transfer . . . . . . . . . . . . . . . . . . . . 55
5 Source Rate Control for Heterogeneous Communication Sys-
tems 60
5.1 Temporal Heterogeneity under Difierent Time Scales . . . . . . . . . 60
5.2 HAA Source Rate Control . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.1 Overview of the HAA Source Rate Control Scheme . . . . . . 62
5.2.2 Bottleneck Bandwidth Measurement . . . . . . . . . . . . . . 64
5.2.3 Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.4 Simple Source Rate Allocation Approaches . . . . . . . . . . . 66
5.2.5 Heterogeneity Aware Approach . . . . . . . . . . . . . . . . . 68
6 Performance of the Source Rate Control Scheme 74
6.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 Single Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2.1 Adaptation to Long-Term Bandwidth Variation . . . . . . . . 76
6.2.2 Feature Extractor Size and Average Bandwidth Updating Interval 77
ix
6.2.3 HAA Compared with Packet Spacing Protocol (PSP) . . . . . 78
6.2.4 HAA versus Mean-Only and Minimum-Bandwidth Schemes . 78
6.2.5 Efiects of Spatial Heterogeneity . . . . . . . . . . . . . . . . . 82
6.2.6 Packet Inter-arrival Time . . . . . . . . . . . . . . . . . . . . . 85
6.2.7 Adaptive Penalty . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2.8 Path Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2.9 Multi-Path Media Streaming . . . . . . . . . . . . . . . . . . . 88
6.3 Multiple Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3.1 Multiple TCP Flows . . . . . . . . . . . . . . . . . . . . . . . 95
6.3.2 Multiple HAA Flows . . . . . . . . . . . . . . . . . . . . . . . 99
6.3.2.1 Overall Performance . . . . . . . . . . . . . . . . . . 99
6.3.2.2 Dynamic Performance . . . . . . . . . . . . . . . . . 101
6.3.3 Comparison with Multiple TCP Flows . . . . . . . . . . . . . 103
6.3.3.1 Aggregated Bandwidth Utilization . . . . . . . . . . 103
6.3.3.2 Inter- ow Fairness . . . . . . . . . . . . . . . . . . . 104
6.3.4 TCP Friendliness . . . . . . . . . . . . . . . . . . . . . . . . . 105
7 Conclusion and Future Work 107
7.1 Computing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2 Communication Systems . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.3.1 Data-Intensive Computing Systems . . . . . . . . . . . . . . . 111
7.3.2 Application to Overlay Networks . . . . . . . . . . . . . . . . 111
Bibliography 112
x
List of Figures
2.1 Temporal and spatial heterogeneity. Ai(t) is the availability of com-
puting power on Ci for a target task, where i = 1;:::;N. . . . . . . . 14
2.2 (a) Average execution time and (b) normalized standard deviation of
execution time on a single computer where X = 100. . . . . . . . . . 23
2.3 (a) Average parallel execution time and (b) normalized standard devi-
ation of parallel execution time on N computers where Xi = 100 for
i = 1;:::;N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 (a) Average execution time and (b) normalized standard deviation of
execution time when SH and TH are varied with N flxed at 8 where
Xi = 100 for i = 1;:::;8. . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 (a) Average execution time and (b) normalized standard deviation of
execution time. Xi = XN for i = 1;:::;N and X = 100. As N increases,
? meanA remains flxed. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 (a) Average execution time and (b) normalized standard deviation of
execution time. Xi = XN for i = 1;:::;N and X = 100. As N increases,
? meanA increases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7 Speed-up when (a) ? meanA remains flxed and (b) ? meanA increases as N
increases. Xi = XN for i = 1;:::;N. X = 100. . . . . . . . . . . . . . 26
2.8 (a) Average execution time and (b) normalized standard deviation of
execution time as functions of Ns. X = 1000, N = 10, and Xi = XN =
100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.9 (a) Average parallel execution time and (b) normalized standard devi-
ation of parallel execution time as functions of SH and Ns. X = 1000,
N = 10, Xi = XN = 100, and ? meanA = 0:28. . . . . . . . . . . . . . . . 27
2.10 (a) Average execution time and (b) normalized standard deviation of
execution time as functions of interval. X = 1000, N = 10, and
Xi = XN = 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
xi
2.11 (a) Average parallel execution time and (b) normalized standard de-
viation of parallel execution time as functions of SH and interval.
X = 1000, N = 10, Xi = XN = 100, and ? meanA = 0:28. . . . . . . . . . 28
2.12 (a) Average execution time (computation and communication times)
and (b) normalized standard deviation of execution time as functions
of SHcomm and SHcomp. N = 20, X = 2000, Ns = 10, Xi = XN = 100,
and ? meanA = ? meanB = 0:28. . . . . . . . . . . . . . . . . . . . . . . . 29
2.13 Average execution time: (a) when communication overhead is inde-
pendent of N and X = 400, and (b) when communication overhead
increases linearly proportional to N and X = 2000. Ns = 10 and
? meanA = ? meanB = 0:28. . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.14 (a) Average execution time and (b) normalized standard deviation of
execution time on two computers where X1 = X2 ?dX, X2 = X2 +dX,
and X = 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.15 Risk factor on two computers where X1 = X2 ?dX, X2 = X2 +dX, and
X = 200. ?d = 221. ?? A = ? A2 ? ? A1. . . . . . . . . . . . . . . . . . 31
2.16 (a) Average execution time and (b) normalized standard deviation of
execution time on 10 computers. X = 100 and Xi = X0+SlopeXi(i?1)
for i = 1;:::;10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.17 Risk factor on 10 computers (a) ?d = 40 and (b) ?d = 28. X = 100
and Xi = X0 +SlopeXi(i?1) for i = 1;:::;10. . . . . . . . . . . . . 32
3.1 Equalization scheme: (a) after Step (1) where a target task is parti-
tioned such that the average execution time is the same on all comput-
ers and (b) after Step (2) where the partitioning is adjusted such that
t0i + 0Ti is equalized for all computers. . . . . . . . . . . . . . . . . . 35
3.2 (a) Average parallel execution time and (b) Percentage standard devi-
ation of t0i + 0Ti after Step (2) in the load balancing, when the number
of computers, N, is 10. After Step (1), ti = 100 and ? T =  TN ? T1
where  Ti = i?1N?1 Tmax, where i = 1;:::;10. . . . . . . . . . . . . . . 38
3.3 Average parallel execution time on two computers (a) with a flxed
? T =  T1 ? T2 = 50 ( T2 = 10) and (b) with a flxed s = 4, where
s = s2s1 (s1 = 1) and ? T =  T1 ? T2 ( T2 = 0). . . . . . . . . . . . . 39
xii
3.4 Percentage reduction in average parallel execution time as a function
of the number of computers (a) with a flxed ? T =  T1 ?  TN = 60
( TN = 0) and (b) with a flxed s = 5, where s = sNs1 (s1 = 1) and
? T =  T1 ? TN ( TN = 0). After Step (1), ti = 100 for all i. . . . . 40
4.1 Illustration of (a) a serial path and (b) a parallel path where Np = 2. 42
4.2 Temporal bandwidth heterogeneity on each channel and spatial band-
width heterogeneity among channels. . . . . . . . . . . . . . . . . . . 43
4.3 Efiects of Tfeedback on (a) throughput and (b) end-to-end delay.
Tinterval=[0.01,5.00]s, Tupdate=0.01s, N = 4, bi=1000 pps, and ? Bi = ? B
for all i. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 (a) Throughput and (b) end-to-end delay on a temporally heteroge-
neous serial path of N links. S = 106 packets. bi = 1000 pps and
? Bi = ? B for i = 1;:::;N. . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 (a) Transfer time and (b) throughput on a temporally heterogeneous
parallel path consisting of Np paths with one link per path. Sj=105
packets and bj=1000 pps for j = 1;:::;Np. . . . . . . . . . . . . . . . 49
4.6 Throughput and end-to-end delay on a temporally and spatially het-
erogeneous serial path with bi=1000 pps for i = 1;:::;N. (a) & (b):
? Bi = 0:1 for i = 2;:::;N; (c) & (d): ? Bi = 0:52 for i = 2;:::;N. . . 51
4.7 Transfer time on a temporally and spatially heterogeneous parallel
path: (a) each path is a single link, and (b) each path consists of
2 links where ?? B = ? B1-? BNp. bj = 1000 pps and Sj = 105 packets
for j = 1;:::;Np. ? Bj is varied linearly over Np paths for j = 1;:::;Np
such that its mean is flxed at 0.26. . . . . . . . . . . . . . . . . . . . 52
4.8 (a) A parallel path where the number of links is the same for all paths,
and (b) a parallel path where the number of links varies with path.
The number pair in brackets over each link represents bi and ? Bi of the
link. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.9 Comparison of the three paths in Figure 4.8-(a) in terms of (a) end-to-
end delay and (b) transfer time. . . . . . . . . . . . . . . . . . . . . . 54
4.10 Comparison of the three paths in Figure 4.8-(b) in terms of (a) end-
to-end delay and (b) transfer time. . . . . . . . . . . . . . . . . . . . 54
4.11 Reduction in transfer time on a parallel path (Np = 2). bj = 1000 pps
for j = 1;2. S=S1 +S2=2?106 packets. . . . . . . . . . . . . . . . . 57
xiii
4.12 Reduction in transfer time ((a);(b)) and end-to-end delay ((c);(d)) on
a parallel path with Np = 2. Nj = 2 for j = 1;2. S = S1 + S2 =
S01 +S02 = 2?106 packets, ? B1=0.06, ? B2=0.46, and b1 +b2 = 2000 pps. 58
5.1 Illustration of long-term and short-term temporal heterogeneity of the
available bandwidth sensed by one  ow (flow0) on a path. . . . . . . 61
5.2 The abstract architecture of the HAA source rate control scheme. . . 62
5.3 Illustration of the proposed HAA communication model. . . . . . . . 63
5.4 Illustration of packet-pair based bandwidth measurement. The hori-
zontal dimension is time, and from top to bottom along the vertical
dimension are the sender, intermediate links, and the receiver. Num-
bers in small boxes are the packet sequence numbers. . . . . . . . . . 65
5.5 Feature extractor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.6 Initialization, iteration, and extraction codes for the feature extractor. 67
5.7 QoS requirements and valid range of fl. . . . . . . . . . . . . . . . . . 70
5.8 Pseudo-code of the proposed SDFI scheme. . . . . . . . . . . . . . . . 72
5.9 Operation of the SDFI scheme. . . . . . . . . . . . . . . . . . . . . . 73
6.1 Simulation testbench topology for a single path. . . . . . . . . . . . . 74
6.2 Comparison of three source rate control schemes on a path of an inter-
router link whose available bandwidth (ABBW) varies with time. The
penalty factor (fl) is set to be 0.7 for HAA. . . . . . . . . . . . . . . . 76
6.3 Efiects of feature extractor size (L) and tinterval. N=4, ? Bi=0.17 for
all 4 links. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.4 Comparison of HAA (fl = 0) with PSP. . . . . . . . . . . . . . . . . . 79
6.5 N = 4, each link?s bottom link bandwidth (Bi(t)) follows a uniform
distribution, with deviation ? Bi. \(min)" in the legend stands for the
"minimum-bandwidth" scheme. . . . . . . . . . . . . . . . . . . . . . 80
6.6 Efiects of path length. All ? Bi are set to 0.17. . . . . . . . . . . . . . 82
6.7 N = 4, heterogeneous links, all links have the same mean
ABBWi=2Mbps. Distributions of ? Bi are as shown in the legends. . . 84
xiv
6.8 Histograms of inter-arrival times. 128 bins are used. . . . . . . . . . . 86
6.9 Deviation of inter-arrival time. . . . . . . . . . . . . . . . . . . . . . . 86
6.10 Illustration of the proposed adaptive rate control scheme. . . . . . . . 87
6.11 A parallel path where the number of links varies with path. The num-
ber pair in brackets over each link represents bi and ? Bi of the link. . 88
6.12 Throughput and End-to-End Delay when the penalty factor (fl) is set
to 1.0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.13 Simulation testbench topology for multi-path media streaming. . . . . 89
6.14 2 paths, each with 4 links, and all links have the same bi=2Mbps, but
difierent deviation ? Bi. . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.15 Application level measures for the scenario in Figure 6.14. . . . . . . 90
6.16 Adaptive rate control on a multi-path network. . . . . . . . . . . . . 91
6.17 Comparison with TCP implementation for 2 paths, each consisting of
4 links. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.18 Comparison with TCP implementation (2 paths, fl is flxed to 1.0). . . 93
6.19 Application level measures (2 paths, fl is flxed to 1.0). . . . . . . . . . 94
6.20 Topology of the testing network: multiple  ows sharing a network path. 95
6.21 Aggregated bandwidth utilization of multiple TCP  ows. Wmax=15.
(note that some measured utilization curves overlap the estimated max
ones) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.22 Arrival time of multi- ow TCP packets. . . . . . . . . . . . . . . . . . 98
6.23 Multiple HAA  ows sharing a network path consisting of 4 links, all
links have the same bi=2Mbps and ? Bi=0.17. . . . . . . . . . . . . . 99
6.24 Aggregated bandwidth utilization of multiple HAA  ows. . . . . . . . 100
6.25 Two On-Ofi  ows with Tswitch=6. . . . . . . . . . . . . . . . . . . . . 101
6.26 Dynamic performance of multiple On-Ofi  ows. N=4, bi=2Mbps for
all links, and fl=1.0 for all  ows. . . . . . . . . . . . . . . . . . . . . . 102
6.27 Improvement in aggregated bandwidth utilization. . . . . . . . . . . . 103
6.28 Fairness comparison with TCP. . . . . . . . . . . . . . . . . . . . . . 104
6.29 Multiple HAA and TCP  ows. . . . . . . . . . . . . . . . . . . . . . . 105
xv
List of Tables
3.1 Possible reduction in ? . . . . . . . . . . . . . . . . . . . . . . . . . . 37
xvi
Chapter 1
Introduction
1.1 Problem Formulation
1.1.1 Computing Systems
A cluster of computers is commonly used for high performance computing these
days [1][2]. Such a system is often shared by multiple users and computers in the
system may not be \identical." One of the most distinct characteristics of such an
environment compared to a dedicated multiprocessor system is its heterogeneity [3].
Heterogeneity of such shared systems has been addressed from various viewpoints of
high performance computing, but much of the research has been devoted to scheduling
a stream of tasks (i.e. from the system?s viewpoint) [4][5][6].
A measure of available computing power may be used to quantify the CPU share
allocated for a given task, which represents the percentage of CPU cycles (or time
slots) per unit time that the task can utilize at a given moment. This measure usually
varies with both time and space. Therefore, the heterogeneity in computing systems
can be classifled into two types: temporal heterogeneity and spatial heterogeneity. The
former refers to the variation of available resources with time on a computer, and the
latter represents the variation of available resources among computers.
1
1.1.2 Communication Systems
A communication network, whether wired or wireless, is normally shared by mul-
tiple users who submit their requests at any time. Heterogeneity in communication
systems can also be classifled into two types: (1) the communication resources avail-
able on a channel for a request (or tra?c, messages) varies with time, i.e., a channel
is temporally heterogeneous, and (2) difierent channels would have difierent charac-
teristics in terms of the communication resources available to them, i.e., channels in
a network are spatially heterogeneous.
The concept of temporal and spatial heterogeneity in the available communi-
cation resources is applicable to various levels of a communication network. For
example, temporal heterogeneity may be considered within individual sub-networks
and spatial heterogeneity among sub-networks. These two types of heterogeneity can
have a signiflcant efiect on the Quality of Service (QoS) of an individual request,
such as delay and transfer time, and also network-wide performance measures such
as utilization, system throughput, and queue length.
It is much more di?cult to quantify the available resources in communication
systems than in computing systems. There are various sources of heterogeneity on
the Internet, and they can be divided into the following three main categories [7]:
? The topology of the Internet is constantly changing and the types of link (slow
modems, flber optic links, copper or glass wires, radio, infrared, etc.) span a
wide range and are generally unknown to an end-user. Also, multi-path and
dynamic routing introduce extra complexities.
2
? There are many protocols coexisting in the Internet, and each protocol has
been implemented by multiple communities. For example, the widely used
Transmission Control Protocol (TCP) has many difierent implementations, for
example, Tahoe, Reno, NewReno, Sack, etc. Also, difierent applications ask for
difierent features.
? Background competing tra?c varies constantly in an unpredictable way.
Each individual heterogeneity is either not directly observable or cannot be mea-
sured accurately. For the protocol heterogeneity, a session is only able to know its
own protocol, and it is almost impossible to have knowledge of other sessions? proto-
cols. The heterogeneity due to background competing tra?c is also hard to model in
detail. Since it is either impossible or unnecessary to describe difierent components
of heterogeneity individually, \available bandwidth" is used as a measure to quantify
the \net" or aggregated efiect of all the sources of heterogeneity in this dissertation.
1.2 Related Work
1.2.1 Heterogeneous Computing Systems
One of the important issues is how to utilize a heterogeneous computing system to
optimize performance of a target task. This issue involves modelling the heterogeneity
of the system [8][9], performance estimation [10][11][12][13], reliability [14], scheduling
and load balancing [15], etc.
The concepts of \machine heterogeneity" and \task heterogeneity" are employed
in an efiort to model task execution time on heterogeneous computing systems [8][9].
The machine heterogeneity refers to the degree to which the machine execution times
3
vary for a given task and the task heterogeneity refers to the degree to which the
task execution times vary for a given machine. The main objective was to simulate
difierent heterogeneous computing environments in evaluating the behavior of the
task mapping heuristics.
A performance estimation model for heterogeneous parallel applications is pro-
posed in [10]. They addressed the heterogeneity among the varying platforms and
operating environments used and the possibility of having multiple tasks of the par-
allel application on each machine.
In [11], a model that calculates the slowdown imposed on applications in time-
shared multi-user clusters is deflned. Three kinds of \slowdown" are identifled and
the efiects of slowdown on application performance are verifled with emulated loads.
In [12], the issue of predicting parallel execution time of non-deterministic bulk
synchronous computations, where computation time on each computer is modelled
as a random variable, is considered. A tighter bound of the average parallel execu-
tion time has been derived. Though the randomness of computation time does not
originate from heterogeneity of the computing system in their model, the results may
be applicable to estimating parallel execution time on a heterogeneous computing
system with minor modiflcation.
In [13], the issue of modelling and characterizing parallel computing performance
of heterogeneous Network Of Workstations (NOW) is considered, and the efiects of
heterogeneity on e?ciency and scalability are investigated. The computing power
(speed) \weight" is used to quantify the spatial heterogeneity among workstations,
but the temporal heterogeneity of computing power on each workstation is not con-
sidered.
4
In [14], the probability of application failure is deflned to quantify the reliability
of a heterogeneous computing system. Minimizing execution time usually con icts
with improving reliability, and algorithms capable of trading ofi execution time for
reliability are proposed.
In [15], a \stochastic scheduling" strategy, where the performance variability is
considered in partitioning a data parallel application to be executed on a heteroge-
neous computing platform, is described. The mean and standard deviation of the
predicted completion time of a task are used in scheduling, along with a \tuning
factor" which specifles how \conservative" the scheduling is to be.
1.2.2 Heterogeneous Communication Systems
Realization of an overlay network involves mapping its topology onto the base
(physical) network [16][17][18][19]. This mapping requires flnding a \path" in the base
network for each \edge" in the overlay network. Here, a path may consist of one or
more channels or links. The path is to be selected such that it satisfles requirements
(for example, bandwidth) of the corresponding edge. Also, multiple paths may be
employed for an edge when a single path cannot meet the requirements of the edge.
Heterogeneity must be considered in mapping of an overlay network and scheduling
of messages in the network, in order to optimize its performance.
Inanefiort tominimize communicationoverheadin high performance computing,
multiple communication paths between computing nodes were employed [20]. The
main issues are how to choose one path over the others and how to partition a large
size of data (message) among multiple paths. They considered \characteristics" of
paths, such as latency, in path selection and aggregation. However, only spatial
5
heterogeneity in terms of the average characteristics has been considered. That is,
temporal variation of the characteristics was not taken into account.
The idea of utilizing multiple paths was also employed for video streaming in
the \path diversity" framework [21]. Here, the objectives are to minimize packet
loss caused by congestion due to time-varying bandwidth and to provide su?cient
bandwidth for media streaming, by using more than one path simultaneously from
a source to a destination. A quantitative measure of temporal heterogeneity in the
bandwidth was not considered in their algorithms.
In order to quantify (temporal) heterogeneity in the channel bandwidth, one may
consider a second order measure of the bandwidth in addition to its mean. Earlier,
heterogeneity in the available computing powers of computing nodes on a cluster or
Grid was studied in terms of load balancing [22][23], and it has been shown that
average parallel execution time of a task can be substantially reduced by taking the
standard deviation of available computing power on each node into account when
partitioning the task over multiple nodes.
Most of the previous work on communication bandwidth management utilized
either the average or instant bandwidth. Nevertheless, there have been some re-
search efiorts on scheduling and allocation problems under varying channel conditions
[24][25][26]. Basically, heuristic approaches are taken to compensate for variations in
the channel conditions in order to satisfy certain QoS requirements. Another load-
aware routing protocol [27] also considers only the average load at intermediate nodes.
However, heterogeneity was not explicitly taken into account. More speciflcally, the
second order moment of available channel bandwidth was not quantitatively utilized
to optimize network performance.
6
1.2.3 Network Measurement
Measurement of the bottleneck bandwidth over a network path may be performed
on either the sender (source) or receiver (sink) side. Sender side measurement avoids
the need for modiflcation of software on the receiver end, and therefore a new protocol
is easier to deploy. However, the measurement delay for a sender side scheme is about
twice that of a receiver side scheme. Depending on the requirement of a speciflc
application, where to perform the measurement needs to be determined.
Several schemes for probing the link states based on \packet pairs" were proposed
in [28][29][30]. The sender transmits probing packets in the form of back-to-back pairs,
and estimates the bottleneck service rate from the spacing of the acknowledgements
(ACK). However, this approach relies on several unrealistic assumptions which may
lead to inaccurate measurements. In order to improve the accuracy, an intersection
flltering method with a multi-phase variable packet size was used in estimating the
bandwidth [31]. The main problem with this scheme is that it is di?cult to properly
set the bin width and boundary of the histogram without knowing the shape of the
distribution in advance. To overcome this di?culty, a packet pair scheme combined
with bandwidth flltering using a kernel density estimator algorithm was proposed
[32][33]. Note that the post-measurement processing/flltering steps in [31][32][33] are
performed ofi-line, therefore, they would not be appropriate for applications where
the real-time requirement is critical.
1.2.4 Source Rate Control
TCP uses explicit rate/congestion control. All implementations of TCP employ
the so-called Additive Increase and Multiplicative Decrease (AIMD) algorithm flrst
7
proposed by Jacobson in 1986 [34][35]. The main problems of this algorithm are
its slow convergence speed and large rate oscillation [36]. Also, its multiplicative
decrease scheme, which cuts its sending rate in half in response to each indication
of link congestion, may unnecessarily waste the available network bandwidth. For
a smoother and more gradual rate adaptation, many TCP-friendly congestion/rate
control schemes have been proposed [37]. For example, a Loss-Delay based Adaption
algorithm (LDA+) [38] is a variant of AIMD congestion control scheme, which adjusts
the source rate dynamically based on the current network situation (using an estimate
of the bottleneck bandwidth obtained by a packet pair approach). The bandwidth
share of a  ow is determined using the TCP model during loss situations, and can
be increased by a value that does not exceed the increase of the bandwidth share of
a TCP connection for the case of no loss. Because AIMD-like control mechanisms
mostly focus on fairness among competing TCP tra?c  ows and usually lead to
bursty tra?c, they are not preferred by many real-time applications, when one is
more interested in per- ow performances such as throughput, end-to-end delay, and
inter-packet arrival time (less jitter).
A TCP-friendly solution has been proposed to evenly space TCP packets injected
into the network over an entire round-trip time, so that data is not sent in a burst
[39]. However, this approach often leads to signiflcantly worse throughput than the
regular TCP because it is susceptible to synchronization losses and delays congestion
signals [40].
Afeedbackmechanismhasbeenintroducedforpacket-spacingtechniques[41][42],
which enables UDP-based applications to perform well while being adaptive and self-
regulating. This damped Packet-Spacing Protocol (PSP) transmits data at a near
8
optimal rate by using a simple feedback mechanism that reports packet loss every
round-trip time. However, the only congestion indicator used in PSP is packet loss,
which may not be su?cient for optimizing QoS on a network with a high degree of
heterogeneity in bandwidth.
1.3 Motivation
In most of the previous work, (i) not all types of heterogeneity were taken into
account, (ii) heterogeneity was often considered only implicitly, and (iii) efiects of
heterogeneity were not thoroughly analyzed.
The heterogeneity in both computing and communication systems needs to be
quantitatively represented. It?s also necessary to have a clear understanding of how
heterogeneity afiects the performance of a given task, and show how the proposed
approach can be applied to various application scenarios.
1.4 Research Objectives
The efiects of various types of heterogeneity are to be analyzed to show the possi-
bility of improving performance of tasks by considering the information of heterogene-
ity. Heterogeneity-aware approaches (HAA) are proposed to improve performance of
both computing and communication systems.
1.4.1 Computing Systems
Efiects of spatial and temporal heterogeneity on the performance of a target task
are flrst analyzed in detail. It is shown that, it is not optimal to partition a task
9
over computers linearly proportional to their average computing powers in order to
minimize the average parallel execution time.
A heterogeneity-aware approach (HAA) to load balancing, designed to minimize
the average parallel execution time of a target task in a spatially heterogeneous com-
puting environment, is then proposed in order to provide a theoretical foundation
for developing a practical load balancing scheme. The proposed approach explicitly
takes into account the standard deviation of available computing power in addition
to the average computing power available on each computer. It has been shown that
the average parallel execution time of a target task can be signiflcantly reduced by
the approach.
1.4.2 Communication Systems
The TCP/AIMD-like source rate control schemes usually lead to bursty tra?c
and therefore are not preferred by many real-time applications. The use of feedback
based packet-spacing approaches [41][42] improves real-time QoS in some situations.
However, \packet loss" by itself may not be su?cient for describing the dynamic
features of rapidly varying bandwidth. A bandwidth measurement technique could
be used to provide some additional information to the source rate controller (as in
[38]), but previous researchers have only used the flrst order feature of bandwidth
samples. In order to extract useful features from the measurement, some types of
post-measurement technique needs to be used. In a real network, however, a rate
control scheme cannot afiord a time-consuming flltering algorithm [31][32][33].
In this dissertation, a heterogeneity-aware approach to source rate control is
proposed, which is adaptive to time-varying available (bottleneck) bandwidth. The
10
proposed scheme extracts the second-order moment (standard deviation) as well as
the flrst-order moment (mean) of the available bandwidth and explicitly takes them
into account in determining the source rate dynamically. A simple feature extractor
implemented at the receiving end measures the features, including the mean and
standard deviation, using a packet-pair approach and sends them to the source. For
the feature measurement, data packets are used instead of inserting control packets
and, therefore, no extra tra?c is generated.
Applications of the HAA rate control scheme to the path selection problems on
an overlay network, and a multi-path media streaming framework are also considered.
11
Chapter 2
Effects of Heterogeneity on A Target Computing Task
2.1 System Model
The high performance computing system to be employed in this study consists of
N heterogeneous computers interconnected via a communication network. The speed
of processor, speed and capacity of memory (including cache), and bandwidth of I/O
system, etc. may not be the same on all computers. Difierent software, including the
OS, may be available on each of the computers.
All these hardware and software components collectively afiect the efiective com-
puting power available for applications on each computer. It is the available com-
puting power for a target task that eventually determines execution time of the task.
Hence, in this dissertation, \availability" of computing power and communication
bandwidth is employed in deflning heterogeneity.
2.1.1 Availability
Let the maximum computing power of computer Ci be denoted by fii for a
target task for i = 1;:::;N, where N is the number of computers. The computing
power available for the task at time t may be expressed by fiiAi(t) where Ai(t) is the
\availability" of Ci for the task at time t and 0 ? Ai(t) ? 1. The mean and standard
deviation of Ai(t) are denoted by ai and  Ai, respectively.
12
In the steady state, ai and  Ai are assumed to be flxed and do not vary with
time while Ai(t) varies with time. In the time-varying state, not only Ai(t) but also
ai and/or  Ai vary with time, i.e., one needs to use the notations ai(t) and  Ai(t).
The time-varying state is not considered in this dissertation.
Availability of communication bandwidth can be deflned similarly with notations
fli, Bi(t), bi, and  Bi for the maximum, instantaneous, mean and standard deviation
of bandwidth, respectively. To avoid redundancy, they are not deflned separately.
One difierence is that Bi(t) tends to decrease as N increases due to the contention
on the shared media, while Ai(t) is independent of N, especially for data parallel
applications.
2.1.2 Heterogeneity
When a set of heterogeneous computers is shared by multiple independent users,
workload on each computer would vary with time and computer. Therefore, workload
coupled with the heterogeneous hardware and software components makes availabil-
ity (computing power) vary spatially and temporally. Temporal heterogeneity refers
to variation of availability along the time dimension on a computer while spatial
heterogeneity refers to that among computers as illustrated in Figure 2.1.
With the notations given in the deflnition of availability above, a computer (Ci)
exhibits temporal heterogeneity when Ai(t) is a non-uniform function of time. A
system consisting of multiple computers shows spatial heterogeneity when ai 6= aj
and/or  Ai 6=  Aj for some i;j where i 6= j. These two types of heterogeneity will be
quantifled in Section 2.3.5.
13
A (t)2
t
1
0
A  (t)N
t
1
0
A (t)1
t
1
0
Temporal heterogeneity
Spatial heterogeneity
Interval
Figure 2.1: Temporal and spatial heterogeneity. Ai(t) is the availability of computing
power on Ci for a target task, where i = 1;:::;N.
The temporal heterogeneity is due to the fact that users submit their tasks at any
time and the size of a task is variable, i.e., the task arrival and size are random. There
are two sources of the spatial heterogeneity. One is system-related: hardware and
software components are not the same for all computers. Also, the interconnection
network may not be \symmetric." The other is task-related: the task (workload)
characteristics (arrival time and size) may vary with computer.
An interconnection network or networks are usually shared by computers. As a
result, the spatial heterogeneity of communication bandwidth tends to be lower than
that of computing power.
14
2.2 Task Model
Two task models are employed in this dissertation: the base and augmented task
models. In the base task model, a target task is assumed to be linearly partitionable
over N computers, i.e., the sum of computational loads of subtasks is equal to that
of the target task. In order to focus on efiects of heterogeneous computing power on
the performance of a target task, communication is not considered in the base model.
Many data-parallel applications such as low-level image processing, (Monte Carlo)
simulations, matrix computation, etc. would flt to this model well.
However, in many other applications, as a task is partitioned, communication
among subtasks is unavoidable for sharing data and results, collecting local results,
etc. The augmented task model is derived by incorporating \periodic" communica-
tion phases into the base task model. That is, execution of a target task consists of a
sequence of alternating computation and communication phases. Hence, a communi-
cation phase can be considered as a synchronization point at which all N computers
are required to be synchronized for data communication before proceeding to the next
(computation) phase. There are many applications which can be represented by this
model, e.g., iterative optimization, simulation of a physical or chemical phenomenon,
medical image reconstruction, etc. The number of synchronization points is denoted
by Ns.
2.3 Performance Measures
Performance of a target task on a heterogeneous computing system may be mea-
sured in a few difierent ways. One popular metric is to use the execution time of the
15
target task, of which minimization is the main objective of high performance comput-
ing. In a heterogeneous computing environment, execution time varies signiflcantly
due to the temporal and spatial heterogeneity and, therefore, the average execution
time may be employed as a performance measure. From the stability point of view,
one may want to minimize variation of the execution time, in which case the stan-
dard deviation of execution time can be used. Another way to quantify stability is to
specify the probability that the execution time longer than a certain threshold is ob-
tained. In this section, the performance measures to be employed in this dissertation
are derived.
Let?s denote the size of a target task by X and the portion of the target task
assigned to the computer Ci by Xi. The execution time of Xi on Ci is referred to as
Ti, which is a random variable, and its mean and standard deviation are denoted by
ti and  Ti, respectively. In this dissertation, the same notation is used for a random
variable and its observation in order to minimize the number of variables.
2.3.1 Mean and Standard Deviation of Execution Time
Referring to the deflnition of availability, the relationship between Xi and Ti may
be expressed as follows:
Z Ti
0
fiiAi(t) dt = Xi
By taking the expectation on both sides,
ti = Xifi
iai
(2.1)
16
Assuming the uncorrelatedness between Ai(t) and Ai(t0), the standard deviation,
 Xi, of the work completed during ti can be shown to be  Xi = ptifii Ai. Then, the
standard deviation of Ti may be given by
 Ti = pti Aia
i
(2.2)
The parallel execution time of X in a run, denoted by T, which is also a random
variable, is given by T = maxi f Ti g where the notation f g refers to a set. The
mean (?) and standard deviation ( T) of parallel execution time of X are computed
as follows:
? = E[ T ]
 T =
q
E[ (T ? ?)2 ]
where E[ ] is the expectation operator.
2.3.2 Spatially Homogeneous Environment
When all N computers have the same distribution of execution time (T0) and
the corresponding cdf (cumulative distribution function) is a monotonically increasing
function, ? may be expressed as follows (note that the subscript i is not used in this
subsection since all computers are identical and that Ti = T0 and Xi = X0 for all i):
? =
Z
NT0F(T0)N?1f(T0) dT0 (2.3)
where F(T0) and f(T0) are the cdf and pdf (probability density function) of T0, re-
spectively.
17
It is possible to derive ? in the following form [12][43], referring to Equations 2.1
and 2.2:
? = t + K(N)  T0
= t + K(N) pt  Aa
= X
0
fia + K(N)
s
X0
fia
 A
a (2.4)
where K(N) is an increasing function of N and K(1) = 0.
Notice that the average parallel execution time of a target task consists of two
components, one depending on the mean of availability and the other on the standard
deviation of availability, i.e., temporal heterogeneity. It is to be noted that temporal
heterogeneity makes the average parallel execution time increase beyond the average
(sequential) execution time on each computer. The increase is larger when the number
of computers employed is greater. Also, as will be shown later, a higher spatial
heterogeneity leads to a longer average parallel execution time.
2.3.3 Synchronization
When the communication phase in the augmented task model does not include
data communication, it can be considered as a synchronization point. That is, at each
synchronization point, all computers need to wait for the \slowest" computer before
they proceed to next computation phase. Suppose that there are Ns synchronization
points in a target task, such that the amount of work to be done between two succes-
sive synchronization points is X0Ns. In order to extract efiects of synchronization only,
18
let?s assume that no data communication is actually carried out at each synchroniza-
tion point. Referring to Equation 2.4, the mean parallel execution time in the jth
phase, ?(j), can be expressed as X0fiaNs + K(N)
q X0
fiaNs
 A
a for j = 1;:::;Ns. Then, the
average parallel execution time ? is Ns?(j). Hence,
? = X
0
fia + K(N)
s
X0Ns
fia
 A
a (2.5)
It can be seen that ? increases as Ns increases even though no synchronization
overhead (e.g., communication overhead for synchronization) is taken into account.
2.3.4 Stability of Performance
Variation of T may be quantifled by its standard deviation,  T, which is to
be minimized when one wants to achieve a stable performance. Also, one may be
interested in knowing the probability, to be referred to as \risk factor," that T is
longer than a certain desirable threshold, ?d. The risk factor, denoted by RF, is
given by Prob[ T > ?d ].
2.3.5 Temporal and Spatial Heterogeneity
In this section, temporal heterogeneity and spatial heterogeneity are quantifled
for computing power only, since the deflnitions would be identical for communication
bandwidth. A difierence is that spatial heterogeneity would generally be lower for
communication bandwidth than for computing power. This is mainly because the
communication paths are usually shared by computers.
19
2.3.5.1 Temporal Heterogeneity
Temporal heterogeneity is deflned on an individual computer, which indicates
variability of computing power available for a task along the time dimension. There-
fore, the standard deviation of the availability may be used to quantify temporal
heterogeneity on each computer. Noting that the average availability may vary with
computer (and also time) and that ? depends on the ratio of  Ai to ai (refer to Equa-
tion 2.4), temporal heterogeneity, to be denoted by THi, on Ci is deflned to be the
normalized standard deviation of availability.
THi 4= ? Ai =  Aia
i
(2.6)
The notation TH will be used when THi is the same for all i (computers), i.e.,
spatially homogeneous, or, when the mean of THi among all computers is to be
referred to.
2.3.5.2 Spatial Heterogeneity
Spatial heterogeneity is deflned for a group of computers to be employed to exe-
cute a target task. It represents variability of computing power among the computers.
Let?s denote the mean and maximum of ? Ai among Ci by ? meanA and ? maxA , re-
spectively, i.e., ? meanA = 1N PNi=1 ? Ai = THmean and ? maxA = maxif? Aig. Spatial
heterogeneity denoted by SH for a set of computers fCig is deflned as
SH 4= ? maxA ? ? meanA : (2.7)
20
That is, SH quantifles variation of temporal heterogeneity among computers.
2.4 Simulation Results and Discussion
2.4.1 Simulation Setup
Availability is assumed to have a uniform distribution (it was observed that
other distributions such as a \truncated" Gaussian distribution resulted in similar
trends). Then, the distribution of execution time on each computer looks similar to
a Gaussian or Gamma distribution which was adopted also in another study [8]. The
average availability ai (or average computing power fiiai) may vary with computer
(Ci). However, ai (or fiiai) can be efiectively normalized in task partitioning, i.e.,
load balancing for the average availability (or average computing power) varying
with computer can be done easily by assigning Xi, proportional to ai (or fiiai), to Ci.
Hence, ai is set to 0.5 with fii = 1:0 for all i in the simulation in order to focus on
the efiects of heterogeneity in the availability (instead of the average availability or
computing power) and to maximize the range of variation of  Ai (note that 0 ? ai ?
1:0). Then, the maximum  Ai (? Ai) is 12p3 ( 1p3). The computing power fii of Ci is set
to 1.0 also for normalization purpose.
The notion of \interval" is adopted to quantify the time duration in which avail-
ability (Ai(t)) remains constant. The interval is mostly afiected by other tasks (dis-
tributions of their arrival time and size) and the local scheduler on Ci and is to be
modelled by a random variable. Note that decreasing the interval given a flxed task
size is equivalent to increasing the task size with the interval flxed, and vice versa.
A larger interval (for a given task size) leads to a higher chance for load imbalance
21
among computers. The interval is generated from a uniform distribution of which
range is [0,20] except when it is varied.
Simulation was repeated 1000 times for each case and results were averaged for
the graphs in the following sections.
2.4.2 Efiects of Heterogeneity on a Target Task
In this section, some of the simulation results are presented to discuss efiects of
temporal and spatial heterogeneity on the performance of a target computing task.
In the graphs where variation of the parallel execution time (T) is analyzed, the
standard deviation of T normalized by ? is used, i.e.,  T? 4= ? T. The results without
communication overhead (but with synchronization in some cases) are provided in
Figures 2.2 - 2.11, and those with communication overhead are in Figures 2.12 and
2.13.
2.4.2.1 Spatially Homogeneous and Temporally Heterogeneous
In Figure 2.2, ? and ? T are plotted when N = 1. As can be seen in the flgure,
it is clear that the average sequential execution time is not afiected by ? A, but its
variation increases proportional to ? A, as shown in Equations 2.1 and 2.2.
In Figure 2.3, efiects of ? A on ? and ? T are shown for difierent values of N.
As predicted in Equation 2.4, ? and ? T increase almost linearly proportional to ? A
(which is TH) when multiple computers are employed. When more computers are
employed (a larger N), the efiect of ? A on ? is larger, as shown in Figure 2.3-(a). The
efiect of ? A on  T is also larger. However, since ? increases faster than  T does, the
efiect of ? A on ? T is smaller for a larger N, as shown in Figure 2.3-(b).
22
0 0.05 0.1 0.15100
150
200
250
300
350
400
450
_?A
?
a=0.25a=0.50
a=0.75
0 0.05 0.1 0.150
0.002
0.004
0.006
0.008
0.01
0.012
_?A
_ ? T
a=0.25a=0.50
a=0.75
(a) (b)
Figure 2.2: (a) Average execution time and (b) normalized standard deviation of
execution time on a single computer where X = 100.
0 0.1 0.2 0.3 0.4 0.5200
202
204
206
208
210
_?A
?
N=2N=6
N=10
0 0.1 0.2 0.3 0.4 0.50
0.005
0.01
0.015
0.02
0.025
0.03
_?A
_ ? T
N=2N=6
N=10
(a) (b)
Figure 2.3: (a) Average parallel execution time and (b) normalized standard deviation
of parallel execution time on N computers where Xi = 100 for i = 1;:::;N.
2.4.2.2 Spatially and Temporally Heterogeneous
In Figure 2.4, dependency of ? and ? T on SH and TH is shown when N is
flxed to 8. In this graph, when SH is zero (i.e., on the TH axis), ? A, which is THi,
is the same for all computers. When SH is greater than zero, distribution of ? Ai
among computers is linear such that ? Ai = ? meanA + 2( i?1N?1 ?0:5)(? maxA ? ? meanA ) for
i = 1;:::;N. That is, TH = ? meanA in these graphs. As SH increases, ? increases
23
signiflcantly, especially when ? meanA also increases (going diagonally from the origin
on the SH ?TH plane).
0.150.20.25
0.30.3500.1
0.2210
220
230
240
THSH
?
0.2 0.3
0.4
0
0.1
0.20.005
0.01
0.015
0.02
THSH
_ ? T
(a) (b)
Figure 2.4: (a) Average execution time and (b) normalized standard deviation of
execution time when SH and TH are varied with N flxed at 8 where Xi = 100 for
i = 1;:::;8.
Now, suppose that the size of a target task, X, is flxed independent of N and it
is uniformly distributed over N computers. The larger the set of computers employed
for the target task is, the larger heterogeneity (SH) among computers becomes in
general. Two cases are considered: (i) when ? meanA is flxed while SH increases and
(ii) when both of SH and ? meanA increase. In both cases, three difierent situations,
in terms of how ? maxA is increased, are considered: proportional to pN, N, and
N2. Simulation results for cases (i) and (ii) are provided in Figures 2.5 and 2.6,
respectively.
As N increases, ? decreases almost inversely proportional to N and then starts
to increase beyond a certain N especially when SH increases rapidly. In contrast, ? T
monotonically increases as N increases. It increases sharply after a certain value of
N in the case that SH increases proportional to N2.
24
10 20 30 40 5010
2030
4050
6070
8090
100110
N
?
SH  ? N1/2SH  ? N
SH  ? N2
10 20 30 40 500.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
N
_ ? T
SH  ? N1/2SH  ? N
SH  ? N2
(a) (b)
Figure 2.5: (a) Average execution time and (b) normalized standard deviation of
execution time. Xi = XN for i = 1;:::;N and X = 100. As N increases, ? meanA
remains flxed.
10 20 30 40 5010
2030
4050
6070
8090
100110
N
?
SH  ? N1/2SH  ? N
SH  ? N2
10 20 30 40 500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
N
_ ? T
SH  ? N1/2SH  ? N
SH  ? N2
(a) (b)
Figure 2.6: (a) Average execution time and (b) normalized standard deviation of
execution time. Xi = XN for i = 1;:::;N and X = 100. As N increases, ? meanA
increases.
Speed-up is shown for the cases of (i) and (ii) in Figure 2.7-(a) and Figure 2.7-
(b), respectively. The reason why the three curves meet when N = 50 is that the
simulation was set up such that the distribution of ? A among computers becomes
identical in all three situations when N is increased to 50. Hence, what is to be
observed in these graphs is the \shape" (trend) of each curve. It is clear that spatial
25
heterogeneity SH alone keeps one from employing more than a certain number of
computers even for an embarrassingly parallel task. Also, it can be induced that the
complexity of the function K(N) in Equation 2.4 is at least O(pN).
10 20 30 40 500
2
4
6
8
10
12
14
16
N
Speed Up
SH  ? N1/2SH  ? N
SH  ? N2
10 20 30 40 500
5
10
15
20
N
Speed Up
SH  ? N1/2SH  ? N
SH  ? N2
(a) (b)
Figure 2.7: Speed-up when (a) ? meanA remains flxed and (b) ? meanA increases as N
increases. Xi = XN for i = 1;:::;N. X = 100.
2.4.2.3 Synchronization
The number of synchronization points, Ns, is varied for difierent ? Ai, to observe
its efiects on ? and ? T when SH = 0 in Figure 2.8 and when SH > 0 in Figure 2.9.
In Figure 2.8, it is seen that as Ns increases ? increases since the idle time due to
synchronization increases. However, ? T decreases. This can be explained as follows.
The increased Ti (or ti) due to a larger Ns leads to the increased  T (refer to Equation
2.2). However, Ti increases faster than  T, causing ? T to decrease. The increase rate
in ? is larger for a larger ? A (TH) and a higher SH, as shown in Figure 2.9.
26
5 10 15 20
250
300
350
400
450
Ns
?
??A=0.1??A=0.3??
A=0.5
5 10 15 200.01
0.02
0.03
0.04
0.05
0.06
0.07
Ns
_ ? T
??A=0.1??A=0.3??
A=0.5
(a) (b)
Figure 2.8: (a) Average execution time and (b) normalized standard deviation of
execution time as functions of Ns. X = 1000, N = 10, and Xi = XN = 100.
0 5 10
15 20
0
0.2
0.4200
250
300
350
400
NsSH
?
0 5 10
15 20
0
0.2
0.40.02
0.04
0.06
0.08
NsSH
_ ? T
(a) (b)
Figure 2.9: (a) Average parallel execution time and (b) normalized standard deviation
of parallel execution time as functions of SH and Ns. X = 1000, N = 10, Xi = XN =
100, and ? meanA = 0:28.
2.4.2.4 Granularity of Availability
When the size of a target task (in terms of its execution time) is much larger
than an interval (the duration when availability remains constant, refer to Section
2.4.1) (equivalently, the interval is much smaller than the target task size), efiects of
heterogeneity (temporal or spatial) are reduced since the probability of load imbalance
27
decreases. Dependency on interval is shown when SH = 0 in Figure 2.10 and when
SH > 0 in Figure 2.11. As the interval increases (i.e., availability varies more slowly
with time) for a given target task, efiects of TH and SH become larger making ?
and ? T larger. Equivalently, a smaller task would be more sensitive to heterogeneity.
20 40 60 80 100
220
240
260
280
300
Interval
?
??A=0.1??A=0.3??
A=0.5
20 40 60 80 100
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Interval
_ ? T
??A=0.1??A=0.3??
A=0.5
(a) (b)
Figure 2.10: (a) Average execution time and (b) normalized standard deviation of
execution time as functions of interval. X = 1000, N = 10, and Xi = XN = 100.
0 50
100
00.2
0.4200
220
240
260
280
IntervalSH
?
0 50
100
00.2
0.40
0.05
0.1
0.15
0.2
IntervalSH
_ ? T
(a) (b)
Figure 2.11: (a) Average parallel execution time and (b) normalized standard devia-
tion of parallel execution time as functions of SH and interval. X = 1000, N = 10,
Xi = XN = 100, and ? meanA = 0:28.
28
2.4.2.5 Communication Overhead
Simulation results when the augmented task model is adopted are provided in
Figures 2.12 and 2.13. In these graphs, ? includes both computation and communi-
cation times. In Figure 2.12, SHcomp and SHcomm represent the spatial heterogeneity
of availability in computing power and communication bandwidth, respectively. As
expected for the simulation model, SHcomp and SHcomm have the similar efiect on ?
and ? T (refer to Section 2.4.2.2). In this simulation, the ratio of computation time
to communication overhead is set to 6. That is why ? increases more slowly along
SHcomm than SHcomp.
0 0.10.2
0.30.4
0
0.2
0.4285
290295
300305
310
SHcompSHcomm
?
0 0.2
0.4
0
0.2
0.40.01
0.0150.02
0.0250.03
0.035
SHcompSHcomm
_ ? T
(a) (b)
Figure 2.12: (a) Average execution time (computation and communication times)
and (b) normalized standard deviation of execution time as functions of SHcomm and
SHcomp. N = 20, X = 2000, Ns = 10, Xi = XN = 100, and ? meanA = ? meanB = 0:28.
In Figure 2.13, SH is the same for both the computing power availability and
bandwidth availability. As N increases, ? decreases initially and then turns around to
increase after a certain value of N. This is mainly because communication overhead
becomes dominant as N increases. It is also observed that the trend is more visible
when the bandwidth availability (and computing power availability) shows a higher
29
0 5 10
15 20
0
0.2
0.48000
9000
10000
11000
NSH
?
0 5 10
15 20
0
0.2
0.41000
1500
2000
2500
3000
NSH
?
(a) (b)
Figure 2.13: Average execution time: (a) when communication overhead is indepen-
dent of N and X = 400, and (b) when communication overhead increases linearly
proportional to N and X = 2000. Ns = 10 and ? meanA = ? meanB = 0:28.
spatial heterogeneity (SH). For example, in Figure 2.13-(a), it is seen that ? turns
around to increase when SH is relatively high while it monotonically decreases when
SH = 0.
2.4.2.6 Load Distribution and Stability (Risk Factor)
So far, only the cases where a target task is uniformly distributed among com-
puters have been discussed. Let?s examine the efiect of distribution of the target task
with SH varied. In Figure 2.14, ? and ? T are plotted as functions of dX with N = 2
where dX = X1?X22 and ?? A = ? A1 ? ? A2. In Figure 2.15, the risk factor deflned in
Section 2.3.4 is shown as a function of dX. The same set of results are provided when
N = 10 in Figures 2.16 and 2.17. In these simulations, ? Ai is increased linearly with
i among 10 computers as before (refer to Figure 2.4). Xi is also varied linearly with
respect to i, i.e., Xi = X0 + SlopeXi(i ? 1) for i = 1;:::;N. Therefore, a negative
slope means that a computer with a smaller ? Ai is assigned a larger fraction of X (a
larger Xi).
30
?20 ?10 0 10 20
214
216
218
220
222
224
226
dX
?
???A=0.00???A=0.28???
A=0.56
?20 ?10 0 10 20
0.04
0.06
0.08
0.1
0.12
dX
_ ? T
???A=0.00???A=0.28???
A=0.56
(a) (b)
Figure 2.14: (a) Average execution time and (b) normalized standard deviation of
execution time on two computers where X1 = X2 ?dX, X2 = X2 +dX, and X = 100.
?20 ?10 0 10 200.1
0.150.2
0.250.3
0.350.4
0.450.5
0.55
dX
RF
???A=0.00???A=0.28???
A=0.56
Figure 2.15: Risk factor on two computers where X1 = X2 ?dX, X2 = X2 +dX, and
X = 200. ?d = 221. ?? A = ? A2 ? ? A1.
It may be observed from these results that, by assigning more work (a larger
fraction of target task) to a computer with a smaller variation of availability, (a) a
shorter (minimum) ? is obtained, (b) ? T is sharply decreased, and (c) the risk factor
is quickly reduced. These trends are more explicit for a larger N and a higher SH. It
is also seen that the amount of work to be assigned to each computer to minimize ?,
? T, and RF depends on SH (and the distribution of ? Ai). That is, for achieving the
minimum execution time on a spatially heterogeneous computing system, it is not
31
?2 ?1 0 1 2
30
35
40
45
SlopeXi
?
SH=0.00SH=0.14
SH=0.28
?2 ?1 0 1 2
0.05
0.1
0.15
0.2
SlopeXi
_ ? T SH=0.00
SH=0.14SH=0.28
(a) (b)
Figure 2.16: (a) Average execution time and (b) normalized standard deviation of
execution time on 10 computers. X = 100 and Xi = X0 + SlopeXi(i ? 1) for i =
1;:::;10.
?2 ?1 0 1 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
SlopeXi
RF
SH=0.00SH=0.14
SH=0.28
?0.5 0 0.50.4
0.5
0.6
0.7
0.8
0.9
SlopeXi
RF
SH=0.00SH=0.14
SH=0.28
(a) (b)
Figure 2.17: Risk factor on 10 computers (a) ?d = 40 and (b) ?d = 28. X = 100 and
Xi = X0 +SlopeXi(i?1) for i = 1;:::;10.
optimal to distribute a target task linearly proportional to the mean availabilities of
computers.
32
Chapter 3
Load Balancing on a Heterogeneous Computing System
The results in Chapter 2 reveal that, for achieving the minimum average parallel
execution time on a spatially heterogeneous computing system, it is not optimal to
distribute a target task linearly proportional to the average availabilities (or average
computing powers) of computers. This observation has motivated the development
of an e?cient load balancing scheme for heterogeneous cluster and grid computing
environments, which is described in this chapter.
3.1 Two-Step Heuristic Approach
Let?s denote the computing power (speed) of Ci at t by Si(t) which has the mean
si and the standard deviation  Si. Then, noting that Si(t) = fiiAi(t), si = fiiai and
 Si = fii Ai. When  Si = 0 for all i, a target task is to be partitioned proportional
to si, in order to achieve the minimum ?. However, such partitioning does not lead
to the minimum ? in general when  Si 6= 0 for some i.
Let?s consider cases where  Si 6= 0 for some i. Suppose that X is partitioned
such that Xi is linearly proportional to si. Then, ti (the average execution time on
Ci) would be the same for all i, but  Ti (the standard deviation of execution time on
Ci) would not be. It is to be pointed out that  Ti is linearly proportional to ? Ai (or
THi). Noting that ? is given by E[maxifTig] rather than maxifTig or maxiftig, it is
33
possible to further reduce ? by taking f Tig into account. Therefore, load balancing
may be carried out in two steps as follows:
(1) X is partitioned over fCig such that Xi is proportional to si,
(2) fXig is further adjusted considering f Tig and fsig.
In the following, this two-step load balancing approach is elaborated for difierent
cases.
3.1.1 When si = sj for all i;j:
Consider the case of N = 2 (two computers). In Step (1) of the two-step load
balancing approach, X is divided equally between C1 and C2 since s1 = s2. That is,
after Step (1), X1 = X2 = X2 and t1 = t2. Suppose that  T1 >  T2 after Step (1).
Then, it is possible to reduce ? further by transferring a certain amount of work
(?X ? 0) from C1 to C2 in Step (2), i.e., X01 = X1??X and X02 = X2 +?X, where
X0i is Xi after Step (2). Then,
t01 = t1 ? ?Xs
1
4= t
1 ??t1 and t02 = t2 +
?X
s2
4= t
2 +?t2 (3.1)
where t0i is ti after Step (2).
Note that ?t1 = ?t2 since s1 = s2. Also, from Equation 2.2, it can be shown
that
 0T1 =  T1
s
1? ?Xs
1t1
and  0T2 =  T2
s
1+ ?Xs
2t2
(3.2)
where  0Ti is  Ti after Step (2).
34
Aheuristic scheme, tobereferred toas equalization scheme, that can beemployed
in Step (2) equalizes the sum of the mean and standard deviation of execution time
between two computers. That is, it determines ?X such that
t01 + 0T1 = t02 + 0T2 (3.3)
where ti and  0Ti are given by Equations 3.1 and 3.2, respectively.
The equalization scheme is illustrated in Figure 3.1. The scheme attempts to
reduce the probability that the parallel execution time of a target task, T = maxifTig,
is large, in order to minimize its average parallel execution time. This heuristic can
be generalized for cases where N > 2, i.e., Xi is determined such that t0i +  0Ti is the
same for all i.
t1
t2
?T1
?T2
C1
C2
(a)
??T1
??T2
t?1
t?2
C1
C2
(b)
Figure 3.1: Equalization scheme: (a) after Step (1) where a target task is partitioned
such that the average execution time is the same on all computers and (b) after Step
(2) where the partitioning is adjusted such that t0i+ 0Ti is equalized for all computers.
35
3.1.2 When si 6= sj for some i;j:
Again, consider the case of N = 2, and suppose that s1 < s2. After Step (1),
X1 = s1Xs1+s2 and X2 = s2Xs1+s2. What is to be done in Step (2) to further reduce
? depends on f Tig and fsig, as discussed below, and can be generalized for cases
where N > 2.
(a)  T1 >  T2: In this case, ?X is to be moved from C1 to C2 and the equalization
scheme may be employed in determining ?X. One difierence is that ?t1 > ?t2.
In other words, the same increase in t2 (by moving ?X from C1 to C2) results
in a larger decrease in t1, leading to a larger reduction in ?, compared to the
case where s1 = s2.
(b)  T1 <  T2 and s1 ? s2: It is still possible to reduce ? by transferring ?X from
C1 to C2 though the equalization scheme may not be used since the condition,
t01 +  0T1 = t02 +  0T2, cannot be satisfled. Reduction in ? would be smaller than
that in (a).
(c)  T1 <  T2 and s1 ? s2: In this case, ?X is to be moved from C2 to C1 and the
equalization scheme can be used. Reduction in ? would be smaller than that in
(a) or (b).
It should be clear that transferring ?X in the other direction than specifled
above would result in a longer ?. The above discussion on reduction in ? for difierent
cases is summarized in Table 3.1.
36
s1 = s2 s1 < s2
 T1 >  T2  T1 >  T2  T1 <  T2 (s1 << s2)  T1 <  T2 (s1 ? s2)
Direction of ?X C1 ! C2 C1 ! C2 C1 ! C2 C1 ? C2
Reduction in ? Smallest Largest Small Small
Table 3.1: Possible reduction in ?
3.2 Simulation Results and Discussion
Let?s flrst consider a case where the average computing speed is the same for all
computers, i.e., si = sj for all i;j. First, Xi = Xj for all i;j in Step (1) of the two-step
load balancing approach described in Section 3.1. It is assumed that ti = 100 and
 Ti = i?1N?1 max, for i = 1;:::;N, after Step (1). In Step (2), Xi is adjusted such that
t0i = ti +SlopeT(i? N?12 ) for i = 1;:::;N. SlopeT = 0 corresponds to the case where
X0i is proportional to si (the average computing speed or power), i.e., Step (1) only.
A negative SlopeT indicates that a computer with a larger  Ti is assigned a larger
?Xi (refer to Section 3.1) in Step (2) in addition to Xi allocated in Step (1).
In Figure 3.2, ? and percentage standard deviation of (t0i +  0Ti) are shown as
functions of SlopeT which specifles how a target task is distributed over N computers.
The percentage standard deviation of (t0i+ 0Ti) is a measure which re ects the degree
of equalization, and is deflned as:
% standard deviation of (t0i + 0Ti) =  (t0+ 0T)(t0 + 0
T)
?100% (3.4)
37
where (t0 + 0T) and  (t0+ 0T) are average and standard deviation of (t0i +  0Ti) over N
computers:
(t0 + 0T) =
PN
i=1 (t
0
i + 
0
Ti)
N (3.5)
 (t0+ 0T) =
sP
N
i=1(t0i + 0Ti ?t0 + 0T)2
N (3.6)
First of all, it is to be noted from Figure 3.2 that it is not optimal to distribute
a target task linearly proportional to the average computing power of computers. In
Figure 3.2-(a), it is clear that the performance improvement (reduction in ?) achieved
by Step (2) (more precisely, Step (1) + Step (2)) over Step (1) is signiflcant, and is
larger for a larger ? T (equivalently, a higher spatial heterogeneity). Comparing
Figures 3.2-(a) and (b), it can be seen that the equalization scheme employed in Step
(2) works well, i.e., the distribution of X minimizing ? closely matches with that
minimizing variation in ti + Ti.
0 2 4 6 8 10110
120
130
140
150
160
?SlopeT
?
??T=10??
T=40??T=70
0 2 4 6 8 100
5
10
15
20
25
30
35
?SlopeT
% Standard Deviation of t
i?+?? T
i ??T=10??T=40??
T=70
(a) (b)
Figure 3.2: (a) Average parallel execution time and (b) Percentage standard deviation
of t0i +  0Ti after Step (2) in the load balancing, when the number of computers, N,
is 10. After Step (1), ti = 100 and ? T =  TN ? T1 where  Ti = i?1N?1 Tmax, where
i = 1;:::;10.
38
In Figure 3.3, a system of two computers where s1 < s2 is considered. In these
graphs, ? is plotted as a function of ?X. Again, ?X = 0 corresponds to the cases
where X is divided between two computers proportional to their average computing
powers (s1 and s2). It can be observed that reduction in ?, which can be achieved by
Step (2), is larger for a larger s or a larger ? T. Let?s deflne the percentage error (")
of the equalization scheme as ??max?????max ?100 where ??max and ?? are the maximum
possible and achieved (by the heuristic) reductions in ?, respectively. In this set of
results, the average " was 5.1%.
0 5 10 15 20 25100
105
110
115
120
125
130
?X
?
s=2s=4
s=6 0 5 10 15 20 25100105
110
115
120
125
130
?X
?
??T=10??
T=30??T=50
(a) (b)
Figure 3.3: Average parallel execution time on two computers (a) with a flxed ? T =
 T1 ?  T2 = 50 ( T2 = 10) and (b) with a flxed s = 4, where s = s2s1 (s1 = 1) and
? T =  T1 ? T2 ( T2 = 0).
InFigure3.4, percentagereductionin?, whichisachievedbyStep(2), isanalyzed
with the number of computers (N) varied. In this simulation, si increases and  Ti
decreases linearly proportional to i, i.e., a faster computer has a smaller variation of
execution time. The percentage reduction in ? is deflned to be ?(1)??(2)?
(1)
?100 where
?(1) and ?(2) are ? achieved by Steps (1) and (2), respectively. It can be seen that the
percentage reduction increases as N increases. That is, the performance improvement
39
5 10 15 204
6
8
10
12
14
16
N
% Reduction in 
?
s=2s=5
s=10
5 10 15 204
6
8
10
12
14
16
N
% Reduction in 
?
??T=20??
T=50??T=80
(a) (b)
Figure 3.4: Percentage reduction in average parallel execution time as a function of
the number of computers (a) with a flxed ? T =  T1 ? TN = 60 ( TN = 0) and (b)
with a flxed s = 5, where s = sNs1 (s1 = 1) and ? T =  T1 ? TN ( TN = 0). After
Step (1), ti = 100 for all i.
by Step (2) becomes greater when a larger number of computers are employed. In
Figure 3.4-(a), ? T is flxed independent of N. However, in reality, ? T would usually
increase as N increases. Therefore, in such a case, one may expect a larger reduction
in ?. Note that a larger ? T leads to a greater reduction in ? as shown in Figure
3.4-(b).
40
Chapter 4
Effects of Heterogeneity on an Individual Network Traffic
In this chapter, how heterogeneity in channel bandwidth afiects the performance
of an individual message is analyzed. Also, application of the analysis results to path
selection and multi-path data transfer are considered.
4.1 Network Model
4.1.1 Topology
In this dissertation, a communication network is represented by a graph where
a node may be a user, a switch, a router, etc. and an edge is a wired or wireless
channel or link. Bandwidth of a channel refers to the bandwidth that is available for
or can be allocated to an individual message (request) on the channel. A path is a
set of consecutive edges (links) from one node to another. Two types of paths are of
interest, serial path and parallel path.
A serial path consists of N serially connected links from a source node to a
destination node through N ?1 intermediate nodes (routers) as illustrated in Figure
4.1-(a). The bandwidth Bi(t) of link i (li in the flgure), specifled in terms of packets-
per-second (pps), is assumed to be randomly distributed between Bmini and Bmaxi with
the mean of bi and the standard deviation of  Bi, where 1 ? i ? N. Bi(t) remains
41
SourceNode
DestinationNode
l1 l2
lN
AserialpathconsistingofLinksN
Aserialpath(P2)consistingoflinksN2
Aserialpath(P1)consistingoflinksN1
SourceNode
DestinationNode
l1,1
l1,2
l1,N1
l2,1
l2,N2
(a) (b)
Figure 4.1: Illustration of (a) a serial path and (b) a parallel path where Np = 2.
constant over a period of time, referred to as an interval, Tinterval, as illustrated in
Figure 4.2. Also, the normalized standard deviation,  Bibi , is denoted by ? Bi.
A parallel path is composed of Np serial paths which are independent of each
other, as illustrated in Figure 4.1-(b). Path j of a parallel path consists of Nj links
where 1 ? j ? Np. Link i on path j may be denoted by lij along with the associated
bandwidth parameters Bij(t), Bmaxij , Bminij , bij, and  Bij. For a parallel path where
all links on each path are identical, bj,  Bj, and ? Bj will be used to denote the mean,
standard deviation, and normalized standard deviation of bandwidth of all links on
path j, respectively.
Transmission rate at the source node follows the bandwidth of the bottleneck link
(the link with the minimum bandwidth) in the path. Note that a higher transmission
rate would lead to a higher throughput, but at the expense of a longer (queueing)
delay. The source node is informed of any change in Bi(t) with a delay of Tfeedback,
and updates its transmission rate at an interval of Tupdate. Practically, the value of
42
Tupdate should not be shorter than the average one-way-trip-time from the destination
node to the source node. t
Temporal heterogeneity
Spatial heterogeneity
B(t)
t
t
Interval
Bmax1Bmin
1B(t)
B(t)
1
2
N
Figure4.2: Temporalbandwidthheterogeneityoneachchannelandspatialbandwidth
heterogeneity among channels.
Temporal heterogeneity of the bandwidth is deflned for each channel or link i and
is quantifled by ? Bi. This indicates the temporal  uctuation of bandwidth (available
for a message) about its mean. Spatial heterogeneity is deflned for a group of channels
or links and may be quantifled by a measure of variation among the members in the
set f? Big, such as the difierence between the maximum and minimum.
There are other factors in addition to channel bandwidth, which can be consid-
ered in a network model. However, the main focus of this chapter is on analyzing
efiects of temporal and spatial variations in the bandwidth allocated to a message on
43
QoS?s. Therefore, those factors are not explicitly considered in the model. Equiva-
lently, the model takes only their combined \net" efiect on the allocated bandwidth
into account.
4.1.2 Performance Measures
Some communication performance measures used in this dissertation are deflned
below.
Throughput (THR) is deflned to be the size of data that a path (or multi-
path) is able to transfer in a unit time, in terms of packets-per-second (pps) when the
protocol has a uniform packet size, or bits-per-second (bps). This measure re ects
the efiective bandwidth of the network path. This is a long-term measure, which is
evaluated on the destination node.
End-to-End-Delay (Tee) is deflned to be the time it takes to transmit a packet
from the source to the destination. This can be decomposed into 4 components,
i.e., propagation delay (Tp), transmission delay (Ttr), queueing delay (Tq), and other
overhead (To). To represents the overhead including system processing time, packet
assembling or disassembling time; Tp is related to the physical property of the network
which is determined by the speed of the electro-magnetic wave on a certain media
(copper, flber, etc.); Tq is the waiting time that a packet spends in queueing into and
dequeuing from the bufiers of routers on the path, and re ects the mismatching of
bandwidth between two successive links; and Ttr can be expressed in terms of the
transmission rate (r(t)) as follows:
Ttr(t) = 1=r(t) (4.1)
44
Note that there is a trade-ofi between Ttr and Tq: when increasing the source rate
r(t), Ttr decreases since it is inversely proportional to r(t) and Tq tends to increase
when r(t) gets faster, if there are some packets bufiered. This dissertation mainly
focuses on Ttr and Tq because they are the two main components of Tee while the
other two components are also included in the simulation model.
Dropping Rate (DR) is deflned to be the ratio of the number of dropped
packets to the total number of packets sent. When the available bandwidth drops
below the source rate, packets are queued. Packet dropping occurs when packets are
sent with a rate higher than the current available bandwidth of an outgoing link over
a long enough period of time so that the bufier over ows. Therefore, a good source
rate control scheme should at least follow and not go above the average available link
(a long term measure) bandwidth for a long period of time.
Transfer Time (Ttrans) is the time it takes to send a given size of data from a
source to a destination.
4.2 Efiects of Heterogeneity
In this section, efiects of heterogeneity on QoS?s of an individual message are an-
alyzed by computer simulations. The bandwidth of each channel, Bi(t), is assumed
to be independently and uniformly distributed between Bmini and Bmaxi . Other distri-
butions such as a truncated Gaussian distribution have been considered, but lead to
qualitatively similar results. Therefore, the results for the uniform distribution only
are provided in this dissertation.
45
4.2.1 Feedback Latency
In Figure 4.3, where the notation Tinterval = [X;Y] indicates that Tinterval is
uniformly distributed between X and Y, efiects of Tfeedback on throughput and end-
to-end delay are analyzed for a serial path of 4 links. It is seen that throughput is
almost independent of Tfeedback while it decreases as ? B increases (refer to Figures
4.3-(a)). This is because throughput is mainly determined by the average bandwidth
of the path as will be discussed later on. However, Tfeedback has a signiflcant efiect on
the end-to-end delay as can be seen in Figure 4.3-(b). The longer the feedback latency
is, the longer the end-to-end delay results. When the feedback latency is longer, the
duration of mismatch between the source rate and the bandwidth of the bottleneck
link is longer leading to a longer end-to-end delay per packet due to longer queues
along the path.
0 0.1 0.2 0.3 0.4 0.5400
600
800
1000
1200
_  ?B
Throughput (pps)
Tfeedback=0.01T
feedback=0.05Tfeedback=0.10
0 0.1 0.2 0.3 0.4 0.50
0.02
0.04
0.06
0.08
0.1
_  ?B
End?to?End Delay (s)
Tfeedback=0.01T
feedback=0.05Tfeedback=0.10
(a) (b)
Figure 4.3: Efiects of Tfeedback on (a) throughput and (b) end-to-end delay.
Tinterval=[0.01,5.00]s, Tupdate=0.01s, N = 4, bi=1000 pps, and ? Bi = ? B for all i.
46
4.2.2 Temporally Heterogeneous and Spatially Homogeneous
4.2.2.1 Serial Path
For a serial path of a single link, it is not di?cult to show that the average
transfer time, ttrans, and the normalized standard deviation of transfer time, ? ttrans,
can be derived as ttrans = Sb and ? ttrans =  ttransttrans = ? B
pb
pS , respectively, where S is
the size of data. Now, consider a serial path of N links where bi = b and  Bi =  B
for all i. The source rate follows the bandwidth of the bottleneck link along the
path in the simulation. Therefore, efiective bandwidth, Beffective, of the serial path
can be approximated as that of the bottleneck link. The average efiective bandwidth,
beffective, which is equivalent to throughput, can be derived as follows (for the uniform
distribution).
beffective = E[Beffective] = E[minifBig]
= b? N ?1N +1(B
max ?Bmin
2 )
= b(1? N ?1N +1p3? B) (4.2)
From Equation 4.2, one can see that the average efiective bandwidth decreases as
? B (temporal heterogeneity) increases, and the decrease is larger (as large as p3? B)
for a longer path (a larger N). The more links are involved in a path, the higher the
probability that minfBig will be smaller. This is verifled by the simulation results in
Figure 4.4-(a), where it is clear that the simulation results show a good match with
the theoretical predictions.
47
0 0.1 0.2 0.3 0.4 0.5300
400500
600700
800900
1000
?N=2
?N=4N=6?
_  ?B
Throughput (pps)
theoreticalsimulation
0 0.1 0.2 0.3 0.4 0.50
5
10
15
20
25
_  ?BEnd?to?End Delay (ms)
N=2N=4N=6
(a) (b)
Figure 4.4: (a) Throughput and (b) end-to-end delay on a temporally heterogeneous
serial path of N links. S = 106 packets. bi = 1000 pps and ? Bi = ? B for i = 1;:::;N.
The end-to-end delay is the sum of delays on N links. The delay on an individual
link includes the transmission and queueing delays (refer to Section 4.1.2). The
queueing delay increases as ? B increases since the probability and degree of bandwidth
mismatch between adjacent links increase leading to longer queues. Therefore, the
end-to-end delay increases as  b and/or N increase as shown in Figure 4.4-(b).
4.2.2.2 Parallel Path
In Figure 4.5, the transfer time and throughput achieved on a parallel path are
plotted. In this parallel path, each path is composed of a single link. Sj = 105 packets
for j = 1;:::;Np where Sj is the size of data transferred over path j, i.e., the total
size of data is 105Np packets transferred over Np paths. It can be seen in Figure
4.5-(a) that transfer time increases up to 16% as ? B increases when Np > 1. The
increase in transfer time is larger for a larger Np. In Figure 4.5-(b), the aggregated
throughput over Np paths is plotted as a function of Np. Ideally, it is to increase
linearly proportional to Np. However, throughput is degraded from the ideal one as
48
Np increases, with a larger degradation for a larger ? B (up to more than 40 % when
Np = 19 and ? B=0.4).
Note that the results in Figure 4.5 are for the cases where each path consists of
a single link. By referring to Figure 4.4, it should not be di?cult to see that efiects
of ? B and/or Np are larger on a parallel path when the number of links per path is
greater than one as verifled in the simulation.
4.2.3 Temporally and Spatially Heterogeneous
In addition to temporal heterogeneity on each link, difierent links may exhibit
difierent characteristics, i.e., bi 6= bj and/or ? Bi 6= ? Bj for i 6= j. In this section, only
the results for cases where bi = b for all i, but ? Bi 6= ? Bj for i 6= j, are presented.
However, in practice, even when bi 6= bj, it is often the case that the allocated average
bandwidth is the same for all links. For instance, it would not be optimal to allocate
difierent average bandwidths on difierent links along a serial path since the overall
0.1 0.2 0.3 0.4 0.595
100
105
110
115
120
_  ?B
t Trans
 (s)
Np=1N
p=2Np=4
Np=6
0 5 10 15 200
0.5
1
1.5
2x 104
Number of Parallel Paths (Np)
Throughput (pps) ??Bi=0.06??B
i=0.23??Bi=0.40idealThroughput
(a) (b)
Figure 4.5: (a) Transfer time and (b) throughput on a temporally heterogeneous
parallel path consisting of Np paths with one link per path. Sj=105 packets and
bj=1000 pps for j = 1;:::;Np.
49
efiective bandwidth of the path mainly depends on the bottleneck link, i.e., the link
with the lowest (average) bandwidth. Note that a link with a large ? B may become
a bottleneck link even though its average bandwidth is not the lowest. Nevertheless,
cases where bi 6= bj are considered for the application examples in Section 4.3.
4.2.3.1 Serial Path
In Figure 4.6, bi = 1000 pps for all i and ? B1 is varied with ? Bi flxed for i =
2;:::;N.
In Figures 4.6-(a) and (b), ? Bi is flxed at 0.1 for i = 2;:::;N, and ? B1 is varied
from 0 to 0.52. It is clear that throughput is degraded signiflcantly (close to 20%)
due to spatial heterogeneity in f? Big as ? B1 increases. The relative degradation in
end-to-end delay is larger as shown in Figure 4.6-(b). As spatial heterogeneity among
links increases, the probability and degree of bandwidth mismatch between adjacent
links increase, which causes a longer end-to-end delay for each packet as discussed
in Section 4.2.2.1. When ? Bi for i = 2;:::;N is flxed at 0.52, the change of spatial
heterogeneity in f? Big is smaller as ? B1 increases from 0 to 0.52. Therefore, the
relative degradation in throughput and end-to-end delay, as ? B1 increases, is less as
can be seen in Figures 4.6-(c) and (d).
4.2.3.2 Parallel Path
In Figure 4.7, the average bandwidth is the same for all links, but ? Bj is dis-
tributed linearly among Np paths such that its mean among the paths is unchanged.
In quantifying spatial heterogeneity, ?? B=? B1-? BNp is adopted. Over each of the Np
50
paths, the same amount of data (Sj = 105 packets for all j) is transmitted, and the
transfer time is determined by the \slowest" path.
The results show that the transfer time increases substantially as ?? B gets larger.
The increase in transfer time is larger when each path becomes longer. This is because
the temporal (efiective) bandwidth heterogeneity of each path increases as the path
length increases.
0 0.1 0.2 0.3 0.4 0.5700
750
800
850
900
950
1000
_  ?B1
Throughput (pps)
N=2N=4
N=6
0 0.1 0.2 0.3 0.4 0.50
5
10
15
_  ?B1
End?to?End Delay (ms) N=2N=4N=6
(a) (b)
0 0.1 0.2 0.3 0.4 0.5300
400
500
600
700
800
900
1000
_  ?B1
Throughput (pps)
N=2N=4
N=6
0 0.1 0.2 0.3 0.4 0.55
10
15
20
25
_  ?B1
End?to?End Delay (ms) N=2N=4N=6
(c) (d)
Figure 4.6: Throughput and end-to-end delay on a temporally and spatially hetero-
geneous serial path with bi=1000 pps for i = 1;:::;N. (a) & (b): ? Bi = 0:1 for
i = 2;:::;N; (c) & (d): ? Bi = 0:52 for i = 2;:::;N.
51
0 0.1 0.2 0.3 0.4 0.5102
104
106
108
110
112
  _  ??B
t Trans
 (s)
Np=2N
p=4Np=6
0 0.1 0.2 0.3 0.4 0.5120
125
130
135
140
145
150
  _  ??B
t Trans
 (s)
Np=2N
p=4Np=6
(a) (b)
Figure 4.7: Transfer time on a temporally and spatially heterogeneous parallel path:
(a) each path is a single link, and (b) each path consists of 2 links where ?? B = ? B1-
? BNp. bj = 1000 pps and Sj = 105 packets for j = 1;:::;Np. ? Bj is varied linearly
over Np paths for j = 1;:::;Np such that its mean is flxed at 0.26.
It is also seen that transfer time is longer for a larger Np. As more paths are
involved in a parallel path, the probability that the variation of efiective bandwidth
among the paths is larger increases, which makes the (average) transfer time longer.
Note that transfer time of a message transmitted over multiple paths depends on the
\slowest" path, i.e., the path which flnishes its transfer last.
4.3 Applications
In this section, two examples where efiects of heterogeneity in channel bandwidth
maybetakenintoaccountinordertoimprove(optimize)certainQoS?sareconsidered.
4.3.1 Path Selection
In Figure 4.8-(a), a parallel path is shown, where each path consists of 4 links.
Note that the average bandwidth of the links on path 3 is higher than those on paths
52
SourceNode
DestinationNode
(900,0.1
3)
(900,0.1
3) (900,0.13
) (900,0.13)
(1000,0.35)(1000,0.35) (1
000,0.35
)
(1000,0.35)
(900,0.04)
(900,0.10)
(900,0.16)
(900,0.22)
Path1Path2
Path3
(1000,0
.46)
(1000,0.46)
(1000,0.06)(1000,0.06) (1
000,0.06
)
(1000,0.35)
(1000,0.17)
(1000,0.17)
(1000,0.17)
Path1Path2
Path3
SourceNode
DestinationNode
(a) (b)
Figure 4.8: (a) A parallel path where the number of links is the same for all paths,
and (b) a parallel path where the number of links varies with path. The number pair
in brackets over each link represents bi and ? Bi of the link.
1 and 2, but ? Bj (heterogeneity) is also larger for the links on path 3 than those on
paths 1 and 2. ? Bj is constant for all links on path 1 while it is linearly distributed
over the links on path 2. Note that the mean of ? Bi for all links on each path is the
same for both paths 1 and 2. In Figure 4.9, the three paths are compared in terms of
end-to-end delay and transfer time. First, it is seen that path 3 performs worst among
the three in both QoS?s though its average bandwidth is higher than that on the other
paths. This is due to its high temporal heterogeneity in bandwidth. Comparing paths
1 and 2 which have the identical bj, one can see that path 1 performs slightly better
than path 2 since spatial heterogeneity (among links) is higher on path 2 than on
path 1. Therefore, one should consider f? Big in addition to fbig in selecting a path,
in order to optimize the QoS?s.
Figure 4.8-(b) shows another example of a parallel path where the number of
links varies with path, in this case, 2, 3, and 4 links on paths 1, 2, and 3, respectively,
is considered. All links have the same average bandwidth, but ? Bj is largest on path
1 and smallest on path 3. In Figure 4.10, the same QoS?s, end-to-end delay and
53
105 1060
0.01
0.02
0.03
0.04
0.05
Size of Data
End?to?End Delay (s)
path1path2path3
105 1060
500
1000
1500
2000
2500
Size of Data
t trans (s)
path1path2path3
(a) (b)
Figure 4.9: Comparison of the three paths in Figure 4.8-(a) in terms of (a) end-to-end
delay and (b) transfer time.
transfer time, are used to compare the three paths. The smallest delay and transfer
time are achieved on path 3 which is the \longest" (the largest number of links) among
the three paths. Path 1 performs worst due to its highest (temporal) heterogeneity
though it is the \shortest." Therefore, one should not simply select the shortest path
even when the average bandwidth is the same on all links.
105 1060
0.01
0.02
0.03
0.04
0.05
Size of Data
End?to?End Delay (s)
path1path2path3
105 1060
500
1000
1500
2000
2500
Size of Data
t trans (s)
path1path2path3
(a) (b)
Figure 4.10: Comparison of the three paths in Figure 4.8-(b) in terms of (a) end-to-
end delay and (b) transfer time.
54
Implementation
Two cases may be considered: (i) the source has the information on the behavior
(bi;? Bi) of each link, and (ii) the source does not have this information. In the case
(i), an analytic formula or look-up table which relates a given QoS to the set fbi;? Big
characterizing a path is to be derived. If a path is temporally heterogeneous but
spatially homogeneous, an analytic formula may be obtained for certain distributions
as shown in Section 4.2.2.1 for the QoS of throughput which is proportional to the
efiective bandwidth of a path. However, for a temporally and spatially heterogeneous
path, it is much more challenging. With such an analytic formula or table available,
a source refers to it in order to select a path when there are more than one possible
path.
In case (ii), the source needs to estimate a given QoS by measuring it for each
possible path before the selection process.
4.3.2 Multi-Path Data Transfer
Consider cases where multiple paths (or a parallel path) are available for trans-
ferring a large size of data. That is, the data is partitioned for simultaneous transfers
over multiple paths. In order to minimize the overall transfer time, it is necessary to
determine how much data is to be transferred over each path (or equivalently trans-
mission rate for each path), considering bandwidth heterogeneity on the multiple
paths. Let?s deflne percentage reduction in transfer time as ttrans?t0transttrans ?100 where
ttrans is the transfer time achieved when the partitioning is done considering only
55
the mean of each link?s bandwidth, and t0trans is the transfer time obtained by con-
sidering the normalized standard deviations (heterogeneity) of link bandwidths also.
Similarly, percentage reduction in end-to-end delay may be deflned as tee?t0eetee ?100.
Recall that S denotes the total size of data and Sj the size of data to be trans-
ferred over path j, i.e., S = Pj Sj. In Figure 4.11, cases where 2 paths are employed
for data transfer and all links have the same average bandwidth (b = 1000 pps) are
considered. Note that one would normally partition S so that S1 = S2 since b1 = b2.
However, that does not lead to the optimal performance. In Figure 4.11-(a), both
paths have the same number of links (N). It can be seen that a signiflcant reduction
of up to 28% in transfer time is achieved by assigning more data to the path with
lower temporal heterogeneity, path 1 in this example. Also, the reduction becomes
larger when there is a larger difierence (spatial heterogeneity) between ? B1 and ? B2
and when each path is longer (a larger N). In Figure 4.11-(b), cases when each path
has difierent number of links (N1 6= N2) are considered. The similar observations can
be made in this result. However, the reduction is larger since N1 6= N2 leads to higher
spatial heterogeneity (between the two paths).
It is also shown in Figure 4.11 that the simulation results provide a good match
with the theoretical ones (optimum).
The results for cases where b1 6= b2, i.e., two paths have difierent average link
bandwidths, are shown in Figure 4.12. Sj is the size of data to be transferred over
path j when only the mean of link bandwidth fbj;j = 1;2g is taken into account,
i.e., Sj is linearly proportional to bj.
Let S0j denote the size of data to be transferred over path j when considering
f? Bj;j = 1;2g in addition to fbj;j = 1;2g. In Figures 4.12-(a) and (b), it can be seen
56
0 5 101520253035?20
?10
0
10
20
30
(S1?S2)/S (%)
Reduction (%) N=2,?B2??B1=0.17N=4,?
B2??B1=0.17N=4,?B2??B1=0.40
Theoretical Optimal?20 0 20 40 60?10
0
10
20
30
40
(S1?S2)/S (%)
Reduction (%) ?B2??B1=+0.40?B2??B1=?0.40?
B2??B1=+0.17?B2??B1=?0.17
Theoretical Optimal
(a) N1=N2 = N (b) N1=2, N2=4
Figure 4.11: Reduction in transfer time on a parallel path (Np = 2). bj = 1000 pps
for j = 1;2. S=S1 +S2=2?106 packets.
that the minimum transfer time (maximum reduction) is achieved by assigning a size
of data larger than S1 to path 1 on which temporal heterogeneity (? B1) is lower than
on path 2. The larger the difierence in temporal heterogeneity between the two paths
(i.e., the larger the spatial heterogeneity), the larger the reduction. Also, a larger
reduction is possible when the path with lower temporal heterogeneity (path 1) has
a higher average bandwidth. In addition, there exists an optimal point at which the
reduction is maximized, and the value of S01?S1S for the optimal point depends on b1
and b2.
In Figures 4.12-(c) and (d), the reduction in the end-to-end delay shows a mono-
tonic behavior, i.e., always increasing as S01?S1S increases. This is the case since the
per-packet delay is reduced as a larger fraction of message is transmitted over the
path with lower temporal heterogeneity, path 1 in this case (? B1 < ? B2). Therefore,
one is to select a proper value of S01?S1S depending on which QoS is to be optimized.
Note that the heterogeneity among paths is likely to increase as Np increases,
and this would lead to an even larger maximum reduction in the transfer time.
57
0 2 4 6 8 10 12?5
0
5
10
15
(S?1?S1)/S (%)
Reduction in t
trans (%)
b1=b2b
1=b2/2b1=b2/3
0 2 4 6 8 10 120
5
10
15
20
(S?1?S1)/S (%)
Reduction in t
trans (%)
b1=b2b
1=2b2b1=3b2
(a) b1 ? b2 (b) b1 ? b2
0 2 4 6 8 10 120
24
68
1012
14
(S?1?S1)/S (%)
Reduction in t
d  (%)
b1=b2b
1=b2/2b1=b2/3
0 2 4 6 8 10 120
5
10
15
20
(S?1?S1)/S (%)
Reduction in t
d  (%)
b1=b2b
1=2b2b1=3b2
(c) b1 ? b2 (d) b1 ? b2
Figure 4.12: Reduction in transfer time ((a);(b)) and end-to-end delay ((c);(d)) on
a parallel path with Np = 2. Nj = 2 for j = 1;2. S = S1 + S2 = S01 + S02 = 2?106
packets, ? B1=0.06, ? B2=0.46, and b1 +b2 = 2000 pps.
Implementation
The main issue in multi-path data transfer is how to divide a given large size of
message over multiple paths to be utilized simultaneously for the message, given the
efiective bandwidth and its variation (bj;? Bj) for each path j. Consider the QoS of
transfer time of a message. How the message is to be divided to minimize its overall
transfer time is equivalent to dividing a computational task over multiple temporally
heterogeneous computers for minimizing the parallel execution time. In Chapter 3,
58
an efiective load balancing strategy was described along with its performance analysis
results. This strategy consists of two steps where a task is partitioned proportional to
the average computing power available on the computing nodes in the flrst step and
the partitioning is adjusted by considering the standard deviations of the available
computing powers in the second step. A similar scheme can be employed for balancing
communication load over temporally heterogeneous paths.
59
Chapter 5
Source Rate Control for Heterogeneous Communication Systems
Based on the analysis results in Chapter 4, the issue of controlling source rate
in order to optimize performance of a communication task is considered, and a het-
erogeneity aware approach to source rate control is described along with simulation
results in this chapter.
5.1 Temporal Heterogeneity under Difierent Time Scales
Temporal heterogeneity of the available bandwidth on a communication path
depends on the activity patterns of difierent network applications and protocols and
can be decomposed into two components, long-term and short-term, in the time do-
main. Long-term heterogeneity is caused by the initiation and termination of long
sessions, such as large flle transfers, real time video and audio streaming, and other
bandwidth reservation applications. These types of applications \dominate" the vari-
ation of available network resources (bandwidth, bufier, etc), and exhibit slow-varying
characteristics.
Short-term heterogeneity is due to short-duration network tra?c, such as the
overhead in packet processing, bufiering, and forwarding, and other sporadic net-
work activities. Examples include Internet Control Messages (ICMP, RSVP), user
inputs/outputs of a terminal session (telnet), browsing (small) web pages (http),
messenger services (msn/yahoo messenger, icq) most of which are based on UDP,
60
and short email messages (smtp, pop3). Dynamic behaviors of some transport layer
protocols, like AIMD in TCP, also cause some short-term variation. This type of
heterogeneity is relatively unpredictable.0 10 20 30 40 50 60 700
0.2
0.4
0.6
0.8
1
3 flows
A
1Mbps/3
2 flows
B
1Mbps/2
1 flows
C
1Mbps
2 flows
D
1Mbps/2
Available Bandwidth (Mbps)
Time
Figure 5.1: Illustration of long-term and short-term temporal heterogeneity of the
available bandwidth sensed by one  ow (flow0) on a path.
A simplifled illustration of long-term and short-term temporal heterogeneity,
where multiple  ows share a bottleneck link with a capacity of 1 Mbps, is given in
Figure 5.1. The available bandwidth sensed by a certain long-duration  ow (say
flow0) over time is plotted in this flgure. The queueing scheme is assumed to work
in a fairly weighted round-robin fashion, i.e. the ideal efiective bandwidth shared by
flow0 is 1Mbps/nf, when there are nf co-existing long-duration  ows. The \contour"
of the available bandwidth (marked by the dotted lines) represents the ideal share for
flow0, and could be modelled by 1 Mbps/nf. Starting from time A, there are 3 co-
existing long-duration  ows, so the available bandwidth for flow0 is approximately
1Mbps/3. At time B, one of the long-duration  ows is closed, and the available
bandwidth increases to about 1Mbps/2. When one more  ow is dropped at time C,
61
all bandwidth of the link (1Mbps) is allocated solely to flow0, but after another  ow
is introduced at time D, the available bandwidth drops back to about 1Mbps/2.
Compared to the short term temporal heterogeneity (\spurs" in Figure 5.1), the
long-term heterogeneity has a longer time duration and is much easier to track. For
example, when the long-term sessions have a duration signiflcantly longer than the
synchronization interval of a rate control scheme, the source rate could be adjusted
so that it can follow the shape of the available bandwidth contour. However, it is
neither possible nor necessary to follow the short term spurs, because they usually
vary much faster than the rate control scheme can handle.
5.2 HAA Source Rate Control
5.2.1 Overview of the HAA Source Rate Control Scheme
The main idea of HAA source rate control scheme is to have the source rate follow
the long term variation in the available bandwidth, and at the same time to adjust
it by means of a \penalty factor" which is derived from the short-term heterogeneity
(refer to Section 5.2.5). A flnite-size feature extractor is used to collect statistical
information on those short term spurs, which is used in the source rate controller.
Figure 5.2 shows the general abstract architecture of the HAA source rate control
scheme which consists of four components: Available Resource Measurement, Feature
Extractor, Source Rate Allocator, and Source Rate Controller.
Figure 5.2: The abstract architecture of the HAA source rate control scheme.
62
The Available Resource Measurement component is responsible for measuring
the available resources of interest, which include the available bottleneck bandwidth
(ABBW) of a network path and available bufier size on each router. The proposed
scheme only measures ABBW. ABBW(t) refers to the minimum bandwidth of all
inter-router links at time t, i.e., ABBW(t) = minNi=1fBi(t)g where i is the link id
and Bi(t) is the available bandwidth of link i. Later in this dissertation, the notation
ABBWl will be used to represent ABBW measurement samples, where l is the sample
index.
The Feature Extractor extracts quantifled heterogeneity information (features)
from the measurement samples of the available resources. Typical features are the flrst
order moment (mean), the second order moment (deviation), minimum, maximum,
and median of measurement samples for a certain time duration. Currently, only the
mean, deviation, minimum, and maximum are used in the proposed scheme.
The Source Rate Allocator is responsible for computing the appropriate source
rate for the corresponding communication session based on the extracted features,
and the Source Rate Controller is the one which actually controls the source rate.
Figure 5.3: Illustration of the proposed HAA communication model.
The schematic diagram of the proposed HAA source rate control communication
model is shown in Figure 5.3. The receiver is responsible for measuring the bottleneck
63
bandwidth, and extracting the features from the bandwidth measurement samples.
It then periodically (by default one round-trip-time) sends control packets, which
contain the information on the allocated rate, to the sender. Upon receiving each
control packet, the sender tries to adapt the allocated source rate.
5.2.2 Bottleneck Bandwidth Measurement
The bottleneck bandwidth is measured with a packet pair [28][29][30] based ap-
proach in the current implementation. The source sends out data in the form of
back-to-back packets, and measurement of ABBW(t) is performed on the receiver
side. An up-link (from receiver to sender) carries control packets and a down-link
(from sender to receiver) transmits data. The latter has a much heavier load and
experiences higher heterogeneity than the former. As a result, the receiver side mea-
surement can get more accurate and timely information on the network load.
Figure 5.4 shows the transmission of packets along a network path. In order
to avoid errors due to lost or out of order packets, ABBW is measured only for
\valid packet pairs" received. A valid packet pair consists of two packets received
successively where the flrst packet with an even sequence number, 2l, is followed by
the other with the sequence number of 2l +1.
ABBW measured by the lth valid packet pair (ABBWl) is
ABBWl = 1T
2l+1 ?T2l
(5.1)
64
2l+12l3210
2l+12l3210
2l+32l+2
2l+2
2l+12l3210 2l+2
Sender
Link0
BottleneckLink
LinkN-1
Receiver T0 T1 T2 T3 T2l T2l+1 T2l+2 T2l+3
2l+3
2l+3c68T0 c68T1 c68Tl c68Tl+1
Figure 5.4: Illustration of packet-pair based bandwidth measurement. The horizontal
dimension is time, and from top to bottom along the vertical dimension are the sender,
intermediate links, and the receiver. Numbers in small boxes are the packet sequence
numbers.
where T2l and T2l+1 are the receiving times of two packets received consecutively.
The time difierence ?Tl = T2l+1?T2l is dependent on the available bandwidth of the
bottleneck link.
5.2.3 Feature Extractor
The features that may be used in the proposed scheme include the mean (abbw),
standard deviation ( ABBW), minimum (ABBWmin), and maximum (ABBWmax) of
available bottleneck bandwidth measurement samples (ABBWl) within an interval,
as shown in Figure 5.5. All features are calculated over a sliding window of size
L (in terms of the number of packet pairs), which has a longer time duration than
the sampling interval. A feature extractor is designed to extract the features for
the measurement samples of the last L packet-pairs (from ABBWl?L+1 to ABBWl),
where L is the size of the feature extractor and l is the packet-pair index (or sample
index). The feature interval is deflned to be the time interval over which the L packet
pairs are spread.
65
0 0.5 1 1.5 20
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4 A feature interval   (L Samples)
abbw?
ABBW
ABBWmin
ABBWmax
ABBW(t) (Mbps)
Time (s)
Figure 5.5: Feature extractor.
In order to reduce the computing and storage complexity, two accumulators
(A1, A2) are employed for the sums of ABBWl and ABBW2l , and two other memory
variables (ABBWmax and ABBWmin) are used to record the maximum and minimum
values for the last L samples. The initialization code is executed immediately after
the flrst valid packet pair is received and the corresponding ABBW0 is obtained (reset
A1 and A2 to L?ABBW0 and L?ABBW20, respectively); the iteration code is to
calculate the sum of the last L ABBWl and ABBW2l after each valid packet pair l is
received; and the extraction code is performed at every control interval, which will
be described in Section 5.2.5. These codes are shown in Figure 5.6.
5.2.4 Simple Source Rate Allocation Approaches
Three simple source rate control schemes are described for comparison with the
proposed scheme (HAA):
Mean-Only Scheme: The source rate may be controlled to follow abbw, the
mean of measured ABBW. This scheme is easy to implement and the throughput is
66
/*initialization code*/
A1 = L?ABBW0;A2 = L?ABBW20
/*iteration code*/
A1 = A1 +ABBWl ?ABBWl?L
A2 = A2 +ABBW2l ?ABBW2l?L
/*extraction code*/
abbw = A1=L
 ABBW =
q 1
L?1A2 ?
L
L?1abbw
2
ABBWmax = max(ABBWj; l?L+1 ? j ? l)
ABBWmin = min(ABBWj; l?L+1 ? j ? l)
Figure 5.6: Initialization, iteration, and extraction codes for the feature extractor.
close to the bottleneck link?s capacity. However, it causes a high dropping rate and a
long end-to-end delay due to the limited packet bufier size.
Scaling Scheme: A scaling factor, fi (0 < fi < 1), may be employed to make
the system work under a lighter load, thus lowering the dropping rate and end-to-end
delay [29][30]. The main problem with this scheme is that it sacriflces throughput
unnecessarily when the variation of ABBW is small. For example, when fi is chosen
to be 0.9, there will always be 10 percent of bandwidth being wasted, even if the
bandwidth variation is very small (or no variation). The situation is even worse when
the bandwidth variation is large and fi is not small enough.
Minimum-Bandwidth Scheme: The source rate may be set to the ABBWmin
at each synchronization interval. However, this is an overly conservative scheme,
especially when the ABBW variation is large. For example, when there is a deep
valley (ABBWmin), in a feature interval as shown in Figure 5.5, there would be a
lot of bandwidth wasted, since in most of time the available bandwidth is above this
value.
67
5.2.5 Heterogeneity Aware Approach
Synchronization
In order to alleviate the network load, control packets are sent to the sender
at every synchronization interval. The synchronization interval determines how
frequently control packets are sent to the sender, therefore, the shorter the synchro-
nization interval is, the faster the sender can adapt the source rate to the variation
of available bandwidth, although more of the up-link bandwidth will be consumed.
Usually, it should not be shorter than one round-trip time.
Thesynchronizationintervalintheproposedschemeisafunctionoftime, Tsync(t),
which is set to be smaller when a larger long-term variation is detected, so that the
source rate can adapt to the actual ABBW faster. The efiect of Tsync is discussed in
the next section.
HAA Source Rate Function
The main idea of the proposed scheme is to derive the desired source rate rs at
every synchronization interval, based on the set, EFk, of extracted features including
abbwk, ? ABBWk, ABBWmink , ABBWmaxk . where k is the synchronization interval
index. A general form of HAA source rate function f(ABBWk; EFk; t) is shown in
Equation 5.2.
rs(k) = f(ABBWk; EFk; t) (5.2)
68
An implementation of the source rate function is shown in Equation 5.3, which
can detect and utilize both long-term and short-term heterogeneity.
rs(k) =
8
>><
>>:
max[ABBWmink ;abbwk(1?fl? ABBWk)] for ? ABBWk ? ? threshold;
ABBWk for ? ABBWk > ? threshold;
(5.3)
where ? threshold is selected such that ? ABBWk exceeds it when there is a clear large
scale edge (rising or falling) in ABBW. A signiflcant transition is most likely due
to the release or injection of a long-term competing  ow, and the magnitude of the
corresponding rise or fall in ABBW is assumed to be signiflcantly larger than the
variation due to the short-term heterogeneity.
When ? ABBWk is less than ? threshold, rs(k) is penalized by fl?? ABBWk, but is set
no lower than ABBWmink . fl is a penaltyfactor which controls how much the source
rate should be penalized for the heterogeneity. The rationale is that the source rate
gets a less penalty when ? ABBWk (short-term heterogeneity) is small, and a larger
penalty when ? ABBWk is large.
When ? ABBWk becomes greater than ? threshold, i.e., a clear rising or falling edge,
rs(k) is set to the most recent ABBWk in order to obtain a more accurate estimation
of the long-term variation in the current implementation. Also, in order to get a
faster response, Tsync is made much smaller (thus faster) than the other operation
mode.
Due to the delay from the ABBW to abbw, the source rate is over-estimated on
the falling transitions and under-estimated on the rising transitions. Some more so-
phisticated predictive scheme could be developed to further improve the performance.
69
Adaptive Penalty (fl)
The main issue in the proposed HAA approach to source rate control is how
to select a proper value of the penalty factor (fl). In many cases, the users specify
QoS requirements that their applications should meet. A typical speciflcation of QoS
requirements is given below:
QoS requirements
8>
>>>>
>><
>>>
>>>>
:
Throughput : thr ? THRmin
End-to-End Delay : tee ? Teemax
Dropping Rate : dr ? DRmax
(5.4)
The means (averages) of throughput (THR), end-to-end delay (Tee), and drop-
ping rate (DR) are denoted by thr, tee, and dr, respectively. THRmin is the minimum
throughput requirement, and Teemax, and DRmax are the upper bounds of the end-
to-end delay, and dropping rate, respectively. These are all user-specifled parameters,
and are given during the session admission control stage.
0 0.5 1 1.5 2?
QoS THRmin
Teemax
DRmax
?maxthr?mintee?mindr
Throughput (thr )End?to?End Delay (t
ee)Dropping Rate (dr)
?valid 
Figure 5.7: QoS requirements and valid range of fl.
70
The proposed scheme determines the valid range of fl so that all the QoS require-
ments can be met by adaptively adjusting fl. This is based on the fact that each QoS
measure mentioned above is a monotonic function of fl, and hence the upper/lower
bounds of fl for each QoS measure QoSi can be found via a simple heuristic search.
8
>>>>
>>><
>>>>
>>>
:
Throughput : flthr ? flthrmax
End-to-End Delay : fltee ? flteemin
Dropping Rate : fldr ? fldrmin
(5.5)
where flthrmax, flthrmin, and fldrmin are the upper/lower bounds of fl that satisfy the re-
quirement of throughput, end-to-end delay, and dropping rate, respectively, as shown
in Figure 5.7.
And the valid range of fl which satisfles all the QoS requirements is
max(flteemin;fldrmin) ? flvalid ? flthrmax (5.6)
If max(flteemin;fldrmin) > flthrmax, no fl can be found to meet all the QoS requirements,
and the admission control thus rejects the  ow.
A \Slowly-Decreasing-Fast-Increasing" (SDFI) scheme is proposed to adaptively
determine a near optimal fl. It goes through a 3-stage cycle when a QoS requirement
violation is detected. Let ?flslow denote the \slowly decreasing step" (0.1 by default),
?flfast the \fast increasing step" (0.6 by default) of fl, and fl0 the initial value in the
cycle. To reduce the oscillation in the adaptation, a margin is introduced for each
71
QoS measure (QoSmargin), and then the adjusted QoS measures are
8>
>>>>
>><
>>>
>>>>
:
THR0min = THRmin +QoSthrmargin
Tee0max = Teemax ?QoSteemargin
DR0max = DRmax ?QoSdrmargin
(5.7)
The margin for each QoS measure depends on the tolerance requirements of
applications. It is either provided by the user as a part of QoS requirements, or set
by the SDFI scheme via experiments. The adjusted QoS measures are \tighter" than
those given in the QoS speciflcation, and are used during the Slowly Decreasing stage.
The complete pseudo-code of the SDFI scheme is shown in Figure 5.8.
while (true)
f
//Fixed stage
while (tee < Teemax && dr < DRmax && thr > THRmin);
//Fast Increase stage
fl = fl +?flfast;
//Slowly Decrease stage
while (tee < Tee0max && dr < DR0max && fl ? ?flslow)
f
fl = fl ??flslow;
g
fl = fl +?flslow;
if(thr < THR0min) exit(); //can not meet all requirements
g
Figure 5.8: Pseudo-code of the proposed SDFI scheme.
(1) Fixed stage: the receiver keeps checking all the QoS measures, and fl remains
unchanged until any of them violates the requirement.
(2) Fast Increasing stage: fl is incremented by ?flfast.
72
(3) Slowly Decreasing stage: fl is decremented by ?flslow in each step until the
QoS measure exceeds QoS;max. fl is incremented by ?flslow, i.e., it returns to the last
valid value.
The operation of the SDFI scheme is illustrated in the time domain in Figure
5.9.
0 20 40 60 80Time (s)
QoS
00.2
0.40.6
0.81
1.21.4
1.61.8
22.2
?
QoSQoS,
maxQoSmax
?
QoSmargin 
Figure 5.9: Operation of the SDFI scheme.
73
Chapter 6
Performance of the Source Rate Control Scheme
6.1 Simulation Setup
An extensive simulation has been carried out using a widely-used network simu-
lator (ns2 [44]). The performance of a single HAA  ow is flrst studied. The packet
size in all the results in this chapter is 1000 bytes. The size of bufier on each router is
set to be 20 packets by default, and the queuing scheme is the drop-tail round-robin.
The default length (L) of the feature extractor is 15 samples. The default synchro-
nization interval is set to be one round-trip-time, except when a signiflcant transition
in ABBW is detected.
Topology
S
SourceNode
RnNn1 n2
SinkNode
5M,3ms B,10ms1 5M,3msB ,10msNB,10ms2 nN-1B ,10msN-1
Figure 6.1: Simulation testbench topology for a single path.
The topology of a single path employed in the simulation is shown in Figure 6.1.
The link capacity and propagation delay for local connection links (thick lines) are
5Mbps and 3ms, respectively. For each inter-router link i, the propagation delay is
10ms and the available bandwidth Bi(t) is a random variable whose the upper limit
74
is 5Mbps.
Queue Management
A round-robin like queue management is assumed for the network routers in the
simulation since it ofiers several advantages over the flrst-in-flrst-out (FIFO) packet
schedulers for bursty data tra?c. In particular, the round-robin queue management
automatically enforces a min-max fair allocation of resources [45], and ensures that
the \well-behaving" users are protected from the \ill-behaving" ones, which is desir-
able in public data networks [30][46].
Tra?c Generation
Instead of directly simulating the background tra?c, the available bandwidth on
each link is changed dynamically in the simulation script. This approach enables easy
control of the link bandwidth to simulate the heterogeneity under difierent scenarios,
and also saves simulation time.
The varying bandwidth is simulated in two dimensions: magnitude and time.
The magnitude variation on a link is independent of those on others and follows a
certain distribution. Several difierent distributions have been tried, including the
uniform distribution and truncated normal distribution, which all lead to similar
results. Therefore, all simulation results provided in this section assume a uniform
distribution. The magnitude of bandwidth on link i follows a uniform distribution
with the mean of bi and normalized standard deviation of ? Bi. In the time dimension,
each random magnitude of bandwidth remains unchanged over a random interval
(Tinterval) which follows an exponential distribution with a mean of tinterval.
75
6.2 Single Flow
6.2.1 Adaptation to Long-Term Bandwidth Variation
48.549 49.550 50.551 51.50
0.5
1
1.5
2
2.5
3
3.5x 106
time (s)
Source Rate/ABBW (bps)
ABBWmean?only
HAAminimum?bandwidth
53.554 54.555 55.556 56.50
0.5
1
1.5
2
2.5
3
3.5x 106
time (s)
Source Rate/ABBW (bps) ABBWmean?onlyHAAminimum?bandwidth
(a) Rising Edge (b) Falling Edge
Figure 6.2: Comparison of three source rate control schemes on a path of an inter-
router link whose available bandwidth (ABBW) varies with time. The penalty factor
(fl) is set to be 0.7 for HAA.
Figure 6.2 shows how the three difierent source rate control schemes adapt the
source rate to the available bottleneck bandwidth (ABBW). ABBW is generated
by a \slow" square wave (with a period of 5 seconds and a magnitude of 1 to 3Mbps)
plusahigh-frequencyvariation. Theaverageintervalofthehigh-frequencycomponent
is 50ms and the magnitude is no greater than 500Kbps. It is shown that all three
schemes can follow ABBW on a large time scale, but the minimum-bandwidth scheme
is too conservative to utilize the available bandwidth e?ciently, and the mean-only
scheme is too greedy, as the allocated bandwidth is higher than ABBW for about half
of the period, and thus leads to a large dropping rate. The proposed scheme (HAA)
tends to adopt a rate between those by the mean-only and minimum-bandwidth
76
schemes in an efiort to optimize the performance such as throughput and dropping
rate.
6.2.2 Feature Extractor Size and Average Bandwidth Updating Interval
0 0.5 1 1.50
0.20.4
0.60.8
11.2
1.41.6
1.8
?
Dropping Rate (%)
tinterval=0.05 L=5t
interval=0.05 L=15tinterval=0.05 L=30
tinterval=0.50 L=5t
interval=0.50 L=15tinterval=0.50 L=30
0 0.5 1 1.560
70
80
90
100
110
120
130
140
?
End?to?End Delay (ms)
tinterval=0.05 L=5t
interval=0.05 L=15tinterval=0.05 L=30
tinterval=0.50 L=5t
interval=0.50 L=15tinterval=0.50 L=30
(a) Dropping Rate (b) End-to-End Delay
Figure 6.3: Efiects of feature extractor size (L) and tinterval. N=4, ? Bi=0.17 for all 4
links.
In Figure 6.3, efiects of the feature extractor size (L) and average bandwidth
updating period (tinterval) on the dropping rate and end-to-end delay are analyzed.
Comparing fl = 0 and fl > 0 (HAA), it is seen that the improvement by HAA is
more signiflcant when the bandwidth varies more rapidly (tinterval=0.5, dashed lines
in the flgure). Also, the results indicate that the feature extractor size L should not
be too small. The performance (in terms of the dropping rate and end-to-end delay)
for L=15 or 30 is better than that for L=5, while the improvement by L=30 over
L=15 is not very large, therefore, L=15 is used in most cases of the simulation.
77
6.2.3 HAA Compared with Packet Spacing Protocol (PSP)
HAA essentially uses a packet spacing approach to adjust the source rate at
the sender. It is compared with the Packet Spacing Protocol (PSP) [41][42], which
is another packet spacing based rate control scheme. The comparison results are
provided in Figure 6.4.
The penalty factor fl is set to 0 in this simulation, i.e., non-adaptive and with no
heterogeneity penalty. It can be seen that the source rate variation of the proposed
scheme is much smaller than that of PSP (Figure 6.4-(a)). The efiective throughput
(Figure 6.4-(b)) of HAA is about 25% better than that of PSP, and the improvement
becomes even larger when the heterogeneity (? B) increases. The dropping rate is also
decreased signiflcantly, especially when ? B is small, from 0.004 to almost 0. The only
measure that is worse for the proposed scheme than for PSP is the end-to-end delay.
However, it should be noted that PSP drops more packets which are not counted
toward the end-to-end delay, and HAA achieves a signiflcantly higher throughput.
6.2.4 HAA versus Mean-Only and Minimum-Bandwidth Schemes
In Figure 6.5, the efiect of fl on the performance of HAA is analyzed and HAA is
also compared with two other flxed rate control schemes (mean-only and minimum-
bandwidth) under two difierent bandwidth varying scenarios: tinterval = 0:05s (fast)
and tinterval = 0:5s (slow).
In each scenario, three sets of results are obtained for three difierent levels of
heterogeneity (? Bi = 0.06, 0.17, and 0.29, respectively) with fl varied. The results for
78
0 5 10 15 20 25 300
0.5
1
1.5
2
2.5
3
3.5x 106
Sending Rate (bps)
Time
HAAPSP
0.1 0.150.2 0.250.3 0.350.4180
190
200
210
220
230
240
250
_?B
Throughput (pps)
HAAPSP
(a) Source Rate (b) Efiective Throughput
0.10.150.20.250.30.350.40
0.1
0.2
0.3
0.4
0.5
0.6
0.7
_?B
Dropping Rate (%)
HAAPSP
0.10.150.20.250.30.350.40.03
0.035
0.04
0.045
0.05
0.055
_?B
End?End Delay (s)
HAAPSP
(c) Dropping Rate (d) End-to-End Delay
Figure 6.4: Comparison of HAA (fl = 0) with PSP.
HAA are shown in the solid curves and the corresponding results for the minimum-
bandwidth scheme in the dotted curves. Note that the results for the mean-only
scheme correspond to the y-intercepts of the solid curves (fl=0).
It can be seen that, there is a trade-ofi between the throughput and end-to-end
delay/dropping rate. As fl increases, the end-to-end delay and dropping rate decrease,
but the efiective throughput also decreases. It is the adaptive algorithm?s task to flnd
a proper value of fl (refer to Section 5.2.5). Also, the efiect of fl becomes more
signiflcant for higher heterogeneity (? Bi=0.29, open circles in the flgures), especially
for the dropping rate and throughput.
79
0 0.5 1 1.560
70
80
90
100
110
120
130
?
End?to?End Delay (ms)
??Bi=0.06??B
i=0.17??Bi=0.29??
Bi=0.06(min)??Bi=0.17(min)??
Bi=0.29(min)
0 0.5 1 1.570
75
80
85
90
95
100
105
?
End?to?End Delay (ms)
??Bi=0.06??B
i=0.17??Bi=0.29??
Bi=0.06(min)??Bi=0.17(min)??
Bi=0.29(min)
(a) end-to-end delay, tinterval = 0.05 (b) end-to-end delay, tinterval = 0.5
0 0.5 1 1.50
0.51
1.52
2.53
3.54
4.5
?
Dropping Rate (%)
??Bi=0.06??B
i=0.17??Bi=0.29??
Bi=0.06(min)??Bi=0.17(min)??
Bi=0.29(min)
0 0.5 1 1.50
1
2
3
4
5
6
?
Dropping Rate (%)
??Bi=0.06??B
i=0.17??Bi=0.29??
Bi=0.06(min)??Bi=0.17(min)??
Bi=0.29(min)
(c) Dropping Rate, tinterval = 0.05 (d) Dropping Rate, tinterval = 0.5
0 0.5 1 1.50.8
1
1.2
1.4
1.6
1.8
2x 106
?
Throughput (bps) ??B
i=0.06??Bi=0.17??
Bi=0.29??Bi=0.06(min)??
Bi=0.17(min)??Bi=0.29(min)
0 0.5 1 1.51.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2x 106
?
Throughput (bps)
??Bi=0.06??B
i=0.17??Bi=0.29??
Bi=0.06(min)??Bi=0.17(min)??
Bi=0.29(min)
(e) Throughput, tinterval = 0.05 (f) Throughput, tinterval = 0.5
Figure 6.5: N = 4, each link?s bottom link bandwidth (Bi(t)) follows a uniform
distribution, with deviation ? Bi. \(min)" in the legend stands for the "minimum-
bandwidth" scheme.
By utilizing the information on heterogeneity, the end-to-end delay (Figure 6.5-
(a) and (b)) and the dropping rate (Figure 6.5-(c) and (d)) can be reduced signifl-
cantly compared to the mean-only scheme (when fl=0). The end-to-end delay and
80
dropping rate in the case of a more rapidly varying bandwidth (tinterval=0.05 in Figure
6.5-(a), (c), and (e)) decrease even faster when the penalty factor (fl) is increased,
compared to the case of a slowly varying bandwidth (tinterval=0.5 in Figure 6.5-(b),
(d), and (f)). However, the throughput also drops more for a more rapidly varying
bandwidth. This is because when tinterval is large (slowly varying bandwidth) and the
extractor size (L) remains the same, the measured bandwidth deviation (? B) would
be less than the steady state (theoretical) deviation which is deflned for a very long
time duration, therefore, the throughput is degraded less.
And as expected, the minimum-bandwidth scheme matches HAA when fl is close
to 1.5. The reason is that when the bandwidth variation follows a uniform distribu-
tion, the minimum value of the distribution is around mean?dev?p3, and p3 ? 1:7.
Nevertheless, there is still a room for performance optimization. For example, when fl
is set to 1.0 for tinterval=0.05, performance of HAA is similar to that of the minimum-
bandwidth scheme in terms of the end-to-end delay and dropping rate, although its
throughput is signiflcantly (at least 10%) higher than that of the minimum-bandwidth
scheme.
It can also be seen that all the performance measures vary monotonically with
fl. Therefore, with the adaptive rate control scheme one could start from a conserva-
tive value of fl, such as 2.0, which is slightly more conservative than the minimum-
bandwidth scheme, and decrease fl by a step size of ?fl, until any of the requirements
is violated (refer to Section 5.2.5).
81
6.2.5 Efiects of Spatial Heterogeneity
The efiects of difierent spatial distributions of mean bandwidth (bi) on each link
i are studied and the results are shown in Figure 6.6. Four difierent scenarios are
simulated on a path consisting of 1 to 9 links.
? scenario \one bottleneck": bi is set to 2Mbps for the middle link and to 5Mbps
for all the other links.
? scenario \increasing bi": bi linearly increases along the path from 2Mbps to
5Mbps (set to 2Mbps if there is only one link).
? scenario \decreasing bi": bi linearly decreases along the path from 5Mbps to
2Mbps (set to 2Mbps if there is only one link).
? scenario \uniform bi": bi is set to 2Mbps for all links.
1 2 3 4 5 6 7 8 91.3
1.4
1.5
1.6
1.7
1.8x 106
# of links
Througput (bps) one bottleneckincreasing bidecreasing b
iuniform bi
1 2 3 4 5 6 7 8 920
40
60
80
100
120
140
# of links
End?to?End Delay (ms)
one bottleneckincreasing bi
decreasing biuniform b
i
(a) Throughput (b) End-to-end delay
Figure 6.6: Efiects of path length. All ? Bi are set to 0.17.
Figure 6.6 shows that, the \uniform bi" scenario has the worst performance in
terms of throughput, because it does not have a flxed bottleneck, i.e., every link has
82
the opportunity to be the bottleneck. According to the previous results (Equation
4.2) on efiective bandwidth of a network path consisting of links with identical mean
bandwidth, and the fact that the throughput can be no more than the efiective
bandwidth (beffective) of the path, the throughput decreases when increasing the path
length.
In the \one bottleneck" scenario, the throughput is almost independent of the
path length. This is because the network path has only one bottleneck in a flxed
position. All the other links? bandwidths are much larger than the bottleneck band-
width in the flrst scenario, therefore, the range of the bandwidth variation of other
links (from 3.5 to 6.5 Mbps ) does not overlap with that of bottleneck link (from 1.4
to 2.6Mbps).
The \increasing bi" and \decreasing bi" scenarios fall between the two scenarios
discussed earlier. It can be seen that, the throughput of the \increasing bi" scenario
is less and decreases more than that of the \decreasing bi" scenario. This is because,
when the links closer to the source node are slower, the faster links which are closer
to the receiver node will be under-utilized. This efiect becomes more signiflcant when
the path length increases.
Figure 6.7 shows how difierent spatial distributions of temporal heterogeneity
along the path afiect the performance of the proposed scheme. Three difierent sce-
narios are simulated for a path of four links. Note that the average of ? Bi along
the path is the same in all cases. The flrst scenario (cross marks) has the smallest
spatial heterogeneity in terms of link bandwidth variation (0.18 0.18 0.18 0.18) along
the path, and the third scenario (circle marks) has the largest (0.0 0.12 0.24 0.36).
It can be seen that the scenario with the largest spatial heterogeneity leads to the
83
worst performance, with the lowest throughput and the largest dropping rate and
end-to-end delay.
0 0.5 1 1.560
70
80
90
100
110
120
130
?
End?to?End Delay (ms)
??Bi=0.180.180.180.18??B
i=0.060.060.300.30??Bi=0.000.120.240.36
0 0.5 1 1.575
80
85
90
95
100
105
110
?
End?to?End Delay (ms) ??
Bi=0.180.180.180.18??Bi=0.060.060.300.30??
Bi=0.000.120.240.36
(a) end-to-end delay, tinterval = 0.05 (b) end-to-end delay, tinterval = 0.5
0 0.5 1 1.50
1
2
3
4
5
?
Dropping Rate (%)
??Bi=0.180.180.180.18??B
i=0.060.060.300.30??Bi=0.000.120.240.36
0 0.5 1 1.50
0.51
1.52
2.53
3.54
?
Dropping Rate (%)
??Bi=0.180.180.180.18??Bi=0.060.060.300.30??
Bi=0.000.120.240.36
(c) Dropping Rate, tinterval = 0.05 (d) Dropping Rate, tinterval = 0.5
0 0.5 1 1.51
1.11.2
1.31.4
1.51.6
1.71.8
1.9x 106
?
Througput (bps)
??Bi=0.180.180.180.18??B
i=0.060.060.300.30??Bi=0.000.120.240.36
0 0.5 1 1.51.1
1.2
1.3
1.4
1.5
1.6
1.7x 106
?
Througput (bps)
??Bi=0.180.180.180.18??B
i=0.060.060.300.30??Bi=0.000.120.240.36
(e) Throughput, tinterval = 0.05 (f) Throughput, tinterval = 0.5
Figure 6.7: N = 4, heterogeneous links, all links have the same mean
ABBWi=2Mbps. Distributions of ? Bi are as shown in the legends.
84
6.2.6 Packet Inter-arrival Time
The packet inter-arrival time (?T) is an important performance measure for
a real-time media (video or audio) streaming application. A large variation of ?T
usually indicates a notable discontinuity or jitter on the receiver side. To provide a
stable stream, ?T should be as even as possible (assuming a flxed packet size). The
histograms of ?T for TCP and HAA  ows are shown in Figure 6.8. It can be seen
that ?T has a much sharper histogram (i.e., a smaller variation of ?) and a smaller
mean (i.e., a higher average rate) for HAA than TCP. Also, note that there are some
components located at much larger values than the main components (around 40ms
for 4 links and 240ms for 12 links) in the TCP histograms. This is due to the bursty
nature of TCP tra?c.
Figure 6.9 provides a qualitative comparison of TCP and HAA in the normalized
deviation of ?T (? ?T), and it shows that ? ?T of HAA is much smaller and less
sensitive to (almost independent of) the path length than that of TCP.
6.2.7 Adaptive Penalty
Figure 6.10 shows how the proposed adaptive scheme behaves when the upper-
bound of end-to-end delay (Teemax) is set to a speciflc value (0.07 seconds in this
example). It is seen that fl decreases from a conservative initial value of 1.9 until the
QoS requirement for the end-to-end delay is violated (at around 25 seconds), and fl
converges to around 1.0, with the end-to-end delay close to 0.07.
85
0 10 20 30 40 50 600
500
1000
1500
2000
2500
3000
3500
? T (ms) 0 10 20 30 40 50 600
500
1000
1500
? T (ms)
(a) 4 links (1 HAA  ow) (b) 4 links (1 TCP  ow)
0 50 100 150 200 2500
500
1000
1500
2000
2500
? T (ms) 0 50 100 150 200 2500
100
200
300
400
500
600
? T (ms)
(c) 12 links (1 HAA  ow) (d) 12 links (1 TCP  ow)
Figure 6.8: Histograms of inter-arrival times. 128 bins are used.
2 4 6 8 10 120
0.5
1
1.5
2
2.5
3
# of links
_ ? ? T HAATCP
Figure 6.9: Deviation of inter-arrival time.
86
0 20 40 60 800.068
0.069
0.07
0.071
0.072
0.073
Time (s)
End?to?End Delay
00.2
0.40.6
0.81
1.21.4
1.61.8
2
?
End?End DelayTeemax
?
Figure 6.10: Illustration of the proposed adaptive rate control scheme.
6.2.8 Path Selection
Because the source rate (rs), which is calculated on the receiver (refer to Equation
5.3), should be almost the same as the efiective throughput of the network path
when the dropping rate is very low, the solution to the path selection problem is
straightforward.
An example network scenario consisting of 3 paths is shown in Figure 6.11. Note
that both path 1 and path 2 have only one bottleneck link, while path 3 has two
bottleneck links. However, the bottleneck bandwidth variation of path 3 is less than
that of paths 1 and 2. Set the penalty factor (fl) to 1.0 for all 3 paths so that the
dropping rate is very low. The simulation results are shown in Figure 6.12.
It can be seen that even though path 3 has the lowest average bottleneck band-
width (abbw) and the longest path length, it produces the highest efiective through-
put. This result reveals that abbw should not be the only measure used in the path
selection problem, and that information on the bandwidth variation (? ABBW) has to
be considered. As can be seen in Figure 6.12, the source rate rs matches the efiective
87
Figure 6.11: A parallel path where the number of links varies with path. The number
pair in brackets over each link represents bi and ? Bi of the link.
1 2 31.6
1.7
1.8
1.9
2x 106
Path No.
Throughput/abbw (Mbsp)
abbwThroughput
rs
1 2 340
45
50
55
60
65
70
End?to?End Delay (ms)
Path No.
(a) abbw and throughput (b) end-to-end delay
Figure 6.12: Throughput and End-to-End Delay when the penalty factor (fl) is set
to 1.0.
throughput quite well, therefore, it could be used as a good estimate of throughput
when selecting the best path.
6.2.9 Multi-Path Media Streaming
The application of multi-path media streaming with inter-path synchronization
is considered. The topology of the multi-sender (mirror) parallel data transmission
system used inthe simulationisshownin Figure 6.13, whichisbased on theframework
developed in [47]. If all paths are synchronized \perfectly" and all connections are
88
reliable, there will be no media packet missed or transmitted more than once. Note
that due to heterogeneity, this perfect synchronization can never be achieved in a
real network. In addition to the performance measures discussed in the previous
sections (throughput, dropping rate, end-to-end delay), which are deflned on the
lower network layers, it is necessary to consider performance measures related to the
degree of mis-synchronization on the application layer, missing rate and duplicate
rate.
Themissing rate isdeflned tobetheratio of\blank segments"to thewhole length
ofthestreamsreceived. These blanksegmentsarecaused bybothmis-synchronization
on the application layer and packet dropping on the lower network layers. The du-
plicate rate is deflned to be the ratio of duplicated (those received more than once)
segments to the whole length of the stream received. Both are application-level mea-
sures and are calculated on the receiver side. To simplify the simulation, the segment
size is set to be the same as the packet size.
S2
SourceNode
R
n2,N2n2,0 n2,1 SinkNode5M,3ms B
2,1
5M,3ms
B2,N2B2,2 n2,N-12B1,N2-1
S1 n1,N1n1,05M B
1,1 B1,N1n1, -1N1B1,N1-1
SK nK,NKnK,0 nK,15M,3ms B
K,1 BK,NKBK,2
nK,N-1KBK,N-1
K
Figure 6.13: Simulation testbench topology for multi-path media streaming.
Simulation Results
Figures 6.14 and 6.15 show the simulation results for media streaming over a
network with 2 parallel paths, each with 4 inter-router links. All links have the same
89
average available bandwidth (b=2Mbps) and the links on path i have the deviation
? Bi for i = 1;2: It is seen that a higher source rate is allocated to path 1 which has a
lower temporal heterogeneity (? B1=0.06), although the average available bandwidth
is the same on both paths, as shown in Figure 6.14-(a) and (b).
0 0.5 1 1.5150
200
250
300
350
400
450
?
Throughput (pps)
Path1Path2Aggregated
0 0.5 1 1.5100
150
200
250
300
350
400
450
?
Throughput (pps)
Path1Path2Aggregated
(a) ? B1 = 0:06; ? B2 = 0:17 (b) ? B1 = 0:06; ? B2 = 0:29
Figure 6.14: 2 paths, each with 4 links, and all links have the same bi=2Mbps, but
difierent deviation ? Bi.
0 0.5 1 1.50
1
2
3
4
5
6
?
Duplicate Rate (%)
??B1=0.1,??B2=0.1??B1=0.1,??B2=0.3??
B1=0.1,??B2=0.5
0 0.5 1 1.50
0.51
1.52
2.53
3.54
4.5
?
Missing Rate (%)
??B1=0.1,??B2=0.1??B1=0.1,??B2=0.3??
B1=0.1,??B2=0.5
(a) Duplicate Rate (b) Missing Rate
Figure 6.15: Application level measures for the scenario in Figure 6.14.
From Figure 6.15, it is interesting to see that both the missing and duplicate
rates decrease signiflcantly. When the penalty factor (fl) increases from 0 to 1.5, the
90
percentage reduction is up to 70% in the duplicate rate and 80% in the missing rate,
while the throughput only decreases by 11% for the case of lower spatial heterogeneity
(? B1 = 0:05; ? B2 = 0:17) and 18% for the case of higher spatial heterogeneity (? B1 =
0:05; ? B2 = 0:29) (Figure 6.14-(a) and (b)). This implies that the real-time QoS
(missing and duplicate rates) can be improved without signiflcantly degrading the
efiective throughput.
0 10 20 30 40 500.04
0.048
0.056
0.064
0.072
0.08
Time (s)
End?to?End Delay
00.2
0.40.6
0.81
1.21.4
1.61.8
2
?
End?End DelayTeemax
?
0 10 20 30 40 500.04
0.048
0.056
0.064
0.072
0.08
Time (s)
End?to?End Delay
00.2
0.40.6
0.81
1.21.4
1.61.8
2
?
End?End DelayTeemax
?
(a) Path 1 (b) Path 2
Figure 6.16: Adaptive rate control on a multi-path network.
Figure 6.16 shows how the proposed adaptive scheme works on a two-path net-
work when the upper-bound of end-to-end delay (Teemax) is set to 0.07 seconds. Note
that fl on each path varies independently and converges to a difierent value (about
0.3 on path 1 and about 1.0 on path 2). That is, it is shown that the proposed scheme
is also able to handle this type of multi-path communication.
Performance Comparison with TCP
The proposed HAA scheme is compared with TCP on a 2-path network with
each path consisting of 4 links. The results are shown in Figure 6.17. As expected,
the aggregated throughput for HAA is much higher than that for TCP. However, the
91
? B1 = 0:06; ? B2 = 0:06
0 0.5 1 1.52
2.22.4
2.62.8
33.2
3.43.6
3.8x 106
?
Aggregated Throughput (bps)
HAATCP
0 0.5 1 1.565
70
75
80
85
90
95
100
105
?
Average End?to?End Delay (ms)
HAATCP
? B1 = 0:06; ? B2 = 0:29
0 0.5 1 1.52
2.5
3
3.5x 106
?
Aggregated Throughput (bps)
HAATCP
0 0.5 1 1.560
70
80
90
100
110
120
?
Average End?to?End Delay (ms)
HAATCP
Figure 6.17: Comparison with TCP implementation for 2 paths, each consisting of 4
links.
end-to-end delay is also larger than that for TCP because more packets tend to be
bufiered under a higher transmission rate. By adjusting the penalty factor to 1.0,
the End-to-End delay could be reduced signiflcantly and the throughput can still be
maintained at a much higher value than that by TCP.
Figure 6.18 compares HAA and TCP when the penalty factor (fl) is flxed to 1.0.
Two network paths are employed and the length of each path varies from 2 to 12.
Note that the throughput achieved by TCP drops dramatically (more than 75%) as
92
? B1 = 0:06; ? B2 = 0:06
2 4 6 8 10 120.5
1
1.5
2
2.5
3
3.5
4x 106
# of links
Aggregated Throughput (bps)
HAATCP
2 4 6 8 10 1220
4060
80100
120140
160180
# of links
Average End?to?End Delay (ms)
HAATCP
? B1 = 0:06; ? B2 = 0:29
2 4 6 8 10 120.5
1
1.5
2
2.5
3
3.5x 106
# of links
Aggregated Throughput (bps)
HAATCP
2 4 6 8 10 1240
6080
100120
140160
180200
# of links
Average End?to?End Delay (ms) HAATCP
Figure 6.18: Comparison with TCP implementation (2 paths, fl is flxed to 1.0).
the path length increases up to 12 links, but that by HAA only decreases by less than
15%.
Dependency of the missing and duplicate rates of HAA on the path length is
shown in Figure 6.19. Both rates increase slightly when the network paths become
longer, but are still at a reasonable level; the duplicate rate is around 2.3% and the
missing rate is a little larger than 1.0% when the path length is 12.
93
2 4 6 8 10 120.8
11.2
1.41.6
1.82
2.22.4
2.6
# of links
Duplicate Rate (%)
??B1=0.06,??B2=0.06??B1=0.06,??B2=0.17??
B1=0.06,??B2=0.29
2 4 6 8 10 120
0.2
0.4
0.6
0.8
1
# of links
Missing Rate (%)
??B1=0.06,??B2=0.06??B1=0.06,??B2=0.17??
B1=0.06,??B2=0.29
(a) Duplicate Rate (b) Missing Rate
Figure 6.19: Application level measures (2 paths, fl is flxed to 1.0).
6.3 Multiple Flows
All the results discussed above assume that the variation of the available bot-
tleneck bandwidth (ABBW) is independent of the HAA source rate control, i.e., the
introduction of HAA  ows does not afiect other background tra?c. However, this
may not be the case in real networks. Here, each  ow is a part of background tra?c
of other  ows and the dynamics of its source rate control can change the available
bandwidth and behavior of other  ows and, in turn, change its own available band-
width.
In this section, multiple  ows sharing a network path with varying bandwidth
are studied. Figure 6.20 illustrates Nf  ows sharing a network path consisting of
N links. Each  ow fi is connected by a separate pair of sender (Si) and receiver
(Ri). Assuming that there is no coordination among the  ows, i.e., the source rate
of each  ow is controlled independently by its sender and receiver pair. The queue
management in this study adopts a basic \Drop-Tail-Round-Robin" scheme to ensure
that all  ows on the path have the same level of priority.
94
Figure 6.20: Topology of the testing network: multiple  ows sharing a network path.
As a comparison reference, the bandwidth utilization of multiple TCP  ows
is flrst analyzed. Next, the performance of multiple HAA  ows (with continuous
and periodical On-Ofi tra?c) in terms of aggregated throughput, average end-to-end
delay, dropping rate, and fairness is investigated and compared with that of multiple
TCP  ows. Finally, TCP friendliness is verifled to see if HAA  ows can friendly share
the network bandwidth with TCP  ows.
6.3.1 Multiple TCP Flows
The aggregated throughput of multiple TCP  ows is flrst investigated. According
to the approximation model of TCP throughput from [48]
TH1TCP(p) = min(WmaxRTT ; 1
RTT
q2bp
3 +T0min(1;3
q3bp
8 )p(1+32p
2)
) (6.1)
where Wmax is the maximum window size, RTT is the round-trip time, T0 is the
timeout value, b is the number of packets of transmitted data that are acknowledged
by one ACK from the receiver, and p is the loss ratio. It is not di?cult to see that a
95
single TCP  ow throughput is no more than
TH1TCP ? WmaxRTT (6.2)
And when there are Nf TCP  ows, the aggregated throughput has an upper-bound
of
THNTCP ?
NfX
i=1
Wmaxi
RTTi (6.3)
From Equation 6.3, it can be seen that, when the maximum window size and Nf are
flxed, a longer path length results in a larger RTT, and subsequently decreases the
maximum possible throughput. This is verifled by the simulation, and the results are
shown in Figure 6.21. Both the measured (solid curves) and the estimated maximum
utilization (dotted curves) are plotted, which are deflned to be
Measured Utilization = THNTCPC (6.4)
Estimated Maximum Utilization = min(1; Nf ?WmaxC ?RTT ) (6.5)
where C is the capacity of the bottleneck link and it is set to 2Mbps in this simulation.
It can be seen from Figure 6.21 that the utilization becomes higher as more
TCP  ows are introduced, especially when the path is short. For example, on a path
consisting of less than 9 links with no variation in ABBW, four TCP  ows can achieve
almost 100% utilization in the given scenario (Figure 6.21-(e)). However, as analyzed
earlier, the bandwidth utilization drops signiflcantly as the path length (number of
links) increases, and it reduces even more when there is a larger available bandwidth
variation (? B).
96
0 2 4 6 8 10 120
20
40
60
80
100
Utilization (%)
# of links
1 TCP measured1 TCP estimated max2 TCP measured
2 TCP estimated max4 TCP measured4 TCP estimated max
0 2 4 6 8 10 120
20
40
60
80
100
Utilization (%)
# of links
1 TCP measured1 TCP estimated max2 TCP measured
2 TCP estimated max4 TCP measured4 TCP estimated max
(a) ? Bi=0.00 (b) ? Bi=0.29
Figure 6.21: Aggregated bandwidth utilization of multiple TCP  ows. Wmax=15.
(note that some measured utilization curves overlap the estimated max ones)
To further illustrate how TCP bandwidth utilization is afiected by the number
of  ows and path length, the trace plots of packet arrival times for multiple (Nf=1,2,
and 4) TCP  ows are provided in Figure 6.22. The simulation is conducted on a
network path consisting of two difierent numbers of links (N=4 links, and N=12
links). The result shows how the efiective aggregated throughput varies as the path
length increases. The FlowID (from 1 to Nf) of each packet is plotted as a function of
time, and each mark indicates a packet arrival event for the corresponding FlowID.
The duration of each transition on FlowID represents the time when no packet is
being received, i.e., the idle time of the shared path.
It can be easily seen that multiple TCP  ows are received by the receiver in a
bursty and multiplexed manner. When there is only one TCP  ow, the \duty cycle"
is about 60% for a short path consisting of 4 links.
A larger number of  ows increases the \duty cycle" signiflcantly, for example,
four TCP  ows can achieve an almost full utilization on the 4-link path. However, it
can also be clearly seen that the duty cycle (hence bandwidth utilization) shrinks as
97
4 4.1 4.2 4.3 4.4 4.50
1
2
Flow ID
Time 4 4.1 4.2 4.3 4.4 4.50
1
2
Flow ID
Time
(a) 1 TCP  ow, 4 links (b) 1 TCP  ow, 12 links
4 4.1 4.2 4.3 4.4 4.50
1
2
3
Flow ID
Time 4 4.1 4.2 4.3 4.4 4.50
1
2
3
Flow ID
Time
(c) 2 TCP  ows, 4 links (d) 2 TCP  ows, 12 links
4 4.1 4.2 4.3 4.4 4.50
1
2
3
4
5
Flow ID
Time 4 4.1 4.2 4.3 4.4 4.50
1
2
3
4
5
Flow ID
Time
(e) 4 TCP  ows, 4 links (f) 4 TCP  ows, 12 links
Figure 6.22: Arrival time of multi- ow TCP packets.
the path length increases. For example, when there is only one TCP  ow, the duty
cycle is only less than 20% on a 12-link path.
98
6.3.2 Multiple HAA Flows
6.3.2.1 Overall Performance
0 0.5 1 1.51.7
1.75
1.8
1.85
1.9
1.95
2x 106
?
Aggregated Throughput (bps)
Nf=1N
f=2Nf=4
0 0.5 1 1.530
40
50
60
70
80
90
?
End?to?End Delay (ms)
Nf=1N
f=2Nf=4
(a) Aggregated Throughput (b) End-to-End Delay
0 0.5 1 1.50
1
2
3
4
5
6
?
Dropping Rate (%)
Nf=1N
f=2Nf=4
(c) Dropping Rate
Figure 6.23: Multiple HAA  ows sharing a network path consisting of 4 links, all
links have the same bi=2Mbps and ? Bi=0.17.
Figure 6.23 shows the simulation results of Nf HAA  ows sharing a varying avail-
able bandwidth on a network path consisting of 4 links. Each  ow is sent continuously
by the allocated source rate. It can be seen from Figure 6.23 that as the total number
of HAA  ows (Nf) increases, the aggregated throughput does not change very much
99
(when fl is flxed). This result shows that multiple HAA  ows can still maintain the
overall bandwidth utilization at almost the same level as that for a single HAA  ow.
It should also be noted that, when introducing more  ows, the dropping rate
increases due to the interaction among multiple HAA  ows and the limited bufier
space available. By adjusting the penalty factor (fl) to a larger value, the dropping
rate can be reduced signiflcantly. For example, when fl is set to 1.5, the dropping
rate is reduced to about 1.1%, but the aggregated throughput is degraded by no more
than 12%.
0 2 4 6 8 10 120
1020
3040
5060
7080
90
Utilization (%)
# of links
1 HAA2 HAA
4 HAA 0 2 4 6 8 10 120
20
40
60
80
100
Utilization (%)
# of links
1 HAA2 HAA
4 HAA
(a) ? Bi=0.00 (b) ? Bi=0.29
Figure 6.24: Aggregated bandwidth utilization of multiple HAA  ows.
Figure 6.24 shows the dependency of the bandwidth utilization by multiple HAA
 ows on the path length and the number of  ows. When there is no variation in
ABBW (? Bi=0.00), multiple HAA  ows can achieve almost 100% utilization and
the utilization is almost independent of the path length. In the case with a larger
bandwidth variation (? Bi=0.29), the utilization decreases as the path length increases,
but the efiect of path length is much less signiflcant than that in TCP. The comparison
result will be shown in Section 6.3.3.
100
6.3.2.2 Dynamic Performance
To see if the HAA source rate control scheme produces any \exclusive  ow"
which permanently uses up all available bandwidth and refuses the injection of other
 ows, a scenario with On-Ofi  ow patterns (\scenario On-Ofi") is designed. Instead
of continuously sending out packets (\scenario Continuous"), each  ow is toggled with
a period of Tswitch seconds. The duty cycle is set to be 2Nf?12N
f
, i.e., in each period, the
 ow is turned ofi for Toff = Tswitch2N
f
seconds, and turned back on for Ton = Tswitch(2Nf?1)2N
f
seconds. The switching time of each  ow is evenly spaced out by Tswitch=Nf so that
at any time there is at least one  ow active (unless Nf=1). A case with two On-Ofi
 ows is illustrated in Figure 6.25.
5 10 15 20 25
1
2 Ton
Toff
time
ID of enabled flow
Flow 1Flow 2
Figure 6.25: Two On-Ofi  ows with Tswitch=6.
When Nf > 1, the ideal aggregated throughput should be close to that in the
scenario Continuous, because every time one  ow is turned ofi, other active (enabled)
 ows are supposed to take over the bandwidth released by the closed  ow.
Multiple On-Ofi  ows are considered on a shared path consisting of 4 links in
order to study the efiects of bandwidth heterogeneity on the aggregated throughput.
101
0 0.050.1 0.150.2 0.251
1.2
1.4
1.6
1.8
2x 106
_ ?B
i
Aggregated Throughput One continuous HAATwo ON?OFF HAA
Four ON?OFF HAA0 0.050.10.150.20.250
0.005
0.01
0.015
0.02
_ ?B
i
_   ?  th i
Two HAAFour HAA
(a) Aggregated throughput (b) Fairness among HAA  ows
Figure 6.26: Dynamic performance of multiple On-Ofi  ows. N=4, bi=2Mbps for all
links, and fl=1.0 for all  ows.
The aggregated throughputs of 2 and 4 On-Ofi HAA  ows are compared to the
throughput of a single continuous  ow for the same background tra?c setup, and the
results are shown in Figure 6.26-(a). It can be seen that the aggregated throughput
of the scenario On-Ofi is close to the throughput of the scenario Continuous and even
slightly better when there is a higher bandwidth heterogeneity (? Bi > 0:1). This
indicates that the active  ows can, in a timely manner, take over the bandwidth
released by the inactive  ows. The normalized standard deviation of throughput
among HAA  ows in the scenario On-Ofi is plotted as an index of fairness in Figure
6.26-(b). It is seen that the throughput variation among the  ows is very small (less
than 2 percent), which means that there is no \irresponsible"  ow.
102
6.3.3 Comparison with Multiple TCP Flows
6.3.3.1 Aggregated Bandwidth Utilization
The aggregated bandwidth utilization of multiple HAA  ows is compared with
that of multiple TCP  ows. The improvement in bandwidth utilization is deflned to
be
Improvement (%) = UtHAA ?UtTCPUt
TCP
(6.6)
where UtHAA and UtTCP are the bandwidth utilizations of Nf HAA  ows and Nf
TCP  ows, respectively. The penalty factor (fl) of HAA  ows is set to 1.0, and the
maximum window size (Wmax) of TCP is set to 15. The results are shown in Figure
6.27.
0 2 4 6 8 10 120
50
100
150
200
250
300
350
400
Improvement (%)
# of links
Nf = 1N
f = 2Nf = 4
0 2 4 6 8 10 12?50
0
50
100
150
200
Improvement (%)
# of links
Nf = 1N
f = 2Nf = 4
(a) ? Bi=0.00 (b) ? Bi=0.29
Figure 6.27: Improvement in aggregated bandwidth utilization.
103
The improvement in utilization becomes larger when the path length increases.
Also, the improvement is more signiflcant when network links have a smaller band-
width variation (? Bi=0.00). However, the improvement becomes smaller as the num-
ber of  ows increases. For example, on the path of 4 links, the improvement by HAA
over TCP is above 40% when there is only 1  ow, but there is almost no improvement
when there are 4  ows.
6.3.3.2 Inter- ow Fairness
1 2 3 4 5 6 7 8 90
0.5
1
1.5
# of links
_  ?  thr i
2 tcp4 tcp6 tcp
2 HAA4 HAA6 HAA
1 2 3 4 5 6 7 8 90
0.05
0.1
0.15
0.2
# of links
_  ?  thr i
2 tcp4 tcp6 tcp
2 HAA4 HAA6 HAA
(a) ? Bi=0.0 (b) ? Bi=0.29
Figure 6.28: Fairness comparison with TCP.
Figure 6.28 compares throughput variation ? thri, an index of inter- ow fairness,
among multiple  ows in both homogeneous (? Bi=0.0) and heterogeneous (? Bi=0.29)
cases. It can be seen that in both cases the throughput variation among HAA  ows
is much smaller than that among TCP  ows, i.e., the inter- ow fairness of HAA is
better. Also, ? thri is heavily dependent on the path length (or round-trip time) for
TCP, but it is almost independent of the path length for HAA.
104
0.2 0.4 0.6 0.8 10
0.5
1
1.5
2x 106
?
Total Throughput (bps) NTCP=NHAA=1N
HAA=2NTCP=2
0.2 0.4 0.6 0.8 10
0.5
1
1.5
2x 106
?
Total Throughput (bps) NTCP=NHAA=2N
HAA=4NTCP=4
(a) 2  ows (b) 4  ows
0.2 0.4 0.6 0.8 10
0.5
1
1.5
2x 106
?
Total Throughput (bps) NTCP=NHAA=4N
HAA=8NTCP=8
0.2 0.4 0.6 0.8 10.5
1
1.5
2
2.5
3
3.5
?
HAA/TCP
NTCP=NHAA=1N
TCP=NHAA=2NTCP=NHAA=4
(c) 8  ows (d) Throughput ratio (HAA/TCP)
Figure 6.29: Multiple HAA and TCP  ows.
6.3.4 TCP Friendliness
Figure 6.29 shows the simulation results for multiple HAA  ows sharing the
available bandwidth with TCP  ows on a network path of 4 links. In Figure 6.29-(a),
(b) and (c), three sets of aggregated throughput are compared: (1) m HAA  ows and
m TCP  ows (2) 2m HAA  ows, and (3) 2m TCP  ows, where m=1, 2 or 4. It can
be seen that the throughput achieved by one TCP  ow and one HAA  ow together
is slightly lower than that by two TCP  ows when fl is large (greater than 0.4), and
105
is quite close to that by two HAA  ows. However, when the number of  ows is larger
(4 or 8  ows), the mixed  ow (HAA and TCP) achieves the highest throughput.
To prevent HAA  ows from aggressively taking too large a share of the available
bandwidth, fl needs to be properly adjusted. Figure 6.29-(d) shows how to ensure
that TCP  ows get a \fairer" share by adjusting fl. It can be seen that setting fl to 1
usually achieves a fairly good bandwidth sharing between TCP and HAA  ows, with
a throughput ratio of between 1.0 and 1.5.
106
Chapter 7
Conclusion and Future Work
In this dissertation, two types of heterogeneity are deflned for heterogeneous com-
puting and communication systems. Temporal heterogeneity (TH) refers to variation
in computing power or communication bandwidth available for a task along the time
dimension on an individual computer or channel, while spatial heterogeneity (SH)
refers to variation of the available computing power or communication bandwidth
among computers.
Efiects of TH and SH on computing and communication systems are flrst an-
alyzed to show the possibility of improving the task performance by taking the het-
erogeneity information into account. In addition to the flrst order moment (mean),
the second order moment (deviation) of resources availability is quantifled and explic-
itly utilized to improve performance of computing and communication tasks in the
heterogeneity aware approaches (HAA).
7.1 Computing Systems
The efiects of heterogeneity are flrst studied in terms of average parallel execution
time and standard deviation of parallel execution time on heterogeneous computing
systems (Chapter 2). The results revealed that both TH and SH in computing power
can have a signiflcant efiect on the parallel computing performance. To achieve the
minimum ? on a spatially heterogeneous computing system, it is not optimal to
107
distribute a target task linearly proportional to the average computing powers of
computers.
If the computational load is perfectly balanced among computers in terms of the
average computing power available, ? signiflcantly increases as one or both types of
heterogeneity (TH and SH) increase, while the average sequential execution time on
a single computer is not afiected by TH. As SH increases, ? increases more when the
average TH (among computers) increases than when it remains flxed. As the number
of computers employed for a target task increases, ? decreases initially and then may
increase especially when SH increases fast. More frequent synchronization amplifles
efiects of heterogeneity and degrades performance of a target task more for a higher
TH and/or SH. Performance degradation due to heterogeneity becomes even larger
when the interval is longer (coarser granularity). The risk factor and variation of
parallel execution time is quickly reduced as the fraction of a target task assigned to
computers with smaller TH increases.
A two-step heuristic approach to partitioning a target task for minimizing its
average parallel execution time (?) is described with a theoretical model (Chapter 3).
The proposed approach achieves a shorter ? by assigning a fraction of target task,
which is larger than the fraction determined proportional to the average computing
power, to a computer with a smaller TH or a higher average computing power.
The reduction in ? by the proposed approach is greater when (i) variation of the
average computing power among computers is larger; (ii) variation of the standard
deviation of computing power among computers is larger; or (iii) the number of
computers employed is greater.
108
7.2 Communication Systems
TheefiectsofTH ontheQoS?ssuchasthroughput, end-to-enddelay, andtransfer
time on a heterogeneous network are studied in Chapter 4. The efiects become even
larger when there also exists SH among links or paths. For the two applications,
path selection and multi-path data transfer, it has been demonstrated that signiflcant
improvements in the end-to-end delay and transfer time are possible by considering
the standard deviation as well as the mean of available bandwidth on each link.
A heterogeneity aware approach (HAA) to improving performance of commu-
nication tasks by utilizing the information on variation of available communication
resources is proposed (Chapter 5). The proposed HAA source rate control scheme
dynamically controls the source rate such that the rate is \penalized" more for a
higher level of heterogeneity.
The performance of HAA source rate control has been analyzed in detail through
simulation, and results may be summarized as follows:
? HAA can signiflcantly increase the efiective throughput and decrease the packet
dropping rate, compared to the typical network measurement based approaches
(the mean-only and minimum-bandwidth schemes) which do not utilize the
heterogeneity information.
? Compared to TCP, the bandwidth utilization of HAA is much less sensitive to
the path length (or round-trip time), and the aggregated throughput is even
better than that of TCP when the path length is long. The inter- ow fairness
of HAA is shown to be better. Also, HAA has a much less variation in inter-
packet arrival time, i.e., it can produce a much more stable stream with less
109
jitter than TCP. This is especially desirable in jitter-sensitive media streaming
applications.
? The results from the multi-path media streaming application show that HAA
can improve the real-time performance measures (missing and duplicate rates)
without degrading the efiective throughput substantially.
? It has been verifled that the proposed scheme can be properly tuned to achieve
fair sharing of bandwidth with other  ows (HAA or TCP) on a network path
and produce \responsible" tra?c.
The proposed HAA source rate control scheme is very promising in various ap-
plications, especially where the real-time QoS?s are concerned. By quantifying TH
and SH in communication resources, it can produce a more comprehensive picture of
resource variation and distribution, and provide a direct control of data transmission
in a more predictable manner.
7.3 Future Work
Further improvements can be made for each component in the proposed schemes.
For example, 1) more sophisticated resource measurement techniques can be designed
to improve the accuracy of measurements; 2) the features (heterogeneity) of other
network resources such as the packet bufier size can be utilized to further improve the
performance; 3) the source rate function can be customized for difierent applications
or services.
Also, applications to the following areas well deserve investigation:
110
7.3.1 Data-Intensive Computing Systems
So far, the HAA has been considered in computing and communication indi-
vidually. However, in many applications, such as data-intensive high performance
computing, both computing and communication are to be considered. It would be
worthwhile to investigate interaction between them in application of the HAA to such
tasks.
7.3.2 Application to Overlay Networks
Overlay networks have received a great deal of attention recently, and it can pro-
vide another framework for applying HAA without worrying about technical details
on lower layers of the network (physical, data link layers, etc.) Resilient Overlay
Network (RON) [49][50], Multicast Backbone (M-Bone)[51], and X-Bone [17][18][19]
are three well-known overlay frameworks. Speciflcally, HAA can be applied to the
path selection and dynamic routing problems of an overlay network.
111
Bibliography
[1] R. Buyya. High Peformance Cluster Computing. 1999.
[2] I. Foster and C. Kesselman. The Grid: Blueprint for a New Computing Infras-
tructure. Morgan Kaufmann, 1988.
[3] R. F. Freund and H. J. Siegel. Heterogeneous Processing. Computer, pages
13{17, June 1993.
[4] M. Maheswaran and H. Siegel. A Dynamic Matching and Scheduling Algorithm
for Heterogeneous Computing Systems. In Proceedings of Heterogeneous Com-
puting, pages 57{69, 1998.
[5] T. Thanalapati and S. Dandamudi. An E?cient Adaptive Scheduling Scheme
for Distributed Memory Multicomputers. IEEE Transactions on Parallel and
Distributed Systems, pages 758{768, July 2001.
[6] Y. Zhang, A. Sivasubrmaniam, J. Moreira, and H. Franke. Impact of Workload
and System Parameters on Next Generation Cluster Scheduling Mechanisms.
IEEE Transactions on Parallel and Distributed Systems, pages 967{985, Septem-
ber 2001.
[7] S. Floyd and V. Paxson. Di?culties in Simulating the Internet. IEEE/ACM
Transactions on Networking (TON), August 2001.
[8] S. Ali, H. Siegel, M. Maheswaran, D. Hensgen, and S. Ali. Task Execution
Time Modeling for Heterogeneous Computing Systems. In Proceedings of the
9th Heterogeneous Computing Workshop, pages 185{199, May 2000.
[9] R. Armstrong. Investigation of Efiect of Difierent Run-Time Distributions on
SmartNet Performance. Master?s thesis, Department of Computer Science, Naval
Postgraduate School, 1997.
[10] J. Al-Jaroodi, N. Mohamed, H. Jiang, and D. Swanson. Modeling Parallel Appli-
cations Performance On Heterogeneous Systems. In Proceedings of IPDPS 2003,
Workshop on Advances in Parallel and Distributed Computational Models, Nice,
France, April 2003.
112
[11] S. M. Figueira and F. Berman. A Slowdown Model for Application Executing
on Time-Shared Clusters of Workstations. IEEE Transactions on Parallel and
Distributed Systems, pages 653{670, June 2001.
[12] C.-Z. Xu, L. Y. Wang, and N.-T. Fong. Stochastic Prediction of Execution Time
for Dynamic Bulk Synchronous Computations. In Proceedings of International
Parallel and Distributed Processing Symposium, San Francisco, April 2001.
[13] X. Zhang and Y. Yan. Modeling and Characterizing Parallel Computing Perfor-
mance On Heterogeneous Networks of Workstations. In Proceedings of Seventh
IEEE Symposium. Parallel and Distributed Processing, pages 25{34, October
1995.
[14] A. Dogon and F. Ozguner. Trading ofi Execution Time for Reliability in Schedul-
ing Precedence-Constrained Tasks in Heterogeneous Computing. In Proceedings
of International Parallel and Distributed Processing Symposium, San Francisco,
April 2001.
[15] J. Schopf and F. Berman. Stochastic Scheduling. In CS Dept. Technical Report
(CS-99-03), University of California, San Diego, 1999.
[16] D. Anderson, H. Balakrishnan, F. Kaashoek, and R. Morris. Resilient Overlay
Networks. In Proceedings of the 18th ACM Symposium on Operating Systems
Principles, October 2001.
[17] J. Touch. Dynamic Internet Overlay Deployment and Management Using the
X-Bone. Computer Networks, pages 117{135, July 2001.
[18] J. Touch and S. Hotz. The X-Bone. In Proceedings of Third Global Internet
Mini-Conference at Globecom, Sydney, Australia, July 1998.
[19] J. Touch. Those Pesky NATs. IEEE Internet Computing, page 96, 2002.
[20] J. Kim and D. Lilja. Multiple Path Communication. High Peformance Cluster
Computing: Architectures and Systems, pages 364{382, 1999.
[21] T. Nguyen and A. Zakhor. Path Diversity with Forward Error Correction Sys-
tem for Packet Switched Networks. In Proceedings of INFORCOM 2003, San
Francisco, April 2003.
[22] J. Huang and S.-Y. Lee. Efiects of Spatial and Temporal Heterogeneity on
Performance of a Target Task in Heterogeneous Computing Environmentss. In
Proceedings of ISCA 15th International Conference on Parallel and Distributed
Computing Systems, pages 301{306, September 2002.
113
[23] S.-Y. Lee and J. Huang. A Theoretical Approach to Load Balancing of a Target
Task in Temporally and Spatially Heterogeneous Grid Computing Environment.
In the 3rd International Workshop on Grid Computing, pages 70{81, November
2002.
[24] M. Andrews, S. Borst, F. Dominique, P. Jelenkovic, K. Kumaran, K. G. Ramakr-
ishnan, andP.Whiting. Dynamicbandwidthallocationalgorithmsforhigh-speed
data wireless networks. Technical Journal of Bell Labs, July-September 1998.
[25] X. Liu, E. Chong, and N. Shrofi. Transmission Scheduling for E?cient Wireless
Utilization. In Proceedings of the IEEE INFOCOM, Alaska, pages 776{785, April
2001.
[26] T. Nandagopal, T. Kim, X. Gao, and V. Bharghavan. Achieving MAC Layer
Fairness in Wireless Packet Networks. In Proceedings of ACM Mobicom, Boston,
MA, August 2000.
[27] S. Lee and M. Gerla. Dynamic Load-Aware Routing in Ad Hoc Network. In
Proceedings of ICC 2001, Helsinki, Finland, June 2001.
[28] J.-C. Bolot. End-to-end packet delay and loss behavior in the Internet. In
Proceedings of ACM SIGCOMM Symposium on Communications Architectures
and Protocols, pages 289{298, September 1993.
[29] S. Keshav. A Control-Theoretic Approach to Flow Control. In Proceedings of
ACM SIGCOMM, September 1991.
[30] S. Keshav. Packet Pair Flow Control. IEEE/ACM Transactions on Networking,
Feburary 1995.
[31] R. L. Carter and M. E. Crovella. Measuring Bottleneck Link Speed in Packet-
Switched Networks. Performance Evaluation, TR-96-006, Boston University
Computer Science Department, March 1996.
[32] K. Lai and M. Baker. Measuring Bandwidth. In Proceedings of IEEE INFO-
COMM, March 1999.
[33] K. Lai and M. Baker. Measuring Link Bandwidth Using a Deterministic Model
of Packet Delay. In Proceedings of the ACM SIGCOMM, Stockholm, Sweden,
August 2000.
[34] Transmission Control Protocol Speciflcation. RFC 793, September 1981.
[35] V. Jacbson. Congestion Avoidance and Control. In Proceedings of ACM SIG-
COMM, pages 314{329, August 1988.
114
[36] Y. R. Yang, M. S. Kim, X. Zhang, and S. S. Lam. Two Problems of TCP AIMD
Congestion Control. In Technical Report TR-00-13, Department of Computer
Sciences, The University of Texas at Austin, June 2000.
[37] J. Widmer, R. Denda, and M. Mauve. A Survey on TCP-Friendly Congestion
Control. Special Issue of the IEEE Network Magazine Control of Best Efiort
Tra?c, pages 28{37, May/June 2001.
[38] D. Sisalem and A. Wolisz. LDA+ TCP-friendly adaptation: A measurement and
comparison study. In 10th International Workshop on Network and Operating
Systems Support for Digital Audio and Video (NOSSDAV?2000), pages 1619{
1622, June 2000.
[39] L. Zhang, S. Shenker, and D. Clark. Observations on the Dynamics of a Con-
gestion Control Algorithm: The Efiects of Two Way Tra?c. In Proceedings of
Proceedings of the ACM SIGCOMM 91 Conference on Communications Archi-
tectures and Protocols, pages 133{147, 1991.
[40] A. Aggarwal, S. Savage, and T. Anderson. Understanding the Performance of
TCP Pacing. In Proceedings of the IEEE INFOCOM Conference on Computer
Communications, pages 1157{1165, 2000.
[41] A. C. Feng and A. C. Kapadia. Packet Spacing: An Enabling Mechanism for
Delivering Multimedia Content in Computational Grids. The Journal of Super-
computing, pages 51{66, 2002.
[42] A. Kapadia, A. Feng, and W. Feng. The Efiects of Inter-Packet Spacing on the
Delivery of Multimedia Content. In Proceedings of The 21st IEEE International
Conference on Distributed Computing Systems, 2001.
[43] H. David. Order Statistics. John Wiley and Sons Inc, 1970.
[44] K. Fall and K. Varadhan. The ns Manual (formerly ns Notes and Doc-
umentation). web resource available at http://www.isi.edu/nsnam/ns/ns-
documentation.html.
[45] E. Gafni and D. Bertsekas. Dynamic Control of Session Input Rates in Com-
munication Networks. IEEE Trans. on Automatic Control, pages 1009{1016,
January 1984.
[46] A. G. Fraser. Designing a Public Data Network. IEEE Communications Maga-
zine, pages 31{35, October 1991.
[47] T. Nguyen and A. Zakhor. Distributed Video Streaming over Internet. In Pro-
ceedings of SPIE Multimedia Computing and Networking, pages 186{195, San
Jose, California, January 2002.
115
[48] J. Padhye, V. Firoiu, D. Towsley, and J. Kurose. Modeling TCP throughput: a
simple model and its empirical validation. In Proceedings of ACM SIGCOMM,
September 1998.
[49] D. G. Andersen, H. Balakrishnan, M. F. Kaashoek, and R. Morris. Resilient
Overlay Networks. In Proceedings of 18th ACM SOSP, Banfi, Canada, October
2001.
[50] D. G. Andersen. Resilient Overlay Networks. Master?s thesis, Massachusetts
Institute of Technology, May 2001.
[51] M. R. Macedonia and D. P. Brutzman. MBone Provides Audio and Video Across
the Internet. IEEE Computer, pages 30{36, April 1994.
116