Heterogeneity-Aware Approaches to Optimizing Performance of Computing and Communication Tasks Except where reference is made to the work of others, the work described in this dissertation is my own or was done in collaboration with my advisory committee. This dissertation does not include proprietary or classifled information. Jun Huang Certiflcate of Approval: Victor P. Nelson Professor Electrical and Computer Engineering Soo-Young Lee, Chair Professor Electrical and Computer Engineering Alvin S. Lim Associate Professor Electrical and Computer Engineering Stephen L. McFarland Dean Graduate School Heterogeneity-Aware Approaches to Optimizing Performance of Computing and Communication Tasks Jun Huang A Dissertation Submitted to the Graduate Faculty of Auburn University in Partial Fulflllment of the Requirements for the Degree of Doctor of Philosophy Auburn, Alabama December 16, 2005 Heterogeneity-Aware Approaches to Optimizing Performance of Computing and Communication Tasks Jun Huang Permission is granted to Auburn University to make copies of this dissertation at its discretion, upon the request of individuals or institutions and at their expense. The author reserves all publication rights. Signature of Author Date of Graduation iii Dissertation Abstract Heterogeneity-Aware Approaches to Optimizing Performance of Computing and Communication Tasks Jun Huang Doctor of Philosophy, December 16, 2005 (M.S., University of Science and Technology of China, 2000) (B.S., University of Science and Technology of China, 1997) 132 Typed Pages Directed by Soo-Young Lee As the domain of computing and communication systems grows, heterogeneity among computers and subnetworks employed for a task also increases. It is important to understand how heterogeneity afiects performance of computing and communica- tion tasks in order to optimally utilize heterogeneous resources. However, the efiects of heterogeneity in heterogeneous computing and communication systems were either not taken into account explicitly or not thoroughly analyzed in the previous research work in this fleld. In this dissertation, efiects of heterogeneity are analyzed, and heterogeneity- aware approaches are proposed for both computing and communication systems. In the computing system context, temporal heterogeneity refers to variation, along the time dimension, of computing power available for a task on a computer, iv and spatial heterogeneity represents the variation among computers. Efiects of het- erogeneity on the performance of a target task have been analyzed in terms of the mean and standard deviation of parallel execution time. The results reveal that, in minimizing the average parallel execution time of a target task on a spatially het- erogeneous computing system, it is not optimal to distribute the target task linearly proportional to the average computing powers available on computers. Based on the analysis results, an approach to load balancing for minimizing the average parallel execution time of a target task is described. The proposed approach, of which validity has been verifled through simulation, considers temporal and spatial heterogeneities in addition to the average computing power of each computer. In the communication system context, the concept of temporal and spatial het- erogeneity in the available communication resource is applicable to various levels of a communication network. Efiects of heterogeneity on the performance of individual messages in a heterogeneous communication systems are analyzed. A heterogeneity- aware approach to source rate control is proposed, which utilizes the heterogeneity information on the available bandwidth in a UDP-based protocol, to improve through- put, dropping rate, end-to-end delay, and real-time performance. Two main compo- nents of the heterogeneity aware source rate control scheme are a receiver side feature extractor and a sender-side adaptive rate controller. The feature extractor captures the dynamic features in the bandwidth heterogeneity, and the source rate controller utilizes the extracted features in the rate control. Performance of the proposed source rate control scheme has been analyzed in detail through an extensive simulation for the single and multiple path media streaming, and multiple HAA and/or TCP ows. v Acknowledgments First and foremost, I want to express my thanks and gratitude to my advisor, Dr. Soo-Young Lee, for his guidance, encouragement, and support in every step of my doctoral program at Auburn University. His comprehensive knowledge and insight on the research were of great help to the completion of this dissertation. His understanding and patience are sincerely appreciated, and his teaching on the academic research and many other aspects will be an everlasting treasure in my life. My thanks also go to the members of my dissertation committee, Dr. Victor Nelson, Dr. Alvin Lim, and my outside reader Dr. Kai Chang. In spite of their very busy schedules, they read the drafts of this dissertation and provided many valuable comments. This dissertation owes a great deal to the time and attention they have given to it. Last, but not least, I want to thank my family for their understanding and love along the path of my academic pursuits, and many of my friends for their generous help when I was in Auburn. Without their continuous support and encouragement, it would never have been possible to flnish this dissertation. vi Style manual or journal used Transactions of the American Mathematical Society (together with the style known as \auphd"). Bibliography follows van Leunen?s A Handbook for Scholars. Computer software used The document preparation package TEX (speciflcally LATEX) together with the departmental style-flle auphd.sty. vii Table of Contents List of Figures xi List of Tables xvi 1 Introduction 1 1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Computing Systems . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Communication Systems . . . . . . . . . . . . . . . . . . . . . 2 1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Heterogeneous Computing Systems . . . . . . . . . . . . . . . 3 1.2.2 Heterogeneous Communication Systems . . . . . . . . . . . . . 5 1.2.3 Network Measurement . . . . . . . . . . . . . . . . . . . . . . 7 1.2.4 Source Rate Control . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.1 Computing Systems . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.2 Communication Systems . . . . . . . . . . . . . . . . . . . . . 10 2 Effects of Heterogeneity on A Target Computing Task 12 2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.1 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2 Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 Mean and Standard Deviation of Execution Time . . . . . . . 16 2.3.2 Spatially Homogeneous Environment . . . . . . . . . . . . . . 17 2.3.3 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.4 Stability of Performance . . . . . . . . . . . . . . . . . . . . . 19 2.3.5 Temporal and Spatial Heterogeneity . . . . . . . . . . . . . . . 19 2.3.5.1 Temporal Heterogeneity . . . . . . . . . . . . . . . . 20 2.3.5.2 Spatial Heterogeneity . . . . . . . . . . . . . . . . . 20 2.4 Simulation Results and Discussion . . . . . . . . . . . . . . . . . . . . 21 2.4.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.2 Efiects of Heterogeneity on a Target Task . . . . . . . . . . . 22 2.4.2.1 Spatially Homogeneous and Temporally Heterogeneous 22 viii 2.4.2.2 Spatially and Temporally Heterogeneous . . . . . . . 23 2.4.2.3 Synchronization . . . . . . . . . . . . . . . . . . . . . 26 2.4.2.4 Granularity of Availability . . . . . . . . . . . . . . . 27 2.4.2.5 Communication Overhead . . . . . . . . . . . . . . . 29 2.4.2.6 Load Distribution and Stability (Risk Factor) . . . . 30 3 Load Balancing on a Heterogeneous Computing System 33 3.1 Two-Step Heuristic Approach . . . . . . . . . . . . . . . . . . . . . . 33 3.1.1 When si = sj for all i;j: . . . . . . . . . . . . . . . . . . . . . 34 3.1.2 When si 6= sj for some i;j: . . . . . . . . . . . . . . . . . . . . 36 3.2 Simulation Results and Discussion . . . . . . . . . . . . . . . . . . . . 37 4 Effects of Heterogeneity on an Individual Network Traffic 41 4.1 Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.1 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Efiects of Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.1 Feedback Latency . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2.2 Temporally Heterogeneous and Spatially Homogeneous . . . . 47 4.2.2.1 Serial Path . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.2.2 Parallel Path . . . . . . . . . . . . . . . . . . . . . . 48 4.2.3 Temporally and Spatially Heterogeneous . . . . . . . . . . . . 49 4.2.3.1 Serial Path . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.3.2 Parallel Path . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1 Path Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.2 Multi-Path Data Transfer . . . . . . . . . . . . . . . . . . . . 55 5 Source Rate Control for Heterogeneous Communication Sys- tems 60 5.1 Temporal Heterogeneity under Difierent Time Scales . . . . . . . . . 60 5.2 HAA Source Rate Control . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2.1 Overview of the HAA Source Rate Control Scheme . . . . . . 62 5.2.2 Bottleneck Bandwidth Measurement . . . . . . . . . . . . . . 64 5.2.3 Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2.4 Simple Source Rate Allocation Approaches . . . . . . . . . . . 66 5.2.5 Heterogeneity Aware Approach . . . . . . . . . . . . . . . . . 68 6 Performance of the Source Rate Control Scheme 74 6.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.2 Single Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.2.1 Adaptation to Long-Term Bandwidth Variation . . . . . . . . 76 6.2.2 Feature Extractor Size and Average Bandwidth Updating Interval 77 ix 6.2.3 HAA Compared with Packet Spacing Protocol (PSP) . . . . . 78 6.2.4 HAA versus Mean-Only and Minimum-Bandwidth Schemes . 78 6.2.5 Efiects of Spatial Heterogeneity . . . . . . . . . . . . . . . . . 82 6.2.6 Packet Inter-arrival Time . . . . . . . . . . . . . . . . . . . . . 85 6.2.7 Adaptive Penalty . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2.8 Path Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.2.9 Multi-Path Media Streaming . . . . . . . . . . . . . . . . . . . 88 6.3 Multiple Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.3.1 Multiple TCP Flows . . . . . . . . . . . . . . . . . . . . . . . 95 6.3.2 Multiple HAA Flows . . . . . . . . . . . . . . . . . . . . . . . 99 6.3.2.1 Overall Performance . . . . . . . . . . . . . . . . . . 99 6.3.2.2 Dynamic Performance . . . . . . . . . . . . . . . . . 101 6.3.3 Comparison with Multiple TCP Flows . . . . . . . . . . . . . 103 6.3.3.1 Aggregated Bandwidth Utilization . . . . . . . . . . 103 6.3.3.2 Inter- ow Fairness . . . . . . . . . . . . . . . . . . . 104 6.3.4 TCP Friendliness . . . . . . . . . . . . . . . . . . . . . . . . . 105 7 Conclusion and Future Work 107 7.1 Computing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.2 Communication Systems . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.3.1 Data-Intensive Computing Systems . . . . . . . . . . . . . . . 111 7.3.2 Application to Overlay Networks . . . . . . . . . . . . . . . . 111 Bibliography 112 x List of Figures 2.1 Temporal and spatial heterogeneity. Ai(t) is the availability of com- puting power on Ci for a target task, where i = 1;:::;N. . . . . . . . 14 2.2 (a) Average execution time and (b) normalized standard deviation of execution time on a single computer where X = 100. . . . . . . . . . 23 2.3 (a) Average parallel execution time and (b) normalized standard devi- ation of parallel execution time on N computers where Xi = 100 for i = 1;:::;N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 (a) Average execution time and (b) normalized standard deviation of execution time when SH and TH are varied with N flxed at 8 where Xi = 100 for i = 1;:::;8. . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 (a) Average execution time and (b) normalized standard deviation of execution time. Xi = XN for i = 1;:::;N and X = 100. As N increases, ? meanA remains flxed. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6 (a) Average execution time and (b) normalized standard deviation of execution time. Xi = XN for i = 1;:::;N and X = 100. As N increases, ? meanA increases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7 Speed-up when (a) ? meanA remains flxed and (b) ? meanA increases as N increases. Xi = XN for i = 1;:::;N. X = 100. . . . . . . . . . . . . . 26 2.8 (a) Average execution time and (b) normalized standard deviation of execution time as functions of Ns. X = 1000, N = 10, and Xi = XN = 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.9 (a) Average parallel execution time and (b) normalized standard devi- ation of parallel execution time as functions of SH and Ns. X = 1000, N = 10, Xi = XN = 100, and ? meanA = 0:28. . . . . . . . . . . . . . . . 27 2.10 (a) Average execution time and (b) normalized standard deviation of execution time as functions of interval. X = 1000, N = 10, and Xi = XN = 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 xi 2.11 (a) Average parallel execution time and (b) normalized standard de- viation of parallel execution time as functions of SH and interval. X = 1000, N = 10, Xi = XN = 100, and ? meanA = 0:28. . . . . . . . . . 28 2.12 (a) Average execution time (computation and communication times) and (b) normalized standard deviation of execution time as functions of SHcomm and SHcomp. N = 20, X = 2000, Ns = 10, Xi = XN = 100, and ? meanA = ? meanB = 0:28. . . . . . . . . . . . . . . . . . . . . . . . 29 2.13 Average execution time: (a) when communication overhead is inde- pendent of N and X = 400, and (b) when communication overhead increases linearly proportional to N and X = 2000. Ns = 10 and ? meanA = ? meanB = 0:28. . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.14 (a) Average execution time and (b) normalized standard deviation of execution time on two computers where X1 = X2 ?dX, X2 = X2 +dX, and X = 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.15 Risk factor on two computers where X1 = X2 ?dX, X2 = X2 +dX, and X = 200. ?d = 221. ?? A = ? A2 ? ? A1. . . . . . . . . . . . . . . . . . 31 2.16 (a) Average execution time and (b) normalized standard deviation of execution time on 10 computers. X = 100 and Xi = X0+SlopeXi(i?1) for i = 1;:::;10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.17 Risk factor on 10 computers (a) ?d = 40 and (b) ?d = 28. X = 100 and Xi = X0 +SlopeXi(i?1) for i = 1;:::;10. . . . . . . . . . . . . 32 3.1 Equalization scheme: (a) after Step (1) where a target task is parti- tioned such that the average execution time is the same on all comput- ers and (b) after Step (2) where the partitioning is adjusted such that t0i + 0Ti is equalized for all computers. . . . . . . . . . . . . . . . . . 35 3.2 (a) Average parallel execution time and (b) Percentage standard devi- ation of t0i + 0Ti after Step (2) in the load balancing, when the number of computers, N, is 10. After Step (1), ti = 100 and ? T = TN ? T1 where Ti = i?1N?1 Tmax, where i = 1;:::;10. . . . . . . . . . . . . . . 38 3.3 Average parallel execution time on two computers (a) with a flxed ? T = T1 ? T2 = 50 ( T2 = 10) and (b) with a flxed s = 4, where s = s2s1 (s1 = 1) and ? T = T1 ? T2 ( T2 = 0). . . . . . . . . . . . . 39 xii 3.4 Percentage reduction in average parallel execution time as a function of the number of computers (a) with a flxed ? T = T1 ? TN = 60 ( TN = 0) and (b) with a flxed s = 5, where s = sNs1 (s1 = 1) and ? T = T1 ? TN ( TN = 0). After Step (1), ti = 100 for all i. . . . . 40 4.1 Illustration of (a) a serial path and (b) a parallel path where Np = 2. 42 4.2 Temporal bandwidth heterogeneity on each channel and spatial band- width heterogeneity among channels. . . . . . . . . . . . . . . . . . . 43 4.3 Efiects of Tfeedback on (a) throughput and (b) end-to-end delay. Tinterval=[0.01,5.00]s, Tupdate=0.01s, N = 4, bi=1000 pps, and ? Bi = ? B for all i. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4 (a) Throughput and (b) end-to-end delay on a temporally heteroge- neous serial path of N links. S = 106 packets. bi = 1000 pps and ? Bi = ? B for i = 1;:::;N. . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5 (a) Transfer time and (b) throughput on a temporally heterogeneous parallel path consisting of Np paths with one link per path. Sj=105 packets and bj=1000 pps for j = 1;:::;Np. . . . . . . . . . . . . . . . 49 4.6 Throughput and end-to-end delay on a temporally and spatially het- erogeneous serial path with bi=1000 pps for i = 1;:::;N. (a) & (b): ? Bi = 0:1 for i = 2;:::;N; (c) & (d): ? Bi = 0:52 for i = 2;:::;N. . . 51 4.7 Transfer time on a temporally and spatially heterogeneous parallel path: (a) each path is a single link, and (b) each path consists of 2 links where ?? B = ? B1-? BNp. bj = 1000 pps and Sj = 105 packets for j = 1;:::;Np. ? Bj is varied linearly over Np paths for j = 1;:::;Np such that its mean is flxed at 0.26. . . . . . . . . . . . . . . . . . . . 52 4.8 (a) A parallel path where the number of links is the same for all paths, and (b) a parallel path where the number of links varies with path. The number pair in brackets over each link represents bi and ? Bi of the link. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.9 Comparison of the three paths in Figure 4.8-(a) in terms of (a) end-to- end delay and (b) transfer time. . . . . . . . . . . . . . . . . . . . . . 54 4.10 Comparison of the three paths in Figure 4.8-(b) in terms of (a) end- to-end delay and (b) transfer time. . . . . . . . . . . . . . . . . . . . 54 4.11 Reduction in transfer time on a parallel path (Np = 2). bj = 1000 pps for j = 1;2. S=S1 +S2=2?106 packets. . . . . . . . . . . . . . . . . 57 xiii 4.12 Reduction in transfer time ((a);(b)) and end-to-end delay ((c);(d)) on a parallel path with Np = 2. Nj = 2 for j = 1;2. S = S1 + S2 = S01 +S02 = 2?106 packets, ? B1=0.06, ? B2=0.46, and b1 +b2 = 2000 pps. 58 5.1 Illustration of long-term and short-term temporal heterogeneity of the available bandwidth sensed by one ow (flow0) on a path. . . . . . . 61 5.2 The abstract architecture of the HAA source rate control scheme. . . 62 5.3 Illustration of the proposed HAA communication model. . . . . . . . 63 5.4 Illustration of packet-pair based bandwidth measurement. The hori- zontal dimension is time, and from top to bottom along the vertical dimension are the sender, intermediate links, and the receiver. Num- bers in small boxes are the packet sequence numbers. . . . . . . . . . 65 5.5 Feature extractor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.6 Initialization, iteration, and extraction codes for the feature extractor. 67 5.7 QoS requirements and valid range of fl. . . . . . . . . . . . . . . . . . 70 5.8 Pseudo-code of the proposed SDFI scheme. . . . . . . . . . . . . . . . 72 5.9 Operation of the SDFI scheme. . . . . . . . . . . . . . . . . . . . . . 73 6.1 Simulation testbench topology for a single path. . . . . . . . . . . . . 74 6.2 Comparison of three source rate control schemes on a path of an inter- router link whose available bandwidth (ABBW) varies with time. The penalty factor (fl) is set to be 0.7 for HAA. . . . . . . . . . . . . . . . 76 6.3 Efiects of feature extractor size (L) and tinterval. N=4, ? Bi=0.17 for all 4 links. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.4 Comparison of HAA (fl = 0) with PSP. . . . . . . . . . . . . . . . . . 79 6.5 N = 4, each link?s bottom link bandwidth (Bi(t)) follows a uniform distribution, with deviation ? Bi. \(min)" in the legend stands for the "minimum-bandwidth" scheme. . . . . . . . . . . . . . . . . . . . . . 80 6.6 Efiects of path length. All ? Bi are set to 0.17. . . . . . . . . . . . . . 82 6.7 N = 4, heterogeneous links, all links have the same mean ABBWi=2Mbps. Distributions of ? Bi are as shown in the legends. . . 84 xiv 6.8 Histograms of inter-arrival times. 128 bins are used. . . . . . . . . . . 86 6.9 Deviation of inter-arrival time. . . . . . . . . . . . . . . . . . . . . . . 86 6.10 Illustration of the proposed adaptive rate control scheme. . . . . . . . 87 6.11 A parallel path where the number of links varies with path. The num- ber pair in brackets over each link represents bi and ? Bi of the link. . 88 6.12 Throughput and End-to-End Delay when the penalty factor (fl) is set to 1.0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.13 Simulation testbench topology for multi-path media streaming. . . . . 89 6.14 2 paths, each with 4 links, and all links have the same bi=2Mbps, but difierent deviation ? Bi. . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.15 Application level measures for the scenario in Figure 6.14. . . . . . . 90 6.16 Adaptive rate control on a multi-path network. . . . . . . . . . . . . 91 6.17 Comparison with TCP implementation for 2 paths, each consisting of 4 links. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.18 Comparison with TCP implementation (2 paths, fl is flxed to 1.0). . . 93 6.19 Application level measures (2 paths, fl is flxed to 1.0). . . . . . . . . . 94 6.20 Topology of the testing network: multiple ows sharing a network path. 95 6.21 Aggregated bandwidth utilization of multiple TCP ows. Wmax=15. (note that some measured utilization curves overlap the estimated max ones) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.22 Arrival time of multi- ow TCP packets. . . . . . . . . . . . . . . . . . 98 6.23 Multiple HAA ows sharing a network path consisting of 4 links, all links have the same bi=2Mbps and ? Bi=0.17. . . . . . . . . . . . . . 99 6.24 Aggregated bandwidth utilization of multiple HAA ows. . . . . . . . 100 6.25 Two On-Ofi ows with Tswitch=6. . . . . . . . . . . . . . . . . . . . . 101 6.26 Dynamic performance of multiple On-Ofi ows. N=4, bi=2Mbps for all links, and fl=1.0 for all ows. . . . . . . . . . . . . . . . . . . . . . 102 6.27 Improvement in aggregated bandwidth utilization. . . . . . . . . . . . 103 6.28 Fairness comparison with TCP. . . . . . . . . . . . . . . . . . . . . . 104 6.29 Multiple HAA and TCP ows. . . . . . . . . . . . . . . . . . . . . . . 105 xv List of Tables 3.1 Possible reduction in ? . . . . . . . . . . . . . . . . . . . . . . . . . . 37 xvi Chapter 1 Introduction 1.1 Problem Formulation 1.1.1 Computing Systems A cluster of computers is commonly used for high performance computing these days [1][2]. Such a system is often shared by multiple users and computers in the system may not be \identical." One of the most distinct characteristics of such an environment compared to a dedicated multiprocessor system is its heterogeneity [3]. Heterogeneity of such shared systems has been addressed from various viewpoints of high performance computing, but much of the research has been devoted to scheduling a stream of tasks (i.e. from the system?s viewpoint) [4][5][6]. A measure of available computing power may be used to quantify the CPU share allocated for a given task, which represents the percentage of CPU cycles (or time slots) per unit time that the task can utilize at a given moment. This measure usually varies with both time and space. Therefore, the heterogeneity in computing systems can be classifled into two types: temporal heterogeneity and spatial heterogeneity. The former refers to the variation of available resources with time on a computer, and the latter represents the variation of available resources among computers. 1 1.1.2 Communication Systems A communication network, whether wired or wireless, is normally shared by mul- tiple users who submit their requests at any time. Heterogeneity in communication systems can also be classifled into two types: (1) the communication resources avail- able on a channel for a request (or tra?c, messages) varies with time, i.e., a channel is temporally heterogeneous, and (2) difierent channels would have difierent charac- teristics in terms of the communication resources available to them, i.e., channels in a network are spatially heterogeneous. The concept of temporal and spatial heterogeneity in the available communi- cation resources is applicable to various levels of a communication network. For example, temporal heterogeneity may be considered within individual sub-networks and spatial heterogeneity among sub-networks. These two types of heterogeneity can have a signiflcant efiect on the Quality of Service (QoS) of an individual request, such as delay and transfer time, and also network-wide performance measures such as utilization, system throughput, and queue length. It is much more di?cult to quantify the available resources in communication systems than in computing systems. There are various sources of heterogeneity on the Internet, and they can be divided into the following three main categories [7]: ? The topology of the Internet is constantly changing and the types of link (slow modems, flber optic links, copper or glass wires, radio, infrared, etc.) span a wide range and are generally unknown to an end-user. Also, multi-path and dynamic routing introduce extra complexities. 2 ? There are many protocols coexisting in the Internet, and each protocol has been implemented by multiple communities. For example, the widely used Transmission Control Protocol (TCP) has many difierent implementations, for example, Tahoe, Reno, NewReno, Sack, etc. Also, difierent applications ask for difierent features. ? Background competing tra?c varies constantly in an unpredictable way. Each individual heterogeneity is either not directly observable or cannot be mea- sured accurately. For the protocol heterogeneity, a session is only able to know its own protocol, and it is almost impossible to have knowledge of other sessions? proto- cols. The heterogeneity due to background competing tra?c is also hard to model in detail. Since it is either impossible or unnecessary to describe difierent components of heterogeneity individually, \available bandwidth" is used as a measure to quantify the \net" or aggregated efiect of all the sources of heterogeneity in this dissertation. 1.2 Related Work 1.2.1 Heterogeneous Computing Systems One of the important issues is how to utilize a heterogeneous computing system to optimize performance of a target task. This issue involves modelling the heterogeneity of the system [8][9], performance estimation [10][11][12][13], reliability [14], scheduling and load balancing [15], etc. The concepts of \machine heterogeneity" and \task heterogeneity" are employed in an efiort to model task execution time on heterogeneous computing systems [8][9]. The machine heterogeneity refers to the degree to which the machine execution times 3 vary for a given task and the task heterogeneity refers to the degree to which the task execution times vary for a given machine. The main objective was to simulate difierent heterogeneous computing environments in evaluating the behavior of the task mapping heuristics. A performance estimation model for heterogeneous parallel applications is pro- posed in [10]. They addressed the heterogeneity among the varying platforms and operating environments used and the possibility of having multiple tasks of the par- allel application on each machine. In [11], a model that calculates the slowdown imposed on applications in time- shared multi-user clusters is deflned. Three kinds of \slowdown" are identifled and the efiects of slowdown on application performance are verifled with emulated loads. In [12], the issue of predicting parallel execution time of non-deterministic bulk synchronous computations, where computation time on each computer is modelled as a random variable, is considered. A tighter bound of the average parallel execu- tion time has been derived. Though the randomness of computation time does not originate from heterogeneity of the computing system in their model, the results may be applicable to estimating parallel execution time on a heterogeneous computing system with minor modiflcation. In [13], the issue of modelling and characterizing parallel computing performance of heterogeneous Network Of Workstations (NOW) is considered, and the efiects of heterogeneity on e?ciency and scalability are investigated. The computing power (speed) \weight" is used to quantify the spatial heterogeneity among workstations, but the temporal heterogeneity of computing power on each workstation is not con- sidered. 4 In [14], the probability of application failure is deflned to quantify the reliability of a heterogeneous computing system. Minimizing execution time usually con icts with improving reliability, and algorithms capable of trading ofi execution time for reliability are proposed. In [15], a \stochastic scheduling" strategy, where the performance variability is considered in partitioning a data parallel application to be executed on a heteroge- neous computing platform, is described. The mean and standard deviation of the predicted completion time of a task are used in scheduling, along with a \tuning factor" which specifles how \conservative" the scheduling is to be. 1.2.2 Heterogeneous Communication Systems Realization of an overlay network involves mapping its topology onto the base (physical) network [16][17][18][19]. This mapping requires flnding a \path" in the base network for each \edge" in the overlay network. Here, a path may consist of one or more channels or links. The path is to be selected such that it satisfles requirements (for example, bandwidth) of the corresponding edge. Also, multiple paths may be employed for an edge when a single path cannot meet the requirements of the edge. Heterogeneity must be considered in mapping of an overlay network and scheduling of messages in the network, in order to optimize its performance. Inanefiort tominimize communicationoverheadin high performance computing, multiple communication paths between computing nodes were employed [20]. The main issues are how to choose one path over the others and how to partition a large size of data (message) among multiple paths. They considered \characteristics" of paths, such as latency, in path selection and aggregation. However, only spatial 5 heterogeneity in terms of the average characteristics has been considered. That is, temporal variation of the characteristics was not taken into account. The idea of utilizing multiple paths was also employed for video streaming in the \path diversity" framework [21]. Here, the objectives are to minimize packet loss caused by congestion due to time-varying bandwidth and to provide su?cient bandwidth for media streaming, by using more than one path simultaneously from a source to a destination. A quantitative measure of temporal heterogeneity in the bandwidth was not considered in their algorithms. In order to quantify (temporal) heterogeneity in the channel bandwidth, one may consider a second order measure of the bandwidth in addition to its mean. Earlier, heterogeneity in the available computing powers of computing nodes on a cluster or Grid was studied in terms of load balancing [22][23], and it has been shown that average parallel execution time of a task can be substantially reduced by taking the standard deviation of available computing power on each node into account when partitioning the task over multiple nodes. Most of the previous work on communication bandwidth management utilized either the average or instant bandwidth. Nevertheless, there have been some re- search efiorts on scheduling and allocation problems under varying channel conditions [24][25][26]. Basically, heuristic approaches are taken to compensate for variations in the channel conditions in order to satisfy certain QoS requirements. Another load- aware routing protocol [27] also considers only the average load at intermediate nodes. However, heterogeneity was not explicitly taken into account. More speciflcally, the second order moment of available channel bandwidth was not quantitatively utilized to optimize network performance. 6 1.2.3 Network Measurement Measurement of the bottleneck bandwidth over a network path may be performed on either the sender (source) or receiver (sink) side. Sender side measurement avoids the need for modiflcation of software on the receiver end, and therefore a new protocol is easier to deploy. However, the measurement delay for a sender side scheme is about twice that of a receiver side scheme. Depending on the requirement of a speciflc application, where to perform the measurement needs to be determined. Several schemes for probing the link states based on \packet pairs" were proposed in [28][29][30]. The sender transmits probing packets in the form of back-to-back pairs, and estimates the bottleneck service rate from the spacing of the acknowledgements (ACK). However, this approach relies on several unrealistic assumptions which may lead to inaccurate measurements. In order to improve the accuracy, an intersection flltering method with a multi-phase variable packet size was used in estimating the bandwidth [31]. The main problem with this scheme is that it is di?cult to properly set the bin width and boundary of the histogram without knowing the shape of the distribution in advance. To overcome this di?culty, a packet pair scheme combined with bandwidth flltering using a kernel density estimator algorithm was proposed [32][33]. Note that the post-measurement processing/flltering steps in [31][32][33] are performed ofi-line, therefore, they would not be appropriate for applications where the real-time requirement is critical. 1.2.4 Source Rate Control TCP uses explicit rate/congestion control. All implementations of TCP employ the so-called Additive Increase and Multiplicative Decrease (AIMD) algorithm flrst 7 proposed by Jacobson in 1986 [34][35]. The main problems of this algorithm are its slow convergence speed and large rate oscillation [36]. Also, its multiplicative decrease scheme, which cuts its sending rate in half in response to each indication of link congestion, may unnecessarily waste the available network bandwidth. For a smoother and more gradual rate adaptation, many TCP-friendly congestion/rate control schemes have been proposed [37]. For example, a Loss-Delay based Adaption algorithm (LDA+) [38] is a variant of AIMD congestion control scheme, which adjusts the source rate dynamically based on the current network situation (using an estimate of the bottleneck bandwidth obtained by a packet pair approach). The bandwidth share of a ow is determined using the TCP model during loss situations, and can be increased by a value that does not exceed the increase of the bandwidth share of a TCP connection for the case of no loss. Because AIMD-like control mechanisms mostly focus on fairness among competing TCP tra?c ows and usually lead to bursty tra?c, they are not preferred by many real-time applications, when one is more interested in per- ow performances such as throughput, end-to-end delay, and inter-packet arrival time (less jitter). A TCP-friendly solution has been proposed to evenly space TCP packets injected into the network over an entire round-trip time, so that data is not sent in a burst [39]. However, this approach often leads to signiflcantly worse throughput than the regular TCP because it is susceptible to synchronization losses and delays congestion signals [40]. Afeedbackmechanismhasbeenintroducedforpacket-spacingtechniques[41][42], which enables UDP-based applications to perform well while being adaptive and self- regulating. This damped Packet-Spacing Protocol (PSP) transmits data at a near 8 optimal rate by using a simple feedback mechanism that reports packet loss every round-trip time. However, the only congestion indicator used in PSP is packet loss, which may not be su?cient for optimizing QoS on a network with a high degree of heterogeneity in bandwidth. 1.3 Motivation In most of the previous work, (i) not all types of heterogeneity were taken into account, (ii) heterogeneity was often considered only implicitly, and (iii) efiects of heterogeneity were not thoroughly analyzed. The heterogeneity in both computing and communication systems needs to be quantitatively represented. It?s also necessary to have a clear understanding of how heterogeneity afiects the performance of a given task, and show how the proposed approach can be applied to various application scenarios. 1.4 Research Objectives The efiects of various types of heterogeneity are to be analyzed to show the possi- bility of improving performance of tasks by considering the information of heterogene- ity. Heterogeneity-aware approaches (HAA) are proposed to improve performance of both computing and communication systems. 1.4.1 Computing Systems Efiects of spatial and temporal heterogeneity on the performance of a target task are flrst analyzed in detail. It is shown that, it is not optimal to partition a task 9 over computers linearly proportional to their average computing powers in order to minimize the average parallel execution time. A heterogeneity-aware approach (HAA) to load balancing, designed to minimize the average parallel execution time of a target task in a spatially heterogeneous com- puting environment, is then proposed in order to provide a theoretical foundation for developing a practical load balancing scheme. The proposed approach explicitly takes into account the standard deviation of available computing power in addition to the average computing power available on each computer. It has been shown that the average parallel execution time of a target task can be signiflcantly reduced by the approach. 1.4.2 Communication Systems The TCP/AIMD-like source rate control schemes usually lead to bursty tra?c and therefore are not preferred by many real-time applications. The use of feedback based packet-spacing approaches [41][42] improves real-time QoS in some situations. However, \packet loss" by itself may not be su?cient for describing the dynamic features of rapidly varying bandwidth. A bandwidth measurement technique could be used to provide some additional information to the source rate controller (as in [38]), but previous researchers have only used the flrst order feature of bandwidth samples. In order to extract useful features from the measurement, some types of post-measurement technique needs to be used. In a real network, however, a rate control scheme cannot afiord a time-consuming flltering algorithm [31][32][33]. In this dissertation, a heterogeneity-aware approach to source rate control is proposed, which is adaptive to time-varying available (bottleneck) bandwidth. The 10 proposed scheme extracts the second-order moment (standard deviation) as well as the flrst-order moment (mean) of the available bandwidth and explicitly takes them into account in determining the source rate dynamically. A simple feature extractor implemented at the receiving end measures the features, including the mean and standard deviation, using a packet-pair approach and sends them to the source. For the feature measurement, data packets are used instead of inserting control packets and, therefore, no extra tra?c is generated. Applications of the HAA rate control scheme to the path selection problems on an overlay network, and a multi-path media streaming framework are also considered. 11 Chapter 2 Effects of Heterogeneity on A Target Computing Task 2.1 System Model The high performance computing system to be employed in this study consists of N heterogeneous computers interconnected via a communication network. The speed of processor, speed and capacity of memory (including cache), and bandwidth of I/O system, etc. may not be the same on all computers. Difierent software, including the OS, may be available on each of the computers. All these hardware and software components collectively afiect the efiective com- puting power available for applications on each computer. It is the available com- puting power for a target task that eventually determines execution time of the task. Hence, in this dissertation, \availability" of computing power and communication bandwidth is employed in deflning heterogeneity. 2.1.1 Availability Let the maximum computing power of computer Ci be denoted by fii for a target task for i = 1;:::;N, where N is the number of computers. The computing power available for the task at time t may be expressed by fiiAi(t) where Ai(t) is the \availability" of Ci for the task at time t and 0 ? Ai(t) ? 1. The mean and standard deviation of Ai(t) are denoted by ai and Ai, respectively. 12 In the steady state, ai and Ai are assumed to be flxed and do not vary with time while Ai(t) varies with time. In the time-varying state, not only Ai(t) but also ai and/or Ai vary with time, i.e., one needs to use the notations ai(t) and Ai(t). The time-varying state is not considered in this dissertation. Availability of communication bandwidth can be deflned similarly with notations fli, Bi(t), bi, and Bi for the maximum, instantaneous, mean and standard deviation of bandwidth, respectively. To avoid redundancy, they are not deflned separately. One difierence is that Bi(t) tends to decrease as N increases due to the contention on the shared media, while Ai(t) is independent of N, especially for data parallel applications. 2.1.2 Heterogeneity When a set of heterogeneous computers is shared by multiple independent users, workload on each computer would vary with time and computer. Therefore, workload coupled with the heterogeneous hardware and software components makes availabil- ity (computing power) vary spatially and temporally. Temporal heterogeneity refers to variation of availability along the time dimension on a computer while spatial heterogeneity refers to that among computers as illustrated in Figure 2.1. With the notations given in the deflnition of availability above, a computer (Ci) exhibits temporal heterogeneity when Ai(t) is a non-uniform function of time. A system consisting of multiple computers shows spatial heterogeneity when ai 6= aj and/or Ai 6= Aj for some i;j where i 6= j. These two types of heterogeneity will be quantifled in Section 2.3.5. 13 A (t)2 t 1 0 A (t)N t 1 0 A (t)1 t 1 0 Temporal heterogeneity Spatial heterogeneity Interval Figure 2.1: Temporal and spatial heterogeneity. Ai(t) is the availability of computing power on Ci for a target task, where i = 1;:::;N. The temporal heterogeneity is due to the fact that users submit their tasks at any time and the size of a task is variable, i.e., the task arrival and size are random. There are two sources of the spatial heterogeneity. One is system-related: hardware and software components are not the same for all computers. Also, the interconnection network may not be \symmetric." The other is task-related: the task (workload) characteristics (arrival time and size) may vary with computer. An interconnection network or networks are usually shared by computers. As a result, the spatial heterogeneity of communication bandwidth tends to be lower than that of computing power. 14 2.2 Task Model Two task models are employed in this dissertation: the base and augmented task models. In the base task model, a target task is assumed to be linearly partitionable over N computers, i.e., the sum of computational loads of subtasks is equal to that of the target task. In order to focus on efiects of heterogeneous computing power on the performance of a target task, communication is not considered in the base model. Many data-parallel applications such as low-level image processing, (Monte Carlo) simulations, matrix computation, etc. would flt to this model well. However, in many other applications, as a task is partitioned, communication among subtasks is unavoidable for sharing data and results, collecting local results, etc. The augmented task model is derived by incorporating \periodic" communica- tion phases into the base task model. That is, execution of a target task consists of a sequence of alternating computation and communication phases. Hence, a communi- cation phase can be considered as a synchronization point at which all N computers are required to be synchronized for data communication before proceeding to the next (computation) phase. There are many applications which can be represented by this model, e.g., iterative optimization, simulation of a physical or chemical phenomenon, medical image reconstruction, etc. The number of synchronization points is denoted by Ns. 2.3 Performance Measures Performance of a target task on a heterogeneous computing system may be mea- sured in a few difierent ways. One popular metric is to use the execution time of the 15 target task, of which minimization is the main objective of high performance comput- ing. In a heterogeneous computing environment, execution time varies signiflcantly due to the temporal and spatial heterogeneity and, therefore, the average execution time may be employed as a performance measure. From the stability point of view, one may want to minimize variation of the execution time, in which case the stan- dard deviation of execution time can be used. Another way to quantify stability is to specify the probability that the execution time longer than a certain threshold is ob- tained. In this section, the performance measures to be employed in this dissertation are derived. Let?s denote the size of a target task by X and the portion of the target task assigned to the computer Ci by Xi. The execution time of Xi on Ci is referred to as Ti, which is a random variable, and its mean and standard deviation are denoted by ti and Ti, respectively. In this dissertation, the same notation is used for a random variable and its observation in order to minimize the number of variables. 2.3.1 Mean and Standard Deviation of Execution Time Referring to the deflnition of availability, the relationship between Xi and Ti may be expressed as follows: Z Ti 0 fiiAi(t) dt = Xi By taking the expectation on both sides, ti = Xifi iai (2.1) 16 Assuming the uncorrelatedness between Ai(t) and Ai(t0), the standard deviation, Xi, of the work completed during ti can be shown to be Xi = ptifii Ai. Then, the standard deviation of Ti may be given by Ti = pti Aia i (2.2) The parallel execution time of X in a run, denoted by T, which is also a random variable, is given by T = maxi f Ti g where the notation f g refers to a set. The mean (?) and standard deviation ( T) of parallel execution time of X are computed as follows: ? = E[ T ] T = q E[ (T ? ?)2 ] where E[ ] is the expectation operator. 2.3.2 Spatially Homogeneous Environment When all N computers have the same distribution of execution time (T0) and the corresponding cdf (cumulative distribution function) is a monotonically increasing function, ? may be expressed as follows (note that the subscript i is not used in this subsection since all computers are identical and that Ti = T0 and Xi = X0 for all i): ? = Z NT0F(T0)N?1f(T0) dT0 (2.3) where F(T0) and f(T0) are the cdf and pdf (probability density function) of T0, re- spectively. 17 It is possible to derive ? in the following form [12][43], referring to Equations 2.1 and 2.2: ? = t + K(N) T0 = t + K(N) pt Aa = X 0 fia + K(N) s X0 fia A a (2.4) where K(N) is an increasing function of N and K(1) = 0. Notice that the average parallel execution time of a target task consists of two components, one depending on the mean of availability and the other on the standard deviation of availability, i.e., temporal heterogeneity. It is to be noted that temporal heterogeneity makes the average parallel execution time increase beyond the average (sequential) execution time on each computer. The increase is larger when the number of computers employed is greater. Also, as will be shown later, a higher spatial heterogeneity leads to a longer average parallel execution time. 2.3.3 Synchronization When the communication phase in the augmented task model does not include data communication, it can be considered as a synchronization point. That is, at each synchronization point, all computers need to wait for the \slowest" computer before they proceed to next computation phase. Suppose that there are Ns synchronization points in a target task, such that the amount of work to be done between two succes- sive synchronization points is X0Ns. In order to extract efiects of synchronization only, 18 let?s assume that no data communication is actually carried out at each synchroniza- tion point. Referring to Equation 2.4, the mean parallel execution time in the jth phase, ?(j), can be expressed as X0fiaNs + K(N) q X0 fiaNs A a for j = 1;:::;Ns. Then, the average parallel execution time ? is Ns?(j). Hence, ? = X 0 fia + K(N) s X0Ns fia A a (2.5) It can be seen that ? increases as Ns increases even though no synchronization overhead (e.g., communication overhead for synchronization) is taken into account. 2.3.4 Stability of Performance Variation of T may be quantifled by its standard deviation, T, which is to be minimized when one wants to achieve a stable performance. Also, one may be interested in knowing the probability, to be referred to as \risk factor," that T is longer than a certain desirable threshold, ?d. The risk factor, denoted by RF, is given by Prob[ T > ?d ]. 2.3.5 Temporal and Spatial Heterogeneity In this section, temporal heterogeneity and spatial heterogeneity are quantifled for computing power only, since the deflnitions would be identical for communication bandwidth. A difierence is that spatial heterogeneity would generally be lower for communication bandwidth than for computing power. This is mainly because the communication paths are usually shared by computers. 19 2.3.5.1 Temporal Heterogeneity Temporal heterogeneity is deflned on an individual computer, which indicates variability of computing power available for a task along the time dimension. There- fore, the standard deviation of the availability may be used to quantify temporal heterogeneity on each computer. Noting that the average availability may vary with computer (and also time) and that ? depends on the ratio of Ai to ai (refer to Equa- tion 2.4), temporal heterogeneity, to be denoted by THi, on Ci is deflned to be the normalized standard deviation of availability. THi 4= ? Ai = Aia i (2.6) The notation TH will be used when THi is the same for all i (computers), i.e., spatially homogeneous, or, when the mean of THi among all computers is to be referred to. 2.3.5.2 Spatial Heterogeneity Spatial heterogeneity is deflned for a group of computers to be employed to exe- cute a target task. It represents variability of computing power among the computers. Let?s denote the mean and maximum of ? Ai among Ci by ? meanA and ? maxA , re- spectively, i.e., ? meanA = 1N PNi=1 ? Ai = THmean and ? maxA = maxif? Aig. Spatial heterogeneity denoted by SH for a set of computers fCig is deflned as SH 4= ? maxA ? ? meanA : (2.7) 20 That is, SH quantifles variation of temporal heterogeneity among computers. 2.4 Simulation Results and Discussion 2.4.1 Simulation Setup Availability is assumed to have a uniform distribution (it was observed that other distributions such as a \truncated" Gaussian distribution resulted in similar trends). Then, the distribution of execution time on each computer looks similar to a Gaussian or Gamma distribution which was adopted also in another study [8]. The average availability ai (or average computing power fiiai) may vary with computer (Ci). However, ai (or fiiai) can be efiectively normalized in task partitioning, i.e., load balancing for the average availability (or average computing power) varying with computer can be done easily by assigning Xi, proportional to ai (or fiiai), to Ci. Hence, ai is set to 0.5 with fii = 1:0 for all i in the simulation in order to focus on the efiects of heterogeneity in the availability (instead of the average availability or computing power) and to maximize the range of variation of Ai (note that 0 ? ai ? 1:0). Then, the maximum Ai (? Ai) is 12p3 ( 1p3). The computing power fii of Ci is set to 1.0 also for normalization purpose. The notion of \interval" is adopted to quantify the time duration in which avail- ability (Ai(t)) remains constant. The interval is mostly afiected by other tasks (dis- tributions of their arrival time and size) and the local scheduler on Ci and is to be modelled by a random variable. Note that decreasing the interval given a flxed task size is equivalent to increasing the task size with the interval flxed, and vice versa. A larger interval (for a given task size) leads to a higher chance for load imbalance 21 among computers. The interval is generated from a uniform distribution of which range is [0,20] except when it is varied. Simulation was repeated 1000 times for each case and results were averaged for the graphs in the following sections. 2.4.2 Efiects of Heterogeneity on a Target Task In this section, some of the simulation results are presented to discuss efiects of temporal and spatial heterogeneity on the performance of a target computing task. In the graphs where variation of the parallel execution time (T) is analyzed, the standard deviation of T normalized by ? is used, i.e., T? 4= ? T. The results without communication overhead (but with synchronization in some cases) are provided in Figures 2.2 - 2.11, and those with communication overhead are in Figures 2.12 and 2.13. 2.4.2.1 Spatially Homogeneous and Temporally Heterogeneous In Figure 2.2, ? and ? T are plotted when N = 1. As can be seen in the flgure, it is clear that the average sequential execution time is not afiected by ? A, but its variation increases proportional to ? A, as shown in Equations 2.1 and 2.2. In Figure 2.3, efiects of ? A on ? and ? T are shown for difierent values of N. As predicted in Equation 2.4, ? and ? T increase almost linearly proportional to ? A (which is TH) when multiple computers are employed. When more computers are employed (a larger N), the efiect of ? A on ? is larger, as shown in Figure 2.3-(a). The efiect of ? A on T is also larger. However, since ? increases faster than T does, the efiect of ? A on ? T is smaller for a larger N, as shown in Figure 2.3-(b). 22 0 0.05 0.1 0.15100 150 200 250 300 350 400 450 _?A ? a=0.25a=0.50 a=0.75 0 0.05 0.1 0.150 0.002 0.004 0.006 0.008 0.01 0.012 _?A _ ? T a=0.25a=0.50 a=0.75 (a) (b) Figure 2.2: (a) Average execution time and (b) normalized standard deviation of execution time on a single computer where X = 100. 0 0.1 0.2 0.3 0.4 0.5200 202 204 206 208 210 _?A ? N=2N=6 N=10 0 0.1 0.2 0.3 0.4 0.50 0.005 0.01 0.015 0.02 0.025 0.03 _?A _ ? T N=2N=6 N=10 (a) (b) Figure 2.3: (a) Average parallel execution time and (b) normalized standard deviation of parallel execution time on N computers where Xi = 100 for i = 1;:::;N. 2.4.2.2 Spatially and Temporally Heterogeneous In Figure 2.4, dependency of ? and ? T on SH and TH is shown when N is flxed to 8. In this graph, when SH is zero (i.e., on the TH axis), ? A, which is THi, is the same for all computers. When SH is greater than zero, distribution of ? Ai among computers is linear such that ? Ai = ? meanA + 2( i?1N?1 ?0:5)(? maxA ? ? meanA ) for i = 1;:::;N. That is, TH = ? meanA in these graphs. As SH increases, ? increases 23 signiflcantly, especially when ? meanA also increases (going diagonally from the origin on the SH ?TH plane). 0.150.20.25 0.30.3500.1 0.2210 220 230 240 THSH ? 0.2 0.3 0.4 0 0.1 0.20.005 0.01 0.015 0.02 THSH _ ? T (a) (b) Figure 2.4: (a) Average execution time and (b) normalized standard deviation of execution time when SH and TH are varied with N flxed at 8 where Xi = 100 for i = 1;:::;8. Now, suppose that the size of a target task, X, is flxed independent of N and it is uniformly distributed over N computers. The larger the set of computers employed for the target task is, the larger heterogeneity (SH) among computers becomes in general. Two cases are considered: (i) when ? meanA is flxed while SH increases and (ii) when both of SH and ? meanA increase. In both cases, three difierent situations, in terms of how ? maxA is increased, are considered: proportional to pN, N, and N2. Simulation results for cases (i) and (ii) are provided in Figures 2.5 and 2.6, respectively. As N increases, ? decreases almost inversely proportional to N and then starts to increase beyond a certain N especially when SH increases rapidly. In contrast, ? T monotonically increases as N increases. It increases sharply after a certain value of N in the case that SH increases proportional to N2. 24 10 20 30 40 5010 2030 4050 6070 8090 100110 N ? SH ? N1/2SH ? N SH ? N2 10 20 30 40 500.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 N _ ? T SH ? N1/2SH ? N SH ? N2 (a) (b) Figure 2.5: (a) Average execution time and (b) normalized standard deviation of execution time. Xi = XN for i = 1;:::;N and X = 100. As N increases, ? meanA remains flxed. 10 20 30 40 5010 2030 4050 6070 8090 100110 N ? SH ? N1/2SH ? N SH ? N2 10 20 30 40 500 0.05 0.1 0.15 0.2 0.25 0.3 0.35 N _ ? T SH ? N1/2SH ? N SH ? N2 (a) (b) Figure 2.6: (a) Average execution time and (b) normalized standard deviation of execution time. Xi = XN for i = 1;:::;N and X = 100. As N increases, ? meanA increases. Speed-up is shown for the cases of (i) and (ii) in Figure 2.7-(a) and Figure 2.7- (b), respectively. The reason why the three curves meet when N = 50 is that the simulation was set up such that the distribution of ? A among computers becomes identical in all three situations when N is increased to 50. Hence, what is to be observed in these graphs is the \shape" (trend) of each curve. It is clear that spatial 25 heterogeneity SH alone keeps one from employing more than a certain number of computers even for an embarrassingly parallel task. Also, it can be induced that the complexity of the function K(N) in Equation 2.4 is at least O(pN). 10 20 30 40 500 2 4 6 8 10 12 14 16 N Speed Up SH ? N1/2SH ? N SH ? N2 10 20 30 40 500 5 10 15 20 N Speed Up SH ? N1/2SH ? N SH ? N2 (a) (b) Figure 2.7: Speed-up when (a) ? meanA remains flxed and (b) ? meanA increases as N increases. Xi = XN for i = 1;:::;N. X = 100. 2.4.2.3 Synchronization The number of synchronization points, Ns, is varied for difierent ? Ai, to observe its efiects on ? and ? T when SH = 0 in Figure 2.8 and when SH > 0 in Figure 2.9. In Figure 2.8, it is seen that as Ns increases ? increases since the idle time due to synchronization increases. However, ? T decreases. This can be explained as follows. The increased Ti (or ti) due to a larger Ns leads to the increased T (refer to Equation 2.2). However, Ti increases faster than T, causing ? T to decrease. The increase rate in ? is larger for a larger ? A (TH) and a higher SH, as shown in Figure 2.9. 26 5 10 15 20 250 300 350 400 450 Ns ? ??A=0.1??A=0.3?? A=0.5 5 10 15 200.01 0.02 0.03 0.04 0.05 0.06 0.07 Ns _ ? T ??A=0.1??A=0.3?? A=0.5 (a) (b) Figure 2.8: (a) Average execution time and (b) normalized standard deviation of execution time as functions of Ns. X = 1000, N = 10, and Xi = XN = 100. 0 5 10 15 20 0 0.2 0.4200 250 300 350 400 NsSH ? 0 5 10 15 20 0 0.2 0.40.02 0.04 0.06 0.08 NsSH _ ? T (a) (b) Figure 2.9: (a) Average parallel execution time and (b) normalized standard deviation of parallel execution time as functions of SH and Ns. X = 1000, N = 10, Xi = XN = 100, and ? meanA = 0:28. 2.4.2.4 Granularity of Availability When the size of a target task (in terms of its execution time) is much larger than an interval (the duration when availability remains constant, refer to Section 2.4.1) (equivalently, the interval is much smaller than the target task size), efiects of heterogeneity (temporal or spatial) are reduced since the probability of load imbalance 27 decreases. Dependency on interval is shown when SH = 0 in Figure 2.10 and when SH > 0 in Figure 2.11. As the interval increases (i.e., availability varies more slowly with time) for a given target task, efiects of TH and SH become larger making ? and ? T larger. Equivalently, a smaller task would be more sensitive to heterogeneity. 20 40 60 80 100 220 240 260 280 300 Interval ? ??A=0.1??A=0.3?? A=0.5 20 40 60 80 100 0.02 0.04 0.06 0.08 0.1 0.12 0.14 Interval _ ? T ??A=0.1??A=0.3?? A=0.5 (a) (b) Figure 2.10: (a) Average execution time and (b) normalized standard deviation of execution time as functions of interval. X = 1000, N = 10, and Xi = XN = 100. 0 50 100 00.2 0.4200 220 240 260 280 IntervalSH ? 0 50 100 00.2 0.40 0.05 0.1 0.15 0.2 IntervalSH _ ? T (a) (b) Figure 2.11: (a) Average parallel execution time and (b) normalized standard devia- tion of parallel execution time as functions of SH and interval. X = 1000, N = 10, Xi = XN = 100, and ? meanA = 0:28. 28 2.4.2.5 Communication Overhead Simulation results when the augmented task model is adopted are provided in Figures 2.12 and 2.13. In these graphs, ? includes both computation and communi- cation times. In Figure 2.12, SHcomp and SHcomm represent the spatial heterogeneity of availability in computing power and communication bandwidth, respectively. As expected for the simulation model, SHcomp and SHcomm have the similar efiect on ? and ? T (refer to Section 2.4.2.2). In this simulation, the ratio of computation time to communication overhead is set to 6. That is why ? increases more slowly along SHcomm than SHcomp. 0 0.10.2 0.30.4 0 0.2 0.4285 290295 300305 310 SHcompSHcomm ? 0 0.2 0.4 0 0.2 0.40.01 0.0150.02 0.0250.03 0.035 SHcompSHcomm _ ? T (a) (b) Figure 2.12: (a) Average execution time (computation and communication times) and (b) normalized standard deviation of execution time as functions of SHcomm and SHcomp. N = 20, X = 2000, Ns = 10, Xi = XN = 100, and ? meanA = ? meanB = 0:28. In Figure 2.13, SH is the same for both the computing power availability and bandwidth availability. As N increases, ? decreases initially and then turns around to increase after a certain value of N. This is mainly because communication overhead becomes dominant as N increases. It is also observed that the trend is more visible when the bandwidth availability (and computing power availability) shows a higher 29 0 5 10 15 20 0 0.2 0.48000 9000 10000 11000 NSH ? 0 5 10 15 20 0 0.2 0.41000 1500 2000 2500 3000 NSH ? (a) (b) Figure 2.13: Average execution time: (a) when communication overhead is indepen- dent of N and X = 400, and (b) when communication overhead increases linearly proportional to N and X = 2000. Ns = 10 and ? meanA = ? meanB = 0:28. spatial heterogeneity (SH). For example, in Figure 2.13-(a), it is seen that ? turns around to increase when SH is relatively high while it monotonically decreases when SH = 0. 2.4.2.6 Load Distribution and Stability (Risk Factor) So far, only the cases where a target task is uniformly distributed among com- puters have been discussed. Let?s examine the efiect of distribution of the target task with SH varied. In Figure 2.14, ? and ? T are plotted as functions of dX with N = 2 where dX = X1?X22 and ?? A = ? A1 ? ? A2. In Figure 2.15, the risk factor deflned in Section 2.3.4 is shown as a function of dX. The same set of results are provided when N = 10 in Figures 2.16 and 2.17. In these simulations, ? Ai is increased linearly with i among 10 computers as before (refer to Figure 2.4). Xi is also varied linearly with respect to i, i.e., Xi = X0 + SlopeXi(i ? 1) for i = 1;:::;N. Therefore, a negative slope means that a computer with a smaller ? Ai is assigned a larger fraction of X (a larger Xi). 30 ?20 ?10 0 10 20 214 216 218 220 222 224 226 dX ? ???A=0.00???A=0.28??? A=0.56 ?20 ?10 0 10 20 0.04 0.06 0.08 0.1 0.12 dX _ ? T ???A=0.00???A=0.28??? A=0.56 (a) (b) Figure 2.14: (a) Average execution time and (b) normalized standard deviation of execution time on two computers where X1 = X2 ?dX, X2 = X2 +dX, and X = 100. ?20 ?10 0 10 200.1 0.150.2 0.250.3 0.350.4 0.450.5 0.55 dX RF ???A=0.00???A=0.28??? A=0.56 Figure 2.15: Risk factor on two computers where X1 = X2 ?dX, X2 = X2 +dX, and X = 200. ?d = 221. ?? A = ? A2 ? ? A1. It may be observed from these results that, by assigning more work (a larger fraction of target task) to a computer with a smaller variation of availability, (a) a shorter (minimum) ? is obtained, (b) ? T is sharply decreased, and (c) the risk factor is quickly reduced. These trends are more explicit for a larger N and a higher SH. It is also seen that the amount of work to be assigned to each computer to minimize ?, ? T, and RF depends on SH (and the distribution of ? Ai). That is, for achieving the minimum execution time on a spatially heterogeneous computing system, it is not 31 ?2 ?1 0 1 2 30 35 40 45 SlopeXi ? SH=0.00SH=0.14 SH=0.28 ?2 ?1 0 1 2 0.05 0.1 0.15 0.2 SlopeXi _ ? T SH=0.00 SH=0.14SH=0.28 (a) (b) Figure 2.16: (a) Average execution time and (b) normalized standard deviation of execution time on 10 computers. X = 100 and Xi = X0 + SlopeXi(i ? 1) for i = 1;:::;10. ?2 ?1 0 1 20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 SlopeXi RF SH=0.00SH=0.14 SH=0.28 ?0.5 0 0.50.4 0.5 0.6 0.7 0.8 0.9 SlopeXi RF SH=0.00SH=0.14 SH=0.28 (a) (b) Figure 2.17: Risk factor on 10 computers (a) ?d = 40 and (b) ?d = 28. X = 100 and Xi = X0 +SlopeXi(i?1) for i = 1;:::;10. optimal to distribute a target task linearly proportional to the mean availabilities of computers. 32 Chapter 3 Load Balancing on a Heterogeneous Computing System The results in Chapter 2 reveal that, for achieving the minimum average parallel execution time on a spatially heterogeneous computing system, it is not optimal to distribute a target task linearly proportional to the average availabilities (or average computing powers) of computers. This observation has motivated the development of an e?cient load balancing scheme for heterogeneous cluster and grid computing environments, which is described in this chapter. 3.1 Two-Step Heuristic Approach Let?s denote the computing power (speed) of Ci at t by Si(t) which has the mean si and the standard deviation Si. Then, noting that Si(t) = fiiAi(t), si = fiiai and Si = fii Ai. When Si = 0 for all i, a target task is to be partitioned proportional to si, in order to achieve the minimum ?. However, such partitioning does not lead to the minimum ? in general when Si 6= 0 for some i. Let?s consider cases where Si 6= 0 for some i. Suppose that X is partitioned such that Xi is linearly proportional to si. Then, ti (the average execution time on Ci) would be the same for all i, but Ti (the standard deviation of execution time on Ci) would not be. It is to be pointed out that Ti is linearly proportional to ? Ai (or THi). Noting that ? is given by E[maxifTig] rather than maxifTig or maxiftig, it is 33 possible to further reduce ? by taking f Tig into account. Therefore, load balancing may be carried out in two steps as follows: (1) X is partitioned over fCig such that Xi is proportional to si, (2) fXig is further adjusted considering f Tig and fsig. In the following, this two-step load balancing approach is elaborated for difierent cases. 3.1.1 When si = sj for all i;j: Consider the case of N = 2 (two computers). In Step (1) of the two-step load balancing approach, X is divided equally between C1 and C2 since s1 = s2. That is, after Step (1), X1 = X2 = X2 and t1 = t2. Suppose that T1 > T2 after Step (1). Then, it is possible to reduce ? further by transferring a certain amount of work (?X ? 0) from C1 to C2 in Step (2), i.e., X01 = X1??X and X02 = X2 +?X, where X0i is Xi after Step (2). Then, t01 = t1 ? ?Xs 1 4= t 1 ??t1 and t02 = t2 + ?X s2 4= t 2 +?t2 (3.1) where t0i is ti after Step (2). Note that ?t1 = ?t2 since s1 = s2. Also, from Equation 2.2, it can be shown that 0T1 = T1 s 1? ?Xs 1t1 and 0T2 = T2 s 1+ ?Xs 2t2 (3.2) where 0Ti is Ti after Step (2). 34 Aheuristic scheme, tobereferred toas equalization scheme, that can beemployed in Step (2) equalizes the sum of the mean and standard deviation of execution time between two computers. That is, it determines ?X such that t01 + 0T1 = t02 + 0T2 (3.3) where ti and 0Ti are given by Equations 3.1 and 3.2, respectively. The equalization scheme is illustrated in Figure 3.1. The scheme attempts to reduce the probability that the parallel execution time of a target task, T = maxifTig, is large, in order to minimize its average parallel execution time. This heuristic can be generalized for cases where N > 2, i.e., Xi is determined such that t0i + 0Ti is the same for all i. t1 t2 ?T1 ?T2 C1 C2 (a) ??T1 ??T2 t?1 t?2 C1 C2 (b) Figure 3.1: Equalization scheme: (a) after Step (1) where a target task is partitioned such that the average execution time is the same on all computers and (b) after Step (2) where the partitioning is adjusted such that t0i+ 0Ti is equalized for all computers. 35 3.1.2 When si 6= sj for some i;j: Again, consider the case of N = 2, and suppose that s1 < s2. After Step (1), X1 = s1Xs1+s2 and X2 = s2Xs1+s2. What is to be done in Step (2) to further reduce ? depends on f Tig and fsig, as discussed below, and can be generalized for cases where N > 2. (a) T1 > T2: In this case, ?X is to be moved from C1 to C2 and the equalization scheme may be employed in determining ?X. One difierence is that ?t1 > ?t2. In other words, the same increase in t2 (by moving ?X from C1 to C2) results in a larger decrease in t1, leading to a larger reduction in ?, compared to the case where s1 = s2. (b) T1 < T2 and s1 ? s2: It is still possible to reduce ? by transferring ?X from C1 to C2 though the equalization scheme may not be used since the condition, t01 + 0T1 = t02 + 0T2, cannot be satisfled. Reduction in ? would be smaller than that in (a). (c) T1 < T2 and s1 ? s2: In this case, ?X is to be moved from C2 to C1 and the equalization scheme can be used. Reduction in ? would be smaller than that in (a) or (b). It should be clear that transferring ?X in the other direction than specifled above would result in a longer ?. The above discussion on reduction in ? for difierent cases is summarized in Table 3.1. 36 s1 = s2 s1 < s2 T1 > T2 T1 > T2 T1 < T2 (s1 << s2) T1 < T2 (s1 ? s2) Direction of ?X C1 ! C2 C1 ! C2 C1 ! C2 C1 ? C2 Reduction in ? Smallest Largest Small Small Table 3.1: Possible reduction in ? 3.2 Simulation Results and Discussion Let?s flrst consider a case where the average computing speed is the same for all computers, i.e., si = sj for all i;j. First, Xi = Xj for all i;j in Step (1) of the two-step load balancing approach described in Section 3.1. It is assumed that ti = 100 and Ti = i?1N?1 max, for i = 1;:::;N, after Step (1). In Step (2), Xi is adjusted such that t0i = ti +SlopeT(i? N?12 ) for i = 1;:::;N. SlopeT = 0 corresponds to the case where X0i is proportional to si (the average computing speed or power), i.e., Step (1) only. A negative SlopeT indicates that a computer with a larger Ti is assigned a larger ?Xi (refer to Section 3.1) in Step (2) in addition to Xi allocated in Step (1). In Figure 3.2, ? and percentage standard deviation of (t0i + 0Ti) are shown as functions of SlopeT which specifles how a target task is distributed over N computers. The percentage standard deviation of (t0i+ 0Ti) is a measure which re ects the degree of equalization, and is deflned as: % standard deviation of (t0i + 0Ti) = (t0+ 0T)(t0 + 0 T) ?100% (3.4) 37 where (t0 + 0T) and (t0+ 0T) are average and standard deviation of (t0i + 0Ti) over N computers: (t0 + 0T) = PN i=1 (t 0 i + 0 Ti) N (3.5) (t0+ 0T) = sP N i=1(t0i + 0Ti ?t0 + 0T)2 N (3.6) First of all, it is to be noted from Figure 3.2 that it is not optimal to distribute a target task linearly proportional to the average computing power of computers. In Figure 3.2-(a), it is clear that the performance improvement (reduction in ?) achieved by Step (2) (more precisely, Step (1) + Step (2)) over Step (1) is signiflcant, and is larger for a larger ? T (equivalently, a higher spatial heterogeneity). Comparing Figures 3.2-(a) and (b), it can be seen that the equalization scheme employed in Step (2) works well, i.e., the distribution of X minimizing ? closely matches with that minimizing variation in ti + Ti. 0 2 4 6 8 10110 120 130 140 150 160 ?SlopeT ? ??T=10?? T=40??T=70 0 2 4 6 8 100 5 10 15 20 25 30 35 ?SlopeT % Standard Deviation of t i?+?? T i ??T=10??T=40?? T=70 (a) (b) Figure 3.2: (a) Average parallel execution time and (b) Percentage standard deviation of t0i + 0Ti after Step (2) in the load balancing, when the number of computers, N, is 10. After Step (1), ti = 100 and ? T = TN ? T1 where Ti = i?1N?1 Tmax, where i = 1;:::;10. 38 In Figure 3.3, a system of two computers where s1 < s2 is considered. In these graphs, ? is plotted as a function of ?X. Again, ?X = 0 corresponds to the cases where X is divided between two computers proportional to their average computing powers (s1 and s2). It can be observed that reduction in ?, which can be achieved by Step (2), is larger for a larger s or a larger ? T. Let?s deflne the percentage error (") of the equalization scheme as ??max?????max ?100 where ??max and ?? are the maximum possible and achieved (by the heuristic) reductions in ?, respectively. In this set of results, the average " was 5.1%. 0 5 10 15 20 25100 105 110 115 120 125 130 ?X ? s=2s=4 s=6 0 5 10 15 20 25100105 110 115 120 125 130 ?X ? ??T=10?? T=30??T=50 (a) (b) Figure 3.3: Average parallel execution time on two computers (a) with a flxed ? T = T1 ? T2 = 50 ( T2 = 10) and (b) with a flxed s = 4, where s = s2s1 (s1 = 1) and ? T = T1 ? T2 ( T2 = 0). InFigure3.4, percentagereductionin?, whichisachievedbyStep(2), isanalyzed with the number of computers (N) varied. In this simulation, si increases and Ti decreases linearly proportional to i, i.e., a faster computer has a smaller variation of execution time. The percentage reduction in ? is deflned to be ?(1)??(2)? (1) ?100 where ?(1) and ?(2) are ? achieved by Steps (1) and (2), respectively. It can be seen that the percentage reduction increases as N increases. That is, the performance improvement 39 5 10 15 204 6 8 10 12 14 16 N % Reduction in ? s=2s=5 s=10 5 10 15 204 6 8 10 12 14 16 N % Reduction in ? ??T=20?? T=50??T=80 (a) (b) Figure 3.4: Percentage reduction in average parallel execution time as a function of the number of computers (a) with a flxed ? T = T1 ? TN = 60 ( TN = 0) and (b) with a flxed s = 5, where s = sNs1 (s1 = 1) and ? T = T1 ? TN ( TN = 0). After Step (1), ti = 100 for all i. by Step (2) becomes greater when a larger number of computers are employed. In Figure 3.4-(a), ? T is flxed independent of N. However, in reality, ? T would usually increase as N increases. Therefore, in such a case, one may expect a larger reduction in ?. Note that a larger ? T leads to a greater reduction in ? as shown in Figure 3.4-(b). 40 Chapter 4 Effects of Heterogeneity on an Individual Network Traffic In this chapter, how heterogeneity in channel bandwidth afiects the performance of an individual message is analyzed. Also, application of the analysis results to path selection and multi-path data transfer are considered. 4.1 Network Model 4.1.1 Topology In this dissertation, a communication network is represented by a graph where a node may be a user, a switch, a router, etc. and an edge is a wired or wireless channel or link. Bandwidth of a channel refers to the bandwidth that is available for or can be allocated to an individual message (request) on the channel. A path is a set of consecutive edges (links) from one node to another. Two types of paths are of interest, serial path and parallel path. A serial path consists of N serially connected links from a source node to a destination node through N ?1 intermediate nodes (routers) as illustrated in Figure 4.1-(a). The bandwidth Bi(t) of link i (li in the flgure), specifled in terms of packets- per-second (pps), is assumed to be randomly distributed between Bmini and Bmaxi with the mean of bi and the standard deviation of Bi, where 1 ? i ? N. Bi(t) remains 41 SourceNode DestinationNode l1 l2 lN AserialpathconsistingofLinksN Aserialpath(P2)consistingoflinksN2 Aserialpath(P1)consistingoflinksN1 SourceNode DestinationNode l1,1 l1,2 l1,N1 l2,1 l2,N2 (a) (b) Figure 4.1: Illustration of (a) a serial path and (b) a parallel path where Np = 2. constant over a period of time, referred to as an interval, Tinterval, as illustrated in Figure 4.2. Also, the normalized standard deviation, Bibi , is denoted by ? Bi. A parallel path is composed of Np serial paths which are independent of each other, as illustrated in Figure 4.1-(b). Path j of a parallel path consists of Nj links where 1 ? j ? Np. Link i on path j may be denoted by lij along with the associated bandwidth parameters Bij(t), Bmaxij , Bminij , bij, and Bij. For a parallel path where all links on each path are identical, bj, Bj, and ? Bj will be used to denote the mean, standard deviation, and normalized standard deviation of bandwidth of all links on path j, respectively. Transmission rate at the source node follows the bandwidth of the bottleneck link (the link with the minimum bandwidth) in the path. Note that a higher transmission rate would lead to a higher throughput, but at the expense of a longer (queueing) delay. The source node is informed of any change in Bi(t) with a delay of Tfeedback, and updates its transmission rate at an interval of Tupdate. Practically, the value of 42 Tupdate should not be shorter than the average one-way-trip-time from the destination node to the source node. t Temporal heterogeneity Spatial heterogeneity B(t) t t Interval Bmax1Bmin 1B(t) B(t) 1 2 N Figure4.2: Temporalbandwidthheterogeneityoneachchannelandspatialbandwidth heterogeneity among channels. Temporal heterogeneity of the bandwidth is deflned for each channel or link i and is quantifled by ? Bi. This indicates the temporal uctuation of bandwidth (available for a message) about its mean. Spatial heterogeneity is deflned for a group of channels or links and may be quantifled by a measure of variation among the members in the set f? Big, such as the difierence between the maximum and minimum. There are other factors in addition to channel bandwidth, which can be consid- ered in a network model. However, the main focus of this chapter is on analyzing efiects of temporal and spatial variations in the bandwidth allocated to a message on 43 QoS?s. Therefore, those factors are not explicitly considered in the model. Equiva- lently, the model takes only their combined \net" efiect on the allocated bandwidth into account. 4.1.2 Performance Measures Some communication performance measures used in this dissertation are deflned below. Throughput (THR) is deflned to be the size of data that a path (or multi- path) is able to transfer in a unit time, in terms of packets-per-second (pps) when the protocol has a uniform packet size, or bits-per-second (bps). This measure re ects the efiective bandwidth of the network path. This is a long-term measure, which is evaluated on the destination node. End-to-End-Delay (Tee) is deflned to be the time it takes to transmit a packet from the source to the destination. This can be decomposed into 4 components, i.e., propagation delay (Tp), transmission delay (Ttr), queueing delay (Tq), and other overhead (To). To represents the overhead including system processing time, packet assembling or disassembling time; Tp is related to the physical property of the network which is determined by the speed of the electro-magnetic wave on a certain media (copper, flber, etc.); Tq is the waiting time that a packet spends in queueing into and dequeuing from the bufiers of routers on the path, and re ects the mismatching of bandwidth between two successive links; and Ttr can be expressed in terms of the transmission rate (r(t)) as follows: Ttr(t) = 1=r(t) (4.1) 44 Note that there is a trade-ofi between Ttr and Tq: when increasing the source rate r(t), Ttr decreases since it is inversely proportional to r(t) and Tq tends to increase when r(t) gets faster, if there are some packets bufiered. This dissertation mainly focuses on Ttr and Tq because they are the two main components of Tee while the other two components are also included in the simulation model. Dropping Rate (DR) is deflned to be the ratio of the number of dropped packets to the total number of packets sent. When the available bandwidth drops below the source rate, packets are queued. Packet dropping occurs when packets are sent with a rate higher than the current available bandwidth of an outgoing link over a long enough period of time so that the bufier over ows. Therefore, a good source rate control scheme should at least follow and not go above the average available link (a long term measure) bandwidth for a long period of time. Transfer Time (Ttrans) is the time it takes to send a given size of data from a source to a destination. 4.2 Efiects of Heterogeneity In this section, efiects of heterogeneity on QoS?s of an individual message are an- alyzed by computer simulations. The bandwidth of each channel, Bi(t), is assumed to be independently and uniformly distributed between Bmini and Bmaxi . Other distri- butions such as a truncated Gaussian distribution have been considered, but lead to qualitatively similar results. Therefore, the results for the uniform distribution only are provided in this dissertation. 45 4.2.1 Feedback Latency In Figure 4.3, where the notation Tinterval = [X;Y] indicates that Tinterval is uniformly distributed between X and Y, efiects of Tfeedback on throughput and end- to-end delay are analyzed for a serial path of 4 links. It is seen that throughput is almost independent of Tfeedback while it decreases as ? B increases (refer to Figures 4.3-(a)). This is because throughput is mainly determined by the average bandwidth of the path as will be discussed later on. However, Tfeedback has a signiflcant efiect on the end-to-end delay as can be seen in Figure 4.3-(b). The longer the feedback latency is, the longer the end-to-end delay results. When the feedback latency is longer, the duration of mismatch between the source rate and the bandwidth of the bottleneck link is longer leading to a longer end-to-end delay per packet due to longer queues along the path. 0 0.1 0.2 0.3 0.4 0.5400 600 800 1000 1200 _ ?B Throughput (pps) Tfeedback=0.01T feedback=0.05Tfeedback=0.10 0 0.1 0.2 0.3 0.4 0.50 0.02 0.04 0.06 0.08 0.1 _ ?B End?to?End Delay (s) Tfeedback=0.01T feedback=0.05Tfeedback=0.10 (a) (b) Figure 4.3: Efiects of Tfeedback on (a) throughput and (b) end-to-end delay. Tinterval=[0.01,5.00]s, Tupdate=0.01s, N = 4, bi=1000 pps, and ? Bi = ? B for all i. 46 4.2.2 Temporally Heterogeneous and Spatially Homogeneous 4.2.2.1 Serial Path For a serial path of a single link, it is not di?cult to show that the average transfer time, ttrans, and the normalized standard deviation of transfer time, ? ttrans, can be derived as ttrans = Sb and ? ttrans = ttransttrans = ? B pb pS , respectively, where S is the size of data. Now, consider a serial path of N links where bi = b and Bi = B for all i. The source rate follows the bandwidth of the bottleneck link along the path in the simulation. Therefore, efiective bandwidth, Beffective, of the serial path can be approximated as that of the bottleneck link. The average efiective bandwidth, beffective, which is equivalent to throughput, can be derived as follows (for the uniform distribution). beffective = E[Beffective] = E[minifBig] = b? N ?1N +1(B max ?Bmin 2 ) = b(1? N ?1N +1p3? B) (4.2) From Equation 4.2, one can see that the average efiective bandwidth decreases as ? B (temporal heterogeneity) increases, and the decrease is larger (as large as p3? B) for a longer path (a larger N). The more links are involved in a path, the higher the probability that minfBig will be smaller. This is verifled by the simulation results in Figure 4.4-(a), where it is clear that the simulation results show a good match with the theoretical predictions. 47 0 0.1 0.2 0.3 0.4 0.5300 400500 600700 800900 1000 ?N=2 ?N=4N=6? _ ?B Throughput (pps) theoreticalsimulation 0 0.1 0.2 0.3 0.4 0.50 5 10 15 20 25 _ ?BEnd?to?End Delay (ms) N=2N=4N=6 (a) (b) Figure 4.4: (a) Throughput and (b) end-to-end delay on a temporally heterogeneous serial path of N links. S = 106 packets. bi = 1000 pps and ? Bi = ? B for i = 1;:::;N. The end-to-end delay is the sum of delays on N links. The delay on an individual link includes the transmission and queueing delays (refer to Section 4.1.2). The queueing delay increases as ? B increases since the probability and degree of bandwidth mismatch between adjacent links increase leading to longer queues. Therefore, the end-to-end delay increases as b and/or N increase as shown in Figure 4.4-(b). 4.2.2.2 Parallel Path In Figure 4.5, the transfer time and throughput achieved on a parallel path are plotted. In this parallel path, each path is composed of a single link. Sj = 105 packets for j = 1;:::;Np where Sj is the size of data transferred over path j, i.e., the total size of data is 105Np packets transferred over Np paths. It can be seen in Figure 4.5-(a) that transfer time increases up to 16% as ? B increases when Np > 1. The increase in transfer time is larger for a larger Np. In Figure 4.5-(b), the aggregated throughput over Np paths is plotted as a function of Np. Ideally, it is to increase linearly proportional to Np. However, throughput is degraded from the ideal one as 48 Np increases, with a larger degradation for a larger ? B (up to more than 40 % when Np = 19 and ? B=0.4). Note that the results in Figure 4.5 are for the cases where each path consists of a single link. By referring to Figure 4.4, it should not be di?cult to see that efiects of ? B and/or Np are larger on a parallel path when the number of links per path is greater than one as verifled in the simulation. 4.2.3 Temporally and Spatially Heterogeneous In addition to temporal heterogeneity on each link, difierent links may exhibit difierent characteristics, i.e., bi 6= bj and/or ? Bi 6= ? Bj for i 6= j. In this section, only the results for cases where bi = b for all i, but ? Bi 6= ? Bj for i 6= j, are presented. However, in practice, even when bi 6= bj, it is often the case that the allocated average bandwidth is the same for all links. For instance, it would not be optimal to allocate difierent average bandwidths on difierent links along a serial path since the overall 0.1 0.2 0.3 0.4 0.595 100 105 110 115 120 _ ?B t Trans (s) Np=1N p=2Np=4 Np=6 0 5 10 15 200 0.5 1 1.5 2x 104 Number of Parallel Paths (Np) Throughput (pps) ??Bi=0.06??B i=0.23??Bi=0.40idealThroughput (a) (b) Figure 4.5: (a) Transfer time and (b) throughput on a temporally heterogeneous parallel path consisting of Np paths with one link per path. Sj=105 packets and bj=1000 pps for j = 1;:::;Np. 49 efiective bandwidth of the path mainly depends on the bottleneck link, i.e., the link with the lowest (average) bandwidth. Note that a link with a large ? B may become a bottleneck link even though its average bandwidth is not the lowest. Nevertheless, cases where bi 6= bj are considered for the application examples in Section 4.3. 4.2.3.1 Serial Path In Figure 4.6, bi = 1000 pps for all i and ? B1 is varied with ? Bi flxed for i = 2;:::;N. In Figures 4.6-(a) and (b), ? Bi is flxed at 0.1 for i = 2;:::;N, and ? B1 is varied from 0 to 0.52. It is clear that throughput is degraded signiflcantly (close to 20%) due to spatial heterogeneity in f? Big as ? B1 increases. The relative degradation in end-to-end delay is larger as shown in Figure 4.6-(b). As spatial heterogeneity among links increases, the probability and degree of bandwidth mismatch between adjacent links increase, which causes a longer end-to-end delay for each packet as discussed in Section 4.2.2.1. When ? Bi for i = 2;:::;N is flxed at 0.52, the change of spatial heterogeneity in f? Big is smaller as ? B1 increases from 0 to 0.52. Therefore, the relative degradation in throughput and end-to-end delay, as ? B1 increases, is less as can be seen in Figures 4.6-(c) and (d). 4.2.3.2 Parallel Path In Figure 4.7, the average bandwidth is the same for all links, but ? Bj is dis- tributed linearly among Np paths such that its mean among the paths is unchanged. In quantifying spatial heterogeneity, ?? B=? B1-? BNp is adopted. Over each of the Np 50 paths, the same amount of data (Sj = 105 packets for all j) is transmitted, and the transfer time is determined by the \slowest" path. The results show that the transfer time increases substantially as ?? B gets larger. The increase in transfer time is larger when each path becomes longer. This is because the temporal (efiective) bandwidth heterogeneity of each path increases as the path length increases. 0 0.1 0.2 0.3 0.4 0.5700 750 800 850 900 950 1000 _ ?B1 Throughput (pps) N=2N=4 N=6 0 0.1 0.2 0.3 0.4 0.50 5 10 15 _ ?B1 End?to?End Delay (ms) N=2N=4N=6 (a) (b) 0 0.1 0.2 0.3 0.4 0.5300 400 500 600 700 800 900 1000 _ ?B1 Throughput (pps) N=2N=4 N=6 0 0.1 0.2 0.3 0.4 0.55 10 15 20 25 _ ?B1 End?to?End Delay (ms) N=2N=4N=6 (c) (d) Figure 4.6: Throughput and end-to-end delay on a temporally and spatially hetero- geneous serial path with bi=1000 pps for i = 1;:::;N. (a) & (b): ? Bi = 0:1 for i = 2;:::;N; (c) & (d): ? Bi = 0:52 for i = 2;:::;N. 51 0 0.1 0.2 0.3 0.4 0.5102 104 106 108 110 112 _ ??B t Trans (s) Np=2N p=4Np=6 0 0.1 0.2 0.3 0.4 0.5120 125 130 135 140 145 150 _ ??B t Trans (s) Np=2N p=4Np=6 (a) (b) Figure 4.7: Transfer time on a temporally and spatially heterogeneous parallel path: (a) each path is a single link, and (b) each path consists of 2 links where ?? B = ? B1- ? BNp. bj = 1000 pps and Sj = 105 packets for j = 1;:::;Np. ? Bj is varied linearly over Np paths for j = 1;:::;Np such that its mean is flxed at 0.26. It is also seen that transfer time is longer for a larger Np. As more paths are involved in a parallel path, the probability that the variation of efiective bandwidth among the paths is larger increases, which makes the (average) transfer time longer. Note that transfer time of a message transmitted over multiple paths depends on the \slowest" path, i.e., the path which flnishes its transfer last. 4.3 Applications In this section, two examples where efiects of heterogeneity in channel bandwidth maybetakenintoaccountinordertoimprove(optimize)certainQoS?sareconsidered. 4.3.1 Path Selection In Figure 4.8-(a), a parallel path is shown, where each path consists of 4 links. Note that the average bandwidth of the links on path 3 is higher than those on paths 52 SourceNode DestinationNode (900,0.1 3) (900,0.1 3) (900,0.13 ) (900,0.13) (1000,0.35)(1000,0.35) (1 000,0.35 ) (1000,0.35) (900,0.04) (900,0.10) (900,0.16) (900,0.22) Path1Path2 Path3 (1000,0 .46) (1000,0.46) (1000,0.06)(1000,0.06) (1 000,0.06 ) (1000,0.35) (1000,0.17) (1000,0.17) (1000,0.17) Path1Path2 Path3 SourceNode DestinationNode (a) (b) Figure 4.8: (a) A parallel path where the number of links is the same for all paths, and (b) a parallel path where the number of links varies with path. The number pair in brackets over each link represents bi and ? Bi of the link. 1 and 2, but ? Bj (heterogeneity) is also larger for the links on path 3 than those on paths 1 and 2. ? Bj is constant for all links on path 1 while it is linearly distributed over the links on path 2. Note that the mean of ? Bi for all links on each path is the same for both paths 1 and 2. In Figure 4.9, the three paths are compared in terms of end-to-end delay and transfer time. First, it is seen that path 3 performs worst among the three in both QoS?s though its average bandwidth is higher than that on the other paths. This is due to its high temporal heterogeneity in bandwidth. Comparing paths 1 and 2 which have the identical bj, one can see that path 1 performs slightly better than path 2 since spatial heterogeneity (among links) is higher on path 2 than on path 1. Therefore, one should consider f? Big in addition to fbig in selecting a path, in order to optimize the QoS?s. Figure 4.8-(b) shows another example of a parallel path where the number of links varies with path, in this case, 2, 3, and 4 links on paths 1, 2, and 3, respectively, is considered. All links have the same average bandwidth, but ? Bj is largest on path 1 and smallest on path 3. In Figure 4.10, the same QoS?s, end-to-end delay and 53 105 1060 0.01 0.02 0.03 0.04 0.05 Size of Data End?to?End Delay (s) path1path2path3 105 1060 500 1000 1500 2000 2500 Size of Data t trans (s) path1path2path3 (a) (b) Figure 4.9: Comparison of the three paths in Figure 4.8-(a) in terms of (a) end-to-end delay and (b) transfer time. transfer time, are used to compare the three paths. The smallest delay and transfer time are achieved on path 3 which is the \longest" (the largest number of links) among the three paths. Path 1 performs worst due to its highest (temporal) heterogeneity though it is the \shortest." Therefore, one should not simply select the shortest path even when the average bandwidth is the same on all links. 105 1060 0.01 0.02 0.03 0.04 0.05 Size of Data End?to?End Delay (s) path1path2path3 105 1060 500 1000 1500 2000 2500 Size of Data t trans (s) path1path2path3 (a) (b) Figure 4.10: Comparison of the three paths in Figure 4.8-(b) in terms of (a) end-to- end delay and (b) transfer time. 54 Implementation Two cases may be considered: (i) the source has the information on the behavior (bi;? Bi) of each link, and (ii) the source does not have this information. In the case (i), an analytic formula or look-up table which relates a given QoS to the set fbi;? Big characterizing a path is to be derived. If a path is temporally heterogeneous but spatially homogeneous, an analytic formula may be obtained for certain distributions as shown in Section 4.2.2.1 for the QoS of throughput which is proportional to the efiective bandwidth of a path. However, for a temporally and spatially heterogeneous path, it is much more challenging. With such an analytic formula or table available, a source refers to it in order to select a path when there are more than one possible path. In case (ii), the source needs to estimate a given QoS by measuring it for each possible path before the selection process. 4.3.2 Multi-Path Data Transfer Consider cases where multiple paths (or a parallel path) are available for trans- ferring a large size of data. That is, the data is partitioned for simultaneous transfers over multiple paths. In order to minimize the overall transfer time, it is necessary to determine how much data is to be transferred over each path (or equivalently trans- mission rate for each path), considering bandwidth heterogeneity on the multiple paths. Let?s deflne percentage reduction in transfer time as ttrans?t0transttrans ?100 where ttrans is the transfer time achieved when the partitioning is done considering only 55 the mean of each link?s bandwidth, and t0trans is the transfer time obtained by con- sidering the normalized standard deviations (heterogeneity) of link bandwidths also. Similarly, percentage reduction in end-to-end delay may be deflned as tee?t0eetee ?100. Recall that S denotes the total size of data and Sj the size of data to be trans- ferred over path j, i.e., S = Pj Sj. In Figure 4.11, cases where 2 paths are employed for data transfer and all links have the same average bandwidth (b = 1000 pps) are considered. Note that one would normally partition S so that S1 = S2 since b1 = b2. However, that does not lead to the optimal performance. In Figure 4.11-(a), both paths have the same number of links (N). It can be seen that a signiflcant reduction of up to 28% in transfer time is achieved by assigning more data to the path with lower temporal heterogeneity, path 1 in this example. Also, the reduction becomes larger when there is a larger difierence (spatial heterogeneity) between ? B1 and ? B2 and when each path is longer (a larger N). In Figure 4.11-(b), cases when each path has difierent number of links (N1 6= N2) are considered. The similar observations can be made in this result. However, the reduction is larger since N1 6= N2 leads to higher spatial heterogeneity (between the two paths). It is also shown in Figure 4.11 that the simulation results provide a good match with the theoretical ones (optimum). The results for cases where b1 6= b2, i.e., two paths have difierent average link bandwidths, are shown in Figure 4.12. Sj is the size of data to be transferred over path j when only the mean of link bandwidth fbj;j = 1;2g is taken into account, i.e., Sj is linearly proportional to bj. Let S0j denote the size of data to be transferred over path j when considering f? Bj;j = 1;2g in addition to fbj;j = 1;2g. In Figures 4.12-(a) and (b), it can be seen 56 0 5 101520253035?20 ?10 0 10 20 30 (S1?S2)/S (%) Reduction (%) N=2,?B2??B1=0.17N=4,? B2??B1=0.17N=4,?B2??B1=0.40 Theoretical Optimal?20 0 20 40 60?10 0 10 20 30 40 (S1?S2)/S (%) Reduction (%) ?B2??B1=+0.40?B2??B1=?0.40? B2??B1=+0.17?B2??B1=?0.17 Theoretical Optimal (a) N1=N2 = N (b) N1=2, N2=4 Figure 4.11: Reduction in transfer time on a parallel path (Np = 2). bj = 1000 pps for j = 1;2. S=S1 +S2=2?106 packets. that the minimum transfer time (maximum reduction) is achieved by assigning a size of data larger than S1 to path 1 on which temporal heterogeneity (? B1) is lower than on path 2. The larger the difierence in temporal heterogeneity between the two paths (i.e., the larger the spatial heterogeneity), the larger the reduction. Also, a larger reduction is possible when the path with lower temporal heterogeneity (path 1) has a higher average bandwidth. In addition, there exists an optimal point at which the reduction is maximized, and the value of S01?S1S for the optimal point depends on b1 and b2. In Figures 4.12-(c) and (d), the reduction in the end-to-end delay shows a mono- tonic behavior, i.e., always increasing as S01?S1S increases. This is the case since the per-packet delay is reduced as a larger fraction of message is transmitted over the path with lower temporal heterogeneity, path 1 in this case (? B1 < ? B2). Therefore, one is to select a proper value of S01?S1S depending on which QoS is to be optimized. Note that the heterogeneity among paths is likely to increase as Np increases, and this would lead to an even larger maximum reduction in the transfer time. 57 0 2 4 6 8 10 12?5 0 5 10 15 (S?1?S1)/S (%) Reduction in t trans (%) b1=b2b 1=b2/2b1=b2/3 0 2 4 6 8 10 120 5 10 15 20 (S?1?S1)/S (%) Reduction in t trans (%) b1=b2b 1=2b2b1=3b2 (a) b1 ? b2 (b) b1 ? b2 0 2 4 6 8 10 120 24 68 1012 14 (S?1?S1)/S (%) Reduction in t d (%) b1=b2b 1=b2/2b1=b2/3 0 2 4 6 8 10 120 5 10 15 20 (S?1?S1)/S (%) Reduction in t d (%) b1=b2b 1=2b2b1=3b2 (c) b1 ? b2 (d) b1 ? b2 Figure 4.12: Reduction in transfer time ((a);(b)) and end-to-end delay ((c);(d)) on a parallel path with Np = 2. Nj = 2 for j = 1;2. S = S1 + S2 = S01 + S02 = 2?106 packets, ? B1=0.06, ? B2=0.46, and b1 +b2 = 2000 pps. Implementation The main issue in multi-path data transfer is how to divide a given large size of message over multiple paths to be utilized simultaneously for the message, given the efiective bandwidth and its variation (bj;? Bj) for each path j. Consider the QoS of transfer time of a message. How the message is to be divided to minimize its overall transfer time is equivalent to dividing a computational task over multiple temporally heterogeneous computers for minimizing the parallel execution time. In Chapter 3, 58 an efiective load balancing strategy was described along with its performance analysis results. This strategy consists of two steps where a task is partitioned proportional to the average computing power available on the computing nodes in the flrst step and the partitioning is adjusted by considering the standard deviations of the available computing powers in the second step. A similar scheme can be employed for balancing communication load over temporally heterogeneous paths. 59 Chapter 5 Source Rate Control for Heterogeneous Communication Systems Based on the analysis results in Chapter 4, the issue of controlling source rate in order to optimize performance of a communication task is considered, and a het- erogeneity aware approach to source rate control is described along with simulation results in this chapter. 5.1 Temporal Heterogeneity under Difierent Time Scales Temporal heterogeneity of the available bandwidth on a communication path depends on the activity patterns of difierent network applications and protocols and can be decomposed into two components, long-term and short-term, in the time do- main. Long-term heterogeneity is caused by the initiation and termination of long sessions, such as large flle transfers, real time video and audio streaming, and other bandwidth reservation applications. These types of applications \dominate" the vari- ation of available network resources (bandwidth, bufier, etc), and exhibit slow-varying characteristics. Short-term heterogeneity is due to short-duration network tra?c, such as the overhead in packet processing, bufiering, and forwarding, and other sporadic net- work activities. Examples include Internet Control Messages (ICMP, RSVP), user inputs/outputs of a terminal session (telnet), browsing (small) web pages (http), messenger services (msn/yahoo messenger, icq) most of which are based on UDP, 60 and short email messages (smtp, pop3). Dynamic behaviors of some transport layer protocols, like AIMD in TCP, also cause some short-term variation. This type of heterogeneity is relatively unpredictable.0 10 20 30 40 50 60 700 0.2 0.4 0.6 0.8 1 3 flows A 1Mbps/3 2 flows B 1Mbps/2 1 flows C 1Mbps 2 flows D 1Mbps/2 Available Bandwidth (Mbps) Time Figure 5.1: Illustration of long-term and short-term temporal heterogeneity of the available bandwidth sensed by one ow (flow0) on a path. A simplifled illustration of long-term and short-term temporal heterogeneity, where multiple ows share a bottleneck link with a capacity of 1 Mbps, is given in Figure 5.1. The available bandwidth sensed by a certain long-duration ow (say flow0) over time is plotted in this flgure. The queueing scheme is assumed to work in a fairly weighted round-robin fashion, i.e. the ideal efiective bandwidth shared by flow0 is 1Mbps/nf, when there are nf co-existing long-duration ows. The \contour" of the available bandwidth (marked by the dotted lines) represents the ideal share for flow0, and could be modelled by 1 Mbps/nf. Starting from time A, there are 3 co- existing long-duration ows, so the available bandwidth for flow0 is approximately 1Mbps/3. At time B, one of the long-duration ows is closed, and the available bandwidth increases to about 1Mbps/2. When one more ow is dropped at time C, 61 all bandwidth of the link (1Mbps) is allocated solely to flow0, but after another ow is introduced at time D, the available bandwidth drops back to about 1Mbps/2. Compared to the short term temporal heterogeneity (\spurs" in Figure 5.1), the long-term heterogeneity has a longer time duration and is much easier to track. For example, when the long-term sessions have a duration signiflcantly longer than the synchronization interval of a rate control scheme, the source rate could be adjusted so that it can follow the shape of the available bandwidth contour. However, it is neither possible nor necessary to follow the short term spurs, because they usually vary much faster than the rate control scheme can handle. 5.2 HAA Source Rate Control 5.2.1 Overview of the HAA Source Rate Control Scheme The main idea of HAA source rate control scheme is to have the source rate follow the long term variation in the available bandwidth, and at the same time to adjust it by means of a \penalty factor" which is derived from the short-term heterogeneity (refer to Section 5.2.5). A flnite-size feature extractor is used to collect statistical information on those short term spurs, which is used in the source rate controller. Figure 5.2 shows the general abstract architecture of the HAA source rate control scheme which consists of four components: Available Resource Measurement, Feature Extractor, Source Rate Allocator, and Source Rate Controller. Figure 5.2: The abstract architecture of the HAA source rate control scheme. 62 The Available Resource Measurement component is responsible for measuring the available resources of interest, which include the available bottleneck bandwidth (ABBW) of a network path and available bufier size on each router. The proposed scheme only measures ABBW. ABBW(t) refers to the minimum bandwidth of all inter-router links at time t, i.e., ABBW(t) = minNi=1fBi(t)g where i is the link id and Bi(t) is the available bandwidth of link i. Later in this dissertation, the notation ABBWl will be used to represent ABBW measurement samples, where l is the sample index. The Feature Extractor extracts quantifled heterogeneity information (features) from the measurement samples of the available resources. Typical features are the flrst order moment (mean), the second order moment (deviation), minimum, maximum, and median of measurement samples for a certain time duration. Currently, only the mean, deviation, minimum, and maximum are used in the proposed scheme. The Source Rate Allocator is responsible for computing the appropriate source rate for the corresponding communication session based on the extracted features, and the Source Rate Controller is the one which actually controls the source rate. Figure 5.3: Illustration of the proposed HAA communication model. The schematic diagram of the proposed HAA source rate control communication model is shown in Figure 5.3. The receiver is responsible for measuring the bottleneck 63 bandwidth, and extracting the features from the bandwidth measurement samples. It then periodically (by default one round-trip-time) sends control packets, which contain the information on the allocated rate, to the sender. Upon receiving each control packet, the sender tries to adapt the allocated source rate. 5.2.2 Bottleneck Bandwidth Measurement The bottleneck bandwidth is measured with a packet pair [28][29][30] based ap- proach in the current implementation. The source sends out data in the form of back-to-back packets, and measurement of ABBW(t) is performed on the receiver side. An up-link (from receiver to sender) carries control packets and a down-link (from sender to receiver) transmits data. The latter has a much heavier load and experiences higher heterogeneity than the former. As a result, the receiver side mea- surement can get more accurate and timely information on the network load. Figure 5.4 shows the transmission of packets along a network path. In order to avoid errors due to lost or out of order packets, ABBW is measured only for \valid packet pairs" received. A valid packet pair consists of two packets received successively where the flrst packet with an even sequence number, 2l, is followed by the other with the sequence number of 2l +1. ABBW measured by the lth valid packet pair (ABBWl) is ABBWl = 1T 2l+1 ?T2l (5.1) 64 2l+12l3210 2l+12l3210 2l+32l+2 2l+2 2l+12l3210 2l+2 Sender Link0 BottleneckLink LinkN-1 Receiver T0 T1 T2 T3 T2l T2l+1 T2l+2 T2l+3 2l+3 2l+3c68T0 c68T1 c68Tl c68Tl+1 Figure 5.4: Illustration of packet-pair based bandwidth measurement. The horizontal dimension is time, and from top to bottom along the vertical dimension are the sender, intermediate links, and the receiver. Numbers in small boxes are the packet sequence numbers. where T2l and T2l+1 are the receiving times of two packets received consecutively. The time difierence ?Tl = T2l+1?T2l is dependent on the available bandwidth of the bottleneck link. 5.2.3 Feature Extractor The features that may be used in the proposed scheme include the mean (abbw), standard deviation ( ABBW), minimum (ABBWmin), and maximum (ABBWmax) of available bottleneck bandwidth measurement samples (ABBWl) within an interval, as shown in Figure 5.5. All features are calculated over a sliding window of size L (in terms of the number of packet pairs), which has a longer time duration than the sampling interval. A feature extractor is designed to extract the features for the measurement samples of the last L packet-pairs (from ABBWl?L+1 to ABBWl), where L is the size of the feature extractor and l is the packet-pair index (or sample index). The feature interval is deflned to be the time interval over which the L packet pairs are spread. 65 0 0.5 1 1.5 20 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 A feature interval (L Samples) abbw? ABBW ABBWmin ABBWmax ABBW(t) (Mbps) Time (s) Figure 5.5: Feature extractor. In order to reduce the computing and storage complexity, two accumulators (A1, A2) are employed for the sums of ABBWl and ABBW2l , and two other memory variables (ABBWmax and ABBWmin) are used to record the maximum and minimum values for the last L samples. The initialization code is executed immediately after the flrst valid packet pair is received and the corresponding ABBW0 is obtained (reset A1 and A2 to L?ABBW0 and L?ABBW20, respectively); the iteration code is to calculate the sum of the last L ABBWl and ABBW2l after each valid packet pair l is received; and the extraction code is performed at every control interval, which will be described in Section 5.2.5. These codes are shown in Figure 5.6. 5.2.4 Simple Source Rate Allocation Approaches Three simple source rate control schemes are described for comparison with the proposed scheme (HAA): Mean-Only Scheme: The source rate may be controlled to follow abbw, the mean of measured ABBW. This scheme is easy to implement and the throughput is 66 /*initialization code*/ A1 = L?ABBW0;A2 = L?ABBW20 /*iteration code*/ A1 = A1 +ABBWl ?ABBWl?L A2 = A2 +ABBW2l ?ABBW2l?L /*extraction code*/ abbw = A1=L ABBW = q 1 L?1A2 ? L L?1abbw 2 ABBWmax = max(ABBWj; l?L+1 ? j ? l) ABBWmin = min(ABBWj; l?L+1 ? j ? l) Figure 5.6: Initialization, iteration, and extraction codes for the feature extractor. close to the bottleneck link?s capacity. However, it causes a high dropping rate and a long end-to-end delay due to the limited packet bufier size. Scaling Scheme: A scaling factor, fi (0 < fi < 1), may be employed to make the system work under a lighter load, thus lowering the dropping rate and end-to-end delay [29][30]. The main problem with this scheme is that it sacriflces throughput unnecessarily when the variation of ABBW is small. For example, when fi is chosen to be 0.9, there will always be 10 percent of bandwidth being wasted, even if the bandwidth variation is very small (or no variation). The situation is even worse when the bandwidth variation is large and fi is not small enough. Minimum-Bandwidth Scheme: The source rate may be set to the ABBWmin at each synchronization interval. However, this is an overly conservative scheme, especially when the ABBW variation is large. For example, when there is a deep valley (ABBWmin), in a feature interval as shown in Figure 5.5, there would be a lot of bandwidth wasted, since in most of time the available bandwidth is above this value. 67 5.2.5 Heterogeneity Aware Approach Synchronization In order to alleviate the network load, control packets are sent to the sender at every synchronization interval. The synchronization interval determines how frequently control packets are sent to the sender, therefore, the shorter the synchro- nization interval is, the faster the sender can adapt the source rate to the variation of available bandwidth, although more of the up-link bandwidth will be consumed. Usually, it should not be shorter than one round-trip time. Thesynchronizationintervalintheproposedschemeisafunctionoftime, Tsync(t), which is set to be smaller when a larger long-term variation is detected, so that the source rate can adapt to the actual ABBW faster. The efiect of Tsync is discussed in the next section. HAA Source Rate Function The main idea of the proposed scheme is to derive the desired source rate rs at every synchronization interval, based on the set, EFk, of extracted features including abbwk, ? ABBWk, ABBWmink , ABBWmaxk . where k is the synchronization interval index. A general form of HAA source rate function f(ABBWk; EFk; t) is shown in Equation 5.2. rs(k) = f(ABBWk; EFk; t) (5.2) 68 An implementation of the source rate function is shown in Equation 5.3, which can detect and utilize both long-term and short-term heterogeneity. rs(k) = 8 >>< >>: max[ABBWmink ;abbwk(1?fl? ABBWk)] for ? ABBWk ? ? threshold; ABBWk for ? ABBWk > ? threshold; (5.3) where ? threshold is selected such that ? ABBWk exceeds it when there is a clear large scale edge (rising or falling) in ABBW. A signiflcant transition is most likely due to the release or injection of a long-term competing ow, and the magnitude of the corresponding rise or fall in ABBW is assumed to be signiflcantly larger than the variation due to the short-term heterogeneity. When ? ABBWk is less than ? threshold, rs(k) is penalized by fl?? ABBWk, but is set no lower than ABBWmink . fl is a penaltyfactor which controls how much the source rate should be penalized for the heterogeneity. The rationale is that the source rate gets a less penalty when ? ABBWk (short-term heterogeneity) is small, and a larger penalty when ? ABBWk is large. When ? ABBWk becomes greater than ? threshold, i.e., a clear rising or falling edge, rs(k) is set to the most recent ABBWk in order to obtain a more accurate estimation of the long-term variation in the current implementation. Also, in order to get a faster response, Tsync is made much smaller (thus faster) than the other operation mode. Due to the delay from the ABBW to abbw, the source rate is over-estimated on the falling transitions and under-estimated on the rising transitions. Some more so- phisticated predictive scheme could be developed to further improve the performance. 69 Adaptive Penalty (fl) The main issue in the proposed HAA approach to source rate control is how to select a proper value of the penalty factor (fl). In many cases, the users specify QoS requirements that their applications should meet. A typical speciflcation of QoS requirements is given below: QoS requirements 8> >>>> >>< >>> >>>> : Throughput : thr ? THRmin End-to-End Delay : tee ? Teemax Dropping Rate : dr ? DRmax (5.4) The means (averages) of throughput (THR), end-to-end delay (Tee), and drop- ping rate (DR) are denoted by thr, tee, and dr, respectively. THRmin is the minimum throughput requirement, and Teemax, and DRmax are the upper bounds of the end- to-end delay, and dropping rate, respectively. These are all user-specifled parameters, and are given during the session admission control stage. 0 0.5 1 1.5 2? QoS THRmin Teemax DRmax ?maxthr?mintee?mindr Throughput (thr )End?to?End Delay (t ee)Dropping Rate (dr) ?valid Figure 5.7: QoS requirements and valid range of fl. 70 The proposed scheme determines the valid range of fl so that all the QoS require- ments can be met by adaptively adjusting fl. This is based on the fact that each QoS measure mentioned above is a monotonic function of fl, and hence the upper/lower bounds of fl for each QoS measure QoSi can be found via a simple heuristic search. 8 >>>> >>>< >>>> >>> : Throughput : flthr ? flthrmax End-to-End Delay : fltee ? flteemin Dropping Rate : fldr ? fldrmin (5.5) where flthrmax, flthrmin, and fldrmin are the upper/lower bounds of fl that satisfy the re- quirement of throughput, end-to-end delay, and dropping rate, respectively, as shown in Figure 5.7. And the valid range of fl which satisfles all the QoS requirements is max(flteemin;fldrmin) ? flvalid ? flthrmax (5.6) If max(flteemin;fldrmin) > flthrmax, no fl can be found to meet all the QoS requirements, and the admission control thus rejects the ow. A \Slowly-Decreasing-Fast-Increasing" (SDFI) scheme is proposed to adaptively determine a near optimal fl. It goes through a 3-stage cycle when a QoS requirement violation is detected. Let ?flslow denote the \slowly decreasing step" (0.1 by default), ?flfast the \fast increasing step" (0.6 by default) of fl, and fl0 the initial value in the cycle. To reduce the oscillation in the adaptation, a margin is introduced for each 71 QoS measure (QoSmargin), and then the adjusted QoS measures are 8> >>>> >>< >>> >>>> : THR0min = THRmin +QoSthrmargin Tee0max = Teemax ?QoSteemargin DR0max = DRmax ?QoSdrmargin (5.7) The margin for each QoS measure depends on the tolerance requirements of applications. It is either provided by the user as a part of QoS requirements, or set by the SDFI scheme via experiments. The adjusted QoS measures are \tighter" than those given in the QoS speciflcation, and are used during the Slowly Decreasing stage. The complete pseudo-code of the SDFI scheme is shown in Figure 5.8. while (true) f //Fixed stage while (tee < Teemax && dr < DRmax && thr > THRmin); //Fast Increase stage fl = fl +?flfast; //Slowly Decrease stage while (tee < Tee0max && dr < DR0max && fl ? ?flslow) f fl = fl ??flslow; g fl = fl +?flslow; if(thr < THR0min) exit(); //can not meet all requirements g Figure 5.8: Pseudo-code of the proposed SDFI scheme. (1) Fixed stage: the receiver keeps checking all the QoS measures, and fl remains unchanged until any of them violates the requirement. (2) Fast Increasing stage: fl is incremented by ?flfast. 72 (3) Slowly Decreasing stage: fl is decremented by ?flslow in each step until the QoS measure exceeds QoS;max. fl is incremented by ?flslow, i.e., it returns to the last valid value. The operation of the SDFI scheme is illustrated in the time domain in Figure 5.9. 0 20 40 60 80Time (s) QoS 00.2 0.40.6 0.81 1.21.4 1.61.8 22.2 ? QoSQoS, maxQoSmax ? QoSmargin Figure 5.9: Operation of the SDFI scheme. 73 Chapter 6 Performance of the Source Rate Control Scheme 6.1 Simulation Setup An extensive simulation has been carried out using a widely-used network simu- lator (ns2 [44]). The performance of a single HAA ow is flrst studied. The packet size in all the results in this chapter is 1000 bytes. The size of bufier on each router is set to be 20 packets by default, and the queuing scheme is the drop-tail round-robin. The default length (L) of the feature extractor is 15 samples. The default synchro- nization interval is set to be one round-trip-time, except when a signiflcant transition in ABBW is detected. Topology S SourceNode RnNn1 n2 SinkNode 5M,3ms B,10ms1 5M,3msB ,10msNB,10ms2 nN-1B ,10msN-1 Figure 6.1: Simulation testbench topology for a single path. The topology of a single path employed in the simulation is shown in Figure 6.1. The link capacity and propagation delay for local connection links (thick lines) are 5Mbps and 3ms, respectively. For each inter-router link i, the propagation delay is 10ms and the available bandwidth Bi(t) is a random variable whose the upper limit 74 is 5Mbps. Queue Management A round-robin like queue management is assumed for the network routers in the simulation since it ofiers several advantages over the flrst-in-flrst-out (FIFO) packet schedulers for bursty data tra?c. In particular, the round-robin queue management automatically enforces a min-max fair allocation of resources [45], and ensures that the \well-behaving" users are protected from the \ill-behaving" ones, which is desir- able in public data networks [30][46]. Tra?c Generation Instead of directly simulating the background tra?c, the available bandwidth on each link is changed dynamically in the simulation script. This approach enables easy control of the link bandwidth to simulate the heterogeneity under difierent scenarios, and also saves simulation time. The varying bandwidth is simulated in two dimensions: magnitude and time. The magnitude variation on a link is independent of those on others and follows a certain distribution. Several difierent distributions have been tried, including the uniform distribution and truncated normal distribution, which all lead to similar results. Therefore, all simulation results provided in this section assume a uniform distribution. The magnitude of bandwidth on link i follows a uniform distribution with the mean of bi and normalized standard deviation of ? Bi. In the time dimension, each random magnitude of bandwidth remains unchanged over a random interval (Tinterval) which follows an exponential distribution with a mean of tinterval. 75 6.2 Single Flow 6.2.1 Adaptation to Long-Term Bandwidth Variation 48.549 49.550 50.551 51.50 0.5 1 1.5 2 2.5 3 3.5x 106 time (s) Source Rate/ABBW (bps) ABBWmean?only HAAminimum?bandwidth 53.554 54.555 55.556 56.50 0.5 1 1.5 2 2.5 3 3.5x 106 time (s) Source Rate/ABBW (bps) ABBWmean?onlyHAAminimum?bandwidth (a) Rising Edge (b) Falling Edge Figure 6.2: Comparison of three source rate control schemes on a path of an inter- router link whose available bandwidth (ABBW) varies with time. The penalty factor (fl) is set to be 0.7 for HAA. Figure 6.2 shows how the three difierent source rate control schemes adapt the source rate to the available bottleneck bandwidth (ABBW). ABBW is generated by a \slow" square wave (with a period of 5 seconds and a magnitude of 1 to 3Mbps) plusahigh-frequencyvariation. Theaverageintervalofthehigh-frequencycomponent is 50ms and the magnitude is no greater than 500Kbps. It is shown that all three schemes can follow ABBW on a large time scale, but the minimum-bandwidth scheme is too conservative to utilize the available bandwidth e?ciently, and the mean-only scheme is too greedy, as the allocated bandwidth is higher than ABBW for about half of the period, and thus leads to a large dropping rate. The proposed scheme (HAA) tends to adopt a rate between those by the mean-only and minimum-bandwidth 76 schemes in an efiort to optimize the performance such as throughput and dropping rate. 6.2.2 Feature Extractor Size and Average Bandwidth Updating Interval 0 0.5 1 1.50 0.20.4 0.60.8 11.2 1.41.6 1.8 ? Dropping Rate (%) tinterval=0.05 L=5t interval=0.05 L=15tinterval=0.05 L=30 tinterval=0.50 L=5t interval=0.50 L=15tinterval=0.50 L=30 0 0.5 1 1.560 70 80 90 100 110 120 130 140 ? End?to?End Delay (ms) tinterval=0.05 L=5t interval=0.05 L=15tinterval=0.05 L=30 tinterval=0.50 L=5t interval=0.50 L=15tinterval=0.50 L=30 (a) Dropping Rate (b) End-to-End Delay Figure 6.3: Efiects of feature extractor size (L) and tinterval. N=4, ? Bi=0.17 for all 4 links. In Figure 6.3, efiects of the feature extractor size (L) and average bandwidth updating period (tinterval) on the dropping rate and end-to-end delay are analyzed. Comparing fl = 0 and fl > 0 (HAA), it is seen that the improvement by HAA is more signiflcant when the bandwidth varies more rapidly (tinterval=0.5, dashed lines in the flgure). Also, the results indicate that the feature extractor size L should not be too small. The performance (in terms of the dropping rate and end-to-end delay) for L=15 or 30 is better than that for L=5, while the improvement by L=30 over L=15 is not very large, therefore, L=15 is used in most cases of the simulation. 77 6.2.3 HAA Compared with Packet Spacing Protocol (PSP) HAA essentially uses a packet spacing approach to adjust the source rate at the sender. It is compared with the Packet Spacing Protocol (PSP) [41][42], which is another packet spacing based rate control scheme. The comparison results are provided in Figure 6.4. The penalty factor fl is set to 0 in this simulation, i.e., non-adaptive and with no heterogeneity penalty. It can be seen that the source rate variation of the proposed scheme is much smaller than that of PSP (Figure 6.4-(a)). The efiective throughput (Figure 6.4-(b)) of HAA is about 25% better than that of PSP, and the improvement becomes even larger when the heterogeneity (? B) increases. The dropping rate is also decreased signiflcantly, especially when ? B is small, from 0.004 to almost 0. The only measure that is worse for the proposed scheme than for PSP is the end-to-end delay. However, it should be noted that PSP drops more packets which are not counted toward the end-to-end delay, and HAA achieves a signiflcantly higher throughput. 6.2.4 HAA versus Mean-Only and Minimum-Bandwidth Schemes In Figure 6.5, the efiect of fl on the performance of HAA is analyzed and HAA is also compared with two other flxed rate control schemes (mean-only and minimum- bandwidth) under two difierent bandwidth varying scenarios: tinterval = 0:05s (fast) and tinterval = 0:5s (slow). In each scenario, three sets of results are obtained for three difierent levels of heterogeneity (? Bi = 0.06, 0.17, and 0.29, respectively) with fl varied. The results for 78 0 5 10 15 20 25 300 0.5 1 1.5 2 2.5 3 3.5x 106 Sending Rate (bps) Time HAAPSP 0.1 0.150.2 0.250.3 0.350.4180 190 200 210 220 230 240 250 _?B Throughput (pps) HAAPSP (a) Source Rate (b) Efiective Throughput 0.10.150.20.250.30.350.40 0.1 0.2 0.3 0.4 0.5 0.6 0.7 _?B Dropping Rate (%) HAAPSP 0.10.150.20.250.30.350.40.03 0.035 0.04 0.045 0.05 0.055 _?B End?End Delay (s) HAAPSP (c) Dropping Rate (d) End-to-End Delay Figure 6.4: Comparison of HAA (fl = 0) with PSP. HAA are shown in the solid curves and the corresponding results for the minimum- bandwidth scheme in the dotted curves. Note that the results for the mean-only scheme correspond to the y-intercepts of the solid curves (fl=0). It can be seen that, there is a trade-ofi between the throughput and end-to-end delay/dropping rate. As fl increases, the end-to-end delay and dropping rate decrease, but the efiective throughput also decreases. It is the adaptive algorithm?s task to flnd a proper value of fl (refer to Section 5.2.5). Also, the efiect of fl becomes more signiflcant for higher heterogeneity (? Bi=0.29, open circles in the flgures), especially for the dropping rate and throughput. 79 0 0.5 1 1.560 70 80 90 100 110 120 130 ? End?to?End Delay (ms) ??Bi=0.06??B i=0.17??Bi=0.29?? Bi=0.06(min)??Bi=0.17(min)?? Bi=0.29(min) 0 0.5 1 1.570 75 80 85 90 95 100 105 ? End?to?End Delay (ms) ??Bi=0.06??B i=0.17??Bi=0.29?? Bi=0.06(min)??Bi=0.17(min)?? Bi=0.29(min) (a) end-to-end delay, tinterval = 0.05 (b) end-to-end delay, tinterval = 0.5 0 0.5 1 1.50 0.51 1.52 2.53 3.54 4.5 ? Dropping Rate (%) ??Bi=0.06??B i=0.17??Bi=0.29?? Bi=0.06(min)??Bi=0.17(min)?? Bi=0.29(min) 0 0.5 1 1.50 1 2 3 4 5 6 ? Dropping Rate (%) ??Bi=0.06??B i=0.17??Bi=0.29?? Bi=0.06(min)??Bi=0.17(min)?? Bi=0.29(min) (c) Dropping Rate, tinterval = 0.05 (d) Dropping Rate, tinterval = 0.5 0 0.5 1 1.50.8 1 1.2 1.4 1.6 1.8 2x 106 ? Throughput (bps) ??B i=0.06??Bi=0.17?? Bi=0.29??Bi=0.06(min)?? Bi=0.17(min)??Bi=0.29(min) 0 0.5 1 1.51.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2x 106 ? Throughput (bps) ??Bi=0.06??B i=0.17??Bi=0.29?? Bi=0.06(min)??Bi=0.17(min)?? Bi=0.29(min) (e) Throughput, tinterval = 0.05 (f) Throughput, tinterval = 0.5 Figure 6.5: N = 4, each link?s bottom link bandwidth (Bi(t)) follows a uniform distribution, with deviation ? Bi. \(min)" in the legend stands for the "minimum- bandwidth" scheme. By utilizing the information on heterogeneity, the end-to-end delay (Figure 6.5- (a) and (b)) and the dropping rate (Figure 6.5-(c) and (d)) can be reduced signifl- cantly compared to the mean-only scheme (when fl=0). The end-to-end delay and 80 dropping rate in the case of a more rapidly varying bandwidth (tinterval=0.05 in Figure 6.5-(a), (c), and (e)) decrease even faster when the penalty factor (fl) is increased, compared to the case of a slowly varying bandwidth (tinterval=0.5 in Figure 6.5-(b), (d), and (f)). However, the throughput also drops more for a more rapidly varying bandwidth. This is because when tinterval is large (slowly varying bandwidth) and the extractor size (L) remains the same, the measured bandwidth deviation (? B) would be less than the steady state (theoretical) deviation which is deflned for a very long time duration, therefore, the throughput is degraded less. And as expected, the minimum-bandwidth scheme matches HAA when fl is close to 1.5. The reason is that when the bandwidth variation follows a uniform distribu- tion, the minimum value of the distribution is around mean?dev?p3, and p3 ? 1:7. Nevertheless, there is still a room for performance optimization. For example, when fl is set to 1.0 for tinterval=0.05, performance of HAA is similar to that of the minimum- bandwidth scheme in terms of the end-to-end delay and dropping rate, although its throughput is signiflcantly (at least 10%) higher than that of the minimum-bandwidth scheme. It can also be seen that all the performance measures vary monotonically with fl. Therefore, with the adaptive rate control scheme one could start from a conserva- tive value of fl, such as 2.0, which is slightly more conservative than the minimum- bandwidth scheme, and decrease fl by a step size of ?fl, until any of the requirements is violated (refer to Section 5.2.5). 81 6.2.5 Efiects of Spatial Heterogeneity The efiects of difierent spatial distributions of mean bandwidth (bi) on each link i are studied and the results are shown in Figure 6.6. Four difierent scenarios are simulated on a path consisting of 1 to 9 links. ? scenario \one bottleneck": bi is set to 2Mbps for the middle link and to 5Mbps for all the other links. ? scenario \increasing bi": bi linearly increases along the path from 2Mbps to 5Mbps (set to 2Mbps if there is only one link). ? scenario \decreasing bi": bi linearly decreases along the path from 5Mbps to 2Mbps (set to 2Mbps if there is only one link). ? scenario \uniform bi": bi is set to 2Mbps for all links. 1 2 3 4 5 6 7 8 91.3 1.4 1.5 1.6 1.7 1.8x 106 # of links Througput (bps) one bottleneckincreasing bidecreasing b iuniform bi 1 2 3 4 5 6 7 8 920 40 60 80 100 120 140 # of links End?to?End Delay (ms) one bottleneckincreasing bi decreasing biuniform b i (a) Throughput (b) End-to-end delay Figure 6.6: Efiects of path length. All ? Bi are set to 0.17. Figure 6.6 shows that, the \uniform bi" scenario has the worst performance in terms of throughput, because it does not have a flxed bottleneck, i.e., every link has 82 the opportunity to be the bottleneck. According to the previous results (Equation 4.2) on efiective bandwidth of a network path consisting of links with identical mean bandwidth, and the fact that the throughput can be no more than the efiective bandwidth (beffective) of the path, the throughput decreases when increasing the path length. In the \one bottleneck" scenario, the throughput is almost independent of the path length. This is because the network path has only one bottleneck in a flxed position. All the other links? bandwidths are much larger than the bottleneck band- width in the flrst scenario, therefore, the range of the bandwidth variation of other links (from 3.5 to 6.5 Mbps ) does not overlap with that of bottleneck link (from 1.4 to 2.6Mbps). The \increasing bi" and \decreasing bi" scenarios fall between the two scenarios discussed earlier. It can be seen that, the throughput of the \increasing bi" scenario is less and decreases more than that of the \decreasing bi" scenario. This is because, when the links closer to the source node are slower, the faster links which are closer to the receiver node will be under-utilized. This efiect becomes more signiflcant when the path length increases. Figure 6.7 shows how difierent spatial distributions of temporal heterogeneity along the path afiect the performance of the proposed scheme. Three difierent sce- narios are simulated for a path of four links. Note that the average of ? Bi along the path is the same in all cases. The flrst scenario (cross marks) has the smallest spatial heterogeneity in terms of link bandwidth variation (0.18 0.18 0.18 0.18) along the path, and the third scenario (circle marks) has the largest (0.0 0.12 0.24 0.36). It can be seen that the scenario with the largest spatial heterogeneity leads to the 83 worst performance, with the lowest throughput and the largest dropping rate and end-to-end delay. 0 0.5 1 1.560 70 80 90 100 110 120 130 ? End?to?End Delay (ms) ??Bi=0.180.180.180.18??B i=0.060.060.300.30??Bi=0.000.120.240.36 0 0.5 1 1.575 80 85 90 95 100 105 110 ? End?to?End Delay (ms) ?? Bi=0.180.180.180.18??Bi=0.060.060.300.30?? Bi=0.000.120.240.36 (a) end-to-end delay, tinterval = 0.05 (b) end-to-end delay, tinterval = 0.5 0 0.5 1 1.50 1 2 3 4 5 ? Dropping Rate (%) ??Bi=0.180.180.180.18??B i=0.060.060.300.30??Bi=0.000.120.240.36 0 0.5 1 1.50 0.51 1.52 2.53 3.54 ? Dropping Rate (%) ??Bi=0.180.180.180.18??Bi=0.060.060.300.30?? Bi=0.000.120.240.36 (c) Dropping Rate, tinterval = 0.05 (d) Dropping Rate, tinterval = 0.5 0 0.5 1 1.51 1.11.2 1.31.4 1.51.6 1.71.8 1.9x 106 ? Througput (bps) ??Bi=0.180.180.180.18??B i=0.060.060.300.30??Bi=0.000.120.240.36 0 0.5 1 1.51.1 1.2 1.3 1.4 1.5 1.6 1.7x 106 ? Througput (bps) ??Bi=0.180.180.180.18??B i=0.060.060.300.30??Bi=0.000.120.240.36 (e) Throughput, tinterval = 0.05 (f) Throughput, tinterval = 0.5 Figure 6.7: N = 4, heterogeneous links, all links have the same mean ABBWi=2Mbps. Distributions of ? Bi are as shown in the legends. 84 6.2.6 Packet Inter-arrival Time The packet inter-arrival time (?T) is an important performance measure for a real-time media (video or audio) streaming application. A large variation of ?T usually indicates a notable discontinuity or jitter on the receiver side. To provide a stable stream, ?T should be as even as possible (assuming a flxed packet size). The histograms of ?T for TCP and HAA ows are shown in Figure 6.8. It can be seen that ?T has a much sharper histogram (i.e., a smaller variation of ?) and a smaller mean (i.e., a higher average rate) for HAA than TCP. Also, note that there are some components located at much larger values than the main components (around 40ms for 4 links and 240ms for 12 links) in the TCP histograms. This is due to the bursty nature of TCP tra?c. Figure 6.9 provides a qualitative comparison of TCP and HAA in the normalized deviation of ?T (? ?T), and it shows that ? ?T of HAA is much smaller and less sensitive to (almost independent of) the path length than that of TCP. 6.2.7 Adaptive Penalty Figure 6.10 shows how the proposed adaptive scheme behaves when the upper- bound of end-to-end delay (Teemax) is set to a speciflc value (0.07 seconds in this example). It is seen that fl decreases from a conservative initial value of 1.9 until the QoS requirement for the end-to-end delay is violated (at around 25 seconds), and fl converges to around 1.0, with the end-to-end delay close to 0.07. 85 0 10 20 30 40 50 600 500 1000 1500 2000 2500 3000 3500 ? T (ms) 0 10 20 30 40 50 600 500 1000 1500 ? T (ms) (a) 4 links (1 HAA ow) (b) 4 links (1 TCP ow) 0 50 100 150 200 2500 500 1000 1500 2000 2500 ? T (ms) 0 50 100 150 200 2500 100 200 300 400 500 600 ? T (ms) (c) 12 links (1 HAA ow) (d) 12 links (1 TCP ow) Figure 6.8: Histograms of inter-arrival times. 128 bins are used. 2 4 6 8 10 120 0.5 1 1.5 2 2.5 3 # of links _ ? ? T HAATCP Figure 6.9: Deviation of inter-arrival time. 86 0 20 40 60 800.068 0.069 0.07 0.071 0.072 0.073 Time (s) End?to?End Delay 00.2 0.40.6 0.81 1.21.4 1.61.8 2 ? End?End DelayTeemax ? Figure 6.10: Illustration of the proposed adaptive rate control scheme. 6.2.8 Path Selection Because the source rate (rs), which is calculated on the receiver (refer to Equation 5.3), should be almost the same as the efiective throughput of the network path when the dropping rate is very low, the solution to the path selection problem is straightforward. An example network scenario consisting of 3 paths is shown in Figure 6.11. Note that both path 1 and path 2 have only one bottleneck link, while path 3 has two bottleneck links. However, the bottleneck bandwidth variation of path 3 is less than that of paths 1 and 2. Set the penalty factor (fl) to 1.0 for all 3 paths so that the dropping rate is very low. The simulation results are shown in Figure 6.12. It can be seen that even though path 3 has the lowest average bottleneck band- width (abbw) and the longest path length, it produces the highest efiective through- put. This result reveals that abbw should not be the only measure used in the path selection problem, and that information on the bandwidth variation (? ABBW) has to be considered. As can be seen in Figure 6.12, the source rate rs matches the efiective 87 Figure 6.11: A parallel path where the number of links varies with path. The number pair in brackets over each link represents bi and ? Bi of the link. 1 2 31.6 1.7 1.8 1.9 2x 106 Path No. Throughput/abbw (Mbsp) abbwThroughput rs 1 2 340 45 50 55 60 65 70 End?to?End Delay (ms) Path No. (a) abbw and throughput (b) end-to-end delay Figure 6.12: Throughput and End-to-End Delay when the penalty factor (fl) is set to 1.0. throughput quite well, therefore, it could be used as a good estimate of throughput when selecting the best path. 6.2.9 Multi-Path Media Streaming The application of multi-path media streaming with inter-path synchronization is considered. The topology of the multi-sender (mirror) parallel data transmission system used inthe simulationisshownin Figure 6.13, whichisbased on theframework developed in [47]. If all paths are synchronized \perfectly" and all connections are 88 reliable, there will be no media packet missed or transmitted more than once. Note that due to heterogeneity, this perfect synchronization can never be achieved in a real network. In addition to the performance measures discussed in the previous sections (throughput, dropping rate, end-to-end delay), which are deflned on the lower network layers, it is necessary to consider performance measures related to the degree of mis-synchronization on the application layer, missing rate and duplicate rate. Themissing rate isdeflned tobetheratio of\blank segments"to thewhole length ofthestreamsreceived. These blanksegmentsarecaused bybothmis-synchronization on the application layer and packet dropping on the lower network layers. The du- plicate rate is deflned to be the ratio of duplicated (those received more than once) segments to the whole length of the stream received. Both are application-level mea- sures and are calculated on the receiver side. To simplify the simulation, the segment size is set to be the same as the packet size. S2 SourceNode R n2,N2n2,0 n2,1 SinkNode5M,3ms B 2,1 5M,3ms B2,N2B2,2 n2,N-12B1,N2-1 S1 n1,N1n1,05M B 1,1 B1,N1n1, -1N1B1,N1-1 SK nK,NKnK,0 nK,15M,3ms B K,1 BK,NKBK,2 nK,N-1KBK,N-1 K Figure 6.13: Simulation testbench topology for multi-path media streaming. Simulation Results Figures 6.14 and 6.15 show the simulation results for media streaming over a network with 2 parallel paths, each with 4 inter-router links. All links have the same 89 average available bandwidth (b=2Mbps) and the links on path i have the deviation ? Bi for i = 1;2: It is seen that a higher source rate is allocated to path 1 which has a lower temporal heterogeneity (? B1=0.06), although the average available bandwidth is the same on both paths, as shown in Figure 6.14-(a) and (b). 0 0.5 1 1.5150 200 250 300 350 400 450 ? Throughput (pps) Path1Path2Aggregated 0 0.5 1 1.5100 150 200 250 300 350 400 450 ? Throughput (pps) Path1Path2Aggregated (a) ? B1 = 0:06; ? B2 = 0:17 (b) ? B1 = 0:06; ? B2 = 0:29 Figure 6.14: 2 paths, each with 4 links, and all links have the same bi=2Mbps, but difierent deviation ? Bi. 0 0.5 1 1.50 1 2 3 4 5 6 ? Duplicate Rate (%) ??B1=0.1,??B2=0.1??B1=0.1,??B2=0.3?? B1=0.1,??B2=0.5 0 0.5 1 1.50 0.51 1.52 2.53 3.54 4.5 ? Missing Rate (%) ??B1=0.1,??B2=0.1??B1=0.1,??B2=0.3?? B1=0.1,??B2=0.5 (a) Duplicate Rate (b) Missing Rate Figure 6.15: Application level measures for the scenario in Figure 6.14. From Figure 6.15, it is interesting to see that both the missing and duplicate rates decrease signiflcantly. When the penalty factor (fl) increases from 0 to 1.5, the 90 percentage reduction is up to 70% in the duplicate rate and 80% in the missing rate, while the throughput only decreases by 11% for the case of lower spatial heterogeneity (? B1 = 0:05; ? B2 = 0:17) and 18% for the case of higher spatial heterogeneity (? B1 = 0:05; ? B2 = 0:29) (Figure 6.14-(a) and (b)). This implies that the real-time QoS (missing and duplicate rates) can be improved without signiflcantly degrading the efiective throughput. 0 10 20 30 40 500.04 0.048 0.056 0.064 0.072 0.08 Time (s) End?to?End Delay 00.2 0.40.6 0.81 1.21.4 1.61.8 2 ? End?End DelayTeemax ? 0 10 20 30 40 500.04 0.048 0.056 0.064 0.072 0.08 Time (s) End?to?End Delay 00.2 0.40.6 0.81 1.21.4 1.61.8 2 ? End?End DelayTeemax ? (a) Path 1 (b) Path 2 Figure 6.16: Adaptive rate control on a multi-path network. Figure 6.16 shows how the proposed adaptive scheme works on a two-path net- work when the upper-bound of end-to-end delay (Teemax) is set to 0.07 seconds. Note that fl on each path varies independently and converges to a difierent value (about 0.3 on path 1 and about 1.0 on path 2). That is, it is shown that the proposed scheme is also able to handle this type of multi-path communication. Performance Comparison with TCP The proposed HAA scheme is compared with TCP on a 2-path network with each path consisting of 4 links. The results are shown in Figure 6.17. As expected, the aggregated throughput for HAA is much higher than that for TCP. However, the 91 ? B1 = 0:06; ? B2 = 0:06 0 0.5 1 1.52 2.22.4 2.62.8 33.2 3.43.6 3.8x 106 ? Aggregated Throughput (bps) HAATCP 0 0.5 1 1.565 70 75 80 85 90 95 100 105 ? Average End?to?End Delay (ms) HAATCP ? B1 = 0:06; ? B2 = 0:29 0 0.5 1 1.52 2.5 3 3.5x 106 ? Aggregated Throughput (bps) HAATCP 0 0.5 1 1.560 70 80 90 100 110 120 ? Average End?to?End Delay (ms) HAATCP Figure 6.17: Comparison with TCP implementation for 2 paths, each consisting of 4 links. end-to-end delay is also larger than that for TCP because more packets tend to be bufiered under a higher transmission rate. By adjusting the penalty factor to 1.0, the End-to-End delay could be reduced signiflcantly and the throughput can still be maintained at a much higher value than that by TCP. Figure 6.18 compares HAA and TCP when the penalty factor (fl) is flxed to 1.0. Two network paths are employed and the length of each path varies from 2 to 12. Note that the throughput achieved by TCP drops dramatically (more than 75%) as 92 ? B1 = 0:06; ? B2 = 0:06 2 4 6 8 10 120.5 1 1.5 2 2.5 3 3.5 4x 106 # of links Aggregated Throughput (bps) HAATCP 2 4 6 8 10 1220 4060 80100 120140 160180 # of links Average End?to?End Delay (ms) HAATCP ? B1 = 0:06; ? B2 = 0:29 2 4 6 8 10 120.5 1 1.5 2 2.5 3 3.5x 106 # of links Aggregated Throughput (bps) HAATCP 2 4 6 8 10 1240 6080 100120 140160 180200 # of links Average End?to?End Delay (ms) HAATCP Figure 6.18: Comparison with TCP implementation (2 paths, fl is flxed to 1.0). the path length increases up to 12 links, but that by HAA only decreases by less than 15%. Dependency of the missing and duplicate rates of HAA on the path length is shown in Figure 6.19. Both rates increase slightly when the network paths become longer, but are still at a reasonable level; the duplicate rate is around 2.3% and the missing rate is a little larger than 1.0% when the path length is 12. 93 2 4 6 8 10 120.8 11.2 1.41.6 1.82 2.22.4 2.6 # of links Duplicate Rate (%) ??B1=0.06,??B2=0.06??B1=0.06,??B2=0.17?? B1=0.06,??B2=0.29 2 4 6 8 10 120 0.2 0.4 0.6 0.8 1 # of links Missing Rate (%) ??B1=0.06,??B2=0.06??B1=0.06,??B2=0.17?? B1=0.06,??B2=0.29 (a) Duplicate Rate (b) Missing Rate Figure 6.19: Application level measures (2 paths, fl is flxed to 1.0). 6.3 Multiple Flows All the results discussed above assume that the variation of the available bot- tleneck bandwidth (ABBW) is independent of the HAA source rate control, i.e., the introduction of HAA ows does not afiect other background tra?c. However, this may not be the case in real networks. Here, each ow is a part of background tra?c of other ows and the dynamics of its source rate control can change the available bandwidth and behavior of other ows and, in turn, change its own available band- width. In this section, multiple ows sharing a network path with varying bandwidth are studied. Figure 6.20 illustrates Nf ows sharing a network path consisting of N links. Each ow fi is connected by a separate pair of sender (Si) and receiver (Ri). Assuming that there is no coordination among the ows, i.e., the source rate of each ow is controlled independently by its sender and receiver pair. The queue management in this study adopts a basic \Drop-Tail-Round-Robin" scheme to ensure that all ows on the path have the same level of priority. 94 Figure 6.20: Topology of the testing network: multiple ows sharing a network path. As a comparison reference, the bandwidth utilization of multiple TCP ows is flrst analyzed. Next, the performance of multiple HAA ows (with continuous and periodical On-Ofi tra?c) in terms of aggregated throughput, average end-to-end delay, dropping rate, and fairness is investigated and compared with that of multiple TCP ows. Finally, TCP friendliness is verifled to see if HAA ows can friendly share the network bandwidth with TCP ows. 6.3.1 Multiple TCP Flows The aggregated throughput of multiple TCP ows is flrst investigated. According to the approximation model of TCP throughput from [48] TH1TCP(p) = min(WmaxRTT ; 1 RTT q2bp 3 +T0min(1;3 q3bp 8 )p(1+32p 2) ) (6.1) where Wmax is the maximum window size, RTT is the round-trip time, T0 is the timeout value, b is the number of packets of transmitted data that are acknowledged by one ACK from the receiver, and p is the loss ratio. It is not di?cult to see that a 95 single TCP ow throughput is no more than TH1TCP ? WmaxRTT (6.2) And when there are Nf TCP ows, the aggregated throughput has an upper-bound of THNTCP ? NfX i=1 Wmaxi RTTi (6.3) From Equation 6.3, it can be seen that, when the maximum window size and Nf are flxed, a longer path length results in a larger RTT, and subsequently decreases the maximum possible throughput. This is verifled by the simulation, and the results are shown in Figure 6.21. Both the measured (solid curves) and the estimated maximum utilization (dotted curves) are plotted, which are deflned to be Measured Utilization = THNTCPC (6.4) Estimated Maximum Utilization = min(1; Nf ?WmaxC ?RTT ) (6.5) where C is the capacity of the bottleneck link and it is set to 2Mbps in this simulation. It can be seen from Figure 6.21 that the utilization becomes higher as more TCP ows are introduced, especially when the path is short. For example, on a path consisting of less than 9 links with no variation in ABBW, four TCP ows can achieve almost 100% utilization in the given scenario (Figure 6.21-(e)). However, as analyzed earlier, the bandwidth utilization drops signiflcantly as the path length (number of links) increases, and it reduces even more when there is a larger available bandwidth variation (? B). 96 0 2 4 6 8 10 120 20 40 60 80 100 Utilization (%) # of links 1 TCP measured1 TCP estimated max2 TCP measured 2 TCP estimated max4 TCP measured4 TCP estimated max 0 2 4 6 8 10 120 20 40 60 80 100 Utilization (%) # of links 1 TCP measured1 TCP estimated max2 TCP measured 2 TCP estimated max4 TCP measured4 TCP estimated max (a) ? Bi=0.00 (b) ? Bi=0.29 Figure 6.21: Aggregated bandwidth utilization of multiple TCP ows. Wmax=15. (note that some measured utilization curves overlap the estimated max ones) To further illustrate how TCP bandwidth utilization is afiected by the number of ows and path length, the trace plots of packet arrival times for multiple (Nf=1,2, and 4) TCP ows are provided in Figure 6.22. The simulation is conducted on a network path consisting of two difierent numbers of links (N=4 links, and N=12 links). The result shows how the efiective aggregated throughput varies as the path length increases. The FlowID (from 1 to Nf) of each packet is plotted as a function of time, and each mark indicates a packet arrival event for the corresponding FlowID. The duration of each transition on FlowID represents the time when no packet is being received, i.e., the idle time of the shared path. It can be easily seen that multiple TCP ows are received by the receiver in a bursty and multiplexed manner. When there is only one TCP ow, the \duty cycle" is about 60% for a short path consisting of 4 links. A larger number of ows increases the \duty cycle" signiflcantly, for example, four TCP ows can achieve an almost full utilization on the 4-link path. However, it can also be clearly seen that the duty cycle (hence bandwidth utilization) shrinks as 97 4 4.1 4.2 4.3 4.4 4.50 1 2 Flow ID Time 4 4.1 4.2 4.3 4.4 4.50 1 2 Flow ID Time (a) 1 TCP ow, 4 links (b) 1 TCP ow, 12 links 4 4.1 4.2 4.3 4.4 4.50 1 2 3 Flow ID Time 4 4.1 4.2 4.3 4.4 4.50 1 2 3 Flow ID Time (c) 2 TCP ows, 4 links (d) 2 TCP ows, 12 links 4 4.1 4.2 4.3 4.4 4.50 1 2 3 4 5 Flow ID Time 4 4.1 4.2 4.3 4.4 4.50 1 2 3 4 5 Flow ID Time (e) 4 TCP ows, 4 links (f) 4 TCP ows, 12 links Figure 6.22: Arrival time of multi- ow TCP packets. the path length increases. For example, when there is only one TCP ow, the duty cycle is only less than 20% on a 12-link path. 98 6.3.2 Multiple HAA Flows 6.3.2.1 Overall Performance 0 0.5 1 1.51.7 1.75 1.8 1.85 1.9 1.95 2x 106 ? Aggregated Throughput (bps) Nf=1N f=2Nf=4 0 0.5 1 1.530 40 50 60 70 80 90 ? End?to?End Delay (ms) Nf=1N f=2Nf=4 (a) Aggregated Throughput (b) End-to-End Delay 0 0.5 1 1.50 1 2 3 4 5 6 ? Dropping Rate (%) Nf=1N f=2Nf=4 (c) Dropping Rate Figure 6.23: Multiple HAA ows sharing a network path consisting of 4 links, all links have the same bi=2Mbps and ? Bi=0.17. Figure 6.23 shows the simulation results of Nf HAA ows sharing a varying avail- able bandwidth on a network path consisting of 4 links. Each ow is sent continuously by the allocated source rate. It can be seen from Figure 6.23 that as the total number of HAA ows (Nf) increases, the aggregated throughput does not change very much 99 (when fl is flxed). This result shows that multiple HAA ows can still maintain the overall bandwidth utilization at almost the same level as that for a single HAA ow. It should also be noted that, when introducing more ows, the dropping rate increases due to the interaction among multiple HAA ows and the limited bufier space available. By adjusting the penalty factor (fl) to a larger value, the dropping rate can be reduced signiflcantly. For example, when fl is set to 1.5, the dropping rate is reduced to about 1.1%, but the aggregated throughput is degraded by no more than 12%. 0 2 4 6 8 10 120 1020 3040 5060 7080 90 Utilization (%) # of links 1 HAA2 HAA 4 HAA 0 2 4 6 8 10 120 20 40 60 80 100 Utilization (%) # of links 1 HAA2 HAA 4 HAA (a) ? Bi=0.00 (b) ? Bi=0.29 Figure 6.24: Aggregated bandwidth utilization of multiple HAA ows. Figure 6.24 shows the dependency of the bandwidth utilization by multiple HAA ows on the path length and the number of ows. When there is no variation in ABBW (? Bi=0.00), multiple HAA ows can achieve almost 100% utilization and the utilization is almost independent of the path length. In the case with a larger bandwidth variation (? Bi=0.29), the utilization decreases as the path length increases, but the efiect of path length is much less signiflcant than that in TCP. The comparison result will be shown in Section 6.3.3. 100 6.3.2.2 Dynamic Performance To see if the HAA source rate control scheme produces any \exclusive ow" which permanently uses up all available bandwidth and refuses the injection of other ows, a scenario with On-Ofi ow patterns (\scenario On-Ofi") is designed. Instead of continuously sending out packets (\scenario Continuous"), each ow is toggled with a period of Tswitch seconds. The duty cycle is set to be 2Nf?12N f , i.e., in each period, the ow is turned ofi for Toff = Tswitch2N f seconds, and turned back on for Ton = Tswitch(2Nf?1)2N f seconds. The switching time of each ow is evenly spaced out by Tswitch=Nf so that at any time there is at least one ow active (unless Nf=1). A case with two On-Ofi ows is illustrated in Figure 6.25. 5 10 15 20 25 1 2 Ton Toff time ID of enabled flow Flow 1Flow 2 Figure 6.25: Two On-Ofi ows with Tswitch=6. When Nf > 1, the ideal aggregated throughput should be close to that in the scenario Continuous, because every time one ow is turned ofi, other active (enabled) ows are supposed to take over the bandwidth released by the closed ow. Multiple On-Ofi ows are considered on a shared path consisting of 4 links in order to study the efiects of bandwidth heterogeneity on the aggregated throughput. 101 0 0.050.1 0.150.2 0.251 1.2 1.4 1.6 1.8 2x 106 _ ?B i Aggregated Throughput One continuous HAATwo ON?OFF HAA Four ON?OFF HAA0 0.050.10.150.20.250 0.005 0.01 0.015 0.02 _ ?B i _ ? th i Two HAAFour HAA (a) Aggregated throughput (b) Fairness among HAA ows Figure 6.26: Dynamic performance of multiple On-Ofi ows. N=4, bi=2Mbps for all links, and fl=1.0 for all ows. The aggregated throughputs of 2 and 4 On-Ofi HAA ows are compared to the throughput of a single continuous ow for the same background tra?c setup, and the results are shown in Figure 6.26-(a). It can be seen that the aggregated throughput of the scenario On-Ofi is close to the throughput of the scenario Continuous and even slightly better when there is a higher bandwidth heterogeneity (? Bi > 0:1). This indicates that the active ows can, in a timely manner, take over the bandwidth released by the inactive ows. The normalized standard deviation of throughput among HAA ows in the scenario On-Ofi is plotted as an index of fairness in Figure 6.26-(b). It is seen that the throughput variation among the ows is very small (less than 2 percent), which means that there is no \irresponsible" ow. 102 6.3.3 Comparison with Multiple TCP Flows 6.3.3.1 Aggregated Bandwidth Utilization The aggregated bandwidth utilization of multiple HAA ows is compared with that of multiple TCP ows. The improvement in bandwidth utilization is deflned to be Improvement (%) = UtHAA ?UtTCPUt TCP (6.6) where UtHAA and UtTCP are the bandwidth utilizations of Nf HAA ows and Nf TCP ows, respectively. The penalty factor (fl) of HAA ows is set to 1.0, and the maximum window size (Wmax) of TCP is set to 15. The results are shown in Figure 6.27. 0 2 4 6 8 10 120 50 100 150 200 250 300 350 400 Improvement (%) # of links Nf = 1N f = 2Nf = 4 0 2 4 6 8 10 12?50 0 50 100 150 200 Improvement (%) # of links Nf = 1N f = 2Nf = 4 (a) ? Bi=0.00 (b) ? Bi=0.29 Figure 6.27: Improvement in aggregated bandwidth utilization. 103 The improvement in utilization becomes larger when the path length increases. Also, the improvement is more signiflcant when network links have a smaller band- width variation (? Bi=0.00). However, the improvement becomes smaller as the num- ber of ows increases. For example, on the path of 4 links, the improvement by HAA over TCP is above 40% when there is only 1 ow, but there is almost no improvement when there are 4 ows. 6.3.3.2 Inter- ow Fairness 1 2 3 4 5 6 7 8 90 0.5 1 1.5 # of links _ ? thr i 2 tcp4 tcp6 tcp 2 HAA4 HAA6 HAA 1 2 3 4 5 6 7 8 90 0.05 0.1 0.15 0.2 # of links _ ? thr i 2 tcp4 tcp6 tcp 2 HAA4 HAA6 HAA (a) ? Bi=0.0 (b) ? Bi=0.29 Figure 6.28: Fairness comparison with TCP. Figure 6.28 compares throughput variation ? thri, an index of inter- ow fairness, among multiple ows in both homogeneous (? Bi=0.0) and heterogeneous (? Bi=0.29) cases. It can be seen that in both cases the throughput variation among HAA ows is much smaller than that among TCP ows, i.e., the inter- ow fairness of HAA is better. Also, ? thri is heavily dependent on the path length (or round-trip time) for TCP, but it is almost independent of the path length for HAA. 104 0.2 0.4 0.6 0.8 10 0.5 1 1.5 2x 106 ? Total Throughput (bps) NTCP=NHAA=1N HAA=2NTCP=2 0.2 0.4 0.6 0.8 10 0.5 1 1.5 2x 106 ? Total Throughput (bps) NTCP=NHAA=2N HAA=4NTCP=4 (a) 2 ows (b) 4 ows 0.2 0.4 0.6 0.8 10 0.5 1 1.5 2x 106 ? Total Throughput (bps) NTCP=NHAA=4N HAA=8NTCP=8 0.2 0.4 0.6 0.8 10.5 1 1.5 2 2.5 3 3.5 ? HAA/TCP NTCP=NHAA=1N TCP=NHAA=2NTCP=NHAA=4 (c) 8 ows (d) Throughput ratio (HAA/TCP) Figure 6.29: Multiple HAA and TCP ows. 6.3.4 TCP Friendliness Figure 6.29 shows the simulation results for multiple HAA ows sharing the available bandwidth with TCP ows on a network path of 4 links. In Figure 6.29-(a), (b) and (c), three sets of aggregated throughput are compared: (1) m HAA ows and m TCP ows (2) 2m HAA ows, and (3) 2m TCP ows, where m=1, 2 or 4. It can be seen that the throughput achieved by one TCP ow and one HAA ow together is slightly lower than that by two TCP ows when fl is large (greater than 0.4), and 105 is quite close to that by two HAA ows. However, when the number of ows is larger (4 or 8 ows), the mixed ow (HAA and TCP) achieves the highest throughput. To prevent HAA ows from aggressively taking too large a share of the available bandwidth, fl needs to be properly adjusted. Figure 6.29-(d) shows how to ensure that TCP ows get a \fairer" share by adjusting fl. It can be seen that setting fl to 1 usually achieves a fairly good bandwidth sharing between TCP and HAA ows, with a throughput ratio of between 1.0 and 1.5. 106 Chapter 7 Conclusion and Future Work In this dissertation, two types of heterogeneity are deflned for heterogeneous com- puting and communication systems. Temporal heterogeneity (TH) refers to variation in computing power or communication bandwidth available for a task along the time dimension on an individual computer or channel, while spatial heterogeneity (SH) refers to variation of the available computing power or communication bandwidth among computers. Efiects of TH and SH on computing and communication systems are flrst an- alyzed to show the possibility of improving the task performance by taking the het- erogeneity information into account. In addition to the flrst order moment (mean), the second order moment (deviation) of resources availability is quantifled and explic- itly utilized to improve performance of computing and communication tasks in the heterogeneity aware approaches (HAA). 7.1 Computing Systems The efiects of heterogeneity are flrst studied in terms of average parallel execution time and standard deviation of parallel execution time on heterogeneous computing systems (Chapter 2). The results revealed that both TH and SH in computing power can have a signiflcant efiect on the parallel computing performance. To achieve the minimum ? on a spatially heterogeneous computing system, it is not optimal to 107 distribute a target task linearly proportional to the average computing powers of computers. If the computational load is perfectly balanced among computers in terms of the average computing power available, ? signiflcantly increases as one or both types of heterogeneity (TH and SH) increase, while the average sequential execution time on a single computer is not afiected by TH. As SH increases, ? increases more when the average TH (among computers) increases than when it remains flxed. As the number of computers employed for a target task increases, ? decreases initially and then may increase especially when SH increases fast. More frequent synchronization amplifles efiects of heterogeneity and degrades performance of a target task more for a higher TH and/or SH. Performance degradation due to heterogeneity becomes even larger when the interval is longer (coarser granularity). The risk factor and variation of parallel execution time is quickly reduced as the fraction of a target task assigned to computers with smaller TH increases. A two-step heuristic approach to partitioning a target task for minimizing its average parallel execution time (?) is described with a theoretical model (Chapter 3). The proposed approach achieves a shorter ? by assigning a fraction of target task, which is larger than the fraction determined proportional to the average computing power, to a computer with a smaller TH or a higher average computing power. The reduction in ? by the proposed approach is greater when (i) variation of the average computing power among computers is larger; (ii) variation of the standard deviation of computing power among computers is larger; or (iii) the number of computers employed is greater. 108 7.2 Communication Systems TheefiectsofTH ontheQoS?ssuchasthroughput, end-to-enddelay, andtransfer time on a heterogeneous network are studied in Chapter 4. The efiects become even larger when there also exists SH among links or paths. For the two applications, path selection and multi-path data transfer, it has been demonstrated that signiflcant improvements in the end-to-end delay and transfer time are possible by considering the standard deviation as well as the mean of available bandwidth on each link. A heterogeneity aware approach (HAA) to improving performance of commu- nication tasks by utilizing the information on variation of available communication resources is proposed (Chapter 5). The proposed HAA source rate control scheme dynamically controls the source rate such that the rate is \penalized" more for a higher level of heterogeneity. The performance of HAA source rate control has been analyzed in detail through simulation, and results may be summarized as follows: ? HAA can signiflcantly increase the efiective throughput and decrease the packet dropping rate, compared to the typical network measurement based approaches (the mean-only and minimum-bandwidth schemes) which do not utilize the heterogeneity information. ? Compared to TCP, the bandwidth utilization of HAA is much less sensitive to the path length (or round-trip time), and the aggregated throughput is even better than that of TCP when the path length is long. The inter- ow fairness of HAA is shown to be better. Also, HAA has a much less variation in inter- packet arrival time, i.e., it can produce a much more stable stream with less 109 jitter than TCP. This is especially desirable in jitter-sensitive media streaming applications. ? The results from the multi-path media streaming application show that HAA can improve the real-time performance measures (missing and duplicate rates) without degrading the efiective throughput substantially. ? It has been verifled that the proposed scheme can be properly tuned to achieve fair sharing of bandwidth with other ows (HAA or TCP) on a network path and produce \responsible" tra?c. The proposed HAA source rate control scheme is very promising in various ap- plications, especially where the real-time QoS?s are concerned. By quantifying TH and SH in communication resources, it can produce a more comprehensive picture of resource variation and distribution, and provide a direct control of data transmission in a more predictable manner. 7.3 Future Work Further improvements can be made for each component in the proposed schemes. For example, 1) more sophisticated resource measurement techniques can be designed to improve the accuracy of measurements; 2) the features (heterogeneity) of other network resources such as the packet bufier size can be utilized to further improve the performance; 3) the source rate function can be customized for difierent applications or services. Also, applications to the following areas well deserve investigation: 110 7.3.1 Data-Intensive Computing Systems So far, the HAA has been considered in computing and communication indi- vidually. However, in many applications, such as data-intensive high performance computing, both computing and communication are to be considered. It would be worthwhile to investigate interaction between them in application of the HAA to such tasks. 7.3.2 Application to Overlay Networks Overlay networks have received a great deal of attention recently, and it can pro- vide another framework for applying HAA without worrying about technical details on lower layers of the network (physical, data link layers, etc.) Resilient Overlay Network (RON) [49][50], Multicast Backbone (M-Bone)[51], and X-Bone [17][18][19] are three well-known overlay frameworks. Speciflcally, HAA can be applied to the path selection and dynamic routing problems of an overlay network. 111 Bibliography [1] R. Buyya. High Peformance Cluster Computing. 1999. [2] I. Foster and C. Kesselman. The Grid: Blueprint for a New Computing Infras- tructure. Morgan Kaufmann, 1988. [3] R. F. Freund and H. J. Siegel. Heterogeneous Processing. Computer, pages 13{17, June 1993. [4] M. Maheswaran and H. Siegel. A Dynamic Matching and Scheduling Algorithm for Heterogeneous Computing Systems. In Proceedings of Heterogeneous Com- puting, pages 57{69, 1998. [5] T. Thanalapati and S. Dandamudi. An E?cient Adaptive Scheduling Scheme for Distributed Memory Multicomputers. IEEE Transactions on Parallel and Distributed Systems, pages 758{768, July 2001. [6] Y. Zhang, A. Sivasubrmaniam, J. Moreira, and H. Franke. Impact of Workload and System Parameters on Next Generation Cluster Scheduling Mechanisms. IEEE Transactions on Parallel and Distributed Systems, pages 967{985, Septem- ber 2001. [7] S. Floyd and V. Paxson. Di?culties in Simulating the Internet. IEEE/ACM Transactions on Networking (TON), August 2001. [8] S. Ali, H. Siegel, M. Maheswaran, D. Hensgen, and S. Ali. Task Execution Time Modeling for Heterogeneous Computing Systems. In Proceedings of the 9th Heterogeneous Computing Workshop, pages 185{199, May 2000. [9] R. Armstrong. Investigation of Efiect of Difierent Run-Time Distributions on SmartNet Performance. Master?s thesis, Department of Computer Science, Naval Postgraduate School, 1997. [10] J. Al-Jaroodi, N. Mohamed, H. Jiang, and D. Swanson. Modeling Parallel Appli- cations Performance On Heterogeneous Systems. In Proceedings of IPDPS 2003, Workshop on Advances in Parallel and Distributed Computational Models, Nice, France, April 2003. 112 [11] S. M. Figueira and F. Berman. A Slowdown Model for Application Executing on Time-Shared Clusters of Workstations. IEEE Transactions on Parallel and Distributed Systems, pages 653{670, June 2001. [12] C.-Z. Xu, L. Y. Wang, and N.-T. Fong. Stochastic Prediction of Execution Time for Dynamic Bulk Synchronous Computations. In Proceedings of International Parallel and Distributed Processing Symposium, San Francisco, April 2001. [13] X. Zhang and Y. Yan. Modeling and Characterizing Parallel Computing Perfor- mance On Heterogeneous Networks of Workstations. In Proceedings of Seventh IEEE Symposium. Parallel and Distributed Processing, pages 25{34, October 1995. [14] A. Dogon and F. Ozguner. Trading ofi Execution Time for Reliability in Schedul- ing Precedence-Constrained Tasks in Heterogeneous Computing. In Proceedings of International Parallel and Distributed Processing Symposium, San Francisco, April 2001. [15] J. Schopf and F. Berman. Stochastic Scheduling. In CS Dept. Technical Report (CS-99-03), University of California, San Diego, 1999. [16] D. Anderson, H. Balakrishnan, F. Kaashoek, and R. Morris. Resilient Overlay Networks. In Proceedings of the 18th ACM Symposium on Operating Systems Principles, October 2001. [17] J. Touch. Dynamic Internet Overlay Deployment and Management Using the X-Bone. Computer Networks, pages 117{135, July 2001. [18] J. Touch and S. Hotz. The X-Bone. In Proceedings of Third Global Internet Mini-Conference at Globecom, Sydney, Australia, July 1998. [19] J. Touch. Those Pesky NATs. IEEE Internet Computing, page 96, 2002. [20] J. Kim and D. Lilja. Multiple Path Communication. High Peformance Cluster Computing: Architectures and Systems, pages 364{382, 1999. [21] T. Nguyen and A. Zakhor. Path Diversity with Forward Error Correction Sys- tem for Packet Switched Networks. In Proceedings of INFORCOM 2003, San Francisco, April 2003. [22] J. Huang and S.-Y. Lee. Efiects of Spatial and Temporal Heterogeneity on Performance of a Target Task in Heterogeneous Computing Environmentss. In Proceedings of ISCA 15th International Conference on Parallel and Distributed Computing Systems, pages 301{306, September 2002. 113 [23] S.-Y. Lee and J. Huang. A Theoretical Approach to Load Balancing of a Target Task in Temporally and Spatially Heterogeneous Grid Computing Environment. In the 3rd International Workshop on Grid Computing, pages 70{81, November 2002. [24] M. Andrews, S. Borst, F. Dominique, P. Jelenkovic, K. Kumaran, K. G. Ramakr- ishnan, andP.Whiting. Dynamicbandwidthallocationalgorithmsforhigh-speed data wireless networks. Technical Journal of Bell Labs, July-September 1998. [25] X. Liu, E. Chong, and N. Shrofi. Transmission Scheduling for E?cient Wireless Utilization. In Proceedings of the IEEE INFOCOM, Alaska, pages 776{785, April 2001. [26] T. Nandagopal, T. Kim, X. Gao, and V. Bharghavan. Achieving MAC Layer Fairness in Wireless Packet Networks. In Proceedings of ACM Mobicom, Boston, MA, August 2000. [27] S. Lee and M. Gerla. Dynamic Load-Aware Routing in Ad Hoc Network. In Proceedings of ICC 2001, Helsinki, Finland, June 2001. [28] J.-C. Bolot. End-to-end packet delay and loss behavior in the Internet. In Proceedings of ACM SIGCOMM Symposium on Communications Architectures and Protocols, pages 289{298, September 1993. [29] S. Keshav. A Control-Theoretic Approach to Flow Control. In Proceedings of ACM SIGCOMM, September 1991. [30] S. Keshav. Packet Pair Flow Control. IEEE/ACM Transactions on Networking, Feburary 1995. [31] R. L. Carter and M. E. Crovella. Measuring Bottleneck Link Speed in Packet- Switched Networks. Performance Evaluation, TR-96-006, Boston University Computer Science Department, March 1996. [32] K. Lai and M. Baker. Measuring Bandwidth. In Proceedings of IEEE INFO- COMM, March 1999. [33] K. Lai and M. Baker. Measuring Link Bandwidth Using a Deterministic Model of Packet Delay. In Proceedings of the ACM SIGCOMM, Stockholm, Sweden, August 2000. [34] Transmission Control Protocol Speciflcation. RFC 793, September 1981. [35] V. Jacbson. Congestion Avoidance and Control. In Proceedings of ACM SIG- COMM, pages 314{329, August 1988. 114 [36] Y. R. Yang, M. S. Kim, X. Zhang, and S. S. Lam. Two Problems of TCP AIMD Congestion Control. In Technical Report TR-00-13, Department of Computer Sciences, The University of Texas at Austin, June 2000. [37] J. Widmer, R. Denda, and M. Mauve. A Survey on TCP-Friendly Congestion Control. Special Issue of the IEEE Network Magazine Control of Best Efiort Tra?c, pages 28{37, May/June 2001. [38] D. Sisalem and A. Wolisz. LDA+ TCP-friendly adaptation: A measurement and comparison study. In 10th International Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV?2000), pages 1619{ 1622, June 2000. [39] L. Zhang, S. Shenker, and D. Clark. Observations on the Dynamics of a Con- gestion Control Algorithm: The Efiects of Two Way Tra?c. In Proceedings of Proceedings of the ACM SIGCOMM 91 Conference on Communications Archi- tectures and Protocols, pages 133{147, 1991. [40] A. Aggarwal, S. Savage, and T. Anderson. Understanding the Performance of TCP Pacing. In Proceedings of the IEEE INFOCOM Conference on Computer Communications, pages 1157{1165, 2000. [41] A. C. Feng and A. C. Kapadia. Packet Spacing: An Enabling Mechanism for Delivering Multimedia Content in Computational Grids. The Journal of Super- computing, pages 51{66, 2002. [42] A. Kapadia, A. Feng, and W. Feng. The Efiects of Inter-Packet Spacing on the Delivery of Multimedia Content. In Proceedings of The 21st IEEE International Conference on Distributed Computing Systems, 2001. [43] H. David. Order Statistics. John Wiley and Sons Inc, 1970. [44] K. Fall and K. Varadhan. The ns Manual (formerly ns Notes and Doc- umentation). web resource available at http://www.isi.edu/nsnam/ns/ns- documentation.html. [45] E. Gafni and D. Bertsekas. Dynamic Control of Session Input Rates in Com- munication Networks. IEEE Trans. on Automatic Control, pages 1009{1016, January 1984. [46] A. G. Fraser. Designing a Public Data Network. IEEE Communications Maga- zine, pages 31{35, October 1991. [47] T. Nguyen and A. Zakhor. Distributed Video Streaming over Internet. In Pro- ceedings of SPIE Multimedia Computing and Networking, pages 186{195, San Jose, California, January 2002. 115 [48] J. Padhye, V. Firoiu, D. Towsley, and J. Kurose. Modeling TCP throughput: a simple model and its empirical validation. In Proceedings of ACM SIGCOMM, September 1998. [49] D. G. Andersen, H. Balakrishnan, M. F. Kaashoek, and R. Morris. Resilient Overlay Networks. In Proceedings of 18th ACM SOSP, Banfi, Canada, October 2001. [50] D. G. Andersen. Resilient Overlay Networks. Master?s thesis, Massachusetts Institute of Technology, May 2001. [51] M. R. Macedonia and D. P. Brutzman. MBone Provides Audio and Video Across the Internet. IEEE Computer, pages 30{36, April 1994. 116