Incast-free TCP for Data Center Networks
by
Santosh B. Kulkarni
A dissertation submitted to the Graduate Faculty of
Auburn University
in partial fulfillment of the
requirements for the Degree of
Doctor of Philosophy
Auburn, Alabama
December 8, 2012
Keywords: Cloud Computing, Data Center, TCP, Incast, Congestion, Timeout
Copyright 2012 by Santosh B. Kulkarni
Approved by
Prathima Agrawal, Ginn Distinguished Professor of Electrical and Computer Engineering
Sa?ad Biaz, Associate Professor of Computer Science and Software Engineering
Alvin Lim, Associate Professor of Computer Science and Software Engineering
Abstract
Cloud Computing is a computing paradigm that involves delivering hosted services over
the Internet, based on a ?pay-per-use? approach. This new style of computing, promises to
revolutionize the IT industry by making computing available over the Internet, in a fashion
similar to other utilities like water, electricity, gas and telephony.
Growing adoption of Cloud Computing, by both the IT industry and the general public,
is driving the service providers into creating new data centers. Data centers are facilities
that typically host tens of thousands of servers. These servers communicate with each other
over high speed network interconnects. With growing application deployments, data centers
utilize a multi-tiered model where several servers work together to service a single client
request. As a result, the overall application performance in a data center, largely depends
on the efficiency of its underlying communication fabric.
There are essentially two high level choices for building communication fabric for data
centers. The first option leverages specialized hardware and communication protocols like
Infiniband, FibreChannel or Myrinet; the second leverages off-the-shelf commodity products
like Ethernet based switches and routers. Cost and compatibility reasons persuade many
data centers to consider the second option for their baseline communication fabric.
Until a few years ago, Ethernet speeds inside data centers averaged around 100 Mbps.
However, evolution of IEEE 802.3 standards led to the development of 1 Gbps and 10 Gbps
Ethernet networks. The sudden jump in Ethernet speeds from 100 Mbps to 1 Gbps and
10 Gbps requires proportional scaling for TCP/IP processing, so that the network intensive
applications can ultimately benefit from the increased network bandwidth. Although IP is
expected to scale well with Ethernet, there are some legitimate questions about TCP.
ii
TCP is a mature technology that has survived the test of time. However, the unique
workloads, speed and scale of modern data centers violate some of the basic assumptions
that TCP was originally based upon. As a result, when TCP is utilized in high-bandwidth,
low-latency data center environments, we discover new shortcomings in the protocol. One
such shortcoming is referred to as the ?Incast? problem.
TCP Incast is a catastrophic collapse in TCP?s throughput that occurs in high band-
width, low latency network environments when multiple senders communicating with a single
receiver, collectively send enough data to surpass the buffering abilities of the receiver?s Eth-
ernet switch. The problem arises from a subtle interaction between limited Ethernet switch
buffer sizes, TCP?s loss recovery mechanisms and the many-to-one synchronized traffic pat-
terns. Unfortunately, such traffic patterns occur frequently in many data center applications
and services. Hence, a feasible solution that addresses the Incast problem is urgently needed.
Our objective in this dissertation, is to address TCP?s Incast problem by providing
transport layer solutions that are both practical and backward compatible. We approach
this goal in two steps. First, we derive an analytical model of TCP Incast. Such a model is
essential to understand the reasons behind TCP?s throughput collapse. The analytical model
provides a closed form equation, which can be used to compute throughput at the client for
various synchronized workloads. We verify the accuracy of our model against measurements
taken from ns-2 simulations. Next, we discuss some solutions that were designed to address
TCP Incast at the transport layer. Specifically, we develop transport layer solutions that
improve TCP?s performance under Incast traffic, by either proactively detecting network
congestion through probabilistic retransmission or by dynamically resizing TCP?s segments
in order to avoid incurring timeout penalty. We evaluate the merits of the aforementioned
solutions using ns-2 simulations. Results show that each of our suggested techniques out-
performs standard TCP under various experimental conditions.
iii
Acknowledgments
There are many people in Auburn who deserve my gratitude for helping me pursue my
doctoral dreams. Foremost among them is Professor Prathima Agrawal, who has truly been
an outstanding advisor. Without her broad vision and continuous support, this dissertation
would never have been possible. I shall forever remain indebted to her for her guidance in
my research and my career.
I would also like to thank Dr. Sa?ad Biaz and Dr. Alvin Lim for serving as members
of my advisory committee. I am especially grateful to Dr. Biaz for encouraging me to give
my Ph.D. qualifying exams while I was still enrolled in my M.S. program. Thanks are also
in order for Dr. Shiwen Mao, the external reader for my dissertation, for reviewing this
document and helping me improve it.
I also owe much gratitude to Dr. Daniela Marghitu, Dr. David Umphress and Dr.
Kai Chang for shaping my graduate student career for the better. I would also like to
acknowledge the efforts of Ms. Shelia Collis, Ms. Michele Wheeles, Ms. Jo Lauraitis and
Ms. Penny Christopher in helping me keep my school and immigration paper work in order.
My thanks also go out to my colleagues at Broun 405. In particular I would like to
thank the group of Pratap Simha, Yogesh Kondareddy, Srivathsan Soundararajan, Nirmal
Andrews, Nida Bano, Gopalakrishnan Iyer, Vijay Sheshadri and Dongsheng Chen for keeping
me company during numerous ?all nighters? and ?working weekends?.
I am also deeply indebted to the families of Dr. Dave Sree, Dr. Prathima Agrawal and
Mr. Nagaraj Ejantkar for ensuring that I missed none of the festivals celebrated back home.
In addition to these families, I would also like to thank my brother Sanjay Kulkarni, my
friends Harish Rao, Vijay Sheshadri, Pratap Simha, Deepika Rao, Harsha Banavara, Rak-
shith Venkatesh, Abilash Kittanna, Adarsh Jain, Kanika Grover, Prateek Hejmady, Uday
iv
Pidikiti, Nitilaksh Hiremath, Amith Jain, Sunil Subramanya, Pallavi Rao, Pavan Chan-
drashekar, Rashmi Chandrashekar, Ramaraju Yelavarthy, Shripad Nanjundarao and Shree-
kanth Rao for all their support, laughs and companionship.
Above all, I would like to express my deepest gratitude to my family for their love,
compassion and support in my endeavor. Together they define my existence and it is to
them that I lovingly dedicate this work.
v
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 NIST?s Model of Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Cloud Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Cloud Service Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Cloud Deployment Models . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Benefits of Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Cloud Computing and Data Centers . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 TCP in Data Centers . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Structure of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 The Transmission Control Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Reliable Data Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Connection Establishment and Multiplexing . . . . . . . . . . . . . . 21
2.2.2 Re-ordering and Duplicate Elimination . . . . . . . . . . . . . . . . . 23
2.2.3 Retransmission of Lost Data . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Congestion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.1 Slow Start and Congestion Avoidance . . . . . . . . . . . . . . . . . . 28
vi
2.4.2 Fast Retransmit and Fast Recovery . . . . . . . . . . . . . . . . . . . 30
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 Modeling Incast and its Empirical Validation . . . . . . . . . . . . . . . . . . . . 35
3.1 Modeling Incast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.1 Model Using Loss Measure of Cumulative Flow . . . . . . . . . . . . 38
3.2 Validation and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2.1 Comparing with Single Flow Model . . . . . . . . . . . . . . . . . . . 75
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4 Addressing TCP Incast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.1 Existing Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.1.1 Larger Switch Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.1.2 Increasing SRU Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1.3 Reducing Timeout Penalty . . . . . . . . . . . . . . . . . . . . . . . . 81
4.1.4 Relying on Explicit Congestion Notification . . . . . . . . . . . . . . 83
4.2 Probabilistic Retransmission . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.1 Retransmit Thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3 Dynamic Segment Resizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3.1 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.1 Summary of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
vii
List of Figures
1.1 NIST?s model of Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Pyramid of service models in Cloud Computing . . . . . . . . . . . . . . . . . . 4
1.3 Data Center Switch Network Architecture . . . . . . . . . . . . . . . . . . . . . 10
1.4 Synchronized reads in cluster storage system . . . . . . . . . . . . . . . . . . . . 13
1.5 TCP goodput collapse for synchronized reads . . . . . . . . . . . . . . . . . . . 14
2.1 Layout of a TCP Segment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 TCP Three-way handshake and initial data exchange . . . . . . . . . . . . . . . 22
2.3 Sliding window mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Summary of TCP?s congestion control mechanisms . . . . . . . . . . . . . . . . 32
2.5 Evolution of TCP?s congestion window . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Evolution of W over time when loss indications are TDs . . . . . . . . . . . . . 41
3.2 Packets sent during a TD period . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Scenario for Intermediate Block Transfer Timeouts . . . . . . . . . . . . . . . . 50
3.4 Scenario for Anterior Block Transfer Timeouts . . . . . . . . . . . . . . . . . . . 53
3.5 Evolution of W over time when loss indications are TD and TO . . . . . . . . . 54
viii
3.6 Packet and ACK transmissions preceding a loss indication . . . . . . . . . . . . 57
3.7 Setup for n parallel, synchronized TCP flows sharing a bottleneck . . . . . . . . 62
3.8 Comparing Full Model with Incast simulation results . . . . . . . . . . . . . . . 64
3.9 Comparing TD Only model with Incast simulation results . . . . . . . . . . . . 65
3.10 Impact of IBTT on proposed model . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.11 Impact of ABTT on proposed model . . . . . . . . . . . . . . . . . . . . . . . . 67
3.12 Comparing Split Model with Incast simulation results . . . . . . . . . . . . . . . 70
3.13 Performance of Full Model, Split Model and ns-2 with 16 KB switch buffer . . . 71
3.14 Performance of Full Model, Split Model and ns-2 with 32 KB switch buffer . . . 71
3.15 Performance of Full Model, Split Model and ns-2 with 64 KB switch buffer . . . 72
3.16 Performance of Full Model, Split Model and ns-2 with 128 KB switch buffer . . 72
3.17 Performance of Full Model, Split Model and ns-2 with 64 KB SRU . . . . . . . 73
3.18 Performance of Full Model, Split Model and ns-2 with 128 KB SRU . . . . . . . 73
3.19 Performance of Full Model, Split Model and ns-2 with 256 KB SRU . . . . . . . 74
3.20 Performance of Full Model, Split Model and ns-2 with 512 KB SRU . . . . . . . 74
3.21 Comparing n * (equation from Padhye et al.) with Incast simulation results . . 76
4.1 Effect of the size of switch buffers on TCP Incast . . . . . . . . . . . . . . . . . 79
4.2 Effect of the size of the SRUs on TCP Incast . . . . . . . . . . . . . . . . . . . 80
ix
4.3 Effect of the RTOmin value on TCP Incast . . . . . . . . . . . . . . . . . . . . . 82
4.4 Effect of Retransmission Probability, p, on TCP Incast . . . . . . . . . . . . . . 87
4.5 Effect of Retransmission Probability, p, on Drop Ratio . . . . . . . . . . . . . . 88
4.6 Comparing Probabilistic Retransmission with Default and Modified TCP . . . . 89
4.7 Timeout frequency for different segment sizes when sender count is 5 . . . . . . 92
4.8 Timeout frequency for different segment sizes when sender count is 10 . . . . . . 92
4.9 Timeout frequency for different segment sizes when sender count is 20 . . . . . . 93
4.10 Timeout frequency for different segment sizes when sender count is 50 . . . . . . 93
4.11 Comparing Dynamic Segment Resize with Default TCP . . . . . . . . . . . . . 96
x
List of Tables
2.1 TCP Segment Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Simulation parameters with default settings . . . . . . . . . . . . . . . . . . . . 63
xi
List of Abbreviations
ABTT Anterior Block Transfer Timeout
ACK Acknowledgment
API Application Programming Interface
ATM Asynchronous Transfer Mode
CE Congestion Experienced
DCTCP Data Center Transmission Control Protocol
ECN Explicit Congestion Notification
FTP File Transfer Protocol
HTTP Hypertext Transfer Protocol
IaaS Infrastructure as a Service
IBTT Intermediate Block Transfer Timeout
i.i.d. independent and identically distributed
IP Internet Protocol
IT Information Technology
MSS Maximum Segment Size
MTU Maximum Transfer Unit
NASD Network Attached Secure Disk
xii
NIST National Institute of Standards and Technology
NNTP Network News Transfer Protocol
PaaS Platform as a Service
RTO Retransmission Timeout
RTT Round Trip Time
SaaS Software as a Service
SMTP Simple Mail Transfer Protocol
SRU Server Request Unit
SSH Secure Shell Protocol
TCP Transmission Control Protocol
TD Triple Duplicate Acknowledgments
TDP Period between two consecutive Triple Duplicate ACKs
TO TCP Timeout
TOR Top Of Rack
UDP User Datagram Protocol
xiii
Chapter 1
Introduction
Speaking at the MIT Centennial in 1961, Dr. John McCarthy [1], a leading scientist who
pioneered the concept of timesharing [2], said: ?If computers of the kind I have advocated
become the computers of the future, then computing may someday be organized as a public
utility just as the telephone system is a public utility... The computer utility could become
the basis of a new and important industry.? Fifty years on, the Information Technology (IT)
industry is finally on the brink of realizing Dr. McCarthy?s vision for computing utilities.
With significant advances in information and communications technology over the last
five decades, computing is now on the verge of becoming the fifth utility behind water,
electricity, gas and telephony. This computing utility, unlike all other four existing utilities,
will provide the basic level of computing service that is considered essential to meet the
everyday needs of the general community [3]. To herald this new era of utility computing, a
number of computing models have been proposed, of which the latest one is known as Cloud
Computing.
Cloud Computing is a computing paradigm that involves delivering hosted services
over the Internet, based on a ?pay-per-use? approach. It derives its name from the ?cloud?
symbol that is often used to represent the Internet in networking diagrams and, promises to
revolutionize the IT industry by making computing available over the Internet, in a fashion
very similar to other utilities [4]. However, Cloud Computing is still an evolving paradigm
and as yet, there is no single, widely accepted definition for it. Garnter in [5], defines Cloud
Computing as a style of computing where a scalable and elastic IT-related capabilities are
provided as a service to external customers using Internet technologies. Forrester in [6],
suggests that Cloud Computing refers to a pool of abstracted, highly scalable and managed
1
infrastructure capable of hosting end customer applications and billed by consumption. NIST
(National Institute of Standards and Technology) in [7], defines Cloud Computing as a
model for enabling ubiquitous, convenient, on-demand network access to a shared pool of
configurable computing resources (e.g., networks, servers, storage, applications, and services)
that can be rapidly provisioned and released with minimal management effort or service
provider interaction.
Among the numerous definitions for Cloud Computing, the NIST definition is meant to
serve as a means for broad comparisons of Cloud services and deployment strategies. The
NIST definition is also intended to provide a baseline for discussions ranging from ?What is
Cloud Computing?? to ?How to best use Cloud Computing?? [7]. Hence, we adopt NIST?s
definition of Cloud Computing for the remainder of this document.
1.1 NIST?s Model of Cloud Computing
In accordance to the definition from NIST, Cloud Computing actually covers more than
just computing technology. As shown in a three-dimensional diagram in Figure 1.1 from [8],
the model of Cloud Computing is actually composed of five essential characteristics, three
service models and four deployment models.
In the subsections below, we outline the key characteristics of Cloud Computing along
with a brief overview on the service models and the deployment approaches that are associ-
ated with it.
1.1.1 Cloud Characteristics
According to NIST in [7], the essential characteristics of Cloud environment include:
? On-demand self service that enables consumers to unilaterally provision computing
capabilities, such as server time and network storage, as needed automatically without
requiring human interaction with any service provider.
2
Figure 1.1: NIST?s model of Cloud Computing
? Broad network access which ensures that all Cloud functionalities and the resources
are available over the network and can be accessed through standard mechanisms like
thin or thick client platforms (e.g., mobile phones, tablets, laptops and workstations).
? Resource pooling which allows the computing resources provisioned by the provider to
be pooled, in order to serve numerous consumers using a multi-tenant model, where dif-
ferent physical and virtual resources are dynamically assigned and reassigned according
to the demands of the consumer.
? Rapid elasticity and scaling that not only allows the functionalities and resources to
scale rapidly outward and inward in accordance to the demands of the consumer, but
also allows those capabilities to be elastically provisioned and released.
? Measured service that facilitates automatic control and optimization of resource allo-
cations in addition to providing the capability to monitor, control and report resource
usage, for both the providers as well as the consumers.
3
1.1.2 Cloud Service Models
In NIST?s model of Cloud Computing, providers offer their services according to three
fundamental models, namely, Infrastructure as a Service (IaaS), Platform as a Service (PaaS)
and Software as a Service (SaaS) [7]. Among the three service models, IaaS offers the most
basic form of Cloud Computing. The three service models can be represented as a pyramid,
as depicted in Figure 1.2, where SaaS is at the top and IaaS is at the bottom. Abstraction
among the service models increases as we move towards the top of the pyramid in Figure
1.2, while the element of control among the service models increases as we move towards the
bottom of the pyramid.
Figure 1.2: Pyramid of service models in Cloud Computing
? Software as a Service ? SaaS refers to software applications that are deployed as a
hosted services on the Cloud infrastructure. Consumers typically access these applica-
tions from client devices that either use a thin client interface, such as a web browser
(e.g., web-based email), or use Application Programming Interfaces (API) defined by
the hosted software. Under the SaaS service model, the consumers do not manage or
4
control the underlying Cloud platform and Cloud infrastructure. Their only control is
usually limited to user specific application configuration settings. Examples of SaaS
include: Gmail, Google Docs, Salesforce.com and Microsoft Office 365.
? Platform as a Service ? PaaS refers to the service where, the providers deliver a com-
puting platform using which consumers can build and deploy their own applications
on the Cloud. The computing platform delivered typically includes operating systems,
programming languages, libraries, services and tools supported by the provider. Under
the PaaS service model, consumers do not manage or control the underlying Cloud
infrastructure. However, they are typically able to control the deployed applications
and configuration settings for the application-hosting environment. Examples of PaaS
include: Google App Engine, Microsoft Azure and Amazon Elastic Beanstalk.
? Infrastructure as a Service ? IaaS delivers compute services, typically in the form of
a set of virtual machines with associated storage, processing capability, other relevant
resources like network connectivity [4]. Under this model, consumers are given the ca-
pability to provision computing resources that are made available by service providers.
Consumers also have the capacity to deploy and run arbitrary software including oper-
ating systems and other applications on the provisioned resources. However, consumers
do not manage or control the underlying Cloud infrastructure. Their control is limited
to operating systems, storage and applications that are deployed by them. Some exam-
ples of IaaS include: Amazon CloudFormation, Rackspace Cloud and Google Compute
Engine.
1.1.3 Cloud Deployment Models
Cloud deployment approaches represent the way providers deploy Cloud service models
in order to make Cloud functionalities available to their consumers. Organizations choose
5
Cloud deployment models based on their specific business, operational and technical re-
quirements [4]. As depicted in Figure 1.1, NIST categorizes Clouds deployments as Public,
Private, Community or Hybrid [7].
? Public Clouds ? Under Public deployment model, the Cloud functionalities and re-
sources are made available for open use to the general public. Customers access and
use hosted Cloud services that are either free or offered on pay-per-use basis. Gener-
ally, public Cloud service providers like Amazon AWS, Microsoft and Google own and
operate the infrastructure and offer access to users only via Internet.
? Private Clouds ? Under Private deployment model, the Cloud infrastructure is provi-
sioned for exclusive use by a single organization. In this environment, the organization,
a third party or some combination of them is in charge of setting up and maintaining
the Cloud resources. Accomplishing this requires a significant level of understanding of
the organization?s business environment and existing resources. However, when done
right, there is an added advantage in terms of better control of security, more effective
regulatory compliance and improved quality of services.
? Community Clouds ? Under Community deployment model, the Cloud infrastructure
is shared exclusively between organizations from a specific group or community and
have common computing concerns. The Cloud infrastructure may be owned, managed
and operated by one or more of the organizations in the community, a third party, or
some combination of them, and it may exist on or off premises.
? Hybrid Clouds ? Under Hybrid deployment model, the Cloud infrastructure consists
of two or more distinct Clouds (Public, Private or Community). These composite
Clouds remain unique entities, but under the Hybrid model, they are bound together by
standardized or proprietary technologies that enable data and application portability.
By utilizing this model, organizations are able to obtain degrees of fault tolerance for
their mission-critical processes.
6
1.2 Benefits of Cloud Computing
Cloud Computing promises numerous benefits, inherent in the characteristics listed in
Subsection 1.1.1. According to [4, 9, 10], some of the key benefits offered by Cloud Computing
include:
? Lower cost ? Cloud Computing dramatically lowers the cost of entry for smaller firms
trying to benefit from compute-intensive business analytics that were hitherto avail-
able only to the largest of corporations. Cloud Computing also represents a huge
opportunity to many third-world countries that have so far been left behind in the IT
revolution.
? Optimization of capital investment ? Cloud Computing allows companies to optimize
their capital investments by reducing the costs of hardware and software purchases.
Organizations that have peak requirements can now rent additional hardware on the
Cloud instead of having to purchase new equipment. Similarly, instead of purchasing
separate software packages for each computer in the organization, Cloud Computing
allows IT administrators to host the required software on Cloud, which allows for lower
installing and maintenance costs.
? Rapid scaling ? Cloud Computing makes it easier for enterprises to scale their services
according to the demands of the customer. Since the computing resources are managed
through software, services can be deployed very quickly as and when new requirements
arise.
? Self service ? Cloud Computing allows users to obtain, configure and deploy Cloud
services without requiring human interaction with any of the service providers. Users
typically use a service portal provided by the Cloud platform to configure various
resources and services.
7
? Anywhere, anytime access ? Cloud Computing enables true device and location inde-
pendence for its users. Users are no longer bound to a single computer, network or
geographic location. Users can access Cloud services using a web browser regardless of
their location or what device they are using.
? Multi-tenancy ? Cloud Computing typically allows single instances of software appli-
cations to serve multiple customers, allowing the service providers to leverage on the
economies of scale while also reducing maintenance costs.
? Easier collaboration ? Cloud Computing allows multiple users to easily collaborate, as
witnessed by Cloud services like Google Docs and Microsoft Office 365, which enable
users across different geographical locations to collaborate on documents, spreadhseets
and presentations.
? Utility service ? Cloud Computing follows a utility pricing model, which allows users
to pay for only those computing resources that they actually use, rather than paying
for a dedicated computing resource which may not be fully used except at certain peak
times.
? Disaster recovery ? Cloud Computing through virtualization [11, 12], delivers faster
recovery times and multi-site availability at a fraction of the cost of conventional sys-
tems, making it attractive for enterprises to deploy comprehensive disaster recovery
plans for their entire IT infrastructure.
1.3 Cloud Computing and Data Centers
In a survey conducted by Cloud.com in the second quarter of 2011, about 61% of the
organizations surveyed were either in early stages of planning or had already acquired an
approved strategy for implementing Cloud Computing. Furthermore, about 20% of the
surveyed participants already had Cloud implementations in their organizations [13]. While
8
the number of organizations leaning towards Cloud related technologies continues to grow,
the general public has already embraced Cloud Computing in form of services like Office 365
[14], Facebook [15], Flikr [16], Yahoo Applications [17], Amazon EC2 [18], Youtube [19] and
Gmail [20].
Growing adoption of Cloud Computing, by both the IT industry and the general public,
is driving service providers into creating new data centers. Data centers are facilities that host
hundreds of thousands of servers which concurrently support a myriad of distinct services
and applications [21]. Such facilities, not only let service providers leverage the economies
of scale for bulk deployments, but also allow them to dynamically relocate resources among
services as workloads change or equipments fail [22, 23].
A data center is generally organized in rows of ?racks? where each rack contains modular
assets such as servers or storage ?bricks? [24]. These racks are interconnected through Top-of-
Rack (TOR) switch, which in turn connects to an Aggregation switch as depicted in Figure
1.3 from [25].
The Aggregation switch connects to other Aggregation switches and through these
switches to other servers or storage bricks in the data center. A Core switch connects to
the various Aggregation switches and provides connectivity to the outside world, typically
through Layer 3 i.e., Network Layer [26]. It can be argued that most of intra-data center
traffic traverses only the Top-of-Rack and the Aggregation switches [25]. As a result, the
overall performance of services and applications hosted by the data center, largely depends
on the efficiency of its underlying communication fabric.
There are essentially two high level choices for building communication fabric for data
centers. The first option leverages specialized hardware and communication protocols like
Infiniband [27], FibreChannel [28] or Myrinet [29]; the second leverages off-the-shelf com-
modity products like Ethernet [30] based switches and routers [31]. While the first option
is capable of scaling up to thousands of nodes, it is generally more expensive (about $500 -
$2000 per port [32]) and not natively compatible with TCP/IP applications. On the other
9
Figure 1.3: Data Center Switch Network Architecture
10
hand, the second option is cheap (less than $30 per port [32]), supports a familiar manage-
ment infrastructure and requires no modification to applications, operating system or system
hardware, but scales poorly with increasing number of nodes. Cost and compatibility reasons
persuade many data centers to consider the second option for their baseline communication
fabric [33].
Until a few years ago, Ethernet speeds inside data centers averaged around 100 Mbps.
However, evolution of IEEE 802.3 standards led to the development of 1 Gbps and 10 Gbps
Ethernet networks. Today, 1 Gbps Ethernet networks are being widely deployed, and 10
Gbps will be commonly deployed as it becomes affordable. The sudden jump in Ethernet
speeds from 100 Mbps to 1 Gbps and 10 Gbps requires proportional scaling for TCP/IP
processing, so that the network intensive applications can ultimately benefit from the in-
creased network bandwidth [34]. Although Internet Protocol (IP) [35] is expected to scale
well with the evolving Ethernet, there are some legitimate questions about Transmission
Control Protocol (TCP) [36] as noted in [37].
1.3.1 TCP in Data Centers
TCP is a mature technology that has survived the test of time. As a standard it has been
successfully adapted to several new environments like, long fat networks [38, 39, 40, 41, 42,
43, 44], Asynchronous Transfer Mode (ATM) [45] networks [46, 47], as well as wireless and
cellular networks [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58]. However, the unique workloads,
speed and scale of modern data centers violate some of the basic assumptions that TCP
was originally based upon. For example, in contemporary operating systems such as Linux,
the default value of TCP?s retransmission timer is set to 200ms ? a reasonable value for
traditional wide area networks where round trip times (RTT) are typically clocked in milli
seconds, but two to three orders of magnitude greater than the average round trip time inside
data centers [59]. As a result, when TCP is utilized in high-bandwidth, low-latency data
11
center environments, we discover new shortcomings in the protocol. One such shortcoming
is referred to as the ?Incast? problem [60].
TCP Incast
TCP Incast is a catastrophic collapse in TCP?s throughput that occurs in high band-
width, low latency network environments when multiple senders communicating with a single
receiver, collectively send enough data to surpass the buffering abilities of the receiver?s Eth-
ernet switch. The problem arises from a subtle interaction between limited Ethernet switch
buffer sizes, TCP?s loss recovery mechanisms and traffic patterns that are characteristic of
data center applications. Small Ethernet buffers get exhausted by a concurrent flood of traf-
fic from many servers, which results in packet loss and one or more TCP timeouts. These
timeouts impose a delay of hundreds of milliseconds on a network whose round trip time
is measured in tens and hundreds of microseconds [61]. As a result of this, the perceived
goodput, which can be defined as the data throughput observed at the receiver?s applica-
tion, is orders of magnitude lower than the receiver?s link capacity. For example, consider
a cluster-based storage system discussed in [62]. In a cluster-based storage system, data is
typically stored across many storage servers to improve both reliability and performance.
Typically, their networks have high bandwidth (1 ? 10 Gbps) and low latency (round trip
times of 10 ? 100 ?s) with clients separated from the storage servers by one or more switches.
In this environment, data blocks are striped over a number of servers, such that each
server stores a fragment of a data block denoted as a Server Request Unit (SRU) as depicted
in Figure 1.4 from [62]. A client requesting a data block sends request packets to all the
storage servers that contain the SRUs for that particular block; the client requests the next
block only after it has received all the SRUs for the current block. Such requests are referred
to as synchronized reads in [62].
However, when performing synchronized reads across an increasing number of servers,
a client may observe a TCP throughput drop of one or two orders of magnitude below its
12
Figure 1.4: Synchronized reads in cluster storage system
link capacity. Figure 1.5 from [62] illustrates TCP?s catastrophic performance drop in a
cluster-based storage network environment when using HP ProCurve 2848 as the intermedi-
ate switch.
Simulation traces reveal that TCP?s retransmission timeouts are the primary cause be-
hind Incast. When goodput degrades, most servers still send their SRUs quickly, but one or
more servers experience timeouts from packet loss causing transmission delays. The servers
that finish their transfers receive requests for new SRUs only after the client has completely
received its previously requested data block, resulting in underutilized links within the net-
work [63].
Unfortunately, such synchronized read patterns occur frequently in many data center
applications and services. For example, in cluster storage when storage nodes respond to
requests for data [64, 65, 66, 67], in web search when many workers respond near simul-
taneously to search queries [68, 69, 70, 71], and in batch processing jobs like MapReduce
13
Figure 1.5: TCP goodput collapse for synchronized reads
in which intermediate key-value pairs from many ?mappers? are transferred to appropriate
?reducers? during the ?shu?e? stage [72, 73]. Hence, a feasible solution that addresses the
Incast problem is urgently needed.
To the best of our knowledge the problem of Incast has so far never been addressed
convincingly. Except for a few attempts in recent literature ([61, 62, 74, 75]), Incast has
largely remained unexplored. Most of the current systems attempt to avoid TCP throughput
collapse by limiting the number of servers involved in any block transfer, by increasing the
size of the data blocks, by relying on enhancements to underlying Ethernet technology, or
by drastically reducing the value of TCP?s minimum retransmission timeout using system
extensions to support microsecond clock granularity. These solutions, however, are typically
specific to one configuration (e.g. a number of servers, data block sizes, Ethernet support,
availability of microsecond timers, etc.), and thus are not robust to changes in the data
center environment.
14
Our goal in this dissertation therefore, is to provide practical, backward compatible,
transport layer solutions to TCP?s Incast problem when operating in high bandwidth, low
latency data center network environment.
1.4 Structure of Dissertation
This dissertation is organized as follows: In Chapter 2, we provide an overview of the
Transmission Control Protocol, including a brief description of some of its features like
reliable delivery, flow control and congestion control. In Chapter 3, we derive a simple
analytical model for TCP Incast, followed by its empirical validation. In Chapter 4, we
describe techniques to address TCP Incast and evaluate the solutions using simulations.
Finally we present our conclusions and directions for future work in Chapter 5.
15
Chapter 2
The Transmission Control Protocol
For over three decades, Transmission Control Protocol (TCP) [36] has been the de-
facto transport protocol for a countless number of networked applications. According to
prior studies, TCP accounts for almost 90% of the byte count in the Internet [76, 77].
TCP?s robustness in a wide variety of networking environments is one of the primary reasons
for its large scale deployment. The protocol?s ability to provide adequate performance to
diverse applications has only been possible through continuous study, improvements and
modifications, making TCP one of the most active areas of research [78]. In this chapter we
provide an overview of the TCP, with a brief description of some of its features like reliable
delivery and congestion control, that are important from an Incast point of view.
2.1 Overview
The Internet is a huge network or networks, each implementing the Internet Protocol
(IP). IP is the principal communications protocol for transmitting information packets across
network boundaries where sources and destinations are hosts identified by fixed length ad-
dresses [35]. The design of IP however, assumes that the underlying network infrastructure is
inherently unreliable. As a result, IP only provides best effort delivery, meaning, the service
it provides is not entirely trustworthy.
User applications however, need reliable, in-order delivery with flow control between
two communicating endpoints. One possible approach to follow would be to allow each
application to implement its own error detection and recovery mechanism. However, given
that the mechanism is needed by many applications, advantages of having a common protocol
that provides these functionalities, is immediately apparent. Not only would the availability
16
of such a protocol ease the design and implementation of user programs, it would also allow
for efficient multiplexing of datagrams from host to the applications [78]. The TCP was
specifically designed to provide such a service.
TCP described in [36] is a connection-oriented, end-to-end reliable protocol designed
to fit into the layered hierarchy just above the Internet Protocol. The TCP provides for
reliable inter-process communication between pairs of processes in host computers attached
to distinct but interconnected computer communication networks. TCP makes very few
assumptions about the reliability of the communication protocols in the layers below itself.
TCP only assumes that it can obtain a simple, potentially unreliable datagram service from
the layers below. This implies that the the protocol can conceivably operate over a wide
spectrum of communication systems ranging from hard-wired connections to packet-switched
or circuit-switched networks.
Using TCP, applications on networked hosts can create virtual circuits (or connections)
to each other, over which they can exchange streams of data. Every byte on a TCP connection
has its own 32-bit sequence number. TCP guarantees reliable, in-order delivery of these bytes
from the sender to the receiver. TCP also has the ability to distinguish data for multiple
connections by concurrent applications.
The sending and receiving TCP endpoints exchange data in the form of segments. A
segment consists of a fixed 20-byte header (plus an optional part) followed by zero or more
data bytes [79]. Figure 2.1 shows the layout of a TCP segment. Table 2.1 lists the purpose
of each field in a TCP segment.
The size of the segments exchanged between two endpoints is controlled by the TCP.
TCP even decides whether to accumulate data from several writes into one segment or to
split data from a single write over multiple segments. Two limits restrict the size of the TCP
segment over a connection. First, each segment including the TCP header must fit into the
65,535 byte IP payload. Second, each network has a Maximum Transfer Unit (MTU), and
each TCP segment must fit in the MTU [79].
17
Figure 2.1: Layout of a TCP Segment
Table 2.1: TCP Segment Fields
Field Name Length In Bits Function
Source Port 16 Identifies the local end point at the source
Destination Port 16 Identifies the local end point at the desti-
nation
Sequence Number 32 Sequence number of the segment?s first
data byte in the overall connection byte
stream
Acknowledgment
Number
32 Sequence number of the next byte expected
by the receiver
Header Length 4 Indicates the segment header length in
words
URG flag 1 Control flag indicates that the Urgent
Pointer field is significant
ACK flag 1 Control flag indicates that the Acknowl-
edgment Number field is significant
18
Field Name Length In Bits Function
PSH flag 1 Control flag requests the receiver to deliver
the data to application on arrival
RST flag 1 Control flag used to reset connection
SYN flag 1 Control flag used in establishing connec-
tions
FIN flag 1 Control flag to request normal termination
of TCP connection in the direction of the
segment
Window 16 Used for flow control. Indicates the num-
ber of bytes that may be sent starting at
the byte acknowledged
Checksum 16 Provides bit error detection for the TCP
segment
Urgent Pointer 16 Indicates the position of the first octet of
non expedited data in the segment
Options 32? Zero or more words designed to provide
extra facilities not covered by the regular
header
A segment that is too large to fit into the MTU of a network is broken down into multiple
fragments by an intermediate router. All resulting fragments get their own IP header and
are assembled back into the original segment at the destination.
TCP relies on sliding window protocol to transfer data between two endpoints. When
a sender transmits a segment, it also starts a timer. When the segment arrives at the
destination, the receiving TCP entity sends back a segment bearing an acknowledgment
19
number equal to the next sequence number it expects to receive. If the sender?s timer goes
off before the acknowledgment is received, the sender transmits the segment again [79].
Though the operations of TCP sound simple, there are a number of complex situations
that the protocol needs to handle. For example, transmitted segments may arrive out of
order at the destination. Segments can also get delayed in the network in which case the
sender times out and retransmits them. If the retransmitted segments take a different path
to the destination, the receiver can end up with multiple copies of the same bytes in the
steam. Additionally, if the segment is fragmented, part of the fragmented segments may
never arrive at the destination. And last but not the least, a segment may occasionally hit
a congested link along its path to the destination.
TCP must be prepared to deal with these situations in an efficient way. A considerable
amount of effort has gone into making TCP robust for all network situations. Some of these
techniques used by many TCP implementations will be discussed in the sections below.
2.2 Reliable Data Delivery
In this section, we describe various mechanisms of TCP that are involved in ensur-
ing in-order transfer of stream bytes between source and destination endpoints, as well as,
multiplexing of network traffic to different application processes.
Transmission in TCP is made reliable via the use of sequence numbers and acknowledg-
ments. Conceptually, each byte of data is assigned a sequence number. The sequence number
of the first byte of data in a segment is also the segment sequence number and is transmitted
along with the segment in the segment header. Segments also carry an acknowledgment
number which is the sequence number of the next expected data byte of transmissions in
reverse direction. When TCP transmits a segment containing data, it puts a copy on a
retransmission queue and starts a timer; when the acknowledgment for that data is received,
the segment is deleted from the retransmission queue. If the acknowledgment is not received
before the timer runs out, the segment is retransmitted [36].
20
In addition to sequence numbers and acknowledgments, TCP?s solution for delivering
data reliably over an unreliable internet communication system involves the following three
mechanisms:
? Establishing connection state at communicating endpoints
? Handling data duplication and reordering
? Handling data loss
The first step in ensuring reliable in-order data delivery between two hosts is the setup
of connection state at each endpoints [80] as discussed in the subsection below.
2.2.1 Connection Establishment and Multiplexing
In order to provide reliable data delivery, TCP needs to initialize and maintain certain
status information for each data stream. The combination of this information along with
sockets, sequence numbers and window sizes, forms a TCP connection or a virtual circuit.
When two processes wish to communicate, their TCP stacks must first establish a
connection (initialize the status information on each side). When their communication is
complete, the connection is terminated in order to free the resources for other uses [36].
Since connections must be established between processes over the unreliable internet
communication system, TCP uses a handshake mechanism with clock-based sequence num-
bers. The procedure to establish a TCP connection involves exchanging three segments
between communicating endpoints, utilizing the synchronize (SYN) control flag in the seg-
ment header. This exchange has been termed a three-way hand shake [36] and is depicted
in Figure 2.2. Unlike other connection establishing protocols, three-way handshake does not
require communicating endpoints to begin transmissions with same sequence numbers. Fur-
thermore, three-way handshake can be used to establish a TCP connection even in absence
of a global clock. The mechanism can also prevent old connection initializations and data
21
Figure 2.2: TCP Three-way handshake and initial data exchange
22
packets from causing any confusion. Additionally, the endpoints can exchange parameter
and option information such as MSS, during connection establishment [80].
The process which initiates the three-way handshake does so by issuing an active open
request. Processes can also issue passive opens and wait for matching active opens from
other processes and be informed by the TCP when connections have been established. Two
processes which issue active opens to each other at the same time will be correctly connected.
This flexibility is critical for the support of distributed computing in which components act
asynchronously with respect to each other [36].
TCP provides 16-bit port identifiers to distinguish separate data streams that the pro-
tocol might handle. Since port identifiers are selected independently by TCP at each com-
municating endpoint, many endpoints in the network can pick the same identifier for a port.
To provide for unique addresses for all communicating processes, TCP concatenates the IP
address identifying the end point with the port identifier that identifies the process, to create
a socket which is unique throughout all networks connected together. A connection is fully
specified by the pair of sockets at the ends.
At each endpoint, the TCP examines the port identifiers in the received segment and
places the segment in the receive buffer of the process associated with that port [80]. A range
of port identifiers is reserved for well-known user applications such as HTTP [81], FTP [82],
SMTP [83], NNTP [84] and SSH [85, 86, 87, 88, 89, 90].
2.2.2 Re-ordering and Duplicate Elimination
In this subsection we describe TCP?s mechanisms which allow data to be re-ordered at
the receiver and duplicate data to be eliminated.
Packet reordering refers to the network behavior where the relative order of some seg-
ments in the same connection is altered when these segments are transported over the net-
work. In other words the receiving order of a stream of segments differs from the sending
order.
23
TCP has the ability to recover from data that is damaged, lost, duplicated or delivered
out of order by the internet communication system. It achieves this by assigning a sequence
number to each transmitted byte and requiring a positive acknowledgment from the receiv-
ing endpoint [36]. The receiving endpoint can detect transmission errors by computing a
checksum on the received segment and comparing it to the checksum value in the received
segment?s header. If the checksum test fails, TCP discards the segment. Otherwise, it checks
to see if the received sequence number falls within the acceptable range of sequence numbers
defined by the receive window, rwnd. In TCP, the receive window indicates the allowed
number of bytes that the sender may transmit before receiving further permission from the
receiver.
A data byte whose sequence number does not fall within the sequence number range
defined by the receive window is discarded by the TCP. Bytes whose sequence numbers
fall within the sequence number range specified by rwnd but are not equal to rwnd?s start
sequence number are buffered by the TCP. This allows TCP to properly re-order any out
of order data. On the other hand, bytes which are received in-order, advance the range
boundaries defined by rwnd.
Duplicate data in TCP may result from segment duplication by faulty devices, from the
finiteness of the sequence space (wrap around), from the presence of segments in the network
sent by earlier incarnations of the connection or from retransmissions from the source [80].
In order to limit the possibility of duplicate segments from previous instances of the
same connection being erroneously accepted, TCP starts the numbering of data bytes with
a ?random? value when initiating the connection.
2.2.3 Retransmission of Lost Data
In this subsection we describe TCP?s strategy for loss recovery. The strategy employed
by TCP mainly relies on positive acknowledgments and timer based retransmissions.
24
Acknowledgments
The receipt of each transmitted byte has to be acknowledged by the receiving endpoint.
TCP acknowledgments carry the sequence number of the next byte that the destination ex-
pects to receive. This strategy is referred to as ?positive acknowledgment? strategy [80]. The
acknowledgment mechanism employed by TCP is ?cumulative? meaning, an acknowledgment
of sequence X indicates that all bytes up to but not including X have been received by the
destination. This mechanism allows for straight forward duplicate detection in presence of
retransmission [36].
If a received segment?s sequence number does not match rwnd?s current start sequence
number, it elicits an acknowledgment for the start sequence number of rwnd. Such ACKs,
called duplicate ACKs, stimulate the sender to retransmit the segment that appears to be
missing [80].
It is important to note that an acknowledgment received by the sending endpoint does
not guarantee that the data has been delivered to the end user, but only that the TCP at
the receiving endpoint has taken the responsibility to do so.
Retransmission Queue
When TCP transmits a segment containing data, it puts a copy of the segment on
a retransmission queue and starts a timer that is initialized to a dynamically computed
retransmission timeout (RTO) value; when the acknowledgment for that data is received,
the segment is deleted from the retransmission queue. If the acknowledgment is not received
before the timer runs out, the segment is retransmitted [36]. Note that segments carrying
no data are not transmitted reliably, except for segments carrying the SYN or FIN flag.
In addition, a ?fast? retransmission of the segment at the head of the retransmission
queue can be triggered by the reception of at least three duplicate ACKs before the expiry
of the retransmission timer [80]. In both cases the retransmission is followed by congestion
control measures that are discussed in section 2.4
25
Note that some implementations of TCP, organize the data in retransmit queue in seg-
ments, as they were originally transmitted, while others do not keep the segment boundaries.
In the first case, when the retransmission timer expires, the segment at the head of the queue
is retransmitted. In the second case, a new segment can be created from the data at the
head of the retransmission queue. The data in the newly created segment can span over
multiple previous segments. This results in more efficient use of the network by decreasing
the segment header overhead.
2.3 Flow Control
Flow control is a technique whose primary purpose is to properly match the transmission
rate of the sending end point to that of the receiving end point [91]. TCP uses sliding window
mechanism to provide flow control, whereby the receiving end point returns a ?window? in
each ACK, indicating a range of acceptable sequence numbers beyond the last segment that
was successfully received. The window, called receive window or rwnd, indicates the allowed
number of bytes that the sender may transmit before receiving further permission. Since
TCP?s rwnd field is limited to 16 bits in length, it provides for a maximum window size of
65,535 bytes.
Figure 2.3 illustrates the concept of the sliding window. In this simple example, the
sliding window spans over four bytes of the data stream. The sequence numbers within the
sender?s window represent the bytes sent but as yet not acknowledged. All sequence numbers
to the left of the sliding window are bytes that were transmitted and also acknowledged;
sequence numbers to the right of the sliding window are bytes that are yet to be transmitted.
Moving from left to right, the window ?slides? as bytes in the window get acknowledged and
new bytes get transmitted.
A receiver can adjust the window size each time it sends the acknowledgments to the
sender. The maximum transmission rate is ultimately bound by the receiver?s ability to
accept and process data. If the receiver is incapable of accepting any new data, it can
26
Figure 2.3: Sliding window mechanism
announce a ?zero receive window? in an ACK, which forces the sender TCP to stall its data
transmission.
A sender which receives a zero window advertisement for rwnd, regularly probes the
receiver for window updates. This is because the underlying IP protocol only provides a
best effort service, due to which, an ACK carrying a window update from the receiver can
sometimes fail to reach the sender. TCP at the sending endpoint sends the first probe after
a retransmit timeout period and sends the subsequent ones at exponentially increasing time
periods [92].
TCP at the sending end point also deals with the case where the receiver advertises a
window that is smaller than the amount of data already in the network (which corresponded
to a previously advertised window value). This case, labeled ?shrinking window?, causes the
sender to wait for the receive window, rwnd, to open up beyond the previously sent limit
before sending any new data [93].
27
2.4 Congestion Control
Congestion control in TCP concerns with controlling the entry of segments into the
network in order to avoid overwhelming the processing or link capabilities of the intermediate
nodes. This section describes TCP?s four intertwined algorithms that are implemented as
part of the protocol?s congestion control strategy: slow start, congestion avoidance, fast
retransmit, and fast recovery. The following subsections discuss these algorithms in detail.
2.4.1 Slow Start and Congestion Avoidance
All TCP senders use slow start and congestion avoidance algorithms to control the
amount of outstanding data being injected into the network. To implement these algorithms,
TCP makes use of two variables, namely, congestion window (cwnd) and receiver?s advertised
window (rwnd). The congestion window is the sender-side limit on the amount of data the
sender can transmit into the network before receiving an acknowledgment, while the receiver?s
window is a receiver side limit on the amount of outstanding data. The minimum of cwnd
and rwnd governs TCP?s data transmission.
Another variable, the slow start threshold (sstrhesh), is used by the TCP to deter-
mine the algorithm to employ ? slow start or congestion avoidance ? in controlling data
transmission.
Starting data transmission with unknown network conditions requires TCP to slowly
probe the network to determine the available capacity, in order to avoid congesting the
network with large burst of data. TCP?s slow start algorithm is used for this purpose. It is
either used at the very beginning of data transfer or after repairing loss detected by TCP?s
retransmission timer. In both these situations, TCP is unaware of the current state of the
network causing it to probe the system for available capacity.
Initially, TCP sets ssthresh to an arbitrarily high value, but reduces it in response
to congestion. Setting ssthresh to a high value initially ensures that network conditions,
rather than some arbitrary host limit, dictates the sending rate. The slow start algorithm
28
is used when cwnd ? ssthresh, while the congestion avoidance algorithm is used when
cwnd > ssthresh.
During slow start, TCP increments its cwnd by at most one maximum segment size
(MSS) for each ACK received. Slow start ends when cwnd exceeds ssthresh or when conges-
tion is observed.
Slow start is actually not very slow when the network is not congested and network
response time is good. For example, the first successful transmission and acknowledgment of
a TCP segment increases cwnd to two segments. After successful transmission and acknowl-
edgment of these two segments, the cwnd is doubled to four segments. Then eight segments,
then sixteen segments and so on, up to the maximum window size (rwnd) advertised by the
receiver or until congestion is observed in the network.
During congestion avoidance, cwnd is incremented by roughly one MSS per round-
trip time. Congestion avoidance continues until congestion is detected. Another common
formula that is used by various implementations of TCP in updating cwnd during congestion
avoidance phase is given in equation 2.1.
cwnd = cwnd + (MSS ?MSS)cwnd (2.1)
This adjustment is executed on every incoming ACK that acknowledges new data during
the congestion avoidance phase.
When a TCP sender detects segment loss through the retransmission timer and the
given segment has not yet been retransmitted, TCP sets the value of its ssthresh according
to equation 2.2. Furthermore, upon a timeout, TCP sets the value of its cwnd to one MSS.
Therefore, after retransmitting the dropped segment, TCP sender uses slow start algorithm
to increase the size of its congestion window (cwnd) from one MSS to the new value of
ssthresh, at which point congestion avoidance again takes over [94].
29
ssthresh = max
parenleftbiggsegmentsinflight
2 ,2?MSS
parenrightbigg
(2.2)
2.4.2 Fast Retransmit and Fast Recovery
When the destination receives an out-of-order segment, TCP at the receiving endpoint
immediately sends back a duplicate ACK to the sender. Duplicate ACK informs the sender
that the destination received a segment that was out-of-order. The acknowledgment number
in the duplicate ACK also informs the sender about the byte sequence number that the
destination expects. From the sender?s perspective, duplicate ACKs can be caused by a
number of network problems. First, they can be caused by dropped segments. In this
case, all segments after the dropped segment will trigger duplicate ACKs until the loss is
repaired. Second, duplicate ACKs can be caused by the re-ordering of data segments by the
network. Finally, duplicate ACKs can be caused by replication of ACK or data segments by
the network.
TCP?s fast retransmit algorithm uses the arrival of three consecutive duplicate ACKs
as an indication that the segment has been lost. After receiving three duplicate ACKs,
the sender retransmits the missing segment, without waiting for the retransmission timer to
expire.
After TCP?s fast retransmit algorithm sends the missing segment, the protocol?s fast re-
covery algorithm governs the transmission of new data until the sender receives non-duplicate
ACK from the destination. The reason that TCP does not perform slow start at this stage
is that in addition to indicating a segment loss, duplicate ACKs also inform the sender that
the segments are most likely leaving the network.
TCP implements the fast retransmit and the fast recovery algorithms in the following
manner:
30
? On the first and the second duplicate ACKs received by the sender, TCP sends a
segment of previously unsent data provided, the receiver?s rwnd allows for it. TCP
also does not change its cwnd to reflect the transmission of these two segments.
? When the third duplicate ACK is received at the sender, TCP sets ssthresh to a value
given in equation 2.2.
? When the third duplicate ACK is received, following the reset of ssthresh, TCP sets its
cwnd to (ssthresh+ 3?MSS) ensuring that the cwnd is artificially inflated by the
number of segments that have left the network.
? For each additional duplicate ACK that the sender receives, TCP increments its cwnd
by one MSS.
? When finally the sender receives an ACK that acknowledges previously unacknowledged
data, TCP sets cwnd to ssthresh. This sequence is also known as ?deflating? of the
congestion window (cwnd).
A summary of TCP?s congestion control mechanisms is depicted in Figure 2.4. An
illustration of how TCP?s congestion window evolves due to the protocol?s aforementioned
congestion control algorithms, is shown in Figure 2.5.
In Figure 2.5, TCP begins by setting its slow start threshold, ssthresh, to an arbitrarily
high value. It then starts its data transfer using the slow start algorithm to determine the
available capacity in the network. During this phase, TCP?s congestion window cwnd, grows
exponentially. In the example above, slow start phase ends when TCP experiences a timeout.
Following the timeout, TCP sets its ssthresh, to half the number of segments that were in
flight before the timeout. The protocol also sets the size of its cwnd to one. Since cwnd is now
less than ssthresh, TCP resumes its data transfer with slow start. Like before, cwnd grows
exponentially as long as cwnd ? ssthresh. When cwnd > ssthresh, TCP?s slow start phase
ends. TCP then continues with its data transfer using the congestion avoidance algorithm.
31
Figure 2.4: Summary of TCP?s congestion control mechanisms
32
Figure 2.5: Evolution of TCP?s congestion window
During this phase, cwnd grows linearly until TCP receives 3 duplicate ACKs. On receiving
3 duplicate ACKs, TCP ends its congestion avoidance phase and invokes fast retransmit
and fast recovery algorithms. This congestion avoidance-fast retransmit-fast recovery cycle
continues until TCP experiences another timeout. Following a timeout, TCP resumes its
data transfer with slow start algorithm as before. The resulting evolution pattern for TCP?s
congestion window cwnd, is often referred to as ?TCP?s sawtooth behavior?.
2.5 Summary
In this chapter, we presented details on mechanisms that are responsible for TCP?s
reliable data transfer, flow control and congestion control. Our goal in this chapter is to not
only provide the necessary background for the following chapters, but to also help readers
working with TCP to gain a better understanding of the protocol.
33
We note that TCP is a highly dynamic protocol, especially when the details of its imple-
mentations are considered. Many developers independently add non-standard modifications
and enhancements to standard implementations of TCP. Moreover, due to the complexity of
the protocol and some ambiguity in its specification, many developers allow themselves the
freedom to deviate from the standard behavior to provide simplicity or inter-operability with
other implementations of TCP. Therefore the information contained in this chapter may not
apply to every implementation of TCP.
34
Chapter 3
Modeling Incast and its Empirical Validation
Transmission Control Protocol is the workhorse for several application layer protocols
like HTTP [81], FTP [82], SMTP [83], NNTP [84] and SSH [85, 86, 87, 88, 89, 90]. As a
result, a significant amount of today?s Internet traffic is carried by the TCP [76]. Studies
have shown that traffic from TCP and UDP [95] make for more than 96% of the packets
in the Internet. TCP alone accounts for almost 82% of packets and about 91% of the byte
count in the Web [77].
TCP also accounts for the bulk of traffic in data centers. TCP is at the core of several
data center applications like distributed filesystems [96, 97], cluster computing [98, 99],
parallel databases [100] as well as disaster recovery [101, 102]. However, recent works have
shown that under certain many-to-one traffic patterns, data center networks experience
Incast: a drastic collapse in throughput due to TCP timeouts triggered by severe losses
at Ethernet [30] switches [61, 74, 103].
In typical Incast communication pattern, a receiver issues synchronized data requests
to multiple senders. The senders, upon receiving the request, concurrently transmit a
large amount of data to the receiver. The data from all senders traverse a bottleneck
link in a many-to-one fashion. As the number of concurrent senders increases, the per-
ceived application-level throughput at the receiver collapses. The receiver application sees
throughput that is orders of magnitude lower than the link capacity [59]. TCP throughput
collapse was first observed in early parallel network storage projects such as NASD [104].
It was later documented as part of a larger paper by Nagle et al in [60]. Today, the same
Incast communication pattern can be found in many popular data center applications such
as cluster based storage systems [60, 64, 105], data analytics [106, 107, 108], Big Data [109],
35
MapReduce [72] as well as Hadoop [73]. Hence a thorough solution that addresses the Incast
pathology is urgently needed.
To substantially solve TCP Incast at low cost, we first need to understand the reasons
behind its throughput collapse. Traditionally, simulation and implementation/measurement
have been tools of choice for examining the performance of various aspects of TCP. In
this chapter we develop a simple analytic characterization of the steady state throughput
of multiple TCP flows, as a function of loss rate and round trip time under many-to-one
Incast communication pattern. Although many earlier works have already modeled TCP
[110, 111, 112, 113, 114, 115], our modeling is different in two aspects:
1. The application in our model exhibits Incast communication pattern whereas existing
models usually assume that the application layer has infinite amount of data to send.
2. Our model describes the overall throughput of the bottleneck link which contains mul-
tiple flows, while existing TCP models usually focus on the throughput of a single
flow.
In our TCP Incast model, we summarize that the throughput collapse in many-to-one
communication pattern is mainly caused by two kinds of timeouts.
? Anterior Block Transfer Timeout (ABTT): Anterior Block Transfer Timeouts happen
when a large number of senders get involved in a many-to-one synchronized data trans-
fer. During the transfer of a block, some senders finish transmitting their blocks early
due to TCP?s unfairness at small timescales. Such completed flows wait for other
senders to finish transmitting their blocks, without consuming any of the available
bandwidth. Meanwhile the remaining flows finish transmitting their blocks using ad-
ditional bandwidth vacated by the completed flows. This results in larger transmission
window for some flows by the end of the block transfer. At the beginning of the next
block transfer, all senders inject their whole windows into the network overwhelming
the small buffers at the intermediate Ethernet switch. This results in a lot of dropped
36
packets and if any flow loses all the packets in its window, then it will enter a timeout
period.
? Intermediate Block Transfer Timeout (IBTT): Unlike Anterior Block Transfer Time-
outs, Intermediate Block Transfer Timeouts are not limited to the start of a block
transfer. IBTTs are caused when a participating sender fails to receive enough dupli-
cate ACKs to trigger Fast Recovery following the loss of transmitted packets during
a block transfer. The sender waits for a period of time defined by TCP?s timeout
before retransmitting its unacknowledged packets. Following a timeout, the conges-
tion window is reduced to one, and only one packet is resent in the first round after
the timeout. However, because of the synchronized nature of the Incast traffic, the
receiver cannot issue its next request until all the senders have finished transmitting
their current blocks.
Investigating the causes behind the aforementioned category of timeouts is beneficial in
developing an effective solution that will avoid the ill effects of TCP Incast.
3.1 Modeling Incast
More than a decade after its publication in [110], the steady state throughput equation of
TCP by Padhye et al. remains the most widely used method for calculating the throughput
that a TCP sender achieves under certain environmental conditions. While there now is a
wealth of other models available (e.g. [111, 112, 113, 114, 115]), many of which are better
in some aspect, none of them seem to strike the same balance between precision and ease of
use that makes equation from [110] the useful tool that it is.
In an effort to enable practical calculation of the throughput in Incast communication
pattern, we extend the equation from [110] to multiple synchronized TCP flows across a
single bottleneck link. We do this by following the basic approach in [110], but considering
37
a number of synchronized flows using an identical path at the same time instead of a single
flow.
3.1.1 Model Using Loss Measure of Cumulative Flow
In order to derive an equation for the throughput in Incast communication pattern, we
extend the model presented in [110]. We assume that the reader is familiar with this work
and therefore will only repeat the preliminary assumptions where needed and shortly repeat
necessary definitions.
Consider n parallel TCP flows f1,...,fn sharing the same bottleneck link inside a data
center network. Like in [110], we too model the congestion avoidance phase of these n flows
in terms of ?rounds?, assuming furthermore that the flows are synchronized in rounds (i.e.
in a round, all flows send packets in their current congestion window before the next round
starts for all of them). For each flow f, the round starts with the back-to-back transmission
of Wf packets, where Wf is the size of the flow?s current congestion window. Once all
packets falling within the congestion window of all n flows have been sent in this back-to-
back manner, no other packets are sent until each flow f, receives an ACK for one of its Wf
packets already sent. The first ACK reception by all senders marks the end of the current
round and the beginning of the next round. In this model, the duration of a round is equal to
the round trip time and is assumed to be independent of the window size. Note that another
assumption here is that the time needed to send all the packets in a window is smaller than
the round trip time.
At the beginning of the next round, a group of W?f new packets will be sent by each flow
f, where W?f is the new size of the flow?s TCP congestion window. Assume that the receiver
acknowledges every packet received with an ACK. Many TCP receiver implementations can
be configured to send one ACK for every packet received. If Wf packets are sent by a flow f,
in the first round and all are received and acknowledged correctly, then the flow will receive
Wf acknowledgments. Since each acknowledgment increases the flow?s congestion window
38
size by
parenleftBig
1
Wf
parenrightBig
, the congestion window size for the flow f, at the beginning of the next round
is W?f = Wf + 1. That is, during congestion avoidance and in the absence of loss, the
congestion window size of each flow increases linearly in time, with the slope of one packet
per round trip time.
Loss of packets in TCP can be detected in one of two ways, either by the reception of
three ?duplicate ACKs? by the sender or via timeouts. We denote the former event as a
?TD? for triple-duplicate ACK loss indication and ?TO? for timeout loss indication. When
the loss indicating event is a TD, the composite flow f reduces its congestion window Wf,
by a factor of two. On the other hand if the loss indication is of type TO, the composite
flow f waits for a period of time denoted by T0 and then reduces the size of its congestion
window Wf to one before retransmitting its unacknowledged segments.
As in [110], we too assume that a packet is lost in a round independently of any packets
lost in other rounds. On the other hand we assume that packet losses are correlated among
the back-to-back transmissions within a round, i.e., if a packet is lost, all remaining packets
transmitted until the end of that round, irrespective of which flow they belong to, are also
lost. This bursty loss behavior which has been shown to arise from the drop-tail queuing
discipline in [116, 117], perfectly matches the queue management policy of the Ethernet
switches used in data center networks.
Cumulative Flow
Now, consider F to be the cumulative flow of n parallel, synchronized TCP flows
f1,...,fn, sharing the same bottleneck link in a data center network. Let W be the cumula-
tive window size of all n composite flows. Because the composite flows are all synchronized,
W is essentially the sum of all n congestion windows. As with the single sender, each round
starts with the back-to-back transmission of a total of W packets belonging to n flows. If all
W packets are sent, received and acknowledged correctly, then the participating n flows will
together receive W acknowledgments. Since each acknowledgment increases the individual
39
flow f?s congestion window size by
parenleftBig
1
Wf
parenrightBig
, the cumulative congestion window size at the
beginning of the next round is W? = summationtextni=1(Wfi + 1), which implies, W? = W + n, as there
are n parallel flows involved. This means that, when all n composite flows are in congestion
avoidance phase and none of them experience a loss, the cumulative window size of all n
flows increases linearly in time, with the slope of n packets per round trip time.
Note that we have assumed the packets lost in the same round to be correlated (i.e., if
a packet is lost, all remaining packets transmitted until the end of that round, irrespective
of which flow they belong to, are also lost). Hence, more than one among n composite flows
could potentially experience a loss event in the same round. But TCP flows that experience
a loss, reduce their congestion window only once per round trip time. Since the flows are all
synchronized in terms of rounds, the resulting cumulative window W, is also modified only
once per round trip time. Hence, in the event of correlated losses, recognizing a packet loss
in a composite flow f, serves as a loss event indicator for the cumulative flow F. We define
TD-period (TDP) for the cumulative flow F, as the period between two consecutive loss
event indicators. For the ith TD-period, TDPi, we define Ai be the duration of the period.
A sample path of the evolution of the cumulative window W is shown in Figure 3.1.
Between two TD loss indications, the composite flows are all in congestion avoidance and the
cumulative window increases by n packets per round, as discussed above. Immediately after
a loss indication occurs, any composite flow f experiencing a loss, reduces its congestion
window size Wf by a factor of two. This implies that a loss experiencing cumulative flow F,
will also reduce its cumulative window W, by
parenleftBigW
f
2
parenrightBig
packets.
In the following subsections, we model the cumulative flow?s behavior in the presence
of packet losses. We develop a stochastic model of the cumulative flow corresponding to its
operating regimes: when loss indications are exclusively TD and when loss indications are
both TD and TO. During the process, we ignore certain aspects of TCP?s behavior (e.g.
slow start) but believe that we have still managed to captured the essential elements of the
40
Figure 3.1: Evolution of W over time when loss indications are TDs
41
protocol, as indicated by the generally good fits between model predictions and simulations,
as discussed in Section 3.2.
Triple Duplicate Loss Indications
In this subsection we assume that loss indications are exclusively of type ?triple-duplicate?
ACK (TD), and that the composite flow f?s window size is not limited by the receiver?s ad-
vertised flow control window.
For any given time t ? 0, we define Nt as the total number of packets transmitted by
the cumulative flow F, in the interval [0, t]. Let Bt = parenleftbigNtt parenrightbig be the cumulative throughput
of all n composite flows in that interval. We can then define the long term steady-state
throughput of all n flows as,
B = limt??Bt
= limt??
parenleftbiggN
t
t
parenrightbigg (3.1)
Note that Bt is the number of packets sent per unit of time regardless of their eventual
fate (i.e., whether they are received or not). Thus Bt represents the throughput of the
cumulative flow F, at the shared link.
For our new extended equation, we define pc as the probability of a loss event of the
cumulative flow F. It is only counted as a loss event when one or more composite flows f,
experiences a loss in a round.
As discussed in the previous subsection, in a loss event of the cumulative flow F, more
than one composite flow f, could experience packet loss. In order to estimate the number
of composite flows that experience packet loss for each cumulative loss event, we will also
use information about real loss probability in our extended equation. With pr, we denote
the probability that a packet (belonging to any composite flow) is lost, given that either it
is the flow?s first packet in its round or the flow?s preceding packet in its round is not lost.
42
In this subsection, we are interested in establishing a relationship B(n,pc,pr) between
the throughput of the cumulative flow F and n the number of parallel synchronized flows
involved, pc the loss probability of the cumulative flow as well as pr the loss probability in
any composite flow f.
For a period TDPi, let Yi be the number of packets sent in that period and Ai be the
duration of that period. From [110], it can be shown that,
B = E[Y]E[A] (3.2)
where E[Y] and E[A] are the expected values of Y and A respectively. Hence, to derive
B, the longterm steady-state throughput of the cumulative flow, we must next derive the
expressions for the mean of Y and mean of A. To achieve this, we need to take a closer look
at how the evolution of window size Wf of each composite flow, the time between two loss
events of a flow Af and the duration of a TD-period of each individual flow f, influence the
development of the cumulative window size W.
As in [110], we define rij to be the duration (round trip time) of the j-th round of TDPi
and Xi to be the number of rounds in TDPi. Then, the duration of TDPi can be computed
as Ai = summationtextXij=1 rij. We consider the round trip times rij to be random variables, that are
assumed to be independent of the size of the cumulative window W, and thus independent
of the round number, j. It follows that
E[A] = E[X]E[r] (3.3)
Henceforth, we denote by RTT = E[r], the average value of round trip time.
Since we are now dealing with the cumulative flow, in a single loss event in F, more
than one composite flow can experience loss. Let ji be the number of flows, belonging to the
cumulative flow, that experience loss at the end of the i-th TD-period. Assuming that loss
43
is identically distributed over all flows, the probability that a composite flow experiences a
loss in the i-th TD-period is parenleftbigjinparenrightbig.
The probability that the time between two loss events of a composite flow Af, is k TD-
periods (k = 1, 2,...) is equal to the probability that the flow did not lose a packet in k - 1
consecutive TD-periods and in the k-th period it loses a packet:
P[loss in the k-th TDP] = jin
k?1productdisplay
l=1
parenleftbigg
1? j(i?l)n
parenrightbigg
(3.4)
If j is the mean number of composite flows experiencing a loss in a round, we have:
P[Af = kE[A]] = jn
parenleftbigg
1? jn
parenrightbiggk?1
(3.5)
The mean value of Af, the time between two loss events for a composite flow, is:
E[Af] =
?summationdisplay
k=1
parenleftBigg
j
n
parenleftbigg
1? jn
parenrightbiggk?1
kE[A]
parenrightBigg
=
parenleftbiggnE[A]
j
parenrightbigg (3.6)
From 3.3 and 3.6 we can express the average number of rounds between two loss events
of a flow as:
E[Xf] = nE[X]j (3.7)
For deriving Y, we will examine the evolution of the cumulative window W, as shown
in Figure 3.2. In each round, the composite window W, is incremented by n. ?i denotes the
sequence number of the first packet lost in TDPi (for simplicity, we assume the sequence
numbers to begin at 1 for every TD-period). After receiving a triple duplicate acknowledg-
ment for one of the composite flows, the cumulative flow recognizes that a packet has been
44
Figure 3.2: Packets sent during a TD period
45
lost (receiving the ACK for packet ?i). We consider that a TD period ends when the cumu-
lative flow recognizes a loss event. This usually happens in the round following the actual
loss; we call this round the ?loss round?. The total number of packets sent in Xi rounds in
TDPi is Yi = ?i, hence
E[Y] = E[?] (3.8)
The probability that ?i = k is equal to the probability that k - 1 packets are not loss
indications and the ACK for the k-th packet triggers the fast retransmission in one of the
the composite flows for the cumulative flow F:
P[?i = k] = (1?pc)k?1 pc,k = 1,2,... (3.9)
And the mean value of ? is:
E[?] =
?summationdisplay
k=1
(1?pc)k?1 pck
=
parenleftbigg1
pc
parenrightbigg (3.10)
For the i-th TD-period let flows xe, e = 1,...,ji (subset of n composite flows) be the ji
flows experiencing loss at the end of the period. The same xe flows do not experience loss
in every TD-period. Instead, the TD-periods in which these xe flows experience loss are a
subset ({is}, s = 1, 2,...) of TD-periods of the cumulative flow F. For example, in Figure
3.2, only flow f2 experiences loss in TDPi. Its next loss could perhaps happen in the period
TDPi+2.
If Wfxe
is
are the congestion windows of the flows xe at the end of the (is)-th period, and
Xfxe
is
is the number of rounds from the end of TDPis?1 till the end of TDPis, during these
Xfxe
is
rounds, the congestion window of flows xe increase by Xfxe
is
packets. Hence, we have:
46
Wfxe
is
=
Wfxe
is?1
2 +Xfxeis (3.11)
Assuming that Xfxe
is
and Wfxe
is
are mutually independent sequences of independent
and identically distributed (i.i.d.) random variables, from [115] we have:
E[Wf] = 2E[Xf] (3.12)
Assuming that at the end of each TD-period the window sizes of the j flows experiencing
loss are E[Wf], and the window sizes of the j flows experiencing loss in the previous loss
events are
parenleftBigE[W
f]
2 +E[X]
parenrightBig
,
parenleftBigE[W
f]
2 + 2E[X]
parenrightBig
,
parenleftBigE[W
f]
2 + 3E[X]
parenrightBig
and so on, the mean window
size of the cumulative flow is:
E[W] = jE[Wf] +
n
j ?1summationdisplay
k=1
j
parenleftbiggE[W
f]
2 +kE[X]
parenrightbigg
(3.13)
From 3.7, 3.12 and 3.13, we have:
E[W] = nE[X]2 + 3n
2E[X]
2j (3.14)
The number of packets sent in a TD-period by the cumulative flow F, is the number of
packets sent between its two loss events. For the i-th TD-period this includes packets sent in
the last round of the (i - 1)th TD-period, starting from the ?(i?1)th packet till the end of the
window (?i?1 packets) and the packets sent in the next Xi rounds till the ?ith packet. If flows
xe, e = 1,...,ji experience loss in the (i - 1)th TD-period and Wfxe
i?1
are their respective con-
gestion window sizes at the end of the (i - 1)th TD-period, the window size of the cumulative
flow at the beginning of the i-th TD-period is Wi =
parenleftbigg
Wi?1 ?summationtextji?1e=1 Wfxei?12 + (n?ji?1)
parenrightbigg
(ji flows reduce their congestion windows by factor of two and the remaining (n - ji?1) flows
47
increase their window size by one segment). Additionally, the window size W of the cumu-
lative flow F is increased by n every round of the i-th TD-period. So the number of packets
sent in a TD-period can be expressed as:
Yi = ?i?1 +
Xi?1summationdisplay
k=0
parenleftBigg
Wi?1 ?
ji?1summationdisplay
e=1
Wfxe
i?1
2 + (n?ji?1) +nk
parenrightBigg
??i (3.15)
where ?i is the number of packets sent in the loss round after the loss event is recognized.
Assuming that loss events in the cumulative flow are uniformly distributed over the size of
the cumulative window W in a loss round, we have:
E[?] = E[W]2 (3.16)
From 3.15 and 3.16, we can show that:
E[Y] =
parenleftbigg
E[W]? jE[Wf]2 + (n?j)
parenrightbigg
E[X] + nE[X]
2
2 ?
nE[X]
2 (3.17)
and including 3.7, 3.8, 3.10 and 3.12:
1
pc =
3n2E[X]2
2j +
nE[X]
2 ?jE[X] (3.18)
Solving the equation in 3.18 for E[X], we get
E[X] = 2j
2pc ?npcj +radicalbign2pc2j2 ?4npc2j3 + 4j4pc2 + 24n2pcj
6n2pc (3.19)
Including 3.14, we have:
E[W] = 2j
2pc ?npcj +radicalbign2pc2j2 ?4npc2j3 + 4j4pc2 + 24n2pcj
4pcj
+ 2j
2pc ?npcj +radicalbign2pc2j2 ?4npc2j3 + 4j4pc2 + 24n2pcj
12npc
(3.20)
48
From 3.2, 3.3, 3.10 and 3.19 we can express B, the longterm steady state throughput of
all n synchronized flows as:
B = 1RTT ? 6n
2
2j2pc ?npcj +radicalbign2pc2j2 ?4npc2j3 + 4j4pc2 + 24n2pcj (3.21)
Equation 3.21 gives us an expression to compute the throughput of Incast traffic when
all the composite flows are in congestion avoidance phase and receive only loss indicating
events that are of type TD. In this equation, we can approximate j, the mean number of
flows experiencing a loss in a round, with the expression
parenleftBig
pr
pc
parenrightBig
. Since j must be no more
than n, we have j = min
parenleftBig
n, prpc
parenrightBig
.
Timeout Loss Indications
In this subsection we model the throughput of cumulative flow for loss indications that
are of type ?time out? (TO). As already mentioned, TCP?s throughput collapse in many-to-
one synchronized communication is mainly caused by two kinds of timeouts, namely, Inter-
mediate Block Transfer Timeouts (IBTT) and Anterior Block Transfer Timeouts (ABTT).
Figure 3.3 shows the scenario where IBTT happens in ns-2 [118, 119] simulations. The
simulation consists of four senders that transmit synchronized data block to the same re-
ceiver. As with the standard Incast communication pattern, the client makes a request for
the next block only when the previous block has been completely received. The advertised
window size of the receiver is set to 1000 packets, which is large enough to have no impact
on the congestion window evolution at the sender. Figure 3.3 plots the window evolution
of three of the four senders involved. The dotted vertical lines running across all three
evolutionary graphs indicate the completion of a block transfer. We can see that at time
t ? 13.559886s, the client successfully receives block number 21. Following the complete
reception of the block, the client makes a request for the transfer of the next block and all
senders start transmitting their share for block 22. During transfer of this block, sender 1 at
time t = 13.563974s, experiences a TO. Since the loss indicator is a timeout, sender 1 waits
49
Figure 3.3: Scenario for Intermediate Block Transfer Timeouts
50
for a period of time T0, defined by TCP?s retransmission timer before retransmitting its lost
packets. And although the other servers involved in the block transfer complete transmitting
their share of the block well before the recovery of sender 1, the client does not make a re-
quest for a new block till sender 1 also follows suit. Hence, the shared link is completely idle
between 13.568913s ? 13.764448s, which results in throughput collapse. By observing the
congestion window evolution of sender 1, we find that although the packets in its congestion
window at t ? 13.559886s, were all successfully transmitted, the server received less than 3
duplicate ACKs resulting in a TO.
Figure 3.4 illustrates the situation where ABTTs occur. In this ns-2 simulation setup,
ten senders transmit synchronized data block to the same receiver. As with the simulations
for IBTT, the advertised window size of the receiver is set to 1000 packets, which is again
large enough to have no impact on the congestion window evolution at the sender. Figure
3.4 plots the window evolution of three of the ten senders involved. Like before, the dotted
vertical lines running across all three evolutionary graphs indicate the completion of a block
transfer. Here, we notice that sender 10 experiences a TO very early (t ? 1.855538s) in the
transfer of block 9 to the receiver. By the time sender 10 resumes with its transmission (at
t ? 2.054982s), all the remaining servers involved, have finished transmitting their share of
the data and are waiting for sender 10 to catch up. Like with IBTT, the shared link remains
completely idle during this interval (1.875551s ? 2.054982s), which drastically reduces the
overall throughput of the Incast traffic. However once sender 10 resumes its transmission,
it does not have to compete with any other sender for a portion of the shared bandwidth.
This results in a large congestion window for sender 10 at the end of the transfer of block
9. At the beginning of the next block transfer, all senders start off by injecting their whole
windows into the network. The small buffers at the intermediate Ethernet switch are easily
overwhelmed by large windows of senders like 10 and as a result, a lot of packets get dropped.
Unfortunately, few senders like sender 9, lose all the packets in their congestion window
resulting in an early TO for block 10. And like sender 10 during transfer of block 9, sender
51
9 too ends up with a large congestion window during the transfer of block 10. The cycle
repeats for block 11 too, with sender 8 experiencing an early TO.
Through investigating numeroussimulations, we find that IBTT dominates TCP through-
put when n is small, while ABTT dominates Incast when n is large.
Consider the evolution of the cumulative window W, in the presence of loss indications
that include type ?TO?, as shown in Figure 3.5. Timeouts occur when any composite flow
f loses packets (or ACKs) and receives less than three duplicate ACKs in response. The
loss experiencing flow then waits for a period of time denoted by T0 before retransmitting
its non-acknowledged packets. Following a timeout, the congestion window of the flow Wf
is reduced to one, and only one packet is thus resent in the first round after the timeout.
In case the composite flow suffers another timeout before successfully retransmitting the
packets lost during the first timeout, the period of timeout doubles to 2T0; this doubling is
repeated for each unsuccessful retransmission until 64T0 is reached, after which the timeout
period remains constant at 64T0 [110].
The evolution of the cumulative window W depicted in Figure 3.5 is an approximation
of the real Incast traffic pattern observed during timeouts. Because we have assumed all n
flows to be synchronized in terms of rounds, when one composite flow experiences a timeout,
the remaining flows refrain from transmitting data as well. However, in the real world when
one composite flow experiences a loss, the other (n - 1) flows continue to transmit their
remaining share of data (e.g. Figure 3.3 and Figure 3.4).
Slow Start is another aspect of TCP that we have conveniently chosen to ignore in our
handling of TO type loss indicators. Following a timeout, TCP uses a mechanism called
?Slow Start? to increase its congestion window. Slow Start operates by observing that the
rate at which new packets should be injected into the network is the rate at which the
acknowledgments are returned by the other end. Unlike the Congestion Avoidance phase
where the congestion window is increased by one segment per round trip time, the Slow Start
increases the congestion window by one segment for every ACK received. This provides for
52
Figure 3.4: Scenario for Anterior Block Transfer Timeouts
53
Figure 3.5: Evolution of W over time when loss indications are TD and TO
54
an exponential growth of the congestion window after it was reduced to one following a
timeout. The Slow Start phase is usually much shorter than the Congestion Avoidance
phase and for the sake of simplicity, we choose to ignore this phase in our model of Incast.
Despite these aforementioned approximations, we believe that we have still managed to
capture the essential aspects of the Incast phenomenon, as indicated by the generally good
fit between our model and the simulations as discussed in Section 3.2.
Let ZTOi denote the duration of a sequence of timeouts and ZTDi denote the time interval
between two consecutive timeout sequences. We define Si to be
Si = ZTDi +ZTOi (3.22)
Let Mi be the number of packets sent during Si. Then {(Si,Mi)}i is an i.i.d. sequence
of random variables [110] from which we have,
B = E[M]E[S] (3.23)
Let vi be the number of TD periods in interval ZTDi . For the jth TD period of interval
ZTDi , we define Yij to be the number of packets sent in the period, Aij to be the duration
of the period, Xij to be the number of rounds in the period, and Wij to be the cumulative
window size of n parallel synchronized TCP flows at the end of the period. From these
definitions we have,
Mi =
visummationdisplay
j=1
Yij (3.24)
and,
Si =
visummationdisplay
j=1
Aij +ZTOi (3.25)
Thus,
55
E[M] = E
bracketleftBigg v
isummationdisplay
j=1
Yij
bracketrightBigg
(3.26)
and,
E[S] = E
bracketleftBigg v
isummationdisplay
j=1
Aij
bracketrightBigg
+EbracketleftbigZTOi bracketrightbig (3.27)
If we assume {vi}i to be an i.i.d. sequence of random variables, independent of Yij and
Aij [110], then we have
E
?
?
parenleftBigg v
isummationdisplay
j=1
Yij
parenrightBigg
i
?
?= E[v]E[Y] (3.28)
and,
E
?
?
parenleftBigg v
isummationdisplay
j=1
Aij
parenrightBigg
i
?
?= E[v]E[A] (3.29)
To derive E[v] observe that, during ZTDi the time between two consecutive timeout
sequences, there are vi TDPs, where each of the first (vi - 1) end in a TD, and the last TDP
ends in a TO. It follows that in ZTDi there is one TO out of vi loss indications. Therefore if
we denote by Q the probability that a loss indication ending a TDP is a TO, we have Q =parenleftBig
1
E[v]
parenrightBig
. Consequently,
B = E[Y]E[A] +Q?E[ZTO] (3.30)
Since Aij and Yij do not depend on timeouts, their means are those derived in 3.3 and
3.10. To compute throughput of n parallel synchronized TCP connections using 3.30 we
must still determine Q and E[ZTO].
We begin by deriving an expression for Q. Let j be the mean number of composite flows
experiencing packet loss at the end of a TDP as discussed in the previous subsection. For
56
Figure 3.6: Packet and ACK transmissions preceding a loss indication
57
simplicity, we assume that at most, only one ?TO? type loss indication occurs at the end of
a TDP. That is, of the j composite flows that lose packets at the end of a TDP, no more
than one flow experiences a timeout event. Since a timeout is either of type IBTT or of type
ABTT, the probability of a TO type loss indication ending a TDP can be expressed as,
Q = min(1,Qibtt +Qabtt) (3.31)
where Qibtt and Qabtt are probabilities of ending a TDP with a single timeout indication
of type IBTT and ABTT respectively.
Next, we focus on deriving an expression for Qibtt ? the probability of a composite flow
experiencing an IBTT at the end of a TDP. Consider the round where a composite flow f,
loses its packets; we will refer to this round as the ?penultimate? round (see Figure 3.6).
Let Wf be the size of the flow?s congestion window. Thus packets u1,...,uWf are sent in the
penultimate round. Packets u1,...,uk are acknowledged and packet uk+1 is the first one to
be lost. Since we have assumed packet losses within a round to be correlated, if a packet
is lost all packets that follow it till the end of the round are also lost. Thus, all packets
following uk+1 in the penultimate round are also lost. However, since packets u1,...,uk are
ACKed, another k packets, s1,...,sk are sent in the next round, which we will refer to as
the ?loss? round. This round of packets may have another loss, say packet sm+1. Again,
our assumptions on packet loss correlation mandates that packets sm+2,...,sk are also lost
in the last round. The m packets successfully sent in the last round are responded to by
ACKs for packet uk, which are counted as duplicate ACKs. If the number of such ACKs is
higher than three, then a TD indication occurs, otherwise an IBTT occurs. In both cases,
the current period between losses, TDP, ends. We denote by A(w,k) the probability that
the first k packets are ACKed in a round of w packets, given there is a sequence of one or
more losses in the round. Then,
58
A(w, k) = (1?pr)
kpr
1?(1?pr)w (3.32)
Also, we define C (g,m) to be the probability that m packets are ACKed in sequence in
the loss round (where g packets were sent) and the rest of the packets in the round, if any
are lost. Then,
C(g, m) =
??
??
???
(1?pr)mpr, m < g
(1?pr)n, m = g
(3.33)
Then, ?Qibtt (w), the probability that a loss in a congestion window of size w is an IBTT,
is given by,
?Qibtt (w) =
??
??
???
1, if w ? 3
summationtext2
k=0 A(w,k) +
summationtextw
k=3
parenleftbigA(w,k)?summationtext2
m=0 C(k,m)
parenrightbig, otherwise (3.34)
since an IBTT occurs if the number of packets successfully transmitted in the penulti-
mate round, k, is less than three or if the number of packets successfully transmitted in the
loss round, m is less than three. Also, due to the assumption that packets following sk+1
are lost independently of packets following uk+1 (since they occur in different rounds), the
probability that there is a loss at uk+1 in the penultimate round followed by a loss at sm+1
in the loss round equals A(w,k)?C (k,m).
Therefore, Qibtt, the probability that composite flow?s loss indication is an IBTT, can
be expressed as
Qibtt =
?summationdisplay
w=1
?Qibtt (w)P [Wf = w] = EbracketleftBig ?QibttbracketrightBig (3.35)
We can approximate this to,
59
Qibtt ? ?Qibtt (E[Wf]) (3.36)
where E[Wf] is the mean congestion window size of a composite flow, derived from the
equation 3.12.
To begin deriving an expression for Qabtt we must first consider the number of packets
transmitted in a TDP in relation to the size of the block being transferred. Let L be the size
of the block that all n senders are trying to transmit to the destination. If E[Y] is the mean
number of packets sent during a TD-period (equation 3.10), the average number of TDPs
needed to transfer a block of size L, can be expressed as,
? = LE[Y] (3.37)
If ? is the mean number of ABTTs occurring at the start of a block transfer, the series
?i and ?i can be assumed to be mutually independent sequence of i.i.d. random variables
from which, the probability of a TDP ending due to a TO indication of type ABTT can be
expressed as
Qabtt = E[?]E[?] (3.38)
We can substitute the results of equations 3.36 and 3.38 in 3.31 to get an expression for
Q ? the probability that a loss indication ending a TDP is a TO. Next, we consider the
derivation of EbracketleftbigZTObracketrightbig, the average duration of a timeout sequence. Since we have assumed
that there can be at most one timeout at the end of a TDP, we can approximate EbracketleftbigZTObracketrightbig
with T0.
By substituting the obtained expressions for Q and EbracketleftbigZTObracketrightbiginto equation 3.30, we now
obtain the following expression for B
60
B =
parenleftbigg E[Y]
RTT ?E[X] +Q?E[ZTO]
parenrightbigg
where, (3.39)
E[Y] =
parenleftbigg1
pc
parenrightbigg
E[X] =
parenleftBigg
2j2pc ?npcj +radicalbign2pc2j2 ?4npc2j3 + 4j4pc2 + 24n2pcj
6n2pc
parenrightBigg
j ? min
parenleftbigg
n, prp
c
parenrightbigg
Q = min(1,Qibtt +Qabtt)
Qibtt ? ?Qibtt (E[Wf])
E[Wf] =
parenleftBigg
2n
j ?
2j2pc ?npcj +radicalbign2pc2j2 ?4npc2j3 + 4j4pc2 + 24n2pcj
6n2pc
parenrightBigg
?Qibtt (w) =
?
???
???
1, if w ? 3
summationtext2
k=0 A(w,k) +
summationtextw
k=3
parenleftbigA(w,k)?summationtext2
m=0 C(k,m)
parenrightbig, otherwise
A(w,k) = (1?pr)
kpr
1?(1?pr)w
C (k,m) =
??
??
???
(1?pr)mpr, m < k
(1?pr)n, m = k
Qabtt = E[?]E[?]
EbracketleftbigZTObracketrightbig= T0
In Section 3.2, we verify whether the equation 3.39 successfully models the behavior of
Incast or not. Henceforth we will refer to the model expressed in equation 3.39 as the ?Full
Model?.
61
3.2 Validation and Analysis
In this section, we validate the performance of our Incast model using the ns-2 simulator.
With simulations we demonstrate that the throughput expression derived in the previous
section works relatively well for broad range of conditions.
Figure 3.7: Setup for n parallel, synchronized TCP flows sharing a bottleneck
For our ns-2 simulations, we used the topology depicted in Figure 3.7 which is commonly
used to study a set of parallel, synchronized flows sharing the same bottleneck link. We vary
various parameters like, number of flows, block size as well as buffer length to validate our
model under different environmental conditions.
Our ns-2 simulation configuration depicted in Figure 3.7 consists of a cluster based
storage system where storage client and storage servers are all connected to the same switch.
In this environment, data blocks are striped over multiple servers, such that each server
stores a fragment of the data block denoted as the Server Request Unit (SRU) in Figure 3.7.
A client requesting a data block sends request packets to all storage servers that contain
62
SRUs for that particular block; the client requests the next block only after it has received
all the data for the current requested block. That is, if the client requests a data block from
n servers, it sends request for the next block only after receiving (n?SRU) bytes of data
in total.
Table 3.1: Simulation parameters with default settings
Parameter Default
Number of servers ?
SRU size (L) 256 KB
Link capacity (C) 1 Gbps
Link delay (D) 50 ?s
Switch buffer size (B) 32 KB
Segment size (S) 1 KB
TCP implementation NewReno
Receive window size 1000 segments
Duplicate ACK threshold 3
Slow start enabled
RTOmin 200 ms
Next, we measure the throughput of n parallel, synchronized TCP flows at the shared
bottleneck link after varying the number of storage servers involved in data transfer. To more
accurately model the real-world scheduling variance, we also add a random scheduling delay
of up to 20 ?s between every consecutive data request from the client. Table 3.1 lists various
other parameters that were used in our experiments. Notice that we have enabled ?Slow
start? in our experiments even after choosing to ignore it for our model. As we demonstrate
later in this section, the impact of ?Slow start? on Incast is negligible; our model produces
a good fit with the simulations despite ignoring ?Slow start?. Each trial in the experiment
runs for 40 seconds of simulated time, providing enough data transfer to accurately calculate
the throughput of the Incast traffic.
63
Simultaneously, we also gather traces generated by ns-2 for all the traffic simulated in
our experiment. Later, we analyze these traces with a set of analysis programs developed by
us. These programs compute the values of pc by dividing the total number of loss indications
in the cumulative flow by the total number of packets sent by all flows, pr by dividing the total
number of loss indications in a composite flow by the total number of packets sent by the flow
and ? the mean number of ABTTs occurring at the start of a block transfer. Additionally, the
programs also measure the round trip time and the average duration of a ?single? timeout.
These values are then averaged over several runs and our model?s throughput computed
using equation 3.39.
Figure 3.8: Comparing Full Model with Incast simulation results
Figure 3.8 compares the throughput of ns-2 Incast simulations to the throughput ob-
tained by our model using equation 3.39. From the graph, it can be seen that our model
64
characterizes the general tendency of TCP Incast relatively well, although it underestimates
the throughput at the bottleneck link when the number of senders is large.
Figure 3.9: Comparing TD Only model with Incast simulation results
Figure 3.9 on the other hand compares the throughput of ns-2 Incast simulations to
the throughput obtained by our model using equation 3.21. It is important to note that
the expression in equation 3.21 computes the throughput of the Incast traffic when all the
composite flows are in congestion avoidance phase and receive only TD type loss indicating
events. That is, equation 3.21 computes the throughput of the Incast traffic without taking
timeouts into account.
Comparing figures 3.8 and 3.9 it is clear that, timeouts ? both ABTT and IBTT ? are
essentially the main causes for TCP?s throughput collapse under Incast workloads. To better
understand the impact of IBTT and ABTT on Incast traffic, we compute the throughput
achieved in our model by considering just one type of timeout at a time. Figures 3.10 and
65
3.11 plot the throughput resulting from our Full Model when only IBTT and ABTT, are
considered respectively.
Figure 3.10: Impact of IBTT on proposed model
From Figure 3.10 it is clear that IBTTs have a greater impact on throughput when the
number of senders is small. When the number of senders is between three and eight, our
model overestimates the impact of IBTTs when compared to throughput resulting from ns-2
simulations. Also, when the number of senders is greater than eight, we observe that the
model?s throughput no longer matches that of the ns-2 simulations. This is mainly because
the expression for Qibtt in equation 3.36 does not take into account the timeouts happening
at the beginning of a block transfer. Furthermore, since Qibtt in equation 3.36 only relies
on the probabilities pc and pr, even small deviations in their measured values, can result in
large fluctuations in the model?s throughput.
66
On the other hand, from Figure 3.11, it is clear that ABTTs dominate timeouts when
the number of senders is large. From the graph, we observe that the ABTTs have little or
no impact on the model?s throughput when the number of senders is less than ten. As the
number of senders increase, some of them finish transmitting their SRUs early allowing the
remaining senders to transmit their SRUs using the additional bandwidth vacated by the
finished peers. This results in large transmission windows for some of the senders at the
end of the block transfer. At the beginning of the next block transfer, all senders begin
by injecting their entire congestion windows into the network. With some senders injecting
larger number of segments, this packet burst at the beginning of a block transfer overwhelms
the bottleneck link?s port buffers resulting in packet drops and ABTTs.
Figure 3.11: Impact of ABTT on proposed model
It is interesting to observe that ABTTs are the result of the start-stop nature of the
synchronized block transfers. The senders stop transmission after completing their SRU
67
transfer and start transmitting again only after receiving a new transfer request. This new
transfer request results in a packet burst which floods the buffers at the bottleneck link
resulting in packet loss and ABTTs. If the senders each had SRUs of infinite size like in
[110], there would be no start-stop pattern to Incast?s traffic and hence, no ABTTs. This
would have resulted in the senders experiencing only IBTTs in which case, the expression for
Qibtt in equation 3.36 would have been sufficient for estimating the probability of a timeout
at the end of a TDP.
Going back to Figure 3.8, we can now analyze the performance of our model in two
parts: the first part, where the number of senders is large and ABTTs have a bigger impact
and the second part, where the number of senders is small and IBTTs are dominant. In the
first part, it is clear that our model underestimates the throughput of multiple TCP flows
at the bottleneck link. This is because our model overestimates the time spent in recovering
from an ABTT. If we were to revisit the expression for Qabtt in equation 3.38, we find that
? is defined as the mean number of timeouts occurring at the beginning of a block transfer.
While our model simply counts the average number of flows experiencing timeouts at the
beginning of a block transfer, from the traces generated by ns-2, we find that most of these
timeouts occur simultaneously. With simultaneous timeouts, the participating flows wait for
a single T0 period before recovering, although the trigger event gets counted multiple times.
Since we do not take simultaneous timeouts into account while deriving an expression for
Qabtt, the estimated duration between two successive TDs in our model is slightly longer than
that of ns-2. This in turn decreases the number of packets estimated per unit time, which
is why our model underestimates Incast throughput when the number of senders involved is
large.
In the second part of our performance analysis of the Full Model, we find that the model
predicts a huge drop in Incast throughput when the number of senders is approximately three.
The throughput obtained via ns-2 simulations on the other hand, appears to have a step
around the four senders mark, followed by a significant drop in performance when the number
68
of senders is around seven. In order to better understand this discrepancy in the results, we
analyzed the traces generated by ns-2 simulations in great detail. From these traces we found
that whenever the number of senders is less than or equal to three, IBTTs are caused by only
one reason ? whole window losses. That is, when the sender experiences a timeout with n ?
3, it loses all the packets in its congestion window, without receiving a single ACK in return.
This type of loss happens when two or more individual flows simultaneously attempt to fill
the bottleneck link buffer, resulting in at least one flow losing all its packets. On the other
hand when the number of senders is greater than three, IBTTs in ns-2 simulations happen
only because of one reason ? lack of enough duplicate ACKs. This type of loss happens when
a sender loses packets in both ?loss? as well as ?last? rounds due to severe congestion at the
bottleneck link, as discussed earlier in Section 3.1 while deriving an expression for C(g, m)
in equation 3.33.
Taking into account the exclusive nature of IBTT type of timeouts in ns-2 simulations,
we can now compute the following new expression for ?Qibtt(w).
?Qibtt (w) =
??
????
??
???
????
1, if w ? 3
A(w,0), if n ? n?
summationtextw
k=3
parenleftbigA(w,k)?summationtext2
m=0 C(k,m)
parenrightbig, otherwise
(3.40)
where n? is the number of senders after which IBTTs are entirely caused by insufficient
duplicate ACKs.
By substituting the equation 3.40 in equations 3.35 and 3.36, we end up with a new
expression for B, the throughput of the Incast traffic across the bottleneck link. We refer to
the model resulting from equation 3.40 as the ?Split Model? as opposed to the ?Full Model?
derived in equation 3.39.
Figure 3.12 compares the throughput of ns-2 Incast simulations to the throughput ob-
tained by the Split Model described above. From the graph it can be seen that the Split
69
Model is much better at characterizing the overall tendency of TCP Incast. When the num-
ber of senders n is less than or equal to three, Split Model only considers whole window
losses for IBTTs. Beyond that, as the number of senders increase, Split Model only consid-
ers insufficient duplicate ACKs for IBTTs. When combined with timeouts of type ABTT
resulting from packet burst at the start of a block transfer, the Split Model appears to model
the Incast traffic much better than our earlier Full Model.
Figure 3.12: Comparing Split Model with Incast simulation results
From the four simulation curves in Figures 3.13-3.16, we can summarize the following
features:
? Larger switch buffer improves the throughput at the bottleneck link with different
number of senders n. This can be explained by our proposed model. Larger buffer size
implies fewer dropped packets i.e., smaller values for probabilities pc and pr. Hence,
the expected number of packets Y in a TDP increases.
70
Figure 3.13: Performance of Full Model, Split Model and ns-2 with 16 KB switch buffer
Figure 3.14: Performance of Full Model, Split Model and ns-2 with 32 KB switch buffer
71
Figure 3.15: Performance of Full Model, Split Model and ns-2 with 64 KB switch buffer
Figure 3.16: Performance of Full Model, Split Model and ns-2 with 128 KB switch buffer
72
Figure 3.17: Performance of Full Model, Split Model and ns-2 with 64 KB SRU
Figure 3.18: Performance of Full Model, Split Model and ns-2 with 128 KB SRU
73
Figure 3.19: Performance of Full Model, Split Model and ns-2 with 256 KB SRU
Figure 3.20: Performance of Full Model, Split Model and ns-2 with 512 KB SRU
74
? Larger switch buffer shifts throughput collapse to the right. That is, for larger switch
buffers, several parallel, synchronized senders can transmit data without experiencing
Incast. This is because larger switch buffers can cache more packets and thereby
reduce the probability of packet loss. And since losses lead to Anterior Block Transfer
Timeouts as senders increase, large buffers delay the onset of performance loss by
reducing the number of ABBTs.
Figures 3.17-3.20 plot the performance of our proposed model and simulation results
with different sizes of SRU. From these graphs, we can summarize the following features:
? We can see that the throughput increases when the SRUs grow larger. But large SRU
size has little impact on the onset of throughput collapse. According to our model,
SRU size is irrelevant to the maximum cumulative window size.
? With larger SRU size, the time wasted by a TO period to the time spent by unlucky
flows transmitting packets becomes smaller. As a result, the throughput across the
bottleneck link increases after Incast.
3.2.1 Comparing with Single Flow Model
Due to the simplicity of our model, it is tempting to believe that a similar result
could also be obtained by simply multiplying the original equation from [110] with the
number of flows n. In Figure 3.21, we compare the simulations results from ns-2 with
n?(equationfrom[110]). Here, in order to obtain the curve for the expression n?(equation
from[110]), we have substituted the packet loss probability p in [110] with average pr from
Section 3.1, the probability that a packet (belonging to any flow) is lost.
As expected the expression n ? (equationfrom[110]), works reasonably well for small
values of n. That is because for smaller values of n, the resulting TCP timeouts are largely
dominated by IBTTs. On the other hand, when n is reasonably large, ABTTs happen more
frequently. Unfortunately, the equation in [110] does not take ABTTs into account. Hence,
75
as n becomes large, the predicted throughput of the expression n ? (equationfrom[110])
is orders of magnitude larger than the one obtained through simulations. This discrepancy
in performance only reinforces our decision to develop a new analytical model to examine
various aspects of Incast.
Figure 3.21: Comparing n * (equation from Padhye et al.) with Incast simulation results
3.3 Summary
In this chapter, we built analytical models to understand the essential causes behind
TCP Incast, which is a crucial issue in data center networks. Existing investigations on TCP
Incast try to find a good solution to the problem despite incurring high costs. For example,
some of the prevalent Incast solutions include, substituting TCP with UDP, reducing RTOmin
value, increasing switch buffer size, limiting the number of senders in Incast transfers, etc...
76
To solve TCP Incast substantially, the fundamental reasons behind it should be first ex-
plored. Unfortunately, almost all existing studies of TCP model the protocol considering
a single flow, with an application that has an infinite amount of data to transmit. Fur-
thermore, there are practically no prior TCP models that study the protocol?s performance
under synchronized traffic workloads in high speed, low latency, data center networking en-
vironments. Our models fill this void by extending the single flow model in [110] to multiple
synchronized flows, where each flow contributes a finite amount of data.
In our work, we find that two types of timeouts, ABTT and IBTT, are together respon-
sible for TCP?s throughput collapse in many-to-one synchronized traffic workloads. IBTT,
which is caused by one of the last three packets in a round being dropped, has a greater
impact on throughput when the number of senders is small. ABTT, which is caused by the
start-stop nature of Incast traffic at the beginning of a block transfer, dominates timeouts
when the number of senders is large. We validate the performance of our proposed models
by comparing them with simulation data. Although our models characterize the overall ef-
fect of Incast pretty well, we find them to be a little conservative in their estimation of the
cumulative throughput. This is because, our models overestimate the frequency of ABTTs,
resulting in longer delays and lower throughput when compared to simulation data.
From our experiments we were also able to demonstrate that larger switch buffers can not
only improve the throughput of the Incast traffic but can even delay the onset of throughput
collapse. Similarly, we show that larger SRUs can also improve the throughput of Incast
traffic. However, we find that the size of the SRU has little impact on the onset of throughput
collapse.
Finally, all the insights gained in building and validating our proposed models, will help
us develop more effective solutions that address the problem of TCP Incast, preferably at
lower costs.
77
Chapter 4
Addressing TCP Incast
As discussed in Chapter 1 and Chapter 3, clients performing synchronized reads across
an increasing number of servers in high bandwidth, low latency data center environments,
observe TCP?s throughput drop one or two orders of magnitude below their link capacity.
Labeled Incast, this pathological behavior of TCP is endured by a growing number of data
center applications and services. Hence, a feasible solution that addresses the Incast problem
is urgently needed.
In this chapter, we provide a broad overview of existing Incast solutions followed by
detailed description of our proposed techniques that are designed to address the Incast
problem at the Transport Layer [26].1
4.1 Existing Solutions
Since timeouts are the primary reason behind TCP Incast, in this section, we shall briefly
discuss existing solutions that either avoid timeouts or reduce their penalty. While all the
solutions discussed here are moderately effective in masking Incast, only two techniques
discussed in Subsections 4.1.3 and 4.1.4, manage to accomplish this at the transport layer.
4.1.1 Larger Switch Buffers
This Incast solution, discussed in [61], tries to mitigate the root cause of timeouts ?
packet losses ? by increasing the buffer space allocated per port on the Ethernet switches.
To evaluate this solution, we vary the size of the switch port buffers in a cluster based storage
1All simulations discussed in this chapter use the same topology as depicted in Figure 3.7 of Chapter 3.
Furthermore, unless noted explicitly, the simulations use the same parameters and values as listed in Table
3.1 of Chapter 3.
78
system where the network links have a delay of 100?s and TCP?s receive window size is 20
segments. The results of this experiment are depicted in Figure 4.1. Figure 4.1 clearly shows
that doubling the size of the switch?s output port buffer, doubles the number of servers that
can supported before the onset of Incast.
Figure 4.1: Effect of the size of switch buffers on TCP Incast
Consequently, given the number of servers, Incast can be avoided with a large enough
buffer space. Unfortunately, switches with larger buffers tend to cost more, forcing system
designers to choose between over-provisioning and hardware budgets. This suggests that a
more cost-effective solution is needed to address TCP Incast.
4.1.2 Increasing SRU Size
This is another Incast countermeasure discussed in [61]. It aims to mask TCP?s through-
put collapse by utilizing the spare link capacity of the stalled flow in transferring larger SRUs
79
belonging to other flows. To evaluate this solution, we vary the size of the SRUs in the clus-
ter based storage system discussed in Subsection 4.1.1, while limiting the size of the switch
port buffer to 32 KB. The results of this experiment are depicted in Figure 4.2. Figure 4.2
illustrates that increasing the size of the SRUs, improves the overall throughput at the client.
For example, with 7 servers, the throughput for 1 MB SRU is orders of magnitude greater
than the throughput of SRU of size 256 KB.
Figure 4.2: Effect of the size of the SRUs on TCP Incast
As discussed in Chapter 3, TCP performs well in settings without synchronized reads,
which can be modeled by infinite sized SRUs. With large SRUs, the servers take longer to
complete transmitting their share of data. This allows the active servers to utilize the spare
link capacity made available by the stalled flows during timeouts. In doing so, the servers
effectively reduce the idle link time experienced by the client, which in turn improves its
overall throughput.
80
Unfortunately, SRU of size 1 MB is quite impractical; most applications ask for data
in small chunks, corresponding to a size range of 1 - 256 KB. This is because, larger the
size of the SRU, greater is the prefetching that the storage system has to commit to. With
prefetching, the storage system needs to allocate pinned space in the client kernel memory,
increasing the memory pressure at the client [74]. This increased pressure at the client, often
leads to kernel failures. Hence it is really not advisable to use larger SRUs on cluster based
storage systems.
4.1.3 Reducing Timeout Penalty
This technique, proposed in [74], aims to address TCP Incast by reducing the time spent
in waiting for a timeout to end.
The amount of time a flow waits before retransmitting a lost packet without the dupli-
cate ACK assisted Fast Retransmit mechanism, is determined by TCP?s RTO value. Esti-
mating TCP?s RTO value involves achieving timely response to packet losses and also avoid-
ing the occurrence of premature timeouts. Premature timeouts have the following negative
effects:
? They lead to spurious retransmissions which can potentially cause and prolong network
congestion.
? They cause TCP to enter the Slow Start recovery after reducing its Slow Start Thresh-
old (ssthresh) value by half, even when no packets were lost. In doing this, the protocol
underestimates its link capacity resulting in lower throughput for its users.
TCP therefore, has a conservative minimum RTO (RTOmin) value to guard itself against the
ill effects of spurious retransmissions [120, 121].
Popular implementations of TCP use a RTOmin value of 200 ms [122]. Although this
value is appropriate in wide area networks, it is orders of magnitude greater than the round
trip times in data center networks. This large RTOmin value, imposes a huge penalty on
81
TCP?s throughput as the transfer times for segments within a data center, are significantly
smaller than the value of RTOmin.
In [74], the authors suggest reducing the value of RTOmin from 200 ms to 200 ?s, in
order to lessen the penalty of TCP timeouts on synchronized reads. To evaluate this solution,
we decrease the value of TCP?s RTOmin in the cluster based storage system discussed in
Subsection 4.1.1, while limiting the size of the switch port buffer to 32 KB. The results of
this experiment are depicted in Figure 4.3. From Figure 4.3, it is clear that reducing TCP?s
RTOmin value, improves the overall throughput at the client even after taking into account,
the drop in peak performance when the number of servers is greater than 40.
Figure 4.3: Effect of the RTOmin value on TCP Incast
In general, for any given SRU size, reducing RTOmin value improves the overall through-
put at the client. Unfortunately, setting RTOmin to 200 ?s poses the following challenges:
82
? According to RTO computing algorithms in [120, 121], reducing RTOmin to 200 ?s
requires a TCP clock granularity of 100 ?s. TCP implementations on most operating
systems including the likes of BSD and Linux, are currently unable to provide this
fine grained timer. For example, BSD implementation of TCP, expects the operating
system to provide two coarse-grained ?heartbeat? software interrupts every 200 ms
and 500 ms, which are used to handle internal per-connection timers [123]. Similarly,
TCP implementation on Linux, expects a clock granularity of 10 ms from the operating
system. Some operating systems can support fine grained timers by either employing
specialized external hardware or utilizing high resolution software timers [124]. How-
ever, neither of these options are feasible in the context of data centers. External
hardware scales poorly inside a data center while software timers which require kernel
changes, are not supported by all operating systems.
? Even if sufficiently fine grained TCP timers were supported, reducing RTOmin value can
be harmful, especially in situations where the servers communicate with clients outside
the data center. In [125], the authors note that low values for RTOmin increases the
occurrence of premature timeouts as RTOmin can be used for trading ?timely response
with premature timeouts?. Other studies of RTO estimation in similar high-bandwidth,
low-latency ATM networks also show that very low RTOmin values result in spurious
retransmissions [47] because variations in round-trip-times inside wide-area networks
clash with the standard RTO estimator?s short RTT memory.
In summary, the solution proposed in [74] should be viewed with caution as it increases
the risk of premature timeouts.
4.1.4 Relying on Explicit Congestion Notification
Data Center TCP (DCTCP), is a protocol proposed in [75]. It aims to achieve high
burst tolerance, low latency and high throughput during synchronized data transfers, by
requiring Ethernet switches to support Explicit Congestion Notifications (ECN).
83
DCTCP relies on a simple marking scheme at switches that sets the Congestion Ex-
perienced (CE) codepoint of packets as soon as the buffer occupancy exceeds a fixed small
threshold. DCTCP uses these ECNs to provide multi-bit feedback to its end hosts. The
DCTCP source reacts to such notifications by reducing the window by a factor that depends
on the fraction of marked packets: larger the fraction, bigger is the decrease factor.
Unfortunately, not all switches support ECN. Without the underlying ECN support,
DCTCP faces the same issues and hurdles as standard TCP. Additionally, ECNs are known
to be effective in simple configurations only. With more than one switch, ECNs have an
adverse effect on data flows [61]. Furthermore, authors in [75], make no claims about he
suitability of DCTCP for wide area networks as they assume internal data center traffic to
be separate from that of the external world.
4.2 Probabilistic Retransmission
In TCP world, timeouts are indicators of severe network congestion. Although the
penalty for detecting congestion through timeouts is quite large in TCP, they are unavoidable
in certain scenarios like, full window losses and retransmission losses. In this section, we
shall examine a technique that reduces the time taken in detecting network congestion when
TCP?s loss recovery mechanism cannot be triggered by duplicate ACKs. Specifically, we
shall explore the notion of proactively detecting network congestion through probabilistic
retransmissions, while using TCP?s retransmission timer as a fallback option.
4.2.1 Retransmit Thread
As discussed in Section 4.1.3, TCP has a conservative minimum RTO (RTOmin), whose
value is orders of magnitude greater than the round trip times at data centers. To overcome
the penalty imposed by a conservative RTOmin on timeouts in synchronized workloads, we
propose a congestion recovery technique that relies on probabilistic retransmissions, kernel
threads and duplicate ACKs.
84
Most modern operating systems support threads in their kernel space. A kernel thread is
the ?lightest? unit of kernel scheduling. Our solution to the Incast problem utilizes one such
kernel thread to probabilistically retransmit the highest unacknowledged segment in sender?s
transmission window. That is, every time the thread is scheduled for execution, it retransmits
with probability p, the highest unacknowledged segment in sender?s transmission window.
Before retransmitting the segment, the thread also ?marks? it as being ?probabilistically
retransmitted?. Algorithm 1 captures necessary details regarding the Retransmit Thread.
Algorithm 1 Retransmit Thread at Sender
if length(Transmit Window) ? 1 then
if uniform(0,1) ? p then
mark Highest UnACKed Segment
retransmit marked Segment
end if
end if
yield processor
To ?mark? the segment as being ?probabilistically retransmitted?, the Retransmit Thread
uses one of the six reserved bits in the segment?s header. Figure 2.1 in Chapter 2, shows the
layout of a TCP segment with the reserved bits located next to the Header Length field.
Because of its probabilistic nature, the retransmitted segment can arrive at the Ethernet
switch (i) before any congestion, (ii) during a congestion or (iii) after a congestion. Case
(i) would result in the destination receiving multiple copies of the same segment ? the
original segment transmitted by TCP, followed by the ?marked? segment transmitted by our
Retransmission Thread. In this situation, the client ignores the ?mark? on the retransmitted
segment and responds back with a normal cumulative ACK. In case (ii), the retransmitted
segment is dropped by the switch since it arrives at a time when the switch?s port buffers are
full. Since the ?marked? segment never reaches the destination, neither the sender nor the
receiver are required to take any action. Under case (iii), if the sender?s original segment was
dropped at the switch due to congestion, the receiver would be seeing the sequence number
on the retransmitted segment for the first time. Since the first copy of the segment is itself
85
?marked?, the receiver responds back with a normal cumulative ACK followed by 3 duplicate
ACKs. By doing this, not only does the receiver acknowledge the occurrence of a congestion
at the intermediate switch, but it also helps the sender trigger Fast Retransmit for quicker
loss recovery. Algorithm 2 lists the steps involved in handling retransmitted segments at the
receiver.
Algorithm 2 Handling Retransmitted Segments at Receiver
...normal handling of segment...
send ACK
if isduplicate(ReceivedSegment) ? false then
if ismarked(ReceivedSegment) ? true then
for i = 1 to 3 do
send ACK
end for
end if
end if
When the sender receives 3 duplicate ACKs in a row, it automatically performs loss
recovery using Fast Retransmit mechanism, without waiting for retransmission timer to
expire. Algorithm 3 gives details on handling duplicate ACKs at the sender.
Algorithm 3 Handling ACKs at Sender
...normal handling of ACK...
if dupackcount ? 3 then
suspend retransmission thread
invoke Fast Retransmit
end if
Receiving a ?marked? segment with an unseen sequence number indicates that (i) there
was congestion in the network which accounted for the original copy of the segment, and (ii)
the congestion is now cleared, for the ?marked? segment would never have made it through
otherwise. With congestion in the network now resolved, the receiver would like the sender
to start its loss recovery early, without having to wait for a retransmission timer to expire.
It initiates this by sending 3 duplicate ACKs back to the sender which forces the sender to
immediately perform an smooth reduction of its flow via Fast Recovery, instead of performing
an abrupt reduction through Slow Start following a timeout.
86
It is also possible that our Retransmission Thread never retransmits the highest unac-
knowledged segment. In such a case, the sender detects and responds to congestion only
when its retransmission timer expires.
4.2.2 Performance Analysis
In order to measure the effectiveness of the suggested technique, we implement Algo-
rithms 1, 2 and 3 in ns-2. To keep the simulations realistic, we model the thread context
switch time by including a small delay of 20 ?s between each execution of the Retrans-
mission Thread. We also fix the RTOmin value to 200 ms. The rest of the experimental
setup is the same as the one described in Section 4.1.3. Figure 4.4 shows that increasing the
value of p (probability of retransmission), improves the throughput at the client by orders
of magnitude, when the number of senders is greater than 8.
Figure 4.4: Effect of Retransmission Probability, p, on TCP Incast
87
From Figure 4.4, it is clear that using Retransmission Threads can significantly improve
TCP?s performance under synchronized workloads. However, the value of its retransmission
probability, p, should be chosen with some consideration. If p is set too low, the proposed
technique provides no significant benefits over default TCP. On the other hand, if p is set
too high, it causes unnecessary retransmissions, contributing further to the congestion at the
switch. Figure 4.5 shows the drop ratio i.e., the number of packets dropped at the switch
versus the number of packets received by it, for varying values of p. The graph also includes
plots for default TCP with RTO 200 ms as well as modified TCP with RTO 200 ?s, for
reference. For optimal p, the probabilistic retransmission technique would yield high TCP
throughput with a low drop ratio. From Figures 4.4 and 4.5, it is clear that for our simulation
environment, the best value of p is 0.001.
Figure 4.5: Effect of Retransmission Probability, p, on Drop Ratio
88
Figure 4.6 compares the performance of probabilistic retransmission (p = 0.001) with
default TCP (RTO 200 ms) and modified TCP (RTO 200 ?s). From Figure 4.6, it is
evident that the probabilistic retransmission outperforms default TCP under all experimental
conditions. The technique also performs better than the modified TCP, when the number
of senders in the experiment is greater than ten. On the other hand, when the number
of senders in the experiment is between five and ten, modified TCP yields slightly better
throughput than our proposed solution. This is because, very few senders experience severe
losses when the sender count in the experiment is less than ten. In addition to that, the
value of the retransmission probability, p, is only 0.001. Therefore, it is quite likely that
the loss experiencing senders make several attempts before succeeding at their probabilistic
retransmissions. This in turn leaves the switch-client link underutilized for some period
which results in a small dip in the solution?s performance when compared to modified TCP.
Figure 4.6: Comparing Probabilistic Retransmission with Default and Modified TCP
89
However, when the number of loss experiencing senders is large, it is more likely that
at least one of them will quickly succeed in its probabilistic retransmission. With every
such success, the switch-client link is kept occupied for that much longer, resulting in a
performance that is significantly better than that of the modified TCP.
One must also keep in mind that the results discussed above are true only for the chosen
value for p, in this case 0.001. In Figure 4.4, we saw that higher values of p need fewer senders
to achieve throughput saturation. Hence, if the synchronized workload inside a data center
involves only a few senders, probabilistic retransmission can still outperform modified TCP,
if p is set to a higher value.
4.2.3 Summary
Based on our experiments and analysis, it is clear that probabilistic retransmission offers
a feasible solution to TCP?s Incast problem. In addition to being backwards compatible with
existing flavors of TCP, the technique is also able to outperform existing Incast solutions,
without incurring any of their drawbacks.
However, probabilistic retransmission relies heavily on the availability of kernel threads.
Also, its performance is governed by the value assigned to p, the retransmission probability.
Ideally, the value of p should be auto computed and auto tuned, but we take the easier
option for now, and make it a user configurable variable. As part of our future work, we
plan to implement this technique on a Linux based cluster and measure its performance in
the real world.
4.3 Dynamic Segment Resizing
As detailed in Chapter 2, when TCP receives an out-of-order segment, it immediately
responds back with a duplicate ACK. From the sender?s perspective, receiving a duplicate
ACK indicates potential loss or reordering of transmitted segments. TCP?s Fast Retransmit
algorithm uses the arrival of three consecutive duplicate ACKs as an indication that segments
90
have been lost. The algorithm then initiates loss recovery at the sender, without waiting
for the retransmission timer to expire. However, when the destination receives fewer than
four segments due to severe network congestion, it has no chance of sending three duplicate
ACKs, meaning, retransmission timeouts are the only means of loss recovery for a source
that has lost all its segments due to network congestion.
Timeouts are known to have a negative impact on TCP?s performance since, the time
needed for the protocol to recover losses through retransmission timer is much longer than
the time needed to recover via Fast Retransmit algorithm. As discussed in Chapters 1 and 3,
timeouts are also known to cause the Incast problem that TCP endures during synchronized
data transfers. In our proposed scheme, we aim to address TCP Incast by making loss
recovery through Fast Retransmit possible in operating regions where currently, timeouts
are the only option available.
Dynamic Segment Resizing is based on the idea of increasing the upstream flow of
ACKs by sending downstream, a large number of segments whose size is smaller than the
maximum segment size supported by the connection. When a large number of segments are
received at the destination, it triggers a large number of ACKs in the backward channel.
And, larger the number of ACKs on the backward channel, larger is the probability of the
source recovering lost segments without the aid of a retransmission timer. In other words,
our proposed procedure gives the transmitter a chance to obtain more information about
the current state of the network between itself and the receiver.
To illustrate our approach by means of an example, we vary the size of TCP?s segments
in the cluster file system experiment discussed in Section 4.1.3. In this experiment, we
also limit the port buffer length on the intermediate switch to 32 KB, set the size of the
SRU to 256 KB, cap the receive window size to 32 KB and fix the value of the minimum
retransmission timeout, RTOmin, to 200 ms.
Figures 4.7, 4.8, 4.9 and 4.10, depict the effects of smaller TCP segments on the proto-
col?s retransmission timeouts when the number of senders in the experiment is 5, 10, 20 and
91
Figure 4.7: Timeout frequency for different segment sizes when sender count is 5
Figure 4.8: Timeout frequency for different segment sizes when sender count is 10
92
Figure 4.9: Timeout frequency for different segment sizes when sender count is 20
Figure 4.10: Timeout frequency for different segment sizes when sender count is 50
93
50 respectively. From these figures, it is clear that smaller sized segments reduce the num-
ber of timeouts that TCP experiences during a synchronized transfer. Additionally, smaller
sized segments move the peak of the timeout histogram to the right, meaning, with smaller
segments, TCP will have to lose a greater number of packets to experience a timeout. The
graphs also suggest that with small enough segments, TCP can completely avoid timeouts
during synchronized data transfers.
Apart from the aforementioned advantages, reducing the size of the segments also gives
TCP a finer control over the amount of unacknowledged data that can remain outstanding in
the network. However, transmitting smaller sized segments decreases TCP?s line efficiency,
which is defined as the ratio of the data size to the size of the (header + data) in a segment.
In order to improve TCP?s line efficiency when operating with smaller sized segments, we
employ a header compression technique that is described in [126]. This data compression
mechanism, reduces the normal 40 byte TCP/IP packet headers down to 3-4 bytes in average
case. It does this by saving the state of TCP connections at both ends of a link, and only
sending the differences in the header fields that change. With this header compression
technique in place, even a small data segment of size 36 bytes, will be able to achieve a line
efficiency of 90% for TCP.
In Figures 4.7, 4.8, 4.9 and 4.10, we notice that different cluster configurations have
different limits on segment sizes that allow synchronized transfers to take place without
incurring any timeout penalty. In order to maximize TCP?s line efficiency during synchro-
nized transfers involving smaller segments, it is desirable to have segment sizes that operate
closer to these limits. Dynamic Segment Resizing is able to achieve this by relying on a
congestion window threshold value called cwnddsr. The solution mandates TCP to begin its
synchronized transfer with a predefined segment size of MSSdsr bytes. As TCP starts trans-
mitting user data, its congestion window begins to grow. When TCP?s congestion window,
cwnd, grows beyond the congestion window threshold, cwnddsr, our solution resizes TCP?s
segments to (cwnd?MSSdsr)parenleftBigcwnddsr
2
parenrightBig bytes. TCP?s congestion window, cwnd, is also resized to
parenleftbigcwnd
dsr
2
parenrightbig
94
segments. Following this resize procedure, TCP resumes transmitting user data albeit with
slightly bigger segments. As before, the segments are resized if TCP?s cwnd again grows
beyond cwnddsr. This resize-transmit-resize cycle continues as long as TCP?s segments re-
main smaller than the maximum segment size of the connection and its congestion window,
cwnd, continues to grow beyond the threshold, cwnddsr. The cycle is eventually broken when
the size of the resized segments equal the MSS of the connection or when the flow encoun-
ters duplicate ACKs which prevent the congestion window from growing beyond cwnddsr.
Algorithm 4 captures the necessary details regarding the resize procedure.
Algorithm 4 Resize Procedure for Dynamic Segment Resizing
...normal handling of cwnd growth...
if (MSSdsr < MSS) and (cwnd ? cwnddsr) then
MSStemp = (cwnd?MSSdsr)parenleftBigcwnddsr
2
parenrightBig
if MSStemp > MSS then
MSStemp = MSS
end if
cwnd = (MSSdsr?cwnd)MSStemp
MSSdsr = MSStemp
end if
Dynamic Segment Resize is a proposed Incast solution that requires some minor changes
to the sender?s TCP stack. These changes are easy to incorporate and only require a few
modifications to existing TCP code. The header compression procedure on the other hand,
needs to be implemented at both the communicating endpoints.
4.3.1 Performance Analysis
In order to measure the effectiveness of the suggested technique, we implement Algo-
rithm 4 in ns-2. We then measure the performance of Dynamic Segment Resizing technique
in the cluster file system example discussed in Section 4.1.3. For this experiment, we limit
the port buffer length on the intermediate switch to 32 KB, set the size of the SRU to 256
KB, fix the value of the minimum retransmission timeout, RTOmin to 200 ms and cap the
receive window size at 1000 segments. We also set Algorithm 4 specific variables, MSSdsr
95
and cwnddsr, to be 50 bytes and 50 segments respectively. The rest of the experimental setup
is the same as the one described in Section 4.1.3.
Figure 4.11, compares the performance of Dynamic Segment Resizing with default TCP.
From Figure 4.11, it is evident that Dynamic Segment Resizing incurs a small penalty in
performance when the number of senders in the experiment is less than three. This is
because, the proposed technique takes some time to converge on the connection?s maximum
segment size as the most appropriate size to perform synchronized data transfers without
incurring any timeout penalty. Default TCP on the other hand, starts with maximum sized
segments and therefore, is able to achieve better line rate than Dynamic Segment Resizing.
However, when the number of senders in the cluster file system is greater than three, our
proposed solution easily outperforms default TCP.
Figure 4.11: Comparing Dynamic Segment Resize with Default TCP
96
4.3.2 Summary
From the simulation results discussed in Section 4.3.1, it is clear that Dynamic Segment
Resizing offers a practical, transport-layer solution to the Incast problem. The technique
only requires some minor modifications on the sender side TCP and is backwards compatible
with many existing flavors of the protocol.
Unlike the probabilistic retransmission technique discussed in Section 4.2, Dynamic Seg-
ment Resizing does not depend on the availability of external resources like kernel threads.
However, like p in probabilistic retransmission, the performance of Dynamic Segment Resiz-
ing is also dependent on the initial values of MSSdsr and cwnddsr. Ideally, the start values
of MSSdsr and cwnddsr are auto computed, but we for now, make these variables user con-
figurable. As part of our future work, we plan to implement Dynamic Segment Resizing on
Linux cluster and measure its performance in the real world.
97
Chapter 5
Conclusions and Future Work
In this chapter, we summarize the research discussed in this dissertation and follow it
up with the directions for future work.
5.1 Summary of Research
In this dissertation, we studied TCP?s performance under many-to-one synchronized
traffic, when operating in high speed, low latency data center networks. In particular, we
discussed the problem of TCP Incast, which causes the protocol?s throughput to drop to
almost a tenth of its link?s available capacity. We derived an analytical model to investi-
gate Incast and attributed TCP?s throughput collapse to its timeouts. We also proposed
some transport layer techniques to overcome Incast and evaluated their merits using ns-2
simulations.
In Chapter 1, we discussed Cloud Computing and its different components. We out-
lined how growing adoption of Cloud Computing is prompting service providers to spawn
more data centers. We also discussed the cost and compatibility reasons that persuade
service providers to employ Ethernet as the baseline communication fabric for their data
centers. We then introduced the problem of TCP Incast that results from utilizing TCP
in an environment where many of its assumptions are violated. In particular, we saw how
TCP?s throughput collapses catastrophically under many-to-one synchronized traffic, when
operating in Ethernet-based, high speed, low latency data center networks.
In Chapter 2, we presented details on mechanisms that are responsible for TCP?s reliable
data transfer, flow control and congestion control. Our work in this chapter, provided the
98
necessary background for Chapters 3 and 4, where we have considered the problem of TCP
Incast in greater detail.
In Chapter 3, we presented a simple model for TCP Incast. The model captures the
essence of many-to-one synchronized workloads and expresses throughput as a function of
packet loss probability. In particular, it takes into account the behavior of multiple TCP
flows in presence of loss induced duplicate acknowledgments and retransmission timeouts.
The model yields a simple, closed form formula for calculating throughput of many-to-one
synchronized traffic and attributes TCP?s throughput collapse to two types of timeouts,
ABTT and IBTT. We validated the model through extensive simulations done using ns-2
simulator. We found that our model provides a very good match to the observed Incast
behavior. The formula resulting from our model, can be used for many purposes such as fast
evaluation of Incast behavior and design of Incast free transport protocols.
In Chapter 4, we discussed few existing Incast solutions and their drawbacks. We then
proposed two feasible solutions that addressed TCP Incast at the transport layer. Specifi-
cally, we developed solutions that improved TCP?s performance under synchronized work-
loads by either proactively detecting network congestion through probabilistic retransmission
or by dynamically resizing TCP?s segments in order to avoid incurring timeout penalty. We
also implemented these solutions in TCP and tested them extensively using ns-2 simulator.
We found that our proposed solutions are both able to avoid timeouts and overcome the ill
effects of throughput collapse during synchronized data transfers in high speed, low latency,
data center environments.
5.2 Future Work
There are several lines of research arising from the work presented in this dissertation.
Some research lines that should be pursued in the future include:
99
? Accounting window limitation in Incast model - The model presented in Chapter 3,
does not consider the impact of window limitation per composite flow. At the begin-
ning of TCP flow establishment, the receiver advertises a maximum buffer size which
determines the maximum congestion window size Wfmax. As a consequence, during
a period without loss indications, the window size can grow up to Wfmax, but will
not grow beyond this value. Our Incast model must be tweaked to account for this
scenario.
? Accounting flavor specific features - TCP New-Reno and TCP SACK are the most
dominant flavors of TCP that are currently deployed in data center networks. In order
to accurately model these protocols, we need to modify our Incast model presented in
Chapter 3 to accommodate flavor specific features.
? Developing techniques for loss rate estimation - For empirical validation of our Incast
model in Chapter 3, we estimated the loss rate probability based on the traces generated
by our ns-2 simulator. Since traces are not always available, we need to understand
and evaluate various techniques that help us in loss rate estimation.
? Apply Markovian analysis - The Incast model presented in Chapter 3, is very simple
and less accurate. Markovian analysis on the other hand, is known to be detailed and
precise. To better analyze the Incast phenomenon, we need to model Incast using
Markovian models.
? Timeout type based solution - Neither the existing techniques nor our proposed solutions
differentiate between the type of timeouts causing the Incast. It should be possible to
design a solution that takes the type of timeouts, ABTT or IBTT, into consideration.
100
? Auto computing variable values - We currently use statically selected values for our
solution specific variables like p, MSSdsr and cwnddsr. More work is needed to inves-
tigate means of automatically updating these variables in order to guarantee better
Incast performance.
? Implement in real world - Since, almost all the results presented in this dissertation
are based on ns-2 simulations, we need to check if our proposed solutions work well in
the real world. Towards this end we need to implement the techniques of probabilistic
retransmissions as well as Dynamic Segment Resizing on a Linux based cluster and
measure their performance in the real world.
101
Bibliography
[1] S. L. Garfinkel and H. Abelson, Architects of the Information Society: Thirty-Five
Years of the Laboratory for Computer Science at MIT. The MIT Press, Apr 1999.
[2] J. McCarthy, ?Reminiscences On The History Of Time Sharing,? 1983. [Online].
Available: http://www-formal.stanford.edu/jmc/history/timesharing/timesharing.
html
[3] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, ?Cloud computing
and emerging IT platforms: Vision, hype, and reality for delivering computing as the
5th utility,? Future Gener. Comput. Syst., vol. 25, no. 6, pp. 599?616, Jun. 2009.
[Online]. Available: http://dx.doi.org/10.1016/j.future.2008.12.001
[4] Z. Mahmood and R. Hill, Cloud Computing for Enterprise Architectures. Springer
Publishing Company, Incorporated, 2011.
[5] D. W. Cearley, ?Cloud Computing: Key Initiative Overview,? Gartner Report,
Gartner, Inc., 2010. [Online]. Available: http://www.gartner.com/it/initiatives/pdf/
KeyInitiativeOverview CloudComputing.pdf
[6] J. Rhoton, Cloud Computing Explained: Implementation Handbook for Enterprises.
Recursive Press, Nov 2009.
[7] P. Mell and T. Grance, ?The NIST Definition of Cloud Computing,? Special
Publication 800-145, National Institute of Standards and Technology, Tech. Rep.,
Sep 2011. [Online]. Available: http://csrc.nist.gov/publications/nistpubs/800-145/
SP800-145.pdf
[8] W. Y. Chang, H. Abu-Amara, and J. F. Sanford, Transforming Enterprise Cloud
Services. Springer Publishing Company, Incorporated, Dec 2010.
[9] S. Marston, Z. Li, S. Bandyopadhyay, J. Zhang, and A. Ghalsasi, ?Cloud computing
The business perspective,? Decision Support Systems, vol. 51, no. 1, pp. 176
? 189, 2011. [Online]. Available: http://www.sciencedirect.com/science/article/pii/
S0167923610002393
[10] M. Miller, Cloud Computing: Web-Based Applications That Change the Way You
Work and Collaborate Online. Que, Aug 2008.
[11] C. Takemura and L. S. Crawford, The Book of Xen: A Practical Guide for the System
Administrator. No Starch Press, Oct 2009.
102
[12] J. Arrasjid, K. Balachandran, D. Conde, G. Lamb, and S. Kaplan, Deploying the
VMware Infrastructure. The USENIX Association, Aug 2010.
[13] Cloud.com, ?2011 Cloud Computing Outlook,? 2011. [Online]. Available: http:
//www.cloudstack.org/cloud-computing-docs/cloud-computing-survey.pdf
[14] K. Murray, Microsoft Office 365: Connect and Collaborate Virtually Anywhere, Any-
time, 1st ed. Microsoft Press, 2011.
[15] D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel, ?Finding a needle
in Haystack: Facebook?s photo storage,? in Proceedings of the 9th USENIX
conference on Operating systems design and implementation, ser. OSDI?10.
Berkeley, CA, USA: USENIX Association, 2010, pp. 1?8. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1924943.1924947
[16] B. F. Cooper, E. Baldeschwieler, R. Fonseca, J. J. Kistler, P. P. S. Narayan, C. Neer-
daels, T. Negrin, R. Ramakrishnan, A. Silberstein, U. Srivastava, and R. Stata, ?Build-
ing a Cloud for Yahoo!? IEEE Data Eng. Bull., pp. 36?43, 2009.
[17] B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A.
Jacobsen, N. Puz, D. Weaver, and R. Yerneni, ?PNUTS: Yahoo!?s hosted data serving
platform,? Proc. VLDB Endow., vol. 1, no. 2, pp. 1277?1288, Aug. 2008. [Online].
Available: http://dx.doi.org/10.1145/1454159.1454167
[18] D. Robinson, Amazon Web Services Made Simple: Learn how Amazon EC2, S3, Sim-
pleDB and SQS Web Services enables you to reach business goals faster. London, UK,
UK: Emereo Pty Ltd, 2008.
[19] J. E. Burgess, ?Youtube,? in Oxford Bibliographies Online. Oxford University
Press, October 2011, final version following copy-editing by OUP. [Online]. Available:
http://eprints.qut.edu.au/46719/
[20] F. P. Miller, A. F. Vandome, and J. McBrewster, Gmail: Gmail. Webmail, Post Office
Protocol, Internet Message Access Protocol, Google, Gmail interface, History of Gmail,
Paul Buchheit, Google?s hoaxes, Comparison of webmail providers, Gmail Mobile. Al-
pha Press, 2009.
[21] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz,
P. Patel, and S. Sengupta, ?VL2: a scalable and flexible data center network,? in
Proceedings of the ACM SIGCOMM 2009 conference on Data communication, ser.
SIGCOMM ?09. New York, NY, USA: ACM, 2009, pp. 51?62. [Online]. Available:
http://doi.acm.org/10.1145/1592568.1592576
[22] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski,
G. Lee, D. A. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, ?Above the
Clouds: A Berkeley View of Cloud Computing,? EECS Department, University of
California, Berkeley, Tech. Rep. UCB/EECS-2009-28, Feb 2009. [Online]. Available:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.html
103
[23] A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel, ?The cost of
a cloud: research problems in data center networks,? SIGCOMM Comput.
Commun. Rev., vol. 39, no. 1, pp. 68?73, Dec. 2008. [Online]. Available:
http://doi.acm.org/10.1145/1496091.1496103
[24] K. Kant, ?Data center evolution: A tutorial on state of the art, issues,
and challenges,? Computer Networks, vol. 53, no. 17, pp. 2939 ? 2965,
2009, ?ce:title?Virtualized Data Centers?/ce:title?. [Online]. Available: http:
//www.sciencedirect.com/science/article/pii/S1389128609003090
[25] T. Sridhar, ?Cloud Computing: A Primer Part 1: Models and Technologies,? The
Internet Protocol Journal, vol. 12, no. 3, pp. 2?19, Sep. 2009. [Online]. Available:
http://www.cisco.com/web/about/ac123/ac147/archived issues/ipj 12-3/index.html
[26] L. J. Miller, ?The ISO Reference Model of Open Systems Interconnection:
A first tutorial,? in Proceedings of the ACM ?81 conference, ser. ACM
?81. New York, NY, USA: ACM, 1981, pp. 283?288. [Online]. Available:
http://doi.acm.org/10.1145/800175.809901
[27] T. Shanley, Infiniband. Boston, MA, USA: Addison-Wesley Longman Publishing Co.,
Inc., 2002.
[28] V. Nagasamy, S. Rajan, , and P. R. Panda, ?Fibre channel protocol: Formal
specification and verification,? in Sixth Annual Silicon Valley Networking Conference.
SysTech Research, apr 1995. [Online]. Available: http://www.csl.sri.com/papers/
svnc95/
[29] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz,
J. N. Seizovic, and W.-K. Su, ?Myrinet: A Gigabit-per-Second Local Area
Network,? IEEE Micro, vol. 15, no. 1, pp. 29?36, Feb. 1995. [Online]. Available:
http://dx.doi.org/10.1109/40.342015
[30] ?IEEE Standard for Information Technology ? Telecommunications and Information
Exchange Between Systems ? Local and Metropolitan Area Networks ?
Specific Requirements ? Part 3: Carrier Sense Multiple Access With Collision
Detection (CSMA/CD) Access Method and Physical Layer Specifications,?
LAN/MAN Standards Committee, New York, NY, USA, 2008. [Online]. Available:
http://standards.ieee.org/about/get/802/802.3.html
[31] M. Al-Fares, A. Loukissas, and A. Vahdat, ?A scalable, commodity data center
network architecture,? in Proceedings of the ACM SIGCOMM 2008 conference on
Data communication, ser. SIGCOMM ?08. New York, NY, USA: ACM, 2008, pp.
63?74. [Online]. Available: http://doi.acm.org/10.1145/1402958.1402967
[32] U. Hoelzle and L. A. Barroso, The Datacenter as a Computer: An Introduction to the
Design of Warehouse-Scale Machines, 1st ed. Morgan and Claypool Publishers, 2009.
104
[33] J. Hamilton, ?On designing and deploying internet-scale services,? in Proceedings
of the 21st conference on Large Installation System Administration Conference, ser.
LISA?07. Berkeley, CA, USA: USENIX Association, 2007, pp. 18:1?18:12. [Online].
Available: http://dl.acm.org/citation.cfm?id=1349426.1349444
[34] G. Regnier, S. Makineni, R. Illikkal, R. Iyer, D. Minturn, R. Huggahalli,
D. Newell, L. Cline, and A. Foong, ?TCP Onloading for Data Center
Servers,? Computer, vol. 37, no. 11, pp. 48?58, Nov. 2004. [Online]. Available:
http://dx.doi.org/10.1109/MC.2004.223
[35] J. Postel, ?DoD standard Internet Protocol,? RFC 760, Internet Engineering Task
Force, Jan. 1980, obsoleted by RFC 791, updated by RFC 777. [Online]. Available:
http://www.ietf.org/rfc/rfc760.txt
[36] ??, ?Transmission Control Protocol,? RFC 793 (Standard), Internet Engineering
Task Force, Sep. 1981, updated by RFCs 1122, 3168, 6093, 6528. [Online]. Available:
http://www.ietf.org/rfc/rfc793.txt
[37] N. Jani and K. Kant, ?SCTP Performance in Data Center Environments,? in Proceed-
ings of the 2005 International Symposium on Performance Evaluation of Computer
and Telecommunication Systems (SPECTS?05), 2005.
[38] V. Jacobson, R. Braden, and D. Borman, ?TCP Extensions for High Performance,?
RFC 1323 (Proposed Standard), Internet Engineering Task Force, May 1992. [Online].
Available: http://www.ietf.org/rfc/rfc1323.txt
[39] L. S. Brakmo, S. W. O?Malley, and L. L. Peterson, ?TCP Vegas: new
techniques for congestion detection and avoidance,? in Proceedings of the conference
on Communications architectures, protocols and applications, ser. SIGCOMM
?94. New York, NY, USA: ACM, 1994, pp. 24?35. [Online]. Available:
http://doi.acm.org/10.1145/190314.190317
[40] T. Kelly, ?Scalable TCP: improving performance in highspeed wide area networks,?
SIGCOMM Comput. Commun. Rev., vol. 33, no. 2, pp. 83?91, Apr. 2003. [Online].
Available: http://doi.acm.org/10.1145/956981.956989
[41] D. X. Wei, C. Jin, S. H. Low, and S. Hegde, ?FAST TCP: motivation, architecture,
algorithms, performance,? IEEE/ACM Trans. Netw., vol. 14, no. 6, pp. 1246?1259,
Dec. 2006. [Online]. Available: http://dx.doi.org/10.1109/TNET.2006.886335
[42] S. Ha, I. Rhee, and L. Xu, ?CUBIC: a new TCP-friendly high-speed TCP variant,?
SIGOPS Oper. Syst. Rev., vol. 42, no. 5, pp. 64?74, Jul. 2008. [Online]. Available:
http://doi.acm.org/10.1145/1400097.1400105
[43] S. Floyd, ?HighSpeed TCP for Large Congestion Windows,? RFC 3649
(Experimental), Internet Engineering Task Force, Dec. 2003. [Online]. Available:
http://www.ietf.org/rfc/rfc3649.txt
105
[44] L. Xu, K. Harfoush, and I. Rhee, ?Binary Increase Congestion Control (BIC) for
Fast Long-Distance Networks,? in IEEE Infocom. IEEE, 2004. [Online]. Available:
http://www.ieee-infocom.org/2004/Papers/52 4.PDF
[45] ATM Forum Inc., ATM User Network Interface (UNI) Specification Version 3.1, 1st ed.
Prentice Hall, 1995.
[46] M. Perloff and K. Reiss, ?Improvements to TCP performance in high-speed ATM
networks,? Commun. ACM, vol. 38, no. 2, pp. 91?100, Feb. 1995. [Online]. Available:
http://doi.acm.org/10.1145/204826.204849
[47] A. Romanow and S. Floyd, ?Dynamics of TCP traffic over ATM networks,?
in Proceedings of the conference on Communications architectures, protocols and
applications, ser. SIGCOMM ?94. New York, NY, USA: ACM, 1994, pp. 79?88.
[Online]. Available: http://doi.acm.org/10.1145/190314.190322
[48] H. Balakrishnan, S. Seshan, E. Amir, and R. H. Katz, ?Improving TCP/IP
performance over wireless networks,? in Proceedings of the 1st annual international
conference on Mobile computing and networking, ser. MobiCom ?95. New York, NY,
USA: ACM, 1995, pp. 2?11. [Online]. Available: http://doi.acm.org/10.1145/215530.
215544
[49] S. R. Cho, H. Sirisena, and K. Pawlikowski, ?An End-to-end Freeze TCP with
Timestamps for Ad Hoc Networks,? in ICC 2005, 40th IEEE International
Conference on Communications, B. G. Lee, Ed. Piscataway, NJ, USA:
IEEE Communications Society, May 2005, pp. 3576?3582. [Online]. Available:
http://dx.doi.org/10.1109/ICC.2005.1495084
[50] S. E. Kim and J. A. Copeland, ?TCP for seamless vertical handoff in
hybrid mobile data networks,? in Global Telecommunications Conference, 2003.
GLOBECOM ?03. IEEE, vol. 2, 2003, pp. 661?665 Vol.2. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=1258321
[51] K. Brown and S. Singh, ?M-TCP: TCP for mobile cellular networks,? SIGCOMM
Comput. Commun. Rev., vol. 27, no. 5, pp. 19?43, Oct. 1997. [Online]. Available:
http://doi.acm.org/10.1145/269790.269794
[52] S. H. Lee, H. G. Ahn, J. S. Lim, S. H. Kwak, and S. Kim, ?Performance analysis
of snoop TCP with freezing agent over cdma2000 networks,? in Proceedings of
the 7th CDMA international conference on Mobile communications, ser. CIC?02.
Berlin, Heidelberg: Springer-Verlag, 2003, pp. 496?505. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1766911.1766973
[53] E. Hossain and N. Parvez, ?Enhancing TCP performance in wide-area cellular
wireless networks: transport level approaches,? in Wireless communications systems
and networks, M. Guizani, Ed. New York, NY, USA: Plenum Press, 2004, pp.
241?289. [Online]. Available: http://dl.acm.org/citation.cfm?id=1016648.1016658
106
[54] J. Liu and S. Singh, ?ATCP: TCP for mobile ad hoc networks,? Selected Areas
in Communications, IEEE Journal on, vol. 19, no. 7, pp. 1300?1315, Aug. 2002.
[Online]. Available: http://dx.doi.org/10.1109/49.932698
[55] I. F. Akyildiz, G. Morabito, and S. Palazzo, ?TCP-Peach: a new congestion control
scheme for satellite IP networks,? IEEE/ACM Trans. Netw., vol. 9, no. 3, pp.
307?321, Jun. 2001. [Online]. Available: http://dx.doi.org/10.1109/90.929853
[56] C. P. Fu and S. C. Liew, ?TCP Veno: TCP enhancement for transmission over
wireless access networks,? IEEE J.Sel. A. Commun., vol. 21, no. 2, pp. 216?228, Sep.
2006. [Online]. Available: http://dx.doi.org/10.1109/JSAC.2002.807336
[57] K. Xu, Y. Tian, and N. Ansari, ?TCP-Jersey for wireless IP communications,? IEEE
J.Sel. A. Commun., vol. 22, no. 4, pp. 747?756, Sep. 2006. [Online]. Available:
http://dx.doi.org/10.1109/JSAC.2004.825989
[58] E. H.-K. Wu and M.-Z. Chen, ?JTCP: jitter-based TCP for heterogeneous wireless
networks,? IEEE J.Sel. A. Commun., vol. 22, no. 4, pp. 757?766, Sep. 2006. [Online].
Available: http://dx.doi.org/10.1109/JSAC.2004.825999
[59] Y. Chen, R. Griffith, J. Liu, R. H. Katz, and A. D. Joseph, ?Understanding TCP incast
throughput collapse in datacenter networks,? in Proceedings of the 1st ACM workshop
on Research on enterprise networking, ser. WREN ?09. New York, NY, USA: ACM,
2009, pp. 73?82. [Online]. Available: http://doi.acm.org/10.1145/1592681.1592693
[60] D. Nagle, D. Serenyi, and A. Matthews, ?The Panasas ActiveScale Storage Cluster:
Delivering Scalable High Bandwidth Storage,? in Proceedings of the 2004 ACM/IEEE
conference on Supercomputing, ser. SC ?04. Washington, DC, USA: IEEE Computer
Society, 2004, pp. 53?. [Online]. Available: http://dx.doi.org/10.1109/SC.2004.57
[61] A. Phanishayee, E. Krevat, V. Vasudevan, D. G. Andersen, G. R. Ganger,
G. A. Gibson, and S. Seshan, ?Measurement and analysis of TCP throughput
collapse in cluster-based storage systems,? in Proceedings of the 6th USENIX
Conference on File and Storage Technologies, ser. FAST?08. Berkeley, CA,
USA: USENIX Association, 2008, pp. 12:1?12:14. [Online]. Available: http:
//dl.acm.org/citation.cfm?id=1364813.1364825
[62] E. Krevat, V. Vasudevan, A. Phanishayee, D. G. Andersen, G. R. Ganger, G. A.
Gibson, and S. Seshan, ?On application-level approaches to avoiding TCP throughput
collapse in cluster-based storage systems,? in Proceedings of the 2nd international
workshop on Petascale data storage: held in conjunction with Supercomputing ?07,
ser. PDSW ?07. New York, NY, USA: ACM, 2007, pp. 1?4. [Online]. Available:
http://doi.acm.org/10.1145/1374596.1374598
[63] S. Kulkarni and P. Agrawal, ?A Probabilistic Approach to Address TCP
Incast in Data Center Networks,? in Proceedings of the 2011 31st International
Conference on Distributed Computing Systems Workshops, ser. ICDCSW ?11.
107
Washington, DC, USA: IEEE Computer Society, 2011, pp. 26?33. [Online]. Available:
http://dx.doi.org/10.1109/ICDCSW.2011.41
[64] S. Ghemawat, H. Gobioff, and S.-T. Leung, ?The Google file system,? SIGOPS
Oper. Syst. Rev., vol. 37, no. 5, pp. 29?43, Oct. 2003. [Online]. Available:
http://doi.acm.org/10.1145/1165389.945450
[65] R. Y. Wang and T. E. Anderson, ?xFS: A Wide Area Mass Storage File System,?
University of California at Berkeley, Berkeley, CA, USA, Tech. Rep., 1993.
[66] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin,
S. Sivasubramanian, P. Vosshall, and W. Vogels, ?Dynamo: Amazon?s highly
available key-value store,? in Proceedings of twenty-first ACM SIGOPS symposium on
Operating systems principles, ser. SOSP ?07. New York, NY, USA: ACM, 2007, pp.
205?220. [Online]. Available: http://doi.acm.org/10.1145/1294261.1294281
[67] F. Schmuck and R. Haskin, ?GPFS: A Shared-Disk File System for Large Computing
Clusters,? in Proceedings of the 1st USENIX Conference on File and Storage
Technologies, ser. FAST ?02. Berkeley, CA, USA: USENIX Association, 2002.
[Online]. Available: http://dl.acm.org/citation.cfm?id=1083323.1083349
[68] S. Chakrabarti, M. van den Berg, and B. Dom, ?Focused crawling: a new
approach to topic-specific Web resource discovery,? in Proceedings of the eighth
international conference on World Wide Web, ser. WWW ?99. New York, NY,
USA: Elsevier North-Holland, Inc., 1999, pp. 1623?1640. [Online]. Available:
http://dl.acm.org/citation.cfm?id=313234.313121
[69] J. Cho and H. Garcia-Molina, ?Parallel crawlers,? in Proceedings of the 11th
international conference on World Wide Web, ser. WWW ?02. New York, NY, USA:
ACM, 2002, pp. 124?135. [Online]. Available: http://doi.acm.org/10.1145/511446.
511464
[70] J. Luo and Z. Shi, ?Eliminate redundancy in parallel search: a multi-agent
coordination approach,? in Proceedings of the 9th Pacific Rim international conference
on Artificial intelligence, ser. PRICAI?06. Berlin, Heidelberg: Springer-Verlag, 2006,
pp. 91?100. [Online]. Available: http://dl.acm.org/citation.cfm?id=1757898.1757912
[71] M. D. Dikaiakos, A. Katsifodimos, and G. Pallis, ?Minersoft: Software retrieval in grid
and cloud computing infrastructures,? ACM Trans. Internet Technol., vol. 12, no. 1, pp.
2:1?2:34, Jul. 2012. [Online]. Available: http://doi.acm.org/10.1145/2220352.2220354
[72] J. Dean and S. Ghemawat, ?MapReduce: simplified data processing on large
clusters,? Commun. ACM, vol. 51, no. 1, pp. 107?113, Jan. 2008. [Online]. Available:
http://doi.acm.org/10.1145/1327452.1327492
[73] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, ?The Hadoop Distributed
File System,? in Proceedings of the 2010 IEEE 26th Symposium on Mass
Storage Systems and Technologies (MSST), ser. MSST ?10. Washington, DC,
108
USA: IEEE Computer Society, 2010, pp. 1?10. [Online]. Available: http:
//dx.doi.org/10.1109/MSST.2010.5496972
[74] V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D. G. Andersen, G. R. Ganger,
G. A. Gibson, and B. Mueller, ?Safe and effective fine-grained TCP retransmissions for
datacenter communication,? in Proceedings of the ACM SIGCOMM 2009 conference
on Data communication, ser. SIGCOMM ?09. New York, NY, USA: ACM, 2009, pp.
303?314. [Online]. Available: http://doi.acm.org/10.1145/1592568.1592604
[75] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar,
S. Sengupta, and M. Sridharan, ?Data center TCP (DCTCP),? in Proceedings of the
ACM SIGCOMM 2010 conference, ser. SIGCOMM ?10. New York, NY, USA: ACM,
2010, pp. 63?74. [Online]. Available: http://doi.acm.org/10.1145/1851182.1851192
[76] K. Thompson, G. J. Miller, and R. Wilder, ?Wide-area Internet traffic patterns and
characteristics,? Netwrk. Mag. of Global Internetwkg., vol. 11, no. 6, pp. 10?23, Nov.
1997. [Online]. Available: http://dx.doi.org/10.1109/65.642356
[77] S. McCreary and k. claffy, ?Trends in wide area IP traffic patterns - A view from Ames
Internet Exchange,? in ITC Specialist Seminar, Monterey, CA, Sep 2000.
[78] W. Noureddine and F. Tobagi, ?The Transmission Control Protocol.? [Online].
Available: http://citeseer.ist.psu.edu/noureddine02transmission.html
[79] A. Tanenbaum, Computer Networks, 4th ed. Prentice Hall Professional Technical
Reference, 2002.
[80] W. K. Noureddine, ?Improving the performance of tcp applications using network-
assisted mechanisms,? Ph.D. dissertation, Stanford University, Stanford, CA, USA,
2002, aAI3048586.
[81] T. Berners-Lee, R. Fielding, and H. Frystyk, ?Hypertext Transfer Protocol ?
HTTP/1.0,? RFC 1945 (Informational), Internet Engineering Task Force, May 1996.
[Online]. Available: http://www.ietf.org/rfc/rfc1945.txt
[82] J. Postel and J. Reynolds, ?File Transfer Protocol,? RFC 959 (Standard), Internet
Engineering Task Force, Oct. 1985, updated by RFCs 2228, 2640, 2773, 3659, 5797.
[Online]. Available: http://www.ietf.org/rfc/rfc959.txt
[83] J. Postel, ?Simple Mail Transfer Protocol,? RFC 821 (Standard), Internet
Engineering Task Force, Aug. 1982, obsoleted by RFC 2821. [Online]. Available:
http://www.ietf.org/rfc/rfc821.txt
[84] C. Feather, ?Network News Transfer Protocol (NNTP),? RFC 3977 (Proposed
Standard), Internet Engineering Task Force, Oct. 2006, updated by RFC 6048.
[Online]. Available: http://www.ietf.org/rfc/rfc3977.txt
109
[85] T. Ylonen and C. Lonvick, ?The Secure Shell (SSH) Protocol Architecture,? RFC
4251 (Proposed Standard), Internet Engineering Task Force, Jan. 2006. [Online].
Available: http://www.ietf.org/rfc/rfc4251.txt
[86] ??, ?The Secure Shell (SSH) Authentication Protocol,? RFC 4252 (Proposed
Standard), Internet Engineering Task Force, Jan. 2006. [Online]. Available:
http://www.ietf.org/rfc/rfc4252.txt
[87] ??, ?The Secure Shell (SSH) Transport Layer Protocol,? RFC 4253 (Proposed
Standard), Internet Engineering Task Force, Jan. 2006. [Online]. Available:
http://www.ietf.org/rfc/rfc4253.txt
[88] ??, ?The Secure Shell (SSH) Connection Protocol,? RFC 4254 (Proposed
Standard), Internet Engineering Task Force, Jan. 2006. [Online]. Available:
http://www.ietf.org/rfc/rfc4254.txt
[89] J. Schlyter and W. Griffin, ?Using DNS to Securely Publish Secure Shell (SSH) Key
Fingerprints,? RFC 4255 (Proposed Standard), Internet Engineering Task Force, Jan.
2006. [Online]. Available: http://www.ietf.org/rfc/rfc4255.txt
[90] F. Cusack and M. Forssen, ?Generic Message Exchange Authentication for the Secure
Shell Protocol (SSH),? RFC 4256 (Proposed Standard), Internet Engineering Task
Force, Jan. 2006. [Online]. Available: http://www.ietf.org/rfc/rfc4256.txt
[91] J. Kristoff. (2004, Apr.) The Transmission Control Protocol. [Online]. Available:
http://condor.depaul.edu/jkristof/technotes/tcp.html
[92] R. Braden, ?Requirements for Internet Hosts - Communication Layers,? RFC 1122
(Standard), Internet Engineering Task Force, Oct. 1989, updated by RFCs 1349,
4379, 5884, 6093, 6298, 6633. [Online]. Available: http://www.ietf.org/rfc/rfc1122.txt
[93] W. R. Stevens, TCP/IP illustrated (vol. 1): the protocols. Boston, MA, USA: Addison-
Wesley Longman Publishing Co., Inc., 1993.
[94] M. Allman, V. Paxson, and E. Blanton, ?TCP Congestion Control,? RFC 5681
(Draft Standard), Internet Engineering Task Force, Sep. 2009. [Online]. Available:
http://www.ietf.org/rfc/rfc5681.txt
[95] J. Postel, ?User Datagram Protocol,? RFC 768 (Standard), Internet Engineering
Task Force, Aug. 1980. [Online]. Available: http://www.ietf.org/rfc/rfc768.txt
[96] R. Chow and Y.-C. Chow, Distributed Operating Systems and Algorithms. Boston,
MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1997.
[97] A. Silberschatz, P. B. Galvin, and G. Gagne, Operating System Concepts, 6th ed. New
York, NY, USA: John Wiley & Sons, Inc., 2001.
[98] G. F. Pfister, In search of clusters (2nd ed.). Upper Saddle River, NJ, USA: Prentice-
Hall, Inc., 1998.
110
[99] R. Buyya, High Performance Cluster Computing: Architectures and Systems. Upper
Saddle River, NJ, USA: Prentice Hall PTR, 1999.
[100] D. DeWitt and J. Gray, ?Parallel database systems: the future of high performance
database systems,? Commun. ACM, vol. 35, no. 6, pp. 85?98, Jun. 1992. [Online].
Available: http://doi.acm.org/10.1145/129888.129894
[101] K. Keeton, D. Beyer, E. Brau, A. Merchant, C. Santos, and A. Zhang, ?On
the road to recovery: restoring data after disasters,? in Proceedings of the 1st
ACM SIGOPS/EuroSys European Conference on Computer Systems 2006, ser.
EuroSys ?06. New York, NY, USA: ACM, 2006, pp. 235?248. [Online]. Available:
http://doi.acm.org/10.1145/1217935.1217958
[102] K. Keeton, C. Santos, D. Beyer, J. Chase, and J. Wilkes, ?Designing for disasters,?
in Proceedings of the 3rd USENIX conference on File and storage technologies,
ser. FAST?04. Berkeley, CA, USA: USENIX Association, 2004, pp. 5?5. [Online].
Available: http://dl.acm.org/citation.cfm?id=1973374.1973379
[103] V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D. G. Andersen,
G. R. Ganger, and G. A. Gibson, ?A (In)Cast of Thousands: Scaling
Datacenter TCP to Kiloservers and Gigabits,? Technical Report CMUPDL-09-101,
Carnegie Mellon University Parallel Data Lab, Feb 2009. [Online]. Available:
http://www.pdl.cmu.edu/PDL-FTP/Storage/CMU-PDL-09-101.pdf
[104] G. A. Gibson, D. F. Nagle, K. Amiri, J. Butler, F. W. Chang, H. Gobioff,
C. Hardin, E. Riedel, D. Rochberg, and J. Zelenka, ?A cost-effective, high-
bandwidth storage architecture,? in Proceedings of the eighth international conference
on Architectural support for programming languages and operating systems, ser.
ASPLOS-VIII. New York, NY, USA: ACM, 1998, pp. 92?103. [Online]. Available:
http://doi.acm.org/10.1145/291069.291029
[105] M. Abd-El-Malek, W. V. Courtright, II, C. Cranor, G. R. Ganger, J. Hendricks,
A. J. Klosterman, M. Mesnier, M. Prasad, B. Salmon, R. R. Sambasivan,
S. Sinnamohideen, J. D. Strunk, E. Thereska, M. Wachs, and J. J. Wylie, ?Ursa
minor: versatile cluster-based storage,? in Proceedings of the 4th conference on
USENIX Conference on File and Storage Technologies - Volume 4, ser. FAST?05.
Berkeley, CA, USA: USENIX Association, 2005, pp. 5?5. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1251028.1251033
[106] Y. Cheng, C. Qin, and F. Rusu, ?GLADE: big data analytics made easy,? in
Proceedings of the 2012 ACM SIGMOD International Conference on Management of
Data, ser. SIGMOD ?12. New York, NY, USA: ACM, 2012, pp. 697?700. [Online].
Available: http://doi.acm.org/10.1145/2213836.2213936
[107] K. Bajda-Pawlikowski, D. J. Abadi, A. Silberschatz, and E. Paulson, ?Efficient
processing of data warehousing queries in a split execution environment,? in
Proceedings of the 2011 ACM SIGMOD International Conference on Management of
111
data, ser. SIGMOD ?11. New York, NY, USA: ACM, 2011, pp. 1165?1176. [Online].
Available: http://doi.acm.org/10.1145/1989323.1989447
[108] Y. Huai, R. Lee, S. Zhang, C. H. Xia, and X. Zhang, ?DOT: a matrix model for
analyzing, optimizing and deploying software for big data analytics in distributed
systems,? in Proceedings of the 2nd ACM Symposium on Cloud Computing, ser.
SOCC ?11. New York, NY, USA: ACM, 2011, pp. 4:1?4:14. [Online]. Available:
http://doi.acm.org/10.1145/2038916.2038920
[109] F. J. Alexander, A. Hoisie, and A. S. Szalay, ?Big Data [Guest editorial],? Computing
in Science and Engineering, vol. 13, no. 6, pp. 10?13, 2011.
[110] J. Padhye, V. Firoiu, D. Towsley, and J. Kurose, ?Modeling TCP throughput: a
simple model and its empirical validation,? in Proceedings of the ACM SIGCOMM
?98 conference on Applications, technologies, architectures, and protocols for computer
communication. New York, NY, USA: ACM, 1998, pp. 303?314. [Online]. Available:
http://doi.acm.org/10.1145/285237.285291
[111] N. Parvez, A. Mahanti, and C. Williamson, ?An analytic throughput model for TCP
NewReno,? IEEE/ACM Trans. Netw., vol. 18, no. 2, pp. 448?461, Apr. 2010. [Online].
Available: http://dx.doi.org/10.1109/TNET.2009.2030889
[112] E. Altman, K. Avrachenkov, and C. Barakat, ?A stochastic model of TCP/IP with
stationary random losses,? IEEE/ACM Trans. Netw., vol. 13, no. 2, pp. 356?369,
Apr. 2005. [Online]. Available: http://dx.doi.org/10.1109/TNET.2005.845536
[113] M. Goyal, R. Gurin, and R. Rajan, ?Predicting TCP Throughput From Non-invasive
Network Sampling,? in INFOCOM, 2002.
[114] A. Kumar, ?Comparative performance analysis of versions of TCP in a local network
with a lossy link,? IEEE/ACM Trans. Netw., vol. 6, no. 4, pp. 485?498, Aug. 1998.
[Online]. Available: http://dx.doi.org/10.1109/90.720921
[115] C. Casetti and M. Meo, ?A new approach to model the stationary behavior of TCP
connections,? in INFOCOM 2000. Nineteenth Annual Joint Conference of the IEEE
Computer and Communications Societies. Proceedings. IEEE, vol. 1, 2000, pp. 367
?375 vol.1.
[116] S. Floyd and V. Jacobson, ?Random early detection gateways for congestion
avoidance,? IEEE/ACM Trans. Netw., vol. 1, no. 4, pp. 397?413, Aug. 1993. [Online].
Available: http://dx.doi.org/10.1109/90.251892
[117] K. Fall and S. Floyd, ?Simulation-based comparisons of Tahoe, Reno and SACK
TCP,? SIGCOMM Comput. Commun. Rev., vol. 26, no. 3, pp. 5?21, Jul. 1996.
[Online]. Available: http://doi.acm.org/10.1145/235160.235162
[118] S. Mccanne, S. Floyd, and K. Fall, ?ns2 (network simulator 2),? http://www-
nrg.ee.lbl.gov/ns/. [Online]. Available: http://www.isi.edu/nsnam/ns/
112
[119] K. Varadhan and K. Fall, ?The ns Manual (formerly ns Notes and Documentation),?
The VINT Project, A Collaboration between researchers at UC Berkeley,
LBL, USC/ISI, and Xerox PARC, Nov 2011. [Online]. Available: http:
//www.isi.edu/nsnam/ns/doc/ns doc.pdf
[120] V. Jacobson, ?Congestion avoidance and control,? SIGCOMM Comput. Commun.
Rev., vol. 25, no. 1, pp. 157?187, Jan. 1995. [Online]. Available: http:
//doi.acm.org/10.1145/205447.205462
[121] V. Paxson and M. Allman, ?Computing TCP?s Retransmission Timer,? RFC 2988
(Proposed Standard), Internet Engineering Task Force, Nov. 2000, obsoleted by RFC
6298. [Online]. Available: http://www.ietf.org/rfc/rfc2988.txt
[122] P. Sarolahti and A. Kuznetsov, ?Congestion Control in Linux TCP,? in
Proceedings of the FREENIX Track: 2002 USENIX Annual Technical Conference.
Berkeley, CA, USA: USENIX Association, 2002, pp. 49?62. [Online]. Available:
http://dl.acm.org/citation.cfm?id=647056.715932
[123] M. Aron and P. Druschel, ?TCP Implementation Enhancements for Improving
Webserver Performance,? Rice University, Tech. Rep. TR99-335, Jun, 1999. [Online].
Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.9.5152
[124] ??, ?Soft timers: efficient microsecond software timer support for network
processing,? ACM Transactions on Computer Systems, vol. 18, no. 3, pp. 197?228,
2000. [Online]. Available: citeseer.ist.psu.edu/aron99soft.html
[125] M. Allman and V. Paxson, ?On estimating end-to-end network path properties,?
SIGCOMM Comput. Commun. Rev., vol. 31, no. 2 supplement, pp. 124?151, Apr.
2001. [Online]. Available: http://doi.acm.org/10.1145/844193.844203
[126] V. Jacobson, ?Compressing TCP/IP Headers for Low-Speed Serial Links,? RFC 1144
(Proposed Standard), Internet Engineering Task Force, Feb. 1990. [Online]. Available:
http://www.ietf.org/rfc/rfc1144.txt
113