iTad: An I/O-Aware Thermal Model for
Data Centers
by
Tausif Muza ar
A thesis submitted to the Graduate Faculty of
Auburn University
in partial ful llment of the
requirements for the Degree of
Master of Science
Auburn, Alabama
May 4, 2014
Keywords: Data Centers, Thermal Energy, Heat Model
Copyright 2014 by Tausif Muza ar
Approved by
Xiao Qin, Chair, Associate Professor of Computer Science Software Engineering
Cheryl Seals, Associate Professor of Computer Science Software Engineering
Alvin Lim, P Associate Professor of Computer Science Software Engineering
Abstract
With the ever-growing cooling costs of large-scale data centers, thermal management
must be adequately addressed. Thermal models can play a critical role in thermal man-
agement that helps in reducing cooling costs in data centers. However, existing thermal
models for data centers can overload I/O activities. To address this issue, we developed
an I/O-aware thermal model called iTad for data centers. The iTad model captures the
thermal characteristics of servers in a data center, o ering a much  ner granularity than the
existing models. In addition to CPU workloads, iTad incorporates the I/O load in order to
accurately estimate the thermal footprint of the servers with I/O-intensive activities. We
validate the accuracy of the iTad model using real-world temperature measurements acquired
by an infrared thermometer. Our empirical results show that I/O utilizations have a signif-
icant impact on internal temperatures of data servers. We show that thermal management
mechanisms can quickly retrieve the thermal information of servers from iTad before making
important workload placement decisions in a real-time manner.
ii
Acknowledgments
As my studies have gone the more humbled I am but the generosity and support to
those in my life. First I would like thank Dr. Qin for his belief in me that encouraged me to
pursue higher education. Without his support and guidance none of this would be possible.
Of course I would also like to thank all the great professor here Auburn CSSE depart-
ment, for challenging me and teaching me all the things I know about computer science
today. A special thanks of course to Dr. Seals and Dr. Lim for their work at teaching me
but also serving as committee members for my defense.
I would be doing an injustice not to thank all those in my lab group for their on going
support and guidance, especially those I have known since the beginning of my masters
program: Xunfei, Sanjay, Ajit.
Lastly I would like to thank God and my family for all their support and protection.
My hope is that my actions and this thesis is done in the way that would please them.
iii
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Reducing Monitoring Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Reducing Monitoring Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Bene ts of Thermal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Energy-E cient Data Centers . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Thermal Aware Data Centers . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Thermal Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Thermal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Determine Recirculation Factors . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Determine Hardware Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1 Assumptions and Notation - . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Modeling impacts of heat on temperature . . . . . . . . . . . . . . . . . . . . 13
4.3 Modeling impacts of workload on temperature . . . . . . . . . . . . . . . . . 15
5 Experimental Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
iv
5.1 Set up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.3 Determining a baseline temperature . . . . . . . . . . . . . . . . . . . . . . . 19
5.4 Impact of I/O utilization on temperature . . . . . . . . . . . . . . . . . . . . 20
5.5 Impact of CPU utilization on temperature . . . . . . . . . . . . . . . . . . . 22
5.6 Shared I/O and CPU Utilization . . . . . . . . . . . . . . . . . . . . . . . . 23
5.7 Determining Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.1 Veri cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.2 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.3 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7 Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.1 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.2 Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
v
List of Figures
1.1 Temperature of Processor when CPU utilization 100% vs Time [26]. . . . . . . 3
1.2 Thermometer used in testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1 An Overview of a data center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Three factors a ect the outlet temperature of a single blade server. . . . . . . . 9
3.4 Three factors a ect the inlet temperature of a single blade server. . . . . . . . . 10
4.1 Radiant heat equals convective heat . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Visual representation of workload e ects outlet temperature . . . . . . . . . . . 16
5.1 Utilization of Components at  xed utilization . . . . . . . . . . . . . . . . . . . 19
5.2 Measured Area of Temperatures . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3 Surface Temperatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.4 Utilization Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.5 Relationship between Utilization and Outlet Temperature . . . . . . . . . . . . 23
5.6 Values of Z in all experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.1 Veri cation of Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2 Sample MPI Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
vi
List of Tables
4.1 Model Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.1 Server Speci cations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Temperature Zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.3 Compilation of all the values gathered . . . . . . . . . . . . . . . . . . . . . . . 24
vii
List of Abbreviations
Auburn Auburn University
LoA List of Abbreviations
iTad: I/O Thermal Aware Datacenter
I/O: Input and Output
CPU: Central processing unit
CRAC: Computer room air conditioning
viii
Chapter 1
Introduction
Recent studies show that thermal management is an important issue to data centers
due to ever-increasing cooling cost [24]. Cooling costs contribute to a signi cant portion of
the operational cost of large-scale data centers; therefore, increasing the size of a data center
leads to huge amount of energy consumed by the center?s cooling system. An e cient way to
combat the high cost of cooling systems is to develop thermal-aware management techniques
that place jobs and data on servers to minimize temperatures of data centers.
Thermal management aims at reducing cooling costs of data centers; thermal man-
agement mechanisms largely rely on thermal information to make intelligent job and data
placement decisions. Thermal information can be acquired in the following three means:
1. Temperature sensors measure inlet and outlet temperatures of servers
2. Computational  uid dynamics simulators (see, for example, Flovent) simulate temper-
atures of servers in data centers
3. Thermal models estimate a server?s temperature based on the server?s workloads.
After looking through these options we decided to create an CPU and I/O aware thermal
model called iTad. iTad standing for I/O Thermal Aware Data center. The reason we decided
to make such a model than use the other two options will be explained in the following
sections.
1.1 Reducing Monitoring Cost
The  rst approach is to monitor server inlet temperatures by deploying sensors in a
number of locations in a data center [22]. This approach faces a dilemma; while high levels
1
of accuracy can be achieved by increasing the number of sensors, this leads to an expensive
monitoring solution. Reversely reducing the number of sensors may cause inaccuracies, and
an algorithm would need to be developed to extrapolate the heat from individual nodes thus
taking away the simplicity that makes this route so appealing. For large-scale data centers,
this approach is not very practical for two reasons. First, it is prohibitively expensive to
deploy hundreds of thousands of sensors to o er accurate temperature measurements. Each
server needs at least two sensors; each sensor may cost up to $100 [7]. Second, wiring and
maintenance cost of the large number of sensors can further increase the operational cost of
data centers.
1.2 Reducing Monitoring Time
To reduce the high cost of deploying an excessive number of sensors, data center man-
agers can make use of the computational  uid dynamics simulators to simulate and collect
inlet temperatures of servers [21]. Although this simulation approach o ers accurate thermal
information at low cost without employing any sensor, it is time consuming (e.g., several
hours) to run each simulation study. Thus, the simulation studies must be conducted o ine,
indicating that thermal management mechanisms are unable to retrieve thermal information
from the simulators at run-time.
1.3 Bene ts of Thermal Model
Thermal models are arguably a more promising approach to providing thermal manage-
ment mechanisms; they can provide temperature information of servers at run-time without
incurring any cost to purchase and maintain sensors. Thermal models o er the following
four major bene ts for data centers. First, thermal models signi cantly reduce thermal
monitoring costs. Second, unlike thermal simulators, thermal models o er temperature in-
formation to thermal management schemes in a real-time manner. For example, our iTad
thermal model is able to pro le the thermal characteristics of a data center in a matter of
2
Figure 1.1: Temperature of Processor when CPU utilization 100% vs Time [26].
seconds. Third, thermal management powered by thermal models helps cut cooling costs
and boosts system reliability. Last, thermal models allows data center designers to quickly
make intelligent decisions on thermal management in an early design phase.
Most existing thermal models in the market treat servers as a uniform black box because
it is unclear what all factors are involved in the heat distribution of a data center [20]. There
are a few thermal models (see, for example, [26]) that can derive power consumption and
necessary cool power from inlet and outlet temperatures of servers. However, implementing
these models requires (1) many thermometers and (2) the management of thermal informa-
tion in real time (3)only based on CPU work. CPU Workload has been known to a ect
thermal load, for example Figure 1.1 shows that when CPU utilization is increased, there
is a large increase in temperature [26], but it neglects I/O workload. I/O-intensive activities
in servers are commonly overlooked in these models. One of the goals of this research is to
demonstrate that I/O utilization plays an important role in a server?s thermal dissipation.
We believe I/O-intensive applications running in data centers impose heavy load on servers,
making disks of the servers hot-spots.
3
1.4 Contributions
Contributions. The major contribution of this study are summarized as follows:
1. We develop the iTad thermal model that provides outlet temperatures of servers in a
data center. We show that both CPU and I/O thermal outputs can be extrapolated
from radiation heat and convection heat applied to a server. With iTad in place,
thermal management schemes can quickly make workload management decisions at
run-time based on I/O and CPU utilizations.
2. We validate the accuracy of the iTad model using a server?s real-world temperature
measurements obtained by an infrared thermometer (Figure 1.2).
Figure 1.2: Thermometer used in testing
3. Our experimental results suggest that iTad is an accurate model to derive server outlet
temperatures according to I/O and CPU activities.
4. We show that this model is easily can be plugged into any data centers.
5. We analytically study the relationship between I/O load and server outlet tempera-
tures. Our analysis con rms that I/O-intensive workloads have signi cant impact on
temperatures of servers.
4
Chapter 2
Related Work
2.1 Energy-E cient Data Centers
Large data has become on the hottest topics in computer. With refocus there has been
more interest in energy e cient data centers [10] [11], because a recent study shows that
1.2% of all energy consumption in U.S. is attributed to data centers [17]. To minimize the
e ect of the data centers on the national consumption there has been many energy-saving
approaches, one that relates to this research is the work of Bieswanger et al. where they
deploy sensors to analyze the power consumption instantaneously, our research deals with
real-time thermal management has some overlap [8].
2.2 Thermal Aware Data Centers
Energy aware data centers has been the classical way of thinking about reducing the
e ect of data centers on the environment. Another school of though is if we manage the
thermal outputs of the data centers, thus reducing the cooling cost we can have the same
impact that as energy e cient data centers. [25]
2.3 Thermal Simulations
Most of the research related to thermal management in data centers use a commercial
simulation software FloVent, which provides detailed 3D visualization of air ow and tem-
perature throughout the server room [2]. It can get very accurate heat recirculation results.
The downside is that, it is very complicated to setup or con gure and it takes huge amount
of time to run each simulation. Such software is very useful for machine learning because
5
of the time needed to implement machine learning techniques but not very e ective on split
second decision making. We use iTad to implement a low cost and less time consuming
management technique.
2.4 Thermal Models
Eibeck et al [13] developed a model to predict the transient temperature pro le of an
IBM 5-1/4-in.  xed disk drives by experimentally determining the thermal characteristics
of the disk drive. Tan et al presented a 3D  nite element modeling technique to predict
the transient temperature under frequent seeking [23]. Gurumurthi et al investigated the
thermal behavior of the hard disk and presented an integrated disk model. Their model
calculates the heat generated from the physical components of the disk drive like spindle
motor, voice-coil motor and disk arms [14]. Kim et al studied thermal behavior of disks
by varying the platter types and number of platters and established a relationship between
seek time and the disk temperature [16]. However, the impact of the disks utilization on the
disk temperature and contribution of disks to the outlet temperature of nodes have not been
investigated. Even though clearly thermal footprints of computing has a breathe of research.
Microsoft research and Carnegie Mellon University [20] presented a model which pre-
dicts the future temperature of servers through machine learning. As this model relies on the
sensor data, it will be costlier for large data centers to buy large amount of sensors. In our
research, instead of predicting future temperature we want a model to calculate the current
temperature based on the workload without using sensors.
Tang et al [27] [26] developed an interesting model demonstrating the e ect of heat
recirculation on the inlet temperature of servers in a data center, and in turn, on the e ciency
of cooling system. They calculate inlet temperature of servers based on the temperature of
the air supplied by the cooling room air condition (CRAC) and CPU utilization. Li et al
[19] showed that CPU intensive applications cause dramatic heat change for processor. We
believe that data intensive applications running in data centers will have the similar e ect
6
on the disks of storage nodes, which has to be taken into account while calculating the total
heat generated by the node. Kozyrakis [12] studied the e ect of di erent application and
observed the power consumption of the nodes. It showed that disk and memory consumes
signi cant amount of power, even as compared to CPU (as shown in Example 2). As power
consumption has direct impact on heat generated, there is a need to investigate the thermal
load of I/O intensive applications on the nodes in data centers.
7
Chapter 3
Methodology
3.1 Determine Recirculation Factors
We achieve the aforementioned goal by focusing on heat recirculation of active data
centers. Figure 3.1 depicts a general model for a data center, where each blade server?s outlet
temperature a ects room temperatures. The outlet temperature of the server depends on its
inlet air that enters the front of the server?s rack. The inlet air temperature is the computer
room temperature cooled by an air conditioning system.
Figure 3.1: An Overview of a data center.
Figure 3.1 shows that heat recirculation in a data center can be derived as the sum of
each server?s outlet temperature. To build a model representing the heat recirculation of a
data center (see Figure 3.1), we start this study by paying attention to constructing a thermal
model for each individual server. Here we are using the assumption that since recirculation
8
is the sum of single servers, then if we can model a single server, we just need to apply it
to all the servers in the data center and add it all together. Essentially we will modeling
two di erent things, the  rst will be the server heat transfer and the the inlet temperatures
value before the server heat transfer. Figure 3.2 shows how the initial temperature will feed
into our  rst model and they feeds into our inlet temperature model.
Figure 3.2: Model Overview
In our iTad model, there are three components (see Figure 3.3, where Tout denote outlet
temperature) a ecting the outlet temperatures of a blade server. These three a ecting factors
are inlet temperature, CPU utilization, and I/O workload.
Figure 3.3: Three factors a ect the outlet temperature of a single blade server.
Our iTad model makes use of these factors to estimate the outlet temperature for server
i, thereby enabling thermal management schemes to place workloads to control outlet tem-
peratures. The iTad model is orthogonal to existing thermal management schemes; iTad can
be seamlessly integrated with any thermal management scheme to either minimize outlet
temperatures or minimize heat recirculation in a data center. In this study, we focus on the
9
accuracy of iTad by validating it against real-world temperature measured by an infrared
thermometer.
Figure 3.4: Three factors a ect the inlet temperature of a single blade server.
A challenge in the development of iTad is the measurements of inlet temperatures of
servers. More speci cally, Figure 3.1 indicates that the air entering the servers is not equiv-
alent to initial temperature. Rather, the inlet temperature equals to the initial temperature
subtracted by some factor of air supplied by the air conditioning system. The inlet temper-
ature of a server is a ected by three factors, namely, computer room temperature, cooling
supply air temperature, and the outlet temperatures of other servers (see Figure 3.4). For
this model we decided to model only the current server outlet temperature an instantaneous
moment so its the only one that a ects input temperatures. In one of our current studies,
we are extending the iTad model to investigate the heat recirculation e ect by considering
the impact of all nodes outlet temperatures on inlet temperatures.
3.2 Determine Hardware Factors
After dealing with actual inlet temperatures, we incorporate I/O and CPU workloads
into iTad. In this part of the study, we show how the outlet temperature of a server changes
based on I/O-intensive activities. The iTad model has to deal with heat transfer, especially
convection heat transfer. Convection heat transfer [28] is based on temperature and spe-
ci c heat, all of which have a linear relationship. A study conducted by Barra and Ellzey
demonstrates how a wide range of shapes a ect heat transfer [9]. iTad is the  rst model
10
that attempts to incorporate I/O-intensive workload therefore, we consider cases where all
the components in a data center have the same transfer rate. Nevertheless, we do not imply
by any means that all the components have an identical transfer rate. In our future work,
we will extend iTad to consider multiple heat transfer rates to further improve the accuracy
of iTad.
The iTad model helps in improving the energy e ciency of data centers because thermal
information o ered by iTad assists dynamic thermal management to reduce the energy con-
sumption in cooling systems in data centers. We show that thermal management mechanisms
can quickly make workload placement decisions based on thermal information facilitated by
iTad.
11
Chapter 4
Modeling
4.1 Assumptions and Notation -
We described the plan of our model as well as the basic components necessary for the
model. In this section, we will present the assumptions and the notations we used in the
model. Following are the assumptions :
1. Initial temperature is always consistent throughout the data center.
2. The air  ow is static in all parts of the data center.
3. Supplied temperature strength is linearly proportional to the distance from the vent.
4. Our model is models temperature at an instantaneous moment so nothing is being
circulated in our model.
5. The adjacent nodes will not heat up enough to cause an e ect to the node in question.
6. PC components are all similar in shape so the heat transfer is consistent.
7. The entire experiment is based on the premise that taking a single node from a cluster
and running our experiments we can grasp the important factors in thermal change in
computers. With this information we will able to model large scale environment.
After laying out the assumptions, the notations used in the model are described in Table
4.1.
This equation can be reorganized to solve for outlet temperature.
12
Table 4.1: Model Notation
Variables Description
i Number of Server Node
Q Heat generated (J)
p Density of air (kg/m3)
f Flow rate (m3/s)
cp Speci c Heat (J/kg/c)
Tout Outlet Temperature (c)
Tin Inlet Temperature (c)
T Change in temperature (c)
hr Heat Transfer Co ecnt (J/s*m2*c)
A Surface area of PC components(m2)
Z Percent of added temperature after workload
R Ratio of distance
k The amount outlet a ects inlet temperature
di Distance of the server from AC vent (m)
d Height of room (m)
TINIT Room Initial Temperature (c)
Ts Supplied temperature from CRAC (c)
Tworkload, Tw Surface Temperature at a workload (c)
Tidle Surface Temperature at a idle (c)
W Workload supplied (%)
TMax Max Temperature the components(c)
4.2 Modeling impacts of heat on temperature
The heat transfer in a data center node can be expressed by Equation 4.1 [20] [26] [24].
There are two kinds of heat transfer in this system: convective heat transfer adn radiant
heat transfer. We organized Equation 4.1 to solve for outlet temperature. In Equation 4.1,
Qi is the convective heat transfer of server i, which means as the inlet air passes through the
amount of heat is builds up is Qi.
Qi = pfcp(Touti Tini)
Touti = Qipfc
p
+ Tini (4.1)
13
The heat generated in the chassis is actually the heat being radiated from the compo-
nents of a server, which also know as radiant heat. So, in this case, the convective heat
transfer of inlet temperature and outlet temperature is equal to the radiant heat transfer of
the PC components. Figure 4.1 shows you the model how all the heat that radiates o the
components mixes into the air to form the outlet temperature and its convective heat gain.
Figure 4.1: Radiant heat equals convective heat
The equation 4.2 shows the formula for radiation heat transfer [1]. The radiant heat
is dependent on the surface of the object and the heat it generates on its surface.
Qi = hrA4Ti (4.2)
In Equation 4.3, the 4Ti is the change of temperature caused by the PC components,
which we modeled as the change in temperature of the server at the speci ed workload
(4Tworkload) plus di erence between inlet and outlet temperature of server at idle state
(Toutidle Tinidle).
4Ti =4Tworkloadi + (Toutidle Tinidle) (4.3)
14
To help simplify what we need to  nd we set Equation 4.1 and Equation 4.2 equal to
each other to give us the variable Z, thus letting us relate Tout to Tin and 4Ti, as shown
Equation 4.4
hrA4Ti = pfcp(Touti Tini)
Z = hrApfc
p
= Touti Tini4T
i
Touti = Z4Ti + Tini (4.4)
4.3 Modeling impacts of workload on temperature
In the article [24], they de ne Tin as dependent on Ts and a vector which models the
exact strength of Ts at each height. We simpli ed the model further by declaring the Tin as
the room temperature subtracted by the a percentage of the temperature supplied by the
CRAC as shown in Equation 4.5.
The amount that Tout e ects the inlet temperature is proportional to k which is some-
thing that is outside the scope of our paper. That being said the way it is implemented now,
only the a current server outlet temperature will e ect the outlet temperature.
R = did
Tin = TINIT  RTs + kTout (4.5)
Also in Equation 4.3, the other variable that de nes Tout is 4Ti, and we modeled 4Ti
after the Figure 4.2.
The theory behind our proposed model is that some components of the server get more
heated by I/O intensive applications while others get more heated by CPU intensive ap-
plications; and based on the percent of CPU or I/O utilization the components will get to
15
Figure 4.2: Visual representation of workload e ects outlet temperature
some percentage of its maximum temperature. After the calculations of Equation 4.5 we
are given 4Tworkloadi which is the increase in the temperature as compared to idle server.
4TMAXCPU = TworkloadMAXCPU  Tidle
4TMAXI=O = TworkloadMAXI=O Tidle
4Tworkloadi =4WCPU4TMAXCPU +4WI=O4TMAXI=O (4.6)
In the end all of these equation are the components needed in modeling a single server
node. This is important because, as we discussed before, getting each single server outlet
temperature can help to model a data center thermal pro le. Before we can do that we need
to verify that these equations are accurate.
16
Chapter 5
Experimental Parameters
In this section we will be determining the parameters for the model we created in the
modeling section for server. Do this we need to prove that all the factors describe in the
model will indeed have an e ect, and then solve for the constants described in the previously
in the modeling section.
5.1 Set up
Since our models, described in the previous section, model a single server node, we
decided to verify the equations by setting up an experimental machine. The machine we
used is an OptiPlex 360 whose speci cations are listed in Table 5.1. So in this section we
will de ne the characteristics of our machine and later use those constants to verify how
accurate our models are.
Table 5.1: Server Speci cations
Dimensions 15.65 x 4.59 x 14.25
RAM 1GB
Chipset Intel G31/ICH7
DC Power Supply 255 W
Processor Type Intel Core 2 Duo
Memory 800 MHz DDR2 SDRAM
To test our server we used a command called "stress" in Ubuntu, which can spawn
multiple CPU workers or I/O workers. This process would allow us to get a estimate how
a computer would act under such a load. To get an estimate of CPU utilization impact we
just used "stress" to call only CPU workers [6]. To get an estimate of I/O utilization impact
we just used "stress" to call only I/O workers. Finally to  nd a mixture of I/O and CPU
17
utilization impacts we called a ratio of CPU workers to I/O workers. (i.e 80 CPU workers
and 20 I/O workers will be 80% CPU utilization and 20% I/O utilization).
After running our stress tests, we used 3 di erent tools to help design experiment.
First, we used the Linux command "iostat", which gave us details about server usages. The
most important pieces of information in "iostat" were the "CPU user%" which displays the
percentage of CPU utilization, "system%" which displays the percentage of I/O utilization
[4] [5].
Another tool we used "HDDTemp"; a software that can measure the temperature of the
hard drives [3]; more of as a reassurance, to make sure our thermometer was working.
When we say thermometer, we are referring to the HDE Temperature Gun Infrared
Thermometer w/ Laser Sight. This thermometer measures the surface temperature of what-
ever surface it is pointed on. The thermometer has a reading ratio that is 12:1, which means
for every 12 cm away we have a 1 cm radius of temperature. We used this tool to measure
all kinds of temperatures used for veri cation.
5.2 Period
Since it takes time for di erent components to heat up to its max temperature, we
needed to test how long it would take for our each application to reach it hottest point. For
CPU intensive application we can assume that the processor would be the most highly active
component. So we ran our stress test at 100% CPU utilization and periodically checked the
processors temperature. We plotted the temperatures over time as shown in Figure 5.2(a).
If you look at Figure 5.2(a), you can see the temperature plateaus around 30 minutes, but
we wait till the blue line to turn o the stress test.
In Figure 5.2(a), we did similar procedure, but we used I/O intensive application.
During the I/O intensive application, instead of monitoring the processor we monitored the
I/O controller. From Figure 5.2(b), it is clear that I/O application takes longer to heat up
and once we turned it o , at the blue line, it takes longer to cool down. So for all our other
18
Figure 5.1: Utilization of Components at  xed utilization
0 2 4 6 8 10 12 14 1638
39
40
41
42
43
44
45
46
47
Time (10m)
Temperature (c)
Temperture of CPU components vs Time
 
 100%
75%50%
25%
(a) Temperature of Processor when CPU utiliza-
tion is changed vs Time
0 2 4 6 8 10 12 14 16 1840
45
50
55
60
65
Time (10m)
Temperature (c)
Temperature of I/O components vs Time
 
 100%
75%50%
25%
(b) Temperature of Processor when CPU utiliza-
tion is changed vs Time
tests we run them for 1 hour before taking temperature readings, to give the CPU and I/O
components ample time to heat up. We waited 1.5 hours between tests to allow server to
cool down.
5.3 Determining a baseline temperature
After  nding out how long it takes to run an experiment we were able to run our tests.
The  rst experiment we needed to run was one to  gure out the thermal impact of the idle
machine. In order to do this, we decided to take an array of temperatures and extrapolate
the information we need.
Figure 5.2 shows the insides of the our server; the numbers 1-32 are areas where we
measured the temperature, 33 is the place we measured the hard drive, 34 is the place where
we measured power supply, 35 is the place where measured inlet temperature and  nally 36
is the outlet temperature. So once we determined what to measure, we measured each grid
area with our thermometer for the server at the idle state. The measurements are given in
Figure 5.4(a).
In Figure 5.3, we arranged the gathered data to graphically match Figure 5.2. In
Figure5.3, the inlet temperatures are temperatures in the middle on the far left. All the
19
Figure 5.2: Measured Area of Temperatures
numbers in the middle row are the temperatures of grid spots of 1-32. The two temperatures
on the bottom left are the temperature of the disk drive measured by two di erent methods,
one using HDDtemp and other using thermometer. After gathering the temperatures we
calculated the average, which we labeled o to the bottom on the far right. This gives us a
baseline value, which is called Tidle in our model. We compared this baseline value with the
other values. Another number that we needed to keep for later calculations is the di erence
between the idle inlet and idle outlet temperatures (Toutidle Tinidle), which is 1:8oC.
5.4 Impact of I/O utilization on temperature
Once we have the idle data as a reference we started to test an active machine. We
started with an I/O intensive machine, and to make the machine I/O intensive we use the
following command:
$stress io 3 vm 7
The ratio of {io to {vm and {io is 30% which means that the I/O is set to 30% utilization.
Using this pattern we created the a  gure similar to 5.3. We then allowed the computers to
20
Figure 5.3: Surface Temperatures
(a) Idle Temperatures (b) Max CPU Temperatures
(c) Max I/O Temperatures
cool down and then rechecked the temperatures at 60% utilization, then once more for 100%
utilization. Although we kept and organized the data for these runs similar to the format of
the idle calculations above for simplicity to show how the utilization e ected the heat with
Figure 5.4. Each bar represents a group of points from Figure 5.2. Table 5.2 lists each
group to help interpret the results.
Table 5.2: Temperature Zones
Group Number Number of points from Figure 5.2 Reasoning
1 34 Power Supply
2 1-12 Rarely changing
3 13,14,17,18 CPU Controls
4 15,16,19,20 I/O Controls
5 33 Hard drive
21
Figure 5.4: Utilization Temperature
1 2 3 4 50
5
10
15
20
25
30
35
40
45
Group Number
Temperature (c)
Temperature at I/O Utilization
 
 100%60%
30%
(a) I/O Utilization
1 2 3 4 50
5
10
15
20
25
30
35
40
45
Group Number
Temperature (c)
Temperature at CPU Utilizatiion
 
 100%60%
30%
(b) CPU Utilization
1 2 3 4 50
5
10
15
20
25
30
35
40
45
Group Number
Temperature (c)
Temperature at Mixed I/O and CPU Utilization
 
 20%,80%
50%, 50%80%, 20%
(c) Mixture of CPU and I/O Uti-
lization
5.5 Impact of CPU utilization on temperature
After conducting the I/O tests it was time to test CPU utilization impacts. To do this
we used the stress command in the following way:
$stress cpu 3 vm 7
The ratio of {cpu to {vm and {cpu is 30% which means that the CPU is set to 30%
utilization. Using that pattern we created a  gure similar to 5.3. Just as we did for I/O we
checked the temperatures at 3 di erent utilization levels; making sure to give the machine
time to cool down before each experiment. We found the average temperature, like we did
for the idle case and I/O, and also created a group heat graph as seen in Figure 5.5(b).
After plotting the average temperatures we were able use curve  tting techniques which we
discuss in section 5.7.
22
Figure 5.5: Relationship between Utilization and Outlet Temperature
5.6 Shared I/O and CPU Utilization
Finally we decided to test a mixture of both I/O and CPU utilizations. We did this by
calling the command:
$ stress cpu 50 io 50
This would set utilization for each component to 50%. So we did the same procedure
as we did on the I/O intensive and CPU intensive. After running the experiments for the
mixed utilization we created the Figure 5.5(c), which we will discuss in section 5.7.
5.7 Determining Constants
After our experiments, we are left with many impressions about iTad. First of all,
there is a clear relation between utilization and temperature which is shown in Figures
5.5(a), 5.5(b). This relationship seemed to be linear relationship shown by the curve  tting
techniques we used on the Figure ( 5.5) where we graph the change in outlet temperature
change over the percent utilization.
23
After using the data from the I/O runs we were able to calculate, the slope of the line
is 2.7 which represents the speed with which the temperature was increasing with R2 value
of 0.981, where R2 represents the accuracy of the slope. In the case of CPU data, the slope
of the line is 3.5 which represents the speed with which the temperature was increasing with
a R2 value of 0.961. Which indicates that a CPU intensive application will get server hotter
than an I/O intensive one. This is nearly con rmed by the mixed data because it clearly
shows that when CPU is higher than I/O the server is warmer.
Once we accept that the relationship is linear we can start to  gure out some of the
values on constants in the equation that we proposed. The proposed Equation 4.4 has
consolidated all the constants of the experiment into one variable and with all our data
readings we can solve for Z.
Table 5.3: Compilation of all the values gathered
I/O Intensive
Wio TW TIdle  TWorkload TidleIn Out Q Tin Tout TIn Out Z
30% 34.021 33.692 0.315 1.800 2.115 26.700 28.700 2.000 0.946
60% 34.237 33.692 0.630 1.800 2.430 24.700 26.900 2.200 0.905
100% 34.742 33.692 1.050 1.800 2.850 26.900 29.900 3.000 1.052
CPU Intensive
Wcpu TW TIdle  TWorkload TidleIn Out Q Tin Tout TIn Out Z
30% 34.039 33.692 0.442 1.800 2.242 26.100 28.300 2.200 0.981
60% 34.400 33.692 0.884 1.800 2.684 26.900 29.700 2.800 1.043
100% 35.166 33.692 1.474 1.800 3.274 27.900 31.400 3.500 1.069
I/O and CPU Intensive
Wcpu,Wio TW TIdle  TWorkload TidleIn Out Q Tin Tout TIn Out Z
20%,80% 34.347 33.692 1.135 1.800 2.935 27.700 29.900 2.200 0.750
50%,50% 34.326 33.692 1.262 1.800 3.062 26.600 28.800 2.200 0.718
80%,20% 34.639 33.692 1.389 1.800 3.189 26.200 28.800 2.600 0.815
24
In Table 5.3, we consolidated all the information we gather while trying to determine the
experimental parameters process. In Table 5.3, column 2 is the average surface temperature
of the machine at that utilization, while column 3 is the average surface temperature with
no utilization; the di erence in column 2 and 3 is the observed di erence in the values. This
shows how much extra heat is generated once the server is pushed to that speci c utilization.
Column 4 shows what the extra temperature generated should would be, from iTad. As you
can see the values are closely related. Any di erence could be accounted by the change in
air  ow of the room or own server fans.
Column 5 of Table 5.3 is Tworkload we calculated plus the di erence between inlet and
outlet temperature for the idle server. This is essentially4T from Equation 4.4. And since
we were able to actually measure the  nal inlet and outlet temperature for server, we were
able to calculate the value of Z. The value of Z is ratio of4T from column 9 to the surface
heat listed in column 6. As you can see, in Figure 5.6, the value of Z has an average just
under 1.
Figure 5.6: Values of Z in all experiments
This simply means that the current arrangement of hardware has a one to one relation-
ship between heat exuding from the server and outlet temperature. The change in Z for
di erent utilization maybe an indication that the air  ow is changing, but our assumption
25
is that the air  ow stays constant and since the values don?t vary that much it doesn?t
contradict iTad.
Through our results the accuracy seems close enough to say that the equations used
to model the outlet temperature can work as a basis to for thermal management based on
workload, or even just a starting point for future expansion.
26
Chapter 6
Usage
In the veri cation section we took our server and determined all the constants of the
machine. In the following section we will discuss how accurate the numbers from the veri -
cation process are. First we would simply like to explain a use case how this model can be
used a in data center.
As we you can tell by Equation 4.4 the most important value you need to solve for the
the Z variable. So when building a data center you should  nd the Z for every machine then
you can plug it into equations.
After gathering the Z values like we did in the veri cation section for all the machines
we will need to run another task that ill keep track of CPU and I/O utilization of each
machine
After setting up a monitor you take the values and plug them into Equation 4.6 and
the outcomes will give the individual outlet temperature. Now this is where the model is
lacking there is no heat circulation model, that will need to extended research to help control
thermal output.
6.1 Veri cation
Now that we developed the model and got all the constants of the experiment solved the
only thing left to do was to actually run iTad model on a server with random amount of CPU
and I/O utilization. In Figure 6.1 you can where we ran a server for 5 hours, with utilization
variation. The model had a tendency to over estimate the the temperature especially at the
beginning of the process.
27
Figure 6.1: Veri cation of Model
We would account for this problem to a poor recirculation constant and the fact that
the model isn?t time variant. That being said the longer the machine ran the better the
results were because this would return back to the state how we tested our machines to  nd
Z value. The variation of the model and actual is a not perfect because at points our model
is over 2 degrees o , but the good thing is the trend stays very close to the original so we
have a model that is going to err in the safe side of calculations.
6.2 MPI
One of the goals of this experiment was to make that this model could work for any
kind of data center. So if the data center was to using a C version of MPI, message passing
interface, then iTad should work for them. So we set up a method called "iTad" inside an
toy MPI project. iTad would return an outlet temperature, and based on that value it would
make a decision about data movement. The method iTad would pull the utilization from
the OS and get the other values from system to determine the its output, and all of this
happened seamlessly without any noticeable change in performance.
28
Figure 6.2: Sample MPI Usage
6.3 Hadoop
We also wrote the model for JAVA, this would allow the model to used for other data
centers that would use JAVA. This version had to use a runtime shell to get CPU and I/O
utilization because the JAVA virtual machine doesn?t have direct access to the OS informa-
tion like C does. This version of the model would work for a JAVA MPI implementation, but
for Hadoop each node is not responsible for its own data so we needed to see if our model
could be used for an Hadoop data center. In the work of our group member paper [18] shows
that the scheduler and heartbeat of Hadoop can be updated to take account of CPU and
I/O utilization to make thermal decisions. So implementing our model is as straight forward
as replacing there thermal method with out iTad method.
29
Chapter 7
Experience
7.1 Improvements
Our experiment is not beyond criticism, but no criticism that is so heinous that will
crush the results of this experiment. The  rst critic that can be made about this experiment
is that the stress command is not pushing the computer enough, especially the I/O commands
seems to be inaccurate so the temperatures we are reading are o . This criticism is one that
has some true concerns but the fact that when I/O intensive the I/O controller is the hottest
spot proves that the I/O is being pushed. Some people may see the fact that our machine
is an isolated machine not actually on a rack as concern. This in fact a true issue, while
other reports [15] an increase 10 degree di erent in inlet and outlet with CPU utilization our
machine shows a max of 1.7. This concern is invalid because our computer is prototype to
monitor trends to prove our model. The next concern is the method of gathering temperature
readings, in which we use the surface temperature using IR thermometer. This concern is
slightly justi ed because of the cables and clutter on a machine the reading can be o .
But each section in Figure 5.3 is actually a sample of points in that block thus giving us a
fairly good representation of the machine, and since most of the materials stay the same and
we average out the hotspots out with the cold spots before using any of the data it takes
care of small temperature reading issues. The last concern may hold the truest of all the
concerns, the way we created a recirculation variable k which without doing much research
on how that variable must be used. In our experiment we pick a very small number for k
because our setup had alot of room so the outlet temperature would dissipate very easily so
a improvement would be to aggregate all the servers and layouts and update the k variable
in real time.
30
7.2 Extension
The next step would be to run these similar results in a full blown data center, or
at least one with AC unit and multiple servers. Also for the future research we would
like to use more sophisticated pieces of hardware for testing temperatures. The follow up
research should have a  owing air thermometer for the inlet and outlet temperatures, some
kind of heat measurement to keep from extrapolating heat from surface temperatures, and
lastly mounts for our sensors and programmatic way to gather these multiple pieces of data
without manually scanning them to get more accurate results. Future research should look
at neighboring nodes and heat recirculation more closely to see how they will a ect the
output in a more exact fashion. Equation 5 is a possible enhancement to the model that
takes in account of the neighboring nodes.
Another place for future research could be to look at the model as function of time.
This could be useful because some application may run short while other run long. Since
this model was made to help facilitate quick modeling of thermal pattern of data centers, a
model that will run in real time could help in develop even more robust algorithms. As a
quick preview equation 6 explains what a time based model could look like.
31
Chapter 8
Conclusion
Growing evidence show that cooling costs contribute a signi cant portion of the op-
erational cost of large-scale data centers. Therefore, thermal management techniques aim
to reduce energy consumption and cooling costs of data centers. A thermal management
mechanism relies on thermal information to make intelligent decisions. Thermal information
can be acquired in three ways: 1) using "Computational Fluid Dynamics" simulators (e.g.,
Flovent is a commercial product), 2) deploying temperature sensors to measure inlet and
outlet temperatures of servers, and 3) applying a thermal model to estimate temperatures
based on speci c workloads.
We advocate that thermal models are a cost-e ective and practical approach to providing
information on server temperatures to thermal management mechanisms. In this study, we
develop a thermal model - iTad - that enables thermal management techniques to quickly
make management decisions based on intensive I/O activities. We show that in light of iTad,
both CPU and I/O thermal outputs can be extrapolated from radiation heat and convection
heat applied to a server. The iTad model helps in improving the energy e ciency of data
centers, because thermal information o ered by iTad assists dynamic thermal management to
reduce the energy consumption in cooling systems in data centers. We validate the accuracy
of the iTad model using a server?s real-world temperature measurements obtained by an
infrared thermometer. Our experimental results suggest that I/O-intensive workloads have
signi cant impacts on temperatures of servers. We demonstrate that thermal management
mechanisms can quickly make workload placement decisions based on thermal information
facilitated by iTad.
32
Bibliography
[1] Convective and radiant heat transfer equations. http://blowers.chee.arizona.edu/
cooking/heat/convection.html.
[2] Flovent. http://www.mentor.com/products/mechanical/products/upload/
flovent-data-center.pdf.
[3] hddtemp. http://manpages.ubuntu.com/manpages/natty/man8/hddtemp.8.html.
[4] iostat. http://sourceforge.net/apps/mediawiki/filebench/index.php?title=
Main_Page/.
[5] iostat. http://linux.die.net/man/1/iostat.
[6] stress. http://www.unixref.com/manPages/stress.html.
[7] Amazon. http://www.amazon.com/pyle-pma90-anemometer-thermometer-
temperature/dp/b009tq6ilq. Amazon, 10 March 2012.
[8] Hendrik F. Hamann Andreas Bieswanger and Hans-Dieter Wehle. Energy e cient data
center. Technical Report 1, 2012.
[9] Amanda J Barra and Janet L Ellzey. Heat recirculation and heat transfer in porous
burners. Combustion and Flame, 137(1):230 { 241, 2004.
[10] Luiz Andr e Barroso and Urs H olzle. The case for energy-proportional computing. Com-
puter, 40(12):33{37, December 2007.
[11] David J. Brown and Charles Reams. Toward energy-e cient computing. Commun.
ACM, 53(3):50{58, March 2010.
[12] Dimitris Economou, Suzanne Rivoire, and Christos Kozyrakis. Full-system power anal-
ysis and modeling for server environments. In In Workshop on Modeling Benchmarking
and Simulation (MOBS, 2006.
[13] P.A. Eibeck and D.J. Cohen. Modeling thermal characteristics of a  xed disk drive.
Components, Hybrids, and Manufacturing Technology, IEEE Transactions on, 11(4):566
{570, dec 1988.
[14] Sudhanva Gurumurthi, Anand Sivasubramaniam, and Vivek K. Natarajan. Disk drive
roadmap from the thermal perspective: A case for dynamic thermal management.
SIGARCH Comput. Archit. News, 33(2):38{49, May 2005.
33
[15] Urs Hoelzle and Luiz Andre Barroso. The Datacenter As a Computer: An Introduction
to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers, 1st
edition, 2009.
[16] Youngjae Kim, S. Gurumurthi, and A. Sivasubramaniam. Understanding the
performance-temperature interactions in disk i/o of server workloads. In High-
Performance Computer Architecture, 2006. The Twelfth International Symposium on,
pages 176 {186, feb. 2006.
[17] Jonathan G. Koomey. Estimating total power consumption by servers in the U.S. and
the world. Technical report, Lawrence Derkley National Laboratory, February 2007.
[18] Sanjay Kulkarni. Cooling Hadoop: Temperature Aware Schedulers in Data Centers.
PhD thesis, Auburn University, 2013.
[19] J Li, G.P Peterson, and P Cheng. Three-dimensional analysis of heat transfer in a
micro-heat sink with single phase  ow. International Journal of Heat and Mass Transfer,
47(190):4215 { 4231, 2004.
[20] Lei Li, Chieh-Jan Mike Liang, Jie Liu, Suman Nath, Andreas Terzis, and Christos
Faloutsos. Thermocast: a cyber-physical forecasting model for datacenters. In Proceed-
ings of the 17th ACM SIGKDD international conference on Knowledge discovery and
data mining, KDD ?11, pages 1370{1378, New York, NY, USA, 2011. ACM.
[21] J. Moore, J. Chase, and P. Ranganathan. Consil: Low-cost thermal mapping of data
centers. In the First Workshop on Tackling Computer Systems Problems with Machine
Learning (SysML), 2006.
[22] R. Sharma, C. Bash, C. Patel, R. Friedrich, and J. Chase. Balance of power: Dynamic
thermal management for internet data centers. IEEE Internet Computing, 9(1), 2005.
[23] C.P.H. Tan, J.P. Yang, J.Q. Mou, and E.H. Ong. Three dimensional  nite element
model for transient temperature prediction in hard disk drive. In Magnetic Recording
Conference, 2009. APMRC ?09. Asia-Paci c, pages 1 {2, jan. 2009.
[24] Q. Tang, S. Gupta, and G. Varsamopoulos. Thermal-aware task scheduling for data
centers through minimizing heat recirculation. In Cluster Computing, 2007 IEEE In-
ternational Conference on, pages 129 {138, sept. 2007.
[25] Q. Tang, S. Gupta, and G. Varsamopoulos. Thermal-aware task scheduling for data
centers through minimizing heat recirculation. In Cluster Computing, 2007 IEEE In-
ternational Conference on, pages 129 {138, sept. 2007.
[26] Qinghui Tang, S.K.S. Gupta, D. Stanzione, and P. Cayton. Thermal-aware task schedul-
ing to minimize energy usage of blade server based datacenters. In Dependable, Auto-
nomic and Secure Computing, 2nd IEEE International Symposium on, pages 195 {202,
29 2006-oct. 1 2006.
34
[27] Qinghui Tang, S.K.S. Gupta, and G. Varsamopoulos. Energy-e cient thermal-aware
task scheduling for homogeneous high-performance computing data centers: A cyber-
physical approach. Parallel and Distributed Systems, IEEE Transactions on, 19(11):1458
{1472, nov. 2008.
[28] W. Wechsatol, S. Lorente, and A. Bejan. Dendritic heat convection on a disc. Interna-
tional Journal of Heat and Mass Transfer, 46(23):4381 { 4391, 2003.
35