Energy E?cient Pre-Fetching - Models to Implementation
by
Adam C. Manzanares
A dissertation submitted to the Graduate Faculty of
Auburn University
in partial fulfillment of the
requirements for the Degree of
Doctor of Philosophy
Auburn, Alabama
May 14, 2010
Keywords: Storage Systems, Pre-fetching, Virtual File System
Copyright 2010 by Adam C. Manzanares
Approved by:
Xiao Qin, Chair, Assistant Professor of Computer Science and Software Engineering
David Umphress, Associate Professor of Computer Science and Software Engineering
Wei-Shinn Ku, Assistant Professor of Computer Science and Software Engineering
Yan Gu,Assistant Professor of Computer Science and Software Engineering
Abstract
With the rapid growth of the production and storage of large scale data sets
it is important to investigate methods to drive the cost of storage systems down.
We are currently in the midst of an information explosion and large scale storage
centers are increasingly used to help store generated data. There are several methods
to bring the cost of large scale storage centers down and we investigate a technique
that focuses on transitioning storage disks into lower power states. To achieve this
goal this dissertation introduces a model of disk systems that leverages disk access
patterns to prefetch popular sets of data to produce energy saving opportunities.
Using our model, we have developed a simulator that allows us to quickly change
variousparameterstoinvestigate therelationship thatfileaccess patterns, diskenergy
parameters, and simulation parameters have on the overall energy e?ciency of disk
systems. To help improve the validity of our simulation results we leveraged the
validated disk simulator, DiskSim, and added disk power models to DiskSim. This
allowed us to test our energy e?cient strategies with a validated storage system
simulator.
The last part of this dissertation focuses on implementing a large scale storage
system virtual file system. We introduce the Energy E?cient Virtual File System, or
EEVFS, to manage the data placement and disk states in a cluster storage system.
Our modeling and simulation results indicated that large data sizes and knowledge
about the disk access pattern are valuable for storage system energy savings tech-
niques. Storage servers that support applications that stream media is one key area
that would benefit fromour strategies. The last chapter of the dissertation introduces
ii
the concept of parallel striping groups, which attempt to improve the performance of
EEVFS while maintaining energy savings.
iii
Acknowledgments
I would like to thank and extend my gratitude to Dr. Xiao Qin who served as my
advisor during the duration of my graduate studies at Auburn University. I met Dr.
Qin as an undergrad at New Mexico Tech and he guided me in a research project as
an undergrad. I went o? to the University of Colorado and when Dr. Qin was hired
at Auburn he recruited me into his research group at Auburn. Dr. Qin has spent
many hours guiding and mentoring me and I am sincerely grateful for the experience
he has provided me as my advisor. His hard work and dedication to the academic
field will serve as an important example of what a successful academic can achieve.
I would also like to thank Dr. David Umphress because he has served as my sec-
ondary advisor and I have enjoyed the feedback and encouragement that he provides.
I believe that his experiences and suggestions have proved to be strong influences in
my early academic career. I would also like to thank Dr. Yan Gu and Dr. Wei-Shinn
Ku for serving on my dissertation committee and providing insightful feedback when
needed. I want to thank the faculty and sta? of the CSSE department as they have
guided me through this process. Jo-Ann Lauranitis has helped me greatly with the
administrative process and I am thankful for her quick responses to all of my ques-
tions. I also would like to extend my thanks to Dr. Shiwen Mao for serving as the
outside reader on short notice for this dissertation.
Our research group has been supportive and helpful all of the years I have been
at Auburn University. I would like to thank Xiaojun Ruan because we have worked
together onseveral projectsand hehashelped memeet several deadlines. I would also
like to thank Shu Yin for providing me feedback and support during the dissertation
process. Ziliang Zong and Kiranmai Bellam are two research group members who
iv
have helped me greatly during the first years of my stay at Auburn. I would also like
to thank Zhiyang Ding, Yun Tian, Jiang Xie, and James Majors for the help during
the last stages of my dissertation work.
I would like to acknowledge my parents Jose and Margaret Manzanares because
they have served as the greatest inspiration in my life. Their climb through the
journey of life is an amazing feat and I am thankful for their endless support through
any situation that I have faced. I also would like to thank my sister Anna because
she has served as inspiration for me to pursue a fulfilling career.
v
Table of Contents
Abstract....................................... ii
Acknowledgments.................................. iv
ListofFigures ................................... x
ListofTables.................................... xiii
1 Introduction.................................. 1
1.1 ProblemStatement............................ 1
1.2 ResearchScope.............................. 3
1.3 Contributions............................... 4
1.4 DissertationOrganization ........................ 4
2 LiteratureReview............................... 6
2.1 EnergyE?cientDiskSystemsRelatedWork.............. 6
2.1.1 StrengthsandLimitationsofRelatedWork........... 6
2.1.2 Observations ........................... 8
2.2 DiskSimRelatedWork.......................... 8
2.3 VirtualFileSystemsandStripingRelatedWork............ 9
3 Preliminary Prefetching Scheme for Energy Conservation in Parallel Disk
Systems.................................... 10
3.1 MotivationalExample .......................... 12
3.2 Energy-E?cientPrefetchingStrategy.................. 19
3.3 SimulationResults ............................ 21
3.4 ChapterSummary ............................ 27
4 PrefetchingModelsandSimulation..................... 29
4.1 MotivationalExample .......................... 29
vi
4.2 PRE-BUDEnergy-E?cientPrefetchingStrategy............ 31
4.2.1 PrefetchingModule........................ 32
4.2.2 Energy-SavingCalculationModule ............... 35
4.3 AnalysisofPRE-BUD .......................... 44
4.3.1 AFull-PowerBaselineSystem.................. 45
4.3.2 DynamicPowerManagement .................. 46
4.3.3 DerivationofEnergyE?ciencyforPRE-BUD......... 48
4.3.4 DerivationofResponseTimeforPRE-BUD .......... 51
4.4 ExperimentalResults........................... 53
4.4.1 ExperimentSetup......................... 53
4.4.2 ComparisonofPRE-BUDandPDC............... 54
4.4.3 ImpactofDataSize........................ 55
4.4.4 ImpactofNumberofDataDisks ................ 56
4.4.5 ImpactofHitRate........................ 59
4.4.6 ImpactofInter-ArrivalDelays.................. 61
4.4.7 PowerStateTransitions ..................... 61
4.4.8 ImpactofDiskPowercharacteristics .............. 64
4.4.9 RealWorldApplications..................... 66
4.4.10ResponseTimeAnalysis..................... 68
4.5 ChapterSummary ............................ 69
5 DiskSimPowerModels............................ 71
5.1 Introduction................................ 71
5.2 SimulationFramework .......................... 71
5.3 DiskSimLimitations ........................... 73
5.4 DiskSimModifications .......................... 73
5.5 GeneratedResults ............................ 76
5.6 Conclusion................................. 77
vii
6 EnergyE?cientVirtualFileSystem .................... 79
6.1 Introduction................................ 79
6.2 Design................................... 82
6.2.1 SystemArchitecture ....................... 82
6.2.2 DataPlacement.......................... 82
6.2.3 PowerManagement........................ 83
6.3 Implementation.............................. 84
6.3.1 ProcessFlow ........................... 84
6.3.2 Prefetching ............................ 85
6.3.3 ApplicationHints......................... 86
6.3.4 DistributedMetadataManagement............... 86
6.4 EvaluationMethodology......................... 87
6.4.1 Testbed .............................. 87
6.4.2 SystemandWorkloadParameters................ 88
6.4.3 Metrics .............................. 89
6.5 ExperimentalResults........................... 90
6.5.1 EnergySavings.......................... 90
6.5.2 PowerStateTransitions ..................... 92
6.5.3 ResponseTimes.......................... 93
6.5.4 BerkeleyWebTraceEnergyConsumption ........... 94
6.6 Conclusion&FutureWork........................ 95
7 ParallelStripingGroupsforEnergyE?ciency............... 111
7.1 ParallelStripingGroups......................... 111
7.2 EEVFS .................................. 113
7.3 ExperimentalResults........................... 116
7.3.1 ImpactofDataSize........................ 117
7.3.2 ImpactoftheNumberofFilesPrefetched ........... 118
viii
7.3.3 ImpactoftheInter-ArrivalDelay................ 119
7.3.4 ImpactofMU........................... 121
7.3.5 Stripingvs.Non-StripingComparison ............. 121
7.3.6 BerkelyWebTraceResults.................... 123
7.3.7 ChapterConclusion........................ 123
8 ConclusionsandFutureWork........................ 125
8.1 MainContributions............................ 125
8.2 FutureWork................................ 127
Bibliography .................................... 128
Appendices ..................................... 132
A DiskSimSourceCodeModifications..................... 133
ix
List of Figures
1.1 EPA Report to Congress on Server and Data Center E?ciency, 2007 . 3
3.1 Thebu?er-diskarchitectureforparalleldisksystems ......... 11
3.2 DiskStateTransitionsEnergyAware.................. 19
3.3 Theenergy-e?cientprefetchingstrategyorPRE-BUD ........ 20
3.4 Impactofbu?erdiskhitrate ...................... 23
3.5 ImpactoftheNumberofDisks ..................... 24
3.6 Impactofdatasize............................ 26
4.1 SampleDiskTrace ............................ 29
4.2 Bu?erDiskAddedtoArchitecture ................... 30
4.3 AlgorithmPRE-BUD:theprefetchingmodule............. 34
4.4 AlgorithmPRE-BUD:theenergy-savingcalculationmodule ..... 43
4.5 PDCandPRE-BUDComparison.................... 55
4.6 Total Energy Consumption of Disk System while Data Size is varied
for four di?erent values of the hit rate: (a) 85 %, (b) 90 %, (c) 95 %,
and (d) 100% . . . ............................ 57
4.7 Total Energy Consumption of Disk System while the number of data
disks is varied. Data size is fixed at: (a) 1MB, (b) 5MB, (c) 10MB,
and(d)25MB............................... 58
4.8 Total Energy Consumption for di?erent hit rate values where the data
sizeisfixedat:(a)1MB,(b)5MB,(c)10MB,and(d)25MB .... 60
4.9 Total Energy Consumption for di?erent delay values where the hit rate
is (a) 85%, (b) 90%, (c) 95%, and (d) 100%. .............. 62
x
4.10 Total disk state transitions for di?erent data sizes where the hit rate
is: (a) 85%, (b) 90%, (c) 95%, and (b) 100% .............. 63
4.11 Total Energy consumption for various values of the following disk pa-
rameters: (a) power active, (b) power idle, and (c) power standby . . 65
4.12TotalEnergyConsumedforRealWorldTraces............. 67
5.1 FileSystemSimulatorandDiskSystemInteraction.......... 72
5.2 EnergyConsumptionResultsofModifiedDiskSim........... 76
6.1 Architecture of EEVFS. The storage server manages metadata (e.g.,
data location and file size). Each storage node manages multiple disks,
which are separated into two groups: bu?er disks and data disks.Client
nodes can directly access storage servers through the network intercon-
nect..................................... 97
6.2 EEVFS Process Flow Chart. Step 1: initialization phase; step 2: stor-
age server generates file popularity; Step 3: prefetch popular files from
data disks to a bu?er disk; Step 4: applications provide access hints;
Step 5: applications submit file requests; Step 6: storage nodes return
data to applications running on compute nodes (i.e., clients) ..... 98
6.3 DataSizeVaried ............................. 99
6.4 MUVaried................................. 100
6.5 Inter-arrivalDelayVaried ........................ 101
6.6 #ofPrefetchedFilesVaried....................... 102
6.7 DataSizeVaried ............................. 103
6.8 MUVaried................................. 104
6.9 Inter-arrivalDelayVaried ........................ 105
6.10#ofPrefetchedFilesVaried....................... 106
6.11DataSizeVaried ............................. 107
6.12MUVaried................................. 108
6.13Inter-arrivalDelayVaried ........................ 109
6.14#ofPrefetchedFilesVaried....................... 110
xi
6.15 Energy Consumption of the tested cluster storage system when the
BerkeleyWebTraceareconsidered.................... 110
7.1 ParallelStripingGroups......................... 112
7.2 DataStripingWithinaGroup...................... 113
7.3 EnergyE?ciencyvs.DataSize ..................... 117
7.4 EnergyE?ciencyvs.#ofFilesPrefetched............... 119
7.5 EnergyE?ciencyvs.Inter-ArrivalDelay................ 120
7.6 EnergyE?ciencyvs.MU ........................ 121
7.7 BerkelyWebTraceResults........................ 122
xii
List of Tables
3.1 SyntheticTrace.............................. 12
3.2 DiskParameters(IBM36Z15) ...................... 12
3.3 Non-EnergyAwareResults........................ 15
3.4 EnergyAwareResults .......................... 16
3.5 PRE-BUDApproach1.......................... 16
3.6 PRE-BUDApproach2.......................... 17
3.7 TraceRepeatResults(J)......................... 17
3.8 SimulationParameters.......................... 22
4.1 Notationforthedescriptionoftheprefetchingmodule......... 32
4.2 Notation for the description of the energy-saving calculation module . 37
4.3 DiskParameters(IBMUltrastar36Z15) ................ 54
4.4 ResponseTimeAnalysis ......................... 68
5.1 KeyModelVariablesAddedToDiskSim................ 73
5.2 DiskSimResponseTimeResults..................... 76
6.1 ConfigurationoftheTestbedNodes. .................. 87
6.2 SystemandWorkloadParameters. ................... 89
7.1 ConfigurationoftheTestbedNodes................... 116
7.2 ResultsofStripingvs.NoStriping ................... 121
xiii
Chapter 1
Introduction
Due to current trends in computing we are facing the so called data explosion.
As the use of computers to help day-to-day tasks has increased, we also face a side
e?ect of generating large amounts of data. This data must be stored on some sort
of medium and currently hard disk drives have become the most common storage
medium. Large scale storage systems are being developed and installed routinely
and there is a significant amount of energy that must be consumed to operate these
storage systems. There are many di?erent methods to conserve energy and we have
identified that hard disk drives consume a significant amount of the energy in a large
scale storage system. To help alleviate the energy burden of hard disk drives we
propose a strategy to manage to states of the hard disk drives.
This chapter continues by developing the problem statement clearly in Section
1.1. Section 1.2 presents the scope of the research Section 1.3 summarizes the main
contributions of the dissertation. Finally Section 1.4 outlines the organization of the
dissertation.
1.1 Problem Statement
The number of large-scale parallel I/O systems is increasing in today?s high-
performance data-intensive computing systems due to the storage space required to
contain the massive amount of data. Typical examples of data-intensive applications
requiringlarge-scaleparallelI/Osystemsinclude; longrunningsimulations[9],remote
sensing applications [33] and biological sequence analysis [12]. As the size of a parallel
I/O system grows, the energy consumed by theI/O system often becomes a largepart
1
of the total cost of ownership [26][35][37]. Reducing the energy costs of operating
these large-scale disk I/O systems often becomes one of the most important design
issues. It is known that disk systems can account for nearly 27% of the total energy
consumption in a data center [15]. Even worse, the push for disk I/O systems to have
larger capacities and speedier response times have driven energy consumption rates
upward.
Reducing energy consumption of computing platforms has become an increas-
ingly hot research field. Green computing has recently been targeted by government
agencies; e?ciency requirements have been outlined in [8]. Large-scale parallel disks
inevitably lead to high energy requirements of data-intensive computing systems due
to scaling issues. Data centers typically consume anywhere between 75 W/ft2 to 200
W/ft2 and this may increase to 200-300 W/ft2 in the near future [1][43] These large-
scale computing systems not only have a large economical impact on companies and
research institutes, but also produce a negative environmental impact. Data from the
US Environmental Protection Agency indicates that generating 1 kWh of electricity
in the United States results in an average of 1.55 pounds (lb) of carbon dioxide (CO2)
emissions. With large-scale clusters requiring up to 40TWh of energy per year at a
cost of over $ 4B it is easy to conclude that energy-e?cient clusters can have huge
economical and environmental impacts [4].
Figure1.1presentsthefutureenergyusepredictionsofserveranddatacenters. It
presents several trend lines each corresponding to a scenario. According to historical
trends the energy usage is going to increase at a greater than linear rate. There
are also several trend lines that can improve the energy e?ciency of server and data
centers. Our goal is to help drive the server and data center usage down using new
techniques, which will bring us closer to the state of the art scenario trend line.
2
Figure 1.1: EPA Report to Congress on Server and Data Center E?ciency, 2007
1.2 Research Scope
Our research focuses onmethods to lower the disk energy consumption in storage
systems. Themaingoalofourresearchistotryandplacemanydisks intothestandby
state to conserve energy. To place disks into the standby state they must have large
periods of inactivity. To help produce periods of inactivity for a disk we propose
prefetching popular data and move this data into a bu?er disk. We are attempting
to skew the workload to a subset of disks, which will allow us to place lightly loaded
disks into the standby state.
Thisworkisaccomplished throughtheuseofmodelsandsimulations. Wepresent
two models to help us model the energy e?ciency of disk systems. We model the disk
serving requests and also the state transition changes and their impact on the perfor-
mance of the disk system. Using these models we developed our own simulator which
we used to test many parameters of the model quickly. Our simulator was augmented
by making changes to the DiskSim simulation environment and also developing a file
3
system simulator. Finally we develop a prototype implementation of a virtual file
system that supports our energy e?ciency strategies and also develop a data layout
technique that improves performance and also energy e?ciency.
1.3 Contributions
The major contributions of the research presented in this dissertation follows:
1. Amotivationalexampleispresented alongwith thedevelopment ofadisk model
that is used to gather preliminary simulation results.
2. An advanced disk system model is developed along with two algorithms that
aid in conserving energy while maintaining acceptable response times.
3. The DiskSim simulation environment is extended to support disk energy models
and a simulated file system is developed to interact with DiskSim.
4. A virtual file system prototype, EAVFS, is developed to put our modeling and
simulation results into practice.
5. An energy e?cient data layout, striping groups, is presented and implemented.
1.4 Dissertation Organization
This dissertation is organized in the following manner:
Chapter 2 introduces related work that is briefly reviewed and contrasted against
the contributions of this dissertation.
Chapter 3 provides the motivational work for the rest of the dissertation. An
example scenario is presented where the use of prefetching and bu?er disks is high-
lighted. Also a simple mathematical model is presented and simulation results based
on the model are presented.
4
Chapter 4 introduces advanced models for the modeling of disk requests and
energy states of disks. We also introduce advanced algorithms that take into account
the response time to prevent major delays. Thorough simulation results are also
presented in this chapter.
Chapter 5 details how improvements were made to our simulation framework to
improve simulation results. Changes were made to the DiskSim simulator to support
disk energy states and a simple file system simulator is used to collect simulation
results.
Chapter 6 introduces the Energy Aware Virtual File System (EAVFS), which is
a prototype virtual file system that I developed to implement some of the ideas that
were developed using models and simulation techniques.
Chapter 7 presents parallel striping groups that are implemented in EAVFS.
Thesegroupsareintroducedtohelpmaintainperformancewhilestillprovidingenergy
savings.
Chapter 8 summarizes the main contributions of this dissertation and presents a
couple of future research directions based on the ideas contained in the dissertation.
5
Chapter 2
Literature Review
2.1 Energy E?cient Disk Systems Related Work
2.1.1 Strengths and Limitations of Related Work
Almost all energy e?cient strategies rely on DPM techniques [2]. These tech-
niques assume a disk will have several power states. Lower power states have lower
performance, so the goal is to place a disk in a lower power state if there are large
idle times. There are several di?erent approaches to generate larger idle times for
individual disks. There are also several approaches to prefetch data, although many
techniques have focused on low power disks.
1. Memory cache techniques - Energy e?cient prefetching was explored by Pap-
athanasiou and Scott [24]. Their techniques relied on changing prefetching and
caching strategies within the Linux kernel. PB-LRU is another energy e?cient
cache management strategy [41]. This strategy focused on providing more op-
portunities for underlying disk power strategies to save energy. Flash drives
have also been proposed for use as bu?ers for disk systems [5]. Energy e?cient
caching and prefetching in the context of mobile distributed systems has been
studied [30] [42]. These three research papers focus on mobile disk systems,
whereas we focus on large scale parallel disk systems. All the previously men-
tioned techniques are limited in the fact that caches, memory, and flash disk
capacities are typically smaller than disk capacities. We propose strategies that
use a disk as a cache to prefetch data into. The break-even times of disk drives
6
are usually very high and prefetch data accuracy and size become a critical
factor in energy conservation.
2. Multi-speed/low power disks - Many researchers have recognized the fact that
large break-even times limit the e?ectiveness of energy e?cient power manage-
ment strategies. One approach to overcome large break-even times is to use
multi-speed disks [31] [39]. Energy e?cient techniques have also relied on re-
placing high performance disks with low energy disks [4]. Mobile computing
systems have also been recognized as platforms where disk energy should be
conserved [5][17]. The mobile computing platforms use low power disks with
smaller break-even times. The weakness of using multi-speed disks is that there
are no commercial multi-speed disks currently available. Low power disk sys-
tems are an ideal candidate for energy savings, but they may not always be a
feasible alternative. Our strategies will work with existing disk arrays and do
not require any changes in the hardware.
3. Disk as cache - MAID was the original paper to propose using a subset of
disk drives as cache for a larger disk system [6]. MAID designed mass storage
systems with the performance goal of matching tape-drive systems. PDC was
proposed to migrate sets of data to di?erent disk locations [26]. The goal is
to load the first disk with the most popular data, the second disk with the
second most popular data, and continue this process for the remaining disks.
The main di?erence between our work and MAID is that our caching policies
are significantly di?erent. MAID caches blocks that are stored in a LRU order.
Our strategy attempts to analyze the request look-ahead window and prefetch
any blocks that will be capable of reducing the total energy consumption of the
disk system. PDC is a migratory strategy and can cause large energy overheads
when a large amount of data must be moved within the disk system. PDC also
7
requires the overhead of managing metadata for all of the blocks in the disk
system, whereas our strategy only needs metadata for the blocks in the bu?er
disk.
2.1.2 Observations
With the previously mentioned limitations ofenergy e?cient research we propose
a novel prefetching strategy. Our research di?ers from the previous research on the
following key points.
1. Wedevelopamathematicalmodeltoanalyzetheenergye?ciencyofourprefetch-
ing strategy. This mathematical model allows us to produce simulations that
o?er insights into the key disk parameters that e?ect energy-e?ciency.
2. We develop a prefetching strategy that tries to move popular data into a set of
bu?er disks without a?ecting the data layout of any of the data disks. We also
perform simulations with parallel I/O intensive applications, which previous
researchers have avoided.
Our strategies also have the added benefit of not requiring any changes to be
madetotheoverall architecture ofanexisting disk system. Previous workhasfocused
on redesigning a disk system or replacing existing disks to produce energy savings.
Our strategy will either add extra disks or use the current disk system to produce
energy savings under certain conditions.
2.2 DiskSim Related Work
The DiskSim simulator is a powerful tool for the modeling and simulation of
disk systems and is used frequently for storage systems research [3]. Recent research
projects based on the DiskSim simulation environment include reducing disk I/O
performance sensitivity and conserving energy in disk systems [23][34]. Although
8
DiskSim is a powerful simulation tool research, projects have recognized the lack of
power models as a limitation of DiskSim.
The Sensitivity-Based Optimization of Disk Architecture introduced accurate
power models into the DiskSim environment, but there work was based on DiskSim
2.0 [29]. The other major paper to publish research related to DiskSim and power
models is the Dempsey paper, which we were unable to obtain a copy of the source
code [38]. Due to these limitations of previous research projects implementing power
modelsitwasdecided todevelop ourownpowermodelsfortheDiskSim4.0simulation
environment.
2.3 Virtual File Systems and Striping Related Work
Cluster file systems are fast becoming a necessity due to the growth of cluster
computing for high performance computing and web applications. Lustre is a popular
clustering file system that has been designed for high performance [13]. The Parallel
Virtual File System, PVFS, is an alternative high performance cluster file system
that resides entirely in user space [20]. The two preceding file systems are designed
for high performance applications and were designed with no energy e?cient consid-
erations. BlueFS is a distributed file system that was designed with energy e?ciency
considerations, but it focuses on mobile computing systems [22].
Due to the limitations of previous research we intend to develop an energy e?-
cient cluster file system, EEVFS which is presented in Chapter 6. EEVFS manages
disk states and data placement to reduce the energy consumption of disk system.
EEVFS also uses striping groups to help to improve the performance of the storage
system and the striping concept is outlined in [25]. Striping groups for EEVFS are
explained in detail in Chapter 7.
9
Chapter 3
Preliminary Prefetching Scheme for Energy Conservation in Parallel Disk Systems
The objective of this study is to use the novel parallel disk architecture with
bu?er disks (see [43] and Figure 3.1) to aid in the reduction of the power consumption
of large-scale parallel disk systems. To fully utilize bu?er disks while aggressively
placing data disks into low-power modes, we investigate an energy-aware prefetching
mechanism (PRE-BUD for short) to dynamically fetch the most popular data into
bu?er disks. PRE-BUD attempts to prefetch data into the bu?er disk with the
desired consequence of reducing the total energy consumption of the parallel disk
system. This work aims at designing two prefetching strategies using the PRE-BUD
mechanism. Specifically, thefirstapproachusingPRE-BUDaddsanextradisk, which
performs as a bu?er disk. The second approach uses an existing disk in the system
as a bu?er disk. The design of these two strategies relies on the fact that in a wide
variety of data-intensive applications (e.g., web applications) a small percentage of
the data is frequently accessed [18]. The goal of this research is to move this small
amount of frequently accessed data from data disks into bu?er disks, thereby allowing
data disks to switch into low-power modes. Apart from energy e?ciency, this work is
focused on improving the reliability of parallel disk systems using traditional dynamic
power management policies. We conduct experiments to confirm that the reliability
of parallel disks can be enhanced through the bu?er disk mechanism by reducing the
number of state transitions. This research o?ers the following contributions. First,
we create a mathematical model forlarge-scale parallel disk systems with bu?er disks.
Second, we develop two energy-e?cient prefetching strategies in the context of the
bu?er disk architecture. Third, we quantitatively compare both of our prefetching
10
RAM Buffer 
m buffer disks n data disks 
Buffer Disk 
Controller 
 Data Partitioning 
    Data Placement 
Data Movement 
Power Management 
Prefetching 
Disk Requests 
  Energy-Related Reliability Model
Figure 3.1: The bu?er-disk architecture for parallel disk systems
approaches against two existing schemes including a dynamic power management
technique and a non-energy-aware strategy. Fourth, we implement a simulator based
on the mathematical model. Fifth, we use the simulator to quantitatively show that
addinganextrabu?erdiskcansubstantiallyimprovethereliabilityandenergysavings
of parallel disk systems without compromising storage capacity. Finally, we observed
from experimental results that using an existing disk as a bu?er disk compromises
the storage space, but there is not the extra energy consumption and cost penalty
associated with adding an extra disk i.e., the bu?er disk. The rest of the chapter
is organized as follows. Section 3.1 presents a motivational example and describes
the mathematical model used for the purposes of this research. Section 3.2 describes
our pre-fetching algorithm PRE-BUD. Section 3.3 presents simulation results and
provides a discussion of the results. Section 3.4 is the conclusion of the chapter and
future research directions are discussed.
11
3.1 Motivational Example
Our motivational example is based on the synthetic disk trace presented in Table
3.1. The requests all have the size of 275MB. This means each request will take
approximately 5s to complete. This length was chosen, so seek and rotational delays
would be negligible. There are N=4 disks used in this example, where each disk is
given a unique letter. Each disk has two di?erent data sections requested multiple
times throughout the example. Each disk is modeled after the IBM 36Z15 using the
parameters shown in Table 3.2 [4]. The example demonstrates only large sequential
reads, which is a case that emphasises the benefits of our energy e?cient strategies.
This was also chosen to simplify our example to allow us to demonstrate the potential
benefits of our approach. We also assume that all the data can be bu?ered which
causes a small percentage of data to be accessed 100% of the time. This is only
used for our motivational example and our simulation results vary this parameter to
model real-world conditions. We also assume these strategies can be handled o?-line
meaning we have prior knowledge of the complete disk request pattern.
Time 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
Block A1 A2 B1 B2 D1 D2 C1 C2 B2 B1 A1 A2 C1 C2 D1 D2 B2
Time 85 90 95
Block B1 A1 A2
Table 3.1: Synthetic Trace
X=TransferRate=55MB\s P
Act
=PowerActive=13.5W
P
Idle
=PowerIdle=10.2W P
Stdby
= Power Standby = 2.5 W
E
AS
= Energy Active to Sleep = 13.0 J E
SA
= Energy Sleep to Active = 135 J
T
AS
= Time Active to Sleep = 1.5 s T
SA
= Time Sleep to Active = 10.9 s
Table 3.2: Disk Parameters (IBM36Z15)
For a base line comparison we use a non-energy aware approach. This approach
puts a disk in the active state if a request for this particular disk is received. If the
12
disk is not serving a request it remains in the idle state. We also compare our two
approaches against an energy aware approach. The energy aware approach puts a
disk into the standby state if the energy savings achievable is greater than the energy
required to complete the state transition. The two di?erent approaches we present
are adding a bu?er disk to the disk system or the use of an existing disk as the bu?er
disk. Following is an explanation of the mathematical model used to produce the
results for our motivational example.
T
E(Y)
=(T
Act(Y)
?P
Act
)+(T
Idle(Y)
?P
Idle
)+(T
Sleep(Y)
?P
Stdby
)+E
Trans(Y)
(3.1)
T
E(Y)
is the total energy Disk Y consumes serving the trace.
T
Act(Y)
=
k
summationdisplay
i=j
(len(Req[i]))/X) (3.2)
where T
Act(Y)
is the total time that DiskY is active. It is a summation over all
the requests involving DiskY. It starts with the first request for DiskY(j) and ends
with the last request for DiskY(k). For the motivational example all requests are of
the same size, so the length of each request is 275MB. The transfer rate X is fixed at
55MB/s, so we know all requests take approximately 5s to process. When using the
bu?er disk a request for a bu?ered block only causes the bu?er disk to be active.
T
Idle(Y)
=
k
summationdisplay
i=j
(len(Req[i]))/X) (3.3)
where T
Idle(Y)
is the total amount of time DiskY is idle. The non-energy aware
strategy places disks in the idle state if they are not serving a request. This ends
up being a summation over all requests that are not for DiskY. This will start at
the first request that is not for DiskY(j)andendatthelastrequestforDiskY(k).
13
The following equation determines if a disk will be idle between requests in all other
approaches.
BE
Idle
=(E
AS
+E
SA
)/P
Idle
=14.5s (3.4)
where BE
Idle
is the energy break even period. If there is an idle time that is
larger than BE
Idle
energy conservation can be achieved by putting the disk to sleep.
IftheidletimeislessthanBE
Idle
than the energy penalty to transition between
the idle state and standby is greater than the energy savings possible by putting the
disk to sleep. T
Sleep(Y)
= The time that DiskY is sleeping (in the standby state).
In the non-energy aware strategy a disk is never put into the standby state. The
energy e?cient strategy we compare against forces a disk to go to sleep whenever
T
Idle(Y)
>BE
Idle
. The disk will stay sleeping until the next request is for a block on
that disk. It will wake up the disk with enough time to make T
SA
in time for the
request, which is 10.9 s. The approaches using a bu?er disk also use this strategy.
The only exception to this rule is when a request for DiskY is contained in the bu?er
DiskY is allowed to sleep longer times.
E
Trans(Y)
=
m
summationdisplay
i=1
E
AS
[i]+
n
summationdisplay
i=1
E
SA
[i] (3.5)
where E
Trans(Y)
is the total energy consumed for all state transitions for DiskY.
In the non-energy aware approach this will be zero for all of the disks. In the energy
aggressive approaches it is the summation over all active/sleep transitions added to
the summation over all sleep/active state transitions for DiskY.
BD
EPre
=
k
summationdisplay
i=j
(len(Req[i]))/X)?P
ACT
(3.6)
14
where BD
EPre
is the total amount of energy consumed by the bu?er disk pre-
fetching data. It is a summation over all of the requests that are put into the bu?er
disk. This value is zero whenever a bu?er disk is not used.
BD
ETot
=
k
summationdisplay
i=j
((len(Req[i]))/X)?P
ACT
)+BD
EPre
(3.7)
where BD
ETot
is the total amount of energy the bu?er disk consumes for the
entire trace. It is the summation over all requests that are in the bu?er disk and
the addition of the energy consumed in the pre-fetch phase. This is also zero when a
bu?er disk is not used.
T
ES
=
N
summationdisplay
i=1
T
E[i]
+BD
ETot
(3.8)
where T
ES
is the total energy consumed by all disks used to serve the synthetic
trace. This also includes bu?er disk energy if a bu?er disk is used.
T
Idle(A)
70s T
Idle(B)
70s
T
Idle(C)
80s T
Idle(D)
80s
T
Act(A)
30s T
Act(B)
30s
T
Act(C)
20s T
Act(D)
20s
E
Trans(A)
0J E
Trans(B)
0J
E
Trans(C)
0J E
Trans(D)
0J
T
E(A)
1119J T
E(B)
1119J
T
E(C)
1086J T
E(D)
1086J
T
ES
4410J
Table 3.3: Non-Energy Aware Results
The results presented in Tables (3.3-3.6) gave us some promising initial results.
The two approaches using a bu?er disk provided significant energy savings over the
non-energy aware parallel disk storage system. Table 3.5 presents the results using
our approach that adds an extra disk to the system. This has the benefit of not
impacting the capacity of the large-scale parallel disk system. The other main benefit
15
T
Idle(A)
0s T
Idle(B)
10s
T
Idle(C)
0s T
Idle(D)
0s
T
Act(A)
30s T
Act(B)
30s
T
Act(C)
20s T
Act(D)
20s
T
Sleep(A)
45.2s T
Sleep(B)
33.7s
T
Sleep(C)
53.7s T
Sleep(D)
53.7s
E
Trans(A)
296J E
Trans(B)
309J
E
Trans(C)
309J E
Trans(D)
309J
T
E(A)
814J T
E(B)
900.25J
T
E(C)
713.25J T
E(D)
713.25J
T
ES
3140.75J
Table 3.4: Energy Aware Results
T
Idle(A)
0s T
Idle(B)
0s
T
Idle(C)
0s T
Idle(D)
0s
T
Act(A)
10s T
Act(B)
10s
T
Act(C)
10s T
Act(D)
10s
T
Sleep(A)
100s T
Sleep(B)
100s
T
Sleep(C)
100s T
Sleep(D)
100s
E
Trans(A)
13J E
Trans(B)
13J
E
Trans(C)
13J E
Trans(D)
13J
T
E(A)
398J T
E(B)
398J
T
E(C)
398J T
E(D)
398J
BD
EPre
540J BD
ETot
1350J
T
ES
3140.75J
Table 3.5: PRE-BUD Approach 1
of our first approach is the fact that state transitions are lowered as compared to the
energy aware baseline.
The problem is that this approach consumes more energy than the energy aware
disk management scheme. This is due to the extra energy associated with adding a
disk. This extra disk must first pre-fetch all of the data into the bu?er, and then
serve all the bu?ered requests. In our strategy we are able to bu?er all of the data
causing the bu?er disk to be active the entire trace. The bu?er disk is also active
pre-fetching data before the trace begins. Adding an extra disk also costs money and
could introduce pricing concerns when this strategy is scaled to accommodate larger
16
T
Idle(A)
0s T
Idle(B)
0s
T
Idle(C)
0s T
Idle(D)
0s
T
Act(A)
140s T
Act(B)
10s
T
Act(C)
10s T
Act(D)
10s
T
Sleep(A)
0s T
Sleep(B)
100s
T
Sleep(C)
100s T
Sleep(D)
100s
E
Trans(A)
0J E
Trans(B)
13J
E
Trans(C)
13J E
Trans(D)
13J
T
E(A)
1890J T
E(B)
398J
T
E(C)
398J T
E(D)
398J
BD
EPre
540J BD
ETot
1350J
T
ES
3084J
Table 3.6: PRE-BUD Approach 2
Repeats 1 2 3 4 5
Non-energy Aware 4410 8820 13230 17640 22050
Energy Aggresive Approach 3141 6294 9447 12600 15753
PRE-BUD 1 3482 5832 8182 10532 12882
PRE-BUD 2 3084 5184 7284 9384 11484
Repeats 6 7 8 9 10
Non-energy Aware 26460 30870 35280 39690 44100
Energy Aggresive Approach 18906 22059 25212 28365 31518
PRE-BUD 1 15232 17582 19932 22282 24632
PRE-BUD 2 13584 15684 17784 19884 21984
Table 3.7: Trace Repeat Results (J)
systems. This is what led us to try a second approach, which uses an existing disk
as a bu?er disk. For this experiment it was assumed DiskA could be the bu?er disk.
This had the desired e?ect of bringing the energy consumption total below the energy
aware strategy. The major negative of using an existing disk as the bu?er disk is that
the capacity of your disk storage system is decreased.
The potential energy savings gainsarerealized once thesample trace used forour
motivational example is repeated. This results in huge energy savings for both of our
approaches over the non-energy aware and energy aware approaches. When the trace
is repeated 10 times adding abu?er disk to your parallel disk system could potentially
17
save 44% of the energy cost of a non-energy aware approach. Similarly using an
existing disk for a bu?er disk saves 50% of the energy cost. When comparing against
the energy aware approach our first approach is able to save 22% and the second
approach is able to save 30% of the energy. The energy savings of the energy aware
approach is already considerable when compared to the non-energy aware strategy.
Our approaches are able to produce a significant amount of extra energy savings
over the energy aware approaches. This is due to the fact that using a bu?er disk
allows each disk the ability to be in the standby state for longer periods of time. The
standby times a disk experiences between requests becomes cumulative if the requests
are contained in the bu?er disk. This is due to the fact that the request can be served
from the bu?er disk, so the requested disk does not have to transition to the active
state.
It is important to note that this is the ideal case for our bu?er disk framework.
This represents a situation where we are able to bu?er all of the data requested and
all requests are large reads. These assumptions were only used to come up with some
upper bound estimates for the potential energy savings. In our simulation results we
tried to choose parameters that would more closely model real-world applications and
settings.
For reliability concerns, we also decided to document the number of state tran-
sitions each disk experiences using a disk management scheme. Figure 3.2 represents
the state transitions for each disk using the energy aware disk management scheme.
The non-energy aware scheme has zero state transitions, since it never tries to put a
disk in the standby state. The bu?ered disk approaches only needs one state tran-
sition per disk. This state transition is experienced after the disk is put into the
standby state after being active delivering data to the bu?er. After this the data
transfer is complete and the disk is put into the standby state.
18
Figure 3.2: Disk State Transitions Energy Aware
Reliability is assumed to be a factor of state transitions. As the number of state
transitions increases the reliability of the disk system goes down. This is due to the
demands that changing states places on a hard drive. Figure 3.2 shows the state
transitions for each disk using the energy aware approach. This approach does not
consider the reliability of the disk as a factor and schedules a state transition when-
ever the disk has idle times greater than a threshold. The bu?ered disk approaches
improves reliability by moving frequently accessed data into a single point. This can
save a sleeping disk from having to be placed in the active state causing a state tran-
sition. In the case that all the data needed for an application can be placed in the
bu?er disk, all the other disks can sleep the entire life of the application. They only
need to be active to move the requested data into the bu?er before the application is
executed.
3.2 Energy-E?cient Prefetching Strategy
Figure 3.3 outlines the algorithm used to pre-fetch blocks and how the pre-
fetched blocks are used using the PRE-BUD strategy. The pre-fetch algorithm uses
the frequency that a block is requested as a heuristic. The first step of the algorithm
19
1. request[] ;            /*  request[]  holds all of the requests in the trace */ 
2. i=0; 
3. while(not all requests have been processed)  /* Iterate over all requests */ 
4. if(request[i].block  has been seen before)  /* Requested block has been seen */ 
5.         list.inc_ref(request[i].block)       /* Add one to the count for this block */  
6.         i++; 
7. elseif(request has not been seen before)   /*Requested block has not been seen before*/ 
8.         list.add(request[i].block) /*Add the request to a list and increment its reference */ 
9.         i++; 
10.
11. buffer[];  /* Buffer to hold requested blocks  
12. Buffer Size=N;         
13. i=0; 
14. list.sort() /* Sort the requests by their reference counts */ 
15. while(i<N) 
16.     buffer[i]=list(i)  /* Move the frequently requested data into the buffer */ 
17.
18. i=0; 
19. distance=0; /* Distance between two requests on the same disk */ 
20. while(not all requests have been processed) 
21.      if(request[i].block is in Buffer) 
22.          buffer_Disk(request[i]); 
23.         distance+=request[i].time + calc_distance(request[i]); 
24. elseif(request[i].block is not in Buffer) 
25.         distance+=calc_distance(request[i]); 
26.          process(request[i]); 
27.          can_sleep(distance); 
28.          distance=0; 
29.   can_sleep(distance) 
Figure 3.3: The energy-e?cient prefetching strategy or PRE-BUD
20
iterates over all of the requests and counts the references for each unique block. Then
it sorts the list of unique blocks by the number of references to each block. At this
point the algorithm puts the highly requested blocks into the bu?er until it is full.
The last part of the algorithm also iterates over all requests trying to figure out how
long each disk can sleep. If a block requested for a disk is in the bu?er the disk can
sleep longer. The bu?er disk handles the request and the distance between requests
on thesame disk becomes cumulative. If a requested block isnot in thebu?er the disk
must be woken up to serve the request, this is handled by the process function. The
distance is then set to zero since the disk had to be woken up. Using the frequently
accessed heuristic the PRE-BUD strategy should have a small performance impact
on the system. Almost all steps of the algorithm run linearly with respect to the
number of requests. The only step that is not linear is the phase that sorts the list of
requests according to their frequency. Sorting is a common procedure and is known
to have a best-case run-time of nlogn. The PRE-BUD strategy is able to have a run
time of n+nlogn using an e?cient sorting algorithm. The PRE-BUD strategy is not
assumed to be optimal, since the requested blocks are sorted using their frequency.
The frequency is used as a heuristic to select blocks to be placed in the bu?er. An
optimal strategies goal would be to select the requests to be placed in the bu?er that
produce the largest impact on the standby time of disks. The largest increases in
standby times for disks result in the largest energy savings gains.
3.3 Simulation Results
For our simulation results it was decided to increase the synthetic trance length
to 200 requests. This would increase the total time of that the simulation was ran
and give our bu?ered disk strategies room to increase idle times for disks. The first
parameter tested is the hit rateof the bu?er disk. Forour motivational example 100%
of the data requested was placed into the bu?er. This results in a hit rate of 100%.
21
To more accurately model applications that access 20% of the data available 80% of
the time we lowered the bu?er disk hit rate. This was implemented by increasing the
number of requests not in the bu?er by 5 until the hit rate was lowered past 80%. We
are assuming that the bu?er disk will be able to hold all of the frequently accessed
data. All energy results are in Joules.
Number of Disks Value
Number of Disks 4,5,6,7,8,9,10
Block Size 275,225,175,125,75,25,10,5 MB
Hit Rate 100,97,5,92.5,90,87.5,85,82.5,80,77.5,75%
Transfer Rate 55 MB/s
Number of Requests 200
Table 3.8: Simulation Parameters
The results displayed Figure 3.4 held the number of disks to 4 and also kept the
data size of each request at 275MB. We have omitted the results of the non-energy
aware approach, since they are constant and higher than the energy aware strategy.
As expected the performance of both of our strategies were lowered when the hit rate
was decreased. This is expected since our motivational example demonstrated a best
case scenario. Disk sleep times are lowered once a miss is encountered. This is due to
the fact that a disk has to wake up to serve the request. This will increase the energy
consumption of disks that have to serve the missed requests. This leads to an increase
in the total energy consumption of the entire system. Our bu?ered large-scale parallel
disk system is still able to consume less energy than the energy aware approach. The
energy aware and non-energy aware disk systems are not a?ected by bu?er disk miss
rates.
The first bu?er disk approach begins to approach the same level of performance
as the energy aware strategy. It is only able to save 10% energy over the energy aware
strategy when the hit rate is 75%. This is because adding the extra disk puts extra
energy requirements on the system, and lowering the hit rate further impacts the
22
Figure 3.4: Impact of bu?er disk hit rate
energy benefits of the first strategy. The second bu?ered disk approach is still able
perform 25% better than the energy aware approach. This is because there is not the
extra energy penalty of adding an extra disk. The capacity of your disk system will
be lowered using this approach.
The hit rate becomes a very important factor in the performance of our ap-
proaches. If the bu?er disk is constantly missing requests then both strategies will
eventually downgrade to the energy aware approach. Fortunately applications have
been documented to request 20% of the data available 80% of the time. Our heuristic
based approach would work considerably well in this case. This is modeled by the
80% hit rate. The bu?ered disk approach one and two are able to save 12% and
26% energy over the energy aware strategy when the hit rate is 80%. Similarly, they
are able to save 37% and 47% of the total energy compared to the non-energy aware
approach.
Thefirstbu?erdiskapproachdowngradesmorequicklythanthesecondapproach
as the hit rate is decreased as compared to the energy aware approach. This is not
23
 
Figure 3.5: Impact of the Number of Disks
that great of a concern since the first strategy is still able to have a positive impact
on the reliability of the disk system as compared to the energy aware approach. The
first bu?er disk approach is still able to produce significant energy savings over the
non-energy aware approach without compromising the reliability of the system. The
energy savings performance of the second approach does not diminish as quickly as
the first approach, but there will be an impact on the capacity of the system. The
second approach is also able to reduce the number of state transitions.
For the next set of experiments we increased the number of disks contained in
the disk system. This was done to show the e?ect that scaling the amount of disks
up would have on our system. We kept the size of the bu?er constant. This had the
e?ect of increasing the miss rate as disks were added to the system. If a disk is added
to the system it demands more blocks to be placed into the bu?er. Not all of these
blocks can be placed into the bu?er disk. As a consequence of this, the hit rate is
lowered.
24
From Figure 3.5 we are able to see that the non-energy aware approach wastes
a considerably larger amount of energy as compared to all of the energy aware ap-
proaches. This is expected since the non-energy aware approach is not able to place
disks in the standby mode. Bu?er strategy 1 was able to produce a 12% increase in
energy savings over the energy aware strategy when 10 disks were simulated. Simi-
larly, bu?er strategy 2 performed even better with an 18% increase. This is expected
again because of the energy overhead adding an extra disk bu?er strategy 1 requires.
Our approach produces promising results as the number of disks is increased. This
is an important observation, since our target system is a large-scale parallel system.
This leads us to believe our system will produce energy benefits regardless of the
number of disks in a system. The number of bu?er disks may be scaled with the size
of the system. This work does not try to figure out a way to decide on the number
of bu?er disks needed for a large scale parallel disk system and leaves this for future
research.
The last parameter varied to produce our simulation results was the data size of
the requested blocks and is presented in Figure 3.6. One of the assumptions made in
our motivational example included the disk system would only conduct large reads.
This assumption was relaxed by lowering the data size of the requests. All data sizes
are in MB and the energy outputs are in Joules. We kept the number of disks at four
and the number of requests at 200. This meant that low data sizes would produce
low amounts of energy because they would be processed quickly.
It was expected that the data size would have a greater impact on the perfor-
mance of our bu?er disk approach. This assumption was based on the fact that
smaller data sizes decrease the time frames in which a disk is able to sleep. This is
evident by looking at the results of the energy aware approach against the non-energy
aware strategy. As the data size is lowered the energy aware strategy is no longer
25
 
 
Figure 3.6: Impact of data size
able to save energy by putting a disk into the sleep state. It turns back into the non-
energy aware strategy once the data size becomes 10MB or less. The bu?ered disk
approaches were still able to perform better than the non-energy aware approach.
This is achieved by the fact that the bu?er disk increased the time frames that a
disk can sleep. By pre-fetching the frequently accessed data for a disk it increased the
oddsthatthecurrent request will bein thebu?er diskallowing thediskcorresponding
to the request to increase its sleep time window. This window must be larger than
BE
Idle
or the energy penalty associated with the state transition will be greater
than the energy that can be saved by sleeping the disk. These results become very
promising since one of our original assumptions is that requests would be large reads.
These results show that our approaches may also work for small reads.
26
3.4 Chapter Summary
The use of large-scale parallel disk systems continues to rise as the demand for
information systems with large capacities grows. Large-scale parallel disk systems
allow someone to combine smaller disks to achieve large capacities. The problem is
that these large- scale disk systems can be extremely energy e?cient. Disk systems
canaccountfornearly27%oftheenergydemandsofalarge-scalecomputingplatform.
The energy consumption rates are rising as disks become faster and disk systems are
scaled up. Our goal was to increase the energy e?ciency of a large-scale disk system
using a bu?er disk that would pre-fetch frequently accessed data. This had the e?ect
of increasing idle periods for disks facilitating long sleep times. As sleep times are
increased the disk is able to save energy over being in the idle state.
We proposed two di?erent methods of using a bu?er disk. The first strategy
added an extra disk to the system and the second approach used an existing disk as
the bu?er disk. The first strategy consumes more energy because an extra disk is
added, but does not compromise the capacity of the disk system. We compared our
approachesagainstnon-energyawareandenergyawareapproaches. Thebu?ereddisk
approaches were both able to produce energy savings as compared to the non-energy
aware and energy aware strategies. The energy savings was larger when compared
againstthenon-energyawarestrategy. Althoughourapproaches didnotsave asmuch
energy as the energy aware approach, they were both able to positively impact the
reliability of the disk system. Both approaches were able to lower the number of state
transitions as compared to the energy aware approach.
For the future research work we would like to work on adding more than one
bu?er disk to the system. The number of bu?er disks will have to be increased
as the scale of the disk system is increased. This will add the extra requirements
parallel applications demand and our strategies will have to be modified to reflect
these changes. Our work also used only synthetic traces for our simulation results.
27
Weneedtoincorporatetracesfromreal-worldapplicationstoimprovethefeasibilityof
our approaches. The reliability impacts of using the bu?er disk can also be researched
more heavily. We demonstrated the reduction in state transitions, but need to come
up with a model that relates state transitions and reliability.
28
Chapter 4
Prefetching Models and Simulation
4.1 Motivational Example
Forasimple motivationalexample thatdemonstrates theutilityofthebu?er disk
architecture, we present a scenario that is depicted in Figure 4.1. Each horizontal
bar represents the time a particular disk is busy or idle. Figure 4.1 presents requests
for individual disks that are represented by the specific colors and patterns presented
in the legend. Idle periods for all of the disks are represented with the orange color.
If we are using the IBM 36Z15 disk for disks A, B, and C DPM techniques will not
be able to save any energy. DPM requires a disk to have an idle period greater than
the break-even time. For the IBM 36Z15 disk the break-even time is 14.5 seconds.
The largest idle-period for any of the disks presented in Figure 4.1 is 8 seconds. This
means that DPM is unable to save any energy is this example, even though there are
idle periods of 8 seconds. The total energy consumed by all of the disks to serve all of
the requests is approximately 949.2 Joules. Each disk must remain in the idle state,
which consumes 10.2 W, when they are not serving a request.
Figure 4.1: Sample Disk Trace
29
 
Figure 4.2: Bu?er Disk Added to Architecture
If we were able to prefetch the requested data from all three disks into a single
disk, which is represented by Figure 4.2, we could have one single disk do the work
of the three disks. Disks A, B, and C will be put into the sleep state and remain in
the sleep state for the entire length of the trace.
Using a bu?er disk allows one to trade many lightly loaded disks, for a smaller
number of heavily loaded disks. The key to energy savings using a bu?er disk is to
accurately place frequently requested data into the bu?er disk. This allows non-bu?er
disks to have larger idle-window sizes as compared to not using a bu?er disk. If a
requestcanbeservedfromabu?erdisk, thecorrespondingdatadiskforthisparticular
request treats the time for the bu?er disk to serve the disk request as an extra idle
window. Thekey toenergysavingswith thebu?erdisk strategyistohaveconsecutive
hits from the perspective of a single disk, so the disk can see a long continuous idle
window. Adding an extra bu?er disk represents one of our approaches, PRE-BUD1,
to conserving energy in parallel storage systems. This approach will consume 804 J,
including the energy required to prefetch the data from all three disks. Similarly, if
you used Disk A to prefetch requested data from Disk B and Disk C, Disk A would
now become a bu?er disk. Disk A would remain active for 28 s, while Disk B and
Disk C would sleep for 28 s. This preceding approach, PRE-BUD2, will consume 680
J. PRE-BUD1 is able to save 15.3% and PRE-BUD2 is able to save 28.4% energy
30
over the DPM strategy. These numbers will go up if the trace presented in Figure 4.1
is repeated. This is because the requested blocks are already in the bu?er disk and
sleeping a disk is 4 times more energy e?cient than leaving it in the idle state.
4.2 PRE-BUD Energy-E?cient Prefetching Strategy
In this section, we describe our energy-e?cient prefetching strategy for parallel
storage systems with bu?er disks. Energy consumption in parallel disk systems can
be reduced by placing idle disks into the standby state, which causes the idle disks to
stop spinning completely. The fundamental goal of PRE-BUD is to improve energy
e?ciencyofparalleldisksthroughthefollowingtwoenergysavingprinciples. First, by
reducing the number of power state transitions one can decrease the energy overhead
of spinning down the disks. Second, increasing the number and lengths of standby
intervals can foster new opportunities to aggressively turn disks into the standby
state. PRE-BUD implements these two energy saving principles using the concept of
bu?er disks, which contain frequently accessed data blocks that are prefetched and
bu?ered. There are two bu?er disk architectures: (1) adding bu?er disks to the disk
system, PRE-BUD1, and (2) using existing disks as the bu?er disk(s), PRE-BUD2.
The energy-e?cient prefetching strategy, PRE-BUD, described in this chapter can
be successfully applied to deal with the two approaches to the architecture. In this
study, let us first focus on parallel disk systems with a single bu?er disk. Then, in
Section 4.5 we briefly discuss how to extend PRE-BUD to conserve energy in parallel
disk systems with multiple bu?er disks.
The PRE-BUD strategy is a greedy algorithm in the sense that blocks fetched
into a bu?er disk in each prefetching phase (see Steps 8-11 in Figure 4.3) are the
ones that have the highest energy savings, which in turn attempts to maximize the
energy e?ciency of the parallel disk system. PRE-BUD has two key components:
the prefetching module and the energy-saving calculation module. Given a parallel
31
disk system with a bu?er disk, the prefetching module determines which blocks to
fetch from any of the parallel disks to improve the energy e?ciency of the entire disk
system. If the bu?er disk is full while more blocks have to be fetched, the prefetching
module is tasked with deciding which blocks need to be evicted. The prefetching
module relies on the second module to calculate and update the energy savings of
referenced blocks in the current look-ahead window and blocks present in the bu?er
disk. The energy savings estimate of a block in a data disk quantifies the energy
consumption reduction produced by fetching the block into a bu?er disk. On the
other hand, the energy savings estimate of a block in the bu?er disk reflects the
energy savings value of caching the block instead of evicting it from the bu?er disk.
The prefetching and energy-saving calculation modules are detailed in Sections 4.2.1
and 4.2.2, respectively.
4.2.1 Prefetching Module
Before presenting the prefetching module of PRE-BUD, we first summarize the
notation for the description of the prefetcher in Table 4.1.
Notation Description
R Current lookahead. r ? R is a reference in the lookahead
block(r) Block accessed in reference r ? R
disk(r) Disk in which block(r) is residing
A
Subset of the lookahead R; for any r in A,
disk(r) is active, i.e., ?r ? A: disk(r) is active
G A set of blocks present in the bu?er disk
E
s
(b) Energy saving contributed by prefetching block b
A
+
For any b ? A
+
,wehavedisk(b) ? A,
E
s
(b) > 0,b/? G,and?r ? R : block(r)=b
G
+
The set of blocks with the highest energy savings in A
+
?G
Table 4.1: Notation for the description of the prefetching module
Figure 4.3 outlines the prefetching module in PRE-BUD. PRE-BUD is energy-
e?cient in nature, because a request for data in a disk currently in the standby mode
32
will not have to be spun up to serve the request if the requested block is present
in the bu?er disk (see Step 4). Bu?er-disk resident blocks allow standby data disks
to stay in the low-power state for an increased period of time, as long as accessed
blocks are present in the bu?er disk. There is a side e?ect of making the bu?er
disk perform I/O operations while placing data disks in standby longer; that is, the
bu?er disk is likely to become a performance bottleneck. To properly address the
bottleneck issue, we design the prefetcher in such a way that the load between the
bu?er anddatadisks isbalanced, iftheactive datadiskcan achieve ashorter response
time than the bu?er disk we don?t rely on the bu?er disk (see step 2). In addition
to load balancing, utilization control is introduced to prevent disk requests from
experiencing unacceptably long response times. In light of the utilization control, the
prefetching module ensures that the aggregatedrequired I/O bandwidth is lower than
the maximum bandwidth provided by the bu?er disk (see Line 11.a in Figure 4.3).
Toimprovetheenergye?ciencyofPRE-BUD,weforcePRE-BUDtofetchblocks
from data disks into the bu?er disk on a demand basis (see Line 5 in Figure 4.3).
Thus, block bis prefetched in Step 10only when thefollowing fourconditions aremet.
First, a request r in the look-ahead is accessing the block, i.e.,?r ? R : block(r)=b.
Second, the block is not present in the bu?er disk, i.e., b/? G. Third, fetching
the blocks and caching them into the bu?er disk can improve energy e?ciency, i.e.,
Es(b) > 0. Lastly, the block is residing in an active data disk, i.e., disk(b) ? A.Note
that set A
+
(see Table 1) contains all the blocks that satisfy the above four criteria.
To maximize energy e?ciency, we have to identify data-disk-resident blocks with
the highest energy savings potential. This step is implemented by maintaining a set,
G
+
, of blocks with the highest energy saving in A
+
?G.Thus,blocksinA
+
?G
+
are
thecandidate blocks tobeprefetched in theprefetching phase. A tieofenergy savings
between a bu?er-disk-resident block and a data-disk-resident block can be broken in
favor of the bu?er-disk-resident block. If two data-disk-resident blocks have the same
33
Input: a request r, parallel disk system with m disks 
1  if block(r) is present in the buffer disk { 
2     if disk(r) is active and T
Disk
(r)  d T
0
(r), where T
Disk
(r) and T
0
(r) are response time of r when  
             serviced by disk(r) and the buffer disk, respectively 
3       The request r is serviced by disk(r); 
4   else the request r is serviced by the buffer disk; 
    } 
5 else { /* block(r) is not present in the buffer disk */ 
    /* Initiate the prefetching phase */ 
6      if disk(r) is in the standby state /* spin up disk(r) when it is standby */ 
7        spin up disk(r); 
8    Compute the energy savings of references in A  ? R,  
          where A is a subset of the lookahead R, and   r? ?A: disk(r?) is active; 
9    Update the energy savings of blocks in the buffer disk; 
 10   Fetch blocks in A
+
  ? G
+
; 
 11   Evicting the blocks in G ? G
+
 with the lowest energy savings as necessary,  
            where G is the set of blocks present in the buffer disk; 
             A
+
 is the set of blocks, such that if block b  ? A
+
, then b is referenced by  
                 a request in the lookahead, b is not present in the buffer disk, disk(b) is active 
                 (i.e., disk(b)  ? A), and the energy saving E
s
(b) of b is larger than 0;  
               G
+
 is the set of blocks with the highest energy saving in A
+
  ? G,  
11.a                     such that  
0
'
)'()'( Brtr
Gr
  ?
 ?
 
 ?
 O  /* Bandwidth constraint must be satisfied */  
11.b                                      
0
'
)'( Crs
Gr
 d
 ?
 
 ?
/* Capacity constraint must be satisfied */ 
         /* The request r is then serviced */ 
12     if block(r) has not been prefetched  
13         The request r is serviced by disk(r); 
14     else return block(r); /* block(r) was recently retrieved; no extra I/O is necessary */ 
 }  
Figure 4.3: Algorithm PRE-BUD: the prefetching module
34
energy saving, the tie is broken in favor of the block accessed earlier by a request in
the look-ahead.
In the case that the bu?er disk is full, blocks in G ? G
+
must be evicted from
the bu?er disk (see Step 11 in Figure 4.3). This is because G ? G
+
contains the
blocks with the lowest energy savings. We assign zero to the energy savings of bu?er-
disk-resident blocks that will not be accessed by any requests in the look-ahead. The
bu?er-disk-resident blocks without any contribution to energy conservation will be
among the first to be evicted from the bu?er disk, if a disk-resident block with high
energy saving must be fetched when the bu?er disk is full. Blocks that will not be
accessed in the look-ahead are evicted in the least-recently-used order.
PRE-BUD can conserve more energy by the virtue of its on-demand manner,
which defers prefetching decisions till the last possible moment when the above two
criteria are satisfied. Deferring the prefetching phase is beneficial, because (1) this
phase needs to spin up a corresponding disk if it is in the standby state, and (2) late
prefetching leads to a larger look-ahead for better energy-aware prefetching decisions.
The prefetching module can be readily integrated with a disk scheduling mech-
anism, which is employed to independently optimize low-level disk access times in
each individual disk. This integration is implemented by batching disk requests and
o?ering each disk an opportunity to reschedule the requests to optimize low-level disk
access performance.
4.2.2 Energy-Saving Calculation Module
We develop an energy-saving prediction model, based on which we implement
the energy-saving calculation module invoked in Steps 8 and 9 in the prefetching
module (see Figure 4.3). The prediction model along with the calculation module is
indispensable for the prefetcher, because the energy savings of a block represents the
importance and priority of placing the block in the bu?er disk to reduce the energy
35
consumption of the disk system. The energy-saving calculation module can illustrate
the amount of energy conserved by fetching a block from a data disk into a bu?er
disk. It also calculates the utility of caching a bu?er-disk-resident block rather than
evicting it from the bu?er disk. Table 4.2 summarizes the notation for the description
of the energy-saving calculation module.
To analyze circumstances under which prefetching blocks can yield energy sav-
ings, we focus on a single referenced block stored in a data disk. Let R
j
? R be a
set of references accessing blocks in the jth data disk. Thus, R
j
is a subset of the
lookahead R and can be defined as
R
j
= {r|r ? R?disk(r)=jthdatadisk?block(r)=b
k,j
?b
k,j
/? G}.
Given a set R
kj
? R of references accessing the kthblock b
kj
in the jthdata disk,
let us derive the energy saving E
s
(b
kj
) achieved by fetching b
kj
from the data disk
into the bu?er disk. R
kj
is comprised of all the requests referencing a common block
bkj that is not present in the bu?er disk; therefore, R
kj
can be formally expressed as
R
k,j
= {r|r ? R?block(r)=b
k,j
?b
k,j
/? G}. Intuitively, energy savings E
s
(b
kj
)can
be computed by considering the energy consumption incurred by each disk request in
R
kj
.
Given a reference list R
j
and a block b
kj
, in what follows we identify four cases
where a reference in R
j
can contribute to positive energy savings by the virtue of
prefetching block b
kj
. First, we introduce two energy saving principles utilized by
PRE-BUD.
Energy Saving Principle 1: To increase the length and number of idle periods
larger than the disk break-even time T
BE
, which is the minimum disk standby time
required to o?set the cost of entering the standby state. This principle can be realized
by combining two adjacent idle periods to form a single idle period that is larger than
T
BE
. PRE-BUD fetches, in advance, a block accessed between two adjacent idle
36
Notation Description
R
j
A set of references accessing blocks in the jth data disk
R
k,j
? R A set of references accessing the kth block b
k,j
in the jth data disk
b
k,j
The kth block in the jth data disk
T
BE
Break-even time. Minimum idle time required to compensate
the cost of entering the standby state
T
ij
Active time period serving the ith request issued to
the jth data disk
t
ij
Time spent serving the ith request issued to the jth data disk
?
ij
Time spent in the idle period prior to the
ith request accessing a block in disk j
I
ij
An idle period prior to the ith request accessing
a block in the jth data disk
n
j
The total number of requests (in the lookahead) issued to
the jth disk
?
j
A set of disk access activities for references in R
j
time(b
kj
) Active time period to serve a request accessing block b
kj
block(T
ij
) A block accessed during the active period T
ij
T
D
Time to transition from active/idle to standby
T
U
Time to transition from standy to active mode
E
D
Energy overhead of transitioning from active/idle to standby
E
U
Energy overhead of transitioning from standby to active mode
P
A
,P
I
,P
S
Disk power in the active, idle, and standby mode
Table 4.2: Notation for the description of the energy-saving calculation module
periods, thereby possibly forming a larger inactivity time that allows the disk to
enter the standby state to conserve energy.
Energy Saving Principle 2: To reduce the number of power-state transitions.
The energy e?ciency of a disk can be further improved by minimizing the energy
cost of spinning up and down disks. Disk vendors can provide high quality disks with
low spin-up/down energy over-heads; PRE-BUD aims to reduce the number of disk
spin-up and spin-down while enlarging disk idle times. We implement this principle
in PRE-BUD by combining two adjacent standby periods to eliminate unnecessary
state transitions between the two standby periods.
Now we investigate cases which exploit the above energy saving principles to
conserve energy in disks. Let ?
j
= {I
1j
,T
1j
,I
2j
,T
2j
,...I
ij
,T
ij
,...I
nj,j
,T
nj,j
} be a
37
set of disk accesses for references in R
j
, where for an active period T
ij
, t
ij
is the time
spent serving the ith request issued to data disk j; for idle period I
ij
,?
ij
is the time
spent in the idle period prior to the ith request accessing a block in the jthdata disk,
and n
j
is the total number of requests issued to the jth disk. We denote block(T
ij
)as
a block accessed during the active period T
ij
.
The following three cases demonstrate scenarios that apply energy saving princi-
ple 1 to generate longer idle periods (i.e., longer than T
BE
) by prefetching block(T
ij
)
to combine the ith and (i+1)th idle periods. Let us pay attention to the ith active
period T
ij
and the two periods I
ij
and I
(i+1)j
(i.e., the ones adjacent to T
ij
). Cases
1-3 share two common conditions - (1) both I
ij
and I
(i+1)j
are larger than zero and
(2) the summation of t
ij
, ?
ij
,and?
(i+1)j
is larger than the break-even time T
BE
.
Case 1: Both the ith and (i +1)th idle periods are equal to or smaller than
the break-even time T
BE
.Thus,wehave0<?
ij
? T
BE
,0<?
(i+1)j
? T
BE
,and
?
ij
+t
ij
+?
(i+1)j
>T
BE
.
Case 2: The ith idle period is equal to or smaller than the break-even time T
BE
;
the(i+1)th idleperiodislarger than T
BE
.Formally,wehave0<?
ij
? T
BE
,?
(i+1)j
>
T
BE
,and ?
ij
+t
ij
+?
(i+1)j
>T
BE
.
Case 3: The ith idle period is larger than T
BE
;the(i+1)th idle period is equal
to or smaller than T
BE
. The conditions for case 3 can be expressed as: ?
ij
>T
BE
,0 <
?
(i+1)j
? T
BE
,and?
ij
+t
ij
+?
(i+1)j
>T
BE
.
Now we calculate, in the above three cases, the energy savings produced by
fetching block(T
ij
)fromthejth data disk to the bu?er disk. The calculation makes
use of the following definitions:
? Let P
A
, P
I
,andP
S
represent the disk power consumption in the active, idle,
and standby modes. Let T
D
and T
U
be times to transition to the standby and
active mode; let E
D
and E
U
be energy overhead to transition to standby and
active.
38
? E
WOP
denotes energy consumption of the periods t
ij
, ?
ij
,and?
(i+1)j
when
PRE-BUD is not applied.
? In case of having block(T
ij
) prefetched, E
WPF
denotes energy consumption of
the jth disk in the periods t
ij
, ?
ij
,and?
(i+1)j
.
? E
BUD
represents energy consumption of the bu?er disk accessing the prefetched
block(Tij).
? For block
(
b
kj
), active time spent serving a request accessing the block is denoted
by time(b
k,j
).
Energysavings, E
S
(block(T
ij
)),contributed byprefetching block(T
ij
)canbewrit-
ten as:
E
S
(block(T
ij
)) = E
WOP
?(E
WPF
+E
BUD
) (4.1)
Energy savings, E
S
(block(T
ij
), in case 1: For case 1, I
ij
and I
(i+1)j
are equal to or
smaller than T
BE
. This condition implies that the jthdisk is in the idle mode during
I
ij
and I
(i+1)j
. Energy consumption experienced by the disk in active period T
ij
is
P
A
?t
ij
. Hence, E
WOP
in case 1 can be expressed as:
E
WOP
= P
I
?(?
ij
+?
(i+1)j
)+P
A
?t
ij
(4.2)
When block(T
ij
)is prefetched, alarge(i.e., largerthanT
BE
) idleperiod can beformed
by combining the periods T
ij
, I
ij
,andI(i+1)j. Therefore, E
WPF
can be computed as
theenergyconsumptionofthejthdiskinthestandbymodeduringT
ij
, I
ij
,andI
(i+1)j
.
Taking into account energy overhead of power state transitions, we can calculate
E
WPF
using the equation below:
E
WPF
= P
S
?(?
ij
+t
ij
+?
(i+1)j
?T
D
?T
U
)+E
D
+E
U
(4.3)
39
We assume that the bu?er disk and data disks are identical; therefore, energy con-
sumption E
BUD
of the bu?er disk accessing the prefetched block(T
ij
)is
E
BUD
= P
A
?t
ij
(4.4)
E
S
(block(T
ij
)) in case 1 can be determined by substituting Eqs. (4.2)-(4.4) into Eq.
(4.1). Hence, we have:
E
S
(block(T
ij
)) = P
I
?(?
ij
+?
(i+1)j
)?P
S
?(?
ij
+t
ij
+?
(i+1)
?T
U
?T
D
)?E
D
?E
U
(4.5)
Energy saving E
S
(block(T
ij
)) in case 2: The jth disk in this case is transitioned into
standby during I
(i+1)j
,sinceI
(i+1)j
is larger than T
BE
. The energy consumption of
the disk in I
(i+1)j
is expressed as (see the third term on the right hand side of Eq. 4.6
below). Thus, the energy consumption E
WOP
of the disk in T
ij
, Iij,andI
(i+1)j
is:
E
WOP
= P
I
??
ij
+P
A
?t
ij
+(P
S
?(?
(i+1)j
?T
D
?T
U
)+E
D
+E
U
) (4.6)
We derive E
S
(block(T
ij
)) in case 2 by substituting Eqs. (4.6), (4.3), and (4.4) for for
E
WOP
, E
WPF
,andE
BUD
.Thus,wehave
E
S
(block(T
ij
)) = P
I
??
ij
?P
S
?(?
ij
+t
ij
) (4.7)
Energy savings, E
S
(block(T
ij
)), in case 3: The Energy saving E
S
(block(T
ij
)) in this
case is very similar to that in case 2 except that the jth disk is transitioned into
standby during I
ij
rather than I
(i+1)j
. Consequently, the energy saving E
S
(block(T
ij
))
in case 3 can be written as:
E
S
(block(T
ij
)) = P
I
??
(i+1)j
?P
S
?(?
(i+1)j
+T
ij
) (4.8)
40
Case 4: The case described here shows a scenario that applies energy saving principle
2 to reduce power-state transitions by prefetching block(T
ij
) to combine two adjacent
standby periods I
ij
and I
(i+1)j
.
EnergysavingE
S
(block(T
ij
))incase4: Inthiscase, both?
ij
and?
(i+1)j
arelarger
than T
BE
, meaning that the jth disk can be standby in these two time intervals to
conserve energy. Formally, we have ?
ij
>T
BE
,?
(i+1)j
>T
BE
,and?
ij
+t
ij
+?
(i+1)j
>
T
BE
. Thus, energy consumption E
WOP
of the jth disk without a bu?er disk is:
E
WOP
= P
A
?t
ij
+(P
S
?(?
ij
?T
D
?T
U
)+E
D
+E
U
)
+
parenleftBig
P
S
?(?
(i+1)j
?T
D
?T
U
)+E
D
+E
U
parenrightBig
(4.9)
where the second and third term on the right hand side of Eq. (4.9) are the energy
consumed by the disk in standby periods I
ij
and I
(i+1)j
, respectively. With a bu?er
disk in place, the energy consumption E
WPF
and E
BUD
in this case are the same as
in case 1 (see Eqs. 4.3 and 4.4). Therefore, the energy savings, ES(block(T
ij
)), in
this case is derived from E
WOP
(see Eq. 4.9), E
WPF
,andE
BUD
as:
E
S
(block(T
ij
)) = E
D
+E
U
?P
S
?(T
D
+T
U
+t
ij
) (4.10)
Case 5 below summarizes scenarios where prefetching a block may have negative
impacts on the energy e?ciency.
Case 5: If the summation of t
ij
, ?
ij
,and?
(i+1)j
is smaller than or equal to T
BE
,
i.e., ?
ij
+t
ij
+ ?
(i+1)j
? T
BE
, then prefetching block bkj causes an negative impact
on energy conservation.
Energy savings, E
S
(block(T
ij
)) in case 5: Since ?
ij
+ t
ij
+ ?
(i+1)j
? T
BE
,the
disk j stays in the idle mode during periods T
ij
, I
ij
,andI
(i+1)j
. If the block b
kj
is
prefetched to the bu?er disk, the energy consumption E
WPF
of disk j in the three
41
periods is:
E
WPF
= P
I
?(?
ij
+t
ij
+?
(i+1)j
) (4.11)
The valuesof E
WOP
andE
BUD
arethesameasthose ofcase 1(see Eq. 4.2). Applying
E
WOP
,EWPF andEBUDtoEq. (1),weestimatethenegativeenergy-savingimpact
E
S
(block(T
ij
)) as:
E
S
(block(T
ij
)) = ?P
I
?t
ij
(4.12)
In light of the above four cases, the set ?
kj
of disk activities for references accessing
block b
kj
in disk j can be partitioned into the following four disjoint subsets,
?
k,j
=?
k,j,1
??
k,j,2
??
k,j,3
??
k,j,4
??
k,j,5
(4.13)
where ?
k,j,1
,?
k,j,2
,?
k,j,3
,?
k,j,4
and?
k,j,5
containactive timeperiodsthatrespectively
satisfy the conditions of the four energy-saving cases. The four subsets can be defined
as:
?
k,j,1
= T
ij
|block(T
ij
)=b
k,j
?0 <?
ij
? T
BE
?0 <?
(i+1)j
? T
BE
??
ij
+t
ij
+?
(i+1)j
>T
BE
for case 1;
?
k,j,2
= T
ij
|block(T
ij
)=b
k,j
?0 <?
ij
? T
BE
??
(i+1)j
>T
BE
??
ij
+t
ij
+?
(i+1)j
>T
BE
for case 2;
42
Input: block b
k,j
, disk j, a set  )
j
  of disk access activities; Output: E
S
(b
k,j
)  
1  Initialize E
S
(b
k,j
) to 0; 
2     for (i = 1 to n
j
) { 
3       if (
BEjiijij
Tt  !  
  )1(
 D D ) {  /* Cases 1-4 */ 
4           if (
BEij
T d  D0 ) { 
5 if (
BEji
T d 
  )1(
0  D )  /* Case 1, see Eq. (4.5) */     
6    
UDDUiijijSjiijIijSijS
EETTtPPbEbE        ?   ?   
  
)()()()(
)1()1(
 D D D D ; 
7 else )()()(
ijijSijIijSijS
tPPbEbE   ?  ?    D D ;/* Case 2, see Eq. (4.7) */  
8 } 
9 else { 
10     if  (
BEji
T d 
  )1(
0  D ) /* Case 3, see Eq. (4.8) */ 
11          )()()(
)1()1( ijjiSjiIijSijS
TPPbEbE   ?  ?   
  
 D D ; 
12      else ).()()(
ijUDSUDijSijS
tTTPEEbEbE    ?      /* Case 4, see Eq. (4.10) */ 
13 } 
14  } /* end Cases 1-4 */ 
15 else 
ijIijSijS
tPbEbE  ?   )()( ; /* Negative energy saving. Case 5, see Eq. (4.12) */ 
16 } /* end for */ 
17 return )time()(
, jkAijS
bPbE  ?   
8
Figure 4.4: Algorithm PRE-BUD: the energy-saving calculation module
?
k,j,3
= T
ij
|block(T
ij
)=b
k,j
??
ij
>T
BE
?0 <?
(i+1)j
? T
BE
??
ij
+t
ij
+?
(i+1)j
>T
BE
for case 3;
?
k,j,4
= T
ij
|block(T
ij
)=b
k,j
??
ij
>T
BE
??
(i+1)j
>T
BE
??
ij
+t
ij
+?
(i+1)j
>T
BE
for case 4; and ?
k,j,5
=
braceleftBig
T
ij
|block(T
ij
)=b
k,j
??
ij
+t
ij
+?
(i+1)j
? T
BE
bracerightBig
for case 5.
Now we are positioned to show the derivation of energy savings, E
S
(b
kj
), yielded
by fetching block b
kj
from disk j to the bu?er disk. Thus, E
S
(b
kj
)canbederived
43
from Eqs. (4.5), (4.7), (4.8), (4.10), (4.11) and Eq. (4.12), where the last item on the
right hand side is the energy overhead of fetching b
kj
from disk j to the bu?er disk.
E
S
(b
k,j
)=
summationtext
T
ij
??
k,j
(E
S
(block(T
ij
)))
=
summationtext
T
ij
??
k,j,1
parenleftBig
P
I
?(?
ij
+?
(i+1)j
)?P
S
?(?
ij
+t
ij
+?
(i+1)
?T
U
?T
D
)?E
D
?E
U
parenrightBig
+
summationtext
T
ij
??
k,j,2
(P
I
??
ij
?P
S
?(?
ij
+t
ij
))+
summationtext
T
ij
??
k,j,3
parenleftBig
P
I
??
(i+1)j
?P
S
?(?
(i+1)j
+t
ij
)
parenrightBig
+
summationtext
T
ij
??
k,j,4
(E
D
+E
U
?P
S
?(T
D
+T
U
+t
ij
).)?P
I
?
summationtext
T
ij
??
k,j,5
t
ij
?P
A
?time(b
k,j
)
(4.14)
Given the kth block bkj residing in disk j, the algorithm used to compute the energy
savings of prefetching block b
kj
from the data disk to the bu?er disk is described
in Figure 4.4. All of the energy saving cases are handled explicitly from Steps 3
through 14; whereas Step 15 addresses the issue of negative energy savings. The
time complexity of the energy-saving calculation module is low, because the time
complexity of this routine for each block is O(nj), where nj is the number of requests
in the look-ahead corresponding to the jth disk. After the block b
kj
is fetched to the
bu?er disk, = {I
1j
,T
1j
,I
2j
,T
2j
,...I
ij
,Tij,...I
nj,j
T
nj,j
} the set j of disk access activities
for references in R
j
must be updated by deleting any T
ij
? ?
j
accessing b
kj
, i.e.,
block(T
ij
)=b
kj
.
4.3 Analysis of PRE-BUD
In this section, we analyze the energy e?ciency and performance of PRE-BUD.
We start the analysis by showing the energy consumption of a full-power baseline
system without turning any disks into the standby state. Next, we analyze the energy
consumption of a parallel disk system with the dynamic power management (DPM)
technique. Last, our analysis will be focused on the energy consumption and response
time of parallel disk systems with PRE-BUD.
44
4.3.1 A Full-Power Baseline System
In this section we describe an energy consumption model, which is built to quan-
titatively calculate energy consumption of a modeled parallel disk systems. We model
the power of a parallel disk system with m disks as a vector P =(P
1
,P
2
,...,P
m
). The
power P
i
of the ith disk is represented by three parameters, i.e., P
i
=(P
A,i
,P
I,i
,P
S,i
),
where P
A,i
, P
I,i
,andP
S,i
are the power of the ith disk when it is in the active, idle,
and standby state, respectively. Let e
j,i
be the energy consumption to serve the jth
request served by the ith disk. We denote the energy consumption rate of the disk
when it is active by P
A,i
and the energy consumption e
j,i
can be written as
e
j,i
= x
j,i
?P
A,i
?t
j,i
= x
j,i
?P
A,i
?
parenleftbigg
t
SK,j,i
+t
RT,j,i
+
s
j
B
i
parenrightbigg
(4.15)
where t
j,i
is the service time of request j on disk i. t
j,i
is the summation of t
SK,j,i
,
t
RT,j,i
,ands
j
/B
i
, which are the seek time and rotational latency of the request, and
the data transfer time depending on the data size s
j
and the transfer rate B
i
of the
disk. Element x
j,i
is?1?if request j isresponded by the ith disk andis?0?, otherwise.
Since each request can be served by only one disk, we have
summationtext
m
i=1
x
j,i
=1.
Given a reference string R, we can compute the energy E
A
consumed by serving
all requests as
E
A
(P,R)=
m
summationtext
i=1
n
summationtext
j=1
e
j,i
=
m
summationtext
i=1
n
summationtext
j=1
(x
j,i
?P
A,i
?t
j,i
)
=
m
summationtext
i=1
n
summationtext
j=1
parenleftBig
x
j,i
?P
A,i
?
parenleftBig
t
SK,j,i
+t
RT,j,i
+
s
j
B
i
parenrightBigparenrightBig
(4.16)
We define f
j
as the completion time of request r
i
. in the reference string. Then, we
obtain the analytical formula for the energy consumed when disks are idle:
E
I
(P,R)=
m
summationdisplay
i=1
(P
I,i
?T
I,i
) (4.17)
45
where T
I,i
is the time interval when the ith disk is idle. T
I,i
can be derived from the
total disk I/O processing time and completion time of the last request served by the
disk. Thus, we have
T
I,i
=
n
max
j=1
(x
j,i
?f
j
)?
n
summationdisplay
j=1
parenleftbigg
x
j,i
?
parenleftbigg
t
SK,j,i
+t
RT,j,i
+
s
j
B
i
parenrightbiggparenrightbigg
(4.18)
where the first term on the right-hand side of Eq. (4.18) is the summation of I/O
processing times and disk idle times, and the second term is the total I/O time. The
total energy consumption E
NEC
of a parallel disk system without placing any disk
into the standby state is derived from Eqs. (4.16) and (4.17) as
E
NEC
(P,R)=E
A
(P,R)+E
I
(P,R)
=
m
summationtext
i=1
n
summationtext
j=1
e
j,i
+
m
summationtext
i=1
(P
I,i
?T
I,i
)
(4.19)
4.3.2 Dynamic Power Management
Energy in disks systems can be e?ciently reduced by employing the dynamic
power management (DPM) strategy, which places disks into standby when they are
idle. To analyze the energy e?ciency of PRE-BUD, it is important and intriguing
to model energy consumption in a DPM-based parallel disk system. If there is an
idle time of the ith disk that is larger than the break-even time T
BE,i
, then energy
conservation can be achieved by putting the disk into the standby state. Otherwise,
theenergypenaltytotransitionbetween thehigh-powerandlow-powerstateisunable
to be o?set by the energy conserved. Let P
TR,i
be the power of state transitions in
the ith disk. Let P
AS,i
and P
SA,i
denote additional power introduced by transitions
from active to standby, and vice versa. P
TR,i
can be derived from P
AS,i
and P
SA,i
as
P
TR,i
= P
AS,i
+P
SA,i
=
T
AS,i
?P
AS,i
+T
SA,i
?P
SA,i
T
AS,i
+T
SA,i
(4.20)
46
where the numerator is the energy consumption caused by a pair of transitions and
the denominator is the transition time. In light of Eq. (4.20), one can calculate the
break-even time T
BE,i
as
T
BE,i
=
?
?
?
?
?
?
?
(T
AS,i
+T
SA,i
)?
parenleftBig
1+
P
TR,i
?P
A,i
P
A,i
?P
S,i
parenrightBig
T
AS,i
+T
SA,i
otherwise
ifP
TR,i
>P
A,i
(4.21)
In what follows, we make use ofT
BE,i
to quantify the energy consumption of a parallel
disk system when the DPM technique is employed. Suppose the number of idle time
intervals in a disk i is N
i
; a sequence of idle periods in the disk can be expressed as
(t
I,i,1
,t
I,i,2,
...,t
I,i,Ni
), where t
I,i,k
represents the length of the kth idle period in the
sequence. Let
slurabove
E
I
(P,R) be the energy consumed when disks are idle. The expression
of
slurabove
E
I
(P,R)isgivenas
?
E
I
(P,R)=
m
summationtext
i=1
parenleftBig
P
I,i
?
?
T
I,i
parenrightBig
=
m
summationtext
i=1
parenleftBigg
P
I,i
?
N
i
summationtext
k=1
(y
k,i
?t
I,i,k
)
parenrightBigg
(4.22)
where
?
T
I,i
is the summation of small idle time intervals that are unable to compensate
the cost of transitioning to the standby state.
?
T
I,i
can be derived from a step function
y
k,i
,wherey
k,i
is ?1? if the idle interval is smaller than or equal to the break-even
time. Otherwise, y
k,i
is ?0?. Using the step function y
k,i
, we can express
?
T
I,i
in Eq.
(4.22) as
?
T
I,i
=
N
i
summationtext
k=1
(y
k,i
?t
I,i,k
).
The energy consumption of the parallel disk system when the disks are in the
standby state can be expressed as
E
S
(P,R)=
m
summationtext
i=1
(P
S,i
?T
S,i
)
=
m
summationtext
i=1
parenleftBigg
P
S,i
?
N
i
summationtext
k=1
(?y
k,i
?(t
I,i,k
?T
BE,i
))
parenrightBigg
(4.23)
47
where T
S,i
is the time period when disk i is in the standby state. Similar as
?
T
I,i
, T
S,i
is derived from a step function ?y
k,i
,where?y
k,i
is ?1? if the idle interval is larger than
T
BE,i
, and is ?0?, otherwise. With the step function ?y
k,i
,wecanmodelT
S,i
in Eq.
(4.23) as T
S,i
=
N
i
summationtext
k=1
(?y
k,i
?(t
I,i,k
?T
BE,i
)).
Similarly, below we obtain the formula for the energy consumption of disk power-
state transitions
E
TR
(P,R)=
m
summationdisplay
i=1
(P
TR,i
?T
TR,i
) (4.24)
where P
TR,i
is determined by Eq. (4.20). T
TR,i
is the time interval when disk i is
transitioning from one power state into another. T
TR,i
can be derived from T
BE,i
.
Hence, we obtain T
TR,i
=
N
i
summationtext
k=1
(?y
k,i
?T
BE,i
).
The energy consumption E
DPM
of the parallel disk system with the DPM tech-
nique is the summation of the energy incurred by the disks when they are in the
active, idle, standby, and transition states. Thus, E
DPM
can be derived from Eqs.
(4.16), (4.22), (4.23), and (4.24) as
E
DPM
(P,R)=E
A
(P,R)+
?
E
I
(P,R)+E
S
(P,R)+E
TR
(P,R) (4.25)
4.3.3 Derivation of Energy E?ciency for PRE-BUD
Now we analyze the energy e?ciency of the PRE-BUD strategy. We only analyze
the energy consumption of a parallel I/O system with PRE-BUD where an extra disk
is added to the system as a bu?er disk. PRE-BUD using an existing disk can be
modeled similarly and can be derived from the models presented in this section.
First of all, we analyze the energy overhead E
PF
introduced by prefetching the
popular data blocks from data disks to the bu?er disk. Let D =(D
1
,D
2
,...,D
q
)
be a set of data blocks retrieved by reference string R. We make use of a predicate
?
j,i,k
, which asserts that request r
i
is accessing data block k on disk i, to partition
the reference string in a way that requests accessing the same kth block on disk i can
48
be grouped into the one set R
k,i
.Thus,wehave
R
k,i
= {r
j
? R |x
j,i
=1??
j,i,k
= TRUE} (4.26)
The sizes of all the requests in R
k,i
are identical. For simplicity, we denote the size
of requests in R
k,i
as s
k,i
. The following property must be satisfied:
?1 ? j ? n, 1 ? k ? q, r
j
? R
k,i
: s
j
= s
k,i
(4.27)
In most cases, it is impossible for a bu?er disk to cache all the popular data sets.
Therefore, weintroducethefollowingstepfunctiontodistinguishdatablocksprefetched
from data disks to the bu?er disk.
z
k,i
=
?
?
?
?
?
?
?
1 if block k on disk i is prefetched,
0 otherwise.
(4.28)
Energy consumption E
PF
caused by prefetching contains two components: energy
consumption E
R,PF
of reading frequently accessed data blocks from the data disks
and energy consumption E
W,PF
of placing the data blocks to the bu?er disk. Thus,
E
PF
is quantified below
E
PF
(P,D)=E
R,PF
(P,D)+E
W,PF
(P,D)
=
m
summationtext
i=1
q
summationtext
k=1
parenleftBig
z
k,i
?P
A,i
?
parenleftBig
t
SK,k,i
+t
RT,k,i
+
s
k,i
B
R,i
parenrightBigparenrightBig
+
m
summationtext
i=1
q
summationtext
k=1
parenleftBig
z
k,i
?P
A,0
?
parenleftBig
t
SK,k,0
+t
RT,k,0
+
s
k,i
B
W,0
parenrightBigparenrightBig
(4.29)
where P
A,0
is the power of the bu?er disk in the active state, B
R,i
is the read transfer
rate of data disk i,andB
W,0
is the write transfer rate of the bu?er disk. Next, let us
derive expressions to calculate the energy consumption E
0
in the bu?er disk. E
0
is
the summation of active, idle, and sleep state energy consumption totals of the bu?er
49
disk, and power state transition overheads. Thus,
E
0
= E
A,0
+E
I,0
+E
S,0
+E
TR,0
(4.30)
where E
A,0
, E
I,0
,andE
S,0
are the active, idle, and sleep state energy consumption
totals of the bu?er disk. E
TR,0
is the energy overhead for power state transitions.
In what follows, we direct our attention to the analytical formulas of E
A,0
, E
I,0
,and
E
S,0
. Given a set D of accessed data blocks, we model energy E
A,0
of the bu?er disk
when it is active as
E
A,0
(D)=
m
summationtext
i=1
q
summationtext
k=1
(z
k,i
?P
A,0
?T
A,0
)
=
m
summationtext
i=1
q
summationtext
k=1
parenleftBigg
z
k,i
?P
A,0
?
summationtext
r
j
?R
k,i
parenleftBig
t
SK,j,0
+t
RT,j,0
+
s
k,i
B
R,0
parenrightBig
parenrightBigg
(4.31)
where T
A,0
is the time period when the bu?er disk is in the active state. T
A,0
is the
accumulated service times of requests processed by the bu?er disk.
Let IS =(t
I,1
,t
I,2
,...,t
I,N0
) be a sequence of idle periods in the bu?er disk. Eq.
(4.32) quantifies energy consumption E
I,0
of the bu?er disk when it is sitting idle.
E
I,0
(IS)=P
I,0
?
?
T
I,0
= P
I,0
?
summationdisplay
t
I,k
?IS
(y
k,0
?t
I,k
) (4.32)
where
?
T
I,0
is the summation ofsmall idle time intervals that areunable to compensate
the cost of transitioning to the sleep state. y
k,i
is a step function used in Eq. (4.22).
Energy consumption E
S,0
in Eq. (4.30) is expressed as
E
S,0
(IS)=P
S,0
?T
S,0
= P
S,0
?
summationtext
t
I,k
?IS
(?y
k,0
?(t
I,k
?T
BE,0
))
(4.33)
where T
S,0
is the total time when the bu?er disk is in the sleep mode. T
S,0
is derived
from the break-even time given by Eq. (4.21) and the step function is used in Eq.
50
(4.23). The energy overhead E
TR,0
for power state transitions is expressed as follows
E
TR,0
(IS)=P
TR,0
?T
TR,0
= P
TR,0
?
summationtext
t
I,k
?IS
(?y
k,0
?T
BE,i
)
(4.34)
Energy consumption E
D
of the data disks with the dynamic power management
technique can be determined by applying Eq. (4.25).
Now we are in a position to obtain the energy consumption total of the parallel
I/Osystem, E
PRE?BUD
, withanextrabu?erdiskfromEqs. (4.25),(4.29),and(4.30).
Thus,
E
PRE?BUD
= E
PF
(P,D)+E
0
(D,IS)+E
D
(P,R) (4.35)
4.3.4 Derivation of Response Time for PRE-BUD
Now we are in a position to derive the response time approximation of the PRE-
BUD architecture. By definition, the response time of a disk request is the interval
between itsarrivaltimeandfinish time. The response timecan becalculated asasum
of a disk request?s wait time and I/O service time. Let D
0
= {d
1
,???,d
k
,???,d
l0
} bea
set of data blocks prefetched to a bu?er disk. Throughout this section, the subscript
0 is used to represent the bu?er disk. Let ?
k
and t
k
represent the access rate and
I/O service time of the kth data block in D
0
.Let?
0
and ?
0
be the utilization and
aggregate utilization of the bu?er disk. Thus, we have
?
0
=
summationdisplay
d
k
?D
0
(?
k
?t
k
) and ?
0
=
summationdisplay
d
k
?D
0
?
k
(4.36)
The mean service time
?
S
0
and mean-square service time
?
S
2
0
of disk accesses to the
bu?er disk are given as
?
S
0
=
summationdisplay
d
0
?D
0
parenleftBigg
?
k
?
0
?t
k
parenrightBigg
=
1
?
0
?
summationdisplay
d
0
?D
0
(?
k
?t
k
) (4.37)
51
?
S
2
0
=
summationdisplay
d
0
?D
0
parenleftBigg
?
k
?
0
?t
2
k
parenrightBigg
=
1
?
0
?
summationdisplay
d
0
?D
0
parenleftBig
?
k
?t
2
k
parenrightBig
(4.38)
where ?
k
/?
0
is the probability of access to data block d
k
in the bu?er disk.
We model each disk in a parallel disk system as a single M/G/1 queue, which has
exponentially distributed inter-arrival times and an arbitrary distribution for service
times of disk requests. Consequently, we can obtain the mean response time
?
T
0
of
accesses to the bu?er disk from Eqs. (4.36), (4.37) and (4.38) as
?
T
0
=
?
S
0
+
?
0
?
?
S
2
0
2?(1??
0
)
(4.39)
In what follows, let us derive mean response time
?
T
j
of accesses to disk j.We
denote D
j
(1 ? j ? m) as a set of data blocks stored in the jth disk. Let D
PF
j
?
D
j
(1 ? j ? m) be a set of data blocks in D
j
prefetched to a bu?er disk. Similarly,
let D
prime
j
? D
j
(1 ? j ? m) be a set of data blocks that has not been prefetched. For
the jth disk, we have D
j
= D
PF
j
? D
prime
j
.Let?
j
and ?
j
represent the utilization and
aggregate utilization of the bu?er disk. ?
j
and ?
j
can be expressed as:
?
j
=
summationdisplay
d
k
?D
prime
j
(?
k
?t
k
)and?
j
=
summationdisplay
d
k
?D
prime
j
?
k
(4.40)
The mean and mean-square service times (i.e.,
?
S
j
and
?
S
2
j
) of disk accesses to disk j
are given as
?
S
j
=
summationdisplay
d
0
?D
prime
j
parenleftBigg
?
k
?
j
?t
k
parenrightBigg
=
1
?
j
?
summationdisplay
d
0
?D
prime
j
(?
k
?t
k
) (4.41)
?
S
2
j
=
summationdisplay
d
0
?D
prime
j
parenleftBigg
?
k
?
j
?t
2
k
parenrightBigg
=
1
?
j
?
summationdisplay
d
0
?D
prime
j
parenleftBig
?
k
?t
2
k
parenrightBig
(4.42)
We can derive the mean response time
?
T
j
of accesses to data disk j from the above
equations as
?
T
j
=
?
S
j
+
?
j
?
?
S
2
j
2?(1??
j
)
(4.43)
52
Therefore, the overall mean response time of a parallel disk system with a bu?er disk
is written as below, where ? =
summationtext
m
j=0
?
j
is the aggregate access rate of the parallel
disk system.
?
T =
1
?
?
m
summationdisplay
j=0
parenleftBig
?
j
?
?
T
j
parenrightBig
(4.44)
4.4 Experimental Results
In this section we present our experimental results for the proposed PRE-BUD
energy e?cient prefetching approach for parallel disk systems. First we provide in-
formation about our simulation environment and parameters that were varied for our
experiments. Next, we compare PRE-BUD with PDC and DPM - two well known
energy conservation techniques for parallel disks [21]. Then, we study the impacts of
various system parameters on energy e?ciency and the performance of parallel disks.
4.4.1 Experiment Setup
Extensive experiments were conducted with a disk simulator based on the math-
ematical models presented in Sections 4.2 and 4.3. Our disk model (see Table 4.3)
is based on the IBM Ultrastar 36Z15, which has been widely used in data-intensive
environments [18]. Our simulator was implemented in JAVA, allowing us to quickly
and easily change various system parameters. Both synthetic and real-world traces
are used to evaluate PRE-BUD.
For comparison purposes, we consider a parallel I/O system (referred to as Non-
Energy Aware) where disks are operating in a standard mode without employing any
energy-saving techniques. In other words, disks are in the busy state while serving
requests, and are in the idle state when not serving a request. Two PRE-BUD
configurations are evaluated; the first configuration PRE-BUD1 adds an extra disk to
be used as the bu?er disk and the second configuration called PRE-BUD2 designates
an existing disk as the bu?er disk. Note that the term ?hit rate? used throughout this
53
section is defined as the percentage of requests that can be served by the bu?er disk.
One of the goals of our experiments is to identify the parameters that are crucial to
energy e?cient disk storage systems.
Parameter Value Parameter Value
Transfer Rate 55 MB/S SpinDownTime:T
D
1.5 s
Active Power:P
A
13.5 W SpinUpTime:T
U
10.9 s
Idle Power:P
I
10.2 W Spin Down Energy:E
D
13.0 J
Standby Power:P
S
2.5 W Spin Up Energy:E
U
135 J
Table 4.3: Disk Parameters (IBM Ultrastar 36Z15)
4.4.2 Comparison of PRE-BUD and PDC
Figure 4.5 shows the energy e?ciency comparison results of our PRE-BUD strat-
egy and the PDC [26] energy saving technique. PDC attempts to move popular data
across the disks, such that the first disk has the most popular data, while the second
disk has the second most popular set of data and so forth. We fixed the data size to
be 275 MB and the hit rate is 95% for PRE-BUD. Since the data could potentially
be anywhere in the disk system, the PDC strategy causes data to be moved within
the disk system.
Figure 4.5 shows that PRE-BUD is more energy e?cient than PDC if PDC has
to move a large amount of data within the storage system. PDC may have a much
higher initial energy penalty when a large amount of data must be moved within
the storage system. PRE-BUD has a fixed amount of bu?er disk space; for this
example, it is fixed at 10% of the total data in the storage system. PRE-BUD can
be adaptively tuned to find the particular amount of bu?er disk capacity that will
yield the largest amount of savings. PRE-BUD only needs to move blocks that can
provide energy savings. In contrast, PDC makes no guarantees about the energy
impact of moving data within the storage system. PDC does not adapt as quickly as
our PRE-BUD strategy to changing workload conditions. The look-ahead window we
54
 
Figure 4.5: PDC and PRE-BUD Comparison
employ can amortize the expense of moving frequently accessed data into the bu?er
disk. PDC attempts to move frequently accessed data at one time, which can cause
large over-heads when the workload of the parallel disk system changes frequently.
4.4.3 Impact of Data Size
The second set of experiments focused on evaluating the impact that the data
size of the requests has on the energy savings of DPM and PRE-BUD. For these set
of experiments we fixed the number of disks at 12. The hit rate of the bu?er disk is
varied from 85% to 100%. Figure 4.6 reveals that the data size has a huge impact on
the energy e?ciency of DPM and our PRE-BUD strategy when the hit rate is lower
than 100%. If the hit rate is 100% for the bu?er disk, data disks can sleep for a long
period of time regardless of the data size.
55
The results depicted in Figure 4.6 indicate that our PRE-BUD strategy performs
best with data-intensive applications that request large files. Thus, multimedia stor-
age systems would be a perfect candidate for the PRE-BUD energy saving strategy.
The data size has such a large impact on energy savings because of the break even
time, TBE, which is 14.5 seconds for the chosen disk model. Large data sizes take a
longer time to serve; consecutive bu?er hits for a large data size meet the break even
time. Conversely, small data sizes produce little or no energy e?ciency gains. These
experimental results confirm that the data size together with the hit rate combine to
produce a probability of meeting TBE, which is the break-even time, with higher hit
rates and large data sizes being the ideal combination for energy savings. PRE-BUD1
consumes more energy than DPM when the data size is 1MB or smaller. This is be-
cause PRE-BUD1 adds an extra disk to the disk system, and with a small data size
energy e?cient opportunities to put a disk to sleep are rare. This set of experiments
leads us to the conclusion that large data sizes are conducive to energy e?ciency in
PRE-BUD.
4.4.4 Impact of Number of Data Disks
Now we evaluate the impact of varying the ratio of data disks to bu?er disks.
The number of bu?er disks is fixed at 1; the number of data disks is set to 4, 8, and
12. The hit rate is fixed at 95% and the data size is varied from 1MB to 25MB.
Not surprisingly, we discover from Figure 4.7 that as we increase the number of data
disks per bu?er disk, the energy savings becomes more pronounced for PRE-BUD.
This energy e?ciency trend is expected because increasing the number of disks makes
each individual disk less heavily loaded.
The bu?er disk simply prefetches blocks that can produce energy savings; lightly
loaded disks are more likely to be switched into the standby mode to conserve energy.
PRE-BUD, of course, has to prefetch a smaller amount of data from each disk to
56
        
                                       (a)                                                                 (b) 
        
                                       (c)                                                                                                              (d) 
Figure 4.6: Total Energy Consumption of Disk System while Data Size is varied for
four di?erent values of the hit rate: (a) 85 %, (b) 90 %, (c) 95 %, and (d) 100%
57
       
                                      (a)                                                                     (b) 
        
                                     (c)                                                                                                                       (d) 
Figure4.7: TotalEnergy Consumption ofDisk System while the number ofdata disks
is varied. Data size is fixed at: (a) 1MB, (b) 5MB, (c) 10MB, and (d) 25MB
58
achieve this high energy e?ciency. If the number of data disks is increased, we must
be sure that the performance is not negatively impacted. When more data disks
are added into a parallel disk system, the bu?er disk is more likely to become the
performance bottleneck. Moreover, Figure 4.7 shows that a large data size makes
PRE-BUD more energy e?cient. This result is consistent with that plotted in Figure
4.6.
4.4.5 Impact of Hit Rate
In this set of experiments we chose to investigate the impact the bu?er disk hit
rate has on the energy e?ciency of the parallel disk system. Again, the data size is
varied from 1 to 25 MB. The number of data disks is set to 12. We observe from
Figure 4.8 that higher hit rates enable PRE-BUD to save more energy in the parallel
disk system. This is expected because with a high hit rate, we heavily load the bu?er
disk while allowing data disks to be transitioned to the standby state. A low hit rate
means a data disk must be frequently spun up to serve requests, incurring energy
penalties. The longer a disk can stay in the standby state, the more energy e?cient a
parallel disk will be. Note that hit rates of 100% are not realistically achievable if the
disk requests require all disks to be active. A 100% hit rate can only be accomplished
if the overall load on the entire disk system is fairly light. It has been documented
that some parallel workloads are heavily skewed towards a small percentage of the
workload, thereby making 80% hit rates feasible. With the varying data sizes, we
notice that the energy savings becomes more significant for larger data sizes. Having
a larger data size is similar to increasing the hit rate of bu?er disks operating on
smaller data sizes.
59
         
                                      (a)                                                                    (b) 
       
                                     (c)                                                                     (d) 
Figure 4.8: Total Energy Consumption for di?erent hit rate values where the data
size is fixed at: (a) 1MB, (b) 5MB, (c) 10MB, and (d) 25 MB
60
4.4.6 Impact of Inter-Arrival Delays
In these experiments, we study the impact that the inter-arrival rates of the
requests have on the energy savings of PRE-BUD. Figure 4.9 shows the energy con-
sumption totals of the disk system with four di?erent values of the inter-arrival delay.
The number of disks was fixed at 12, the data size was fixed at 1MB, and the hit
rate was varied from 85% to 100%. When there is no inter-arrival delay, DPM will
not yield any energy savings. This is because there are no idle-windows large enough
for disks to spin down. PRE-BUD1 ends up consuming more energy than DPM in
this case, because PRE-BUD1 adds the over-head of an extra disk and the energy
required to prefetch the data. PRE-BUD2 is the most energy e?cient, since there is
no need to add an extra disk. If the inter-arrival delay is 100 ms, we have a similar
situation, except that PRE-BUD1 is now able to produce a small amount of energy
savings.
When the inter-arrival delay becomes 500 ms, DPM begins to produce energy
savings. However, such energy savings pales in comparison to PRE-BUD. When the
delay is increased to 1 Sec., the results look similar to the results for a 500 ms delay.
Although DPM in this case can result in more energy savings, PRE-BUD1 and PRE-
BUD2 significantly outperform DPM in terms of energy e?ciency. These results fit
our intuition about the behavior of the PRE-BUD approach. DPM needs large idle
times between consecutive requests to achieve energy savings, heavily depending on
thebreakeven timeofaparticularharddrive. PRE-BUDismoreenergye?cient than
DPM, because PRE-BUD proactively provides data disks with larger idle windows
by redirecting requests to the bu?er disk.
4.4.7 Power State Transitions
In this section of our study, we investigate the relationship between the number
of power state transitions and energy e?ciency. Figure 4.10 depicts the number of
61
         
                                       (a)                                                              (b) 
           
                                    (c)                                                                     (d) 
Figure 4.9: Total Energy Consumption for di?erent delay values where the hit rate
is (a) 85%, (b) 90%, (c) 95%, and (d) 100%.
62
  
                                   (a)                                                               (b) 
  
                                   (c)                                                                 (d) 
Figure 4.10: Total disk state transitions for di?erent data sizes where the hit rate is:
(a) 85%, (b) 90%, (c) 95%, and (b) 100%
power state transitions triggered by DPM, PRE-BUD1, and PRE-BUD2 when the
data size and hit rate are varied. The number of state transitions caused by DPM is
zero when the data size is smaller than or equal to 25MB. There is no power state
transitions for small data sizes, because no idle time periods of data disks are long
enough for DPM to justify transitioning to the standby state. When the data size is
larger than 25MB, the number of power state transitions quickly rises with increasing
data sizes. If DPM triggers transitions, it is able to improve the energy e?ciency of
the disk system.
Interestingly, the number of transitions for PRE-BUD slowly increases at first
and then starts dropping when the data size is larger than 125MB. The transition
number increases when data size is small because many small idle periods in data
disks are merged by PRE-BUD creating new opportunities for data disks to sleep.
63
Since the small idle intervals tend to be spread out data disks experience many power
state transitions. The bu?er disk reduces the number of transitions for data disks
when the data size is large, because a bu?er disk generates larger and fewer idle time
periods in data disks. A few very large idle time periods lead to a small number of
transitions.
One of the problems with DPM is that it will transition a disk many times, which
may decrease the reliability of the disk. Unlike DPM, PRE-BUD can improve the
reliability of the disk system by lowering the number of transitions when data sizes of
requests are very large. As such, PRE-BUD is conducive to improving both energy
e?ciency and reliability for data-intensive applications with large data requests.
4.4.8 Impact of Disk Power characteristics
To examine the e?ect that manipulating disk power characteristics has on PRE-
BUD, we varied the active power, idle power, and standby power, for three separate
experiments respectively. The number of data disks is fixed at 4 and the data size is
25MB.
Figure 4.11(a) shows that for all the four schemes, increasing the active power of
a disk results in a continuous increase of energy consumption across the four di?erent
strategies. Results plotted in Figure 4.11(a) indicate that PRE-BUD is more energy
e?cient for parallel disks with low active power. For example, if the active power
is 9.5W, PRE-BUD2 saves 15.1% of the energy consumption total over DPM. If the
active power is increased to 17.5W, then PRE-BUD2 improves energy e?ciency over
DPM by only 13.0%. Figure 4.11(b) shows the impact of varying the idle power
parameter of a disk has on the energy e?ciency of PRE-BUD. Compared with active
power, idle power has a greater impact on the energy savings achieved by PRE-BUD.
If the idle power is very low, PRE-BUD2 has a negative impact. If the idle power is
increased to 14.2 W, PRE-BUD2 can save energy over DPM by 25%. Figure 4.11(c)
64
  
                                      (a)                                                              (b) 
 
           (c) 
Figure 4.11: Total Energy consumption for various values of the following disk pa-
rameters: (a) power active, (b) power idle, and (c) power standby
65
shows that standby power also has a significant impact on PRE-BUD. Specifically,
the energy savings starts at 16.3% and drops to 11.7% with increasing standby power.
These results illustrated inFigure4.11indicate thatparallel disks with lowactive
power, high idle power, and low standby power can produce the best energy-saving
benefit. This is because PRE-BUD allows disks to be spun down in standby dur-
ing times they would be idle using DPM. The greater the discrepancy between idle
and standby power, the more beneficial PRE-BUD becomes. Lowering active power
also makes PRE-BUD more energy e?cient because the amount of energy consumed
prefetching and serving requests can be reduced.
Throughout our experiments it was realized that the main factor limiting the
energy savings potential of PRE-BUD is the large break-even times of disks. A large
break-even time of a disk reduces opportunities for DPM to conserve energy if there
are a large number of idle periods that are smaller than the break-even time. PRE-
BUD alleviates this problem of DPM by combining idle periods to form large idle
windows. Unfortunately, PRE-BUD inevitably reaches a critical point where energy
savings are no longer possible. To further improve energy e?ciency of PRE-BUD, we
have to rely on disks that are able to quickly transition among power states - one of
the dominating factors in energy savings for disks.
4.4.9 Real World Applications
To validate our results based on synthetic traces, we evaluated eight real-world
applicationtraces. Theapplicationsareparallelinnature; thus, alloftheapplications
used eight disks, with the Titan and HTTP application being the exceptions and only
used seven disks. Note that results plotted in Figure 4.12 generally represented the
worst case for PRE-BUD. Figure 4.12 shows that PRE-BUD1 consumes more energy
than DPM for most applications except for the Cholesky and LU Decomposition
applications. When applications are very I/O-intensive, adding an extra disk leaves
66
 
Figure 4.12: Total Energy Consumed for Real World Traces
67
no opportunity to conserve energy. Figure 4.12 also shows PRE-BUD2 noticeably
improves energy e?ciency over DPM for most applications. The results confirm
that PRE-BUD can generally produce energy savings under both low and high disk
workloads, even though the energy savings is relatively small for high workloads.
A surprising exception is the Titan application, because DPM is more energy
e?cient than PRE-BUD. In the Titan trace, there is one large gap between all of the
consecutive requests, allowing DPM an opportunity to put all of the disks into the
standby state for a long period of time. PRE-BUD, on the other hand, keeps the
bu?er disk active all the time to minimize the negative impact on performance. In
this special case, the active bu?er disk makes PRE-BUD less energy e?cient than
DPM. The energy e?ciency of PRE-BUD can be further improved by aggressively
transitioning the bu?er disk to the standby state if it is sitting idle.
PRE-BUD Response Time Degradation 5Disks 10 Disks 15 Disks 20 Disks
10% of Data Accessed in 90% of Trace 6ms 16 ms 26 ms 36 ms
20% of Data Accessed in 80% of Trace 6ms 16 ms 26 ms 36 ms
30% of Data Accessed in 70% of Trace 6ms 16 ms 26 ms 36 ms
40% of Data Accessed in 60% of Trace 32 ms 47 ms 62 ms 79 ms
Table 4.4: Response Time Analysis
4.4.10 Response Time Analysis
In Table 4.4 we present our response time analysis results for the PRE-BUD
strategy. We used four di?erent traces, which had a designated set of popular data
that varied in size and overall percentage of the entire trace. We also varied the
number of data disks that each bu?er disk is responsible for prefetching data from.
From the table we see that the first three traces have similar response time results
for each number of data disks used for the experiments. This tells us that our PRE-
BUD strategy is capable of balancing the load and producing energy savings with a
minimal impact on the response time of the parallel disk system. For the last trace,
68
in which 40% of the data is accessed 60% in the trace, we see that our response time
degradation is significantly higher when compared to the other traces. This result is
expected because the workload does not have an easily identifiable subset of data that
can be prefetched to produce energy savings. PRE-BUD relies on the fact that some
parallel application I/O operations are heavily skewed towards a small subset of data.
From all of the results presented in Table 4.4 we realize that the PRE-BUD strategy
produces relatively small response time degradations. This means our strategy will
work for applications that can tolerate response degradations and is not suitable for
real time applications.
4.5 Chapter Summary
The use of large-scale parallel I/O systems continues to rise as the demand for
information systems with large capacities grows. Parallel disk I/O systems combine
smaller disks to achieve large capacities. A challenging problem is that large-scale
disk systems can be extremely energy ine?cient. The energy consumption rates are
rising as disks become faster and disk systems are scaled up. The goal of this study is
to improve the energy e?ciency of a parallel I/O system using a bu?er disk to which
frequently accessed data are prefetched.
In thischapter, we develop an energy-e?cient prefetching algorithm(PRE-BUD)
forparallelI/Osystemswithbu?erdisks. Twobu?erdiskconfigurationsconsideredin
our study are (1) adding an extra bu?er disk to accommodate prefetched data and (2)
utilizing an existing disk as the bu?er disk. Prefetching data blocks in the bu?er disk
providesampleopportunitiestoincreaseidleperiodsindatadisks, therebyfacilitating
long standby times of disks. Although the first bu?er disk configuration may consume
more energy due to the energy overhead introduced by an extra disk, it does not
compromise the capacity of the disk system. The second bu?er disk configuration
69
lowersthecapacityoftheparalleldisksystem, butitismorecost-e?ective andenergy-
e?cient than the first one. Compared with existing energy saving strategies for
parallel I/O systems, PRE-BUD exhibits the following appealing features: (1) it
is conducive to achieving substantial energy savings for both large and small read
requests, (2) it is able to positively impact the reliability of parallel disk systems by
the virtue of reducing the number of power state transitions, (3) it prefetches data
into a bu?er disk without a?ecting the data layout of any data disks, (4) it does not
require any changes to be made to the overall architecture of an existing parallel I/O
system, and (5) it does not involve complicated metadata management for large-scale
parallel I/O systems.
There are three possible future research directions for extending PRE-BUD.
First, we will improve the scalability of PRE-BUD by adding more than one bu?er
disk to the parallel I/O system. This can be implemented by considering a bu?er disk
controller which manages various bu?er disks each responsible for a set of data disks.
In this work we investigate the relationship between bu?er disks and data disks, to
improve the parallelism of PRE-BUD we need to investigate the relationship between
a bu?er disk controller and the bu?er disks. The number of bu?er disks will have to
be increased as the scale of the disk system is increased. Second, PRE-BUD will be
integrated with the dynamic speed control or DRPM [9] for parallel disks. Last but
not least, we will quantitatively study the reliability impacts of PRE-BUD on parallel
I/O systems.
70
Chapter 5
DiskSim Power Models
5.1 Introduction
The previous chapter explored some powerful simulation models and we imple-
mentedthesemodelsinasimulatorthatwehavedeveloped. Itisimportanttovalidate
the simulator that one has developed and this can be a lengthy process, so we have
decided to leverage DiskSim to continue our research. DiskSim is a validated disk
system simulator developed by the parallel data lab of Carnegie Mellon University
[3]. If we make any changes we would only need to validate the changes that were
made to the simulator and not worry about the disk simulation details because they
have been detailed by the DiskSim project. DiskSim also collects a large amount
of information about disk usage that was not developed for our simulator and this
information can be leveraged to further improve the quality of the research.
5.2 Simulation Framework
In order to complete the work for our project we have two main components in
our system. The DiskSim simulator, which is responsible for simulating the operation
of all the disks and the movement of data blocks in the system, and a block to file
translator that is responsible for mapping data blocks to files in our storage system.
It is important for us to investigate data movement at the file system for several key
reasons. The first reason that we want to develop a storage system simulator at the
file system level is that a disk must see a large idle period before it can be placed
into the standby state to conserve energy. Data blocks are typically small and it
71
Figure 5.1: File System Simulator and Disk System Interaction
would require us to prefetch a large number of data blocks that could potentially
be changed using writes. Secondly, due to the observations made in the previous
chapters we have decided that our work is most suited for web applications and the
cluster storage systems that support large volumes of data. Data blocks are typically
managed within a single computer and we need a higher level mechanism to manage
data across cluster storage nodes.
The interaction of our file to block level translator and a modified version of
DiskSim is outlined in Figure 5.1. The modifications made to DiskSim are outlined
in Section 5.4. The process begins with the use of a file-system level trace, generated
using our trace generator, or from a real-world application trace. We then use our file
to block level translator to generate a block level trace for DiskSim. The file to block
level translator keeps tracks of all the files in the storage system and which node they
are located on and simulates the movement of files across nodes. Once the block level
trace is produced it is used by DiskSim to collect power and response time statistics.
The movement of files between disks and nodes in the storage system is important to
support our energy-e?cient pre-fetching strategies, which rely on data movement.
72
5.3 DiskSim Limitations
DiskSim is a powerful tool for simulating the operation of disks in large scale
storage systems, but it has one fundamental flaw that limits its use for our research
purposes, there are no energy models for disk systems. There have been two research
papers that have implemented power models within Diskim[29] [38]. We were fortu-
nate that the authors of the SODA paper were able to provided us with source code
of power models developed for and older version (version 2.0) of DiskSim. This got us
closer to our goal of implementing power models into DiskSim, but we had to adapt
the energy models in SODA to a newer version, 4.0, of DiskSim and we also had to
simulate Disk state transitions which are critical to our disk energy savings strategies.
5.4 DiskSim Modifications
Parameter Description
dm power active Power Consumption in the Active State
dm power idle Power Consumption in the Idle State
dm power standby Power Consumption in the Standby State
dm spin down time Time to transition from Actve/Idle to Standby
dm spin up time Time to transition from Standby to Active/Idle
dm spin down penalty Energy required to transition from Actve/Idle to Standby
dm spin up penalty Energy required to transition from Standby to Actve/Idle
Table 5.1: Key Model Variables Added To DiskSim
The first area where we had to make changes was in the disk model code of
DiskSim. The major changes are included in the Appendix A. The current disk
models did not support our energy e?ciency models so we had to add seven key
parameters into the disk model to support our power requirements. The parameters
added are summarized in Table 5.1. The first parameter that we have added is the
power active, dm power active, and this ishow much power isconsumed oncethe disk
is busy working on a request. The second is the idle power, dm power idle, which is
73
how much energy the disk consumes when the platters are spinning but the disk is not
serving a request. The next parameter, dm power standby, is how much energy the
disk consumes when disk is transitioned into the standby state and the platters are
now stopped. The fourth parameter and fifth parameters, dm spin down time and
dm spin up time, are the amount of time it takes to spin down and spin up the disk
respectively. Spinning down the disk takes it from theactive/idle stateto thestandby
state and spinning up the disk takes it from the standby state to the active/idle state.
The last parameters are, dm spin down penalty and dm spin up penalty, which are
the energy requirements to spin down and spin up the disk respectively.
DiskSim is a discrete event based simulator, so modifications had to be made to
support disk state transitions. The first area that we had to modify was to implement
atimer to check tosee if thedisks have been idleforacertain period oftime. This was
implemented by adding a TIMER EXPIRED event to the event queue of DiskSim
and this code is outlined in the Appendix A with the disksim power.c code that I
wrote. This timer runs every 500ms and then runs the disk update idle function
which is used to check all of the disks that are currently being simulated and check
how long they have been idle. If the disks have been idle longer than an idle threshold
value then the disks are transitioned into the standby state if they are currently in
the active/idle state. To make the state transition complete I implemented another
event type in DiskSim named the SPIN DOWN COMP event which is executed after
the spin down time penalty is complete. Once this event is pulled o? of the DiskSim
event queue then the disk is placed into the standby state.
To wake up a disk I had to modify code in the disksim diskctlr.c file. This
file manages the disk and controls the disk when a read or write request makes it
to the disk. If a request comes in for a disk that is currently in the standby state
the wakeup disk sleep function is called in the disk sim power.c file. This function
calculates how much time it will take to wake up a disk using the spin up penalty
74
and schedules a wake up complete event, SPIN UP COMP, that is similar to the
SPIN DOWN COMP code described in the previous paragraph. There are also two
special cases to check, the first if the disk is currently being spun up, and second if
the disk is being spun down. The first case adds the di?erence of the current time
and the spin up complete event as delay and keeps moving through the rest of the
code. The second case requires letting the spin down complete and then scheduling
a spin up complete event, and adding the delay for the spin down and spin up to
complete. All of this code is detailed in the disksim power.c file in the appendix.
The last area that had to be modified was to keep track of the energy consumed
in the various states that the disk could bein. The disk would beactive as its seeking,
rotating, and transferring the data, the code for this is contained in disksim disk.c,
and these times were passed into the activeEnergyStat function in disksim power.c.
The idle time must also be accounted for and the idle time is gathered before a
seek is found in disksim diskctlr.c and passed to the idleEnergyStat function which
is contained in disksim power.c. The last place to check the idle time is when the
simulation has finished we check the di?erence between the end of the simulation and
the last I/O request for a disk in disksim disk.c and calculate the idle energy using
idleEnergyStat. The last thing to note is that we must also take care to measure the
amount of time that the disk has spent in the standby state and this is controlled
by the wakeup disk functions in the disksim power.c. After the trace is finished the
total standby time is multiplied by the standby state energy to produce a total energy
consumption number in Joules. Similar calculations are made for the idle and active
times and these numbers are combined to produce the overall energy output of the
disk system.
75
15000
20000
25000
30000
35000
40000
45000
En
er
gy
 
in
 
Jo
ules
EA
NonrEA
0
5000
10000
Atlas10K IAD200 IAD1000 IAD5000
En
er
gy
 
in
 
Jo
ules
Figure 5.2: Energy Consumption Results of Modified DiskSim
Response Times (ms)
Trace Energy Aware Non-Energy Aware
Atlas10k 3.97 3.97
IAD200 2.65 2.65
IAD1000 2.66 2.66
IAD5000 4461 2.59
Table 5.2: DiskSim Response Time Results
5.5 Generated Results
Figure 5.2 presents the results we collected when running our modified version of
DiskSim and Table 5.2 presents the response times we collected. The first trace that
we used was the Atlas10K trace which is used for validating the DiskSim simulation
environment. We made our changes to the disk model for the Atlas10K trace and
ran the trace with the modified and unmodified versions of the code. The energy
consumption and response time results were exactly the same for the Energy and
Non-Energy aware approaches. This result was expected because we noticed from
the simulation results that the disk was not being transitioned into the standby state
76
because the inter-arrival delay of requests in the trace was very low. Since the disk
was not being transitioned to the idle state this validated that our changes made no
negative impact to Disksim when no energy savings opportunities were available.
The disk will transition to the idle state only after the idle threshold is reached,
which is 5 seconds for our experiments, and the IAD200 and IAD1000 experiments
produced no state transitions. IAD200 and IAD1000 are tested with synthetically
generated traces with the inter-arrival delay set at 200 and 1000 ms, respectively.
These traces produced results that are similar to the case for the Atlas10K trace and
we are able to validate the operation of our modified version of DiskSim in the case
that energy savings is not possible. The last results that we produced were using an
inter-arrival delay of up to 5000 ms, which was able to produce 147 state transitions.
Unfortunately the traces we were working with are at the block level and the data
size of requests is relatively small, from 1 to 50 blocks. This caused the disks to be
transitioned with a high probability that a request would arrive while the disk was
transitioning to the standby state, which caused an enormous jump in the response
timeoftheenergy-awareapproach. UsingthediskparametersfromTable3.2thetime
to spin down and then immediately spin up takes 12.4 seconds and incurs an energy
penalty of 148 Joules. Since the disk spent relatively little time in the standby state
the energy consumption of the energy aware approach is higher than the non-energy
aware approach.
5.6 Conclusion
This chapter introduced an improved simulation framework that can be used
to simulate energy aware disk systems. The framework that we have introduced in
this chapter is valuable because DiskSim allows one to test a large number of disks
in a quick manner. Along with our source modifications this allows us to quickly
prototype large scale energy-aware storage systems. The disks must be spun down to
77
conserve energy and the currently used disk model has a high state transition penalty
in terms of energy and time. If the disks are not managed carefully energy-aware
approaches can quickly degrade response times and also consume more energy than
non-energy aware approaches. DiskSim focuses on block level requests which are
generally small and through our experimental results we have concluded that large
data sizes are conducive to energy savings approaches that place a disk in the stanby
state for energy savings.
To remedy these deficiencies we decided to move away from modeling and simu-
lation and produce a prototype energy aware virtual file system. This will eliminate
any errors that are introduced by disk power models and any disk simulators that
have to be modified to support disk power models. Our goal was to develop a system
that would be capable of producing energy savings and it was time to move towards
a real-world implementation. A virtual file system will allow us to work with larger
data sizes, place data across multiple storage nodes, and implement energy e?cient
strategies across all of the storage nodes.
78
Chapter 6
Energy E?cient Virtual File System
6.1 Introduction
Large-scale cluster storage systems are becoming ubiquitous because of the large
amount of data required for search engines, multimedia websites, and data-intensive
high-performance computing [10] [9]. These large-scale cluster storage systems typ-
ically are extremely ine?cient concerning energy consumption. With data centers
quickly growing in scale, it is important to develop energy-e?cient tools to keep the
cost of operating cluster storage systems down. Improving the energy e?ciency of
cluster storage systems is important because storage systems can account for 27%
of the total cost to operate a data center [16]. The demands for increased per-
formance and storage capacities of large-scale storage systems exacerbate the high
energy consumption problems associated with cluster storage systems.
A handful of novel techniques developed to conserve energy in storage systems
include dynamic power management schemes [6][19], power-aware cache management
strategies [40], power-aware prefetching schemes [31], software-directed power man-
agement techniques [32], redundancy techniques [27]and multi-speed settings[11][14].
These energy-saving techniques can significantly enhance the energy e?ciency of disk
drives under workload conditions where idle periods between groups of disk accesses
are substantial.
One fundamental drawback of cluster storage systems is that large data sets are
partitioned and distributed across multiple storage nodes in a cluster; it is di?cult
for the storage nodes to energy-e?ciently coordinate and handle parallel data man-
agement. A promising approach to improving the energy e?ciency of cluster storage
79
systems is to implement energy-saving techniques in file systems. In the absence of
an energy-e?cient cluster file system, energy conservation is commonly achieved in
individual storagenodesinanon-collaborativemanner. Relying onlow-power storage
components to save energy in clusters not only limits I/O performance, but also loses
opportunities to conserve energy by considering file accesses across multiple storage
nodes.
The goal of our research is to develop an energy-e?cient virtual file system -
called EEVFS - for large computing clusters. EEVFS is a cluster file system that
energy-e?ciently processes file accesses across multiple storage nodes in a computing
cluster. The salient features of EEVFS lie in its high energy e?ciency, fast I/O
processing, and scalability potential. In other words, EEVFS can provide significant
energy savings for cluster storage systems while achieving high I/O performance.
EEVFS achieves high energy e?ciency through a BUD disk architecture [21][28].
In the BUD architecture, each storage node contains m bu?er disks and n data disks.
We choose to use log disks as bu?er disks in each storage node, because data can
be written onto the log disks in a sequential manner to improve performance of the
bu?er disk. In most cases, the number of bu?er disks m is smaller than the number
of data disks n.
To fully utilize bu?er disks, we have investigated an energy-aware prefetching
strategy (see [21] for details on the prefetching algorithm called PRE-BUD) to dy-
namically fetch the most popular data into bu?er disks, thereby making data disks
stay in the standby mode for long period of time to conserve energy. We evalu-
ated the impact the PRE-BUD algorithm on the overall energy e?ciency of parallel
disks within a storage node. The research on PRE-BUD has led us to discover that
file access patterns, data size, inter-arrival delays, and disk drive energy parameters
combine to produce opportunities to transition hard drives into lower energy con-
suming states. There is energy, performance, and reliability penalties associated with
80
transitioning disks into the various power states and; therefore, it is imperative to in-
vestigate techniques that are able to o?set these penalties. It is desirable to minimize
the amount of state transitions to provide a balance between the energy e?ciency,
performance, and reliability of parallel disks.
In our previous studies on PRE-BUD, we have conducted extensive simulations
to estimate performance and energy-e?ciency of our prefetching schemes. Simulation
resultsshowthatPRE-BUDisconducive toconserving energyinparalleldisks. These
findings motivate us to build the EEVFS file system, in which an energy-e?cient
prefetching mechanism is implemented for cluster storage systems. EEVFS keeps
track of file locations and disk states of all the storage nodes in the file system. The
system architecture for the EEVFS file system contains two di?erent components
- storage servers and storage nodes. The EEVFS architecture details are further
outlined in Section 6.2.1.
Apart from high energy e?ciency, EEVFS has high scalability. This extreme
scalability is possible, because EEVFS coordinates a large number of storage nodes,
each of which is managing anarrayof disk drives (see Fig. 1). The EEVFS file system
is running on storage nodes connected over a switching fabric. EEVFS is responsible
for balancing the I/O load across storage nodes. The I/O load of individual disks
within a storage node is balanced by storage node component of EEVFS.
The rest of the chapter is organized as follows: We discuss in Section 6.2 the
designissuesofEEVFS.Section6.3discusses theimplementation decisionsofEEVFS.
Before discussing the performance evaluation, we present in Section 6.4 a testbed,
metrics, and important parameters. Then, Section 6.5 shows experimental results.
Finally, Section 6.6concludes thechapter andpresents ourfutureresearch directions.
81
6.2 Design
We outline in this section the design issues of the Energy-E?cient Virtual File
System (EEVFS). In this study, we paid particular attention to the implementation
of energy-e?cient prefetching with bu?er disks in EEVFS.
6.2.1 System Architecture
Like PVFS, EEVFS was designed to improve the performance of cost-e?ective
cluster storage systems. In addition to achieving high performance, reducing energy
consumption in cluster storage systems is a primary design goal of EEVFS.
Fig. 6.1 illustrates the architecture of EEVFS, where nodes are divided into
three main groups - compute nodes (clients), storage nodes, and a storage server.
The storage server is responsible for handling incoming file requests for data reads or
writes from the compute nodes. The storage server needs to determine the storage
node that contains the data that is requested by a client. When the number of
storage nodes scales up, the storage server might become a performance bottleneck,
we address this issue by simplifying the functionality of the storage server. Thus,
the storage server only has to manage metadata such as data location and file size.
To achieve high scalability of cluster storage systems, we allow each storage node to
manage (1) multiple data disks (see Section 6.2.2 below) and (2) metadata for the
multiple local disks (see Section 6.3.4).
6.2.2 Data Placement
If the storage server is given previous knowledge about the popularity and access
patternsofthedatablocks, theserver distributes thedatablockstostoragenodesina
round-robin fashion based on file popularity. After the storage server has distributed
the data across the storage nodes, it splits the file access patterns based on the data
82
distribution. The server then forwards the corresponding access patterns to each
storage node.
A storage nodemanages the states of multiple hard drives and also performs load
balancing based on the file popularities determined by the storage server. As data is
placed on each storage node by the storage server, the storage node places the data
on its N disks in a round-robin order. Since the first data placement request contains
the most popular data, the second request contains the second most popular data,
and so forth, the storage node load balances the data placement request on its local
data disks. The storage node receives file access pattern information from the storage
server about each file that is stored within the storage node.
6.2.3 Power Management
The storage node uses the file access pattern to predict periods when each of its
data disks will be idle for long periods of time. If there are any periods of time larger
than a threshold value, the storage node will transition a data disk into the standby
period. The storage node uses an energy prediction model that takes into account
the number of files to prefetch and the file access pattern. If there are consecutive
requests for data in the predicted prefetch area then the storage node marks points
in time when the data disks should be transitioned to the standby state to conserve
energy.
Within each storage node the disks are separated into two groups, namely, bu?er
disks and data disks. The bu?er disks are responsible for holding copies of popular
data from the data disks. Our goal is to keep the bu?er disk active and keep the data
disks as lightly loaded as possible, thereby allowing EEVFS to transition data disks
into the standby state to produce energy savings. The bu?er disk used in the current
incarnation of EEVFS relies on the local file system to manage bu?ers residing on
the bu?er disk. The bu?er disk, of course, must constantly be available for the Linux
83
operating system running in the storage node. It is worth noting that placing the
bu?er disk into the standby state is not feasible under heavy loads, because power
state transitions in the bu?er disk can adversely a?ect the performance of the storage
node. If the bu?er disk has any available space, the free space should be used as a
write bu?er area for the other data disks contained in the storage node.
6.3 Implementation
We implemented a prototype of EEVFS on a cluster storage system. The imple-
mentation uses an append-only log of requests to keep track of file access patterns,
which assists the storage server in determining the needs for prefetching popular files
ordatablocksfromdatadiskstobu?erdisks. Thissection discusses several important
implementation issues.
6.3.1 Process Flow
The process flow of EEVFS is presented in Fig. 2. The first step of the process
is the initialization phase, which consists of the storage server connecting to all of the
storage nodes in the system. The server creates a separate thread for each storage
node and then establishes a TCP/IP connection to each storage node. The second
step is that the storage server gets popularity information from a log of file access
patterns. The prototype implementation uses a trace to replay file access patterns
and bases the file popularity on information gathered from traces.
In step 3 files are created on the storage nodes and the server informs the storage
nodes if they should perform prefetching, which is explained further in 6.3.2. The
storage server attempts to load balance the files among the storage nodes based on
the popularity information gained from step 2. The most popular data is placed on
storage node 1 and the second most popular data is placed on storage node 2 and
so on. The storage node also tries to load balance among the attached data disks.
84
This is achieved because the first create file request a storage node sees contains a
file that is guaranteed to be more popular than the file contained in the second file
create request. The first file a storage node creates is then placed on the first storage
disk and the second file a storage node creates is placed on the second storage disk.
In step 4 the server passes application hints to the storage nodes which is elaborated
on in 6.3.3.
In step 5 the client requests information from a file and sends a request to the
storage server node. The client can not access any of the content in the storage node
without first going through the storage server node. The storage server node contains
the storage node location of a file, but does not know which data disk the file is
located on or if the file has been prefetched. The storage node passes information
about the client to the storage node that contains the file and the storage node then
establishes a connection with the client and passes the data to the client which is
outlined in step 6.
6.3.2 Prefetching
Prefetching is an important part of the EEVFS architecture because it allows the
data disk an opportunity to see large idle windows. If a bu?er disk can serve a disk
request then the corresponding data disk sees an idle window increase as opposed
to the disk serving the request and resetting the idle window timer that each disk
keeps. Our current version of prefetching is based on file access patterns and we
derive a popularity based on the number of accesses over a given period of time. This
information is passed to storage nodes and if they are instructed to prefetch then
they will place a copy of popular data into the bu?er disk. The current version of
EEVFS uses the hard drive that also runs the operating system as the bu?er disk,
which allows us to use an existing disk as the bu?er disk. If this is not possible due
to space limitations it may be possible to add another extra disk to be used as the
85
bu?er disk, but our experiments and previous simulation results indicate you would
need many data disks to amortize the energy cost of adding an extra disk.
6.3.3 Application Hints
Assuming that the programmer of an application using EEVFS or the creator of
files in the EEVFS system can pass information about the application that is going
to be run on EEVFS we can further improve the energy e?ciency of EEVFS. The
application hints are used to predict idle windows to increase the energy e?ciency of
EEVFS while providing minimal delays to response time. The application hints can
be extremely useful because they allow us to predict if there are any opportunities to
save energy and if there are none then EEVFS will not place disks into the standby
state. It allows EEVFS to operate in a more conservative manner as opposed to not
knowing application hints and relying solely on the idle window timers. EEVFS can
operate without the application hints, but there may be situations where a request
comes in immediately after the idle window threshold is reached, causing a negative
impact to energy savings and response time.
6.3.4 Distributed Metadata Management
To alleviate the metadata management burden of the storage server, we e?ec-
tively distribute metadataacrossstoragenodesinacluster. Thestorageserver simply
manages metadata that provides hints as to which storage nodes contain files that
can handle requests submitted from clients. Each storage node maintains metadata
that can locate files on local disks in the node to respond requests forwarded from
the storage server.
The goal of the distributed metadata management is to balance the metadata
management load among the storage nodes. The storage server is unaware of the
individual disks in each storage node, primarily acting as a load balancer and access
86
point for all of the storage nodes. The storage server does not need to know any
information about the exact disk location of the data within each storage node. The
implementation of our metadata management subsystem can be further improved in
our future studies. For example, Miller et al. developed a scalable metadata man-
agement strategy for large-scale file systems [36]. We plan to integrate their dynamic
metadata management scheme in our energy-e?cient cluster storage system.
6.4 Evaluation Methodology
Parameter Storage Server Node Storage Node
CPU Type and Clock Speed Celeron 2.2 GHz Celeron 2.2 GHz
Memory (MB) 2000 2000
Network Interconnect (Mb/s) 1000 1000
Disk Type SATA SATA
Disk Capacity 160 Gbyes 480 Gbyes
Disk Bandwidth 126 MB/s 126 MB/s
Table 6.1: Configuration of the Testbed Nodes.
6.4.1 Testbed
We built a cluster storage system serving as a testbed to evaluate the energy
e?ciency andperformanceoftheenergy-e?cient prefetching mechanism implemented
in EEVFS. This cluster storage system was configured as follows at the time when
we conducted the experiments (see Table 6.1 for details on the configuration of the
testbed). In our storage cluster system, there is one server node and five storage
nodes. The server node has a 2.2 GHz Celeron processor, 2 Gbytes of RAM, a 1
Gbits/sec Intel EtherExpress Pro Fast-Ethernet network card, and a 160 GB SATA
disk. Each storage node has a 2.2 GHz Celeron 4 processor, 2 Gbytes of RAM, a 1
Gbits/sec Fast-Ethernet network card, and three 160 GB SATA disks. One disk is
used as the operating system host and as the bu?er disk while the other two disks
are used as data disks. All the nodes were running Linux 2.4.20 rather than Linux
87
2.6, because we experienced disk transition inconsistencies running recent Linux 2.6
kernels.
6.4.2 System and Workload Parameters
For the collection of our experimental results we have focused on varying five key
system and workload parameters (see Table 6.2) that noticeably a?ect both energy
e?ciency and performance of the EEVFS system. These five important parameters
are: (1) average data size, (2) file access popularity (i.e., the MU value), (3) inter-
arrival delays (i.e., arrival rate), (4) number of files to be fetched, and (5) disk idle
threshold. In what follows, let us describe these parameters summarized in Table 6.2.
? Data Size. We conducted extensive experiments using both real-world traces
(see Section 6.5.4) and synthetic file traces (see Sections 6.5.1 - 6.5.3). For
synthetic file traces, the mean data size of files is varied from 1MB to 50MB.
When it comes to real-world traces, data sizes are obtained from the traces.
? File Popularity Rate - The MU Value. The second parameter that we
have chosen to vary is the MU value for the Poisson distribution of file requests
that are fed into the storage server. This value was varied from 1 to 1000 and
with 1 skewing the file accesses patterns to a small number of files and 1000
spreading out the distribution of files accessed.
? Arrival Rate. For synthetic traces, we used the inter-arrival delay to represent
thearrivalrateoffilerequests submitted fromapplicationstothecluster storage
system. We used four di?erent synthetic workload scenarios by varying the
inter-arrival delay of the file requests. We have added 0 to 1000 ms of inter-
arrival delay between requests to represent lighter to heavier loads respectively.
Note that we have also set a default inter-arrival delay at 700 ms to keep our
88
queue from growing too large and our response times growing too large for the
energy and non-energy aware comparisons.
? Number of Files to Prefetch. The last parameter that we have varied is
the number of files to prefetch and we have varied this from 10 to 100. The
total number of files in our test file system is 1000 files for testing purposes.
EEVFS with the prefetching flag set is represented as PF in the figures and
NPF represents EEVFS without prefetching.
? Disk Idle Threshold. If disks are sitting idle for a certain period of time (i.e.,
Disk Idle Threshold), the disks are switched to the standby mode to conserve
energy.
Parameter Values
Data Size(MB) 1, 10, 25, 50
File Popularity Rate - The MU Value 1, 10, 100, 1000
Inter-arrival Delay(ms) 0, 350, 700, 1000
Number of Files to Prefetch 10, 40, 70, 100
Disk Idle Threshold (sec) 5
Table 6.2: System and Workload Parameters.
6.4.3 Metrics
To quantify the energy e?ciency improvement and performance impacts of our
prefetching scheme, we used the following three metrics in the experimental evalua-
tion.
? Energy Savings (see Section 6.5.1). We compared the energy consumption
of the cluster storage system with the energy-e?cient prefetching mechanism
against that of the same system without employing the prefetching mechanism.
? Number of Power State Transitions (see Section 6.5.2). The total num-
ber of power state transitions can closely reflect overhead incurred by switching
89
the power state of the disks between the active and standby mode. We evalu-
ated the impacts of workload parameters on the overhead introduced by power
state transitions.
? Response Time (see Section 6.5.3). Our energy-e?cient prefetching mech-
anism aims to improve energy e?ciency of cluster storage systems while mini-
mizing performance penalties. We measured the performance penalties caused
by the prefetching mechanism in terms of the increase in response time.
6.5 Experimental Results
6.5.1 Energy Savings
From Figure 6.5.1 we discover that EEVFS with prefetching significantly im-
proves the energy e?ciency of the disk system. Power measurements were collected
from the individual storage client nodes running the experiments and combined for
our results. Taking a look at Figure 6.5.1, which varies the data size used for the
experiment; we realize that larger data sizes produce larger energy e?ciency gains.
If the data size is 1MB we produce an 11% energy e?ciency gain and when the data
size is 50MB the energy savings produced is 15%. The other interesting thing to note
about the data size experiments is that the overall energy output of EEVFS with
PF and no PF significantly increases when the data size is 50MB. This is produced
because our default inter-arrival delay of 700 ms is too low and the queue for the
storage client nodes becomes quite large and the test runs longer than the original
trace time causing the overall energy output to increase. Even though the test ran
longer for the PF and no PF cases the energy e?ciency gain produce by EEVFS
with PF was still the largest for 50 MB. For the data size experiments MU was fixed
at 1000, the number of files to prefetch was 70, and the inter-arrival delay is set at
700ms.
90
Figure 6.5.1 shows the impact of popularity rate (i.e., the MU value) on the
energy e?ciency of the cluster storage system. From this figure we realize that the
larger MU value produces a smaller energy e?ciency gain. This is caused by the fact
that many files are requested in the trace and the probability that the data required
for the trace will be prefetched is smaller with larger values of MU. Using our default
prefetch sizevalueof70filesweareabletoproducethesameamountofenergysavings
when MU is 100 or smaller. The reason for this similarity in energy consumption is
the fact that when MU is 100 or smaller EEVFS is able to prefetch all of the required
data and sleeps the disks at the beginning of the trace execution and is able to keep
the disks in the standby state for the entirety of the trace. For the MU experiments
the data size was fixed at 10 MB, the number of files to prefetch was 70, and the
inter-arrival delay is set at 700ms.
Figure 6.5.1 reveals the impact of the inter-arrival delay on energy e?ciency.
The results plotted in Figure 6.5.1 indicate that we are able to produce larger energy
e?ciency gains when the inter-arrival delay is increased. Intuitively this makes sense
because large inter-arrival delays produce lighter workloads. Light workloads gener-
allyproduce moreopportunities forthe datadisks tobe placedinto thestandby state.
The interesting thing to note is that the overall energy e?ciency actually seemed to
level o? around the 700ms inter-arrival delay value. When the inter-arrival delay
value is increased to 1000ms we actually see a small decrease in energy e?ciency.
This could likely be caused by the fact that we sleep a disk as a particular request
enters the storage client node, and if the requests are spaced further apart it will take
slightly longer for the disks to transition to the standby state. For the inter-arrival
experiments the data size was fixed at 10 MB, the number of files to prefetch was 70,
and MU is set to 1000.
Figure 6.5.1 evaluates the e?ect of the number of files to prefetch into a bu?er
disk. In this experiment, the data size was fixed at 10 MB, the inter-arrival delay is
91
700 ms, and MU is set to 1000. The results show that as the number of prefetched
files is increased, EEVFS produces larger energy savings. When the number of files
to prefetch is 10 (i.e., 1% of the total files in a storage node), our prefetching strategy
can only improve energy e?ciency by 3%. This result is expected because larger
amounts of data prefetched increase the chance that EEVFS is able to serve a request
from the bu?er disk. Once the number of files to be prefetched is increased to 40 and
above, the prefetching mechanism can provide significant energy savings due to the
fact that a vast majority of requests can be served by the bu?er disk.
6.5.2 Power State Transitions
Figure 6.5.2 displays the total number of state transitions for the data size exper-
iment. For the data size experiments we notice that the number of state transitions
decreases as the data size is increased. This result confirms that EEVFS can place
the data disks into the standby state fewer times and for longer periods of time. This
is intuitive because increasing the data size causes each request to be served longer
and consecutive hits in the bu?er disk produced large idle windows for the data disks.
Looking at Figure 6.5.2 it is interesting to note that no state transitions will pro-
ducenoenergysavings, sothereisabalancebetween energye?ciencyandthenumber
of state transitions. We aim to minimize the number of disk spin up operations while
being able to place the disks in the standby state for optimal energy savings. As
the inter-arrival delay is increased, it produces a similar pattern as compared to the
data size experiments. The number of state transitions decreases as the inter-arrival
delay is increased. Similarly, as the energy savings is increased the number of state
transitions is decreased due to the fact that larger inter-arrival delays produce lighter
loads for the data disks in the storage client node.
Figures 6.5.2 and 6.5.2 show that the number of state transitions produced by
varying MU and the number of files prefetched are very similar because they produce
92
a situation where the data disks are transitioned to the standby state for the entire
trace. This occurs for small values of MU and for larger values of the number of
files to prefetch. The interesting thing to note is that when the number of prefetch
files is 10, this situation produces the largest amount of state transitions for all of
the tests, 447. This same case also produced the smallest energy savings with only
a 3% increase in energy e?ciency. This small amount of energy savings may not be
worth the stress put on the hard drives from the large amount of state changes. The
idle threshold can be increased to prevent disks from transitioning frequently and
producing a small amount of energy savings.
6.5.3 Response Times
Now we analyze the performance penalties caused by the prefetching mechanism.
Figures 6.5.3 - 6.5.3 illustrate the increase in response time due to prefetching. The
results collected, concerning MU and the number of files to prefetch, represent two
special cases as indicated in the state transition results explanation. When the disks
are able to stay in the standby state the entire time there is virtually no response
time penalty. This is because the response time penalties are generally a product of
the state transitions. If the number of state transitions rises it also causes a response
time degradation. This ismainly due to the spin up operations, which average around
2 sec for the disks used in our experiments.
Figsures 6.5.3 and 6.5.3 show the e?ect of data size and inter-arrival delay on
response time. Fromthesetwo figureswecandeduce thatthereisalinear relationship
between the response time of the cluster storage system with prefetching and without
prefetching. This is promising due to the fact that it shows that there is a tolerable
response time penalty for producing energy e?ciency gains.
The response times for the data size of 50 MB were omitted because of the fact
that they were much larger than the other values because of the large amount of
93
queuing that took place on the storage server node. As the data size is increased
we produce smaller penalties in the response time degradation. For the data size of
1MB we have a 121% increase in response time, 120 ms to 265 ms, but for the largest
data size we produced only a 4% increase in response time degradation. We believe
that the number of state transitions is closely related to the response time penalty
and it is interesting to compare the results in Figures 6.5.3 - 6.5.3 with Figures 6.5.2
- 6.5.2. The inter-arrival delay response time pattern closely follows the pattern of
the data-size, which is similar to the pattern in the results in the previous sections.
As the inter-arrival delay is increased the response time decreases for the prefetching
and non-prefetching versions of EEVFS. The response time degradation is 31% for
the smallest inter-arrival delay value and 16% for the largest inter-arrival delay value.
There seems to be a response time anomaly produced when the inter-arrival delay
is 700 ms because the response time degradation is 37% at this point representing
the largest response time degradation for the inter-arrival delay experiments. This
performance degradation could be caused by the fact that the storage nodes attempt
to predict idle window periods that are as large as possible, but aren?t guaranteed to
be the optimal solution. This might have produced a situation where the wake up
transitionsmayhave been skewed towardsadisk thattakesalonger timetotransition
from the standby to active/idle state.
6.5.4 Berkeley Web Trace Energy Consumption
The final figure, 6.15, we have produced presents a trace that was taken from
the Berkeley file-system trace collection project [7]. The particular trace we used
was a section of the web trace collection. For this experiment we set the data size
to 10MB and kept the number of prefetch files to 70. The file access patterns were
taken directly from the web trace collection but we modified the data size and the
inter-arrival delay for requests to prevent a large amount of queuing on the storage
94
server. Based on the results In Fig. 6 we were able to produce a 17% energy e?ciency
improvement when prefetching was enabled in EEVFS. This represents a number that
is near the maximum that we expect our current test bed to produce using EEVFS.
After investigating the Berkeley web trace, it was discovered that we were able to
place all of the data disks in the standby for the entirety of the Berkeley web trace.
The web trace appeared to be skewed towards a smaller subset of data, but we were
unable to find out how many files were contained in their file system.
6.6 Conclusion & Future Work
In this paper we introduced EEVFS - an energy-e?cient virtual file system.
Based on our experimental results we conclude that EEVFS can boost the energy
e?ciency of storage systems by more than 17%. We believe this number will increase
as more disks are added to each EEVFS storage nodes. Although we were unable
to test this theory using our existing testbed, we tested this theory using models
and simulation. We evaluated energy e?ciency and performance as functions of data
size of files, popularity rate (i.e., the MU value), inter-arrival delay, and the number
of files to prefetch. The metrics used in each experiment are energy consumption,
the number of power state transitions, and response time. Our experimental results
confirm that EEVFS is conducive to saving energy with a tolerable impact to the
response time of disk requests.
For the future work we intend to develop EEVFS to be a production grade piece
ofsoftware. Wehavecurrentlyinvestigated twoapproachestoimprovingEEVFS.The
first approach involves extending PVFS to handle our energy management strategies.
The second method is to extend the source code of EEVFS to make it robust. We
also plan to investigate striping techniques within EEVFS that can help improve
the performance of EEVFS, while still maintaining energy savings. EEVFS is a
95
distributed file system and we intend to investigate the performance of EEVFS in a
large-scale distributed environment.
96
Network Interconnect
Storage Server Client Nodes
Storage Nodes
Data Disks
Figure 6.1: Architecture of EEVFS. The storage server manages metadata (e.g., data
locationandfilesize). Eachstoragenodemanages multiple disks, which areseparated
into two groups: bu?er disks and data disks.Client nodes can directly access storage
servers through the network interconnect.
97
Log
Storage Nodes
Storage Node 1
Storage Node N
Buffer
Disk
Data
Disk
Buffer
Disk
Data
Disk
Meta
Data
Meta
Data
Meta
Data
Storage Server
Client
Init
App 
Hints
Popularity Prefetch Prefetch
Request
1
2 3 3
4
5
Response
6
Figure 6.2: EEVFS Process Flow Chart. Step 1: initialization phase; step 2: storage
server generates file popularity; Step 3: prefetch popular files from data disks to a
bu?er disk; Step 4: applications provide access hints; Step 5: applications submit
file requests; Step 6: storage nodes return data to applications running on compute
nodes (i.e., clients)
98
1 10 25 50
4
4.5
5
5.5
6
6.5
7
7.5
8
8.5
x 10
5
Energy Joules
Data Size (MB)
 
 
PF
NPF
Figure 6.3: Data Size Varied
99
1 10 100 1000
4
4.2
4.4
4.6
4.8
5
5.2
5.4
x 10
5
Energy Joules
MU
 
 
PF
NPF
Figure 6.4: MU Varied
100
0 350 700 1000
4
4.2
4.4
4.6
4.8
5
5.2
5.4
x 10
5
Energy Joules
Inter?arrival delay (ms)
 
 
PF
NPF
Figure 6.5: Inter-arrival Delay Varied
101
10 40 70 100
4
4.2
4.4
4.6
4.8
5
5.2
5.4
x 10
5
Energy Joules
# of files to prefetch
 
 
PF
NPF
Figure 6.6: # of Prefetched Files Varied
102
1 10 25 50
0
50
100
150
200
250
300
Total State Transitions
Data Size (MB)
Figure 6.7: Data Size Varied
103
1 10 100 1000
10
0
10
1
10
2
10
3
Total State Transitions
MU
Figure 6.8: MU Varied
104
0 350 700 1000
0
50
100
150
200
250
Total State Transitions
Inter?arrival delay (ms)
Figure 6.9: Inter-arrival Delay Varied
105
10 40 70 100
10
0
10
1
10
2
10
3
Total State Transitions
# of files to prefetch
Figure 6.10: # of Prefetched Files Varied
106
0 5 10 15 20 25
0
1
2
3
4
5
6
7
8
Response Time (s)
Data Size (MB)
 
 
PF
NPF
Figure 6.11: Data Size Varied
107
0 200 400 600 800 1000
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
Response Time (s)
MU
 
 
PF
NPF
Figure 6.12: MU Varied
108
0 200 400 600 800 1000
0.8
1
1.2
1.4
1.6
1.8
2
2.2
Response Time (s)
Inter?arrival delay (ms)
 
 
PF
NPF
Figure 6.13: Inter-arrival Delay Varied
109
10 20 30 40 50 60 70 80 90 100
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
Response Time (s)
# of files to prefetch
 
 
PF
NPF
Figure 6.14: # of Prefetched Files Varied
PF NPF
4
4.2
4.4
4.6
4.8
5
5.2
5.4
x 10
5
Energy Joules
Berkeley Web Trace
Figure 6.15: Energy Consumption of the tested cluster storage system when the
Berkeley Web Trace are considered.
110
Chapter 7
Parallel Striping Groups for Energy E?ciency
This chapter leverages the work completed in Section 6 and particularly the use
of the Energy-E?cient Virtual File system (EEVFS). To improve the performance
of EEVFS we introduce in this chapter the novel idea of parallel striping groups.
We break storage nodes into groups and stripe files across the groups. This allows
us to control the aggregated bandwidth of each group of storage nodes which can
be matched to the bottleneck of the system. Each group of disks also has bu?er
disks which can achieve energy savings and the striping groups help improve the
performance of the storage system.
7.1 Parallel Striping Groups
Our first implementation of the EEVFS prototype placed a single file on a single
disk and didn?t support striping. Striping data across multiple disks is an important
idea that has helped to improve the performance of parallel disk systems. This allows
us to improve the performance of our EEVFS, while at the same time maintain
comparable energy e?ciency. It is important to balance the performance and energy
e?ciency of the file system because one usually comes at the expense of the other.
Figure 6.1 provides the high level architecture of the EEVFS system and details can
be found in Chapter 6
Figure 7.1 demonstrates the first level of our parallel striping architecture. Disks
are grouped into a striping group, so that a file is striped across the data disks in
one striping group. We have chosen to implement these striping groups for two main
reasons. First striping the data across multiple disks improves the performance of the
111
Storage Node 1
Buffer Disk Data Disk 1 Data Disk 2
Storage Node 2
Buffer Disk Data Disk 3 Data Disk 4
Storage Node 3
Buffer Disk Data Disk 5 Data Disk 6
Storage Node 4
Buffer Disk Data Disk 7 Data Disk 8
 
 
 
Figure 7.1: Parallel Striping Groups
disk system. Second we split the disks into groups to try to tune the performance of
the striping group to the network performance of the particular install. One could
stripe a file across all of the data disks in the disk system, but you may find that
you encounter a bottleneck in your disk system interconnect. The striping groups are
intended to be matched to a particular disk systems interconnect which would allow
multiple striping groups to respond to requests at the same time.
Inside of each storage node we have multiple disks and they are broken into two
di?erent groups, data disks and bu?er disks as shown in Figure 7.2. Data disks are
responsible for storing the files and their associated data. They will always have a
copy of a file and this copy will only disappear if the user wants to delete the file.
When a file is created on a node, EEVFS places data on the data disks in a round
robin fashion. This is intended to load balance the load of the data disks. The second
type of disk on an EEVFS node is a bu?er disk. The bu?er disk is used to store a
copy of popular data from the data disks to help improve the energy e?ciency of the
112
Buffer Disk
Buffer Disk
Data Disk 1 Data Disk 2
Data Disk 3 Data Disk 4
1 2
1 2
7531
7531
8642
8642
 
Figure 7.2: Data Striping Within a Group
disk system. If a bu?er disk has a copy of a particular file and a request is made for
the file, the bu?er disk will serve the request. The corresponding data disk will not
see the request, allowing the data disk to sit idle for a longer period of time.
If the data disk is idle for a certain period of time, the idle-threshold, then
the data disk is transitioned into the sleep state. One has to carefully select the
idle-threshold, if it is too low than the disks will spin-down too frequently causing
performance degradation and increased energy consumption. If the idle-threshold is
too high than the data disks will never transition to the sleep state and EEVFS will
be unable to produce any energy savings.
7.2 EEVFS
To support parallel striping groups and our energy e?ciency strategies we have
created an Energy Aware Virtual File System (EEVFS). We believe a virtual file
system implementation is important for several reasons. In the course of our research
we have realized that it is not practical to implement our strategies at a disk block
level becauseitishardtotrackthepopularityofblocks when theycouldhold di?erent
113
data depending on how they are managed by higher levels of the operating system.
This led us to start to investigate file systems and it was quickly realized that it may
not be so feasible to develop a new file system from scratch that will work with the
Linux kernel. After investigating virtual files systems, such as the Parallel Virtual
File System (PVFS), we decided that a prototype virtual file system would be the
most e?ective method to implement our energy e?ciency strategies.
A virtual file system has several properties that led us to choose this method of
implementation. First, the virtual file system code is independent of the kernel and
this will allow us to simplify some aspects of our implementation and also allows the
code to be more portable. Another major benefit of using a virtual-file system is that
it will be independent of the underlying file system that is used on the Linux system.
A user can determine the Linux file system type to use for the disk system and this
allows more flexibility for implementers or our strategy.
The EEVFS system is split up into two groups of nodes, server nodes and storage
nodes. The server node is in charge of presenting a user with a single point to
access the data in all of the EEVFS storage nodes. The server node handles are
operations associated with the data on the storage nodes. It is the single metadata
manager for the files in the virtual file system, and its current form represents a
single point of failure. We have recognized this shortcoming and plan to distribute
the metadata across several nodes to improve future generations of the EEVFS. Since
we are currently using a trace driven method to test the performance and energy
e?ciency of parallel striping groups in the EEVFS, the server node also plays the
trace and sends requests to the EEVFS server process.
The server node is in charge of placing data among all of the storage nodes that
are connected to the server. The current incarnation of the EEVFS server attempts
to derive a popularity value for files based on information collected from traces. After
the popularity valueis created theEEVFS server than begins sending out file creation
114
requests to the storage nodes. The server starts a thread for each storage node and
places a file creation request into n threads, where n is the stripe size being used in
EEVFS. EEVFS will attempt to load balance the files by placing the most popular
file on the first striping group and then placing the second most popular data on the
second striping group.
The storage nodes in EEVFS are in charge of managing the data disks that are
attached to the storage node. The storage client node manages the energy states of
the disks and attempts to place idle disks in the standby period to conserve energy.
EEVFS client nodes contain an idle threshold value that is compared against the idle
time of each disk, if the idle time becomes larger than the idle threshold than the
storage node places the disk in the standby state. Placing a disk in the standby state
is time consuming and causes energy to be consumed once the disk must be woken
up. The idle threshold must be chosen carefully because an aggressive idle threshold
will transition disks too frequently causing a significant performance degradation and
also causing the disk system to consume more energy. The storage node also attempts
to load balance file creation requests among the disks contained in the storage node.
Since the EEVFS server sends out file creation requests in order of popularity we can
guarantee that the first file creation request seen on an EEVFS storage node is as
popular as or more popular than the second file creation request. The storage node
than places the first file creation request on the first disk and second file creation
on the second disk and so on. This allows the storage node to load balance the files
among the data disks on the storage node.
All of the storage nodes also contain a bu?er disk which is used to place a copy of
popular data spread among the storage disks in a particular storage node. The bu?er
disk in our particular example is the same disk running the operating system and it
also happens to have space available. If space is not available it may be worthwhile
to add an extra disk to be used as the bu?er disk, but the energy cost must be o?set
115
by the number of data disks you can place into the standby state if they are not
being used. Our test bed allows us to only use 3 data disks in each storage node,
so the bu?er disk must also run the operating system for our implementation to be
practical. If the server node indicates that the storage node should prefetch popular
data then the storage node also places data on the bu?er disk. This is done in order
to provide the data disks with the illusion of longer idle periods, if a data request can
be served by a bu?er disk than the data disk remains idle. If the bu?er disk was not
present than the data disk idle counter would be reset. The longer the idle period
that can be achieved provides the largest energy savings gains.
The storage node is in charge of the data placement and management of the
disks and shields the storage server from these details. The storage server only needs
to know which nodes contain which data, it is not concerned with the states of the
disks in the disk system or which disk holds a file in a storage node.
7.3 Experimental Results
Parameter Storage Server Node Storage Node
CPU Type and Clock Speed Celeron 2.2 GHz Celeron 2.2 GHz
Memory (MB) 2000 2000
Network Interconnect (Mb/s) 1000 1000
Disk Type SATA SATA
Disk Capacity 160 Gbyes 480 Gbyes
Disk Bandwidth 126 MB/s 126 MB/s
Table 7.1: Configuration of the Testbed Nodes
The experimental results were collected on 6 PC?s and their performance charac-
teristics are outlined in Table 7.1. The PC?s were all connected to the same Gigabit
switch and with one node serving as the EEVFS storage server and the other 5 nodes
are EEVFS storage nodes. The experimental results focus onvarying four key param-
eters: the striping size of the data, the inter-arrival delay of the requests, the number
of files to prefetch, and the MU value of the Poisson distribution used to generate
116
 
Figure 7.3: Energy E?ciency vs. Data Size
our synthetic traces. In addition we have also included a comparison against our
EEVFS without striping groups. Response time graphs are also included to compare
the e?ect that prefetching has on the performance of the disk system. Finally we
present results of our system when we use a real-world trace based on the Berkeley
web trace collection.
7.3.1 Impact of Data Size
The results in figure 7.3 allow us to make some very interesting observations.
Using previous modeling and simulation techniques, we were able to deduce that as
the data size increases it follows that our prefetching techniques will produce larger
energy e?ciency gains. This is caused by the fact that if the data size is larger it
means that any particular disk will have to spend extra time reading/writing the file.
If it takes longer to process requests for a file it follows that keeping a copy of data
in the bu?er disk will allow the data disk to sleep for longer periods of time. We
are able to apply this intuitive explanation to the 1 MB, 10 MB, and 50 MB data
117
size results, which produce large energy e?ciency gains as the data size is increased.
The interesting example takes place when we used 25 MB for the data size. In this
case our prefetching strategy actually causes the system to consume more energy as
opposed to not prefetching data. At first we thought this was an anomaly and we
re-ran our experiment to see if the results would change. We were surprised to realize
that we consistently saw that in the 25 MB experiment our prefetching approach was
expending more energy as compared to the non-prefetching approach.
We took a close look at our output results and realized that the experiment
was actually running quite longer for the 25 MB experiment. This was due to the
increased transition periods that were being caused by parameters in the EEVFS.
The idle threshold of 5 seconds was too aggressive for this particular set of exper-
imental parameters and was causing the requests to get delayed too often by spin
up operations. This is a double negative for our energy consumption total because
the longer the experiment runs the larger the energy consumption total is and also
because spin up operations consume a significant amount of energy. The 5 second
threshold worked well for the rest of our experiments and we decided to keep this
number for the rest of our experiments. If the threshold is increased we could see
smaller energy e?ciency gains, but it is always the safer choice to increase the idle
threshold although you may not see any energy e?ciency gains.
7.3.2 Impact of the Number of Files Prefetched
The second parameter that we decided to vary is the number of files to prefetch
and these results arepresented in 7.4. Again weused previous modelsand simulations
and were expecting the energy e?ciency to increase as the number of files prefetched
was increased. If you increase the number of files that are prefetched you also increase
theprobabilitythatthedatadiskswillbeplacedintothestandbystate. Thisiscaused
by the fact that the bu?er disk will be more likely to hold a file for a data request if
118
 
Figure 7.4: Energy E?ciency vs. # of Files Prefetched
it has a larger amount of files. Our experiments confirmed our hypothesis because as
the number of prefetched files was increased the energy e?ciency of our disk system
was increased.
The 50 file prefetch size combined with our idle threshold of 5s caused the
prefetching approach to consume slightly more energy. Once the prefetch size was
increased beyond this number we actually saw a fairly consistent amount of energy
that was consumed. This was caused because our bu?er disk was able to cache a
significant amount of files that were accessed in trace, causing the data disks to sleep
for near the entire trace. There are slight variations among the larger prefetch sizes
but this is caused by the variation in the traces that were generated.
7.3.3 Impact of the Inter-Arrival Delay
Figure 7.5 presents our simulation results collected from varying the inter-arrival
delay of requests from 0ms to 1000ms. This test was performed to demonstrate how
our system behaves as the workload of the system changes with 0ms representing a
119
 
Figure 7.5: Energy E?ciency vs. Inter-Arrival Delay
heavy workload and 1000ms representing a light workload. Again using some results
collected from models and simulation we were able to predict that as the workload
was lightened the energy e?ciency of our system would increase. This is caused
because a light load produces more opportunities for transitioning to the standby
state and also produces safer opportunities for transitioning to the standby state.
For the purposes of our experiments a safe transition is one that allows the disk to
stay in the standby state long after it is placed in the standby state. This is opposed
to a transition to the standby state that is immediately followed by a data request
which causes the disk to be immediately transitioned back to the active state. As
predicted the energy e?ciency of the disk system increases as the inter-arrival delay
is increased. When the inter-arrival delay is 0 seconds the energy e?ciency of the disk
system is actually decreased. The non-energy aware approach has no opportunities
to transition the disks into the standby state which causes the disks to be busy the
entire trace. In the case of prefetching the data disks have short idle windows and
our 5 second idle-window proves to be too small for the 0ms inter-arrival delay.
120
 
Figure 7.6: Energy E?ciency vs. MU
7.3.4 Impact of MU
Figure 7.6 varies the value MU, which is a parameter for the Poisson distribution
used to generate the synthetic traces. Specifically MU is used to determine which files
areaccessed with a lowvalueofMUskewing many oftherequests in thetracetowards
asmallgroupoffiles. IfasmallvalueofMUisusedoneexpectstheprefetchingscheme
to produce significant energy e?ciency gains because it will cause most of the trace
to be serviced from a bu?er disk. Our experimental results confirm this trend with
the MU value of 5 producing the largest energy e?ciency gain and the MU value of
5000 producing the smallest energy e?ciency gain.
7.3.5 Striping vs. Non-Striping Comparison
Parameter Striping No Striping
Energy Consumption (J) 2088113 2100243
Response Time (S) 2.78 13.87
Table 7.2: Results of Striping vs. No Striping
121
 
Figure 7.7: Berkely Web Trace Results
Table 7.2 shows that striping can significantly improve the response time of a
request and also the energy e?ciency of the disk system. For these experiments
the file size was set at 250 MB, the inter-arrival delays was set at 700 ms, and the
amount of files to prefetch was set to 70 out of 5000 files. As expected the striping
groups significantly outperformed the non-striping approach. Using striping we are
able to break a file into 5 separate chunks of data that are 50 MB as opposed to
trying to read 250 MB from each node. In our experiments we had one storage
server which would read the data simultaneously from the storage nodes. This could
be improved by allowing multiple readers to connect to multiple writes to further
improve the bandwidth utilization. The results are expected because striping has
been extensively used to improve the performance of disk systems, we have added
bu?er disks to improve the energy e?ciency and balance the performance and energy
e?ciency of the storage system.
122
7.3.6 Berkely Web Trace Results
To improve the validity of our experimental results we decided to conduct ex-
periments based on the Berkeley Web Trace and the results are presented in 7.7. We
extracted all of the read operations from the trace and realized that the workload
was too light and that our energy-e?cient strategy was guaranteed to produce the
largest energy e?ciency gains. To make the workload heavier we took the file access
pattern from the Berkeley web trace and created two file accesses every second. This
increased the workload of the trace to help demonstrate that our strategies are able to
produce energy savings under heavier workloads. The first time we ran the trace we
realized that our prefetching strategy was causing a decrease in the energy e?ciency.
There were two main factors causing this problem, our prefetch size was set to 20
files and the break-even time of 5 seconds was too aggressive. We decided to increase
the prefetch size up to 50 files and also to change the break-even time to 10 seconds,
which is a more conservative value. Once we re-ran our test we produced Figure 7.7,
which shows that our strategy is able to produce a 4.4% energy e?ciency increase.
7.3.7 Chapter Conclusion
This chapter leveraged EEVFS and implemented the idea of parallel striping
groups for energy e?ciency. This novel idea helps improve the energy e?ciency of
the storage system, while maintaining performance. Parallel striping groups were
designed with the intent to match the performance of the nearest bottleneck focusing
on the e?cient use of resources. In addition the parallel striping groups leverage the
concept of bu?er data disks that have been extensively studied in this dissertation.
Thisrequires onetocoordinatetheplacement andstripingoffilesacrossmanystorage
nodes and this is aided by the groundwork laid with EEVFS. Parallel striping groups
are an extension to EEVFS that helps to improve the performance of EEVFS while
maintaining energy savings. In this chapter we tested EEVFS with parallel striping
123
groups against striping with no energy e?ciency and were able to produce energy
savings with a minimal impact to the response times of the storage system.
For future work we need to extend EEVFS to be a production grade system.
Currently we havealimited number ofstoragenodesand parallelstriping groupsneed
to be implemented in a large scale storage system to test EEVFS. This work needs to
be extended by allowing multiple EEVFS storage nodes connect to multiple clients
to move data as quickly as possible. The EEVFS storage server node is currently a
bottleneck of the system and we need to update the architecture, so that the storage
server node can connect storage clients directly to storage nodes.
124
Chapter 8
Conclusions and Future Work
In this dissertation, we have investigated techniques to conserve energy in large
scale storage systems. This work was first developed using mathematical models,
which were then used to develop simulators to test many parameters. Using our
simulation results we determined that prefetching data can help conserve a significant
amount of energy in disk systems. Based on this conclusion we investigated a means
to implement our strategies in a real world system. To implement our ideas some of
our algorithms had to be modified for the implementation phase of the work. Our
implementation required a means of tracking file access patterns so we developed
an Energy E?cient Virtual File System. After this work was completed we used our
virtualfilesystemtocollectextensiveexperimentalresults. Inthecourseofdeveloping
EEVFS we also developed a novel striping strategy that is conducive to increased
energy e?ciency and performance. This chapter is organized as follows, section 8.1
highlight themaincontributions oftheresearch presented inthisdissertation. Section
8.2, concentrates on future research directions based on the work contained in this
dissertation.
8.1 Main Contributions
This dissertation introduced four main contributions that aim to achieve energy
e?cient storage system while producing minimal response time degradations. The
first contribution is theintroduction of amodel and simulation technique that allowed
us to investigate how the data size, file access patterns, and disk parameters impact
techniques toconserve energy. Toconserveenergydisksmustbeplacedinthestandby
125
state which can be costly both in term of time and energy and this penalty must be
o?set with a long period of sleep time.
After we conducted extensive simulation results using the models that we have
developed we began investigating a method to improve the validity of our results.
Modeling and simulation of disk drives is a complex task and can be aided with the
use of an existing validated disk simulation, DiskSim. DiskSim is a useful piece of
software, but it lacked power models and state transitions that were required for our
energy saving techniques. To remedy this deficiency I wrote some extensions to the
DiskSim simulator that were able to support our energy e?cient strategies. This
helped improve our simulation results and we decided to work on an implementation
of our strategies.
The third main contribution is the use of an Energy E?cient Virtual File Sys-
tem (EEVFS), which manages the location of files and the states of multiple disks
connected with a network interconnect. Using the modeling and simulation that was
conducted we decided that large file systems and predictable access patters are con-
ducive to energy savings. Prime examples that could benefit from our work include
web servers that are primarily used for reads and access patterns can be determined.
Aservice like YouTubefitsthisprofileandmanyotherweb servers havetheproperties
that can promote the use of our energy savings principles.
Thefourthmaincontributionofthisdissertationistheconceptofparallelstriping
groups for energy e?ciency. This work leverages the backbone of EEVFS, but adds
striping to help improve the performance of EEVFS. Striping groups can be matched
to performance bottlenecks which is physically energy e?cient and the bu?er disks
used also help to lower the energy e?ciency through the use of data placement.
126
8.2 Future Work
All of the techniques mentioned in this dissertation leverage prefetching to im-
prove the energy e?ciency of disk systems. The current work focuses on static
prefetching of data and all of the techniques in this dissertation could be improved
with dynamic prefetching techniques. This is a di?cult problem due to the fact that
prefetching is a costly operation that can negatively impact the energy e?ciency and
the response time of storage systems.
Another area to improve the work in this dissertation is the fact that our current
implementation of EEVFS is a prototype and work must be completed to improve the
prototype to be production grade. There are two main methods to achieve this goal,
develop our prototype, or leverage an existing virtual file system, such as PVFS (Par-
allel Virtual File System). I believe using a well developed virtual file system is the
only method available to a small research group, so I have begun investigating PVFS.
During the preliminary investigation of PVFS it was realized that the open source
property of PVFS would allow us to make changes to support the ideas presented in
this dissertation.
127
Bibliography
[1] MooreB.,Taking the data center powerand cooling challenge,EnergyUserNews,
2002.
[2] Luca Benini, Alessandro Bogliolo, and Giovanni De Micheli, A survey of design
techniques for system-level dynamic power management, (2002), 231?248.
[3] John S. Bucy, Jiri Schindler, Steven W. Schlosser, Gregory R. Ganger, and Con-
tributors, The disksim simulation environment version 4.0 reference manual,
Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-
101, 2008.
[4] Enrique V. Carrera, Eduardo Pinheiro, and Ricardo Bianchini, Conserving disk
energy in network servers, IN PROCEEDINGS OF THE 17TH INTERNA-
TIONAL CONFERENCE ON SUPERCOMPUTING, 2003, pp. 86?97.
[5] Feng Chen, Song Jiang, Weisong Shi, and Weikuan Yu, Flexfetch: A history-
aware scheme for i/o energy saving in mobile computing, ICPP ?07: Proceedings
of the 2007 International Conference on Parallel Processing (Washington, DC,
USA), IEEE Computer Society, 2007, p. 10.
[6] Dennis Colarelli and Dirk Grunwald, Massive arrays of idle disks for storage
archives, Supercomputing ?02: Proceedings of the 2002 ACM/IEEE conference
on Supercomputing (Los Alamitos, CA, USA), IEEE Computer Society Press,
2002, pp. 1?11.
[7] RoselliD.andAndersonT.E., Characteristic of file system workloads,University
of California Berkeley Technical Report UCB/CSD-98-1029, 1998.
[8] Jones E., Epa announces new computer e?ciency requirements, EPA Anounce-
ment, 2006.
[9] Hyeonsang Eom and Je?rey K. Hollingsworth, Speed vs. accuracy in simula-
tion for i/o-intensive applications, IPDPS, IEEE Computer Society Press, 2000,
pp. 315?322.
[10] SanjayGhemawat, Howard Gobio?,andShun-Tak Leung, Thegooglefilesystem,
SIGOPS Oper. Syst. Rev. 37 (2003), no. 5, 29?43.
128
[11] S. Gurumurthi, A. Sivasubramaniam, M. Kandemir, and H. Franke, Drpm: dy-
namic speed control for power management in server class disks, Computer Ar-
chitecture, 2003. Proceedings. 30th Annual International Symposium on, June
2003, pp. 169?179.
[12] John Hawkins and Mikael Bodn, The applicability of recurrent neural networks
for biological sequence analysis, IEEE/ACM Transactions on Computational Bi-
ology and Bioinformatics 2 (2005), 243?253.
[13] R. Hedges, B. Loewe, T. McLarty, and C. Morrone, Parallel file system testing
for the lunatic fringe: the care and feeding of restless i/o power users, april 2005,
pp. 3 ? 17.
[14] Inki Hong and Miodrag Potkonjak, Power optimization in disk-based real-time
application specific systems, Proc. IEEE/ACM Int?l Conf. Comp.-Aided Design
(Washington, DC, USA), IEEE Computer Society, 1996, pp. 634?637.
[15] Maximum Institution Inc., Power, heat, and sledgehammer, White Paper, 2002.
[16] Young-Jin Kim, Kwon-Taek Kwon, and Jihong Kim, Energy-e?cient disk re-
placement and file placement techniques for mobile systems with hard disks,SAC
?07: Proceedings ofthe2007ACMsymposium onApplied computing (NewYork,
NY, USA), ACM, 2007, pp. 693?698.
[17] Thomas T. Kwan, Robert E. Mcgrath, and Daniel A. Reed, Ncsa?s world wide
web server: Design and performance, IEEE Computer 28 (1995), 68?74.
[18] Kester Li, Roger Kumpf, paul horton, and Thomas Anderson, A quantitative
analysis of disk drive power management in portable computers, In Proceedings
of the 1994 Winter USENIX Conference, 1994, pp. 279?291.
[19] W. B. Ligon and R. B. Ross, Implementation and performance of a parallel
file system for high performance distributed applications, Proc. 5th IEEE Int?l
Symp.HighPerformanceDistributed Computing(Washington, DC,USA),IEEE
Computer Society, 1996, p. 471.
[20] A. Manzanres, Xiaojun Ruan, Shu Yin, M. Nijim, Wei Luo, and Xiao Qin,
Energy-aware prefetching for parallel disk systems: Algorithms, models, and eval-
uation, Proc. 8th IEEE Int?l Symp. Network Comp. and App., July 2009, pp. 90
?97.
[21] Edmund B. Nightingale and Jason Flinn, Energy-e?ciency and storage flexibil-
ity in the blue file system, Proc. 6th USENIX Symp. Operating Sys. Design &
Implementation (Berkeley, CA, USA), USENIX Association, 2004, pp. 25?25.
[22] George Panagiotakis, Michail D. Flouris, and Angelos Bilas, Reducing disk i/o
performancesensitivityforlarge numbersof sequential streams,DistributedCom-
puting Systems, International Conference on 0 (2009), 22?31.
129
[23] Athanasios E. Papathanasiou and Michael L. Scott, Energy e?cient prefetch-
ing and caching, ATEC ?04: Proceedings of the annual conference on USENIX
Annual Technical Conference (Berkeley, CA, USA), USENIX Association, 2004,
pp. 22?22.
[24] David A. Patterson, Garth Gibson, and Randy H. Katz, A case for redundant
arrays of inexpensive disks (raid), SIGMOD ?88: Proceedings of the 1988 ACM
SIGMOD international conference on Management of data (New York, NY,
USA), ACM, 1988, pp. 109?116.
[25] Eduardo Pinheiro and Ricardo Bianchini, Energy conservation techniques for
disk array-based servers, ICS ?04: Proceedings of the 18th annual international
conference on Supercomputing (New York, NY, USA), ACM, 2004, pp. 68?78.
[26] Eduardo Pinheiro, Ricardo Bianchini, and Cezary Dubnicki, Exploiting redun-
dancy to conserve energy in storage systems, SIGMETRICS Perform. Eval. Rev.
34 (2006), no. 1, 15?26.
[27] Xiaojun Ruan, A. Manzanares, Shu Yin, Ziliang Zong, and Xiao Qin, Perfor-
mance evaluation of energy-e?cient parallel i/o systems with write bu?er disks,
Proc. 38th Int?l Conf. Parallel Processing, Sept. 2009, pp. 164 ?171.
[28] Sriram Sankar, Yan Zhang, Sudhanva Gurumurthi, and Mircea R. Stan,
Sensitivity-based optimization of disk architecture, IEEE Trans. Comput. 58
(2009), no. 1, 69?81.
[29] H. Shen, Mohan Kumar, S.K. Das, and Z. Wang, Energy-e?cient caching and
prefetching with data consistency in mobile distributed systems, Parallel and Dis-
tributed Processing Symposium, 2004. Proceedings. 18th International, April
2004, pp. 67?.
[30] Seung Woo Son and Mahmut T. Kandemir, Energy-aware data prefetching for
multi-speed disks, Conf. Computing Frontiers, 2006, pp. 105?114.
[31] S.W. Son, M. Kandemir, and A. Choudhary, Software-directed disk power man-
agement for scientific applications, Parallel and Distributed Processing Sympo-
sium, 2005. Proceedings. 19th IEEE International, April 2005, pp. 4b?4b.
[32] D.B.Trizna, Microwave and hf multi-frequency radars for dual-use coastal remote
sensing applications, OCEANS, 2005. Proceedings of MTS/IEEE, 2005, pp. 532?
537 Vol. 1.
[33] Jun Wang, Xiaoyu Yao, and Huijun Zhu, Exploiting in-memory and on-disk
redundancy to conserve energy in storage systems, IEEE Trans. Comput. 57
(2008), no. 6, 733?747.
[34] Jun Wang, Huijun Zhu, and Dong Li, eraid: Conserving energy in conventional
disk-based raid system, Computers, IEEE Transactions on 57 (2008), no. 3, 359?
374.
130
[35] Sage A. Weil, Kristal T. Pollack, Scott A. Brandt, and Ethan L. Miller, Dynamic
metadata management for petabyte-scale file systems, Proc. ACM/IEEE Int?l
Conf. Supercomputing (LosAlamitos, CA, USA), IEEE Computer Society, 2004,
p. 4.
[36] Tao Xie, Sea: A striping-based energy-aware strategy for data placement in raid-
structured storage systems, Computers, IEEE Transactions on 57 (2008), no. 6,
748?761.
[37] John Zedlewski, Sumeet Sobti, Nitin Garg, Fengzhou Zheng, Arvind Krishna-
murthy, and Randolph Wang, Modeling hard-disk power consumption, FAST ?03:
Proceedings of the 2nd USENIX Conference on File and Storage Technologies
(Berkeley, CA, USA), USENIX Association, 2003, pp. 217?230.
[38] Qingbo Zhu, Zhifeng Chen, Lin Tan, Yuanyuan Zhou, Kimberly Keeton, and
John Wilkes, Hibernator: helping disk arrays sleep through the winter,SIGOPS
Oper. Syst. Rev. 39 (2005), no. 5, 177?190.
[39] QingboZhu,FrancisM.David,Christo F.Devaraj,ZhenminLi, YuanyuanZhou,
and Pei Cao, Reducing energy consumption of disk storage using power-aware
cache management, HPCA ?04: Proceedings of the 10th International Sympo-
sium on High Performance Computer Architecture (Washington, DC, USA),
IEEE Computer Society, 2004, p. 118.
[40] Qingbo Zhu, Asim Shankar, and Yuanyuan Zhou, Pb-lru: a self-tuning power
aware storage cache replacement algorithm for conserving disk energy, ICS ?04:
Proceedings of the 18th annual international conference on Supercomputing
(New York, NY, USA), ACM, 2004, pp. 79?88.
[41] Xiaotong Zhuang and Santosh Pande, Power-e?cient prefetching via bit-
di?erential o?set assignment on embedded processors, LCTES ?04: Proceedings
of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and
tools for embedded systems (New York, NY, USA), ACM, 2004, pp. 67?77.
[42] Z. Zong, M. Briggs, N. O?Connor, and X. Qin, An energy-e?cient framework for
large-scale parallel storage systems, Parallel and Distributed Processing Sympo-
sium, 2007. IPDPS 2007. IEEE International, March 2007, pp. 1?7.
131
Appendices
132
Appendix A
DiskSimSourceCodeModifications
// This code is was written by Adam Manzanares , Auburn University based on the code from Sriam Sankar
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <assert.h>
#include ?disksim global.h?
#include ?disksim iosim.h?
#include ?disksim power.h?
#include ?disksim stat.h?
#include ?disksim disk.h?
#include ?inst.h?
void init disk power model()
{
int i;
disk ?currdisk;
int numdisks=disksim?>diskinfo?>numdisks;
for(i=0;i<numdisks; i++)
{
currdisk = disksim?>diskinfo?>disks[i ];
currdisk?>stat.activeEnergy=0.0;
currdisk?>stat.activeTime=0.0;
currdisk?>stat.idleEnergy=0.0;
currdisk?>stat.idleTime=0.0;
currdisk?>stat.sleepTime=0.0;
currdisk?>stat.sleepEnergy=0.0;
currdisk?>stat.transEnergy=0.0;
currdisk?>idle time=0.0;
currdisk?>last access=0.0;
currdisk?>stat.transTime=0.0;
currdisk?>status=DISK IDLE;
}
}
void activeEnergyStat(disk ?currdisk ,double timeVal)
{
currdisk?>stat.activeEnergy+=(timeVal/1000.0)?(currdisk?>model?>dm power active);
currdisk?>stat.activeTime+=(timeVal/1000.0);
}
// Mark beginning and end of each idle period
void idleEnergyStat(disk ?currdisk , double timeVal, double idleStartTime)
{
currdisk?>stat.idleTime+= (timeVal/1000.0);
currdisk?>stat.idleStartTime=idleStartTime ;
}
// Idle Timer Check, Add another checkpoint and update all disks idle time
// Currently checked every 500ms
void timer expired(event? curr)
{
if(disksim?>stop dpm check==FALSE)
{
timer event? timecheck;
timecheck=calloc(1,sizeof(timer event));
timecheck?>time=curr?>time+500;
timecheck?>type=TIMEREXPIRED;
addtointq((event ?) timecheck);
disk update idle(curr?>time);
}
133
}
void disk update idle(double time)
{
int i;
disk ?currdisk;
int numdisks=disksim?>diskinfo?>numdisks;
for(i=0;i<numdisks; i++)
{
currdisk = disksim?>diskinfo?>disks[i ];
currdisk?>idle time=time?currdisk?>last access;
switch(currdisk?>status)
{
case DISK IDLE:
{
if(currdisk?>idle time>5000) // Idle DPM Threshold
{
event? spin down;
spin down=calloc(1,sizeof(event ));
spin down?>time=time+currdisk?>model?>dm spin down time;
spin down?>type=SPINDOWN COMP;
spin down?>temp=i ;
addtointq(spin down);
currdisk?>status =DISKSPIN DOWN;
currdisk?>spin down complete=spin down?>time;
currdisk?>stat.transEnergy+=currdisk?>model?>dm spin down penalty;
currdisk?>stat.transTime+=currdisk?>model?>dm spin down time;
currdisk?>stat.spin down transitions++;
}
break;
}
case DISKSPIN DOWN:
{
// Spinning down do nothing
break;
}
case DISK SPIN UP:
{
// Spinning up do nothing
break;
}
case DISKSTANDBY:
{
// Do Nothing
break;
}
default:
{
printf(?Unknown disk state reached in disk update idle\n?);
break;
}
}
}
}
double wakeup disk sleep(double time,disk? currdisk)
{
event? spin up;
disk? id;
int disk id=0;
spin up=calloc(1,sizeof(event));
spin up?>time=time+currdisk?>model?>dm spin up time;
spin up?>type=SPINUP COMP;
while(currdisk!=disksim?>diskinfo?>disks[disk id])
{
disk id++;
}
spin up?>temp=disk id;
addtointq(spin up);
currdisk?>status =DISK SPIN UP;
currdisk?>spin up complete=spin up?>time;
currdisk?>stat.sleepTime+=time?currdisk?>spin down complete;
currdisk?>stat.transEnergy+=currdisk?>model?>dm spin up penalty;
currdisk?>stat.spin up transitions++;
currdisk?>stat.transTime+=currdisk?>model?>dm spin up time;
134
return currdisk?>model?>dm spin up time;
}
double wakeup disk spndwn(double time,disk? currdisk)
{
event? spin up;
disk? id;
int disk id=0;
spin up=calloc(1,sizeof(event));
spin up?>time=currdisk?>spin down complete+currdisk?>model?>dm spin up time;
spin up?>type=SPINUP COMP;
while(currdisk!=disksim?>diskinfo?>disks[disk id])
{
disk id++;
}
spin up?>temp=disk id;
addtointq(spin up);
currdisk?>status =DISK SPIN UP;
currdisk?>spin up complete=spin up?>time;
currdisk?>stat.transEnergy+=currdisk?>model?>dm spin up penalty;
currdisk?>stat.spin up transitions++;
currdisk?>stat.transTime+=currdisk?>model?>dm spin up time;
return spin up?>time?time;
}
double spin up delay(double time,disk? currdisk)
{
return currdisk?>spin up complete?time;
}
135