Improving Reliability of Energy-Efficient Storage System
by
Shu Yin
A dissertation submitted to the Graduate Faculty of
Auburn University
in partial fulfillment of the
requirements for the Degree of
Doctor of Philosophy
Auburn, Alabama
May 7, 2012
Keywords: Reliability, Energy Efficient, Parallel Storage System, Modeling
Copyright 2012 by Shu Yin
Approved by
Xiao Qin, Chair, Associate Professor of Computer Science and Software Engineering
Alvin Lim, Associate Professor of Computer Science and Software Engineering
Sanjeev Baskiyar, Associate Professor of Computer Science and Software
Engineering
George T. Flowers, Dean of Graduate School
Abstract
With the rapid growth of the production and storage of large scale data sets it
is important to investigate methods to drive the cost of storage systems down. Many
energy conservation techniques have been proposed to achieve high energy efficiency
in disk systems. Unfortunately, growing evidence shows that energy-saving schemes
in disk drives usually have negative impacts on storage systems. Existing reliability
models are inadequate to estimate reliability of parallel disk systems equipped with
energy conservation techniques. To solve this problem, we firstly propose a mathe-
matical model - called MINT - to evaluate the reliability of a parallel disk system
where energy-saving mechanisms are implemented. In this dissertation, MINT is
focused on modeling the reliability impacts of two well-known energy-saving tech-
niques - the Popular Disk Concentration technique (PDC) and the Massive Array of
Idle Disks (MAID). Different from MAID and PDC which store a complete file on
the same disk, the Redundancy Array of Inexpensive Disks (RAID) stripes file into
several parts and stores them on different disks to ensure higher parallelism, hence
higher I/O performance. However, RAID faces more challenges on energy efficiency
and reliability issues. In order to evaluate the reliability of power-aware RAID, we
then develop a Weibull-based model?MREED. In this dissertation, we use MREED
to model the reliability impacts of a famous energy efficiency storage mechanism? the
Power-Aware RAID (PARAID). Thirdly, we focus on validation of two models?MINT
and MREED. It is challenging to validate the accuracy of reliability models, since we
are unable to watch certain energy-efficiency systems for a couple of decades due to
its time consuming and experimental costs. We introduce validated storage system
simulator?DiskSim?to determine if our model and DiskSim agree with one another.
ii
In our validation process, we compare a file access trace in a real-world file system.
Last part of of this dissertation focuses on improvement of energy-efficient parallel
storage systems. We propose a strategy?Disk Swapping?to improve disk reliability
by alternating disks storing data that is frequently accessed with disks holding less
accessed data. In this part, we focus on studying reliability improvement of PDC
and MAID. At last, we further improve disk reliability by introducing multiple disk
swapping strategy.
iii
Acknowledgments
I am sincerely and heartily grateful to my advisor, Dr. Xiao Qin, for the support
and guidance he showed me throughout my graduate studies at Auburn University.
Dr. Qin has spent many hours guiding and mentoring me and I am sincerely grateful
for the experience he has provided me as my advisor. His hard work and dedication to
the academic field will serve as an important example of what a successful academic
can achieve.
I would also like to thank Dr. David Umphress because he has served as my sec-
ondary advisor and I have enjoyed the feedback and encouragement that he provides.
I believe that his experiences, suggestions and passions on teaching have proved to be
strong influences in my early academic career. I would also like to thank Dr. Sanjeev
Baskiyar and Dr. Alvin Lin for serving on my dissertation committee and providing
insightful feedback when needed. I want to thank the faculty and staff of the CSSE
department as they have guided me through this process. Jo-Ann Lauranitis has
helped me greatly with the administrative process and I am thankful for her quick
responses to all of my questions.
Our research group has been supportive and helpful all of the years I have been at
Auburn University. I would like to thank Adam Manzanares because we have worked
together on few projects and he has helped me meet several deadlines. Besides, his
encouragements and suggestions helped me a lot for building vision of my career
and the world outlook . I would also like to thank Xiaojun Ruan, Zhiyang Ding
for providing me feedback and support during the dissertation process. Ziliang Zong
and Kiranmai Bellam are two research group members who have helped me greatly
during the first years of my stay at Auburn.I would also like to thank Jianguo Lu,
iv
Yun Tian, James Majors, Jiong Xie and Ji Zhang for the help during the last stages
of my dissertation work.
I would like to acknowledge my parents Guilin Yin and Luyun Zhu because they
have served as the greatest inspiration in my life. Their climb through the journey
of life is an amazing feat and I am thankful for their endless support through any
situation that I have faced. I also would like to thank my best friends who offered
greatly encouragements when I was depression and helped me to realize who I am
and what I want.
To my beloved.
v
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Hard Disk Drive Storage Systems . . . . . . . . . . . . . . . . . . . . 6
2.2 Parallel Storage Systems . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Energy-Efficient Parallel Disk Systems. . . . . . . . . . . . . . . . . . 10
2.4 Reliability Impacts of Power Management on Disks. . . . . . . . . . . 11
2.5 Reliability Models of Disk Systems. . . . . . . . . . . . . . . . . . . . 11
2.6 Validation of Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Reliability Improvements . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 MINT: A Reliability Modeling Framework for Energy-Efficient Parallel
Disk Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 The MINT Reliability Model . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Impacts of Utilization on Disk Annual Failure Rate . . . . . . 17
vi
3.2.3 Impacts of Temperature on Disk Annual Failure Rate . . . . . 21
3.2.4 Power-State Transition Frequency . . . . . . . . . . . . . . . . 22
3.2.5 Single Disk Reliability Model . . . . . . . . . . . . . . . . . . 24
3.3 Reliability Models for MAID and PDC . . . . . . . . . . . . . . . . . 26
3.3.1 MAID- Massive Array of Idle Disks . . . . . . . . . . . . . . . 26
3.3.2 PDC- Popular Disk Concentration . . . . . . . . . . . . . . . 32
3.3.3 Results Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 MREED: Reliability Analysis of An Energy-Aware RAID System . . . . 43
4.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 The MREED Modeling Framework . . . . . . . . . . . . . . . . . . . 45
4.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.2 Weibull Distribution Analysis . . . . . . . . . . . . . . . . . . 49
4.3 Reliability Model for PARAID . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.2 Modeling Utilization of Disks in PARAID . . . . . . . . . . . 52
4.4 Reliability Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.2 Disk Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.3 Annual Failure Rate . . . . . . . . . . . . . . . . . . . . . . . 58
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5 Models Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.1 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.1.1 The Validation Techniques . . . . . . . . . . . . . . . . . . . . 62
5.1.2 Berkeley Web Trace Replay . . . . . . . . . . . . . . . . . . . 64
5.1.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Validation of MREED . . . . . . . . . . . . . . . . . . . . . . . . . . 68
vii
5.2.1 The Validation Techniques . . . . . . . . . . . . . . . . . . . . 68
5.2.2 DiskSim Simulation . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.3 Simulation Framework . . . . . . . . . . . . . . . . . . . . . . 71
5.2.4 UMass WebSearch Trace . . . . . . . . . . . . . . . . . . . . . 72
5.2.5 Validation Results . . . . . . . . . . . . . . . . . . . . . . . . 73
6 Improving Reliability of Energy-Efficient Parallel Storage Systems . . . . 76
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2 Improving Reliability of MAID via Disk Swapping . . . . . . . . . . . 78
6.2.1 Improving Reliability of Cache Disks in MAID . . . . . . . . . 78
6.2.2 Swapping Disks Multiple Times . . . . . . . . . . . . . . . . . 83
6.3 Experimental Results and Evaluation . . . . . . . . . . . . . . . . . . 84
6.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3.2 Disk Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3.3 The Single-Disk-Swapping Strategy . . . . . . . . . . . . . . . 85
6.3.4 The Multiple-Disk-Swapping Strategy . . . . . . . . . . . . . . 89
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.1.1 The MINT model for parallel storage systems . . . . . . . . . 94
7.1.2 The MREED model for RAID systems . . . . . . . . . . . . . 94
7.1.3 Reliability improvement of parallel storage systems . . . . . . 95
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.2.1 Future Directions for the Short Term . . . . . . . . . . . . . . 96
7.2.2 Future Directions for the Long Term . . . . . . . . . . . . . . 97
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
viii
List of Figures
2.1 A Simplified Taxonomy of Storage Systems Research . . . . . . . . . . . 6
3.1 The Framework of the MINT Reliability Model . . . . . . . . . . . . . . 18
3.2 Utilization Impacts on AFR (by Google) . . . . . . . . . . . . . . . . . . 19
3.3 3-Year-Old HDD Utilization Impacts on AFR . . . . . . . . . . . . . . . 20
3.4 Average Drive Temperature Impacts on AFR
(by Google) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Temperature-Factor Function of 3-Year-Old HDDs . . . . . . . . . . . . 23
3.6 Impacts of Transition Frequency on Frequency adder of 3-Year-Old HDDs 24
3.7 3-Year-Old HDD Combined Factors Impacts on AFR
(Single Disk Reliability Model) . . . . . . . . . . . . . . . . . . . . . . . 26
3.8 MAID System Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.9 PDC System Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.10 Utilization Comparison of the PDC and MAID
Access Rate(up to 500/month) Impacts on Utilization . . . . . . . . . . 36
3.11 AFR Comparison of the PDC and MAID
Access Rate Impacts on AFR(Temperature=35nullC) . . . . . . . . . . . . 37
ix
3.12 Utilization Comparison of the PDC and MAID
Access Rate(up to 1000/month) Impacts on Utilization . . . . . . . . . 38
3.13 Utilization Comparison of the PDC and MAID
Access Rate(up to 1000/month) Impacts on AFR(Temperature=35nullC) . 39
3.14 Utilization Comparison of the PDC and MAID
Access Rate(up to 1000/month) Impacts on AFR(Temperature=40nullC) . 39
3.15 AFR Comparison of the PDC and MAID
Temperature Impacts on AFR (Access Rate= 200/month) . . . . . . . . 40
3.16 AFR Comparison of the PDC and MAID
Temperature Impacts on AFR (Access Rate= 450/month) . . . . . . . . 41
4.1 Overview of the MREED reliability modeling methodology . . . . . . . . 47
4.2 Framework of PARAID: skewed striping of replicated blocks in soft state,
creating 3 RAID gears over 4 disks . . . . . . . . . . . . . . . . . . . . . 51
4.3 Disks Utilization Comparison Between PARAID-0 And RAID-0 at A Low
Access Rate(20 times per hour) . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Disks Utilization Comparison Between PARAID-0 And RAID-0 at A Low
Access Rate(80 times per hour) . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 AFR Comparison Between PARAID-0 And RAID-0 at A Low Access
Rate(20 times per hour) . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.6 AFR Comparison Between PARAID-0 And RAID-0 at A High Access
Rate(80 times per hour) . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
x
5.1 The file access rate distribution of the one-month Berkeley web trace.
Access Rate ranges from 1 to 4.5null104 No./month . . . . . . . . . . . . . 65
5.2 Impacts of file access rate on disk utilization. Access rate varies from 10
to 64null104 No./month . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Impacts of file access rate on disk utilization (PDC). Access rate varies
from 10 to 64null104 No./month . . . . . . . . . . . . . . . . . . . . . . . 67
5.4 Impacts of file access rate on disk utilization (MAID1). Access rate varies
from 10 to 64null104 No./month . . . . . . . . . . . . . . . . . . . . . . . 67
5.5 Impacts of file access rate on disk utilization (MAID2). Access rate varies
from 10 to 64null104 No./month . . . . . . . . . . . . . . . . . . . . . . . 68
5.6 Impacts of file access rate on AFR (PDC). Access rate varies from 10 to
64null104 No./month . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.7 Impacts of file access rate on AFR (MAID1). Access rate varies from 10
to 64null104 No./month . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.8 Impacts of file access rate on AFR (MAID2). Access rate varies from 10
to 64null104 No./month . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.9 File to Block Level Converter Outline . . . . . . . . . . . . . . . . . . . 72
5.10 Diagram of the Storage System Corresponding to the DiskSim Raid-0 . . 73
5.11 Utilization Comparison Between MREED and DiskSim Simulator . . . . 74
5.12 Gear Shiftings Comparison Between MREED and DiskSim Simulator . . 75
6.1 Disk Swapping in MAID: The two cache disks on the left-hand side are
swapped with the two data disks on the right-hand side . . . . . . . . . 80
xi
6.2 Logic Diagram of Disk Swapping . . . . . . . . . . . . . . . . . . . . . . 81
6.3 Utilization Comparison of the MAID
Access Rate Impacts on AFR (No Swapping) . . . . . . . . . . . . . . . 86
6.4 Utilization Comparison of the MAID
Access Rate Impacts on AFR (Threshold=2null105) . . . . . . . . . . . . 87
6.5 Utilization Comparison of the MAID
Access Rate Impacts on AFR (Threshold=5null105) . . . . . . . . . . . . 88
6.6 Utilization Comparison of the MAID
Access Rate Impacts on AFR (Threshold=8null105) . . . . . . . . . . . . 88
6.7 Utilization Comparison of the MAID
Access Rate Impacts on AFR (Multiple Threshold=2null105) . . . . . . . 90
6.8 Utilization Comparison of the MAID
Access Rate Impacts on AFR (Multiple Threshold=2.5null105) . . . . . . 90
6.9 Utilization Comparison of the MAID
Access Rate Impacts on AFR (Multiple Threshold=4null105) . . . . . . . 91
xii
List of Tables
3.1 The characteristics of the simulated parallel disk system used to evaluate
the reliability of PDC, MAID-1, and MAID-2. . . . . . . . . . . . . . . . 35
4.1 Temperature Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 List of Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Experiment Parameter Setup . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1 File Access Rates of the One-Month Web Trace . . . . . . . . . . . . . . 64
6.1 The characteristics of the simulated parallel disk system used to evaluate the
reliability of MAID-1, and MAID-2. . . . . . . . . . . . . . . . . . . . . . . 85
xiii
Chapter 1
Introduction
Due to current trends in computing we are facing the so called data explosion.
As the use of computers to help day-to-day tasks has increased, we also face a side
effect of generating large amounts of data. This data must be stored on some sort
of medium and currently hard disk drives have become the most common storage
medium. Large scale storage systems are being developed and installed routinely
and there is a significant amount of energy that must be consumed to operate these
storage systems. Many energy conservation techniques have been proposed to achieve
high energy efficiency in disk systems. Unfortunately, growing evidence shows that
energy-saving schemesin disk drivesusually havenegative impactson storage systems.
The reliability models are inadequate to estimate reliability of parallel disk systems
equipped with energy conservation techniques. To solve this problem, we propose
mathematical models to evaluate the reliability of parallel disk systems where energy-
saving mechanisms are implemented. Furthermore, we propose a strategy to improve
energy-efficient parallel disk systems reliability.
This chapter continues by developing the problem statement clearly in Sec-
tion 1.1. Section 1.2 presents the scope of the research Section 1.3 summarizes the
main contributions of the dissertation. Finally Section 1.4 outlines the organization
of the dissertation.
1.1 Problem Statement
The number of large-scale parallel I/O systems is increasing in todays high-
performance data-intensive computing systems due to the storage space required to
1
contain the massive amount of data. Typical examples of data-intensive applications
requiring large-scale parallel I/O systems include; long running simulations [27], re-
mote sensing applications [83] and biological sequence analysis [39]. As the size of a
parallel I/O system grows, the energy consumed by the I/O system often becomes a
large part of the total cost of ownership [62][91][100]. Reducing the energy costs of
operating these large-scale disk I/O systems often becomes one of the most important
design issues. It is known that disk systems can account for nearly 27% of the total
energy consumption in a data center [37]. Even worse, the push for disk I/O systems
to have larger capacities and speedier response times have driven energy consumption
rates upward.
Existing energy conservation techniques can yield significant energy savings in
disks. While several energy conservation schemes like cache-based energy saving
approaches normally have marginal impact on disk reliability, many energy-saving
schemes (e.g., dynamic power management and workload skew techniques) inevitably
have noticeable adverse impacts on storage systems [12][90]. For example, dynamic
power management (DPM) techniques save energy by using frequent disk spin-downs
and spin-ups, which in turn can shorten disk lifetime [22][34][46], redundancy tech-
niques [60] [102][82][89], workload skew [54][38][98], and multi-speed settings [32][76].
Unlike DPM, workload-skew techniques such as MAID [19] and PDC [58] move popu-
lar data sets to a subset of disks arrays acting as workhorses, which are kept busy in a
way that other disks can be turned into the standby mode to save energy. Compared
with disks storing cold data, disks archiving hot data inherently have higher risk of
breaking down.
It is often challenging to improve both reliability and energy efficiency of storage
systems, because little attention has been paid to evaluating reliability impacts of
power management strategies on storage systems. Many excellent reliability models
2
have been proposed for disk systems (see, for example, [17] and [80]). However, ex-
isting disk reliability models are inadequate for evaluating reliability of disk systems
epuipped with energy-saving mechanisms. For example, Shah and Elerath conducted
a series of reliability analyses using field failure data of several drive models from
various disk drive manufacturers [72]. Hughes and Murray investigated SATA disk
drive reliability factors that bear on storage system performance [35]. They not only
studied SATA drive operating failure rates, but also proposed approaches to improv-
ing reliability of storage systems comprised of multiple SATA disks [35]. Reliability
models that do not consider energy-saving mechanisms are quite inaccurate when it
comes to the estimation of reliability of energy-efficient disk systems. Our goal is to
quantitatively investigate the reliability of parallel disk systems employing a variety
energy conservation schemes using a novel mathematical model.
1.2 Research Scope
Our research focuses on models to evaluate reliability of energy-efficient parallel
storage systems. We start the modeling process by capturing the behaviors of parallel
disk systems coupled with power management optimization policies. Let us first
make use of data access patterns as input parameters, which are used to estimate
each disk?s utilization and power-state transition frequency. Then, we derive each
disk?s reliability in terms of annual failure rate from the disk?s utilization, operating
temperature as well as power-state transition frequency. These parameters are key
reliability-affecting factors in addition to disk ages. Finally, we calculate the reliability
of the parallel disk system in accordance with the annual failure rate of each disk in
the system.
This work is accomplished through the use of models and simulations. We present
two models to help us model reliability of two different types of energy efficiency
disk systems?ordinary disk arrays and RAID, which equipped with data striping
3
techniques. We model the utilization of disk serving requests and also the state
transition changes and their impact on the reliability of the disk system. Using
these models we developed our own simulator which we used to evaluate reliability
of disk systems quickly. Our models are validated by making changes to the DiskSim
simulation environment . Finally we develop a prototype implementation of a virtual
file system that supports our reliability models for energy efficiency disk systems and
also develop a prototype technique that improves reliability of parallel storage systems
equipped with energy-saving strategies.
1.3 Contributions
The major contributions of the research presented in this dissertation follows:
1. A generic mathematical approach ?MINT? to modeling reliability of energy-
efficient parallel disks coupled with power management optimization policies;
2. Two reliability models for the two well-known energy-saving schemes - Popular
Data Concentration scheme (PDC) and Massive Array of Idle Disks (MAID);
3. Intriguing impacts of PDC and MAID on the reliability of parallel disk systems;
4. A reliability model ?MREED, which introduces Weibull analysis? is proposed
for energy aware data-stripping parallel storage system;
5. Validation of the access-rate-utilization model of MREED is presented;
6. The reliability of power-aware RAID-0 and RAID-5 (PARAID-0, PARAID-5)
is evaluated;
7. A prototype technique ?disk swapping? is developed and implemented.
4
1.4 Organization
This dissertation is organized in the following manner:
Chapter 2 introduces related work that is briefly reviewed and contrasted against the
contributions of this dissertation.
Chapter 3 introduces MINT model for the evaluation of disk arrays equipped with
energy-saving techniques. Especially, we evaluate two well-known energy-efficient
mechanisms ?PDC and MAID. Thorough simulation results are also presented in this
chapter.
Chapter 4 details MREED model for the evaluation of energy aware date-stripping
parallel storage system. The reliability of PARAID-0 is evaluated.
Chapter 5 introduces methods for the validation of our reliabilities.
Chapter 6 presents the Disk-Swapping, which is a prototype techniques that I devel-
oped to improve reliability of parallel storage systems.
Chapter 7 summarizes the main contributions of this dissertation and presents a
couple of future research directions based on the ideas contained in the dissertation.
5
Chapter 2
Literature Review
This chapter briefly presents previous approaches found in the literature that are
most relevant to our research from two aspects: energy-efficient storage systems, and
reliability impacts on disks. Fig. 2.1 shows a simplified taxonomy of storage systems.
Hard Disk Storage 
Systems 
Non Parallel 
Storage Systems
Parallel
 Storage Systems
Non Energy-Aware 
Storage Systems
Energy-Aware 
Storage Systems
Reliability-Based
Research
Performance-Based
Research
Energy-Based
Research
ModelingImprovement Validation
Figure 2.1: A Simplified Taxonomy of Storage Systems Research
2.1 Hard Disk Drive Storage Systems
Introduced by IBM in 1956, a hard disk drive (HDD) is a device for storing
and retrieving digital information. Hard disk drives have been a dominant device
6
for secondary storage of data in general purpose computers since the early 1960s.
Hard drives have maintained this position because advances in their recording capac-
ity, cost, reliability, and speed have kept pace with the requirements for secondary
storage [51].
The capacity of hard drives has grown exponentially over time. When hard drives
became available for personal computers (PCs), they only offered five-megabytes ca-
pacity. During the mid-1990s, the typical hard disk drive for a PC had a capacity of
about one-gigabyte [1]. In the year 2007, Hitachi firstly introduced the world?s one-
terabyte hard disk drive [5]. As of January 2012, desktop hard disk drives typically
had a capacity of 500 to 2000 gigabytes, while the largest-capacity drives were four
terabytes [8].
The latency of a disk access can therefore be broken down into three main aspects:
seek, rotational and transfer latencies. Seek latency refers to the time it takes to
position the read/write head over the proper track. The seek process involves a
mechanical transitional movement that may require an acceleration in the beginning
and a deceleration and a repositioning in the end. As a result, although disk seek times
have been reduced, short seek times have not kept up with the rates of improvement
of silicon processors. While processing rates have improved by more than an order
of magnitude, average seek times have shrunk to only half of their values of a decade
ago [11].
Rotational latency, which is delay for the rotation of a disk to bring the required
disk sector under the read-write mechanism. This characteristic is mainly relies on
rotational speed of a disk, measured in revolutions per minute (RPM). Due to elec-
tronic , mechanical as well as the manufacturing constraints, it is hard to shorten the
latency by increasing the rotational speed of disks. The RPM of a disk have tripled
in the past decades; the fastest hard disk drive was produced by Seagate in 2000 with
7
RPM 15000 [3]. A study shows that it is unlikely that there will be a disk rotational
speed increase in the near future [29].
The third type of delay is transfer time, which is the time for target sectors to
pass under a read/write head. Disk transfer times are determined by the rotational
speed and storage density (in bytes/square inch). Disk areal densities continue to
increase at 50 to 55% per year, leading to dramatic increases in sustained transfer
rates, averaged at 40% per year [30].
The disk performance has been steadily improving with more pronounced gains
for large transfer access time. The maximum sustained bandwidth (MB/s) is roughly
proportional to the linear density. The compound annual growth rate (CAGR) of
bandwidth kept around 20% from the year 1996 to early 2002 and; recently, it is
more likely to fall within the range of 10 to 15%. Currently a high performance disk
drive would have a maximum sustained bandwidth of approximately 171 MB/s [47].
2.2 Parallel Storage Systems
A single disk storage system is out of its reach in terms of scientific compu-
tation, because it often requires significant computational power and involves large
amount of data. Advances in communications technology allow numbers of effectively
unbounded processing power and storage capacity to be used to solve much larger
problems than those that only handled by single machine.
RAID is an example of advanced storage technique first introduced by David
Partterson, Garth A. Gibson, and Randy Katz at the University of California, Berke-
ley in 1987 [56]. The different schemes or architectures are named by the term RAID
followed by a number (e.g., RAID-0, RAID-1). Each scheme provides a different bal-
ance between two key goals: to increase date reliability and to increase read/write
performance. Mainly, there are three RAID levels; many more variations have been
proposed.
8
null RAID-0 (block-level striping without parity or mirroring) has no (or zero) re-
dundancy. It provides improved performance and additional storage without
fault tolerance. Hence simple stripe sets are normally referred to as RAID 0.
Any drive failure destroys the array, and the likelihood of failure increases with
more drives in the array [69][41].
null RAID 1 (mirroring without parity or striping), data is written identically to
multiple drives, thereby producing a ?mirrored set?; at least two drives are
required to constitute such an array. While more constituent drives may be
employed, many implementations deal with a maximum of only two. Of course,
it might be possible to use such a limited level 1 RAID to effectively mask the
limitation [45][74][31].
null RAID 5 (block-level striping with distributed parity) distributes parity along
with the data and requires all drives but one to be present to operate; data in
the array will not lost even in case of a single drive failure. Upon drive failure,
any subsequent reads can be calculated from the distributed parity such that
failed the drive can be rebuild by the end user. However, a single drive failure
results in reduced performance of the entire array until the failed drive has
been replaced and the associated data reconstructed [52][14]. RAID 5 requires
at least three disks.
The Parallel Virtual File System (PVFS) is an open source parallel file system. A
parallel file system is a type of distributed file system that distributes file data across
multiple servers and provides for concurrent access by multiple tasks of a parallel
application. PVFS was designed for large-scale cluster computing systems. PVFS
focuses on high performance access to large data sets. It consists of a server process
and a client library, both of which are written entirely of user-level code [33][77][96].
9
Lustre is another parallel distributed file system, generally used for large scale
cluster computing. The name Lustre is a portmanteau word derived from Linux and
cluster. Because Lustre has high performance capabilities and open licensing, it is
often deployed in super computers [57][28][15]. At the present time, fifteen of the
top 30 supercomputers in the world have Lustre file systems installed, including the
world?s fastest TOP500 supercomputer [9], K computer [7].
Ceph is a free software distributed file system initially created by Sage Weil [86].
Ceph?s main goals are to be POSIX-compatible, and completely distributed with-
out a single point of failure. The data is seamlessly replicated, making Ceph fault
tolerant [43].
PanFS is a parallel distributed file system developed by Pansas, INC. It creates
a single pool of storage under a global namespace that provides the ability to support
multiple applications and workflows in a single storage system with optimal perfor-
mance for complex technical applications. PanFS eliminates the need for multiple
islands of storage [18][53][88].
2.3 Energy-Efficient Parallel Disk Systems.
Hard disk drives (HDD) are made up of various electrical, electronic, and me-
chanical components [97]. An array of techniques were developed to save energy in
single HDDs. Energy dissipation in disk drives can be reduced at the I/O level (e.g.,
dynamic power management [23][46] and multi-speed disks [34]), the operating sys-
tem level (e.g., power-aware caching/prefetching [102][76]), and the application level
(e.g. software DMP [75] and cooperative I/O [87]).
Existing energy-saving techniquesfor parallel disk systems often rely on one of the
two basic ideas - power management and workload skew. Power management schemes
conserve energy by turning disks into standby after a period of idle time. Although
multi-speed disks are not widely adopted in storage systems, power management
10
has been successfully extended to address the energy-saving issues in multi-speed
disks [34][32][42]. The basic idea of workload skew is to concentrate I/O workloads
from a large number of parallel disks into a small subset of disks allowing other disks
to be placed in the standby mode [58][19][66][59].
2.4 Reliability Impacts of Power Management on Disks.
Recent studies show that both power management and workload skew schemes
inherently impose adverse reliability impacts on disk systems [12][90]. For example,
the power management schemes are likely to result in a huge number of disk spin-
downs and spin-ups that can significantly reduce the lifespan of hard disks.
The workload skew techniques dynamically migrates frequently accessed data
to a subset of disks [65] [49], which inherently have higher risk of breaking down
than other disks usually being kept standby. Disks storing popular data tend to
have high failure rates due to extremely unbalanced workload. Thus, the popular
data disks have a strong likelihood to become a reliability bottleneck. The design of
our MINT, presented in this dissertation, is orthogonal to the aforementioned energy
saving studies (see Section 3.2), because MINT is focused on reliability impacts of
the power management and workload skew schemes in parallel disks.
2.5 Reliability Models of Disk Systems.
A malfunction of any components in a hard disk drive could lead to a failure of the
disk. Reliability - one of the key characteristics of disks - can be measured in terms of
mean-time-between-failure (MTBF). Disk manufacturers usually investigate MTBFs
of disks either by laboratory testing or mathematical modeling. Although disk drive
manufacturers claim that MTBF of most disks is more than 1 million hours [71],
users have experienced a much lower MTBF from their field data [25]. More impor-
tantly, it is challenging to measure MTBF because of a wide range of contributing
11
factors including disk age, utilization, temperature, and power-state transition fre-
quency [36][24][63].
A handful of reliability models have been successfully developed for storage sys-
tems. For example, P?aris et. al investigated an approach to computing both average
failure rate and mean time to failure in distributed storage systems [55]; Elerath and
Pecht proposed a flexible model for estimating reliability of RAID storage [26]; and
Xin et. al developed a model to study disk infant mortality [93]. Unlike these re-
liability models tailored for conventional parallel and distributed disk systems, our
MINT model proposed in Chapter 3 pays special attention to reliability of parallel
disk systems coupled with energy-saving mechanisms.
2.6 Validation of Models.
Model validation means substantiation that a computerized model within its
domain of applicability prossess a satisfactory range of accuracy consistent with the
intended application of the model [70]. Major ways to validate models include Histor-
ical Methodes, extreme condition test, and Comparison to Other Models [67][13][48].
For example, R.E. Brown et. al validated their distributions system reliability models
by adjusting default component reliability parameters so that predicted results match
historical results. [16]. In Extreme Condition Tests, the model structure and outputs
should be plausible for any extreme and unlikely combination of levels of factors in
the system. We developed a trace-driven simulation model using the Berkeley Web
Trace [2] as a reference model to compare with our MINT model for the validation
purpose. The major reason that we used a Web trace is that our research pays more
attention to read-intensive I/O activities and Web workloads impose higher read load
than write load [64][79][44].
12
2.7 Reliability Improvements
Storage clusters consisting of thousands of disk drives are widely employed be-
cause of their large capacity and high I/O throughput. However, the reliability of
large storage clusters is far worse than that of smaller storage systems due to the
increased number of storage nodes. RAID technology is no longer sufficient to guar-
antee high data reliability for large-scale storage cluster systems, because disk rebuild
time lengthens as disk capacity grows [95]. Researchers developed various methods to
improve reliability of storage clusters. For example, Xie et. al developed a novel data
reconstruction strategy, called multi-level caching-based reconstruction optimization
(MICRO), which can be applied to RAID-structured mobile storage systems. MICRO
can noticeably shorten reconstruction times and user response times while saving en-
ergy [92]; Xin et. al presented fast recovery mechanism (FARM), a distributed recov-
ery approach that exploits excess disk capacity and reduces data recovery time [94];
Zhang et. al proposed a fast and efficient reverse lookup scheme named Group-based
Shifted Declustering (G-SD) layout that is able to locate the whole content of the
failed node [101].
13
Chapter 3
MINT: A Reliability Modeling Framework for Energy-Efficient Parallel Disk Systems
Many energy conservation techniques have been proposed to achieve high energy
efficiency in disk systems. Unfortunately, growing evidence shows that energy-saving
schemes in disk drives usually have negative impacts on storage systems. Exist-
ing reliability models are inadequate to estimate reliability of parallel disk systems
equipped with energy conservation techniques. To solve this problem, we propose
a mathematical model - called MINT - to evaluate the reliability of a parallel disk
system where energy-saving mechanisms are implemented. In this paper, we focus
on modeling the reliability impacts of two well-known energy-saving techniques - the
Popular Disk Concentration technique (PDC) and the Massive Array of Idle Disks
(MAID). We started this research by investigating how PDC and MAID affect the
utilization and power-state transition frequency of each disk in a parallel disk system.
We then model the annual failure rate of each disk as a function of the disk?s uti-
lization, power-state transition frequency as well as operating temperature, because
these parameters are key reliability-affecting factors in addition to disk ages. Next,
the reliability of a parallel disk system can be derived from the annual failure rate of
each disk in the parallel disk system. Finally, we used MINT to study the reliability of
a parallel disk system equipped with the PDC and MAID techniques. Experimental
results show that PDC is more reliable than MAID when disk workload is low. In
contrast, the reliability of MAID is higher than that of PDC under relatively high
I/O load.
14
3.1 Motivations
Parallel disk systems, providing high-performance data-processing capacity, are
of great value to large-scale parallel computers [4]. A parallel disk system comprised
of an array of independent disks can be built from low-cost commodity hardware
components. In the past few decades, parallel disk systems have increasingly become
popular for data-intensive applications running on massively parallel computing plat-
forms [81].
Existing energy conservation techniques can yield significant energy savings in
disks. While several energy conservation schemes like cache-based energy saving
approaches normally have marginal impact on disk reliability, many energy-saving
schemes (e.g., dynamic power management and workload skew techniques) inevitably
have noticeable adverse impacts on storage systems [12][90]. For example, dynamic
power management (DPM) techniques save energy by using frequent disk spin-downs
and spin-ups, which in turn can shorten disk lifetime [22] [34] [46], redundancy
techniques [60] [102] [82] [89], workload skew [54] [38] [98], and multi-speed set-
tings [32] [76]. Unlike DPM, workload-skew techniques such as MAID [19] and
PDC [58] move popular data sets to a subset of disks arrays acting as workhorses,
which are kept busy in a way that other disks can be turned into the standby mode
to save energy. Compared with disks storing cold data, disks archiving hot data
inherently have higher risk of breaking down.
Unfortunately, it is often difficult for storage researchers to improve reliability of
energy-efficient disk systems. One of the main reasons lies in the challenge that every
disk energy-saving research faces today, how to evaluate reliability impacts of power
management strategies on disk systems. Although reliability of disk systems can
be estimated by simulating the behaviors of energy-saving algorithms, there is lack
of fast and accurate methodology to evaluate reliability of modern storage systems
with high-energy efficiency. To address this problem, we developed a mathematical
15
reliability model called MINT to estimate the reliability of a parallel disk system that
employs a variety of reliability-affecting energy conservation techniques [99].
In this paper, we first study the reliability of a parallel disk system equipped with
a well-known energy-saving scheme? the MAID [19] technique. I/O load skewing
techniques like MAID inherently affect reliability of parallel disks because of two
reasons: First, disks storing popular data tend to have high I/O utilization than
disks storing cold data. Second, disks with higher utilization are likely to have higher
risk of breaking down. To address the adverse impact of load skewing techniques on
disk reliability, a disk swapping strategy was proposed to improve disk reliability in
MAID by switching the roles of data disks and cache disks. We evaluate impacts of
the disk swapping scheme on the reliability of MAID-based parallel disk systems.
In this paper, our contributions are as follows:
1. We studied a model for Massive Array of Idle Disks (MAID) based on Mathe-
matical Reliability Modelsfor Energy-efficientParallel Disk System (MINT)[99];
2. We built single disk swapping and multiple disk swapping mechanisms to im-
prove reliability of various load skewing techniques.
3. We studied the impacts of the disk swapping schemes on the reliability of MAID.
The remainder of this paper is organized as follows. Section 3.2 outlines the
design and implementation of the MINT reliability modeling framework, which relies
on disk utilization, temperature, and power-state transition frequency. Section 3.3
presents reliability models for MAID and PDC schemes along with the preliminary
results.
16
3.2 The MINT Reliability Model
3.2.1 Framework
Fig. 3.1 depicts the framework of the MINT reliability model for parallel disk
systems with energy conservation schemes. MINT is composed of a single disk relia-
bility model, a system-level reliability model, and three reliability-affecting factors -
temperature, disk state transition frequency (hereinafter referred to as frequency) and
utilization. Many energy-saving schemes (e.g., PDC [58] and MAID [19]) inherently
affect reliability-related factors like disk utilization and transition frequency. Given
an energy optimization mechanism, MINT first transfers data access patterns into the
two reliability-affecting factors - frequency and utilization. The single-disk reliability
model can derive individual disk?s annual failure rate from utilization, power-state
transition frequency, age, and temperature. Each disk?s reliability is used as input
to the system-level reliability model that estimates the annual failure rate of parallel
disk systems.
For simplicity without losing generality, we consider four reliability-related fac-
tors in MINT. This assumption does not necessarily indicate that disk utilization,
age, temperature, and power-state transitions are the only parameters affecting disk
reliability. Other factors having impacts on reliability include handling, humidity,
voltage variation, vintage, duty cycle, and altitude [25]. If a new factor has to be
integrated into MINT, we can extend the single reliability model described in Section
3.2.5. Since the infant mortality phenomena is out the scope of this study, we pay
attention to disks that are more than one year old.
3.2.2 Impacts of Utilization on Disk Annual Failure Rate
Disk utilization can be characterized as the fraction of active time of a disk drive
out of its total powered-on-time [61]. In our single disk reliability model, the impacts
17
Access
Patterns
Frequency Utilization ...
Disk n
Single Disk Reliability Model
Frequency Utilization 
Disk 1
Reliability of 
Disk 1 
Reliability of 
Disk n ......
Reliability of a Parallel Disk 
System
System-Level Reliability Model
Energy Conservation 
Technique
(e.g. PDC, MAID, etc.)
MINT 
Model
Disk
Age Temperatrue
Figure 3.1: The Framework of the MINT Reliability Model
18
Figure 3.2: Utilization Impacts on AFR (by Google)
of disk utilization on reliability is good way of providing a baseline characterization of
disk annual failure rate (AFR). Using field failure data collected by Google, shows the
impact of utilization on AFR across the different age groups. Pinheiro et al. studied
the impact of utilization on AFR accross different disk age groups. Pinheiro et al.
categorized disk utilization in three levels - low, medium, and high. Fig. 3.2 shows
AFRs of disks whose ages are 3 months, 6 months, 1 year, 2 years, 3 years, 4 years,
and 5 years under the three utilization levels. Since the single-disk reliability model
needs a baseline AFR derived from a numerical value of utilization, we make use of
the polynomial curve-fitting technique to model the baseline value of a single disk?s
AFR as a function of utilization. Thus, the baseline value (i.e., BaseValue in Eq.
4.1) of AFR for a disk can be calculated from the disk?s utilization. For example,
Fig. 3.3 shows the AFR value of a 3-year old disk as a function of its utilization. The
curve plotted in Fig. 3.3 can be modeled as a utilization-reliability function described
as Eq. 3.1 below:
19
10 20 30 40 50 60 70 803
3.5
4
4.5
5
5.5
6
6.5
7
Utilization (%)
AFR (%)
3?Year?Old Hard Drive
 
 
Figure 3.3: 3-Year-Old HDD Utilization Impacts on AFR
R(u) =4.167enull7u4 null7.5enull5u3 + 5.968enull3u2null
null2.575enull1u + 9.3, for all u null [0,100]
(3.1)
where R(u) represents the AFR value as a function of a certain disk utilization u.
With Eq. 3.1 in place, one can readily derive annual failure rate of a disk if its age
and utilization are given. For example, for a 3-year old disk with 50% utilization
(i.e., u = 50%), we can obtain the AFR value of this disk as R(u) = 4.8%. Fig. 3.3
suggests that unlike the conclusions drawn in a previous study (see [78]), a low disk
utilization does not necessarily lead to low AFR. For instance, given a 3-year old disk,
the AFR value under 30% utilization is even higher than AFR under 80% utilization.
20
3.2.3 Impacts of Temperature on Disk Annual Failure Rate
Temperature is often considered as the most important environmental factor
affecting disk reliability. Field failure data of disks in a Google data center (see Fig.
3.4) shows that in most cases when temperatures are higher than 35nullC, increasing
temperatures lead to an increase in disk annual failure rates. On the other hand,
Fig. 3.4 indicates that in the low and middle temperature ranges, the failure rates
decreases when temperature increases [61].
Growing evidence shows that disk reliability should reflect disk drives operat-
ing under environmental conditions like temperature [25]. Since temperature (e.g.,
meaured 1/2? from the case) apparently affect disk reliability, the temperature can be
considered as a multiplier (hereinafter referred to as temperature factor) to baseline
failure rates where environmental factors are integrted [25]. Given a temperature, one
must decide the corresponding temperature factor (see TemperatureFactor in Eq.
4.1) that can be multiplied to the base failure rates. Using Google?s field failure data
plotted in 3.4, we attempted to calculate temperature factors under various temper-
atures ranges for disks with different ages. More specifically, Fig. 3.4 shows annual
failure rates of disks whose ages are from 3-month to 4-year old. For disk drives whose
ages fall in each age range, we model the temperature factor as a function of drive
temperature. Thus, six temperature-factor functions must be derived.
Now we explain how to determine a temperature facotr for each temperature
under each age range. Let us choose 25nullC as the base temperature value, because
room temperatures of data centers in many cases are set as 25nullC controlled by cooling
systems. Thus, the temperature factor is 1 when temperature is set to the base
temperature - 25nullC. Let T denote the average temperature, we define the temperature
factor for temperature T as T/25 if T is larger than 25nullC. When T exceeds 45nullC,
the temperature factor becomes a constant (i.e., 1.8 = 45/25). Due to space limit,
we only show how temperature affects the temperature factor of a 3-year old disk in
21
Figure 3.4: Average Drive Temperature Impacts on AFR
(by Google)
Fig. 3.4. Note that the temperature-factor functions for disks in other age ranges can
be modeled in a similar way. Fig. 3.5 shows the temperature-factor function derived
from Fig. 3.4 for 3-year old disks. We can observe from Fig. 3.4 that AFRs increase
to 215% of the base value when the temperature is between 40 to 45nullC.
3.2.4 Power-State Transition Frequency
To conserve energy in single disks, power management policies turn idle disks
from the active state into standby. The disk power-state transition frequency (or
frequency for short) is often measured as the number of power-state transitions (i.e.,
from active to standby or vice versa) per month. The reliability of an individual disk
is affected by power-state transitions and; therefore, the increase in failure rate as
a function of power-state transition frequency has to be added to a baseline failure
rate (see Eq. 4.1 in Section 3.2.5). We define an increase in AFR due to power-state
22
0 5 10 15 20 25 30 35 40 45 50 550
0.5
1
1.5
2
2.5
3
3.5
4
Temperature (?C)
Temp. Factor
3?Year?Old Hard Drive
Figure 3.5: Temperature-Factor Function of 3-Year-Old HDDs
transitions as power-state transition frequency adder (frequency adder for short).
The frequency adder is modeled by combining the disk spindle start/stop failure rate
adders described by IDEMA [78] and the PRESS model [90]. Again, we focus on
3-year old disk drives. Fig. 3.6 demonstrates frequency adder values as a function of
power-state transition frequency. Fig. 3.6 shows that high frequency leads to a high
frequency adder to be added into the base AFR value. We used the quadratic curve
fitting technique to model the frequency adder function (see Eq. (4.2)) plotted in
Fig. 3.6.
R(f) = 1.51enull6f2 null1.09enull5f + 1.39enull2,f null [0,100] (3.2)
where f is a power-state transition frequency, R(f) represents an adder to the base
AFR value. For example, suppose the transition frequency is 300 per month, the base
AFR value needs to be increase by 1.33%.
23
0 50 100 150 200 250 300 350 400 450 5000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Transition Frequency(per month)
AFR(%)
3?Year?Old Hard Drive
 
 
Figure 3.6: Impacts of Transition Frequency on Frequency adder of 3-Year-Old
HDDs
3.2.5 Single Disk Reliability Model
Single-disk reliability can not be accurately described by one valued parame-
ter, because the disk drive reliability is affected by multiple factors (see Sections
3.2.2,3.2.3, and 3.2.4). Though recent studies attempted to consider multiple reliabil-
ity factors (see, for example, PRESS [90]), few of prior studies investigated the details
of combining the multiple reliability factors. We model the single-disk reliability in
terms of annual failure rate (AFR) in the following three steps. We first compute a
baseline AFR as a function of disk utilization. We then use temperature factor as
a multiplier to the baseline AFR. Finally, we add a power-state transition frequency
adder to the baseline value of AFR. Hence, the failure rate R of an individual disk
24
can be expressed as:
R =?nullBaseValuenullTemperatureFactor+
+ ? nullFrequencyAdder
(3.3)
where BaseValue is the baseline failure rate derived from disk utilization (see Sec-
tion 3.2.2), TemperatureFactor is the temperature factor (or temperature multiplier;
see Section 3.2.3), FrequencyAdder is the power-state transition frequency adder to
the base AFR (see Section 3.2.4), and ? and ? are two coefficients to reliability R.
If reliability R is more sensitive to frequency than to utilization and temperature,
then ? must be greater than ?. Otherwise, ? is smaller than ?. In either cases,
? and ? can be set in accordance with R?s sensitivities to utilization, temperature,
and frequency. In our experiments, we assume that all the three reliability-related
factors are equally important (i.e., ?=?=1). Ideally, extensive field tests allow us to
analyze and test the two coefficients. Although ? and ? are not fully evaluated by
field testing, reliability results are valid because of the following two reasons: first, we
have used the same values of ? and ? to evaluate impacts of the two energy-saving
schemes on disk reliability (see Section 3.3.1); second, the failure-rate trend of a disk
when ? and ? are set to 1 are very similar to those of the same disk when the values
of ? and ? do not equal to 1.
With Eq. 4.1 in place, we can analyze a disk?s reliability in turns of annual
failure rate (AFR). Fig. 3.7 shows AFR of a three-year-old disk when its utilization
is in the range between 20% and 80%. We observe from Fig. 3.7 that increasing
temperature from 35nullC to 40nullC gives rise to a significant increase in AFR. Unlike
temperature, power-state transition frequency in the range of a few hundreds per
month has marginal impact on AFR. It is expected that when transition frequency
is extremely high, AFR becomes more sensitive to frequency than to temperature.
25
0 10 20 30 40 50 60 70 80 90 1002
4
6
8
10
12
14
16
3?Year?Old Hard Drive
Utilization(%)
AFR(%)
 
 
Base Value
Frequency=250/month,Temperatrue=35 ?C
Frequency=350/month,Temperatrue=35 ?C
Frequency=250/month,Temperatrue=40 ?C
Frequency=350/month,Temperatrue=40 ?C
Figure 3.7: 3-Year-Old HDD Combined Factors Impacts on AFR
(Single Disk Reliability Model)
3.3 Reliability Models for MAID and PDC
3.3.1 MAID- Massive Array of Idle Disks
Background
The MAID (Massive Arrays of Idle Disks) technique - developed by Colarelli and
Grunwald - aims to reduce energy consumption of large disk arrays while maintaining
acceptable I/O performance [19]. MAID relies on data temporal locality to place
replicas of active files on a subset of cache disks, thereby allowing other disks to spin
down. Fig. 3.8 shows that MAID maintains two types of disks - cache disks and
data disks. Popular files are copied from data disks into cache disks, where the LRU
policy is implemented to manage data replacement in cache disks. Replaced data is
discarded by a cache disk if the data is clean; dirty data has to be written back to
the corresponding data disk. To prevent cache disk from being overly loaded, MAID
26
gid2gid5gid6gid9gid7gid1gid4gid5gid12gid5gid8gid7gid13
gid3gid5gid15gid5gid1gid3gid10gid14gid11gid14gid2gid5gid6gid9gid7gid1gid3gid10gid14gid11gid14
Figure 3.8: MAID System Structure
avoids copying data to cache disks that have reached their maximum bandwidth.
Three components integrated in the MAID model include: (1) a power management
policy that switches idle disks into the standby mode if the disks are sitting idle for
a certain period of time; (2) a data placement module that either linearly places
successive blocks on a disk drive or uniformly distributes data blocks across multiple
drives; (3) a cache disk controller that determines the number of disks performing as
cache disks [19].
Modeling Utilization of Disks in MAID
Recall that the annual failure rate of each disk can be calculated using disk age,
utilization, operating temperature as well as power-state transition frequency. To
model reliability of a disk array equipped with MAID, we have to first address the
issue of modeling disk utilization used to calculate base annual failure rates. In this
subsection, we develop a utilization model capturing behaviors of a MAID-based disk
27
array. The utilization model takes file access patterns as an input and calculates the
utilization of each disk in the disk array.
Disk utilization is computed as the fraction of active time of a disk drive out of
its total powered-on-time. Now we describe a generic way of modeling the utilization
of a disk drive. Let us consider a sequence of I/O accesses with N I/O phases. We
denote Ti as the length or duration of the ith I/O phase. Without loss of generality,
we assume that a file access pattern in an I/O phase remains unchanged. The file
access pattern, however, may vary in different phases. The relative length or weight
of the ith phase is expressed as Wi = Ti/T where T = nullNi=1 Ti is the total length of
all the I/O phases. Suppose the utilization of a disk in the ith phase is ?i, we can
write the overall utilization ? of the disk as the weighted sum of the utilization in all
the I/O phases. Thus, we have
? =
Nnull
i=1
(Wi null?i) =
Nnull
i=1
(TiT null?i) (3.4)
Let Fi = (fi1,fi2,...,fini) be a set of ni files residing in the disk in the ith phase.
The utilization ?i (see Eq. 4.5) of the disk in phase i is contributed by I/O accesses
to each file in set Fi. Thus, ?i in Eq. 4.5 can be written as:
?i =
ninull
j=1
(?ij nullsij) (3.5)
where ?ij is the file access rate of file fij in Fi and sij is the mean service time of file
fij. Note that I/O accesses to each file in set Fi are modeled as a Poisson process;
file access rate and service time in each phase i are given a priori. We assume that
there are n hard drives with k phases. In the l-th phase, let fijl be the j-th file on
28
the i-th disk, where i null (1,2,nullnullnull ,n), j null (1,2,nullnullnull ,mi), l null (1,2,nullnullnullk). We have:
F1l = nullf11l,f12l,nullnullnull ,f1m1lnull
...
Fnl = nullfn1l,fn2l,nullnullnullfnmnlnull (3.6)
where mi is the number of files on the ith disk and Fil is the total files on the same
disk. Since frequently accessed files are duplicated to cache disks, we model below an
updated file placement after copying the frequently accessed files.
Fnull1l =
null
fnull11l,fnull12l,nullnullnull ,fnull1mnull
1l
null
...
Fnullnl = nullfnulln1l,fnulln2l,nullnullnull ,fnullnmnullnlnull (3.7)
where mnulliis the number of the files on the i-th disk, fnullijl is the j-th file at the l-th
phase and Fnullil is the set of files on the same disk after the files are copied. We can
calculate the utilization for jth file in the lth phase on the ith disk as ?ijl = ?ijl nullt.
We assume that ?i1l null ?i2l null nullnullnull null ?im1l, meaning that files are placed in a descending
order of utilization. After the frequently accessed files are copied to the cache disks,
we denote the updated utilization contributed by files including copied ones as ?nulli1l null
?nulli2l null nullnullnull null ?nullim1l. It is intuitive that the utilization of disk i should be smaller than
1. When a disk reaches its maximum utilization, the disk also reaches its maximum
bandwidth denoted as Bi. For both cache and data disks, we express the utilization
29
for ith disk in phase l as:
?nullil =
I/Otime + Copying time
T
=
mnullinull
j=1
?nullijl + Copying timeT (3.8)
where T is the time interval of the lth I/O phase. The first and second items on the
bottom-line on the right-hand side of Eq. 3.8 are the utilizations caused by accessing
files and duplicating files from data disks to cache disks, respectively.
Since files on cache disks are duplicated from data disks, frequently accessed
files must be copied from data disks and written down to cache disks. As such, we
must consider disk utilization incurred by the data duplication process. To quantify
utilization overhead caused by data replicas, we define a set FM outil of files copied
from the ith data disk to cache disks in phase l. Similarly, we define a set FM inil of
files copied to the ith cache disk from data disks in phase l.
With respect to the ith data disk, the utilization ?nullilnulldata in phase l is the sum of
utilization caused by accessing files on the data disk and reading files to be duplicated
to cache disks. Thus, ?nullilnulldata can be written as:
?nullilnulldata =
mnullinull
j=1
?nullijl +
null
jnullFM outil
tijl
T (3.9)
where the first and second items on the right-hand side of Eq. 3.9 are the utilizations
of accessing files and reading files from the data disk to make replicas on cache disks,
respectively.
When it comes to the ith cache disk, the utilization ?nullilnullcache in phase l is the
sum of utilization contributed by accessed files and written file replicas to cache disks.
30
Thus, ?nullilnulldata can be written as:
?nullilnullcache =
mnullinull
j=1
?nullijl +
null
jnullFM inil
tijl
T (3.10)
where the first and second items on the right-hand side of Eq. 3.10 are the utilizations
of accessing files and writing files to the cache disk to make replicas, respectively.
Modeling Power-State Transition Frequency for MAID
Eq. 4.1 in Section 3.2.5 shows that the power-state transition frequency adder
is an important factor to model disk annual failure rate. The number of power-state
transitions largely depends on I/O workload conditions in addition to the behaviors
of MAID. In this subsection, we derive the number of power-state transitions from
file access patterns.
We define the TBE as the disk break-even time - the minimum idle time required
to compensate the cost of entering the disk standby mode (TBE values are usually
anywhere between 10 to 15 seconds). Given file access patterns of the ith phase for
a disk, we need to calculate the number ?i of idle periods that are larger than the
break-even time TBE. The number of power-state transitions during phase i is 2?i,
because there is a spin-down at the beginning of each large idle time and a spin-up by
the end of the idle time. For an access pattern with N I/O phases, the total number
of power-state transitions ? can be expressed as: ? = 2nullnullNi=1 ?i.
We model a workload condition where I/O burstiness can be leveraged by the
dynamic power management policy to turn idle disks into the standby mode to save
energy. To model I/O burstiness, we assume the first I/O requests of files within an
access phase are arriving in a short period of time, within which disks are too busy
to be switched into standby. After the period of high I/O load, there is an increasing
number of opportunities to place disks into the standby mode. This workload model
31
allows MAID to achieve high energy efficiency at the cost of disk reliability, because
the workload model leads to a large number of power-state transitions.
To conduct a stress test on reliability of MAID, we assume that the first requests
of files on a disk arrive at the same time. For the first few time units, the workloads are
so high that no data disks can be turned into standby. As the I/O load is descreasing,
some data disks may be switched to standby when idle time intervals are larger than
TBE. In this workload model, MAID can achieve the best energy efficiency with the
worst reliability in terms of the number of power-state transitions.
3.3.2 PDC- Popular Disk Concentration
Background
The PDC (Popular Data Concentration) technique proposed by Pinheiro and
Bianchini migrates frequently accessed data to a subset of disks in a disk array [58].
Fig. 3.9 demonstrates the basic idea behind PDC: the most popular files are stored
in the far left disk, while the least popular files are stored in the far right disk. PDC
can rely on file popularity and migration to conserve energy in disk arrays, because
several network servers exhibit I/O loads with highly skewed data access patterns.
The migrations of popular files to a subset of disks can skew disk I/O load towards this
subset, offering other disk more opportunities to be switched to standby to conserve
energy. To void performance degradation of disks storing popular data, PDC aims to
migrate data onto a disk until its load is approaching the maximum bandwidth.
The main difference between MAID and PDC is that MAID makes data replicas
on cache disks, whereas PDC lays data out across disk arrays without generating any
replicas. If one of the cache disks fails in MAID, files residing in the failed cache disks
can be found in the corresponding data disks. In contrast, any failed disk in PDC
can inevitably lead to data loss. Although PDC tends to have lower reliability than
32
gid6gid9gid8gid1gid4gid13gid16gid17gid1gid5gid13gid14gid18gid12gid7gid15gid1gid2gid10gid16gid11gid1 gid6gid9gid8gid1gid3gid8gid7gid16gid17gid1gid5gid13gid14gid18gid12gid7gid15gid1gid2gid10gid16gid11gid1
Figure 3.9: PDC System Structure
MAID, PDC does not need to trade disk capacity for improved energy efficiency and
I/O performance.
Modeling Utilization of Disks in PDC
Since frequently accessed files are periodically migrated to a subset of disks in a
disk array, we have to take into account disk utilization incurred by file migrations.
Hence, the ith disk?s utilization ?nullil during phase l is computed as the sum of the uti-
lization contributed by accessing files residing in disk i and the utilization introduced
by migrating files to/from disk i. Thus, we can express utilization ?nullil as:
?nullil =
mnullinull
j=1
?nullijl + Migration timeT (3.11)
33
where T is the time interval of I/O phase l. The first and second items on the right-
hand side of Eq. 3.11 are the utilizations caused by accessing files and duplicating
files from data disks to cache disks, respectively.
To quantify utilization introduced by the file migration process (see the second
item on the bottom-line on the right-hand side of Eq. 3.11), we define two set of
files for the ith disk in the lth I/O phase. The first set FM outil contains all the files
migrated from disk i to other disks during the lth phase. Similarly, the second set
FM inil consists of files migrated from other disks to disk i in phase l.
Now we can formally express the utilization of disk i in phase l using the two file
sets FM outil and FM inil . Thus,
?nullil =
mnullinull
j=1
?nullijl +
null
jnullFMil
tijl
Tl (3.12)
where the second item on the right-hand side of Eq. 3.12 is the utilization incurred
by (1) migrating files in set FM outil from disk i to other disks and (2) migrating files
in set FM inil from other disks to disk i during phase l.
Modeling Power-State Transition Frequency for PDC
We used the same way described in Section 3.3.1 to model power-state transition
frequency for PDC. Unlike MAID, PDC allows each disk to receive migrated data
from other disks. In light of PDC, disks storing the most popular files are most likely
to be kept in the active mode.
3.3.3 Results Evaluation
Experimental Setup
We developed a simulator to validate the reliability models for MAID and PDC.
It might be unfair to compare the reliability of MAID and PDC using the same
34
number of disks, since MAID trades extra cache disks for high energy efficiency. To
make fair comparison, we considered two system configurations for MAID. The first
configuration referred to as MAID-1 employs existing disks in a parallel disk system as
cache disks to store frequently accessed data. Thus, the first configuration of MAID
improves energy efficiency of the parallel disk system at the cost of capacity. In
contrast, the second configuration? called MAID-2?needs extra disks to be added
to the disk system to serve as cache disks.
Our experiments were started by evaluating the reliability of PDC as well as
MAID-1 and MAID-2. Then, we studied the reliability impacts of the proposed
disk-swapping strategies on both PDC and MAID. We simulated PDC, MAID-1, and
MAID-2 along with the disk-swapping strategies in two parallel disk systems described
in Table 6.1. For the MAID-1 configuration, there are 5 cache disks and 15 data disks.
In the disk system for the MAID-2 configuration, there are 5 cache disks and 20 data
disks. As for the case of PDC, we fixed the number of disks to 20. Thus, we studied
MAID-2 and PDC using a parallel disk system with 20 disks; we used a similar disk
system with totally 25 disks to investigate MAID-1. We varied the file access rate in
the range between 0 to 106 times per month. The average file size considered in our
experiments is 300KB. The base operating temperature is set to 35nullC. In this study,
we focused on read-only workload. Nevertheless, the MINT model should be readily
extended to capture the characteristics of read/write workloads.
Table 3.1: The characteristics of the simulated parallel disk system used to evaluate
the reliability of PDC, MAID-1, and MAID-2.
Energy-efficiency
Scheme Number of Disks
File Access Rate
(No. per month)
File Size
(KB)
PDC 20 data(20 in total) 0?106 300
MAID-1 15 data+5 cache(20 in total) 0?106 300
MAID-2 20 data+5 cache(25 in total) 0?106 300
35
0 50 100 150 200 250 300 350 400 450 5000
10
20
30
40
50
60
70
80
90
100
3?Year?Old Hard Drive
Access Rate(per month)
Utiliazation(%)
 
 
PDC
MAID1
MAID2
Figure 3.10: Utilization Comparison of the PDC and MAID
Access Rate(up to 500/month) Impacts on Utilization
Preliminary Results
In terms of utilization, when the file access rate of the files increase, represented
in Fig. 3.10, the utilization of both PDC and MAID increase also. However, other
than increasing as smoothly as that of MAID reaching nearly 50%, the utilization
of PDC increases sharply hitting nearly 90%. The main reason is that PDC will be
busy with migrating data in and out of the disks according to the popularity of the
data. When the file access rate increases, which leads to more files migrating upward
to the more popular disks while others migrating downward to the lease popular
disks, the PDC system needs to spend more utilization to deal with the inner data
migration in addition to the requests themselves. On the other hand, after copying
the popular data to cache disks, there is no need for data disks to handle the requests
in MAID any more. The increase of the curve is mainly influence by the utilization
of cache disks in MAID. As one step further, Fig. 3.11 shows the annual failure rates
36
0 50 100 150 200 250 300 350 400 450 5003
4
5
6
7
8
9
10
3?Year?Old Hard Drive
(Temperatur 35?C)
Access Rate(per month)
AFR(%)
 
 
PDC
MAID1
MAID2
Figure 3.11: AFR Comparison of the PDC and MAID
Access Rate Impacts on AFR(Temperature=35nullC)
of MAID1, MAID2, and PDC. We observe from Fig. 3.11 that the AFR value of
PDC keeps increasing from 5.6% to 8.3% when the file access rate is larger than
150. We attribute this trend to high disk utilization due to data migrations. More
interestingly, if the file access rate is lower than 150, AFR of PDC slightly reduces
from 5.9% to 5.6% when the access rate is increased from 5 to 150. This result can be
explained by the nature of the utilization function that is concave rather than linear.
The concave nature of the utilization function is consistent with the empirical results
reported in [61]. When the file access rate 150, the disk utilization is approximately
50%, which is the turing point of the utilization function.
Unlike PDC, MAID?s AFR continues to decrease from 6.3% to 5.8% with the
increasing file access rate. This declining trend might be explained by two reasons.
First, increasing the file access rates reduces the number of power-state transitions.
37
0 100 200 300 400 500 600 700 800 900 10000
10
20
30
40
50
60
70
80
90
100
3?Year?Old Hard Drive
Access Rate(per month)
Utilization(%)
 
 
PDC
MAID1
MAID2
Figure 3.12: Utilization Comparison of the PDC and MAID
Access Rate(up to 1000/month) Impacts on Utilization
Second, the range of the disk utilization is close to 40%, which is in the declining part
of the curve.
When the access rate is extended to 1000 per month, as shown in Fig. 3.12, the
utilization of PDC gets close to 90% while those of MAID keep rising. The reason
that utilization of MAID-1 grows faster than that of MAID-2 is because that when
the method of weighted sum is adopted, the less number of disk is the more each disk
weights more. As the systems utilization changed, the AFR will change accordingly.
One important observation from Fig. 3.13 and Fig. 3.14 is that when access rate is
higher than 700 times per month, the AFR of MAID-1 is getting higher than that
of MAID-2. The reason is that the utiliaztion of MAID-1 keeps rising up over 60%,
observed from Fig. 3.12, when access rate is higher than 700 times per month. And
according to Fig. 3.3, the AFR will stop to rise up after utilization goes higher than
60%. Hence, we can predict that after access rate hit 900 per month, the AFR of
38
0 100 200 300 400 500 600 700 800 900 10003
4
5
6
7
8
9
10
3?Year?Old Hard Drive
Access Rate(per month)
AFR(%)
 
 
PDC
MAID1
MAID2
Figure 3.13: Utilization Comparison of the PDC and MAID
Access Rate(up to 1000/month) Impacts on AFR(Temperature=35nullC)
0 100 200 300 400 500 600 700 800 900 10009
10
11
12
13
14
15
16
17
18
3?Year?Old Hard Drive
Access Rate(per month)
AFR(%)
 
 
PDC
MAID1
MAID2
Figure 3.14: Utilization Comparison of the PDC and MAID
Access Rate(up to 1000/month) Impacts on AFR(Temperature=40nullC)
39
20 25 30 35 40 45 500
2
4
6
8
10
12
14
16
3?Year?Old Hard Drive
Temperature(?C)
AFR(%)
 
 
PDC
MAID1
MAID2
Figure 3.15: AFR Comparison of the PDC and MAID
Temperature Impacts on AFR (Access Rate= 200/month)
MAID-2 will be expected to stop to rise up. When we fix the access rate at 200 times
per month and vary the temperature from 25nullC to 45nullC, as shown in Fig. 3.15, it
is easy to see that as the temperature grows up, the AFR of all three systems goes
down at the range of 25nullC to 30nullC, and goes up at the range of 30nullC to 45nullC. It
is all according to the trend derived from Google [61]. Further, we notice that the
AFR of PDC is lower than that of MAID. And when the temperature grows up, the
AFR of MAID grows faster than that of PDC. On the contrary, when access rate is
fixed at 450 times per month, as shown in Fig. 3.16, observation is that the AFR
of PDC grows higher and faster than that of MAID. The two main reasons for these
opposite results are utilization and frequency. As access rate is 200 times per month,
even thought the utilization of PDC is higher than that of MAID, it still stays in
the descending part of the utilization curve. From Fig. 3.3, it is obvious that higher
utilization leads to lower AFR in the recession part of the curve. When access rate
40
20 25 30 35 40 45 500
2
4
6
8
10
12
14
16
18
20
3?Year?Old Hard Drive
Temperature(?C)
AFR(%)
 
 PDC
MAID1
MAID2
Figure 3.16: AFR Comparison of the PDC and MAID
Temperature Impacts on AFR (Access Rate= 450/month)
is 450 times per month, the utilization of PDC is approaching 90% because of the
data migration, which is way higher than that of MAID as shown in Fig. 3.10. At
this moment PDC stays in the ascending part of the utilization curve while MAID is
about reaching the rock-bottom of the curve. Also, as the adder factor, the frequency
makes the utilization of PDC grows even faster.
3.4 Summary
In recognition that existing disk reliability models cannot be used to evaluate
reliability of energy-efficient disk systems, we propose a new model called MINT
to evaluate the reliability of a disk array equipped with reliability-affecting energy
conservation techniques. We first model the impacts of disk utilization and power-
state transition frequency on reliability of each disk in a disk array. We then derive
41
the reliability of an individual disk from its utilization, age, temperature, and power-
state transition frequency. Finally, we use MINT to study the reliability of disk
arrays coupled with the MAID (Massive Array of Idle Disks) technique and the PDC
(Popular Disk Concentration technique) technique.
42
Chapter 4
MREED: Reliability Analysis of An Energy-Aware RAID System
We develop a mathematical model? MREED? to quantitatively evaluate the
failure rate of energy-efficient parallel storage systems. The Power-Aware Redundant
Array of Inexpensive Disk (PARAID) aims to reduce energy use of commodity server-
class disks without specialized hardware. The goal of PARAID is to skewed striping
pattern to adapt to the system load by changing the number of powered disks. By
spinning down disks during light workloads, PARAID can reduce power consumption,
while still meeting performance demands. We show that MREED can be used to es-
timate a five-disk PARAID-0 system. We validate the accuracy of MREED using the
DiskSim simulator. Our approach shows that MREED can rely on file access pattern
to estimate system utilization correctly. Furthermore, even thought PARAID may
achieve reasonable reliability, our model shows that PARAID?s reliability is affected
by data locality.
4.1 Motivations
Existing reliability models for conventional parallel and distributed disk systems
do not consider energy-saving issues or data-stripping mechanisms. In this paper, we
first study the reliability of a parallel disk system equipped with the PARAID [85]
technique by employing the Mathematical Reliability model for Energy- Efficient
RAID system called MREED. As a mathematical model, MREED shows its advan-
tage of presenting the reliability trend of energy-aware storage systems. However,
it is challenging to validate the MREED model. To address the correctness issue of
MREED, we validate the access-rate-utilization model, which converts file access rate
43
to utilization of the storage system, in MREED. Finally, we study impacts of the I/O
load skewing technique ?gear shifting ? on the reliability of PARAID, a well known
energy-aware data stripping storage system.
Existing energy conservation techniques can yield significant energy savings in
disks. While several energy conservation schemes like cache-based energy-saving
approaches normally have marginal impact on disk reliability, many energy-saving
schemes (e.g., dynamic power management and workload skew techniques) inevitably
have noticeable adverse impacts on storage systems [12][90]. For example, dynamic
power management (DPM) techniques save energy by using frequent disk spin-downs
and spin-ups, which in turn can shorten disk lifetime [22][34][46], redundancy tech-
niques [60][102][82][89], workload skew [54][38][98], and multi-speed settings [32][76].
We pay attention on the reliability issue of RAID systems, existing energy conserva-
tion techniques can not be applied for RAID systems for the following reasons:
null Conventional RAIDs balance I/O load across all disks in the array for maximized
disk parallelisms and performance, meaning that all disks are spinning even
under a light load. No opportunity is offered to spin down any of disks;
null Server class disks are not designed for frequent power cycles, which significantly
reduce life expectancy;
null Server systems cannot rely on caching and dynamic power management because
the servers are too busy to have long idle time.
In this paper, our contributions are summaries as follows:
1. We propose a reliability model MREED for Power-Aware RAID (i.e., an energy
aware data-stripping parallel storage system);
2. We introduce Weibull distribution analysis to MREED. Using the utilization of
a storage system as an input, we can estimate and forecast the annual failure
rate (a.k.a, AFR) of this system;
44
3. We validate the access-rate-utilization model of MREED;
4. We study the impacts of the gear-shifting schemes on the reliability of PARAID.
We study impacts of the I/O load skewing technique especially on PARAID-0,
which is an energy-aware RAID-0 system. Experimental results shows that gear-
shifting affects reliability of parallel disks due to two reasons: First, disks working at
all gears tend to have high I/O utilization than disks that only works at high gears.
Second, disks with high utilization are likely to have high risk of breaking down.
The remainder of this paper is organized as follows. Section 4.2 presents the
overview of the MREED model. In Section 4.3, we apply MREED model to quanti-
tatively estimate the reliability of PARAID. Section 4.4 presents experimental results
and performance evaluation. Finally, Section 6.4 concludes the paper with discus-
sions.
4.2 The MREED Modeling Framework
4.2.1 Overview
MREED is a framework developed to model reliability of parallel disk systems
employing energy conservation techniques. In the MREED framework, we evaluate
the reliability impacts of a specific energy-saving technique - the Power-Aware RAID.
One critical module in MREED is to model the impact of energy-efficient schemes
on the utilization and power-state transition frequency of each disk in a parallel disk
system. Another important module developed in MREED is to calculate the annual
failure rate of each disk as a function of the disk?s utilization, power-state transition
frequency. Given the annual failure rate of each disk in the parallel disk system,
MREED is able to derive the reliability of an energy-efficient parallel disk system.
As such, we used MREED to study the reliability of a parallel disk system equipped
with the PARAID technique.
45
Fig. 4.1 outlines the MREED reliability modeling framework. MREED is com-
posed of a Weibull-based disk reliability model, a system-level reliability model,
and three reliability-affecting factors?temperature, power state transition frequency
(hereinafter referred to as transition frequency or frequency) and utilization. Many
energy-saving schemes inherently affect reliability-related factors like disk utilization
and transition frequency. Given an energyoptimization mechanism(e.g., PARAID[85]),
MREED first converts data access patterns into the two reliability-affecting factors?
frequency and utilization. The Weibull-based disk reliability model can derive in-
dividual disk?s possibility of failure from utilization and power-on hours per year
because these parameters are key reliability-affecting factors. Each disk?s reliability
is used as input to the system-level reliability model that evaluates the annual failure
rate of parallel disk systems.
Forsimplicity withoutlosing generality, weconsideredin MREEDthreereliability-
related factors, namely: disk utilization, temperature, and power-state transitions.
This assumption does not necessarily indicate by any means that there are only three
parameters affecting disk reliability. Other factors having impacts on reliability in-
clude: handling, humidity, voltage variation, vintage, duty cycle, and altitude [25]. If
a new factor has to be taken into account, one can extend the single reliability model
by integrating the new factor with other reliability-affecting factors in MREED. Since
the infant mortality phenomenon is out the scope of this study, we pay attention to
disks that are no less than one year old.
The single-disk reliability can not be accurately described by one valued param-
eter because the disk drive reliability is affected by multiple factors. There are three
major factors that affect disk reliability.
1. Disk Utilization can be characterized as the fraction of active time of a disk drive
out of its total powered-on-time. The baseline value (i.e. RBase Value in Eq. 4.1)
of AFR for a disk, which is derived from the Weibull distribution analysis, can
46
Energy-Conservation RAID Technique 
(e.g. PARAID)
Access Patterns Temperature
Utilization
Power Transition 
Frequency
Active Hour Per 
Year
Possibility to Fail
Weibull Distribution 
Analysis
Annual Failure 
Rate 
System Level Reliability Model
Reliability of a Energy-
Efficient RAID System
Figure 4.1: Overview of the MREED reliability modeling methodology
be calculated from the disk?s utilization. The details will be discussed in the
subsection 4.2.2;
2. Temperature, which acts as a multiplier to base failure rates in the MREED
model. The temperature factor shown in the Table 4.1 was reported by Seagate
Storage Group in Longmont, Colorado [20]. From the Table 4.1, we observe
that as the temperature rises, the derating factor and the MTBF show clear
47
decreasing. In our research, we will use the Derating Factor(DF) as the Tem-
perature Factor(i.e. TemperatureFacotr in Eq. 4.1) of AFR. For example, at
30nullC, the DF value is 0.78, which indicates that the AFR at this temperature
is 22% higher than the AFR at 25nullC. The main reason that we only use partial
Table 4.1: Temperature Factor
Temperature
(nullC)
Acceleration
Factor
Derating
Factor
Adjusted
MTBF
25 1.0000 1.00 232,140
26 1.0507 0.95 220,553
30 1.2763 0.78 181,069
34 1.5425 0.65 150,891
38 1.8552 0.54 125,356
42 2.2208 0.45 104,463
46 2.6465 0.38 88,123
data from the report (25nullC null 46nullC) is that we believe the cooling systems will
prevent the temperature keeping higher than 46nullC for long.
3. Power-State Transition Frequency, which is measured as the number of power-
state transition (i.e. from active to standby or vice versa) per month. The
reliability of an individual disk is affected by power-state transitions and, there-
fore, the increase in failure rate as a function of power-state transition frequency
has to be added to a baseline failure rate (see Eq. 4.1 in the next subsection).
Hence, the failure rate R of an individual disk can be expressed as:
R = RBase Value null? + ?nullRFrequency Adder (4.1)
where RBase Value is the baseline failure rate derived from disk utilization, ? is the
temperature factor, ? is a coefficient to reliability R, and RFrequency Adder is the power-
state transition frequency adder to the baseline failure rate, which can be calculated
48
by Eq. 4.2 [99].
R(f) = 1.51enull6f2 null1.09enull5f + 1.39enull2,f null [0,500] (4.2)
where f is a power-state transition frequency, R(f) represents an adder to the base
AFR value. For example, suppose the transition frequency is 300 per month, the base
AFR value needs to be increase by 1.33%.
4.2.2 Weibull Distribution Analysis
Weibull distribution analysis is a leading method in the world for fitting life date.
The primary advantage of Weibull analysis is the ability to provide accurate failure
analysis and failure forecasts with extremely small samples [10]. It is now widely used
reliability engineering and failure analysis including mechanical, electronic, materials,
and human failures [21]. The Weibull reliability function describes the probability of
survival as a function of time, and is described as follows in Eq. 4.3:
R(t) =
null null
t
?(x)(?null1)
?? exp[null(
x
?)
?]dx
= exp[null(tnull?? )?]
(4.3)
where ? is the shape parameter or slope parameter (0 < ? < null), and ? is the scale
parameter or characteristic life (0 < ? < null). Given a disk drive?s total power-on
hours per year, and the utilization calculated by Eq. 4.1, we can calculate its total
active hours during one year by Eq. 4.4
Tactive = Tpower on null? (4.4)
49
where ? is a disk utilization. With active hours as an input along with ? and ?,
we can use Eq. 4.3 to estimate its annual failure rate and MTBF (which serves as
BaseValue in Eq. 4.1).
4.3 Reliability Model for PARAID
4.3.1 Background
Different from traditional disk array systems, RAID balances the load across all
disks in the array for maximized disk parallelism and performance [56]. In a RAID
system, all disks are spinning even under a light load. Instead of spinning down
inactive disks under a light load as MAID [19] or PDC [58] behave, PARAID exploits
unused storage to replicate and stripe data blocks in a skewed fashion, so that disks
can be organized into hierarchical overlapping sets of RAIDs. Each set contains a
different number of disks, and can serve all requests via either its data blocks or
replicated blocks. PARAID introduces a skewed striping pattern that allows RAID
devices to use just enough disks to meet the system load. Each set is analogous to
a gear in automobiles as PARAID has aggregated disk bandwidth. PARAID varies
the number of powered-on disks via gear-shifting among sets of disks to reduce power
consumption [85]. The authors confirmed that PARAID system can save up to 34%
energy compared to the conventional 5-disk RAID system. However, such energy-
efficient technique may have adverse impacts on the reliability of the storage system.
The system has to spend extra disks utilization on copying data from disks that are
about to be spun down, which leads to higher risk of system failures. Furthermore,
after a gear-shifting down, less number of disks will provide the same amount of
service as it is before the gear-shifting, which pushes the power-on disks into higher
utilization range and thus makes the system even less reliable. Thirdly, due to the
data stripping technique, each single disk in the PARAID system only holds part of
files. PARAID may face absolute data lose if the number of failure disks exceeds the
50
system?s failure tolerance. The reliability issue of PARAID counts much more than
conventional disk array systems. Fig. 4.2 is a PARAID system consists of four disks.
gid10gid6gid9gid7gid1
gid11gid15gid14gid18gid1
gid11gid18gid12gid18gid13gid1
gid8gid13gid12gid16gid17gid1
gid2gid1
gid3gid1
gid4gid1
gid2gid1 gid3gid1 gid4gid1 gid5gid1
Figure 4.2: Framework of PARAID: skewed striping of replicated blocks in soft
state, creating 3 RAID gears over 4 disks
[85]
Fig. 4.2 shows that each disk in PARAID has two separate states? the Soft State
and RAID State. When operating in gear 3, with all four disks powered, PARAID
works as the way of conventional RAID system offering maximized disk parallelism
and performance accordingly. As I/O load decreases, PARAID down-shifts into gear
2 by spinning down the fourth disk. Before the down-shifting, the blocks stored in
the RAID states on disk 4 are copied to disk 1null3 one by one. In this case, disk 1
holds the 1st and the 4th block of disk 4, disk 2 keeps the 2nd and the 5th block of
disk 4, and disk 3 will store the 3rd and the 6th block of disk 4. If the load keeps
decreasing, PARAID will further down-shift into gear 1 by powering down the third
disk.
51
4.3.2 Modeling Utilization of Disks in PARAID
Recall that the annual failure rate of each disk can be calculated using utiliza-
tion, operating temperature as well as power-state transition frequency. To model
reliability of a disk array equipped with PARAID, we have to first address the issue
of modeling disk utilization used to calculate base annual failure rates (RBase Value in
Eq.4.1 shown in Section 4.2). In this subsection, we develop a utilization model cap-
turing behaviors of a RAID-based disk array. The utilization model takes file access
patterns as an input and calculates the utilization of each disk in the disk array.
Disk utilization is computed as the fraction of active time of a disk drive out of
its total powered-on-time. Now we describe a generic way of modeling the utilization
of a disk drive. Let us consider a sequence of I/O accesses with L I/O phases. We
denote Tl as the length or duration of the lth I/O phase. Without loss of generality,
we assume that a file access pattern in an I/O phase remains unchanged. The file
access pattern, however, may vary in different phases. The relative length or weight
of the ith phase is expressed as Wl = Tl/T where T = nullLl=1 Tl is the total length of
all the I/O phases. Suppose the utilization of a disk in the lth phase is ?l, we can
write the overall utilization ? of the disk as the weighted sum of the utilization in all
the I/O phases. Thus, we have
? =
Lnull
l=1
(Wl null?l) =
Lnull
l=1
(TlT null?l) (4.5)
Since a PARAID system requires at least two disks to achieve the minimum
I/O parallelism, the PARAID system consists of N disks has (N null 1) gears to shift.
Assume that at the GNnull1th gear, in which case all N disks of the system are kept
spinning in order to offer the maximum parallelism, each single disk stores M blocks.
When disk N is spun down, all its M blocks will be separated into N null 1 sets in a
way that each of the rest N null 1 disks will handle making replicas for M blocks in
52
disk N. Thus, we have:
Fout G(Nnull1)(Nnull2) = M (4.6)
and
if mod (Fout G(Nnull1)(Nnull2)N null1 = 0)
Fin G(Nnull1)(Nnull2) = Fout G(Nnull1)(Nnull2)Nnull1
else
null
nullnullnull
nullnullnullnull
null
nullnullnullnull
nullnullnullnull
Fin G(Nnull1)(Nnull2) =
nullF
out G(Nnull1)(Nnull2)
Nnull1
null
+ 1
for disk 1 null disk D
Fin G(Nnull1)(Nnull2) =
nullF
out G(Nnull1)(Nnull2)
Nnull1
null
for rest of (N nullD)disks
(4.7)
where D = mod(Fout G(Nnull2)(Nnull3)Nnull2 ), Fout G(Nnull1)(Nnull2) represents replicas of the blocks
moved out from the disk N when PARAID shifts down the gear from GNnull1 to GNnull2
due to the decreasing workload. Fin G(Nnull1)(Nnull2)represents the set of replicated blocks
that moved into each of the N null1 disks. If M can be exactly divided by N null1, each
disk will handle M/(N null 1) blocks. Otherwise, the first remainder of M/(N null 1)
disks will handle one extra block, while each of the rest disks will handle quotient of
M/(N null1) blocks.
Similarly, when PARAID shifts down from gear GNnull2 to GNnull3, we have:
Fout G(Nnull2)(Nnull3) = M + Fin G(Nnull1)(Nnull2) (4.8)
53
and
if mod (Fout G(Nnull2)(Nnull3)N null2 = 0)
Fin G(Nnull2)(Nnull3) = Fout G(Nnull2)(Nnull3)Nnull2
else
null
nullnullnullnull
nullnullnullnull
nullnullnullnull
nullnullnullnull
nullnullnullnull
nullnullnull
nullnullnull
nullnullnullnull
nullnullnullnull
nullnullnullnull
nullnull
Fin G(Nnull2)(Nnull3) =
nullF
out G(Nnull2)(Nnull3)
N null2
null
+ 1
=
nullM + F
out G(Nnull1)(Nnull2)
N null2
null
+ 1
for disk 1 null disk D
Fin G(Nnull2)(Nnull3) =
nullF
out G(Nnull2)(Nnull3)
N null2
null
=
nullM + F
out G(Nnull1)(Nnull2)
N null2
null
for rest of (N nullD)disks
(4.9)
It is noticed that the disk to be powered off needs to duplicate blocks, which were
copied during the first downshifting period of time, apart from its own M blocks.
The rest N null2 disks move in more replicated blocks accordingly.
In general, when PARAIDshiftsdownfrom gearGj toGi, wherej null (3,...,Nnull2),
the number of blocks that the disk to be powered off must handle the following number
of reads copy out is
Fout G(j)(jnull1) =M + Fin G(Nnull1)(Nnull2) + Fin G(Nnull2)(Nnull3)+
+ Fin G(Nnull3)(Nnull4)... + Fin G(j+1)(j)
(4.10)
54
while the number of blocks that must be written to the rest j null1 disks is expressed
as:
if mod (Fout G(j)(jnull1)/j) = 0
Fin G(j)(jnull1) = Fout G(j)(jnull1)/j
else
nullnull
nullnullnullnull
nullnullnull
nullnullnullnull
nullnullnullnull
Fin G(j)(jnull1) =
null
Fout G(j)(jnull1)/j
null
+ 1
for disk 1 null disk mod(Fout G(j)(jnull1)/(j null1));
Fin G(j)(jnull1) =
null
Fout G(j)(jnull1)/j
null
for rest of disks.
(4.11)
where j represents the current gear number while (j null1) indicated the gear number
that the PARAID system is about to be shifted to, nullFout G(j)(jnull1)/jnull returns the in-
tegral part of Fout G(j)(jnull1). We assume that every single file has the same number of
blocks, each of which has the same size. Hence, the I/O time for accessing each single
block is the same. Now we can formally express the utilization of disk i in phase l as
follows:
For the disk to be power-off, we have:
?powernulloff = TI/O + TreadT (4.12)
, while for the rest of disks, we have:
?powernullon = TI/O + TwriteT (4.13)
To improve the readability, Table 4.2 lists the notation used in our model.
55
Table 4.2: List of Notations
Parameter Description
R Total Reliability
RBase Value Reliability of Utilization
Rfreq(f) Reliability of Power Transition Frequency f
? Temperature Factor
? Coefficient to R
? Shape Parameter
? Scale Parameter
Tactive Active Time
Tpower on Power-on Time
? Disk Utilization
Wl Relative Weight of l-th I/O phase
Fout Copy Out File
Fin Copy In File
N Number of Disks
M Number of Blocks
TI/O Service Time for I/O Requests
Tread Service Time for Reading Duplicated Files
Twrite Service Time for Writing Duplicated Files
4.4 Reliability Evaluation
4.4.1 Experimental Setup
We developed a simulator in which the PARAID-0 system (a.k.a Power-Aware
RAID Level 0) is implemented. Table 4.3 shows the parameters of configurations for
PARAID-0. We evaluate the reliability of a five-disk PARAID-0 system, in which the
highest gear of the system is 4. In order to keep the RAID-0 configuration, there are
two disks kept active at the lowest gear 1. The file access rate is generated by Poisson
distribution. The operating temperature is set to 38nullC. Furthermore, we are using
properties of Seagate hard disk drive in our simulator. The properties are also shown
in Table 4.3. Since Seagate?s disks properties are introduced to our experimental
setups, we set ? = 0.55, ? = 8410332 in the Weibull analysis model, and 0.54 as the
?, which is the temperature factor in Eq. 4.1 [20].
56
Table 4.3: Experiment Parameter Setup
Disk Type SEAGATE ST3146855FC
Capacity 146 GB
Cach Size SATA 16MB
Buffer to Host Transfer Rate 4Gb/s) (MAX)
Total Number of Disks 5
File Size 100 MB
Number of Files 1000
Synthetic Trace Poisson Distribution
Time Period 24 hours
Interval Time
(Time Phase) 1 hour
Power On Hour Per Year 8760
4.4.2 Disk Utilization
We first investigate the impacts of file access rate (? in Poisson distribution) on
utilization of PARAID-0. We set values of utilization to trigger gear-shifting are set
to 60% for gear up while 30% for gear down. The PARAID-0 is assumed to be started
at the top gear? all five disks are working . Fig. 4.3 plots the utilization comparison
of PARAID-0 and RAID-0 within 24 hours. The average access rate is set to 20 per
hour (? = 20), which is relatively low. We observe from Fig. 4.3 that as time goes,
the utilization of RAID-0 stays stable around 22%, while that of PARAID-0 increases
twice then stays stable around 36%. Those two increasing points are caused by the
gear-down shiftings hence the decreasing the number of active disks. Even though the
utilization of PARAID-0 is 60% higher than that of RAID-0 at the end hour 24, the
energy consumption of PARAID-0 is 40% lower than that of RAID-0 since there are
only three active disks by then. Fig. 4.4 shows the utilization comparison of PARAID-
0 and RAID-0 when the average access rate is set to 80 per hour (? = 80), which is
3 times higher than that in Fig. 4.3. From the figure we notice that the utilization
difference between PARAID-0 and RAID-0 is very vague. The major reason is that
when the access rate is relatively high enough, the utilization of PARAID-0 keeps high
(around 90% shown in Fig. 4.4) accordingly and, therefore a gear-shifting mechanism
57
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 2510
15
20
25
30
35
40
45
50
55
60
Time Interval (hour)
Utilization (%)
? = 20
 
 
RAID?0
PARAID?0
Gear Down, 4 Active Disks
Gear Down, 3 Actvie Disks
Figure 4.3: Disks Utilization Comparison Between PARAID-0 And RAID-0 at A
Low Access Rate(20 times per hour)
is not triggered. Hence at high access rate pattern, PARAID-0 behaves as similar as
the regular RAID-0 system.
4.4.3 Annual Failure Rate
Fig. 4.5 illustrates the annual failure rates (AFR) of PARAID-0 and RAID-0
based on their utilization which is derived from Fig. 4.3. Results plotted in Fig. 4.5
show that AFR values of RAID-0 keeps increasing from 4.5% to 5.46% when hour
lapses, while AFR of PARAID-0 increases by 4% at hour 2 and surges by another 8%
at hour 3. We attribute this trend to the decreasing of the number of active disks
due to gear-down shifings. Since the utilization of PARAID-0 keeps the same as that
of RAID-0 at high access rate, the AFR of the two systems are similar to each other
accordingly. However, if the power transition issue is taken into account, AFR of
PARAID-0 is different from that of RAID-0 even if their access rate are the same to
58
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 2587
87.5
88
88.5
89
89.5
90
Time Interval (hour)
Utilization (%)
? = 80
 
 
RAID?0
PARAID?0
Figure 4.4: Disks Utilization Comparison Between PARAID-0 And RAID-0 at A
Low Access Rate(80 times per hour)
each other. Fig. 4.6 reveals the AFR comparisons between RAID-0 and PARAID-0
starts from different gears within 24 hours. From the figure we observe that when the
access rate increases shapely if PARAID-0 is not at the top gear, AFR of the system
will suffer from the number of power transitions. Storage system at lower gear hvae
relatively poor reliability. It is mainly because that more disks needs to be spun on to
meet the needs of requests hence more number of power transitions will be counted.
4.5 Summary
This paper presents a reliability model called MREED to quantitatively study
the reliability of energy-efficient parallel disk systems equipped with the PARAID
technique. Note that PARAID is a newly developed energy-saving scheme for RAID
systems. It aims to skew I/O load towards a few disks so that other disks can be
transitioned to low power states to conserve energy. I/O load skewing techniques like
59
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 254
4.2
4.4
4.6
4.8
5
5.2
5.4
5.6
5.8
6
Time Interval (hour)
AFR (%)
? = 20
 
 
RAID?0
PARAID?0
Figure 4.5: AFR Comparison Between PARAID-0 And RAID-0 at A Low Access
Rate(20 times per hour)
PARAID inherently affect reliability of RAID disks, because disks keep working on
low gears tend to have high failure rates, let alone the risk of failure caused by data
duplicating during the gear shifting. Furthermore, once the number of failed disks
exceeds the system?s tolerance, data in the system are lost without any chance of
being recovered. To address the model validation issue for MREED, we modified the
DiskSim simulator, which is a widely-used storage system simulator, to validate our
access-rate-utilizaiton sub-model of MREED by comparing the utilization of 5-disk
PARAID system using a real-world disk I/O trace with the utilization that calculated
from the MREED model using the same trace.
Future directions of this research can be performed in the following. First, we
will extend the MREED model to investigate reliability of different levels (e.g., level
5) of PARAID in the future which introduces parity data technique to tolerate one
disk failure. Second, we will investigate a fundamental trade-off between reliability
60
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 2510.2
10.4
10.6
10.8
11
11.2
11.4
11.6
11.8
Tme Interval (hour)
Utilization (%)
? = 80
 
 
PARAID?0 (Gear3)
PARAID?0 (Gear2)
PARAID?0 (Gear1)
RAID?0
Figure 4.6: AFR Comparison Between PARAID-0 And RAID-0 at A High Access
Rate(80 times per hour)
and energy-efficiency in the context of energy-efficient RAID systems. A tradeoff
curve will be used as a unified framework to justify whether or not it is wise to trade
reliability for high energy efficiency. Last, we will evaluate and compare an array of
energy-saving techniques with respect to specific application domains.
61
Chapter 5
Models Validation
5.1 Model Validation
5.1.1 The Validation Techniques
It is reasonable to use MINT to compare the reliability performance of different
energy-efficient storage systems, because the reliability models of the MAID and
PDC storage systems use the same experimental data. It is challenging to validate
the accuracy of the MINT modeling framework, since we are unable to watch MAID
and PDC running for a couple of decades. One way to address this problem is to
maintain and monitor a large number of MAID and PDC systems for a short period
of time (e.g., 5 to 10 years). If one can watch the MAID and PDC systems over their
entire service life, failure-rate data will be collected to validate reliability models.
Even if we can test MAID and PDC with 100 disks for five years, the sample size is
still considered small.
To address this validation problem, we verify MINT using the combination of the
following two validation techniques [68], which are practical approaches to verification
and validation of models.
null Event Validity: Events of occurrences of the model are compared to those of
the real storage system to determine if they are similar. For example, in our
validation process, we compared the file access rates in a real-world file system.
null Historical Data Validation: We first used part of the historical file access
data (i.e., file I/O traces) for building our models. Then, we relied on the
remaining data to test the models.
62
Recall that MINT consists of two major components - the utilization model (see
Sections 3.3.1 and 3.3.2) and the failure-rate model. The utilization model estimates
disk utilization of the MAID and PDC systems based on I/O access rates. The failure-
rate model relies on real world failure data (see [61]) to predict the failure rate of a
disk from its utilization.
To validate MINT, we have to validate the utilization model and the failure-rate
model. Since failure rates in this study are projections based on the failure-rate model
derived from Google?s empirical analysis (see [61]), we pay attention on the validation
of the utilization model.
We performed the following six steps repeatedly to validate the utilization model
described in Sections 3.3.1 and 3.3.2.
null Step 1: We made use of the real-world I/O trace (i.e., Berkeley web trace) to
derive file access rates.
null Step 2: The file access rates are applied to our utilization model to estimate
disk utilizations of the MAID and PDC storage systems.
null Step 3: We implemented a trace replay tool, which captures the rapid evolution
of web server workloads.
null Step 4: We developed the simple MAID and PDC systems that handle I/O
requests created by the trace reply tool.
null Step 5: The utilizations of disks in the MAID and PDC storage systems are
measured.
null Step 6: We compare the measured disk utilizations from the two real storage
systems (see Step 5) with the disk utilizations derived from our models (see
Step 2).
63
5.1.2 Berkeley Web Trace Replay
The Berkeley Web Trace [2] used in the model validation procedure was collected
from a web server for an online library project from January 22nd to February 23rd,
1997. The Berkeley Web Trace data represents intensive I/O activities of a real-world
system, for which MAID and PDC can conserve energy. Because I/O access rates in
this study are measured in term of number I/O per/month or No./month, we decided
to replay a one-month trace containing 33 trace files and 25205132 I/O requests.
Among all the requests, 24481520 are file accesses requesting 302519 web files. The
trace replay period is 1631753 seconds or 453.3 hours.
Table 5.1: File Access Rates of the One-Month Web Trace
File Access Rate Interval
(No./Month) The number of files
0 null10 185383
10 null102 112203
102 null103 4539
103 null104 244
104 null105 113
105 null106 33
106 null107 4
Before applying file access rates into the utilization models presented in Sec-
tions 3.3.1 and 3.3.2, we performed an analysis on file access rates of the web traces.
The goal of this analysis is to determine the access rate of each web file accessed
over the one month period. Table 5.1 summarizes the distribution of file access rates
of the 12304467 web files recorded in the 33 traces. Table 5.1 indicates that a vast
majority (i.e., more than 61%) of web files were accessed less than ten times within
a month. However, there are a few web files that were accessed for more than 1000
times over a one-month period. The analysis result shows that the highest file access
rate is 3180697 No./month.
64
100 101 102 103 104 105 106 107
100
101
102
103
104
105
Access Rate Distribution
File Access Rate(No./Month)
Number of Files
Figure 5.1: The file access rate distribution of the one-month Berkeley web trace.
Access Rate ranges from 1 to 4.5null104 No./month
Fig. 5.1 shows the files accesses distribution pattern using a bar chart. The
distribution pattern suggests that when the access rate increases, the number of files
that have such access rate decreases dramatically.
5.1.3 Experimental Results
Since the Utilization-AFR model, which transfers the utilization of systems to
reliability, is employing the same date from the validated Google report, we only show
the validation of Access Rate-Utilization model in this subsection.
Fig. 5.2 indicates the utilization comparison between the MINT model and Berke-
ley Web Trace-driven simulation. In order to make a clearer comparison between the
MINT model and the trace-driven simulation, we divided the utilizaiton comparison
of PDC, MAID-1 and MAID-2 separately (as shown in Fig. 5.3,Fig. 5.4 and Fig. 5.5).
From the figures, we observed that the the cures according to MINT model show a
65
0 100 200 300 400 500 60020
30
40
50
60
70
80
90
100
3?Year?Old Hard Drive
Access Rate(*103 per month)
Utilization(%)
 
 
PDC?MINT
MAID1?MINT
MAID2?MINT
PDC?trace
MAID1?trace
MAID2?trace
Figure 5.2: Impacts of file access rate on disk utilization. Access rate varies from 10
to 64null104 No./month
similar trend to that of simulation. Furthermore, the differential rate between the
model and the simulation is around 10%.
After validating the Access Rate-Utilization sub-model, we further present the
comparison results of Access Rate-AFR between the MINT model and the simulation.
We are able to build up a Utilization-AFR sub-model of our own and insert it to our
MINT model. However, due to the lack of maintenance date recently, how to validate
the sub-model becomes a hard issue to deal with. Instead, we are using the validated
data published by Google [61] in this part. Once we get more updated data in the
future, such sub-model could be re-modified.
Fig. 5.6, Fig. 5.7, and Fig. 5.8 show the impacts of file access rate on AFR. Even
thought the trends of Access Rate-Utilization sub-model appeared similar between
the model and the simulation (as shown in Fig. 5.3,Fig. 5.4 and Fig. 5.5), there
66
0 100 200 300 400 500 600 700 800 900 10000
10
20
30
40
50
60
70
80
90
100
3?Year?Old Hard Drive (PDC)
Access Rate(*103 per month)
Utilization(%)
 
 
PDC?MINT Model
PDC?MINT Simulation
Figure 5.3: Impacts of file access rate on disk utilization (PDC). Access rate varies
from 10 to 64null104 No./month
0 100 200 300 400 500 600 700 800 900 10000
10
20
30
40
50
60
70
80
90
100
3?Year?Old Hard Drive (MAID1)
Access Rate(*103 per month)
Utilization(%)
 
 
MAID1?MINT Model
MAID1?MINT Simulation
Figure 5.4: Impacts of file access rate on disk utilization (MAID1). Access rate
varies from 10 to 64null104 No./month
67
0 100 200 300 400 500 600 700 800 900 10000
10
20
30
40
50
60
70
80
90
100
3?Year?Old Hard Drive (MAID2)
Access Rate(*103 per month)
Utilization(%)
 
 
MAID2?MINT Model
MAID2?MINT Simulation
Figure 5.5: Impacts of file access rate on disk utilization (MAID2). Access rate
varies from 10 to 64null104 No./month
are noticeable differences between them when we discussed the AFR issue. Such
differences are mainly due to the bath-shaped curve shown in Fig. 3.3.
5.2 Validation of MREED
5.2.1 The Validation Techniques
It is challenging to validate the accuracy of the MREED modeling framework,
since we are unable to monitor PARAID running for a couple of decades. One way to
address this problem is to maintain and analyze a large number of PARAID systems
for a short period of time (e.g., 5 to 10 years). If one can track the systems over their
entire service life, failure-rate data will be collected to validate reliability models.
Even if we can test PARAID with 100 disks for five years, the sample size is small
from a validation perspective.
68
0 100 200 300 400 500 600 700 800 900 10000
1
2
3
4
5
6
7
8
9
3?Year?Old Hard Drive (PDC)
Access Rate(*103 per month)
AFR(%)
 
 
PDC?MINT Model
PDC?MINT Simulation
Figure 5.6: Impacts of file access rate on AFR (PDC). Access rate varies from 10 to
64null104 No./month
0 100 200 300 400 500 600 700 800 900 10000
1
2
3
4
5
6
7
8
3?Year?Old Hard Drive (MAID1)
Access Rate(*103 per month)
AFR(%)
 
 
MAID1?MINT Model
MAID1?MINT Simulation
Figure 5.7: Impacts of file access rate on AFR (MAID1). Access rate varies from 10
to 64null104 No./month
69
0 100 200 300 400 500 600 700 800 900 10000
1
2
3
4
5
6
7
8
3?Year?Old Hard Drive (MAID2)
Access Rate(*103 per month)
AFR(%)
 
 
MAID2?MINT Model
MAID2?MINT Simulation
Figure 5.8: Impacts of file access rate on AFR (MAID2). Access rate varies from 10
to 64null104 No./month
To address this validation issue, we verify MREED using the Event Validity
validation technique [68], which is a practical approach to verification and validation
of reliability models. Events of occurrences of our MREED model are compared
to those of the widely-used storage system simulator? DiskSim? to determine if our
model and DiskSim agree with one another. In our validation process, we compared
a file access trace in a real-world file system
Recall that MREED consists of two major components ? a utilization model and
a failure-rate model. The utilization model estimates disk utilization of the PARAID
system based on I/O access rates. Thefailure-rate model relies on Weibull distribution
analysis, parameters of which were derived from a hard drive disk manufacture?s
report (see [20]) to predict the possibility of disk failure from its utilization.
To validate MREED, we have to validate the utilization model and the failure-
rate model. Since failure rates in this study are projections based on the failure-rate
70
model derived from Seagate?s empirical analysis (see [20]), we pay attention to the
validation of the utilization model.
5.2.2 DiskSim Simulation
The DiskSim simulator, a powerful tool for the modeling and simulation of disk
systems, is used widely for storage systems research [40]. Recent research projects
using the DiskSim simulation environment include reducing disk I/O performance
sensitivity and conserving energy in disk systems [84]. Although DiskSim is a pow-
erful simulation tool research, there is a lack of power models in DiskSim. The
Sensitivity-Based Optimization of Disk Architecture introduced accurate power mod-
els into DiskSim, but this work was based on DiskSim 2.0 [73]. Another recent study
on DiskSim and power models is the Dempsey project [103]. We are grateful to the
author of the EEPF papar [50] who provided us with the source code of power models
developed for a newer version (i.e., version 4.0) of DiskSim. This makes it possible
for us to implement utilization and power transition models into DiskSim.
5.2.3 Simulation Framework
In order to complete our validation work via DiskSim, we integrate the following
two major components in the system.
null DiskSim Simulator: It is in charge of simulating the operations of all disks and
data blocks managements in the sytem.
null File to Block Translator: It is responsible for mapping files residing in the
storage system into block-level data.
As shown in Fig. 5.9, files are mapped into blocks before being used as inputs to
the DiskSim simulator. The file-to-block converter is critical, because data blocks
are typically managed within a single node and a higher level mechanism is needed
71
to manage data across different nodes in RAID systems. In the DiskSim simulator,
Input Trace
(File Level)
File to Block Mapper
Simulate File
(Block Access)
DiskSim
(Block Level)
Figure 5.9: File to Block Level Converter Outline
we use the same disk model (which is a Seagate ST3146855LW hard disk drive), the
I/O throughput of which is significantly high than consumer level products. In order
to avoid I/O transfer throughput bottlenecks, we modify a disk architecture in the
DiskSim that each single disk has its own bus and controller (see in Fig. 5.10).
5.2.4 UMass WebSearch Trace
The UMass WebSearch Trace [6] is used in the model validation process. This
trace is obtained from the University of Massachusetts-Amherst (UMASS) website.
The trace used in our experiments is WebSearch3.trace, which contains 4,261,709 read
requests. The trace reply period is 298,715,395 milliseconds or 83 hours.
72
Driver 0
BUS 0
CTLR 0 CTLR 1 CTLR 2 CTLR 3 CTLR 4
BUS 1 BUS 2 BUS 3 BUS 4 BUS 5
DISK 0 DISK 1 DISK 2 DISK 3 DISK 4
Figure 5.10: Diagram of the Storage System Corresponding to the DiskSim Raid-0
5.2.5 Validation Results
The Utilization-AFR model transfers the utilization of systems to the reliability.
This model is employing the Weibull analysis by the same ? and ? parameters (see
Section 4.2), so we only show the validation of utilization and power transition model
in this subsection.
In order to make a clearer comparison between the MREED model and the trace-
driven DiskSim simulations, we divided the comparison of utilization(see Fig. 5.11)
and power transition (see Fig. 5.12). We observe that results obtained from the
MREED model is similar to the simulation. Furthermore, the discrepancy between
the model and the simulation is below 10%.
After validating the Access Rate-Utilization sub-model, we further present the
comparison results of Access Rate-Power Transition between the MREED model and
the simulation results (as shown in Fig. 5.12). The figure shows that as time elapsed,
73
0 1 2 3 4 5 6 7 8
x 104
0
10
20
30
40
50
60
70
80
90
100
Access Time (millisecond)
Utilization (%)
 
 
MREED
DiskSim Simulator
Figure 5.11: Utilization Comparison Between MREED and DiskSim Simulator
the gear shifted accordingly as files access pattern changed. Fig. 5.12 illustrates that
our model performs well in estimating gear-shift events.
74
0 1 2 3 4 5 6 7 8
x 104
1
2
3
4
Access Time (millisecond)
Gear Number
 
 
MREED Model
DisSim Simulator
Figure 5.12: Gear Shiftings Comparison Between MREED and DiskSim Simulator
75
Chapter 6
Improving Reliability of Energy-Efficient Parallel Storage Systems
The Massive Array of Idle Disks (MAID) technique is an effective energy saving
schemes for parallel disk systems. The goal MAID is to skew I/O load towards a few
disks so that other disks can be transitioned to low power states to conserve energy.
I/O load skewing techniques like MAID inherently affect reliability of parallel disks
because disks storing popular data tend to have high failure rates than disks storing
cold data. To achieve good tradeoffs between energy efficiency and disk reliability,
we first present a reliability model to quantitatively study the reliability of energy-
efficient parallel disk systems equipped with MAID schemes. Then, we propose a novel
strategy?disk swapping?to improve disk reliability by alternating disks storing hot
data with disks holding cold data. At Last, we further improve disk reliability by
introducing multiple disk swapping strategy. We demonstrate that our disk-swapping
strategies not only can increase the lifetime of cache disks in MAID-based parallel
disk systems, but also further reduce the failure rate of the entire system when the
multiple-disk swapping is introduced.
6.1 Introduction
Parallel disk systems, providing high-performance data-processing capacity, are
of great value to large-scale parallel computers [4]. A parallel disk system comprised
of an array of independent disks can be built from low-cost commodity hardware
components. In the past few decades, parallel disk systems have increasingly become
popular for data-intensive applications running on massively parallel computing plat-
forms [81].
76
Existing energy conservation techniques can yield significant energy savings in
disks. While several energy conservation schemes like cache-based energy saving
approaches normally have marginal impact on disk reliability, many energy-saving
schemes (e.g., dynamic power management and workload skew techniques) inevitably
have noticeable adverse impacts on storage systems [12][90]. For example, dynamic
power management (DPM) techniques save energy by using frequent disk spin-downs
and spin-ups, which in turn can shorten disk lifetime [22] [34] [46], redundancy
techniques [60] [102] [82] [89], workload skew [54] [38] [98], and multi-speed set-
tings [32] [76]. Unlike DPM, workload-skew techniques such as MAID [19] and
PDC [58] move popular data sets to a subset of disks arrays acting as workhorses,
which are kept busy in a way that other disks can be turned into the standby mode
to save energy. Compared with disks storing cold data, disks archiving hot data
inherently have higher risk of breaking down.
Unfortunately, it is often difficult for storage researchers to improve reliability of
energy-efficient disk systems. One of the main reasons lies in the challenge that every
disk energy-saving research faces today, how to evaluate reliability impacts of power
management strategies on disk systems. Although reliability of disk systems can
be estimated by simulating the behaviors of energy-saving algorithms, there is lack
of fast and accurate methodology to evaluate reliability of modern storage systems
with high-energy efficiency. To address this problem, we developed a mathematical
reliability model called MINT to estimate the reliability of a parallel disk system that
employs a variety of reliability-affecting energy conservation techniques [99].
In this chapter, we first study the reliability of a parallel disk system equipped
with a well-known energy-saving scheme? the MAID [19] technique. I/O load skew-
ing techniques like MAID inherently affect reliability of parallel disks because of two
reasons: First, disks storing popular data tend to have high I/O utilization than disks
storing cold data. Second, disks with higher utilization are likely to have higher risk
77
of breaking down. To address the adverse impact of load skewing techniques on disk
reliability, a disk swapping strategy was proposed to improve disk reliability in MAID
by switching the roles of data disks and cache disks. We evaluate impacts of the disk
swapping scheme on the reliability of MAID-based parallel disk systems.
We summarize our contributions as follows:
1. We developed a model for Massive Array of Idle Disks (MAID) based on Mathe-
matical Reliability Modelsfor Energy-efficientParallel Disk System (MINT)[99];
2. We built single disk swapping and multiple disk swapping mechanisms to im-
prove reliability of various load skewing techniques.
3. We studied the impacts of the disk swapping schemes on the reliability of MAID.
The remainder of this chapter is organized as follows. Section 6.2 studies single
disk swapping and multiple disks swapping strategies on MAID. Section 6.3 presents
experimental results and performance evaluation. Finally, Section 6.4 concludes the
chapter with discussions.
6.2 Improving Reliability of MAID via Disk Swapping
6.2.1 Improving Reliability of Cache Disks in MAID
Cache disks in MAID are more likely to fail than data disks due to the two
reasons. First, cache disks are always kept active to maintain short I/O response
times. Second, the utilization of cache disks is expected to be much higher than
that of data disks. From the aspect of data loss, the reliability of MAID relies on
the failure rate of data disks rather than that of cache disks. However, cache disks
tend to be a single point of failure in MAID, which if the cache disks fail, will stop
MAID from conserving energy. In addition, frequently replacing failed cache disks
can increase hardware and management costs in MAID. To address this single point
78
of failure issue and make MAID cost-effective, we designed a disk swapping strategy
for enhancing the reliability of cache disks in MAID.
Fig. 6.1 shows the basic idea of the disk swapping mechanism, according to which
disks rotate to perform the cache-disk functionality. In other words, the roles of cache
disks and data disks will be periodically switched in a way that all the disks in MAID
have equal chance to perform the role of caching popular data. For example, the two
cache disks on the left-hand side in Fig. 6.1 are swapped with the two data disks on
the right-hand side after a certain period of time (see Section 6.3.3 for circumstances
under which disks should be swapped). For simplicity without losing generality, we
assume that all the data disks in MAID initially are identical in terms of reliability.
This assumption is reasonable because when a MAID system is built, all the new disks
with the same model come from the same vendor. Initially, the two cache disks in
Fig. 6.1 can be swapped with any data disk. After the initial phase of disk swapping,
the cache disks are switched their role of storing replica data with the data disks with
the lowest annual failure rate. In doing so, we ensure that cache disks are the most
reliable ones among all the disks in MAID after each disk swapping process. It is
worth noting that the goal of disk swapping is not to increase mean time to data loss,
but is to boost mean time to cache-disk failure by balancing failure rates across all
disks in MAID.
Fig. 6.2 is the logic diagram of the single disk swapping mechanism, which demon-
strates more details about the swapping. When the access rate reaches the threshold,
which is set beforehand, a data disk?s capacity will be checked. If the data disk has
enough free space to hold all the replicas that are hold by a cache disk, it will be
paired with the cache disk for swapping later. Otherwise, other data disks? capacity
will be checked until a disk that meets the requirement. If there is no disk meets the
requirement, the disk swapping won?t be executed. This step needs to be executed
first to prevent the original data from miss-deleting on the data disk. In our research,
79
Cache Manager
Data DisksCache Disks
Figure 6.1: Disk Swapping in MAID: The two cache disks on the left-hand side are
swapped with the two data disks on the right-hand side
we assumed that the data disk?s capacity is large enough to hold all the cache data
and to keep the original data. The capacity of the cache disk will be examined when
it is paired with a data disk.
If the cache disk has enough free space to hold all the data that are hold by the
data disk, the data disk will duplicate all the cache data from the cache disk while
holding all the original data. Then the cache disk will copy the data from the data
disk and keeps all replicas of its own. On the other hand, if the cache disk does
not have enough free space to hold all the data from the data disk, all replicas it
holds will be deleted after they are duplicated to the destination releasing the space
for the data copied from the data disk. At this step, no matter the cache disk has
available capacity or not, the data needs to be transfered from cache disk first to
prevent original data from either miss-deleting or losing.
Algorithm 1 outlined below is the single-disk-swapping algorithm that switches
the roles of cache disks and data disks to improve the reliability of cache disks. The
algorithm is called single-disk-swapping because the disk swapping occurs only once
in MAID.
Disk swapping is very beneficial to MAID for two reasons. First, disk swap-
ping further improves the energy efficiency of MAID because any failed cache disk
80
Access Rate
Reaches the 
Threshold
Check Data Disk's 
Capacity
Capacity Is 
Enough
Check Cache Disk's 
Capacity
Capacity Is 
Enough
Cache Disk Keeps All 
Replicas, And Copy The 
Data From Data Disk
Data Disk Keeps All 
Original Data, And Copy 
The Replicas From 
Cache Disk
Disk Swap Ends
Check Other Data Disk's 
Capacity
Find One Has 
Enough Capacity
Data Disk Keeps All 
Original Data, And Copy 
Replicas From Cache 
Disk
Cache Disk Deletes All 
Replicas, And Copy 
Data From Data Disk
No Swapping
YES
YES
YES
YES
NO
NO
NO
NO
Figure 6.2: Logic Diagram of Disk Swapping
81
Algorithm 1 The Single-Disk-Swapping Algorithm
1: Input The Access Rate of The System;
2: if The Access Rate Reaches The Threshold then
3: Check the Available Capacity of Data Disk;
4: if The Available Capacity of Data Disk Is Enough then
5: Check the Available Capacity of Cache Disk;
6: if The Available Capacity of Cache Disk Is Enough then
7: Data Disk Keeps All Original Data and Duplicates Cache Data From Cache
Disk;
Cache Disk Keeps All Replicas and Copies Data From Data Disk;
8: else
9: if The Available Capacity of Cache Disk Is NOT Enough then
10: Data Disk Keeps All Original Data and Duplicates Cache Data From
Cache Disk;
Cache Disk Deletes All Replicas and Copies Data From Data Disk;
11: end if
12: end if
13: else
14: if The Available Capacity of Data Disk Is NOT Enough then
15: while There Is A Data Disk That Has Enough Available Capacity do
16: Check the Available Capacity of Cache Disk;
17: end while
18: end if
19: end if
20: else
21: Don?t Do Swap;
22: end if
23: Disk Swap Ends;
82
can prevent MAID from effectively saving energy. Second, disk swapping reduces
maintenance cost of MAID by making cache disks less likely to fail.
6.2.2 Swapping Disks Multiple Times
Now we consider the case where disk swapping is invoked multiple times in MAID.
As described in Section 6.2.1, the single-disk-swapping mechanism improves the
reliability of the MAID system by making all disks have equal chance to perform the
role of cache disks that have high I/O workload and high utilization. The single-
disk-swapping algorithm has a major limitation, because disks are swapped only
once throughout their lifetimes. That means single-disk-swapping only affects the
reliability for a very short period of time. After each disk swapping, the utilization of
those disks with low AFRs are likely to be kept at a high level, which in turn leads to
an increasing AFR of the entire disk system. In order to improve the reliability of the
MAID system for a long time period (e.g., 1,000,000 hours or over 100 years [71]), we
address the issue of swapping disks multiple times (see multiple disk swapping shown
in Algorithm 2).
In the multiple-disk-swapping algorithm, the number of disk-swapping per month
is an important parameter affecting both reliability and performance of MAID. This
parameter can either be manually set as a constraint or be configured dynamically
according to changing workload conditions. In the static approach, the disk-swapping
mechanism is triggered after MAID has been operating for a certain number of days
regardless I/O workload. For example, if the frequency is set as three times per
month, disks will be swapped once every ten days.
In the dynamic approach, the disk-swapping function is invoked once workload
conditions (i.e., access rate) meet the configured value regardless the time intervals
between two swaps. For instance, if the access rate is set as 2null105 Numbers per month,
the disks will be swapped every time when the access rate reaches 2null105No./Month.
83
The dynamic multiple-disk-swapping scheme ensures that disk swaps occur only when
it is necessary.
Algorithm 2 The Algorithm for Multiple Disk Swapping
1: while The Frequency of Disk Swapping Is No More Than The Given Ones do
2: Run Algorithm 1
3: end while
4: Disk Swap Ends;
6.3 Experimental Results and Evaluation
6.3.1 Experimental Setup
We developed a simulator to validate the reliability model for MAID. It might
be unfair to compare the reliability of MAID with any non-energy-efficient parallel
disks, since MAID trades extra cache disks for high energy efficiency. To make fair
comparisons, we considered a MAID system with two configurations. The first con-
figuration referred to as MAID-1 employs existing disks in a parallel disk system as
cache disks to store frequently accessed data. Thus, the first configuration of MAID
improves energy efficiency of the parallel disk system at the cost of capacity. In con-
trast, the second configuration? called MAID-2?needs extra disks to be added to
the disk system to serve as cache disks.
Our experiments were started by evaluating the reliability of the original MAID
system without disk swapping. Then, we studied the reliability impacts of the single-
disk-swapping strategy on MAID. Finally, we assessed the reliability impacts of the
multiple-disk-swapping scheme. We simulated MAID-1, and MAID-2 coupled with
the disk-swapping strategies in two parallel disk systems described in Table 6.1. For
the MAID-1 configuration, there are 5 cache disks and 15 data disks. In the disk
system for the MAID-2 configuration, there are 5 cache disks and 20 data disks. As
for the case of PDC, we fixed the number of disks to 20. Thus, we studied MAID-2
84
Table 6.1: The characteristics of the simulated parallel disk system used to evaluate the
reliability of MAID-1, and MAID-2.
Energy-efficiency
Scheme Number of Disks
File Access Rate
(No. per month)
File Size
(KB)
NONE* 20 data(20 in total) 0?106 300
MAID-1 15 data+5 cache(20 in total) 0?106 300
MAID-2 20 data+5 cache(25 in total) 0?106 300
Original Disk System Without Any Energy-Efficiency Scheme
and PDC using a parallel disk system with 20 disks; we used a similar disk system
with totally 25 disks to investigate MAID-1. We varied the file access rate in the
range between 0 to 106 times per month. The average file size considered in our
experiments is 300KB. The base operating temperature is set to 35nullC. In this study,
we focused on read-only workload. Nevertheless, the MINT model should be readily
extended to capture the characteristics of read/write workloads.
6.3.2 Disk Utilization
Fig. 6.3 shows that when the average file access rate increases, the utilizations of
MAID-1 and MAID-2 increase accordingly. Compared with the utilization of MAID-
2, the utilization of MAID-1 is more sensitive to the file access rate. Under low I/O
load, the utilizations of MAID-1 and MAID-2 are very close to each other. When
I/O load becomes relatively high, the utilization of MAID-1 is slightly higher than
that of MAID-2. This is mainly because the capacity of MAID-2 is larger than that
of MAID-1.
6.3.3 The Single-Disk-Swapping Strategy
A key issue of the disk-swapping strategies is to determine circumstances under
which disks should be swapped in order to improve disk system reliability. One
85
0 1 2 3 4 5 6 7 8 9 10
x 105
6
6.5
7
7.5
8
8.5
9
9.5
10
10.5
11
Access Rate(per month)
AFR(%)
3?Year?Old Hard Drive
 
 
MAID?1(no swap)
MAID?2(no swap)
Figure 6.3: Utilization Comparison of the MAID
Access Rate Impacts on AFR (No Swapping)
straightforward way to address this issue is to periodically initiate the disk-swapping
process. For example, we can swap disks in MAID once every month. Periodically
swapping disks, however, might not always enhance the reliability of parallel disk
systems. For instance, swapping disks under very light workloads cannot substantially
improve disk system reliability. In some extreme cases, swapping disks under light
workload may worsen disk reliability due to overhead of swapping. As such, our
disk-swapping strategies do not periodically swap disks. Rather, the disk-swapping
process is initiated when the average I/O access rates exceed a threshold. In our
experiments, we evaluated the impact of this access-rate threshold on the reliability
of a parallel disk system. More specifically, the threshold is set to 2 null 105, 5 null 105,
and 8 null 105 times/month, respectively. These three values are representative values
for the threshold because when the access rate hits 5 null 105, the disk utilization lies
86
0 1 2 3 4 5 6 7 8 9 10
x 105
6
6.5
7
7.5
8
8.5
9
9.5
10
10.5
11
Access Rate(per month)
AFR(%)
3?Year?Old Hard Drive
 
 
MAID?1(no swap)
MAID?2(no swap)
MAID?1(swap MTTF)
MAID?2(swap MTTF)
Figure 6.4: Utilization Comparison of the MAID
Access Rate Impacts on AFR (Threshold=2null105)
in the range between 80% and 90% [61], which in turn ensures that AFR increases
with the increasing value of utilization (see Fig. 3.7).
Figs. 6.4, 6.5, and 6.6 reveal the annual failure rates (AFR) of MAID-1 and
MAID-2 with and without using the proposed disk-swapping strategy. The results
plotted in Figs. 6.4, 6.5, and 6.6 show that for both MAID-1 and MAID-2, the disk-
swapping process reduces the reliability of data disks in the disk system. We attribute
the reliability degradation to the following reasons. MAID-1 and MAID-2 only store
replicas of popular data; the reliability of the entire disk system is not affected by
failures of cache disks. The disk-swapping processes increase the average utilization
of data disks, thereby increasing the AFR values of data disks. Nevertheless, the
disk-swapping strategy has its own unique advantage. Disk swapping is intended to
reduce hardware maintenance cost by increasing the lifetime of cache disks. In other
87
0 1 2 3 4 5 6 7 8 9 10
x 105
6
6.5
7
7.5
8
8.5
9
9.5
10
10.5
11
Access Rate(per month)
AFR(%)
3?Year?Old Hard Drive
 
 
MAID?1(no swap)
MAID?2(no swap)
MAID?1(swap MTTF)
MAID?2(swap MTTF)
Figure 6.5: Utilization Comparison of the MAID
Access Rate Impacts on AFR (Threshold=5null105)
0 1 2 3 4 5 6 7 8 9 10
x 105
6
6.5
7
7.5
8
8.5
9
9.5
10
10.5
11
Access Rate(per month)
AFR(%)
3?Year?Old Hard Drive
 
 
MAID?1(no swap)
MAID?2(no swap)
MAID?1(swap MTTF)
MAID?2(swap MTTF)
Figure 6.6: Utilization Comparison of the MAID
Access Rate Impacts on AFR (Threshold=8null105)
88
words, disk swapping is capable of extending the Mean Time To Failure or MTTF [61]
of the cache disks.
We observed from Figs. 6.4, 6.5, and 6.6 that for the MAID-based disk system
with the disk-swapping strategy, a small threshold leads to a low AFR. Compared
with the other two thresholds, the 2null105 threshold showed in Fig. 6.4 results in the
lower AFR. The reason is that when the access rate is 2 null 105 No./month, the disk
utilization is around 35% [61], which lies in the monotone decreasing area of the
curve shown in Fig. 3.7. Thus, disk swapping reduces AFR for a while until the disk
utilization reaches 60%.
6.3.4 The Multiple-Disk-Swapping Strategy
Section 6.3.3 shows that single-disk-swapping strategy can improve the reliabil-
ity of the MAID system. However, the single-disk-swapping has minimal reliability
impact in a long period of time. For example, Fig. 6.4 indicates that after swapping
cache and data disks, the failure rate of the disk system continues going up as the
access rate keeps increasing. We observed that after the first disk swap without any
consecutive disk swaps, the failure rate of disk-swapping-enabled MAID will become
close to that of non-disk-swapping MAID. Thus, disk swapping must be repeatedly
conducted under the condition that the failure rate of MAID increases.
To evaluate the multiple-disk-swapping scheme, we configured the access rate
threshold to 2null105, 2.5null105, and 4null105 No./month. For example, if the threshold
is set to 2 null 105, the total access rate can be as high as 8 null 105, which is one of the
thresholds chosen for the single-disk-swapping strategy.
Figs. 6.7, 6.8, and 6.9 reveal the annual failure rates (AFR) of MAID-1 and
MAID-2 with both a single disk swap and multiple disk swaps. The results show that
the multiple-disk-swapping process further reduces the failure rate of data disks in
the MAID system. Comparing the AFR values plotted in Figs. 6.4, 6.5, and 6.6,
89
0 1 2 3 4 5 6 7 8 9 10
x 105
6
6.5
7
7.5
8
8.5
9
9.5
10
10.5
11
Access Rate(per month)
AFR(%)
3?Year?Old Hard Drive
 
 
MAID?1(no swap)
MAID?2(no swap)
MAID?1(swap MTTF)
MAID?2(swap MTTF)
Figure 6.7: Utilization Comparison of the MAID
Access Rate Impacts on AFR (Multiple Threshold=2null105)
0 1 2 3 4 5 6 7 8 9 10
x 105
6
6.5
7
7.5
8
8.5
9
9.5
10
10.5
11
Access Rate(per month)
AFR(%)
3?Year?Old Hard Drive
 
 
MAID?1(no swap)
MAID?2(no swap)
MAID?1(swap MTTF)
MAID?2(swap MTTF)
Figure 6.8: Utilization Comparison of the MAID
Access Rate Impacts on AFR (Multiple Threshold=2.5null105)
90
0 1 2 3 4 5 6 7 8 9 10
x 105
6
6.5
7
7.5
8
8.5
9
9.5
10
10.5
11
Access Rate(per month)
AFR(%)
3?Year?Old Hard Drive
 
 
MAID?1(no swap)
MAID?2(no swap)
MAID?1(swap MTTF)
MAID?2(swap MTTF)
Figure 6.9: Utilization Comparison of the MAID
Access Rate Impacts on AFR (Multiple Threshold=4null105)
we noticed that the failure rate of MAID with multiple disk swaps is lower than
that of the same with with a single disk swap at access rate 10 null 105. As the access
rate increases, the reliability improvement achieved by the multiple-disk-swapping
scheme becomes more pronounced. The major reason behind the improvement is
that swapping disks multiple times can continue balancing I/O workload of each disk
in the MAID system in the long run. After each disk swap, if the failure rate of MAID
increases to a certain point, (see, for example, Fig. 6.3) a subsequent disk swap will
be initiated.
Figs. 6.7, 6.8, and 6.9 demonstrate that the failure rate of the multi-swapping
MADI system changes periodically. For exampple, Fig. 6.7 shows that immediately
after each disk swapping process, the failure rate of MAID increases 5% due to the
overhead caused by copying data among cache disks and data disks. Then, the failure
rate stays stable for a while until the next disk swapping occurs. We observe that at
91
the second disk swap, the cumulative access rate is 4null105, which is the same as the
first swapping threshold shown in Fig. 6.9. The forth disk-swapping point in Fig. 6.7 is
the same as that single disk swapping threshold shown in Fig. 6.6. Comparing Fig. 6.9
and Fig. 6.6, we conclude that when access rate reaches 10null105, the failure rate of the
multiple-disk-swapping scheme is lower than that of the single-disk-swapping scheme.
This reliability improvement is made possible by multiple disk swaps, because cache
disks and data disks are switched after the failure rates of the cache disks become
higher than those of the data disks. Repeatedly swapping cache and data disks can
well balance the failure rates of all the disks in the MAID system.
6.4 Summary
This chapter presents a reliability model to quantitatively study the reliability of
energy-efficient parallel disk systems equipped with the Massive Array of Idle Disks
(MAID) technique. Note that MAID is a well-known effective energy-saving schemes
for parallel disk systems. It aims to skew I/O load towards a few disks so that
other disks can be transitioned to low power states to conserve energy. I/O load
skewing techniques like MAID inherently affect reliability of parallel disks because
disks storing popular data tend to have high failure rates than disks storing cold data.
To address the reliability issue in MAID, we developed single disk-swapping strategies
to improve disk reliability by alternating disks storing hot data with disks holding
cold data. Additionally, we introduced multiple disk-swapping scheme to further
improve reliability of MAID. Then we quantitatively evaluated the impacts of the
disk-swapping strategies on reliability of MAID-based disk systems. We demonstrated
that the disk-swapping strategies not only can increase the lifetime of cache disks
in MAID-based parallel disk systems, but also can improve its reliability in the long
period of time by balancing the workload of cache disks and data disks then balancing
the their utilization correspondingly.
92
Future directions of this research can be performed in the following. First, we
will extend the MINT model to investigate mixed read/write workloads in the future.
Second, we will investigate a fundamental trade-off between reliability and energy-
efficiency in the context of energy-efficient disk arrays. A tradeoff curve will be used
as a unified framework to justify whether or not it is worth trading reliability for high
energy efficiency. Last, we will study the most appropriate conditions under which
disk-swapping processes should be initiated.
93
Chapter 7
Conclusion and Future Work
7.1 Main Contributions
7.1.1 The MINT model for parallel storage systems
In recognition that existing disk reliability models cannot be used to evaluate
reliability of energy-efficient disk systems, we propose a new model called MINT
to evaluate the reliability of a disk array equipped with reliability-affecting energy
conservation techniques. We first model the impacts of disk utilization and power-
state transition frequency on reliability of each disk in a disk array. We then derive
the reliability of an individual disk from its utilization, age, temperature, and power-
state transition frequency. Finally, we use MINT to study the reliability of disk
arrays coupled with the MAID (Massive Array of Idle Disks) technique and the PDC
(Popular Disk Concentration technique) technique.
7.1.2 The MREED model for RAID systems
We presente a reliability model called MREED to quantitatively study the relia-
bility of energy-efficient parallel disk systems equipped with the PARAID technique.
Note that PARAID is a newly developed energy-saving scheme for RAID systems. It
aims to skew I/O load towards a few disks so that other disks can be transitioned
to low power states to conserve energy. I/O load skewing techniques like PARAID
inherently affect reliability of RAID disks, because disks keep working on low gears
tend to have high failure rates, let alone the risk of failure caused by data duplicating
during the gear shifting. Furthermore, once the number of failed disks exceeds the
94
systems tolerance, data in the system are lost without any chance of being recov-
ered. To address the model validation issue for MREED, we modified the DiskSim
simulator, which is a widely-used storage system simulator, to validate our access-
rate-utilizaiton sub-model of MREED by comparing the utilization of 5-disk PARAID
system using a real-world disk I/O trace with the utilization that calculated from the
MREED model using the same trace.
7.1.3 Reliability improvement of parallel storage systems
This dissertation presents a reliability model to quantitatively study the reliabil-
ity of energy-efficient parallel disk systems equipped with the Massive Array of Idle
Disks (MAID) technique. Note that MAID is a well-known effective energy-saving
schemes for parallel disk systems. It aims to skew I/O load towards a few disks so
that other disks can be transitioned to low power states to conserve energy. I/O
load skewing techniques like MAID inherently affect reliability of parallel disks be-
cause disks storing popular data tend to have high failure rates than disks storing
cold data. To address the reliability issue in MAID, we develop single disk-swapping
strategies to improve disk reliability by alternating disks storing hot data with disks
holding cold data. Additionally, we introduce multiple disk-swapping scheme to fur-
ther improve reliability of MAID. Then we quantitatively evaluate the impacts of the
disk-swapping strategies on reliability of MAID-based disk systems. We demonstrate
that the disk-swapping strategies not only can increase the lifetime of cache disks
in MAID-based parallel disk systems, but also can improve its reliability in the long
period of time by balancing the workload of cache disks and data disks then balancing
the their utilization correspondingly.
95
7.2 Future Work
7.2.1 Future Directions for the Short Term
Our short-term interest will concentrate on the following two directions, which
are the extensions of my past and current research on reliability analytical model for
parallel storage systems
null Fault Tolerance Analysis for RAID Storage Systems
Although the MINT model presented in this dissertation is adequate to quantify
the reliability of energy-efficient disk arrays, MINT is insufficient to analyze
energy-ware RAID systems. We plan to investigate a more sophisticated model
that can modify data access patterns and the stripped data placement. To
reduce power, a conventional RAID system cannot simply rely on caching and
powering off disks during idle periods due to its disk parallelism?all disks are
spinning even under a light load. By varying the number of powered-on disks
via gear-shifting or switching among sets of disks (e.g. Power-Aware Redundant
Array of Inexpensive Disks), the energy consumption of a RAID system can be
reduced. However, after changing the number of active disks in the system,
the RAID level will be changed accordingly. This affects the reliability of the
system. As a further extension of this dissertation, we plan to investigate the
behavior of RAID levels in terms of gear shifting and the striped data movement
along with input data access patterns.
null Predictive Reliability Models for Storage Systems
Reliability evaluation of a disk system indicates the present liability of the
system. We argue that if one can predict the reliability of a storage system,
the system?s maintenance expenses can be reduced as disks will be replaced on
time. Risks that disks will fail before being replaced can be diminished and
the chances of purchasing new disks can be decreased. The goal of this future
96
research is to build up a predictive reliability models to forecast reliability of
storage systems based on data access patterns and to provide disks maintenance
suggestions. Furthermore, such a strategy can be integrated with load balancing
schemes to ensure tow policies. First, disks reaching the end of their lifetimes
will be assigned with lighter workloads. Second, data on disks that are likely
to fail will be backed-up in right time.
7.2.2 Future Directions for the Long Term
We plan to pursue the following three long-term research goals.
null Energy-Aware Storage Systems in Data Centers
Distributed File Systems are becoming the de-facto method of data storage
for the new generation of data centers ( e.g., web applications by companies
like Google, Amazon, and Yahoo!). There are several reasons that distributed
storage mechanisms are preferred over traditional relational database systems
including scalability, availability and performance. However, the energy con-
sumption issue needs to be addressed carefully in data centers. For example, a
360-T flops supercomputer (e.g., IBM Blue Gene/L) with traditional processors
needs 2,329.60KW/h to be operated. This energy requirement is approximately
equal to the sum of 22,000 US households? energy consumption. In addition,
high-temperature heat dissipation caused by large-scale clusters requires cool-
ing equipments (e.g., air conditioners) to control temperatures in supercomputer
and data centers.The trends in power/cooling delivery and cost highlight the
need for support in data centers for power and thermal management. In the
long term, we plan to explore schemes in utilizing platform power manage-
ment(e.g., processor frequency scaling, prefetching, caching, data management,
load balancing, etc) for data centers.
97
null Reliability-AwareParallelVirtualFileSystem(PVFS)inHigh-Performance
Computing
PVFS, a popular network clustering file system, brings state-of-the-art parallel
I/O concepts to production parallel systems. It is designed to scale to petabytes
of storage and provide access rates at 100s of GB/s. While working on a PVFS-
related research project, we realized that the energy-saving may not be a central
issue for high-performance computing(HPC) systems. One of the major reasons
is that energy-efficiency schemes usually negatively affect to the main goal of
a HPC system, which aims to maximized system performance. However, the
fault-tolerant issue plays an important role in HPC systems, because any minor
defect may cause data tragedies of the entire system. Hence, we plan to develop
fault tolerant mechanism for PVFS in order to enhance availability.
null Information Assurance and Security in Cloud Storage Systems
Providing confidentiality, integrity, authenticity, privacy and availability of in-
formation are essential for the normal operation in cloud computing. Hence,
information assurance and security is a critical issue. As the last long-term
research direction, we will place emphasis on schemes of authorization and au-
thentication for cloud storage systems
98
Bibliography
[1] 1996 disk trend report?rigid disk drives, figure 2?unit shipment summary. http:
//www.disktrend.com.
[2] Berkeley web trace. http://tracehost.cs.berkeley.edu/web/, 1998.
[3] Seagate unveils hefty, fast cheetah drives, March 2001.
[4] The distributed-parallel storage system (dpss) home pages. http://www-didc.
lbl.gov/DPSS/, June 2004.
[5] Hitachi introduces 1-terabyte hard drive. http://www.pcworld.com/article/
128400/hitachi_introduces_1terabyte_hard_drive.html, January 2007.
[6] Umass trace repository. http://traces.cs.umass.edu/index.php/Storage/
Storage, December 2009.
[7] Japans k computer tops 10 petaflop/s to stay atop top500 list, November 2011.
[8] Seagate is the first manufacturer to break the capacity ceiling with a
new 4tb goflex desk drive. http://www.seagate.com/ww/v/index.jsp?
locale=en-US&name=goflex-desk-4tb-capacity-seagate-pr&vgnextoid=
e07c2d857df32310VgnVCM1000001a48090aRCRD, September 2011.
[9] Top 500 supercomputer sites, March 2012.
[10] Robert B. Abernethy. The New Weibull Handbook 5th edition. Barringer &
Associates, Inc, Humble, TX, USA, 2010.
[11] Khalil S. Amiri. Scalable and manageable storage systems. Carnegie Mellon
University, December 2000.
[12] K. Bellam, A. Manzanares, X. Ruan, X. Qin, and Y.-M. Yang. Improving
reliability and energy efficiency of disk systems via utilization control. In Proc.
IEEE Symp. Computers and Comm., 2008.
[13] Soren Bergmann and Steffen Strassburger. Challenges for the automatic gener-
ation of simulation models for production systems. In 2010 Summer Simulation
Multiconference, SummerSim ?10, pages 545?549, San Diego, CA, USA, 2010.
Society for Computer Simulation International.
99
[14] M. Blaum, J. Brady, J. Bruck, and J. Menon. Evenodd: an optimal scheme for
tolerating double disk failures in raid architectures. In Computer Architecture,
1994., Proceedings the 21st Annual International Symposium on, pages 245 ?
254, apr 1994.
[15] Francieli Zanon Boito, Rodrigo Virote Kassick, and Philippe O. A. Navaux.
The impact of applications&#39; i&&#35;47;o strategies on the performance
of the lustre parallel file system. Int. J. High Perform. Syst. Archit., 3:122?136,
May 2011.
[16] R.E Brown and J.R. Ochoa. Distribution system reliability: default data and
model validation. In IEEE Transactions on Power Systems, pages 704?709,
March 1998.
[17] W.A. Burkhard and J. Menon. Disk array storage system reliability. In Proc.
23rd Int?l Symp. Fault-Tolerant Comp., pages 432?441, 1993.
[18] Philip H. Carns, Bradley W. Settlemyer, and Walter B. Ligon, III. Using
server-to-server communication in parallel file systems to simplify consistency
and improve performance. In Proceedings of the 2008 ACM/IEEE conference
on Supercomputing, SC ?08, pages 6:1?6:8, Piscataway, NJ, USA, 2008. IEEE
Press.
[19] D. Colarelli and D. Grunwald. Massive arrays of idle disks for storage archives.
In Proc. ACM/IEEE Conf. Supercomputing, pages 1?11, 2002.
[20] Gerry Cole. Estimating drive reliability in desktop computers and consumer
electronics systems. Seagate Personal Storage Group, 2000.
[21] Bryan Dodson. Weibull Analysis. ASQC Quality Press, Milwaukee, WI,USA,
1994.
[22] F. Douglis, P. Krishnan, and B. Marsh. Thwarting the power-hungry disk. In
Proc. USENIX Winter 1994 Technical Conf., pages 23?23, 1994.
[23] F. Douglis, P. Krishnan, and B. Marsh. Thwarting the power-hungry disk. In
Proc. USENIX Winter 1994 Technical Conf., pages 23?23, 1994.
[24] B. Eckart, Xin Chen, Xubin He, and S.L. Scott. Failure prediction models for
proactive fault tolerance within storage systems. In Modeling, Analysis and
Simulation of Computers and Telecommunication Systems, 2008. MASCOTS
2008. IEEE International Symposium on, pages 1 ?8, sept. 2008.
[25] J.G. Elerath. Specifying reliability in the disk drive industry: No more mtbf?s.
pages 194?199, 2000.
[26] J.G. Elerath and M. Pecht. Enhanced reliability modeling of raid storage sys-
tems. In Proc. IEEE/IFIP Int?l Conf. Dependable Sys. and Networks, 2007.
100
[27] Hyeonsang Eom and Jeffrey K. Hollingsworth. Speed vs. accuracy in simula-
tion for i/o-intensive applications. In IPDPS, pages 315?322. IEEE Computer
Society Press, 2000.
[28] Blake G. Fitch, Aleksandr Rayshubskiy, Michael C. Pitman, T. J. Christopher
Ward, and Robert S. Germain. Using the active storage fabrics model to address
petascale storage challenges. In Proceedings of the 4th Annual Workshop on
Petascale Data Storage, PDSW ?09, pages 47?54, New York, NY, USA, 2009.
ACM.
[29] Richard Freitas, Joseph Slember, Wayne Sawdon, and Chiu Lawrence. Gpfs
scans 10 billion files in 43 minutes. San Jose, CA, USA, 2011. IBM Advanced
Storage Laboratory.
[30] E. Grochowski and R.F. Hoyt. Future trends in hard disk drives. Magnetics,
IEEE Transactions on, 32(3):1850 ?1854, may 1996.
[31] Jorge Guerra, Himabindu Pucha, Joseph Glider, Wendy Belluomini, and Raju
Rangaswami. Cost effective storage using extent based dynamic tiering. In
Proceedings of the 9th USENIX conference on File and stroage technologies,
FAST?11, pages 20?20, Berkeley, CA, USA, 2011. USENIX Association.
[32] S. Gurumurthi, A. Sivasubramaniam, M. Kandemir, and H. Franke. Drpm:
dynamic speed control for power management in server class disks. In Proc.
Int?l Symp. Computer Architecture, pages 169?179, June 2003.
[33] Ibrahim F. Haddad. Pvfs: A parallel virtual file system for linux clusters. Linux
J., 2000, November 2000.
[34] D.P. Helmbold, D.E. Long, T.L. Sconyers, and B. Sherrod. Adaptive disk
spin?down for mobile computers. Mob. Netw. Appl., 5(4):285?297, 2000.
[35] G.F. Hughes and J.F. Murray. Reliability and security of raid storage systems
and d2d archives using sata disk drives. ACM Trans. Storage, 1(1):95?107, Dec.
2004.
[36] G.F. Hughes, J.F. Murray, K. Kreutz-Delgado, and C. Elkan. Improved disk-
drive failure warnings. Reliability, IEEE Transactions on, 51(3):350 ? 357, sep
2002.
[37] Maximum Institution Inc. 2002.
[38] S. Jin and A. Bestavros. Gismo: A generator of internet streaming media
objects and workloads. ACM SIGMETRICS Performance Evaluation Review,
November 2001.
[39] Hawkins John and Bod00E9n Mikael. The applicability of recurrent neural
networks for biological sequence analysis. IEEE/ACM Transactions on Com-
putational Biology and Bioinformatics, 2:243?253, 2005.
101
[40] Steven W. Schlosser John S. Bucy, Jiri Schindler and Gregory R. Ganger. The
disksim simulation environment version 4.0 reference manual. 2008.
[41] Mahmut Kandemir, Seung Woo Son, and Mustafa Karakoy. Improving disk
reuse for reducing power consumption. In Proceedings of the 2007 international
symposium on Low power electronics and design, ISLPED ?07, pages 129?134,
New York, NY, USA, 2007. ACM.
[42] P Krishnan, M P Long, and Scott J Vitter. Adaptive disk spindown via optimal
rent-to-buy in probabilistic environments. Technical report, Durham, NC, USA,
1995.
[43] Samuel Lang, Philip Carns, Robert Latham, Robert Ross, Kevin Harms, and
William Allcock. I/o performance challenges at leadership scale. In Proceedings
of the Conference on High Performance Computing Networking, Storage and
Analysis, SC ?09, pages 40:1?40:12, New York, NY, USA, 2009. ACM.
[44] Chunhua Li, Ke Zhou, and Dan Feng. Capturing the object behavior for storage
system evaluation. Int. J. High Perform. Comput. Netw., 6:226?233, December
2010.
[45] Dong Li and Jun Wang. Eeraid: energy efficient redundant and inexpensive
disk array. In Proceedings of the 11th workshop on ACM SIGOPS European
workshop, EW 11, New York, NY, USA, 2004. ACM.
[46] K. Li, R. Kumpf, P. Horton, and T. Anderson. A quantitative analysis of disk
drive power management in portable computers. In Proc. USENIX Winter
Technical Conf., pages 22?22, 1994.
[47] Seagate Technology LLC. Product manual: Cheetak 15k.6 sas. Scotts, CA,
USA, September 2008. Seagate Technology LLC.
[48] Tim Lynam, John Drewry, Will Higham, and Carl Mitchell. Adaptive modelling
for adaptive water quality management in the great barrier reef region, australia.
Environ. Model. Softw., 25:1291?1301, November 2010.
[49] Adam Manzanares, Xiaojun Ruan, Shu Yin, and Mais Nijim. Energy-aware
prefetching for parallel disk systems: Algorithms, models, and evaluation. IEEE
Int?l Symp. on Network Computing and Applications, 2009.
[50] Adam C. Manzanares. Energy efficient pre-fetching?models to implementation.
Auburn University, April 2010.
[51] C. Mee and Eric Daniel. Magnetic Storage Handbook. McGraw-Hill, Inc., New
York, NY, USA, 2 edition, 1996.
[52] J. Menon. A performance comparison of raid-5 and log-structured arrays. In
High Performance Distributed Computing, 1995., Proceedings of the Fourth
IEEE International Symposium on, pages 167 ?178, aug 1995.
102
[53] David Nagle, Denis Serenyi, and Abbie Matthews. The panasas activescale
storage cluster: Delivering scalable high bandwidth storage. In Proceedings of
the 2004 ACM/IEEE conference on Supercomputing, SC ?04, pages 53?, Wash-
ington, DC, USA, 2004. IEEE Computer Society.
[54] Athanasios E. Papathanasiou and Michael L. Scott. Power-efficient server-class
performance from arrays of laptop disks. 2004.
[55] J.-F. P?aris, T.J. Schwarz, and D.D.E. Long. Evaluating the reliability of storage
systems. In Proc. IEEE Int?l Symp. Reliable and Distr. Sys., 2006.
[56] David A. Patterson, Garth Gibson, and Randy H. Katz. A case for redundant
arrays of inexpensive disks (raid). In SIGMOD ?88: Proceedings of the 1988
ACM SIGMOD international conference on Management of data, pages 109?
116, New York, NY, USA, 1988. ACM.
[57] Juan Piernas, Jarek Nieplocha, and Evan J. Felix. Evaluation of active stor-
age strategies for the lustre parallel file system. In Proceedings of the 2007
ACM/IEEE conference on Supercomputing, SC ?07, pages 28:1?28:10, New
York, NY, USA, 2007. ACM.
[58] E. Pinheiro and R. Bianchini. Energy conservation techniques for disk array-
based servers. In Proc. 18th Int?l Conf. Supercomputing, 2004.
[59] E. Pinheiro, R. Bianchini, E. Carrera, and T. Heath. Load balancing and un-
balancing for power and performance in cluster-based systems. Proc. Workshop
Compilers and Operating Sys. for Low Power, September 2001.
[60] E. Pinheiro, R. Bianchini, and C. Dubnicki. Exploiting redundancy to conserve
energy in storage systems. In Proc. Joint Int?l Conf. Measurement and Modeling
of Computer Systems, 2006.
[61] E. Pinheiro, W.-D. Weber, and L.A. Barroso. Failure trends in a large disk
drive population. In Proc. USENIX Conf. File and Storage Tech., February
2007.
[62] Eduardo Pinheiro, Ricardo Bianchini, and Cezary Dubnicki. Exploiting redun-
dancy to conserve energy in storage systems. SIGMETRICS Perform. Eval.
Rev., 34(1):15?26, 2006.
[63] A. Polze, P. Troandger, and F. Salfner. Timely virtual machine migration for
pro-active fault tolerance. In Object/Component/Service-Oriented Real-Time
Distributed Computing Workshops (ISORCW), 2011 14th IEEE International
Symposium on, pages 234 ?243, march 2011.
[64] Drew Roselli, Jacob R. Lorch, and Thomas E. Anderson. A comparison of
file system workloads. In ATEC ?00: Proceedings of the annual conference on
USENIX Annual Technical Conference, pages 4?4, Berkeley, CA, USA, 2000.
USENIX Association.
103
[65] X. J. Ruan, A. Manzanares, K. Bellam, Z. L. Zong, and X. Qin. Daraw: A
new write buffer to improve parallel I/O energy-efficiency. In Proc. ACM Symp.
Applied Computing, 2009.
[66] X.-J. Ruan Run, A. Manzanares, S. Yin, Z.-L. Zong, and X. Qin. Performance
evaluation of energy-efficient parallel I/O systems with write buffer disks. In
Proc. 38th Int?l Conf. Parallel Processing, Sept. 2009.
[67] Robert G. Sargent. Verification and validation of simulation models. In WSC
?05: Proceedings of the 37th conference on Winter simulation, pages 130?143.
Winter Simulation Conference, 2005.
[68] Robert G. Sargent. Verification and validation of simulation models. In Pro-
ceedings of the 37th conference on Winter simulation, WSC ?05, pages 130?143.
Winter Simulation Conference, 2005.
[69] K. Bernhard Schiefer and Gary Valentin. Db2 universal database performance
tuning. IEEE Data Eng. Bull., 22(2):12?19, 1999.
[70] S. Schlesinger, RE. Crosbie, RE Gagne, and Innie GSd. Terminology for model
credibility. In Simulation 32, pages 103?104, 1979.
[71] B. Schroeder and G.A. Gibson. Disk failures in the real world: what does an
mttf of 1,000,000 hours mean to you? In Proc. USENIX Conf. File and Storage
Tech., page 1, 2007.
[72] S. Shah and J.G. Elerath. Reliability analysis of disk drive failure mechanisms.
In Proc. Annual Reliability and Maintainability Symp., pages 226?231, 2005.
[73] H. Shen, Mohan Kumar, S.K. Das, and Z. Wang. Energy-efficient caching and
prefetching with data consistency in mobile distributed systems. In Parallel
and Distributed Processing Symposium, 2004. Proceedings. 18th International,
page 67, april 2004.
[74] Sean M. Snyder, Shimin Chen, Panos K. Chrysanthis, and Alexandros Labrini-
dis. Qmd: exploiting flash for energy efficient disk arrays. In Proceedings of
the Seventh International Workshop on Data Management on New Hardware,
DaMoN ?11, pages 41?49, New York, NY, USA, 2011. ACM.
[75] S. W. Son, M. Kandemir, and A. Choudhary. Software-directed disk power
management for scientific applications. In Proc. IEEE Int?l Parallel and Distr.
Processing Symp., 2005.
[76] S.W. Son and M. Kandemir. Energy-aware data prefetching for multi-speed
disks. In Proc. Int?l Conf. Comp. Frontiers, 2006.
[77] Huaiming Song, Yanlong Yin, Xian-He Sun, Rajeev Thakur, and Samuel Lang.
Server-side i/o coordination for parallel file systems. In Proceedings of 2011 In-
ternational Conference for High Performance Computing, Networking, Storage
and Analysis, SC ?11, pages 17:1?17:11, New York, NY, USA, 2011. ACM.
104
[78] IDEMA Standards. Specification of hard disk drive reliability. pages Document
Number R2?98.
[79] Jan Stender, Bj?orn Kolbeck, Felix Hupfeld, Eugenio Cesario, Erich Focht,
Matthias Hess, Jes?us Malo, and Jonathan Mart??. Striping without sacri-
fices: maintaining posix semantics in a parallel file system. In First USENIX
Workshop on Large-Scale Computing, pages 6:1?6:8, Berkeley, CA, USA, 2008.
USENIX Association.
[80] A. Thomasian and M. Blaum. Mirrored disk organization reliability analysis.
IEEE Trans. Computers, 55(12):1640?1644, 2006.
[81] P.J. Varman and R.M. Verma. Tight bounds for prefetching and buffer manage-
ment algorithms for parallel I/O systems. IEEE Trans. Parallel Distrib. Syst.,
10(12):1262?1275, 1999.
[82] J. Wang, H.-J. Zhu, and D. Li. eraid: Conserving energy in conventional disk-
based raid system. IEEE Trans. Computers, 57(3):359?374, 2008.
[83] Jun Wang, Xiaoyu Yao, and Huijun Zhu. Exploiting in-memory and on-disk re-
dundancy to conserve energy in storage systems. IEEE Trans. Comput., 57:733?
747, June 2008.
[84] Jun Wang, Huijun Zhu, and Dong Li. eraid: Conserving energy in conventional
disk-based raid system. IEEE Transactions on Computers, 57(3):359?374, 2008.
[85] Charles Weddle, Mathew Oldham, Jin Qian, An-I Andy Wang, Peter Reiher,
and Geoff Kuenning. Paraid: a gear-shifting power-aware raid. In FAST ?07:
Proceedings of the 5th USENIX conference on File and Storage Technologies,
pages 30?30, Berkeley, CA, USA, 2007. USENIX Association.
[86] Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos
Maltzahn. Ceph: a scalable, high-performance distributed file system. In Pro-
ceedings of the 7th symposium on Operating systems design and implementation,
OSDI ?06, pages 307?320, Berkeley, CA, USA, 2006. USENIX Association.
[87] Andreas Weissel, Bj?orn Beutel, and Frank Bellosa. Cooperative I/O: a novel
I/O semantics for energy-aware applications. In Proc. the 5th Symp. Operating
Systems Design and Implementation, pages 117?129, New York, NY, USA, 2002.
ACM.
[88] Brent Welch, Marc Unangst, Zainul Abbasi, Garth Gibson, Brian Mueller, Ja-
son Small, Jim Zelenka, and Bin Zhou. Scalable performance of the panasas
parallel file system. In Proceedings of the 6th USENIX Conference on File
and Storage Technologies, FAST?08, pages 2:1?2:17, Berkeley, CA, USA, 2008.
USENIX Association.
105
[89] T. Xie. Sea: A striping-based energy-aware strategy for data placement in
raid-structured storage systems. IEEE Trans. Computers, 57(6):748?761, June
2008.
[90] T. Xie and Y. Sun. Sacrificing reliability for energy saving: Is it worthwhile for
disk arrays? In Proc. IEEE Symp. Parallel and Distr. Processing, pages 1?12,
April 2008.
[91] Tao Xie. Sea: A striping-based energy-aware strategy for data placement in
raid-structured storage systems. IEEE Transactions on Computers, 57:748?
761, 2008.
[92] Tao Xie and Hui Wang. Micro: A multilevel caching-based reconstruction
optimization for mobile storage systems. Computers, IEEE Transactions on,
57(10):1386 ?1398, oct. 2008.
[93] Q. Xin, J.E. Thomas, S.J. Schwarz, and E.L. Miller. Disk infant mortality
in large storage systems. In Proc. IEEE Int?l Symp. Modeling, Analysis, and
Simulation of Computer and Telecomm. Sys., 2005.
[94] Qin Xin, E.L. Miller, and S.J.T.J.E. Schwarz. Evaluation of distributed recovery
in large-scale storage systems. In High performance Distributed Computing,
2004. Proceedings. 13th IEEE International Symposium on, pages 172 ? 181,
june 2004.
[95] Qin Xin, E.L. Miller, T. Schwarz, D.D.E. Long, S.A. Brandt, and W. Litwin.
Reliability mechanisms for very large storage systems. In Mass Storage Systems
and Technologies, 2003. (MSST 2003). Proceedings. 20th IEEE/11th NASA
Goddard Conference on, pages 146 ? 156, april 2003.
[96] Ying Xu and Brett D. Fleisch. Nfs-cc: tuning nfs for concurrent read sharing.
Int. J. High Perform. Comput. Netw., 1:203?213, December 2004.
[97] J. Yang and F.-B. Sun. A comprehensive review of hard-disk drive reliability.
In Proc. Annual Reliability and Maintainability Symp., 1999.
[98] Q. Yang and Y.-M. Hu. DCD - Disk Caching Disk: A new approach for boosting
I/O performance. In Proc. Int?l Symp. Computer Architecture, pages 169?169,
May 1996.
[99] S. Yin, X. Ruan, A. Manzanares, and X. Qin. How reliable are parallel disk
systems when energy-saving schemes are involved? In Proc. IEEE International
Conference on Cluster Computing (CLUSTER), 2009.
[100] John Zedlewski, Sumeet Sobti, Nitin Garg, Fengzhou Zheng, Arvind Krish-
namurthy, and Randolph Wang. Modeling hard-disk power consumption. In
Proceedings of the 2nd USENIX Conference on File and Storage Technologies,
pages 217?230, Berkeley, CA, USA, 2003. USENIX Association.
106
[101] Junyao Zhang, Pengju Shang, and Jun Wang. A scalable reverse lookup scheme
using group-based shifted declustering layout. In Parallel Distributed Processing
Symposium (IPDPS), 2011 IEEE International, pages 604 ?615, may 2011.
[102] Q.-B. Zhu, F.M. David, C.F. Devaraj, Z.-M. Li, Y.-Y. Zhou, and P. Cao. Reduc-
ing energy consumption of disk storage using power-aware cache management.
In Proc. Int?l Symp. High Performance Comp. Arch., page 118, Washington,
DC, USA, 2004.
[103] Qingbo Zhu, Zhifeng Chen, Lin Tan, Yuanyuan Zhou, Kimberly Keeton, and
John Wilkes. Hibernator: helping disk arrays sleep through the winter. SIGOPS
Oper. Syst. Rev., 39:177?190, October 2005.
107