Informed Prefetching in Distributed Multi-Level Storage Systems
by
Maen Mahmoud Al Assaf
A dissertation submitted to the Graduate Faculty of
Auburn University
in partial ful llment of the
requirements for the Degree of
Doctor of Philosophy
Auburn, Alabama
December 12, 2011
Keywords: informed prefetching, pipelining, parallel storage systems, multi-level storage
system, distributed storage systems
Copyright 2011 by Maen Mahmoud Al Assaf
Approved by
Xiao Qin, Chair, Associate Professor of Computer Science and Software Engineering
David Umphress, Associate Professor of Computer Science and Software Engineering
Wei-Shinn Ku, Assistant Professor of Computer Science and Software Engineering
Abstract
In this dissertation, we present pipelined prefetching mechanisms that use application-
disclosed access patterns to prefetch hinted blocks in multi-level storage systems. The fun-
damental concept in our approach is to split an informed prefetching process into a set of
independent prefetching steps among multiple storage levels (e.g., main memory, solid state
disks, and hard disk drives). In the  rst part of this study, we show that a prefetching pipeline
across multiple storage levels is an viable and e ective technique for allocating  le bu ers at
the multiple-level storage devices. Our approaches (a.k.a., iPipe and IPO) extends previous
ideas of informed prefetching in two ways: (1) our approach reduces applications? I/O stalls
by keeping hinted data in caches residing in the main memory, solid state disks, and hard
drives; (2) we propose a pipelined prefetching scheme in which multiple informed prefetching
mechanisms semi-dependently work to fetch blocks from low-level (slow) to high-level (fast)
storage devices. Our iPipe and IPO strategies integrated with the pipelining mechanism
signi cantly reduce overall I/O access time in multiple-level storage systems. Next, we pro-
pose a third prefetching scheme called IPODS that aims to maximize the bene t of informed
prefetches as well as to hide network latencies in a distributed storage systems. Finally,
we develop a simulator to evaluate the performance of the proposed informed prefetching
schemes in the context of multiple-level storage systems. We implement a prototype to
validate the accuracy of the simulator. Our results show that our iPipe improves system
performance by 56% in most informed prefetching cases, IPO and IPODS improve system
performance by 56% and 6% respectively in informed prefetching critical cases across a wide
range of real-world I/O traces.
ii
Acknowledgments
I owe my gratitude to all the people who have made this work possible and who sup-
ported me during my stay at Auburn University.
I would like to express my deepest gratitude to my advisor, Dr. Xiao Qin. I was
indeed fortunate to be his student. He educated me and gave me support, experience,
encouragement, and guidance. He taught me good research methodologies, team work, and
writing skills. He also did his best to help me when I face obstacles. Without his help and
support, this work would not have been possible. I really hope to be a good advisor like Dr.
Qin for my future students.
I am grateful and thankful for Dr. David Umphress who reviewed my proposal and
dissertation. He alerted me to many important points concerning good writing skills and
research. I am pleased to be his student.
I would like to express my sincere thanks for Dr. Wei-shinn Ku for reviewing my proposal
and dissertation. Dr. Ku indeed supported me in my research and guided me on the right
track. I am really grateful and thankful for him.
I would also like to acknowledge Dr. Guofu Niu from the Department of Electrical and
Computer Engineering at Auburn University for reviewing my dissertation and supporting
my research. I really thank him very much.
This work was discussed with my colleagues in Dr. Xiao Qin?s research group. I would
like to mention in particular Xiaojun Ruan, Shu Yin, Yun Tian, Zhiyang Ding, James Majors
, Jiong Xie, Yixian Yang, Ji Zhang, Joshua Lewis,and Jianguo Lu. I really thank them very
much.
iii
I am indebted to the University of Jordan who sponsored me during my stay in the
United States. In particular, I would like to thank my colleagues in King Abdullah II School
of IT for their support and advise.
Most importantly, I thank my family who encouraged me to continue my Ph.D educa-
tion. Without their patience and love, this work would not have been possible. I dedicate
this dissertation to them.
iv
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Informed Prefetching for Multiple-Level Storage Systems . . . . . . . 2
1.1.2 Prefetching in Multi-Level Storage Systems with Limited I/O Bandwidth 3
1.1.3 Informed Prefetching for Distributed Multi-level Storage Systems . . 3
1.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Motivation 1: the growing needs of multi-level storage systems. . . . 4
1.2.2 Motivation 2: the I/O access hints o ered by applications. . . . . . . 4
1.2.3 Motivation 3: multiple prefetching mechanisms perform in parallel. . 5
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Literature Review & Current Work . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Storage systems current work . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Multi-level storage systems . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Solid State Disks and Hard Drives . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Parallel Storage Systems . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.4 Distributed Storage Systems . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Two Types of Prefetching . . . . . . . . . . . . . . . . . . . . . . . . 13
v
2.2.2 Informed Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Predictive Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.4 Prefetching in multi-level storage systems . . . . . . . . . . . . . . . . 16
2.2.5 Prefetching in Distributed and Parallel Storage Systems . . . . . . . . 17
2.3 Summery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Assumptions and Parameters Validations . . . . . . . . . . . . . . . . . . . . . . 20
3.1 System Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Demand Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.2 Informed Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.3 Bandwidth Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.4 I/O Bandwidth Sharing . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.5 Life Time of Solid State Disk (SSD) . . . . . . . . . . . . . . . . . . . 23
3.2 Assumptions on Prefetching and Pipelining . . . . . . . . . . . . . . . . . . . 23
3.2.1 Initial Data Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 A Pipeline for Data Transfers . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Writes and Data Consistency . . . . . . . . . . . . . . . . . . . . . . 25
3.2.4 Pipelining Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.5 Block Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.6 LASR Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Validations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 System Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.3 Proof of the Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.4 Block Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.5 Model Tcpu + Thit + Tdriver . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.6 System Parameters validation . . . . . . . . . . . . . . . . . . . . . . 35
3.3.7 Limited Parallel I/O Bandwidth . . . . . . . . . . . . . . . . . . . . . 36
vi
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 iPipe: An Pipelined and Informed Prefetching for Multi-Level Storage Systems . 41
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Motivations and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Design Issues in iPipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.1 Architecture of iPipe . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 The iPipe Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.1 Stalls and Disk Read Latencies . . . . . . . . . . . . . . . . . . . . . 46
4.4.2 Prefetching Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4.3 The Pstart and Pdepth Algorithms . . . . . . . . . . . . . . . . . . . 48
4.4.4 Stalls, Elapsed Time, Prefetching Horizon, and Prefetching Bene t . 53
4.4.5 The iPipe Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5.1 System Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5.2 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5.3 Validated Performance Evaluation . . . . . . . . . . . . . . . . . . . . 67
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5 IPO: Informed Prefetching Optimization in Multi-level Storage Systems . . . . . 74
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 Motivations and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Design Issues in IPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3.1 Architecture of IPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 The IPO Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4.1 De nitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4.2 The Pstart and Pnext Algorithms . . . . . . . . . . . . . . . . . . . . 81
vii
5.4.3 The IPO Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5.1 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5.2 Elapsed Time Improvement . . . . . . . . . . . . . . . . . . . . . . . 88
5.5.3 Bandwidth Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.5.4 Increasing the MaxBW Value . . . . . . . . . . . . . . . . . . . . . . 91
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6 IPODS: Informed Prefetching in Distributed Multi-level Storage Systems . . . . 94
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2 Motivations and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3 IPODS Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3.1 The IPODS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.4 The IPODS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.4.1 De nitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.4.2 The IPODS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.5.1 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.5.2 Improving Elapsed Time . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7 Prototype Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.0.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.0.2 System Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.0.3 Design Issues of the Prototypes . . . . . . . . . . . . . . . . . . . . . 106
7.0.4 Validation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.1 Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.1.1 The iPipe Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
viii
7.1.2 The IPO Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.1.3 The IPODS Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8 Conclusions & Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.1.1 iPipe: An Pipelined and Informed Prefetching . . . . . . . . . . . . . 124
8.1.2 IPO: Informed Prefetching Optimization . . . . . . . . . . . . . . . . 125
8.1.3 IPODS: Pipelined Prefetching in Distributed/Parallel Storage Systems 126
8.1.4 Prototypes for iPipe, IPO, and IPODS . . . . . . . . . . . . . . . . . 126
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.2.1 Data Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.2.2 The Cost-Bene t Model . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.2.3 Write Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.2.4 Most Recently Used Policy . . . . . . . . . . . . . . . . . . . . . . . . 129
8.2.5 Extending Storage Hierarchy . . . . . . . . . . . . . . . . . . . . . . . 129
8.2.6 Caching and Benchmarking . . . . . . . . . . . . . . . . . . . . . . . 130
8.2.7 Various Solid State Disks . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.2.8 Block Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.2.9 Fast Networks for Distributed Storage Systems . . . . . . . . . . . . . 131
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
ix
List of Figures
2.1 Multi-level storage system that consists of di erent storage devices with various
speed performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 Read Latency of HDDs and SDDs. When block size is 10 MB, SSD has better
read performance than HDD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Read Latency of accessing a remote HDD and SDD through the LAN network
connection. In this distributed system setting, SSD has better read performance
than HDD when block size is 200 MB. . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 (Tcpu) + (Thit) + (Tdriver) values range from 0.037- 0.104 seconds. We will
consider the smallest value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 10 MB: Estimated Thdd ss is 0.122 seconds. . . . . . . . . . . . . . . . . . . . . 36
3.5 10 MB: Estimated Thdd cache is 0.12 seconds. . . . . . . . . . . . . . . . . . . . . 37
3.6 10 MB: Estimated Tss cache is 0.052 seconds. . . . . . . . . . . . . . . . . . . . . 37
3.7 200 MB: Estimated Thdd ss is 4.5 seconds. . . . . . . . . . . . . . . . . . . . . . 38
3.8 200 MB: Estimated Thdd cache is 2.3 seconds. . . . . . . . . . . . . . . . . . . . . 38
3.9 200 MB: Estimated Tss cache is 1.5 seconds. . . . . . . . . . . . . . . . . . . . . 39
3.10 Limited Parallel I/O Bandwidth. Concurrent read requests are noticeably af-
fected by the limited parallel I/O bandwidth. . . . . . . . . . . . . . . . . . . . 40
4.1 iPipe system hardware architecture. Consists of an array of multi-level disks. . . 43
4.2 High-level design of the iPipe software architecture for a multi-level storage sys-
tem. The multi-level storage system consists of two levels - SSDs and HDDs. . . 44
4.3 Detailed design of the iPipe software architecture for a three-level storage system.
An application provides hints to both TIP and iPipe. The system performance
parameters are passed to iPipe to calculate the pipelining starting block and
depth. iPipe keeps fetching hinted data blocks to the highest level. Since the
storage system?s bandwidth is high enough, iPipe is able to fetch most of the
hinted blocks. TIP uses a cost bene t model to determine the number of prefetch
bu ers. Hinted data blocks are fetched from the storage system to the bu er cache. 45
x
4.4 Average stall when using iPipe and a  xed number of bu ers for pipelined
prefetching. Thdd cache = 5, Thdd ss = 8, Tss cache = 4, Xcache = 3. The  rst
stall is 5 time units. Before the  rst hinted block is fetched from HDD into SSD,
the application stalls for Tstall hdd(Xcache) = Thdd cache - 3(Tcpu + Thit + Tdriver)
= 2 time units every 3 accesses. When hinted blocks are retrieved in the SSD,
the application stalls for Tstall ss(Xcache) = Thdd cache - 3(Tcpu + Thit + Tdriver) =
1 every 3 accesses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Average stall when using a  xed number of bu ers for pipelined prefetching.
Thdd cache = 5, Xcache = 3. The  rst stall is for 5 time units, because all data
blocks are read from the HDD. The application stalls for Tstall hdd(Xcache) =
Thdd cache - 3(Tcpu + Thit + Tdriver) = 2 time units every 3 accesses. . . . . . . . 56
4.6 Total elapsed time when the number of prefetching bu ers is set from 1 to 11.
iPipe reduces the elapsed time. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.7 Total elapsed time when the number of prefetching bu ers is set from 13 to 26.
iPipe reduces the elapsed time. . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.8 Total informed prefetching read latency. iPipe reduces the read latency. . . . . . 62
4.9 Total stall time when the number of prefetching bu ers is set from 1 to 9. iPipe
reduces the stall time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.10 Total stall time when the number of prefetching bu ers is set from 11 to 19. iPipe
reduces the stall time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.11 Total elapsed time when the number of prefetching bu ers is set from 1 to 5.
iPipe reduces the elapsed time in both Nova-V64 and Intel X25-E SSDs. Elapsed
time is less when Intel X25-E SSD is tested because it is faster than Nova-V64
SSD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.12 Total elapsed time when the number of prefetching bu ers is set from 7 to 11.
iPipe reduces the elapsed time in both Nova-V64 and Intel X25-E SSDs. Elapsed
time is less when Intel X25-E SSD is tested because it is faster than Nova-V64
SSD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.13 Total elapsed time when the number of prefetching bu ers is set from 13 to 26.
iPipe reduces the elapsed time in both Nova-V64 and Intel X25-E SSDs. Elapsed
time is less when Intel X25-E SSD is tested because it is faster than Nova-V64
SSD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.14 Block size = 10 MB. Total elapsed time when the number of bu ers is set from
1 to 9. iPipe reduces the elapsed time. . . . . . . . . . . . . . . . . . . . . . . . 69
4.15 Block size = 10 MB. Total elapsed time when the number of bu ers is set from
11 to 25. iPipe reduces the elapsed time. . . . . . . . . . . . . . . . . . . . . . . 69
xi
4.16 Block size = 10 MB. Total elapsed time when the number of bu ers is set from
35 to 63. iPipe reduces the elapsed time. . . . . . . . . . . . . . . . . . . . . . . 70
4.17 Block size = 200 MB. Total elapsed time when the number of bu ers is set from
1 to 9. iPipe reduces the elapsed time. . . . . . . . . . . . . . . . . . . . . . . . 70
4.18 Block size = 200 MB. Total elapsed time when the number of bu ers is set from
11 to 25. iPipe reduces the elapsed time. . . . . . . . . . . . . . . . . . . . . . . 71
4.19 Block size = 200 MB. Total elapsed time when the number of bu ers is set from
35 to 63. iPipe reduces the elapsed time. . . . . . . . . . . . . . . . . . . . . . . 71
5.1 High-level design of IPO software architecture. The multi-level storage system
consists of an array of two levels of SSDs and HDDs with limited bandwidth and
scalability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Detailed design of IPO software architecture for a three-level storage system.
The application provides hints on future I/O accesses. IPO determines the ap-
propriate hinted blocks to be fetched. IPO keeps prefetching a particular number
(depends on available bandwidth) of hinted blocks to the uppermost level. The
upper-level prefetcher (i.e., TIP) uses the cost/bene t model to determine the
number of prefetching bu ers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Average stalls when using IPO and a  xed number of bu ers for parallel prefetch-
ing in the bu er cache. The maximum number (MaxBW) of read requests is 5.
Informed prefetching bu ers = 2, and the rest 3 spaces of the bandwidth are used
for pipelined prefetching. Thdd cache = 5, Thdd ss = 8, Tss cache = 4, and Xcache =
2. The  rst accesses stall for 5 time units. Before IPO fetches hinted blocks from
HDD to SSD, the application stalls for Tstall hdd(Xcache) = Thdd cache - 3(Tcpu +
Thit + Tdriver) = 2 time units every 3 accesses. IPO continues to fetch 3 hinted
blocks each time from HDD. When a prefetched block is consumed from SSD, a
new pipelined prefetching request is initiated by IPO. When IPO is employed,
stalls time becomes 40 and elapsed time is 76 time units. . . . . . . . . . . . . . 84
5.4 Continue Figure 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5 Average stalls when using a  xed number of bu ers for parallel prefetching in
bu er cache. Thdd cache = 5, and Xcache = 2. The  rst accesses stall for 5 time
units. All data is read from HDD. The application stalls for Tstall hdd(Xcache)
= Thdd cache - 3(Tcpu + Thit + Tdriver) = 2 time units every 3 accesses. In the
non-IPO case, stalls time is 45 and elapsed time is 81. . . . . . . . . . . . . . . 86
5.6 Continue Figure 5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.7 IPO reduces application elapsed time. 10 MB block size. Total elapsed time
when using 1 to 15 prefetching bu ers. MaxBW = 15. . . . . . . . . . . . . . . . 89
xii
5.8 IPO reduces application elapsed time. 200 MB block size. Total elapsed time
when using 1 to 7 prefetching bu ers. MaxBW = 15. . . . . . . . . . . . . . . . 90
5.9 IPO reduces application elapsed time. 200 MB block size. Total elapsed time
when using 9 to 15 prefetching bu ers. MaxBW = 15. . . . . . . . . . . . . . . . 90
5.10 Bandwidth utilization when Xcahce is varied from 1 to 15 in both IPO and non-
IPO cases. IPO fully utilizes the bandwidth. . . . . . . . . . . . . . . . . . . . . 91
5.11 IPO reduces the elapsed time. Total elapsed time when the Xcahce value is in-
creased from 1 to 63. MaxBW is set to 154. . . . . . . . . . . . . . . . . . . . . 92
6.1 The architecture of a distributed parallel storage system. Several distributed
clients and storage nodes are connected by a network. T.Madhyastha; G. Gibson;
C. Faloutsos: Informed prefetching of collective input/output requests, Proceed-
ings of the 1999 ACM/IEEE conference on Supercomputing (CDROM), Portland,
Oregon, 1999. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 [71] Another architecture of a distributed parallel storage system. Several dis-
tributed clients and storage nodes are connected by a network. Luis Cabrera
, Darrell D.E. Long: SWIFT: USING DISTRIBUTED DISK STRIPING TO
PROVIDE HIGH I/O DATA RATES, University of California at Santa Cruz,
Santa Cruz, CA, 1991. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3 Distributed/Parallel Multi-level Storage System: The system shows several dis-
tributed storage nodes and clients connected by a network. Each I/O stor-
age node consists of a two-level storage device containing both SSD and HDD.
T.Madhyastha; G. Gibson; C. Faloutsos: Informed prefetching of collective in-
put/output requests, Proceedings of the 1999 ACM/IEEE conference on Super-
computing (CDROM), Portland, Oregon, 1999. . . . . . . . . . . . . . . . . . . 98
6.4 Total elapsed time when the number of prefetching bu ers is varied from 1 to 15.
MaxBW = 15. IPODS reduces the elapsed time. . . . . . . . . . . . . . . . . . . 104
xiii
List of Tables
3.1 I/O bandwidth measured using the Ramspeed benchmark. . . . . . . . . . . . . 33
4.1 The number of informed prefetching requests issued to HDDs when the LASR1
and LASR2 traces are evaluated. . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Service time is reduced when one extra bu er is added for prefetching. . . . . . 61
4.3 Prefetching Horizon P(Tcpu) equals 26 data blocks distance while iPipe is not
used. P(Tcpu) drops to 20 data blocks distance when iPipe is used. . . . . . . . 63
4.4 Total stall time when the number of prefetching bu ers is set from 21 to 26. iPipe
reduces the stall time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Data block size = 10 MB. The position of the  rst data block to be prefetched. 72
4.6 Data block size = 10 MB. The depth of the pipelined prefetching when using dif-
ferent Xcahce values. Small depth is needed when few Xcahce bu ers are used. The
maximum depth = 91. iPipe needs to assign the maximum depth for pipelined
prefetching starting from Xcahce = 35, because the reading stalls from SSD at
that point = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.7 Data block size = 200 MB. The position of the  rst data block to be prefetched. 73
4.8 Data block size = 200 MB. The depth of the pipelined prefetching when using
di erent Xcahce values. Small depth is needed when few Xcahce bu ers are used.
The maximum depth = 163. iPipe needs to assign the maximum depth for
pipelined prefetching starting from Xcahce = 45, because the reading stalls from
SSD at that point = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.1 Total elapsed time measured in seconds when the iPipe prototype is tested. The
number of prefetching bu ers is set to a range between 1 to 63. The block size is
10 MB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2 Total elapsed time measured in seconds when the LASR traces are replayed by
the iPipe prototype. The number of prefetching bu ers is set to a range between
1 to 63. The block size is 10 MB. . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.3 Total elapsed time measured in seconds when the iPipe simulator is tested. The
number of prefetching bu ers is set to a range between 1 to 63. The block size is
10 MB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
xiv
7.4 Comparison between iPipe?s simulation results and the prototyping results. Total
elapsed time measured in seconds when the LASR traces are replayed by the iPipe
simulator and prototype. The block size is 10 MB. . . . . . . . . . . . . . . . . 112
7.5 Total elapsed time measured in seconds when the iPipe prototype is tested. The
number of prefetching bu ers is set to a range between 1 to 63. The block size is
200 MB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.6 Total elapsed time measured in seconds when the LASR traces are replayed by
the iPipe prototype. The number of prefetching bu ers is set to a range between
1 to 63. The block size is 200 MB. . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.7 Total elapsed time measured in seconds when the iPipe simulator is tested. The
number of prefetching bu ers is set to a range between 1 to 63. The block size is
200 MB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.8 Comparison between iPipe?s simulation results and the prototyping results. Total
elapsed time measured in seconds when the LASR traces are replayed by the iPipe
simulator and prototype. The block size is 200 MB. . . . . . . . . . . . . . . . . 114
7.9 Total elapsed time measured in seconds when the IPO prototype is tested. The
number of prefetching bu ers is set to a range between 1 to 15. The block size is
10 MB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.10 Total elapsed time measured in seconds when the LASR traces are replayed by
the IPO prototype. The number of prefetching bu ers is set to a range between
1 to 15. The block size is 10 MB. . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.11 Total elapsed time measured in seconds when the IPO simulator is tested. The
number of prefetching bu ers is set to a range between 1 to 15. The block size is
10 MB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.12 Comparison between IPO?s simulation results and the prototyping results. Total
elapsed time measured in seconds when the LASR traces are replayed by the
iPipe simulator and prototype. The block size is 10 MB. . . . . . . . . . . . . . 117
7.13 Total elapsed time measured in seconds when the IPO prototype is tested. The
number of prefetching bu ers is set to a range between 1 to 15. The block size is
200 MB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.14 Total elapsed time measured in seconds when the LASR traces are replayed by
the IPO prototype. The number of prefetching bu ers is set to a range between
1 to 15. The block size is 200 MB. . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.15 Total elapsed time measured in seconds when the IPO simulator is tested. The
number of prefetching bu ers is set to a range between 1 to 15. The block size is
200 MB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
xv
7.16 Comparison between IPO?s simulation results and the prototyping results. Total
elapsed time measured in seconds when the LASR traces are replayed by the
iPipe simulator and prototype. The block size is 200 MB. . . . . . . . . . . . . 118
7.17 Total elapsed time measured in seconds when the IPODS prototype is tested.
The number of prefetching bu ers is set to a range between 1 to 15. The block
size is 200 MB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.18 Total elapsed time measured in seconds when the LASR traces are replayed by the
IPODS prototype. The number of prefetching bu ers is set to a range between
1 to 15. The block size is 200 MB. . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.19 Total elapsed time measured in seconds when the LASR traces are replayed by
the IPODS simulator. The number of prefetching bu ers is set to a range between
1 to 15. The block size is 200 MB. . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.20 Total elapsed time measured in seconds when the IPODS simulator is tested.
The number of prefetching bu ers is set to a range between 1 to 15. The block
size is 200 MB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
xvi
Chapter 1
Introduction
To solve the I/O bottleneck problem (see [81]) in large-scale computing systems, re-
searchers have proposed a wide range of prefetching techniques for preloading data from
disks into the main memory prior to the data accesses. Existing prefetching techniques can
be categorized into two camps - predictive prefetching and informed prefetching. Predictive
prefetching schemes predict future I/O access patterns based on historical I/O accesses of
applications [2], whereas informed prefetching techniques make preloading decisions based
on applications? future access hints [1]. In this dissertation study, we will focus on informed
prefetching schemes and investigate performance impact of informed prefetching on multi-
ple level storage systems. Well-known predictive prefetching solutions are summarized in
Section 2.2.
The chapter is organized as follows. The  rst section describes the problem statement
of this dissertation study. In the second section, we illustrate the important motivations for
our new proposed informed prefetching approaches. Next, the second section outlines the
main contributions of this study. Finally, the last section speci es the organization of the
dissertation.
1.1 Problem Statement
As the performance gap between processors and I/O subsystems increases rapidly, disk
performance becomes a serious bottleneck for large-scale computing systems supporting data-
intensive applications [32]. Recent studies show that informed prefetching can bridge the
performance gap between the CPU and I/O; for example, an informed prefetching algorithm
called TIP proposed by Patterson et al. aims to improve performance of I/O-intensive
1
applications by applying cost-bene t analysis to allocate bu ers for both prefetching and
caching [1]. TIP?s cost-bene t analysis is possible, because TIP estimates the impact of
alternative bu er allocations on application execution time.
This dissertation research is inspired by the TIP approach [1]. Follows are three main
challenges to be addressed our study:
1. informed prefetching for multiple-level storage systems,
2. prefetching in multi-level storage systems with limited I/O bandwidth, and
3. informed prefetching for distributed multi-level storage systems.
1.1.1 Informed Prefetching for Multiple-Level Storage Systems
When disk arrays are employed, the TIP algorithm improves the quality of parallel
prefetching through accurately eliminating I/O stalls. In Patterson?s approach, the informed
caching and prefetching algorithm assigns a portion of bu er space used for demand caching
(LRU) and the rest of the bu er to store hinted blocks prefetched from disks [1]. This bu er
allocation process is guided by a cost-bene t model. When data accessing time increases
due to high disk load, I/O stall time goes up and creates an increasing bene t of assign-
ing additional bu ers for hinted blocks. When parallel disk subsystems are extended into
multiple-level storage systems [52] [23], a hierarchy of multiple storage devices increases data
access latency if the data are residing in a lower level of the systems. To shorten long data
transfer latency, popular data or future accessed data may be stored in the upper level of
the storage systems.
Traditional informed prefetching schemes can hide the latency of accessing storage sys-
tems by invoking disk I/O parallelisms and fetching data based on application-disclosed
hints. We will show that building an informed prefetching pipeline can signi cantly improve
the I/O performance of multi-level storage systems. We will illustrate how to use appli-
cation hints to initiate prefetching among multiple storage levels like main memory, solid
2
state disks, and hard disk drives. The centerpiece of our approach is a pipeline in which
we split the informed prefetching process into a set of independent prefetching steps among
the multiple storage levels. In particular, we will demonstrate how to integrate this pipeline
with informed prefetching and caching to manage  le bu ers at various storage levels.
1.1.2 Prefetching in Multi-Level Storage Systems with Limited I/O Bandwidth
I/O-intensive applications can disclose hints about their future I/O accesses and these
hints can be used to guide prefetching mechanisms in making accurate prefetching decisions.
Existing informed prefetching algorithms rely on the assumption that parallel disks o er
enough I/O bandwidth for prefetching without encountering I/O congestion. Under such an
assumption, an informed prefetching mechanism can prefetch a large number of data blocks
in parallel; Unfortunately, our preliminary results show that real-world storage systems may
not have unlimited I/O bandwidth; this observation is especially true for small-scale storage
systems. We addressed this issue by developing an informed prefetching algorithm (see
Chapter 5) in a multiple-level storage system where disk devices o er limited I/O bandwidth.
Our informed prefetching solution is practical, because it does not rely on the assumption
that storage systems provide unlimited I/O bandwidth.
1.1.3 Informed Prefetching for Distributed Multi-level Storage Systems
To further extend our prefetching approaches (see Chapters 4 and 5), we developed an
informed prefetching algorithm (see Chapter 6) tailored for distributed multi-level storage
systems, each of which consists of a group of multi-level storage servers. In a distributed
storage system, large disk access latency due to network delays can be hidden by informed
prefetching. We will demonstrate that a pipeline mechanism can be used to e ciently
prefetch data blocks from a low-level storage device to a up-level storage device before
moving the data blocks to the clients.
3
1.2 Motivations
The following key factors motivated us to investigate pipelined informed prefetching:
1. the growing needs of multi-level storage systems,
2. the I/O access hints o ered by applications, and
3. the possibility of multiple prefetching mechanisms working in parallel.
1.2.1 Motivation 1: the growing needs of multi-level storage systems.
Multi-level storage systems have been widely employed in data centers supporting
service-based applications such as multimedia streaming and scienti c computing. For ex-
ample, popular data are fetched and cached in an upper-level server while massive amounts
of unpopular data are placed in lower-level storage servers. Overall performance of I/O-
intensive applications can be improved by increasing the I/O performance of multi-level
storage systems. Existing studies (see, for example, [19] and [9]) suggest that new prefetch-
ing and caching techniques are needed to boost the I/O performance of multi-level storage
systems. Since prefetching must be performed at each storage level to hide I/O latencies,
prefetching mechanisms at multi-level need to coordinate in order to achieve high prefetching
e ciency.
1.2.2 Motivation 2: the I/O access hints o ered by applications.
The second factor motivating our informed prefetching for multi-level storage systems
is that largely predictable I/O access patterns can be disclosed by applications as hints. I/O
access hints are used by prefetching mechanisms to invoke asynchronous I/O accesses to
fetch data to upper-level storage like the main memory and solid state disks.
Informed prefetching in multi-level storage systems is challenging for two main reasons.
First, hints must be processed at di erent storage levels in di erent manners. For example, an
4
upper-level prefetching mechanism takes hints and makes conservative prefetching decisions
for upper-level storage with small capacity; a lower-level prefetching mechanism is aggressive
in order to e ectively use large lower-level storage as a staging area. Second, at each storage
level, an informed prefetching manager must balance cache space against prefetching space.
Informed prefetching mechanisms at multiple levels should coordinate to manage caches
across multiple levels of storage devices.
Prefetching and caching issues in multi-level storage systems have been investigated
in existing studies [19] [9]. However, to our best knowledge no research has incorporated
informed prefetching mechanisms into a multi-level storage system.
1.2.3 Motivation 3: multiple prefetching mechanisms perform in parallel.
Multiple informed prefetching mechanisms can independently fetch blocks from lower-
level to upper-level storage devices. These prefetching mechanisms can work in parallel,
because multiple prefetching operations can be simultaneously processed by the upper-level
and lower-level prefetching mechanisms. Such storage parallelisms make it possible to im-
plement a prefetching pipeline, where the informed prefetching processes are separated into
a set of distinct prefetching steps among multiple storage levels.
When data blocks have to be cached in all of the storage levels, signi cant I/O delays are
incurred by prefetching data from the lowest-level to the highest-level of storage. We show
that a pipelined prefetching approach can increase prefetching throughput, which is de ned
as the number of blocks prefetched from the lowest storage level to the highest storage level.
We will focus on read-intensive applications, because read performance of applications
is poor when the data to be accessed are residing in the lower-level storage subsystems.
Write performance in multi-level storage systems is not as critical as read performance, since
applications can write data to upper-level storage devices (e.g., write-behind bu er) before
moving data from the upper-level to the lower-level storage.
5
1.3 Contributions
The following list summarizes the major research contributions made in this dissertation
study:
a136 To reduce I/O delays in multi-level storage systems, we propose new informed prefetch-
ing approaches to coordinating multiple prefetching mechanisms in the form of a
pipeline. The three informed prefetching algorithms developed in this study are called
iPipe (see Chapter 4), IPO (see Chapter 5), and IPODS (see Chapter 6). We show that
with our prefetching pipeline in place, multiple prefetching operations can be processed
in parallel by both upper-level and lower-level prefetching mechanisms.
a136 We apply a novel cost-bene t model to estimate the value of prefetching or caching a
block at a speci c storage level in a multi-level storage system. The cost-bene t model
is used by prefetching mechanisms deployed in multiple storage levels to improve bu er
usage at all storage levels. We describe how we employed the model in our prefetching
algorithms in Chapters 4-6.
a136 We developed a simulated multiple-level storage system, in which the three prefetching
algorithms are implemented. Simulation results show that our prefetching mechanism
powered by a data fetching pipeline reduces applications? stall and execution times.
Simulation results also indicate that our approaches can reduce prefetching distance
(a.k.a., prefetching horizon).
a136 We implemented a prototype to validate our simulator for multi-level storage systems.
Like the simulator, the prototype contains the implementation of all the three proposed
informed prefetching algorithms. We conducted an experiment using the prototype,
showing that our prefetching solutions can reduce disk read latency in distributed
multi-level storage system, an environment in which data are stored in di erent nodes
that are connected by a network. Prefetching time will increase because of network
6
and server latencies. Pipelining can only reduce the disk read latency as a part of the
total prefetching time.
1.4 Dissertation Organization
The rest of the dissertation is organized as follows:
Chapter 2 describes the cutting-edge-research in multi-level storage systems as well as
informed and predictive prefetching techniques.
Chapter 3 explains the assumptions made in this dissertation, shows the proof of con-
cept, and validates system parameters used throughout this dissertation.
Chapter 4 , outlines the system architecture, algorithm Design, and performance eval-
uation for our iPipe solution that assumes enough storage system?s bandwidth.
Chapter 5, outlines our IPO solution that performs pipelining in a limited bandwidth
storage system. The chapter discusses the system architecture, algorithm Design, and the
performance evaluation.
Chapter 6 outlines IPODS solution which implements IPO in distributed multi-levels
storage systems.
Chapter 7 discuss our solutions? prototyping results.
Finally, Chapter 8 provides a summary of this dissertation study with a list of directions
for future research.
7
Chapter 2
Literature Review & Current Work
Previous researchers have suggested that application-disclosed hints can be used by
prefetching mechanisms to dramatically improve I/O performance. To our best knowledge,
however, ours is the  rst study to focus on informed prefetching in multi-level storage sys-
tems, the  rst to consider how to apply the cost-bene t model in a multi-level prefetching
system, the  rst to construct a pipeline to coordinate multiple prefetching mechanisms in
a multi-level storage system, and the  rst to o er a systematic performance evaluation of
informed prefetching in multi-level hybrid storage systems.
In this Chapter, we will discuss closely related work in both storage systems in general
and prefetching in particular.
2.1 Storage systems current work
We will begin this subsection by describing multiple-level storage systems (see Section
2.1.1). Then, we will compare solid state disks with traditional hard disk drives (see Section
2.1.2). Next, we introduce new techniques in parallel storage systems (see Section 2.1.3).
Finally, we will discuss related work in distributed storage systems (see Section 2.2).
2.1.1 Multi-level storage systems
A multi-level storage system consists of a hierarchy of heterogeneous storage devices that
di er in their hardware, speed, size, and other speci cations [87]. Multilevel storage systems
provides cost-e ective solutions for large-scale data centers without signi cantly a ecting
I/O response times. The I/O performance of a multi-level storage system depends on data
placement of the system. Ideally, a high-level storage device should store two types of data:
8
Figure 2.1: Multi-level storage system that consists of di erent storage devices with various
speed performance.
(1) popular data that are frequently accessed and (2) data that are likely to be access in the
not-too-distant future.
Typical storage devices in a modern multi-level storage system include main memory,
solid state disks, hard disks, and magnetic tape subsystems (see, for example, [23]). Fig-
ure 2.1 shows a 4-level storage system, which is a straightforward extension of a traditional
2-level storage system with main memory and hard drives.
Multi-level storage systems and caches share similar advantages. A multi-level cache
system contains a hierarchy of more cache levels [33]. Previous studies show that there is
a maximum bene t of having a single level cache [21] [22]. Making tradeo between cache
latency and hit rate is a challenge, because large caches inevitably have high hit rates with
long latency. To achieve a good tradeo between hit rate and latency, researchers proposed
multiple-level caches, where small, fast caches (i.e., upper level caches) are backed up by
large, slow caches (lower level caches).
Similar to multi-level caches, multi-level storage systems [18] [19] [20]  rst check their
upper-level storage devices. If data items are not in the upper level storage, the next storage
level is checked. This process is repeated until the required information is retrieved.
In a multi-level storage system, data can be moved from one particular level to another
by data migration processes. The decision of migrating data blocks from one level to another
9
is based on the data?s metadata (e.g., popularity and pass access patterns). Data migration
based on popularity or predictions is a good example of the important metadata used by
data migration modules [38], [34], [36]. When a multi-level storage system migrates data
to upper levels based on popularity or predicted access patterns, the migrated data will be
accessed by clients in a fast way thanks to reduced access delays [37].
2.1.2 Solid State Disks and Hard Drives
Compared to traditional hard drives (HDD), solid state disks (SSD) show better data
read performance. Tuma provides comparisons between a wide range of SSDs and HDDs;
detailed comparisons can be found in [85]. Tuma?s results show that SSDs have better
random read-performance than HDDs. For example, reading 4- KB blocks from a seagate
HDD achieves a throughput of 0.70 MB/s; whereas reading the same data blocks from M-
systems Model FDD 3.5" Flash SSD can obtain a throughput of 4.3 MB/s. Rizvi and Chung
compares SLC with MLC- ash SSDs with HDDs; the  ndings show that SSDs o er better
I/O throughput than HDDs [86]. Therefore, solid state disks are becoming increasingly
popular in the support of large-scale multimedia systems requiring high I/O bandwidth.
Researchers can obtain the speci cations of various HDD and SDD products from vendors
(e.g., [94] [93] [92] [90]). The product data sheets clearly indicate that SDDs outperform
HDDs in term of reads. For example, WD Caviar Green has approximately a maximum
read bandwidth of 190 MB/s, assuming that the WD?s SSDs are connected with SATA 1.5
Gb/s (1.5 Gb/s is approximately equal to 190 MB/s). On the other hand, Nova Series V64
solid-state drive can achieve a high read bandwidth of 215 MB/s, which is higher than that
of WD?s SSDs. Di erent SSD products vary in their I/O performance. For example, Intel?s
X25-E Extreme SATA SSDs - with a read bandwidth of 250 MB/s - is better than Nova
Series V64 SSDs. Many SSD and HDD products distinguish themselves from each other in
terms of I/O performance, size, cost, and other aspects. In general, SDDs have good I/O
performance and a high price whereas HDDs o er high storage capacity at low cost. As such,
10
J. No built a hybrid  le system using both SSDs and HDDs, thereby taking the advantages
of two di erent types of storage devices [56].
2.1.3 Parallel Storage Systems
Parallel storage systems are commonly deployed in supercomputers [1] [23] [63]. A
parallel storage system consist of redundant storage devices o ering high I/O performance
and bandwidth [51] [60]. Disk arrays are important components in parallel storage systems.
Scalable parallel storage systems (see, for example, [23] [63] [49]) provide I/O paral-
lelisms through the stripping technique, in which each large data block is stripped among
an entire disk array [64]. When a particular data block is requested, the request will be
directed to the entire disk array, where multiple disk controlled by a disk controller can
coordinate and respond to the request in parallel. Generally speaking, it is very expensive to
build large-scale parallel storage systems; therefore, researchers have proposed various ways
to reduce the cost of scalable parallel storage systems (see, for example, [57]). Cluster stor-
age systems - a new high-performance computing storage resource, is one of a few e ective
ways to develop cost-e ective parallel storage systems, because cluster storage systems use
entirely o -the-shelf components from data storage vendors.
Parallel storage systems can be very reliable, thanks to the fault tolerance features
o ered by data redundancy [59] [58]. In the event of a disk or a storage device fails, users
are not aware of such a failure. Thus, all requests issued to a fault-tolerant storage system
can be processed, even in the presence of disk failures. Data stored on failed disks can be
recovered or reconstructed from data stored in other disks in the system.
A recent study shows that informed prefetching mechanisms can rely on disk parallelisms
to quickly prefetch data in parallel [1]. Applications with parallel I/O access patterns bene t
greatly from parallel storage systems that make parallel informed prefetching possible [50]
[53] [54]. In this dissertation study, we focus on the issue of informed prefetching in the
context of storage system in general and multiple-level storage systems in particular.
11
2.1.4 Distributed Storage Systems
A distributed storage system consists of multiple storage servers or sites that communi-
cate through a network. Data  les stored in distributed storage systems may be stripped or
replicated among the storage servers [72] [83] [68] [82]. Clients access data on a distributed
system through computer networks connecting all the storage servers in the system. Modern
distributed systems achieve high I/O performance, bandwidth, and scalability [67] [69] [66].
A previous study addressed mismatches among applications, storage devices, and intercon-
nection medium requirements in distributed storage systems [71]. Tierney et al. designed
an architecture of a distributed storage system that uses a high speed ATM network to im-
prove system performance [84]. I/O workload characteristics in the context of distributed
systems were investigated by researchers [70], because understanding the I/O workloads of
applications in distributed systems is the  rst step toward improving the applications? I/O
performance.
Numerous distributed storage systems exhibit high reliability through data replication
mechanisms, where each data item in a storage server has one or two backup copies in other
servers. Because there are three copies of a set of data, If a particular storage server fails,
the data on it can still be accessed from the backup servers [61] [62] [80].
Distributed  le systems are  le systems that allow  le accesses from multiple storage
servers or nodes through a computer network. Good examples of these systems include
Google  le system (GFS), Andrew  le system (AFS), Hadoop distributed  le system (HDFS),
Ceph, Frangipani, Deceit,ITC, Coda, wide area AEF, and Sprite, to name just a few [73] [74]
[76] [77] [78] [79] [39] [40] [42] [45]. Access transparency - an important feature of modern
distributed  le systems - allows users and applications to store and retrieve  les over the
network as if the  les were on local disks [75].
Recently, Hadoop distributed  le system (HDFS) emerged as an e cient distributed  le
system supporting the Hadoop software framework for data-intensive distributed applications
[43] [44] [45] [46] [47] [48]. Hadoop is an open-source implementation of Google?s MapReduce
12
programming model designed to deal with thousands of computing nodes and Petabytes
of data. HDFS in the Hadoop framework is a portable and scalable  le system widely
deployed in numerous large-scale cluster computing systems. Hadoop, with the support
from HDFS, relies on location awareness to run Hadoop applications on computing nodes
in which processed data is located. To improve reliability of Hadoop, HDFS maintains
multiple copies of each  le and keeps the di erent copies on di erent computing nodes. Like
MapReduce applications, Hadoop programs comprise two types of operations: 1) a Map
operation that outputs a key-value pair used for generating a set of intermediate key-value
pairs and 2) a Reduce operation that merges all the intermediate values associated with the
same intermediate key.
Scalability is a critical issue in the development of large-scale distributed  le systems.
Scalability issues must be addressed along with consistency, availability, and synchronization.
For example, Srinivas and Janakiram built a generic scalability model for distributed  le
systems. Their model characterizes scalability as a factor related to workload and faultload
[41]. This model was used to compare several well-known distributed  le systems in terms
of availability, synchronization, and consistency.
2.2 Prefetching
In this section, we focus on various issues related to prefetching. We will start this section
by classifying prefetching approaches into two group. Then, we will outline the state-of-the-
art research in informed prefetching. Next, research progress on predictive prefetching will
be reported. Finally, we will summarize a few studies on prefetching mechanisms in multiple-
level storage systems and distributed systems.
2.2.1 Two Types of Prefetching
Due to the large performance gap between CPU and disks, I/O becomes a signi cant per-
formance bottleneck in large-scale computing systems [81]. To hide I/O latencies, prefetching
13
can preload data from disks into the main memory prior to data accesses. Prefetching tech-
niques can be classi ed into two categories: informed and predictive approaches. Informed
prefetching relies on applications to provide future access hints [1]; predictive prefetching in-
volves algorithms that predict future access patterns of applications based on their historical
access patterns [2]. In the next two subsections, we will review previous studies on informed
prefetching and follow with an overview of existing predictive prefetching strategies.
2.2.2 Informed Prefetching
A study conducted by Patterson et al. [1] [12] [11] [16] [26] inspired us to concentrate on
informed prefetching issues. Patterson et. al suggested that informed prefetching algorithms
- invoking storage parallelisms - can take advantage of application-disclosed I/O access hints
to eliminate I/O stalls through aggressive prefetching [26][1][12]. Performance bene ts of
informed prefetching were con rmed by other studies undertaken by Huizinga et al. [10] and
Chen et al. [30].
When it comes to disk arrays, parallel informed prefetching aims to leverage parallel
I/O to improve prefetching performance. Parallel informed prefetching is made possible,
because data blocks are striped across an array of disks [25][51]. Parallel informed prefetching
eliminates I/O stalls by placing hinted data in the cache before the data are requested
by applications. To improve cache usage for both prefetching and caching, Patterson et
al. proposed a cost-bene t model, which assists their prefetching mechanism to balance
cache/bu er space shared between the LRU (least-recently-used) cache and the prefetching
bu er [1]. The model decides the bene t of using more bu ers for prefetching and the cost
of ejecting a LRU block or a prefetched block.
A informed prefetching mechanism makes prefetching decisions based on access hints
and access patterns being given a priority. Researchers have investigated various ways of
collecting information to o er accurate access hints for informed prefetching mechanisms.
14
Without appropriate hints, the prefetching mechanisms are unable to make accurate deci-
sions. Chang and Gibson proposed an application speculative execution scheme that pre-
executes applications to record the applications future accesses [13]. Future access patterns
recorded by Chang and Gibson?s scheme can be used to guide informed prefetching algo-
rithms to preload data that are likely to be accessed. Byna et al. studied a solution that
combines post-execution and runtime analysis to reduce future I/O reads prediction over-
head [35]. Byna?s study is one step toward improving the performance of existing informed
prefetching mechanisms. Our informed prefetching algorithms presented in this dissertation
are orthogonal to the above research in the sense that integrating our solutions with these
techniques of collecting hints can signi cantly improve the overall performance of informed
prefetching mechanisms for multiple-level storage systems.
Prefetch horizon is an important concept in informed prefetching. A prefetch horizon
is an upper bound on prefetch depth; evidence shows that there is no performance bene t
from prefetching further ahead than the prefetch horizon. Much attention has been paid to
the impact of prefetch horizon on prefetching performance (see, for example, [1][12]. In this
dissertation study, we will also evaluate the performance impact of prefetch horizon on our
informed prefetching mechanisms in parallel and distributed storage systems.
2.2.3 Predictive Prefetching
Predictive Prefetching (a.k.a., automatic prefetching) relies on past I/O accesses to
predict future access and to prefetch predicted future required data [2] [31]. Gri oen and
Appleton developed a model for encoding past access patterns and future access probabilities
[2]. In Gri oen and Appleton?s model, directed weighted graphs are employed to estimate
access probabilities [2]. Please refer to [2] for details on the predictive prefetching model
where directed weighted graphs are used.
The Markov predictor is another model used by existing predictive prefetching algo-
rithms [5] [88] [4] [6] [55]. It uses prediction by partial match to  nd recurring sequences of
15
I/O events. The Markov predictor has been widely applied in web prefetching, because web
accesses of the Internet have a hypertextual nature [7] [8][88].
Recently, we developed an automatic prefetching and caching system called APACS [31].
Three unique techniques integrated into APACS include (1) dynamic cache partitioning, (2)
prefetch pipelining, and (3) prefetch bu er management. APACS dynamically partitions
the bu er cache memory, used for prefetched and cached blocks, by automatically changing
bu er/cache sizes in accordance to global I/O performance.
The aforementioned studies focused on predictive prefetching, whereas our solutions
performs informed prefetching. Since informed-prefetching and predictive-prefetching algo-
rithms supplement one another, our solutions can seamlessly work together with a variety
of predictive prefetching algorithms.
2.2.4 Prefetching in multi-level storage systems
Prefetching in multi-level storage systems is technically challenging due to the following
dilemma - aggressive prefetching is required to e ciently reduce I/O latency [3], whereas
overaggressive prefetching may waste I/O bandwidth by transferring useless data from lower-
level to upper-level storage devices.
Previous studies showed that a single-level cache o ers I/O performance improvements [21][22].
Empirical evidence indicates that a multi-level storage system can extend the bene ts of a
single-level cache by augmenting storage systems with multi-level caches [19][18][20]. Re-
cently, Nijim proposed a multi-layer prefetching algorithm that can speculatively prefetch
data from tapes to hard drives and preload data from hard drives to solid state disks [23]
[24]. The experimental results show that one can leverage prefetching techniques to enhance
I/O performance of multiple-level storage systems.
A recent study conducted by Zhang et al. suggests that prefetching algorithms de-
signed for single-level storage systems are inadequate for multi-level systems [9]. Rather
than proposing a new prefetching algorithm for multi-level storage systems, Zhang et al.
16
developed a hierarchy-aware optimization scheme called PFC [9]. Their PFC coordinator
is applicable to any existing prefetching algorithm, because PFC?s aim is to coordinate
prefetching aggressiveness across multiple levels of caches.
Kraiss and Weikum developed an integrated approach to migrating data in multiple-level
storage systems [34]. In particular, Kraiss and Weikum focused on prefetching data between
the tertiary, secondary, and primary storage, thereby masking high latency of the tertiary
storage [34]. In their study, replacement policies and the interaction of prefetching/caching
policies were also considered. It is worth noting that Kraiss and Weikum?s prefetching
mechanism relies on Markov-chain predictions to make prefetching decisions [34].
Our solutions di er from the existing prefetching and caching techniques for multi-level
storage systems in the sense that they manage multi-level caches based on application-
disclosed access patterns.
2.2.5 Prefetching in Distributed and Parallel Storage Systems
Apart from multiple-level storage systems, parallel and distributed storage systems can
bene t from prefetching schemes. For example, Patterson et al. proposed and implemented
an informed prefetching model in a distributed storage system where data are allocated in
remote nodes [1]. Unlike multi-level storage systems, distributed storage systems have to
face the challenge of reducing latencies of accessing remote storage nodes through networks.
To address this challenge, Rochberg and Gibson extended a network  le system by integrat-
ing an informed prefetching mechanism [17]. Rochberg and Gibson?s approach hides disk
latency while exposing storage parallelism. Their experimental results show that informed
prefetching over network reduces application?s execution time by anywhere from 17% to 69%
[17].
Performance of parallel storage systems can be improved through collective input/output
(I/O). A collective application programming interface (API) is necessary to obtain good I/O
performance. Collective APIs, however, require applications to disclose their entire access
17
patterns to fully reorder I/O requests. To solve this problem, Gibson and Faloutsos investi-
gated an approach to optimizing collective I/O access patterns through informed prefetching
[65]. Gibson and Faloutsos?s idea was to make use of informed prefetching to exploit any
amount of available memory to overlap I/O with computation [65]. After comparing their
approach to disk directed I/O, Gibson and Faloutsos showed that under certain conditions,
a per-processor prefetch depth equal to the number of drives is able to guarantee sequential
disk accesses for collectively accessed  les in a parallel storage system. Gibson and Falout-
sos also conducted experiments to indicate that a prefetch horizon of one to two times the
number of disks per processor is good enough to match disk-directed I/O?s performance for
sequentially allocated  les.
Distributed and parallel storage systems are widely adopted to support high-performance
Web servers; therefore, in the context of the Web, it is critical to improve I/O performance
of distributed and parallel systems. Nanopoulos et al. investigated interesting predictive
prefetching issues in the Web, where a predictive Web prefetching mechanism aims at de-
ducing forthcoming page accesses of a client based on its past accesses [88]. The prefetching
algorithms developed by Nanopoulos and his team are referred to as Markov predictors.
Nanopoulos et al. studied factors a ecting the performance of their Web prefetching algo-
rithms that are based on data mining schemes [88]. The experimental results show that the
new data-mining-based prefetching algorithm can provide performance improvements over
previously proposed Web prefetching algorithms.
Our informed prefetching strategies are quite di erent from the aforementioned ap-
proaches, because ours leverage a prefetching pipeline to e ciently migrate hinted data to
upper-level storage devices, thereby signi cantly reducing data access times.
2.3 Summery
In this Chapter we discussed related work in both storage systems and prefetching
techniques. First, we described recent studies on multiple-level storage systems, solid state
18
disks, and parallel and distributed storage systems. Then, we summarized research progress
on both informed and predictive prefetching schemes.
In particular, we conducted a literature review to investigate novel prefetching algo-
rithms and mechanisms designed for multiple-level storage systems and distributed/parallel
storage systems. We also described the major di erences between our proposed informed
prefetching approaches and existing prefetching solutions reported in the literature.
19
Chapter 3
Assumptions and Parameters Validations
In this chapter, let us  rst discuss assumptions used in this dissertation research. Sec-
tions 3.1-3.2 explain and justify research assumptions related to caching, initial data allo-
cations, bandwidth limitations, and I/O writes. Then, we describe in Section 3.3 system
parameters validated by empirical studies. These validated system parameters will be used
in the subsequent chapters.
3.1 System Assumptions
In this subsection, we illustrate assumptions on data storage systems as well as software
environments.
3.1.1 Demand Misses
In this study, we intend to e ciently manage I/O bu ers to balance caching against
prefetching. A caching algorithm manages all demand misses. For example, when a demand
miss of an application occurs, the application stalls until the miss is satis ed for a complete
disk read latency period. The application is not able to consume the prefetched data and
to invoke a new prefetching request. When a demand miss takes place, caching bu er size
should be increased at the cost of reduced prefetching bu er size. In doing so, storage
systems can boost I/O bandwidth to quickly respond demand miss requests.
In addition, we assume that only one demand miss request can take place at a given time
period. Our pipelining mechanism is synchronized with an informed prefetching scheme.
Thus, when a new informed prefetching request takes place, a new pipelining request is
20
initiated. This synchronization means that the pipelining mechanism also stalls with demand
misses.
3.1.2 Informed Caching
Caching prefetches is an e ective technique for improving the performance of I/O inten-
sive applications [14] [15]. In addition to the prefetching process, informed caching removes
the need to fetch data again, which in turn reduces the application?s stall time. The most re-
cently used policy (MRU) can be used in an informed caching mechanism, thereby improving
I/O performance of the system when there are many repeated access patterns.
In our dissertation study, we do not investigate informed caching due to the following
two reasons:
a136 In the worst case, informed caching does not show any performance optimization. The
preliminary results show that several benchmarks are unable to bene t from informed
caching.
a136 Informed caching is a technique for optimizing the informed prefetching process. The
performance optimization gained by employing informed caching does not necessarily
represent the advantages of our solutions.
Rather than informed caching, the pipelining mechanism is at the heart of our system
design. The pipelining aims to optimize informed caching by reducing the overhead of
prefetching. With the pipelining mechanism in place, storage systems need less caching
space.
Unfortunately, it is di cult, if not impossible, to predict whether a particular hinted
data block will be cached. In one of our future studies, we will integrate the informed caching
algorithm with our pipelining algorithms in order to determine if a hinted access needs to
be pipelined. Please refer to Chapter 8 for details on this future research direction.
21
3.1.3 Bandwidth Limitations
We assume that the maximum I/O bandwidth o ered by a parallel storage system
equals the maximum number of concurrent read requests that may take place in the storage
system without causing any congestion. Thus, the maximum I/O bandwidth depends on
the scalability of the storage system. A similar assumption can be  nd in the literature (see,
for example, [1]). In the  rst part of our research (see Chapter 4), we assume that parallel
storage systems have enough I/O bandwidth with no congestion.
This assumption, of course, is relaxed in our solution described in Chapter 5, which
proposes a pipelined prefetching solution for parallel storage systems with limited bandwidth.
3.1.4 I/O Bandwidth Sharing
I/O bandwidth can be shared among solid state disks (i.e., SSDs) and hard drive disks
(i.e., HDDs). In other words, a multiple-level storage system can access its SSDs and HDDs in
parallel. For example, let us consider a multi-level disk array consists of 5 disks. When there
are two concurrent read requests taking place, at the same time three concurrent data blocks
can be written from HDDs to the SSDs. We conducted empirical experiments to validate the
correctness of this assumption, showing that fetching data from SSDs to main memory and
moving data from HDDs to SDDs can be carried out in parallel. This observation indicates
that it is feasible for us to implement a pipelined prefetching mechanism.
We only focus on reads in this dissertation study; our goal is to improve I/O performance
of read-intensive applications. Thus, write-intensive applications can not bene ts a whole
lot from our solution, because writes cause informed prefetching latency and the prefetching
pipeline can be delayed by demand misses. Other studies show that writes are less critical
than reads because applications handles writes in an asynchronous way - applications do
not wait until new data blocks are written. Writes are governed by the write-behind policy;
log-structured  le systems can signi cantly improve I/O performance for write-intensive
22
applications. Please refer to [27] [28] [29] for details on the write-behind policy and the
log-structured  le systems.
In the event that a storage system encounters bandwidth shortage (because of pipelining,
writes behind, power availability, or other reasons), informed prefetching requests may not
be honored in a timely manner. This problem can be solved by shrinking the pipelining
depth, meaning that the pipelined prefetching mechanism does not aggressively load data
blocks from lower-levels of the storage devices.
3.1.5 Life Time of Solid State Disk (SSD)
Because the life time of SSD?s storage cells is limited. The pipelining process may have
negative impact on the life time of the SSDs. The purpose of our research is to build a generic
solution for informed and pipelined prefetching that optimizes prefetching performance in
multi-level storage systems. Although our solutions inevitably a ect SSD?s life time, other
types of storage devices (e.g. HDD and Tapes) used in a multi-levels storage system will not
be a ected by our solution at all.
Many techniques are proposed to improve reliability of SSDs. Quantifying the impacts
of our pipelined prefetching on reliability of SSDs is out the scope of our research. In the
future, we plan to implement a scheme that distributes the pipelining writes among the
SSD?s cells. Hence, this future solution will provide load balancing and improve the SSD?s
life time.
3.2 Assumptions on Prefetching and Pipelining
In this section, we illustrate the assumptions on our pipelined prefetching algorithms.
3.2.1 Initial Data Allocation
Let us consider a simple multi-level storage system, where there are two-level disk arrays
composed of an SSD (top) and an HDD (bottom). All of the data can initially be distributed
23
between the two levels. In our simulation studies, we assume that all the informed prefetched
blocks are initially located in the HDDs (bottom level). This assumption is reasonable
because of the following  ve reasons.
a136 Normally, HDDs have larger storage capacity than SSDs and; therefore, the probably
of  nding data in the HDDs is greater than that of  nding it in the SSDs.
a136 Research of prefetching in multi-level storage systems (see previous studies conducted
by Nijim and Zhang et al.) assumes that the bottom level contains less important data
[23] [9]. Consequently, prefetching techniques move data likely to be accessed in the
future to the upper-level storage devices. Thus, the probability of  nding hinted data
blocks in the HDDs is increased.
a136 The worse case scenario is that all data are initially allocated in the lowest level (e.g.,
HDDs in our study). If some data were initially allocated in the upmost level (e.g.,
SSDs)of the system, the number of pipelined prefetching requests would be reduced.
a136 Having data partially or totally allocated in SSDs does not properly show the perfor-
mance improvement o ered by our solutions. For this reason, we consider a conserva-
tive case - the worst-case scenario - in which all the data are initially allocated to the
lowest level.
a136 If a hinted data block is already available in an SSD, the corresponding informed
prefetching request will be discarded.
3.2.2 A Pipeline for Data Transfers
We build a pipeline to transfer data among multiple storage levels in parallel. For
example, we design two prefetching modules: the  rst one fetches data from HDDs and stores
the data to SSDs; the second one prefetches data from SSDs to I/O bu ers in main memory.
These two prefetching modules can perform in parallel to form a prefetching pipeline. In
24
our design, the lowest-level storage devices always achieve original copies of prefetched data.
Our motivations behind this design are:
a136 Keeping original copies in lower-level storage devices can save storage space in higher-
level storage devices, which are more expensive than the lower-level counterparts.
a136 In case we have to migrate entire data blocks from HDDs to SSDs (see Chapter 8 for
details on data migrations), we need to keep moving the data back and forth among
multiple-level storage, which consumes I/O bandwidth.
a136 Keeping the original copies at the lowest level can save I/O bandwidth, which in turn
improve the overall I/O performance of multi-level storage systems.
3.2.3 Writes and Data Consistency
Writes may create a updated version of a data block that has been prefetched or is
currently being prefetched. This leads to inconsistency between the new version stored in an
upper-level storage and the original copy in the lower-level storage. To counter this problem,
we adopt the write-back policy. Thus, our pipelined prefetching mechanism keeps track of
writes. When data blocks are evicted from upper-levels, updated data will be forced to
written down to the lowest level. In doing so, we can guarantee that data copies in upper
levels are consistent with copies stored in lower levels.
If the writing and updating process cause any I/O latency for informed prefetching, the
prefetching pipeline will have to wait until the updating process is done. This is because the
prefetching process must be synchronized with the writing and updating process.
Recall that our focus is read-intensive applications where writes are considered rare.
Applications with few writes do not experience any signi cant I/O delays introduced by our
mechanism.
25
3.2.4 Pipelining Depth
Our pipelined prefetching algorithms reserve space (called pipelining depth) in the
upper-level storage devices for the concurrent prefetching processes at multiple levels. This
space depends on (1) the number of concurrent prefetching requests that our schemes can
handle and (2) the available I/O bandwidth. As we will discuss later, the pipelining depth
is small and the storage consumed by hinted blocks is insigni cant compared with the entire
storage system?s capacity.
In Chapters 4 and 5, we will provide a mathematical model that calculates the pipelining
depth, which is used to indicate reserved bu er space for pipelined prefetching.
3.2.5 Block Size
Using SSDs and HDDs installed in the servers in our laboratory, we observed that an
SSD does not provide better performance than an HDD for small data blocks; SSDs are faster
than HDDs when the block size is at least 10MB. In our distributed system implementation,
our empirical results show that in order to bene t from high speed SSDs, a data block size
must be at least 200MB.
Many  le systems pack data into large blocks to improve I/O performance. For example,
the data block size in HDFS (Hadoop Distributed File System) is 64 MB [43] [44]. In reference
[45], HDFS block size is increased to improve system performance. Hadoop achieves (HAR)
tool used to pack several small  les into a large one [89], can be used to create large data
blocks out of small ones.
3.2.6 LASR Traces
Our develop trace-driven simulators to evaluate performance of our prefetching mecha-
nisms under I/O-intensive workload. Traces used in our experiments represent applications
that have small CPU processing time between each two subsequent I/O reading requests.
All data blocks in the tested traces have identical block size. More speci cally, we use the
26
machine01 trace (called LASR1) [97] and the machine06 trace (called LASR2) [98] in our
simulation studies.
We assume that all the requested data blocks have the same blocks size. For example,
block size is set to 10MB in single-node systems and 200MB in the distributed system tested
in our studies. We also assume that time between each two sequential I/O reading requests
(i.e. arrival rate or processing time) is the smallest possible value (later: (Tcpu + Thit +
Tdriver)). This setting allow us to evaluate I/O-intensive cases. Simulation results of the
two traces share the same trend but di er in elapsed time due to the variation between the
number of I/O reading requests that appears in both of them.
In our future work (see Chapter 8), we will collect traces from real-world benchmarks.
Then, we will use new traces to further evaluate our solutions. Our major obstacle is that we
need to implement these benchmarks to represent a wide range of real-world I/O-intensive
applications.
3.3 Validations
3.3.1 Overview
In this section, we provide the proof of concept that lays a solid foundation for proposed
solutions described in the subsequent chapters. Next, we validate system parameters of our
simulators using data collected from real-world storage systems. Finally, we discuss the I/O
bandwidth limitation issue of parallel storage systems.
We need to  nd the block size for which SSD?s read performance is better than that of
HDD. We also validate SSD and HDD read latencies (i.e., Tss cache, Thdd cache) for a single
data block. We also validate the Thdd ss latency which is the time spent in reading a data
block from an SSD and write the block back to the SSD. Next, we validate SSD and HDD
read latencies of accessing a remote storage node through a network (i.e., Tss network cache,
Thdd network cache respectively). Finally, we validate the time needed for the application to
consume a single prefetched data block (i.e., (Tcpu + Thit + Tdriver)).
27
The following list summarizes the validated system parameters:
a136 Data block size.
a136 Tcpu: CPU time to process a data block.
a136 Thit: Time to read a single data block from an I/O bu er.
a136 [31] Tdriver Time to allocate a single bu er in the bu er cache.
a136 Thdd ss: Time to fetch a single data block from HDD and to store the block in SSD.
a136 Thdd cache: Time to fetch a single data block from HDD and to store the block in the
bu er cache.
a136 Tss cache: Time to fetch a single data block from SSD and to store the block the bu er
cache.
a136 Thdd network cache: Time to fetch a single data block over a network from a remote
node?s HDD.
a136 Tss network cache: Time to fetch a single data block over a network from a remote node?s
SSD.
3.3.2 System Setup
The following are storage and network devices tested in our laboratory:
a136 Memory: Samsung 3GB RAM Main Memory.
a136 HDD : Western Digital 500GB SATA 16 MB Cache WD5000AAKS.
a136 SSD: Intel 2Gb/s SATA SSD 80G sV 1A.
a136 Network Switch: [96] Network Dell Power Connect 2824.
28
3.3.3 Proof of the Concept
As discussed in our research motivations, multi-level storage systems consist of several
levels that vary in disk read performance (e.g., upper levels are faster than lower levels). Our
proposed solutions aim to optimize informed prefetching when it is implemented in a parallel
multi-levels storage system by prefetching multiple blocks in a pipelining manner to reduce
disk read latency. Pipelined prefetching aims to reduce applications? stalls and elapsed time.
Hiding disk read latencies is the main approach used in our prefetching designs. Multiple
prefetching modules are deployed at di erent storage levels. For example, one module handles
the prefetching process between HDDs and SSDs; another module takes care of the preteching
process between SSDs and main memory. These two prefetching modules can fetch hinted
blocks in parallel to form a two-stage prefetching pipeline.
The goal of the pipelining approach is to hide long read latencies by moving data from
HDDs to SSDs at the background. With our pipeline in place, many hinted blocks are
residing in SSDs rather than in HDDs. Fetching hinted data from SSDs, of course, is faster
than fetching the data from HDDs.
When it comes to a distributed multi-level storage system, the long read latency problem
is even worse - read latencies include network and server latencies. Our pipelining mechanism
can hide the long read latencies by fetching hinted block from remote storage nodes to local
SSDs. time that we cannot change (their optimization is beyond our research?s scope).
Hiding disk read latencies improves the I/O throughput of both local multi-level storage
systems and distributed multi-level storage systems.
Our preliminary results show that SSDs have better performance than HDDs when block
size exceeds a particular threshold. Figure 3.1 and 3.2 validate this statement where read
latency of SSD is less than that of HDD. We performed this experiment using read requests.
We conduct experiments on real-world storage devices in our laboratory, showing that the
performance di erence between HDDs and SSDs begins at 10 MB per data block in a local
storage system and 200 MB per block in a distributed system.
29
Figure 3.1: Read Latency of HDDs and SDDs. When block size is 10 MB, SSD has better
read performance than HDD.
Figure 3.2: Read Latency of accessing a remote HDD and SDD through the LAN network
connection. In this distributed system setting, SSD has better read performance than HDD
when block size is 200 MB.
30
3.3.4 Block Size
Our preliminary results based on our storage devices indicate that an SSD is guaranteed
to exhibit better performance than HDD when the data block size is 10 MB for a local
system and 200 MB for a distributed system. Figures 3.1 and 3.2 validate this argument.
We perform a comparison between the highest read latency of SSD with the lowest of HDD,
observing that for small data blocks (e.g., 1 MB), SSD is not superior to HDD. When block
size reaches 5 MB, SSD begins to improve performance over HDD. We choose to use 10 MB
as the next point where SSD shows noticeably better performance than HDD. For the case
of distributed systems, we observe that an SDD starts to exhibit better performance than an
HDD when the block size is 200 MB. Because of these preliminary results, in the subsequent
chapters we consider the block size equal to 10 MB for local systems (see our iPipe and IPO
schemes) and 200 MB for distributed system (see the IPODS scheme).
3.3.5 Model Tcpu + Thit + Tdriver
We denote (Tcpu + Thit + Tdriver) as the time that an application takes to consume a
prefetched data block from I/O bu ers. This is considered the single step time to which the
entire prefetching system is synchronized. It is also the time interval between two sequential
I/O reading requests. We discover from our experiments that this single step time has three
components: Tcpu, Thit, and Tdriver, which have a positive relation with data block size.
While calculating (Tcpu + Thit + Tdriver), values of the three components are varying.
Therefore, it is di cult to assign a constant value to each of these three factors. A recent
study (see the TIP model) considers (Tcpu + Thit + Tdriver) to be a constant value. We
take a similar approach by modeling this single-step time as a constant. We consider the
worst case scenario. Please note that when (Tcpu + Thit + Tdriver) is small, the application
consumes prefetching bu ers faster.
Estimate Tcpu + Thit + Tdriver when block size is 1KB.. In our  rst pipelined
informed prefetching scheme or iPipe, we assume that the data block size equals to 1KB. In
31
other prefetching research (e.g., TIP), the data block size is 8KB. Thus, we decide to use a
small data block to make fair comparison with the existing solution. Now, let us estimate
Tcpu + Thit + Tdriver for when data size is 1KB. In this small-data-block case, Tcpu equals 4
nanosecond, Thit equals 0.15 microsecond, and Tdriver equals 0.06 microsecond.
The (Tcpu) value was derived from a trivial application that measures the time needed
to perform simple computations (i.e., a loop on each byte without doing any computing) on
1KB data. We consider that the CPU always does computations on the 1KB data. The (Thit)
value is derived from the 1KB main memory read latency speci cation (see [90] for details).
Tdriver is derived from a trivial application that measures the time needed to allocate 2000
unique 8 KB data blocks (see similar application in the TIP research). In the worst-case
case, Tcpu + Thit + Tdriver) becomes at least 0.2 microsecond. We use this worst-case-value
in our subsequent simulation studies.
Estimate Tcpu + Thit + Tdriver when block size is 200MB. In order to validate
the value of Tcpu + Thit + Tdriver when block size is 200MB in our simulation studies, we
need to individually validate each factor in this expression. As we mentioned earlier, 200MB
is a large block size. Tcpu in particular varies each time depending on applications? data
processing time. For the value of Tcpu + Thit + Tdriver, we must  rst determine the best
and worst cases. Then, in our simulations, we may randomly choose a value between the
best-case and the worst-case values.
 Validate Thit: We used RamSpeed cache and memory benchmarking tool [99] to
validate the Thit value when block size is set to 200MB. Table 3.1 shows the I/O bandwidth
measured when we run the RamSpeed benchmark in our testbed.
It is clear from the benchmark testing that the I/O bandwidth achieved for 200MB block
size is at least 5304.49 MB/s. We take the lowest possible approximation for the Thit value
to make applications more I/O than CPU intensive. We also observe that I/O bandwidth
for block size larger than 4096KB share similar trend. Using this testing information, we
estimate that the Thit value of a 200MB data block is approximately equal to 0.037 second.
32
Table 3.1: I/O bandwidth measured using the Ramspeed benchmark.
INTEGER & READING 1 Kb block: 10586.42 MB/s
INTEGER & READING 2 Kb block: 10624.48 MB/s
INTEGER & READING 4 Kb block: 10626.43 MB/s
INTEGER & READING 8 Kb block: 10627.82 MB/s
INTEGER & READING 16 Kb block: 10618.71 MB/s
INTEGER & READING 32 Kb block: 10386.05 MB/s
INTEGER & READING 64 Kb block: 9010.96 MB/s
INTEGER & READING 128 Kb block: 9016.90 MB/s
INTEGER & READING 256 Kb block: 9020.86 MB/s
INTEGER & READING 512 Kb block: 9021.48 MB/s
INTEGER & READING 1024 Kb block: 9012.64 MB/s
INTEGER & READING 2048 Kb block: 8110.29 MB/s
INTEGER & READING 4096 Kb block: 5870.99 MB/s
INTEGER & READING 8192 Kb block: 5193.41 MB/s
INTEGER & READING 16384 Kb block: 5221.85 MB/s
INTEGER & READING 32768 Kb block: 5222.84 MB/s
INTEGER & READING 65536 Kb block: 5359.69 MB/s
INTEGER & READING 131072 Kb block: 5272.53 MB/s
INTEGER & READING 262144 Kb block: 5354.08 MB/s
INTEGER & READING 524288 Kb block: 5342.72 MB/s
INTEGER & READING 1048576 Kb block: 5304.49 MB/s
 Validate Tcpu: In order to calculate Tcpu, we implement a trivial application that
measures the time spent in performing simple computations (i.e., a loop on each byte without
any computation) on a 200MB data block residing in the main memory. We measure that
the application takes at least 0.067000 seconds to complete the loop. This Tcpu value, of
course, can increase when massive numerical computations are involved. The (Tcpu) value is
zero if no computations take place while its maximum value is at least 0.067000 seconds. The
measured Tcpu is relatively noticeable when comparable with the measured I/O latencies; this
seemly small Tcpu makes the application be more CPU bounded than I/O. Because informed
prefetching aims to improve I/O performance of I/O-intensive applications, we intend to use
small Tcpu values to make our benchmarks I/O intensive. We also consider the worst case
where Tcpu is set to zero.
 Validate Tdriver): In the 1KB case, we use dynamic memory allocation to test the
time needed to allocate 1KB in the memory. We tried using the same method for a 200MB
33
block size, but we were unable to  nd a reasonable value for (Tdriver) - all the measured results
are too large to represent any real-world case. Consequently, we choose to use static memory
allocation for 200MB blocks; we implement a trivial application to statically allocate 200MB
blocks in the main memory. Our measurements show that Tdriver in the static memory
allocation cases is negligible.
 Put it all together - Validate Tcpu + Thit + Tdriver: As mentioned previously, the
worst case value of Tcpu + Thit + Tdriver is very small. The purpose of this validation process
is to  nd the lowest possible approximation for this small value, which causes our benchmarks
to be more I/O bounded and enables us to investigate performance improvement o ered by
our prefetching solutions. Applying the (Tcpu) + (Thit) + (Tdriver) value in our simulation
studies described in the subsequent Chapters is to estimate reasonable and practical CPU and
I/O processing overheads. In reality, of course, this estimated value is application dependent
and may be large.
Figure 3.3 shows the range of the (Tcpu) + (Thit) + (Tdriver) value. In our experiments,
we use the minimum value 0.037 seconds, indicating that Thit is the most critical factor that
must be recorded each time. As seen in the TIP research, the value of Tcpu + Thit) + Tdriver
is always constant and; therefore, we also make a similar assumption that this value used in
our simulation studies is a constant.
Estimate Tcpu + Thit + Tdriver when block size is 10MB. When the block size
is large, the worst-case value of Tcpu) + Thit + Tdriver is the minimum value of Thit, which
dominate performance for I/O-intensive applications. Table 3.1 allows us to derive Thit
when block size is 10MB. Thit for the 10MB-block-size case approximately equals to 0.00192
seconds; the value of Tcpu + Thit + Tdriver when the block size is 10MB is around 0.00192
seconds.
34
Figure 3.3: (Tcpu) + (Thit) + (Tdriver) values range from 0.037- 0.104 seconds. We will
consider the smallest value.
3.3.6 System Parameters validation
Our goal is to validate important system parameters used in simulation studies presented
in the subsequent chapters. These critical system parameters include I/O access latencies
like Thdd ss, Thdd cache, and Tss cache when block size is 10MB and 200 MB, respectively. The
other parameter validated in this part of study are Thdd network cache and Tss network cache
for the 200MB-block-size cases.
Figures 3.4, 3.5, and 3.6 validate that Thdd ss approximately equals to 0.122 seconds,
Thdd cache is about 0.12 seconds , and Tss cache is in the neighbourhood of 0.052 seconds
when block size is 10 MB.
Figures 3.7, 3.8, and 3.9 show that Thdd ss is around 4.5 seconds, Thdd cache is estimated
to be 2.3 seconds, and Tss cache is measured to be 1.5 seconds when data block is set to 200
MB. Please note that in Chapters 4 and 5, we will evaluate two of our prefetching solutions
- iPipe and IPO - using 200MB-block-size cases.
35
Figure 3.4: 10 MB: Estimated Thdd ss is 0.122 seconds.
To validate Thdd network cache and Tss network cache, we calculate average read latencies
using a network storage system in our laboratory at Auburn. In our testbed, we use SCP
(i.e., Secure copy) to read data from a remote node?s HDD and SSD. The read latencies of
accessing a 200MB block from a remote HDD (Thdd network cache) and from a remote SSD
(Tss network cache) are estimated as 4.43 and 4.158 seconds, respectively. For distributed
storage systems (see Chapter 6), we will use these validated system parameters in addition
to parameter Thdd ss that equals approximately 4.5 seconds.
For our iPipe solution (see Chapter 4), we test 1KB-block-size cases and provide pre-
liminary experimental results. We validate various I/O bandwidth of HDDs and SSDs by
using their speci cations data released by the vendors.
3.3.7 Limited Parallel I/O Bandwidth
One may assume that parallel storage systems have enough I/O parallelism and band-
width (see, for example, [1]). This assumption implies that there is no I/O congestion during
36
Figure 3.5: 10 MB: Estimated Thdd cache is 0.12 seconds.
Figure 3.6: 10 MB: Estimated Tss cache is 0.052 seconds.
37
Figure 3.7: 200 MB: Estimated Thdd ss is 4.5 seconds.
Figure 3.8: 200 MB: Estimated Thdd cache is 2.3 seconds.
38
Figure 3.9: 200 MB: Estimated Tss cache is 1.5 seconds.
data prefetching processes when a parallel storage system scales up. When the parallel stor-
age system is not scalable due to limited parallel I/O bandwidth, the above assumption may
become invalid. In other words, limited I/O bandwidth make the parallel storage system
liable to I/O congestions (e.g. I/O queuing). This problem is more pronounced if the con-
current number of read requests exceeds a threshold. While designing the iPipe scheme, we
relies on the above assumption, which is relaxed in the development of our IPO and IPODS
prefetching mechanisms. Figure 3.10 shows concurrent read requests to the same disk are
noticeably a ected by the limited parallel I/O bandwidth.
3.4 Summary
We started this Chapter by justifying a number of important research assumptions
related to caching, initial data allocations, bandwidth limitations, and I/O writes. Before
present our pipelined prefetching solutions in the subsequent Chapters, we validated critical
system parameters used in our simulation studies. The validate processes described in this
Chapter ensure that parameters used in our simulations represent real-world storage systems,
39
Figure 3.10: Limited Parallel I/O Bandwidth. Concurrent read requests are noticeably
a ected by the limited parallel I/O bandwidth.
because the parameters showed in this Chapter were estimated using the storage systems in
our laboratory at Auburn.
40
Chapter 4
iPipe: An Pipelined and Informed Prefetching for Multi-Level Storage Systems
4.1 Overview
The informed prefetching algorithm - TIP (see [1] for details) - reduces applications?
stalls and elapsed time. TIP makes use of applications? hints of future I/O accesses. In this
Chapter, we extend the informed prefetching technique by creating a pipeline in which we
split the informed prefetching process into a set of independent prefetching steps among the
multiple storage levels. In Chapter, we show that how multiple prefetching mechanisms can
coordinate to prefetch hinted blocks in parallel.
We call our new informed prefetching technique powered by a pipeline as iPipe. Because
iPipe is the  rst step toward building a prefetching pipeline, we assume that parallel storage
systems have scalable I/O bandwidth and parallelism. This assumption implies that there
is no I/O congestion. This assumption of scalable I/O enables a prefetching mechanism to
issue a large number of concurrent read requests without introducing long I/O queues. In
the  rst Section of this Chapter, we explain the motivations and objective of iPipe. Then, we
describe the design and implementation issues of the iPipe prefetching mechanism. Finally,
the performance of iPipe is quantitatively evaluated.
4.2 Motivations and Objectives
The iPipe prefetching technique is motivated by the following three observation:
a136 There are growing needs of applying multi-level storage systems in data centers. Multi-
level storage systems consist of multiple-level storage devices, where the uppermost
level - the most expensive device in the storage systems - has the best I/O performance.
41
a136 The informed prefetching technique can utilize I/O access hints o ered by applications
to boost I/O performance. However, informed prefetcing has not yet been incorporated
in multi-level storage systems.
a136 Multiple-level storage devices can be accessed in parallel. Such I/O parallelisms among
multiple storage levels allows hinted blocks to be simultaneously fetches from the mul-
tiple storage levels to reduce prefetching times.
The overall objectives of the iPipe scheme are summarized as follows:
a136 To minimized time spent in prefetching data from storage systems by moving hinted
blocks to upper-levels of storage devices in multi-level storage systems.
a136 To reduce application stalls and execution times.
a136 To decrease prefetching horizon (i.e., prefetching distance that leads to zero stalls).
a136 To balance prefetching and caching and to distribute I/O bu ers between the prefetch-
ing and caching mechanisms.
4.3 Design Issues in iPipe
A study [1] shows that long disk latency and low I/O bandwidth increase applications?
stalls. Aggressive prefetching data blocks may inevitably pollute cache bu ers. To address
this problem, in this section we describe design issues of a pipelined prefetching mechanism
- iPipe - for hinted blocks in multi-level storage systems.
Compared with existing prefetching schemes, iPipe introduces a set of salient features:
support application-disclosed I/O access hints, multiple informed prefetching mechanisms,
incorporate the cost-bene t model in multi-level storage systems, and support a prefetching
pipeline across multiple storage devices. Before presenting the iPipe implementation details,
we  rst outline a high-level overview of iPipe?s hardware and software architectures.
42
4.3.1 Architecture of iPipe
Hardware Architecture of iPipe
The system consists of an application (user) that can provide hints of its future accesses.
The hierarchy from top to bottom consists of a prefetching bu er cache and an array of two
levels of storage devices (Solid State Drive (SSD) and Hard Disk Drive (HDD)) as shown in
 gure 4.1. As we descend in the hierarchy, the disk read latency increases.
Figure 4.1: iPipe system hardware architecture. Consists of an array of multi-level disks.
Software Architecture of iPipe
Figure 4.2 depicts a high-level iPipe software architecture. The software architecture
contains software modules that (1) implement the informed prefetching algorithm (see, for
example, [1]), (2) control bu er cache, and (3) perform informed prefetching. In a multi-
level storage system, an informed prefetching request is  rst issued to a solid state disk
(SSD). If the hinted block is not residing in SSD, the block will be fetched from a hard disk
(HDD). In the software architecture, there is a module that manages the SSD and HDD. The
43
iPipe mechanism take applications? future accesses hints as input and performs prefetching
pipelining in the multiple levels of storage.
Figure 4.2: High-level design of the iPipe software architecture for a multi-level storage
system. The multi-level storage system consists of two levels - SSDs and HDDs.
44
Figure 4.3: Detailed design of the iPipe software architecture for a three-level storage
system. An application provides hints to both TIP and iPipe. The system performance
parameters are passed to iPipe to calculate the pipelining starting block and depth. iPipe
keeps fetching hinted data blocks to the highest level. Since the storage system?s
bandwidth is high enough, iPipe is able to fetch most of the hinted blocks. TIP uses a cost
bene t model to determine the number of prefetch bu ers. Hinted data blocks are fetched
from the storage system to the bu er cache.
Figure 4.3 details the design of the iPipe software architecture for a three-level storage
system. Figure 4.3 shows that an application provides hints to both TIP and iPipe. The
system performance parameters are passed to iPipe to calculate the pipelining starting block
and depth. iPipe keeps fetching hinted data blocks to the highest level. Since the storage
system?s bandwidth is high enough, iPipe is able to fetch most of the hinted blocks. TIP
uses a cost bene t model to determine the number of prefetch bu ers. Hinted data blocks
are fetched from the storage system to the bu er cache.
4.3.2 Assumptions
In chapter 3, we discussed our research?s assumptions in details. In this subsection, let
us highlight the most important assumptions related to iPipe.
45
We assume that prefetching I/O requests can be processed in parallel. Further, storage
devices have su cient I/O bandwidth and enough I/O parallelism, meaning that there is
no I/O congestions. Similar assumptions can be found in the TIP study [1]. We assume
that there is storage systems can be well scale up to accommodate a large number of I/O
requests being fetched in parallel [23]. Therefore, we conclude that if a list of hinted data
blocks are requested at the same time, the blocks can be fetched without incurring excessive
I/O congestion.
As a conservative assumption, hinted data blocks are initially allocated to HDDs thanks
to the large capacity of the HDDs. It is noteworthy that I/O performance of multi-level
storage systems can be improved if hinted blocks are initially placed in SSDs rather than
HDDs.
In a multiple level storage system, a small portion of SSD space is reserved for retaining
copies of the prefetched data. Instead of migrating data from HDDs to SSDs, iPipe keeps
original copies at the SSD level while fetching duplicated copies to SSDs. In addition, iPipe
adjust SSD space reserved for prefetching.
4.4 The iPipe Algorithm
In this section, we  rst present a mathematical model for iPipe. Then, we describe
the concept of prefetching horizon. Next, the iPipe algorithm is proposed. Finally, we
demonstrate the usage of iPipe through a concrete example.
4.4.1 Stalls and Disk Read Latencies
When an application starts, the informed prefetching module (i.e., TIP [1]) in iPipe
assigns a number of bu ers for prefetching hinted blocks based on the cost-bene t model
(see [1] for details on the cost-bene t model).
Let (Xcache) be the number of prefetching bu ers in the bu er cache, TIP initiates (xcache)
prefetching requests. Assuming the storage system provides su cient I/O bandwidth and
46
parallelism, multiple hinted data blocks can be fetched in parallel without incurring I/O
congestion. Let Tdisk denote disk read latency (i.e., time spent in fetching a block from a
disk to a bu er cache). The application initially stalls for Tdisk until the hinted blocks arrive
at the bu er cache.
After the hinted blocks are ready in the bu er, the application begins consuming these
blocks. When each hinted blocks is consumed in the bu er, a new prefetching request is
issued. In the case of few bu ers being used for prefetching to be consumed in Tdisk time
period, newly requested blocks will cause the application to stall every Xcache accesses for
the time calculated by equation 4.1. Patterson et al. an example [1] where a system has
three bu ers for prefetching, the disk read latency is  ve, and the time that the application
needs to consume a single bu er was one. This example shows that a stall of two time units
will arise for every three accesses.
Tstall(Xcache) = Tdisk Xcache(Tcpu+Thit+Tdriver)X
cache
(4.1)
where;
Tstall(Xcache) 1: average stall of prefetching
Tcpu: computational time
Thit: time to read a data block from the bu er cache
Tdriver: time to allocate a bu er in the bu er cache
We denote Tcpu + Thit + Tdriver as a single time unit spent in consuming one bu er in
the bu er cache. This single time unit is also the time between each two subsequent I/O
read requests.
1R. Patterson, Hugo, G. Gibson, D. Stodolsky, and J. Zelenka: Informed prefetching and caching, In
Proceedings of the 15th ACM Symposium on Operating System Principles, pages 79-95, CO, USA, 1995.
47
In a multi-level storage system, we consider two types of Tdisk - Thdd cache and Tss cache.
Thdd cache is the time spent fetching a block from HDD; Tss cache is the time interval for
loading a block from SSD. We de ne Thdd ss as the time spent copying a block from an HDD
to an SSD. Our empirical results show i.e., Tss cache is smaller than Thdd cache, indicating
that loading data from SSD is much faster than from HDD.
4.4.2 Prefetching Horizon
Prefetching Horizon is a distance by which informed prefetching leads an application
to experience zero stall time [1]. Equation 4.2 represents a way of calculating prefetching
horizon. In the previous example, P(Tcpu) equals  ve time units because Tdisk equals  ve
and Tcpu + Thit + Tdriver equals one. Consequently, if the prefetching mechanism assigns
 ve bu ers for prefetching, stalls time will drop down to zero.
P(T cpu)= Tdisk(Tcpu+T
hit+Tdriver)
(4.2)
where;
P(Tcpu) 1 : prefetching horizon
4.4.3 The Pstart and Pdepth Algorithms
Two important modules in the iPipe architecture are (1) a module to determine the
starting block to be prefetched and (2) a module to compute the prefetching depth. The
what follows, we describe the algorithms implemented in these two modules.
1R. Patterson, Hugo, G. Gibson, D. Stodolsky, and J. Zelenka: Informed prefetching and caching, In
Proceedings of the 15th ACM Symposium on Operating System Principles, pages 79-95, CO, USA, 1995.
48
When the  rst informed prefetching request is initiated, iPipe begins fetching hinted
blocks from HDD to SSD. iPipe relies on the following algorithm called Pstart - to decide
which hinted blocks in the  rst block to be loaded.
The Pstart Algorithm:
xcachecounter = 0
accesstime = - (Tcpu + Thit + Tdriver)
for blockcounter = 1 to Thdd ss / (Tcpu + Thit + Tdriver) do
xcachecounter ++
accesstime += (Tcpu + Thit + Tdriver)
if accesstime  Thdd ss then
Pstart = blockcounter
return Pstart
end if
if xcachecounter = Xcache then
if Tstall hdd(Xcache) > 0 then
accesstime += Tstall hdd(Xcache)
end if
if accesstime  Thdd ss then
Pstart = blockcounter + 1
return Pstart
end if
xcachecounter = 0
end if
end for
Algorithm 1. returns the future hinted access position that iPipe starts pipelining from.
It depends on the Thdd ss value and (Tstall hdd(Xcache)) stalls that takes place from the
HDD to the bu er cache during the beginning of the informed prefetching process . The
algorithm calculates the position of the future hinted access that will be accessed far
enough in the future for it to be read from the HDD to the SSD (see equation 4.3).
The above Pstart algorithm is very important, because loading hinted blocks so earlier
can pollute I/O bu ers and fetching hinted blocks at a late stage can increase applications
stall time.
The Pstart algorithm determines the  rst hinted block to be prefetched given a future
access pattern. Pstart depends on the Thdd ss value and (Tstall hdd(Xcache)) stalls that takes
place from HDD to bu er cache during the beginning of the informed prefetching process.
The Pstart algorithm calculates the position in the future access list. Thus, Pstart needs
49
to decide which hinted blocks in the list should be fetched in advance from HDD to SSD to
reduce application stall times. The Pstart algorithm is guided by Equation 4.3 to pick the
most appropriate hinted blocks to fetch from HDD.
While the iPipe mechanism is utilizing the TIP technique [1] to prefetch hinted blocks
from SSDs to main memory, iPipe manges to fetch hinted blocks from HDDs to SDDs.
Prefetching hinted blocks by two parallel prefetchers in a pipelining manner can increase
chances of retrieving hinted blocks in SSDs rather than in HDDs.
The application begins consuming the arrived data and new prefetching requests may
take place before the iPipe brings hinted blocks from HDDs to SSDs. The new informed
prefetching requests may also cause stalls until the requests are honored, especially when
Xcache is smaller than the prefetching horizon. If Tstall hdd(Xcache) is the average stall time
of fetching blocks from HDDs, iPipe starts loading hinted blocks from HDDs within or after
time Thdd ss minus the stall time.
iPipestart = Thdd ss(T
cpu + Thit + Tdriver)
 (
Thdd ss
(Tcpu+Thit+Tdriver)
Xcache  Tstall hdd(Xcache)
(Tcpu + Thit + Tdriver) )(4.3)
where;
iPipestart: iPipe pipelining start
Algorithm 2: Prefetching pipelining depth
totaltime = Thdd ss + Tconsume ss
xcachecounter = 0
accesstime = 0
for blockcounter = 1 to totaltime / (Tcpu + Thit + Tdriver) do
xcachecounter ++
accesstime += (Tcpu + Thit + Tdriver)
if accesstime  totaltime then
Pdepth = Pstart + blockcounter - 1
return Pdepth
50
end if
if xcachecounter = Xcache then
if Tstall ss(Xcache) > 0 then
accesstime += Tstall ss(Xcache)
end if
if accesstime  totaltime then
Pdepth = Pstart + blockcounter -1
return Pdepth
end if
xcachecounter = 0
end if
end for
Algorithm 2. returns the depth of iPipe?s pipeline that is needed to maintain the pipelin-
ing process working smoothly. Starting from Pstart, iPipe initially pipelines a number of
hinted future accesses that equals the prefetching horizon of the disk read latency from
HDD to SSD plus additional blocks to cover a single SSD block?s consuming time (see equa-
tion 4.5). Tconsume ss is the time needed to consume a block in the SSD. Tstall ss(Xcache)
represents the application?s stalls when the the data is prefetched from the SSD to the
bu er cache. These stalls reduce the pipelining depth.
The above Pdepth algorithm chooses pipelined prefteching depth, which is de ned as
the number of hinted block to be fetched from HDDs to SSDs. The goal of the Pdepth
algorithm is to prefetch as many hinted blocks as possible without polluting bu er caches in
SSDs.
After performing the Pstart and Pdepth algorithms, the iPipe mechanism have a full
list of the most appropriate blocks to be prefetched from HDDs to SSDs. The number of
hinted blocks to be fetched largely depends on two factors: (1) prefetching horizon of the
disk read latency from HDD to SSD, and (2) the time spent consuming a single SSD block
in the bu er cache (see equation 4.5). Let Tconsume ss be the time needed to consume a
block in SSD and Tstall ss(Xcache) be an application?s stalls when the the data is prefetched
from the SSD to the bu er cache. These stalls reduce the number of hinted blocks to be
prefetched from HDDs to SSDs.
When it comes to the decision of how many hinted blocks to be loaded from HDDs to
SSDs, iPipe initially fetches the number of hinted blocks that equal the prefetching horizon
plus additional blocks to cover the time needed to consume a single block in the SSD. We
51
assume that (Tconsume ss) equals the latency of a SSD read to the bu er cache (Tss cache).
The hinted blocks residing in the SSDs may cause stalls until the blocks become available in
bu er caches in main memory. This takes place when Xcache is smaller than the prefetching
horizon of reading data from SSDs to bu ers in the main memory. Tstall ss(Xcache) is the aver-
age stall time of prefetching from the SSD, and these stalls reduce the pipelined prefetching
depth. Equation 4.4 calculates the maximum number of concurrent prefetching requests
that may take place when using iPipe. Equation 4.5 shows the calculation of the maximum
iPipe?s pipelined prefetching depth. This occurs when TIP is assigning Xcache bu ers that
equal the prefetching horizon of reading data from HDD to bu er cache; no stalls can take
place during the pipelined prefetching process.
Our experimental results show that pipelined prefetching depth is a small value, indi-
cating that iPiple does not need to reserve a large bu ers in SSDs for pipelined prefetching
requests.
iPipemaxreads = Thdd cache + Thdd ss + Tconsume ss(T
cpu + Thit + Tdriver)
(4.4)
where;
iPipemaxreads: iPipe?s maximum number of concurrent reading requests
Thdd cache: Latency of reading a single data block from HDD to the cache
Thdd ss: Latency of reading a single data block from HDD to SSD
Tconsume ss : The time that TIP takes to consume a single data block from the SSD
iPipedepth = Thdd ss + Tconsume ss(T
cpu + Thit + Tdriver)
(4.5)
where;
iPipedepth: iPipe?s pipelining depth
Thdd ss: Latency of reading a single data block from HDD to SSD
52
Tconsume ss : The time that TIP takes to consume a single data block from the SSD
4.4.4 Stalls, Elapsed Time, Prefetching Horizon, and Prefetching Bene t
In the example shown earlier, we assume that Thdd ss, Tconsume ss, and Thdd cache equal
eight, four, and  ve time units, respectively. iPipe initially issue a pipelined prefetching
request for a hinted data block with position number seven through position number  fteen
(inclusive). Giving su cient I/O bandwidth and parallelism, informed prefetching requests
as well as pipelined prefetching requests will be completed (i.e., arrive at bu er caches in the
main memory and SSDs, respectively) within Thdd cache) and Thdd ss, respectively. The
application stalls at the beginning of Thdd cache, which is  ve time units.
Tpf(Xcahce) =
8
>>>
>>>
<
>>>
>>>
:
Xcache=0  (Tcpu + Thit + Tdriver)
Xcache < P(Tcpu)  TdiskX
cache(Xcache+1)
Xcache  P(Tcpu) 0
(4.6)
Where;
Tpf(Xcahce) 1: reduction in service time
The application then begins to consume the prefetched blocks in the bu er cache residing
in the main memory. Every time when a bu er is consumed, an informed prefetching request
is issued for the next hinted block. When a hinted block has been preloaded from HDDs to
SDDs, iPipe can quickly fetches the hinted block from SSDs into the main memory without
1R. Patterson, Hugo, G. Gibson, D. Stodolsky, and J. Zelenka: Informed prefetching and caching, In
Proceedings of the 15th ACM Symposium on Operating System Principles, pages 79-95, CO, USA, 1995.
53
making a trip to HDDs. After a hinted block is consumed in the SSDs in Tconsume ss time,
iPipe triggers the next pipelined prefetching request.
Disk accessing time - Tdisk - from the perspective of informed prefetching becomes
Tss cache, which is less than Thdd cache. Hinted blocks are fetched from HDDs to SSDs prior
to being accessed by the prefetching module fetching the blocks from SSDs into main mem-
ory. Prefetching hinted blocks from HDDs and keeping the hinted blocks in SSDs reduce the
number of application stalls (see Equation 4.1).
Figure 4.4 builds on the previous example and shows the average stall time when a  xed
number of bu ers are used with pipelined prefetching. In contrast, Figure 4.5 reveals average
stall when using a  xed number of bu ers in the bu er cache for pipelined prefetching (the
regular case in which no pipelining is used). When iPipe is employed, the stall time and
elapsed time are 9 and 39 time units, respectively. When it comes to the same storage system
without iPipe, the stall time and elapsed time are 16 and 46 time units, respectively. In this
example, iPiple reduce the application?s elapsed time by 15.2%.
The TIP study [1] o ers a cost-bene t model to calculating bene ts (e.g., reduction in
I/O service time) of assigning one additional bu er for prefetching. The bene t is that the
stall time is reduced from Tstall(Xcache) to Tstall(Xcache+1). The bene t calculation makes use of
the prefetching horizon, which changes with the Tdisk value. In the case that Xcache is smaller
than the prefetching horizon P(Tcpu), the bene t of adding an extra bu er is calculated by
Equation 4.6.
It takes Thdd cache time to fetch the  rst hinted block from an HDD to a bu er cache
in the main memory. When subsequent hinted blocks are fetched from HDDs to SSDs
in a pipelining manner, the prefetching time drops from Thdd cache to Tss cache). Pipelined
prefetching reduces the prefetching horizon and stalls. As a result, iPipe increase the bene ts
of using an additional bu er for informed prefetching.
54
Figure 4.4: Average stall when using iPipe and a  xed number of bu ers for pipelined
prefetching. Thdd cache = 5, Thdd ss = 8, Tss cache = 4, Xcache = 3. The  rst stall is 5 time
units. Before the  rst hinted block is fetched from HDD into SSD, the application stalls for
Tstall hdd(Xcache) = Thdd cache - 3(Tcpu + Thit + Tdriver) = 2 time units every 3 accesses.
When hinted blocks are retrieved in the SSD, the application stalls for Tstall ss(Xcache) =
Thdd cache - 3(Tcpu + Thit + Tdriver) = 1 every 3 accesses.
55
Figure 4.5: Average stall when using a  xed number of bu ers for pipelined prefetching.
Thdd cache = 5, Xcache = 3. The  rst stall is for 5 time units, because all data blocks are
read from the HDD. The application stalls for Tstall hdd(Xcache) = Thdd cache - 3(Tcpu + Thit
+ Tdriver) = 2 time units every 3 accesses.
56
4.4.5 The iPipe Algorithm
The pseudo code of the iPipe algorithm is presented below. First, the iPipe algorithm
invokes the Pstart and Pdepth algorithms (see the earlier sections of this Chapter) to deter-
mines the most appropriate hinted blocks to be fetched from HDDs to SDDs. Second, iPiple
initiates prefetching requests to HDDs. When a hinted block is consumed from SSD (i.e., the
hinted block is fetched from SSD to main memory), iPipe starts fetching subsequent blocks
from HDDs to SSDs. Due to limited I/O bandwidth, hinted blocks may not be fetched to
the main memory before the blocks are consumed. In this case, iPipe shrinks its pipelined
prefetching depth each time by 1 (i.e., reducing the number of hinted blocks fetched from
HDDs to SSDs). Before prefetching any hinted block directly from HDD, the iPipe algorithm
checks if a hinted block has been fetched to SDD or is being fetched to SDD. If the block is
residing in SSD, the pipelined prefetching request will be discarded.
The iPipe Algorithm
Pstart = call prefetching start block
Pdepth = call prefetching depth
while informed prefetching do
if bandwidth shortage then
shrink the pipelining depth by 1
end if
if pipelined data block is altered then
discard the pipelined data block
end if
if is the  rst prefetching then
for block = Pstart To block = Pdepth do
pipeline block to SSD
end for
else if SSD bu er is consumed then
if the following data block is not already in the SSD then
pipeline the following data block to SSD
end if
end if
end while
Algorithm 3. iPipe algorithm. It calls prefetching start and depth algorithms to calculate
Pstart and Pdepth values. Then, it initiates a pipelining request from the HDD to the SSD
for the data blocks from Pstart to Pdepth. Every time a pipelined block is consumed from
the SSD by the informed prefetching, iPipe pipelines the next one to the SSD. Bandwidth
57
shortage means that TIP is not able to receive its prefetching requests on time (from any
level). This may be caused by the pipelining process, writes, power problems, or many
other reasons. In that case, iPipe shrinks its pipelining depth each time by 1 until the
shortage is released. The algorithm checks if a data block that has been pipelined or is in
the pipelining process is changed (written) before it is requested by TIP. In this case, the
pipelined copy will be discarded to maintain consistency. The algorithm also checks if the
data is already in the SSD. In this case, its pipelining request is discarded.
4.5 Performance Evaluation
We implement the iPipe algorithm in a trace-driven simulator written in C++. The
metrics evaluated in the simulated two-level (i.e., SSD and HDD) storage system (see Fig-
ure 4.1) include elapsed time, prefetching stall time, prefetching horizon, as well as the bene t
of adding an additional bu er for informed prefetching. These metrics are evaluated by vary-
ing the numbers of prefetching bu ers in the bu er cache. We compare an iPipe-enabled
prefetching mechanism against the same system without deploying iPipe. Data blocks are
initially placed in HDD stores the data. The iPipe mechanism coordinates two prefeching
modules: the  rst one fetches hinted blocks from HDDs to SSDs, the second one uses TIP [1]
to further fetch hinted blocks from SSDs to main memory.
4.5.1 System Setup
In our simulation studies, we use two LASR traces: machine01 (LASR1) and machine06
(LASR2) [97][98], which consist of 11686 and 51206 I/O read system calls, respectively.
Without loss of generality, we assume that each I/O read system call requests an entire data
block. As a conservative assumption, the average I/O arrival interval is Tcpu + Thit + Tdriver,
representing the worst case. If the interval is larger than Tcpu + Thit + Tdriver, iPipe can
achieve even better I/O improvement.
We choose to focus on a two-level storage system containing a set of SSDs and HDDs.
Our preliminary results prove that reading 1-KB block from SSD is faster than from HDD.
We con gure the simulated two-level storage system using the devices? speci cations provided
58
by the disk vendors. To fetching blocks in an e cient way, we use a log  le system to store
hinted blocks sequentially. In this way, a list of hinted blocks can be retrieved from the
log  le system without incurring any disk rotational delay. This is practical to apply the
log  le system to support the iPipe mechanism, because target I/O-intensive applications of
iPipe are read-intensive applications where most data blocks are read only. It is natural and
straightforward to place all the read-only blocks sequentially using the log  le system.
All the system parameters used in our simulator are validated by the testbed in our
laboratory at Auburn. Please refer to Chapter 3 for details on the validation process.
4.5.2 Preliminary Results
The simulated two-level storage system is connect to a host computer with a 2GHz core
2 duo processor. The simulated host computer uses Kingston?s RAM (800MHz) [90] as bu er
caches. The data bus is 64- bit; the bu er caches o er I/O throughput at the rate of 6400
MB/s. The two-level storage system is comprised of Intel?s X25-E Extreme SATA solid-state
drive [91], which has a read throughput of 250 MB/s and WD Caviar Green HDD o ering
a read throughput of 190 MB/s assuming [94]. The validation of the above parameters can
be found in Chapter 3.
The system parameters Tcpu, Thit, Tdriver are set to 4 nanosecond, 0.15 ms, and 0.06
us, respectively. The Tcpu value is derived from a trivial application that measures the time
needed to perform simple computations on (1 KB) of data. Thit is derived from the memory
read latency speci cation for 1KB data [90]. Tdriver - derived from a trivial application - is
the time needed to allocate 2000 unique 8 KB data blocks in the main memory. Combining
these values, it takes Tcpu + Thit + Tdriver time (i.e., 0.2 ms) for the application to consume
a bu er in the main memory.
Data is initially stripped in the HDD layer. The read latencies of 1KB data blocks from
SSDs and HDDs are 4 and 5.2 microseconds, respectively. The write latency of 1KB blocks
to SSDs is 5.8 microseconds. When reading blocks from HDDs to SDDs, reading from HDDs
59
Table 4.1: The number of informed prefetching requests issued to HDDs when the LASR1
and LASR2 traces are evaluated.
Xcahce 1 3 5 7 9 11 13
LASR1 With iPipe 2 6 9 11 13 15 17
LASR1 Without iPipe 11676 11676 11676 11676 11676 11676 11676
LASR2 With iPipe 2 6 9 11 13 15 17
LASR2 Without iPipe 51206 51206 51206 51206 51206 51206 51206
Xcahce 15 17 19 21 23 26
LASR1 With iPipe 19 21 23 25 27 30
LASR1 Without iPipe 11676 11676 11676 11676 11676 11676
LASR2 With iPipe 19 21 23 25 27 30
LASR2 Without iPipe 51206 51206 51206 51206 51206 51206
and writing to SSDs are interleaved. Thus, the total time latency Thdd ss of fetching 1KB
block from HDD and caching the block in SSD equals to 5.8 microseconds. There is no write
latency to the bu er cache in main memory (see, for example, [1]).
Because hinted blocks are fetched from both SSDs and HDDs concurrently, Equation
4.4 calculates the maximum number of prefetching requests issued to HDDs. Equation 4.4
indicates that the largest possible number of prefetching requests for both SSDs and HDDs
at a given time is 75 data blocks. To guarantee serving this maximum number of requests
from the HDD to the bu er cache and the SSD, we assume there is su cient I/O bandwidth
and parallelism [1]. Equation 4.2 determines the prefetching horizon of reading data from
HDDs to the bu er cache is 26 if all data blocks are prefetched from HDD without employing
iPipe. Using the above system information, we evaluate the performance of iPipe by varying
the number of bu ers for informed prefetching from 1 to 26.
Our simulation results show the impact of number of bu ers on the total elapsed time
(i.e., application?s execution time) and read latency time. Figures 4.6 and 4.7 show the total
elapsed time of the two traces. The experimental results reveals that iPipe reduces elapsed
time by up to 23% compared with the non-iPipe case. Figure 4.8 shows the average informed
prefetching read latency of the two evaluated I/O traces. iPipe signi cantly improve I/O
performance of the two-level storage system, because only a few hinted blocks are fetched
directly from HDDs whereas most hinted blocks are prefetched and cached in SSDs.
60
Table 4.2: Service time is reduced when one extra bu er is added for prefetching.
Xcahce 1 3 5 7 9 11 13
LASR1 With iPipe 21041 3895.92 1559.44 835.51 519.56 354.83 256.4
LASR1 Without iPipe 28044.8 5063.6 2027.63 1086.04 675.58 461.57 333.19
LASR2 With iPipe 102292.3 17064.3 6826.5 3656.7 2276.5 1551.14 1124.65
LASR2 Without iPipe 133165 22208.3 8871.8 4760.3 2959 2017.9 1462.04
Xcahce 15 17 19 21 23 26
LASR1 With iPipe 195.004 152.605 122.999 0.2 0.2 0
LASR1 Without iPipe 253.79 198.21 160.201 131.202 110.403 0
LASR2 With iPipe 852.92 669.52 539.015 0.2 0.2 0
LASR2 Without iPipe 1108.84 871 700.98 576.36 482.355 0
Figure 4.6: Total elapsed time when the number of prefetching bu ers is set from 1 to 11.
iPipe reduces the elapsed time.
Table 4.1 shows the number of data blocks prefetched from HDDs at the beginning of
the simulation. When iPipe is deployed, we observe that very few hinted blocks are directly
fetched from HDDs because most hinted blocks are prefetched and cached in SSDs for fast
accesses. The prefetching horizon in our simulations is initially 26 data blocks, but drops
to 20 data blocks. In the non-iPipe case, all informed prefetching requests must bring data
directly from HDDs; the prefetching horizon always equals 26 data blocks.
When using iPipe, the  rst few informed prefetching requests are read directly from
HDDs due to a cold start (i.e., hinted blocks have not been preloaded in SSDs). The rest
61
Figure 4.7: Total elapsed time when the number of prefetching bu ers is set from 13 to 26.
iPipe reduces the elapsed time.
Figure 4.8: Total informed prefetching read latency. iPipe reduces the read latency.
62
Table 4.3: Prefetching Horizon P(Tcpu) equals 26 data blocks distance while iPipe is not
used. P(Tcpu) drops to 20 data blocks distance when iPipe is used.
Xcahce P(Tcpu)
iPipe 20
non - iPipe 26
hinted blocks are retrieved from SSDs thanks to the pipelined prefetching mechanism that
preloads the blocks from HDDs to SSDs. Since (Tss cache) is smaller than (Thdd cache), the
prefetching horizon P(Tcpu) drops. Our simulation studies (see Table 4.3) show that when
reading from the HDDs and the SSDs, P(Tcpu) equals 26 and 20 data blocks distance re-
spectively. Table 4.1 shows the number of hinted blocks that are prefetched directly from
the HDDs.
iPipe reduces informed prefetching disk read latency, which in turn leads to fewer ap-
plication stalls during the prefetching processes. iPipe also reduces the bene t of adding
an additional bu er for prefetching. Figures 4.9 and 4.10 show that the total stall time
measured in microseconds is approximately the same for both traces. In all cases, the stalls
decrease as the (Xcache) increases. However, iPipe shows fewer stalls when using a certain
number of (Xcache). If the value of (Xcache) is larger than 20 data blocks, the stall time become
zero because the number of informed prefetching requests exceed P(Tcpu) (as Table 4.4).
As the disk read latency, prefetching horizon, and stalls decrease, the reduction in ser-
vice time (bene t) from using additional bu ers for prefetching also decreases. Table 4.2
illustrates the decrease in service time for each (Xcache) value, that is when adding addi-
tional bu er used for prefetching. In all cases, the prefetching bene t decreases with the
increase of (Xcache). iPipe shows less need for more bu ers, however, when (Xcache) is larger
than 20, which makes the reduction in service time become zero. In other words, there is
no bene t from using more bu ers than P(Tcpu). This conclusion is consistent with that
reported in the TIP study [1].
63
Figure 4.9: Total stall time when the number of prefetching bu ers is set from 1 to 9. iPipe
reduces the stall time.
Figure 4.10: Total stall time when the number of prefetching bu ers is set from 11 to 19.
iPipe reduces the stall time.
64
Table 4.4: Total stall time when the number of prefetching bu ers is set from 21 to 26.
iPipe reduces the stall time.
Xcahce 21 23 26
LASR1 With iPipe 1 0.6 0
LASR1 Without iPipe 556 304.802 0
LASR2 with iPipe 1 0.6 0
LASR2 Without iPipe 2438 1335.57 0
Figure 4.11: Total elapsed time when the number of prefetching bu ers is set from 1 to 5.
iPipe reduces the elapsed time in both Nova-V64 and Intel X25-E SSDs. Elapsed time is
less when Intel X25-E SSD is tested because it is faster than Nova-V64 SSD.
65
Figure 4.12: Total elapsed time when the number of prefetching bu ers is set from 7 to 11.
iPipe reduces the elapsed time in both Nova-V64 and Intel X25-E SSDs. Elapsed time is
less when Intel X25-E SSD is tested because it is faster than Nova-V64 SSD.
Figure 4.13: Total elapsed time when the number of prefetching bu ers is set from 13 to
26. iPipe reduces the elapsed time in both Nova-V64 and Intel X25-E SSDs. Elapsed time
is less when Intel X25-E SSD is tested because it is faster than Nova-V64 SSD.
66
Previous results indicate that iPipe?s performance depends on both read and write
performances of SSDs. Reading performance reduces the disk read latency for informed
prefetching while writing performance e ects the e ciency of the pipelined prefetching pro-
cess. Because of this, performance varies when using di erent SSDs with di erent read
performances. We simulate iPipe using various SSDs and compared the performance in
terms of stalls and elapsed times. Figures 4.11, 4.12, and 4.13 compare of both two LASR
traces when using [92] the Nova Series V64 Solid-State Hard Drive and Intel X25-E Extreme
SATA Solid-State Drive. The Nova V64 SSD has a reading throughput of 215 MB/s and
writing throughput of 130 MB/s. The latency of reading a (1KB) data block from it is 4.6
microseconds; the latency of writing a (1KB) block to the SSD equals 7.6 microseconds.
When reading from the HDD to the SDD, we assumed that both reading from the HDD and
writing to the SSD are interleaved making the total time latency for reading a (1KB) block
from the HDD and having it available in the SSD (Thdd ss) equal to 7.6 microseconds.
4.5.3 Validated Performance Evaluation
We con gure our simulator using the validated system parameters reported in Chapter
3. For the LASR 1 and LASR 2 traces, we assume that each exclusive access system call
record represents a single 10 MB data block stored in the simulated storage system. In
addition to 10-MB blocks, 200- MB data blocks are also test. In doing so, we can evaluate
the impact of block size on iPipe performance. We assume that the sequence of the I/O
reading system calls represents parallel I/O prefetching requests for the data. We also
assume that each bu er can accommodate one data block.
When block sizes equals to 10 MB and 200 MB, (Tcpu + Thit + Tdriver) is at least
0.00192 and 0.037 seconds, respectively (see also Chapter 3). Data is initially stripped in the
HDD layer of the two-level storage system. Reading a 10 MB data block from the SSD and
the HDD takes 0.052 and 0.12 seconds, respectively. The total time spent reading a 10 MB
block from the HDD and placing it in the SSD (Thdd ss) equals to 0.122 seconds. When it
67
comes to a 200 MB block, it takes 1.5 and 2.3 seconds to read the block from the SSD and
the HDD, respectively. The total time spent reading a 200 MB block from the HDD and
having it available in the SSD (Thdd ss) is 4.5 seconds. Time spent accessing bu er cache in
the main memory is negligible (see [1] for a similar assumption).
Since two prefetching modules in iPipe work concurrently at two storage levels, the
maximum number of read I/O requests issued to the HDD is calculated by Equation 4.4.
According to this equation, the largest possible number of concurrent read requests from
both prefetching modules in iPipe is 154 in the case of 10 MB data blocks and 225 in the
case of 200 MB ones. This is because 200 MB blocks increase I/O latencies. Our results show
that system parameters are the main factor a ecting iPipe?s behaviour and requirements.
To guarantee maintaining this maximum number of reading requests, we assume enough
I/O bandwidth and parallelism (i.e., no I/O congestion). A similar assumption is justi ed
in [1]. According to Equation 4.2, the prefetching horizon of reading data from HDD to
the bu er cache equals 63 for both 10 MB and 200 MB data blocks; therefore, we vary the
number of bu ers from 1 to 63 for both the 10 MB and the 200 MB cases.
The simulation results show that the total elapsed time (a.k.a., application execution
time) when using  xed numbers (Xcahce) of prefetching bu ers is reduced by up to 56% in
the 10 MB case and 34% in the 200 MB case. Our results show that SSD exhibits better
performance compared to the non-iPipe system when the block size is 10 MB. Figures 4.14,
4.15, and 4.16 show the total elapsed time when the LASR traces are tested for the 10-MB-
block-size case. Figures 4.17, 4.18, and 4.19 show the same test for the 200 MB block size.
As the number of Xcahce bu ers increases, the performance gap between iPipe and non-iPipe
cases diminishes because the prefetching module (i.e., TIP) in the upper level becomes more
e cient and able to reduce the application stalls. iPipe shows the most bene t when the
upper-level prefetching module is using small Xcahce; (e.g. 1) this is considered the most
critical situation for informed prefetching. iPipe allows the upper-level prefetcher  nd only
a few of its reads in HDDs and the rest in SSDs.
68
Figure 4.14: Block size = 10 MB. Total elapsed time when the number of bu ers is set
from 1 to 9. iPipe reduces the elapsed time.
Figure 4.15: Block size = 10 MB. Total elapsed time when the number of bu ers is set
from 11 to 25. iPipe reduces the elapsed time.
69
Figure 4.16: Block size = 10 MB. Total elapsed time when the number of bu ers is set
from 35 to 63. iPipe reduces the elapsed time.
Figure 4.17: Block size = 200 MB. Total elapsed time when the number of bu ers is set
from 1 to 9. iPipe reduces the elapsed time.
70
Figure 4.18: Block size = 200 MB. Total elapsed time when the number of bu ers is set
from 11 to 25. iPipe reduces the elapsed time.
Figure 4.19: Block size = 200 MB. Total elapsed time when the number of bu ers is set
from 35 to 63. iPipe reduces the elapsed time.
71
Table 4.5: Data block size = 10 MB. The position of the  rst data block to be prefetched.
Xcahce 1 3 5 7 9 11 13 15
Depth 3 7 11 15 19 23 27 31
Xcahce 17 19 25 35 45 55 63
Depth 35 39 51 64 65 65 65
Table 4.6: Data block size = 10 MB. The depth of the pipelined prefetching when using
di erent Xcahce values. Small depth is needed when few Xcahce bu ers are used. The
maximum depth = 91. iPipe needs to assign the maximum depth for pipelined prefetching
starting from Xcahce = 35, because the reading stalls from SSD at that point = 0
Xcahce 1 3 5 7 9 11 13 15
Depth 4 12 20 28 36 43 49 55
Xcahce 17 19 25 35 45 55 63
Depth 61 66 85 91 91 91 91
It is clear that iPipe o ers better performance improvement to the 10 MB case than to
the 200 MB case. Small blocks bene t better from iPipe, because SSD?s I/O performance
is notably better than HDD when the block size is small. The time needed to read a single
data block from the HDD to the SSD becomes very large in the 200 MB case.
iPipe can utilize multiple prefetching modules at di erent storage levels to load hinted
blocks without needing a large pipelining depth, especially when few prefetching bu ers
Xcahce are assigned (e.g. 1). If Xcahce is closer to the prefetching horizon, iPipe needs a
larger depth in the SSD for its pipelined prefetching process. If that depth is not available,
iPipe should only be used when the Xcahce value is small (the case where TIP shows very
poor performance). Tables 4.5 and 4.6 show the the  rst data block to be pipelined and
the pipeline depth when using di erent Xcahce values for a block size of 10 MB. Tables 4.7
and 4.8 show the same performance trend for 200 MB data blocks. iPipe assumes enough
bandwidth, implying that storage systems are su ciently scaled and have redundant storage
resources. Because of this assumption, the pipelined prefetching depth does not need to be
large. Maximum pipelined prefetching space is 910 MB in the 10-MB-block-size case and
32600 MB (32.6 GB) in the 200-MB-block-size case.
72
Table 4.7: Data block size = 200 MB. The position of the  rst data block to be prefetched.
Xcahce 1 3 5 7 9 11 13 15
Depth 4 10 16 22 28 34 40 46
Xcahce 17 19 25 35 45 55 63
Depth 52 58 76 106 123 123 123
Table 4.8: Data block size = 200 MB. The depth of the pipelined prefetching when using
di erent Xcahce values. Small depth is needed when few Xcahce bu ers are used. The
maximum depth = 163. iPipe needs to assign the maximum depth for pipelined prefetching
starting from Xcahce = 45, because the reading stalls from SSD at that point = 0
Xcahce 1 3 5 7 9 11 13 15
Depth 4 12 20 28 36 44 52 60
Xcahce 17 19 25 35 45 55 63
Depth 68 76 100 140 163 163 163
4.6 Summary
Informed prefetching reduces application I/O stalls and elapsed times. Disk read I/O
latencies e ect the performance of iPipe. For example, when the I/O latencies increase,
applications encounter more stalls. Multi-level storage systems? uppermost level show better
performance than the lower levels. iPipe has multiple informed prefetching modules running
in parallel in a pipeline manner. The prefetchers in lower-level storage devices help in
reducing I/O latencies of the prefetchers in upper-levels. Consequently, iPipe reduces stall
times, application elapsed times, the prefetching horizon. iPipe also improves the bene t
of using more bu ers for prefetching. One weakness of iPipe proposed in this Chapter is
the assumption that storage systems have su cient I/O bandwidth and scalability. This
assumption will be relaxed in the proposed prefetching solutions described in the next two
Chapters.
73
Chapter 5
IPO: Informed Prefetching Optimization in Multi-level Storage Systems
5.1 Overview
In the previous chapter, we discussed the importance of informed prefetching for re-
ducing an application?s stalls and elapsed time. We also pointed out the motivation of a
pipelined prefetching mechanism aiming to reduce the overhead of prefetching.
The study presented in the previous chapter assumes that multi-level storage systems
have su cient I/O bandwidth; however, our empirical experiments indicate that parallel
storage systems may have I/O congestion (see Figure 3.10 in Chapter 3). Evidence shows
that there is a maximum number of read requests being concurrently processed in a parallel
storage system. In the event that an upper-level prefetching mechanism does not fully
utilize available I/O bandwidth, unused I/O bandwidth can be allocated for lower-level
prefetching mechanisms to bring hinted blocks from lower-level to upper-level storage in a
pipeline manner. This observation suggests that the pipelining process largely depends on
available I/O bandwidth for lower-level prefetchers.
The remaining parts of this chapter is organized as follows. First, we highlight the
motivation and objective of a pipelined informed prefetching scheme called IPO. Second, we
describe the design issues of the IPO mechanism. Third, we present a core algorithm in IPO.
Last, the performance of IPO is systematically evaluated.
5.2 Motivations and Objectives
The motivations behind the design of IPO are very similar to those of iPipe (See Chap-
ter 4.2 for details on the motivations of the iPipe study). In addition to the motivations listed
74
in Chapter 4.2, limited I/O bandwidth of parallel storage systems motivate of to propose
the IPO solution.
I/O bandwidth of a storage system, of course, is limited. This reality means that when
I/O load is high, the storage system can exhibits I/O congestions (e.g. long I/O queuing
times). In particular, I/O congestion takes place if the storage resources are limited and not
scalable. When it comes to a small-scale parallel storage system, only a few I/O requests can
be processed in parallel. This limitation motivates us to extend the iPipe design into IPO by
considering the impact of limited I/O bandwidth on our pipelined prefetching mechanisms.
The objective of this part of the study is to address the issue of limited I/O bandwidth
and investigate the negative impacts of low I/O bandwidth on pipelined informed prefetching.
5.3 Design Issues in IPO
Our preliminary results (see Chapter 3) show that the bandwidth of a storage system
is limited based on the available resources. Our overall goal in this part of the study is to
optimize informed prefetching in a multi-level storage system with limited I/O bandwidth.
The iPipe study (see the previous Chapter) o ers initial evidence that a prefetching pipeline
is a good approach to utilizing multiple prefetchers to preload hinted blocks in parallel among
multiple storage levels. The prefetching pipeline substantially reduces time spent loading
hinted blocks from the up-most storage level.
Before the designing phase of IPO, we realize that the number of pipelined prefetching
requests that may be handled concurrently is limited depending on available I/O bandwidth
and the size of prefetching bu ers managed by the upper-level prefetchers (e.g., TIP is the
upper-level prefetcher in our studies). Recall that multi-level storage systems considered
in this dissertation study consist of two levels: the uppermost one uses solid state drives
(SSD); the lowest one uses Hard Drives (HDD). The IPO mechanism gets hints of future
I/O accesses from applications and aggressively prefetches hinted blocks from HDDs up to
SDDs in a pipeline manner.
75
The most challenging design issue is how to e ciently utilized limited available I/O
bandwidth for the prefetching module that preloads hinted blocks from HDDs to SDDs. A
second design challenge is how to allow two prefetching modules to share I/O bandwidth of
SSDs, which are accessed by both upper-level and lower-level prefetchers.
In our design, we address the above challenges as follows. Recall that lower-level
prefetcher and upper-level prefetchers handles informed prefetching requests in parallel. The
lower-level prefetcher trigger a informed prefetching request when the following conditions
are met. First, the upper level SDDs have su cient available I/O bandwidth. The lower-level
prefetcher aims to improve the utilization of SSDs by preloading data from HDDs to SSDs
when the I/O load of SSD is median or low. This design policy is important because a very
aggressive lower-level prefetcher can slow down the performance of its upper-level prefetch-
ing counterpart. Second, fetching hinted blocks from HDDs to SSDs will not saturate the
I/O bandwidth of SSDs and HDDs. We believe that the main concern is SSDs? bandwidth,
because the SSDs are accessed by two prefetching modules. Compared with HDDs, SSDs
are more likely to have their I/O bandwidth saturated.
5.3.1 Architecture of IPO
Before presenting the implementation details of IPO, let us  rst outline a high-level
overview of IPO?s hardware and software architectures.
Hardware Architecture of IPO
IPO shares iPipe?s hardware architecture as shown in Figure 4.1. As we discussed in
Chapter 4, an application accessing a storage system provides hints of future I/O accesses
for IPO, which manages two prefetching modules - the upper module (e.g., TIP) prefetches
hinted blocks from SSDs to bu ers in main memory; the lower module prefetches hinted
blocks from HDDs to SSDs. Both prefetching modules work concurrently in a pipelines
manner.
76
Parallel storage systems investigated in our studies contain prefetching bu ers as well
as an array of two-level storage devices - SSDs and HDDs. Recall that the storage systems
have limited I/O bandwidth and scalability.
Software Architecture of IPO
Figures 5.1 and 5.2 show IPO?s software architecture. One of the core components
in the IPO mechanism is the upper-level informed prefetching module or TIP that receives
hints from data-intensive applications, controls bu ers in the main memory, and performs
informed prefetching from SDDs. When the IPO system receives an informed prefetching
request, IPO uses the meta-data to decide if the hinted block is residing in an SSD before
accessing a low-level HDD. If the hinted block is not available in one of the SSDs, the block
must be fetched directly from a HDD. In case that the hinted block has been fetched by the
low-level prefetcher from the HDD to SSD, the hinted block will be quickly loaded from SSD
into the main-memory bu er.
At the heart of the IPO mechanism is a lower-level informed prefetching module that
fetches hinted blocks from HDDs to SSDs. When the lower-level prefetcher triggers an
informed prefetching request, the prefetcher checks if the hinted block has been cached in
one of the SSDs. If the hinted block is not available in SSDs, the block is fetched from the
HDD and placed in an SSD.
In addition, the IPO system gains control of both SSDs and HDDs in order to perform
pipelined prefetching based on the hints o ered by the applications.
77
Figure 5.1: High-level design of IPO software architecture. The multi-level storage system
consists of an array of two levels of SSDs and HDDs with limited bandwidth and scalability.
Figure 5.2: Detailed design of IPO software architecture for a three-level storage system.
The application provides hints on future I/O accesses. IPO determines the appropriate
hinted blocks to be fetched. IPO keeps prefetching a particular number (depends on
available bandwidth) of hinted blocks to the uppermost level. The upper-level prefetcher
(i.e., TIP) uses the cost/bene t model to determine the number of prefetching bu ers.
78
5.3.2 Assumptions
In Chapter 3, we discussed our research?s assumptions in details; in this section, we
simply highlight the most important assumptions related to IPO.
We assume that data blocks are initially placed in HDDs. Our motivation behind this
assumption is that HDDs have much larger capacity than that of SSDs. In addition, HDDs
are much cheaper than SDDs. Other prefetching studies in the realm of multi-level storage
systems have similar assumption that data blocks are initially stored in the lower-level of
the storage hierarchy [23].
IPO reserves a small portion determined by Equation 5.1 to retain copies of prefetched
data. Instead of migrating the data blocks from HDDs to SSDs, the prefetching process
only move copies of the original data. This strategy works very well for read-intensive
applications, because read-only data cached in upper-level storage can be quickly discarded
without being moving back to a lower-level storage device. Replicas across multi-levels allow
storage system to conserve bandwidth when cached blocks are evicted from I/O bu ers in
SSDs.
5.4 The IPO Algorithm
This section presents an algorithm that guides us in implementing the IPO mechanism
for multi-level storage systems. We also give examples to demonstrate that IPO improve
I/O performance of storage systems by reducing application stalls and elapsed time.
5.4.1 De nitions
IPO handles the informed prefetching process between SSDs and HDDs. When an ap-
plication starts its execution, the TIP module assigns a number of bu ers for prefetching
(Xcache) based on the cost bene t model. This means that the informed prefetching module
79
continues to issue a number of concurrent read requests that equals to (Xcache). The concur-
rent reads utilize either a part or all of the parallel storage system?s bandwidth. Based on
the non-utilized portion of the I/O bandwidth, IPO assigns its pipelined prefetching depth.
Let MaxBW be the maximum number of read requests that may take place concurrently
in the parallel storage system. In the event that TIP is assigning Xcache bu ers for prefetching,
the di erence between Xcache and MaxBW is the the pipelined prefetching depth of IPO. The
pipelined prefetching depth is determined by Equation 5.1; this number represents the
maximum space that needs to be reserved for pipelined prefetching in the uppermost level
(i.e, SSDs). As we will see in the performance evaluation section, pipelined prefetching depth
dose not consume the SSD storage space.
Pdepth = MaxBW  Xcache (5.1)
where;
Pdepth: IPO pipelined prefetching depth
MaxBW: Maximum bandwidth
Xcache : Number of prefetching bu ers
The TIP module continues prefetching Xcache hinted data blocks concurrently. At the
same time, IPO keeps fetching Pdepth hinted data blocks from HDDs in a pipeline manner
with TIP. IPO consists of two algorithms determining what the  rst hinted block to be
fetched from HDDs and the number of subsequent blocks to be fetched. Initially, most of
TIP?s prefetching requests are found in the lowest level (i.e., HDDs) until the hinted data
blocks arrive in the uppermost ones (i.e., SSDs). At this point, TIP fetches hinted blocks
from one of the two storage levels. As more I/O bandwidth becomes available for IPO
to fetch hinted blocks in a pipeline manner, more prefetching requests are handled in the
uppermost level. Every time a data block is prefetched by TIP, it will be consumed from
80
the SSD?s pipelined prefetching bu er and a new pipelined prefetching request is initiated.
As more I/O bandwidth becomes available for pipelining, application stalls and elapsed time
will be reduced (see Equation 4.1).
Recall that Tcpu + Thit + Tdriver represents the system?s single time unit needed for an
application to consume a prefetched data block. In addition, our multi-level storage system
consists of an HDD and an SSD. Thdd cache is the disk read latency from HDD to bu er cache
in the main memory, Thdd ss is the disk read latency from HDD to SSD, and Tss cache is the
disk read latency from SSD to bu er cache. Tss cache should be less than Thdd cache.
5.4.2 The Pstart and Pnext Algorithms
IPO pipelines the prefetching processes. When informed prefetching initiates its  rst
prefetching requests, IPO begins fetching requests from HDDs to SSDs. The  rst block to be
fetched from HDD by IPO is calculated by the Pstart algorithm. The pipelined prefetching
depth spans until a number of data blocks is calculated by Equation 5.1. When a data block
is requested by TIP, the block will be consumed from the pipelining bu er in SSD after
the block is fully fetched from SSD to the main memory. At this time, IPO issues a new
pipelined prefetching request for another hinted data block. The position of that access?s
data block is calculated by algorithm (5). Tconsume ss represents the time needed to consume
a single pipelined preftching block from SSD (i.e. Tconsume ss = Tss cache).
The Pstart Algorithm: Determine the  rst block fetched from HDD
xcachecounter = 0
accesstime = - (Tcpu + Thit + Tdriver)
for blockcounter = 1 to Thdd ss / (Tcpu + Thit + Tdriver) do
xcachecounter ++
accesstime += (Tcpu + Thit + Tdriver)
if accesstime  Thdd ss then
Pstart = blockcounter
return Pstart
end if
if xcachecounter = Xcache then
81
if Tstall ss(Xcache) > 0 then
accesstime += Tstall ss(Xcache)
end if
if accesstime  Thdd ss then
Pstart = blockcounter + 1
return Pstart
end if
xcachecounter = 0
end if
end for
Algorithm 4. returns the hinted block position for which IPO begins pipelining. The  rst
hinted block to be fetched from HDD depends on the Thdd ss and Tstall ss(Xcache) values
that may take place during the beginning of informed prefetching, assuming that all data
is already in the SSD. The algorithm calculates the hinted data block?s position that will
be accessed after enough time to have it read from HDD to SSD.
The Pnext Algorithm:
totaltime = Thdd ss + Tconsume ss
xcachecounter = 0
accesstime = - (Tcpu + Thit + Tdriver)
for blockcounter = 1 to totaltime / (Tcpu + Thit + Tdriver) do
xcachecounter ++
accesstime += (Tcpu + Thit + Tdriver)
if accesstime  totaltime then
Pnext = Pstart + blockcounter - 1
return Pnext
end if
if xcachecounter = Xcache then
if Tstall ss(Xcache) > 0 then
accesstime += Tstall ss(Xcache)
end if
if accesstime  totaltime then
Pnext = Pstart + blockcounter
return Pnext
end if
xcachecounter = 0
end if
end for
Algorithm 5. returns the next hinted data block that IPO should fetched from HDD to
SDD when a previously pipelined data block is consumed from bu er in SDD. Tconsume ss
is the time to consume a block in SSD. The application stalls for at least Tstall ss(Xcache).
These stalls reduce the prefetching depth.
82
Equations 5.2 and 5.1 are used in the above two algorithms to determine (1) the  rst
blocks to be fetched from HDD to SDD and (2) the number of the subsequent blocks to be
fetched from HDD to SDD.
IPOstart = Thdd ss(T
cpu + Thit + Tdriver)
 (
Thdd ss
(Tcpu+Thit+Tdriver)
Xcache  Tstall ss(Xcache)
(Tcpu + Thit + Tdriver) )(5.2)
where;
IPOstart: The  rst block to be fetched by IPO from HDD.
Figures 5.3 and 5.4 illustrate an example of IPO working with the TIP module. MaxBW
is equals 5 and Xcache equals 2. IPO uses the remaining 3 slots for pipelined prefetching.
Figures 5.5 and 5.6 show the non-IPO case. In the IPO case, the stalls and elapsed time of
the application are reduced.
5.4.3 The IPO Algorithm
Algorithm 3: IPO
Pstart = call pipelining start block
MaxBW = Maximum number of concurrent reading requests
Pdepth = Pstart + (MaxBW - Xcache) - 1
while informed prefetching do
if bandwidth shortage then
shrink the pipelining depth by 1
end if
if pipelined data block is altered then
discard the pipelined data block
end if
if is the  rst prefetching then
for block = Pstart To block = Pdepth do
pipeline block to SSD
end for
else if SSD bu er is consumed then
Pnext = call (Next to Prefetch) for the consumed block
if the Pnext data block is not already in the SSD then
pipeline Pnext to the SSD
end if
end if
83
Figure 5.3: Average stalls when using IPO and a  xed number of bu ers for parallel
prefetching in the bu er cache. The maximum number (MaxBW) of read requests is 5.
Informed prefetching bu ers = 2, and the rest 3 spaces of the bandwidth are used for
pipelined prefetching. Thdd cache = 5, Thdd ss = 8, Tss cache = 4, and Xcache = 2. The  rst
accesses stall for 5 time units. Before IPO fetches hinted blocks from HDD to SSD, the
application stalls for Tstall hdd(Xcache) = Thdd cache - 3(Tcpu + Thit + Tdriver) = 2 time units
every 3 accesses. IPO continues to fetch 3 hinted blocks each time from HDD. When a
prefetched block is consumed from SSD, a new pipelined prefetching request is initiated by
IPO. When IPO is employed, stalls time becomes 40 and elapsed time is 76 time units.
84
Figure 5.4: Continue Figure 5.3
85
Figure 5.5: Average stalls when using a  xed number of bu ers for parallel prefetching in
bu er cache. Thdd cache = 5, and Xcache = 2. The  rst accesses stall for 5 time units. All
data is read from HDD. The application stalls for Tstall hdd(Xcache) = Thdd cache - 3(Tcpu +
Thit + Tdriver) = 2 time units every 3 accesses. In the non-IPO case, stalls time is 45 and
elapsed time is 81.
86
Figure 5.6: Continue Figure 5.5
end while
The IPO algorithm calls the Pstart algorithm to determine the  rst hinted block to be
fetched from HDD. Next IPO calculates the pipelined prefetching depth (i.e., Pdepth),
which is a ected by the available I/O bandwidth. Then, IPO initiates a prefetching
request to fetch hinted blocks from HDD to SSD. Every time a cached block is consumed
from SSD by the TIP module, IPO starts prefetching the next hinted block from HDD
to SSD. Bandwidth shortage means TIP is unable to complete its prefetching requests on
time from any level. This may be caused by the pipelined prefetching process and limited
I/O bandwidth. In the event of bandwidth shortage, IPO reduces its pipelined prefetching
depth by one until the shortage is eliminated. The IPO algorithm checks if a data block
to be fetched has been modi ed before the block is requested by TIP. In this case, the
prefetched copy will be discarded to preserve consistency. If the data is cached in the SSD.
the pipelined prefetching request will be discarded by IPO.
87
5.5 Performance Evaluation
This section evaluates IPO?s performance through extensive simulations. After pre-
senting the simulation environment, we discuss the impact of IPO on application elapsed
times.
5.5.1 Simulation Environment
IPO shares the same simulation environment of iPipe in terms of architecture, traces,
assumption, and parameters all of which were validated in Chapter 3.
Since both TIP and IPO work concurrently, the maximum number of read requests
issued to HDDs is MaxBW. According to Equation 4.2, the prefetching horizon of reading
data from the HDD to the bu er cache equals 63 in case 10 MB or 200 MB data blocks are
being used. In this simulation, we simulate IPO and record its performance when MaxBW
value is set to 15. Note that this value is 15 in the TIP study [1]. Assuming that each single
read request needs one disk to not face any congestions, MaxBW in this case equals 15.
The IPO pipelined prefetching depth does not consume the SSD space. If TIP is as-
signing (Xcahce) = 1, pipelined prefetching depth becomes 14. In this case, IPO needs 140
MB in the 10MB-block-size case and 2.8 GB) in the 200-MB-block-size case. The bu ers are
stripped through the entire disk array of 15 disks.
5.5.2 Elapsed Time Improvement
The simulation results show that when TIP is using very few prefetching bu ers (e.g.
Xcahce = 1), IPO reduces the total elapsed time by approximately 56% when using 10 MB
data blocks and 34% when using 200 MB blocks. Figure 5.7 shows the total elapsed time
when MaxBW equals 15 in the event of 10 MB data blocks. Figures 5.8 and 5.9 show the
total elapsed time when MaxBW equals 15 and 200 MB data blocks are being used. As
the number of Xcahce increases, the di erence between IPO and non-IPO cases decreases.
This is because TIP?s prefetching process becomes more e cient at hiding the application?s
88
Figure 5.7: IPO reduces application elapsed time. 10 MB block size. Total elapsed time
when using 1 to 15 prefetching bu ers. MaxBW = 15.
stalls. At the same time, IPO?s pipelined prefetching process shows the most performance
improvement during TIP?s most critical situation (i.e., TIP uses very few Xcahce bu ers -
e.g., 1). Also, due to pipelining limitations, IPO does not exhibits a signi cant performance
improvement when Xcahce is larger than 9, especially as MaxBW becomes smaller; however,
the 10-MB-block-size case shows better performance improvement because SSD performs
better than HDD when block size is 10MB.
It is clear that the 200-MB-block-size case shows signi cantly larger elapsed time values,
because both SSD and HDD have longer read latencies to access blocks whose size is 200MB.
Another reason is that SSD is much faster than HDD when it comes to the case of accessing
large (e.g., 200MB) data blocks.
Performance improvement o ered by IPO becomes more pronounced when I/O band-
width of the parallel storage system increases. This is a result IPO being able to issue more
pipelined prefetching requests to HDDs and bring more data to the uppermost level.
89
Figure 5.8: IPO reduces application elapsed time. 200 MB block size. Total elapsed time
when using 1 to 7 prefetching bu ers. MaxBW = 15.
Figure 5.9: IPO reduces application elapsed time. 200 MB block size. Total elapsed time
when using 9 to 15 prefetching bu ers. MaxBW = 15.
90
Figure 5.10: Bandwidth utilization when Xcahce is varied from 1 to 15 in both IPO and
non-IPO cases. IPO fully utilizes the bandwidth.
5.5.3 Bandwidth Utilization
Recall that IPO utilizes unused I/O bandwidth in parallel storage systems. When the
TIP module is using few bu ers for prefetching, the rest of the I/O bandwidth is used by
IPO. Again, in this experiment, the block size is set to 10 MB and 200 MB, respectively.
Figure 5.10 shows the bandwidth utilization when using di erent values of Xcahce between 1
and 15 for both IPO and non-IPO cases.
5.5.4 Increasing the MaxBW Value
We observed that IPO?s performance improvement is not as signi cant as that of iPipe
because of the limited I/O bandwidth. When available I/O bandwidth MaxBW increases,
IPO tends to perform like iPipe. This is because the storage system?s I/O bandwidth becomes
high enough. When the block size is 10 MB, IPO can improve the system performance
similarly like iPipe when the MaxBW equals to 154 (see Equation 4.4).
91
Figure 5.11: IPO reduces the elapsed time. Total elapsed time when the Xcahce value is
increased from 1 to 63. MaxBW is set to 154.
Figure 5.11 shows application elapsed time when the Xcahce value is increased from 1
to 63 and MaxBW is 154. The LASR1 trace is used in this simulation study. IPO shows
similar results to those of iPipe using LASR1 (see also Figures 4.14, 4.15, and 4.16).
5.6 Summary
IPO is an extension of the iPipe mechanism presented in the previous chapter. IPO
di ers from iPipe in that it can fully support multi-level storage systems with limited I/O
bandwidth. IPO decides the number of concurrent prefetching requests issued to HDDs in
accordance with available I/O bandwidth o ered by both SDDs and HDDs. IPO ensures
that the pipelined prefetching scheme does not make SDDs overly loaded. On average, per-
formance improvements provided by IPO are not as signi cant as those o ered by iPipe due
to the limited I/O bandwidth. It reduces the application elapsed time by approximately 56%
when the Xcahce is set to 1. As the Xcahce value increases, IPO?s performance improvement
is decreasing. Experimental results show that increasing the MaxBW value helps in making
92
IPO performance improvement more noticeable. The results reported in this chapter indi-
cate that our proposed pipelined prefetching solutions are very bene cial to parallel storage
systems o ering su cient I/O bandwidth.
93
Chapter 6
IPODS: Informed Prefetching in Distributed Multi-level Storage Systems
6.1 Overview
In the previous two chapters, we proposed informed prefetching algorithms (i.e., iPipe
and IPO) that aim to hide I/O latency by invoking I/O parallelisms and prefetching hinted
accesses in multi-level storage systems. I/O access hints are provided by applications so
that hinted blocks can be prefetched prior to their accesses. In this Chapter, we extend
the iPipe and IPO schemes to meet the needs of distributed multi-level storage systems,
where massive amounts of data are allocated to a set of storage servers built with multi-level
storage devices.
A distributed storage system consists of several storage servers among which data are
stripped and stored [72] [83] [68] [82]. Clients can store and retrieve data to and from the
storage servers through networks. Distributed storage systems aim at o ering high I/O
performance, high bandwidth, and scalability [67] [69] [66]. Apart from storage devices in
distributed systems, network and server latencies raise major I/O performance issues, which
can be addressed by prefetching techniques.
Several architectures of distributed storage systems were proposed in previous studies.
For example, Figures 6.1 and 6.2 show two existing architectures designed in prior research
projects [65] and [71]. In both architectures, there are several distributed storage nodes
and multiple clients. Data blocks are stripped among the storage nodes, all of which are
connected by a network.
In this Chapter, we focus on a way of extending the IPO mechanism in the realm of
distributed multi-level storage systems. Note that the terms servers, nodes, and sites are
used exchangeable throughout this Chapter. In a distributed storage system, each storage
94
Figure 6.1: The architecture of a distributed parallel storage system. Several distributed
clients and storage nodes are connected by a network.
T.Madhyastha; G. Gibson; C. Faloutsos: Informed prefetching of collective input/output
requests, Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM),
Portland, Oregon, 1999.
server consists of multi-level storage devices (see the previous two Chapters for details on
multi-level storage systems) and data blocks are stripped among multiple storage servers.
We decide to extend the IPO scheme rather than its earlier iPipe version, because IPO does
not rely on an unreasonable assumption that storage systems o er unlimited I/O bandwidth.
In this part of study, we consider limited number of storage nodes with limited aggregated
I/O bandwidth. The goal of this study is to reduce disk read latency, which is a part of the
total prefetching time latencies. Since each storage server consists of a multi-levels storage
devices, the IPO mechanism in each server enables TIP to locate prefetching requests in the
uppermost storage level of the server. The IPODS helps in reducing application stalls and
elapsed time in distributed and parallel storage systems.
6.2 Motivations and Objectives
The motivation of the IPODS study is four-fold. First, little attention has been paid to
informed prefetching in distributed/parallel multi-level storage systems. Please refer to the
related work chapter for a summary of research projects related to distributed and parallel
storage systems. Second, distributed and parallel storage systems have high scalability,
because massive amounts of data can be striped and distributed across a large number of
95
Figure 6.2: [71] Another architecture of a distributed parallel storage system. Several
distributed clients and storage nodes are connected by a network.
Luis Cabrera , Darrell D.E. Long: SWIFT: USING DISTRIBUTED DISK STRIPING TO
PROVIDE HIGH I/O DATA RATES, University of California at Santa Cruz, Santa Cruz,
CA, 1991.
storage nodes and servers. Third, each storage node and server in distributed and parallel
system can be built with multi-level storage devices, where hot data blocks are stored and
cached in upper-level storage for quick accesses. Prefetching techniques can be employed to
preload hinted data blocks to the upper-level storage to shorten disk read latencies, because
upper-level storage shows good I/O performance in terms of read latency.
There are the following two research objectives in this part of study:
a136 To incorporate informed prefetching in distributed/parallel multi-level storage systems.
a136 To reduce prefetching overhead by leveraging multiple prefetchers working in parallel.
a136 To evaluate the performance of the proposed IPODS scheme by measuring applications?
stalls and elapsed time.
6.3 IPODS Design Issues
When a client accesses data blocks from a multi-level storage server via network con-
nections, the IPODS mechanism brings hinted blocks to the upper-level storage. Although
96
IPODS is unable to reduce network latency that is out the scope of this dissertation, IPODS
is capable of shortening long I/O read latencies.
The  rst design issue is to enable multiple prefetchers to collaborate in a pipelining
manner. Similar to iPipe and IPO, IPODS have several prefetching modules residing in
each storage server. The number of prefetchers in a storage server depends on the number
of storage levels. An N-level storage server requires N-1 prefetchers. For example, one
prefetcher fetches data blocks from tape to hard drives, whereas another prefetcher loads
data from hard drives to solid state disks (SSDs). Unlike iPipe and IPO, IPODS should
incorporate a prefetcher to bring data from remote storage servers to a client?s local I/O
bu ers.
A second design issue in IPODS is a coordination mechanism between a pair of client
local I/O bu er and bu ers in SSDs (i.e., upper-level device) of a remote storage server.
The prefetching module fetches data from remote storage servers to local bu ers contains
both client and server portions. The client portion of the prefetching module takes hints
from applications at the client. Then, the client portion needs to check if the hinted blocks
have been cached in the local bu er. If the hinted blocks are not available in the local
bu er, a request will be sent to the server portion to have the hinted blocks fetched from the
remote server. The prefetching module handling the client and server portions are working
in parallel with the other prefetching modules residing in remote storage servers.
6.3.1 The IPODS Architecture
Before presenting the IPODS implementation details, let us describe the hardware and
software architectures of the IPODS prefetching mechanism.
Hardware Architecture of IPODS
Figure 6.3 illustrates the hardware architecture of IPODS. The system consists of
client nodes and distributed storage nodes. Each client node maintains a local bu er caches;
97
applications running on clients nodes retrieve data from remote storage nodes. Applications
disclose their future access hints to client nodes, which prefetch data from remote storage
nodes to local bu ers. Each storage node built with N-level storage devices consists of
N-1 prefetching modules. There is one prefetcher between ith and (i+1)th levels. In this
dissertation study, we con gure each storage node as a two-level storage device - the upper
level is a solid state disk and the lower level is a hard drive. A large data block are stripped
and stored across multiple storage nodes. Data blocks are transferred Client and I/O storage
nodes through networks. Unlike iPipe, IPODS are designed for distributed/parallel storage
systems with limited I/O bandwidth.
Figure 6.3: Distributed/Parallel Multi-level Storage System: The system shows several
distributed storage nodes and clients connected by a network. Each I/O storage node
consists of a two-level storage device containing both SSD and HDD.
T.Madhyastha; G. Gibson; C. Faloutsos: Informed prefetching of collective input/output
requests, Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM),
Portland, Oregon, 1999.
Software Architecture of IPODS
Figures 5.1 and 5.2 represent the software architecture of IPO. IPODS and IPO share a
similar software architecture. IPODS relies on the TIP module [1] to process hints disclosed
98
by applications. IPODS is in charge of controlling bu ers in all the storage levels while per-
forming informed prefetching process. When IPODS receives a prefetching request, it checks
local bu er  rst before searching SSDs and HDDs in remote storage nodes. IPODS incor-
porates IPO to control the distributed storage nodes and to perform pipelined prefetching
process based on the application hints.
6.3.2 Assumptions
We assume that data blocks are initially placed in HDDs in remote storage nodes. Like
IPO, IPODS reserves a small portion in the SSD for the pipelined prefetching process. This
portion - de ned as prefetching depth - is represented by Equation 5.1.
The maximum I/O bandwidth represents that maximum number of concurrent read
requests that may take place in a distributed system without any I/O congestions or delays.
We assume that the maximum I/O bandwidth scales up with the number of the nodes
[65]. In addition to disk I/O bandwidth, network bandwidth a ect the performance of
distributed/parallel storage systems. In this dissertation study, we only focus on the disk
I/O issues and the network I/O problems are out scope of this research.
6.4 The IPODS Algorithm
6.4.1 De nitions
IPODS incorporates IPO to deal with the informed prefetching process in each remote
storage nodes. When an application starts its execution on a client node, the TIP mod-
ule [1] assigns Xcache number of bu ers for prefetching based on the cost-bene t model.
In other words, TIP keeps requesting (Xcache) concurrent prefetching requests from the re-
mote storage nodes. Concurrently fetching a number of blocks utilizes either a part or
all of the distributed/parallel storage system?s I/O bandwidth. Recall that IPO assigns
its pipelined prefetching depth based on the non-utilized portion of the disk bandwidth.
MaxBW is the maximum number of data blocks that may be requested concurrently in the
99
distributed/parallel storage system. The value of MaxBW depends on the number of nodes
and the available aggregated I/O bandwidth. The IPO?s pipelined prefetching depth is de-
termined by Equation 5.1. The pipelined prefetching depth represents the maximum space
needed to be reserved in the SSD to cache hinted data blocks fetched from HDDs.
Since a hinted block must be fetched from a remote storage node to a local bu er, the
time spent fetching the block includes network transfer time and server latency as well as
the disk I/O access time. Equation 6.1 calculates the time spent fetching a hinted block
from a remote storage node.
RemoteRead = TDisk + TNetwork + TServer (6.1)
where;
RemoteRead 1: Time spent in fetching a hinted block from a remote node.
TDisk: Disk I/O access time.
TNetwork: Network transfer time.
TServer: Server processing time.
Recall that (Tcpu + Thit + Tdriver) represents the time needed for an application to con-
sume a single block in the bu er cache. In addition, a distributed multi-level storage system
consists of several storage nodes where each one has an HDD and an SSD. Thdd network cache
is the time spent in retrieving a single data block over a network from a remote storage
node?s HDD. Tss network cache is the time needed to fetch a single data block over a net-
work from a remote node?s SSD. Thdd ss is the disk read latency from the HDD to SSD of
a remote node. Tss network cache is less than Thdd network cache, because the performance of
SSD is higher than that of HDD. If the network bandwidth is very low, the performance
1R. Patterson, Hugo, G. Gibson, D. Stodolsky, and J. Zelenka: Informed prefetching and caching, In
Proceedings of the 15th ACM Symposium on Operating System Principles, pages 79-95, CO, USA, 1995.
100
gap between Tss network cache and Thdd network cache is diminished. This is because the net-
work rather than hard drives becomes a serious performance bottleneck. In our research, we
assume that the overhead introduced by servers can be ignored (i.e., TServer = 0), because
storage servers simply fetch data rather than processing data.
The IPO pipelined prefetching process enables TIP to  nd more hinted blocks in SSDs.
IPO consists of two algorithms that determine the  rst appropriate hinted blocks to be
fetched and the total number of prefetched blocks. The pipelined prefetching process reduces
time spent in fetching data from remote storage nodes. IPODS extends IPO to meet the needs
of distributed/parallel multi-level storage systems and reduces elapsed times of applications
running on client nodes.
6.4.2 The IPODS Algorithm
IPODS employs the IPO scheme to perform pipelined prefetching in distributed multi-
level storage systems. We implement the Pstart and Pnext algorithms in IPODS, because
these two algorithms are the core of the IPO scheme. IPODS is di erent from IPO in the
sense that IPODS has to manage local bu ers in client nodes. In addition, IPODS must
consider network transfer time as well as storage node latencies as parts of the prefetching
time.
In a single storage node system, moving a data block from HDD or SSD to the bu er
cache is faster than moving the same data block from HDD to SSD (Thdd ss). In a distributed
system, however, Thdd ss is expected to be less than (Thdd network cache) and (Tss network cache).
This is because both SSD and HDD are in the same node, moving data blocks from HDD to
SSD within a node is faster than moving data from a remote storage node to a client through
networks. This results in more application and TIP stalls which consequently opens more
room than in a single node system for IPODS to prefetch data blocks in a pipeline manner.
This is due to the fact here, IPODS can guarantee the prefetched data will arrive in SSDs
and local bu er before being requested by TIP.
101
6.5 Performance Evaluation
In what follows, we evaluate the performance of IPODS through simulation studies. Let
us  rst illustrates the simulation environment.
6.5.1 Simulation Environment
The simulation environment for IPODS is the similar to that for IPO, because IPODS
is an extension of IPO to feed the needs of a distributed/parallel storage systems containing
several storage nodes.
We built a simulator using C++ programming language to simulate a distributed storage
system in which IPODS is implemented. Again, we use the elapsed time as a key performance
evaluation metric. For comparison purpose, we also simulator the same distributed system
where IPODS is not employed. We evaluate the impact of bu er size on system performance.
The simulated distributed system consists of a client node o ering local bu ers and a
set of remote storage nodes built with two-level storage devices (i.e., a SSD level and a HDD
level). The client and all the storage nodes are connected by a network. Figure 6.3 shows
the architecture of the simulated distributed storage system. Data is initially stored in the
HDDs whereas prefetched blocks are stored in bu ers in SSDs in storage nodes and local
bu ers in the client. The TIP module manages the bu er cache and IPODS controls bu ers
in all the storage nodes.
Again, we use the LASR traces (i.e., machine01 or LASR1; machine06 or LASR2)
that consist respectively of 11686 and 51206 I/O read requests [97] [98]. According to our
validation results presented in Chapter 3, data block size must be larger than 200 MB to
ensure that SSDs o er better I/O performance than HDDs. For data blocks that are smaller
than 200 MB, there is no signi cant di erence between reading data from a remote SSD
and reading data from a remote HDD, since the network becomes a performance bottleneck
rather than HDDs.
102
In the simulated distributed storage system, we con gure the system parameters using
the data validated in Chapter 3. For LASR1 and LASR2, we assume that each exclusive
I/O read system call represents a single request for a 200 MB data block. The trace?s I/O
reading requests sequence represents the future access hints. In addition, each cache bu er
in both client and storage servers are large enough to accommodate a large data block.
According to our validation chapter 3, the system parameters used in our simulation
study are: block size equals 200 MB, and (Tcpu + Thit + Tdriver) equals 0.02 seconds. Data is
located in the HDDs. The average latency of reading a 200MB data block from a remote SSD
and HDD to the bu er cache over a LAN network is 4.158 and 4.43 seconds, respectively.
The time spent reading a 200MB block from the HDD to the SSD Thdd ss equals 4.5 seconds.
There is no write latency to the bu er cache; similar assumption can be found in [1].
Since both TIP and IPODS work concurrently, the maximum number of read requests
to HDDs is set to MaxBW. Based on Equation 4.2, the prefetching horizon of reading data
from HDDs to bu er cache is 222 data blocks. Similar to the IPO simulation study, we
simulate IPODS by setting MaxBW to 15 concurrent read requests. We choose 15 because
in the TIP study, 15 disks were tested in the performance evaluation [1]. In the IPODS
simulator, each storage node?s disks can individually handle a single read request without
I/O congestion.
6.5.2 Improving Elapsed Time
The simulation results show that IPODS reduces the total elapsed time by about 6%
when the TIP module is using very few bu ers for prefetching (e.g., Xcahce is set to 1, 2, and
3). Figure 6.4 shows the total elapsed time for varying numbers of prefetching bu ers when
MaxBW equals 15. As the number of Xcahce increases, the performance di erence between
the IPODS and non-IPODS cases decreases because TIP becomes more e cient in reducing
the application?s stalls.
103
Figure 6.4: Total elapsed time when the number of prefetching bu ers is varied from 1 to
15. MaxBW = 15. IPODS reduces the elapsed time.
Importantly, the IPODS pipelined prefetching process shows its best performance im-
provement when TIP uses few Xcahce bu ers (e.g. Xcahce = 1). The experimental results
suggest that IPODS can enhance I/O performance of distributed and parallel storage sys-
tems where the TIP module has limited or no cache bu er to utilize. On the other hand,
when Xcahce is greater than or equal to 9 especially when (MaxBW) is limited, IPODS can
not exhibit any performance improvement.
6.6 Summary
In this chapter, we describe the IPODS prefetching scheme - an extension of an earlier
IPO version. IPODS are designed to meet the needs of distributed/parallel multi-levels
storage systems, where networks introduce signi cant overhead to the entire prefetching
process. Like iPipe and IPO, IPODS reduces I/O access time by hiding the long I/O time
through prefetching. Our results show that IPODS can judiciously reduce the elapsed times
of applications by approximately 6% when the cache bu er is small (e.g., Xcahce = 1). When
Xcahce increases, the improvement o ered by IPODS start diminishing.
104
Chapter 7
Prototype Development
In this chapter, we describe our prototypes for the informed prefetching schemes pro-
posed in the previous three chapters. Because we are unable to test our design in a large
scale distributed system, we develop our prototypes using a computing cluster in the storage
systems research laboratory at Auburn University. The laboratory is located in Room 2104
of the Shelby Building. We replay real-world traces to evaluation the performance of the
prototypes. Then, we compare experimental results obtained from the prototypes with those
obtained from the simulators to validate the correctness of our simulators.
We only replay a portion (i.e., 10%)of the two traces used to drive our simulators,
because it takes a couple of days to replay an entire trace. The results generated by our
prototypes can be used to evaluate the performance of our proposed pipelined prefetching
schemes in a real-world storage system.
7.0.1 Objectives
Developing the prototypes helps us to achieve the following three objectives:
a136 To build real-world storage systems that implements our pipelined prefetching solu-
tions.
a136 To evaluate performance of our prefetching schemes in a real-world storage system.
a136 To validate the correctness of our simulators by comparing simulation results with the
prototyping results.
105
7.0.2 System Setup
We summarize the testbed (including hardware and software) used to develop our pro-
totypes as follows.
a136 Memory: Samsung 3GB RAM Main Memory.
a136 HDD: Westren Digital 500GB SATA 16 MB Cache WD5000AAKS.
a136 SSD: Intel 2Gb/s SATA SSD 80G sV 1A.
a136 LAN Network Switch: [96] Network Dell Power Connect 2824.
a136 Scripting Language: c-shell, c++.
a136 Input Parameter: set of informed prefetching requests.
a136 Performance Metric: Elapsed time measured in Second.
7.0.3 Design Issues of the Prototypes
In the process of developing our prototypes, we need to determine the  rst prefetched
blocks and the total number of the blocks to be prefetched in a pipeline manner. Thdd ss -
time spent in moving a block from hard drive to solid state disk - is an important system
parameter used to determine the  rst block and the number of blocks to be prefetched. To
guarantee the correctness of Thdd ss, we validate this parameter in a real-world testbed in our
laboratory by recording the highest value of Thdd ss (see Chapter 3 for validation details).
Because the prototypes are developed in a small scale storage system using our testbed, we
set this parameter as a constant. Speci cally, Thdd ss is set to 0.122 seconds and 4.5 seconds
when block size is 10 MB and 200 MB, respectively.
The iPipe scheme di ers from IPO in that iPipe takes Tstall hdd(Xcache) to determine the
 rst block to be prefetched and Tstall ss(Xcache) to determine the prefetch depth (i.e., the total
number of blocks to be prefetched). On the other hand, IPO always takes Tstall ss(Xcache) to
106
determine the  rst block and the total number of blocks to be fetched in a pipeline manner.
In our simulation studies, these values are constants. However, we are unable to predict these
values in the prototypes. Therefore, we use the best case (i.e., lowest values) of Thdd cache,
which is 0.07 seconds in the case of 10 MB data block size and 1.6 seconds in the case of 200
MB block size as shown in Figures 3.5 and 3.8. We also set Tss cache to 0.044 seconds in
the case of a 10 MB data block size and 0.96 second in the case of a 200 MB block size as
shown in Figures 3.6 and 3.9. These values are used to determine the minimal stall values
that help our algorithms decide the proper prefetching depth and avoid incorrect results.
For the TIP module in the prototypes, we set Tcpu) + Thit + Tdriver to 0.00192 seconds
when block size is 10 MB and to 0.037 seconds when block size is 200 MB. These settings
are validated in Chapter 3. When it comes to Tcpu, we consider the worst case where our
prototype does not perform any computational overheads, meaning that applications are
very I/O intensive and consuming prefetched data very quickly. Note that this assumption
is conservative in the sense that our pipelined prefetching schemes can achieve better I/O
performance when Tcpu is a large value and applications are CPU intensive.
7.0.4 Validation Process
In order to validate our simulations results, we have to check if there is a similar trend
and little variation between the simulation results and prototyping results. Recall that
our prototypes implement the pipelined prefetching schemes proposed in the previous three
Chapters. The prototypes also reply 1000 I/O read requests from the two real-world I/O
traces. Although we only reply 1000 requests in the prototypes, the prototypes can handle
hundreds of thousands of requests. We simply reply 1000 requests from the two traces,
because it takes a couple of days to fully replay the entire trace. We conduct one experiment
to show that performance trend of replaying 1000 requests is the same as that of replaying the
entire I/O traces; the disk access time for pipelined prefetching and non-pipelined prefetching
are bounded by a particular range. We set Tcpu + Thit + Tdriver in the prototypes to a
107
constant, which represents the worst case (i.e., smallest value). This enables us to notice
the system performance trend (with and without our solutions) and to expect the results in
terms of elapsed time when am I/O trace is replayed. Equation 7.1 calculates the expected
elapsed time for a real world application, whose I/O access pattern is represented as a trace
replayed in our prototypes.
ProtElapsed =
 Trace#ofReads
1000 )  PrototypingElapsedTime(7.1)
where;
ProtElapsed : Expected prototyping elapsed time of a given trace
7.1 Prototypes
In this section, we discuss our prototypes developed for the iPipe, IPO, and IPODS
prefetching schemes. We also show results obtained from the prototypes and compare pro-
totyping results with earlier simulation results to validate the correctness of our simulators.
7.1.1 The iPipe Prototype
In this subsection, we present a prototype for the iPipe pipelined prefetching solution.
When using iPipe, the  rst few hinted blocks are fetched from the HDDs (i.e., the lowest
storage level) and the rest hinted blocks are fetched from the SSDs (i.e., the uppermost
storage level) if the hinted blocks are cached in the SSDs. We measure the prototyping
results in terms of elapsed time with and without iPipe that handles 1000 read requests. In
the iPipe simulator, the TIP module uses the number of bu ers anywhere from 1 to 63 and
the block size is either 10 MB or 200 MB. Then, we collect prototyping results by replaying
a portion of the LASR1 or LASR2 traces. Finally, we compare the variation ratio between
the iPipe simulation results and the iPipe prototyping results.
108
Tables 7.1 and 7.5 show the prototyping results in terms of elapsed time in seconds
for the 10 MB and 200 MB cases, respectively. The prototyping results show that iPipe
improves the system performance by approximately 51% when the block size is 10 MB and
25% when the block size is 200 MB. The performance improvement o ered by iPipe is
more pronounced when the TIP module is using few prefetching bu ers. As the number of
prefetching bu ers increases, the performance di erence between iPipe and the non-iPipe
counterpart is decreasing.
When the block size is 10 MB, the prototyping results are close to the simulation results.
For example, the di erence between the simulation results and the prototyping results is only
about 9%. The iPipe simulator is more accurate when the block size is large (e.e., 200 MB).
For example, the di erence between the simulation and prototyping results is as low as 5%.
The simulator is more accurate for large block size, because the simulator always uses the
worst case of input parameters.
In the iPipe prototype, we implemented an application whose access pattern is the same
as the LASR traces machine01 (LASR1) and machine06 (LASR2), which consist of 11686
and 51206 I/O reading requests, respectively. Tables 7.2 and 7.6 show the prototyping
results in term of elapsed time measured in seconds when the application is running in the
iPipe prototype. Both the iPipe simulator and prototype share the same performance trend
in term of improvement over the non-iPipe counterpart.
Tables 7.3 and 7.7 show the iPipe simulation results illustrated in the iPipe chapter.
Tables 7.4 and 7.8 show a comparison between the improvement ratio of the simulation
results and the prototyping results. When we compare each case?s simulation and prototyping
results for di erent numbers of prefetching bu ers, we notice that they are close to each other.
The simulation results show higher values by about 3-18%, because the iPipe simulator uses
the highest (worst case) recorded disk read latency. In the 10 MB case, prototyping results
show higher values than the simulation results by about 3-18% when Xcache is larger than
or equal to 11. We observe that during the prototyping experiments, several read requests
109
experience higher disk I/O time than those requests in the simulator. We refer these read
requests with high I/O access time as extreme disk requests. This variation ratio is very
low, because we use the highest validated disk read latency in our iPipe simulator.
To clarify the e ect of these extreme disk requests recorded during our prototyping
experiments, let us consider the following motivational example. Suppose that Tcpu + Thit
+ Tdriver equals 1 time unit, Thdd cache in simulation equals 5 time units, and Thdd cache in
prototyping does not exceeds 4 time units but sometimes records some extreme values equal
to 6 time units. According to Equation 4.2, the prefetching horizon in the iPipe simulator
equals 5. We compare both simulation and prototyping stalls when Xcache equals 1 and 4,
which are considered small and large values, respectively. In addition, we assume that the
application issues 8 I/O read requests in both simulator and prototyping. In the simulation
case, the requests? read latencies is (5,5,5,5,5,5,5,5) while in the prototyping case we assume
that the accesses are read in disk read latencies of (6,4,4,4,6,4,4,4). When Xcache is set to 1,
the application stalls for 5,4, and 3 time units when Thdd cache equals 6,5, and 4, respectively
(see Equation 4.1). When Xcache is 4, the application stalls for 2,1, and 0 time units when
Thdd cache equals 6,5, and 4, respectively. When applying these stall values to the given
stream of I/O read requests, we observe the following facts:
a136 In the simulation case, when Xcache is set to 1, the application stalls for 32 time units.
a136 In the prototype case, when Xcache is set to 1, the application stalls for 28 time units.
a136 In the simulation case, when Xcache is set to 4, the application stalls for 2 time units.
a136 In the prototype case, when Xcache is set to 4, the application stalls for 4 time units.
The above motivational example indicates that the iPipe prototype produces more stalls
than its simulator when the value of Xcache increases.
110
Table 7.1: Total elapsed time measured in seconds when the iPipe prototype is tested. The
number of prefetching bu ers is set to a range between 1 to 63. The block size is 10 MB.
Xcache 1 3 5 7 9
With iPipe 47.88 16.2851 10.0297 7.40379 5.81344
Without iPipe 99.5 34.9642 22.1658 15.9048 12.9741
Xcache 11 13 15 17 19
With iPipe 4.90562 4.24443 3.79232 3.39027 3.08462
Without iPipe 11.338 10.0755 9.79321 7.99748 7.35371
Xcache 25 35 45 55 63
With iPipe 2.54144 2.44104 2.34504 2.13651 2.07306
Without iPipe 6.79856 5.4483 5.05102 3.34087 3.16796
Table 7.2: Total elapsed time measured in seconds when the LASR traces are replayed by
the iPipe prototype. The number of prefetching bu ers is set to a range between 1 to 63.
The block size is 10 MB.
Xcache 1 3 5 7 9
LASR1 With iPipe 559.6273482 190.3076786 117.2070742 86.52068994 67.93585984
LASR1 Without iPipe 1161.996241 408.5916412 259.0295388 185.8634928 151.6153326
LASR2 With iPipe 2452.188772 833.8948306 513.5808182 379.1184707 297.6830086
LASR2 Without iPipe 5091.663489 1790.376825 1135.021955 814.4211888 664.3517646
Xcache 11 13 15 17 19
LASR1 With iPipe 57.32707532 49.60040898 44.31705152 39.61869522 36.04686932
LASR1 Without iPipe 132.495868 117.742293 114.4434521 93.45855128 85.93545506
LASR2 With iPipe 251.1971777 217.3402826 194.1895379 173.6021656 157.9510517
LASR2 Without iPipe 580.573628 515.926053 501.4711113 409.5189609 376.5540743
Xcache 25 35 45 55 63
LASR1 With iPipe 29.69926784 28.52599344 27.40413744 24.96725586 24.22577916
LASR1 Without iPipe 79.44797216 63.6688338 59.02621972 39.04140682 37.02078056
LASR2 With iPipe 130.1369766 124.9958942 120.0801182 109.4021311 106.1531104
LASR2 Without iPipe 348.1270634 278.9856498 258.6425301 171.0725892 162.2185598
111
Table 7.3: Total elapsed time measured in seconds when the iPipe simulator is tested. The
number of prefetching bu ers is set to a range between 1 to 63. The block size is 10 MB.
Xcache 1 3 5 7 9
LASR1 With iPipe 607.612 202.537 121.522 86.8018 67.5125
LASR1 Without iPipe 1402.32 467.44 280.464 200.331 155.813
LASR2 With iPipe 2662.65 887.551 532.53 380.379 295.85
LASR2 Without iPipe 6144.72 2048.24 1228.94 877.817 682.747
Xcache 11 13 15 17 19
LASR1 With iPipe 55.2375 46.7394 40.5075 35.7419 31.9796
LASR1 Without iPipe 127.484 107.871 93.488 82.4894 73.8063
LASR2 With iPipe 242.059 204.819 177.51 156.627 140.14
LASR2 Without iPipe 558.611 472.671 409.648 361.454 323.406
Xcache 25 35 45 55 63
LASR1 With iPipe 24.3045 22.4358 22.4365 22.4369 22.4371
LASR1 Without iPipe 56.0928 40.0663 31.1627 25.4967 22.4371
LASR2 With iPipe 106.506 98.3142 98.3149 98.3153 98.3155
LASR2 Without iPipe 245.789 175.563 136.549 111.722 98.3155
Table 7.4: Comparison between iPipe?s simulation results and the prototyping results.
Total elapsed time measured in seconds when the LASR traces are replayed by the iPipe
simulator and prototype. The block size is 10 MB.
Xcache 1 3 5 7 9
LASR1 Simulation 0.566709453 0.566710166 0.566710879 0.566708098 0.566708169
LASR1 prototyping 0.51839143 0.534235018 0.54751464 0.53449336 0.551919594
LASR2 Simulation 0.566676757 0.566676268 0.566675346 0.566676198 0.566676968
LASR2 prototyping 0.51839143 0.534235018 0.54751464 0.53449336 0.551919594
Xcache 11 13 15 17 19
LASR1 Simulation 0.566710332 0.566710237 0.566709096 0.566709177 0.566709075
LASR1 prototyping 0.567329335 0.578737532 0.61276027 0.576082716 0.580535539
LASR2 Simulation 0.56667699 0.566677456 0.566676757 0.566675151 0.566674706
LASR2 prototyping 0.567329335 0.578737532 0.61276027 0.576082716 0.580535539
Xcache 25 35 45 55 63
LASR1 Simulation 0.566709096 0.440033145 0.280020666 0.120007687 0
LASR1 prototyping 0.626179662 0.551962998 0.535729417 0.360492925 0.345616738
LASR2 Simulation 0.566677109 0.440006152 0.280002783 0.120000537 0
LASR2 prototyping 0.626179662 0.551962998 0.535729417 0.360492925 0.345616738
112
Table 7.5: Total elapsed time measured in seconds when the iPipe prototype is tested. The
number of prefetching bu ers is set to a range between 1 to 63. The block size is 200 MB.
Xcache 1 3 5 7 9
With iPipe 1360.27 455.467 285.398 200.495 158.36
Without iPipe 1832.43 653.29 371.157 273.445 246.693
Xcache 11 13 15 17 19
With iPipe 130.818 112.293 97.3989 86.2939 80.076
Without iPipe 174.627 148.697 133.107 127.165 108.789
Xcache 25 35 45 55 63
With iPipe 60.3609 44.974 38.793 37.6105 37
Without iPipe 86.2886 63.5934 49.8966 40.6426 37.1776
Table 7.6: Total elapsed time measured in seconds when the LASR traces are replayed by
the iPipe prototype. The number of prefetching bu ers is set to a range between 1 to 63.
The block size is 200 MB.
Xcache 1 3 5 7 9
LASR1 With iPipe 15896.11522 5322.587362 3335.161028 2342.98457 1850.59496
LASR1 Without iPipe 21413.77698 7634.34694 4337.340702 3195.47827 2882.854398
LASR2 With iPipe 69653.98562 23322.6432 14614.08999 10266.54697 8108.98216
LASR2 Without iPipe 93831.41058 33452.36774 19005.46534 14002.02467 12632.16176
Xcache 11 13 15 17 19
LASR1 With iPipe 1528.739148 1312.255998 1138.203545 1008.430515 935.768136
LASR1 Without iPipe 2040.691122 1737.673142 1555.488402 1486.05019 1271.308254
LASR2 With iPipe 6698.666508 5750.075358 4987.408073 4418.765443 4100.371656
LASR2 Without iPipe 8941.950162 7614.178582 6815.877042 6511.61099 5570.649534
Xcache 25 35 45 55 63
LASR1 With iPipe 705.3774774 525.566164 453.334998 439.516303 432.382
LASR1 Without iPipe 1008.36858 743.1524724 583.0916676 474.9494236 434.4574336
LASR2 With iPipe 3090.840245 2302.938644 1986.434358 1925.883263 1894.622
LASR2 Without iPipe 4418.494052 3256.36364 2555.0053 2081.144976 1903.716186
113
Table 7.7: Total elapsed time measured in seconds when the iPipe simulator is tested. The
number of prefetching bu ers is set to a range between 1 to 63. The block size is 200 MB.
Xcache 1 3 5 7 9
LASR1 With iPipe 17531.8 5843.93 3506.36 2504.54 1947.98
LASR1 Without iPipe 26877.8 8959.27 5375.56 3839.69 2986.42
LASR2 With iPipe 76811.8 25603.9 15362.4 10973.1 8534.64
LASR2 Without iPipe 117774 39257.9 23554.8 16824.8 13086
Xcache 11 13 15 17 19
LASR1 With iPipe 1593.8 1348.65 1168.79 1031.28 922.726
LASR1 Without iPipe 2443.44 2247.31 1791.85 1581.05 1414.62
LASR2 With iPipe 6982.89 5908.6 5120.79 4518.34 4042.73
LASR2 Without iPipe 10706.7 9059.52 7851.59 6927.87 6198.62
Xcache 25 35 45 55 63
LASR1 With iPipe 701.272 500.909 432.431 432.399 432.382
LASR1 Without iPipe 1075.11 767.937 597.284 488.687 432.382
LASR2 With iPipe 3072.47 2194.62 1894.67 1894.64 1894.62
LASR2 Without iPipe 4710.95 3364.97 2617.2 2141.34 1894.62
Table 7.8: Comparison between iPipe?s simulation results and the prototyping results.
Total elapsed time measured in seconds when the LASR traces are replayed by the iPipe
simulator and prototype. The block size is 200 MB.
Xcache 1 3 5 7 9
LASR1 Simulation 0.347721912 0.347722527 0.347721912 0.347723384 0.347720682
LASR1 prototyping 0.257668779 0.30281039 0.231058555 0.266781254 0.358068531
LASR2 Simulation 0.34780342 0.347802608 0.347801722 0.347802054 0.34780376
LASR2 prototyping 0.257668779 0.30281039 0.231058555 0.266781254 0.358068531
Xcache 11 13 15 17 19
LASR1 Simulation 0.347722882 0.347697728 0.347718838 0.347724613 0.34772165
LASR1 prototyping 0.250871858 0.244820003 0.268266132 0.321402115 0.263932934
LASR2 Simulation 0.347801844 0.347802091 0.347802165 0.347802427 0.347801607
LASR2 prototyping 0.250871858 0.244820003 0.268266132 0.321402115 0.263932934
Xcache 25 35 45 55 63
LASR1 Simulation 0.347720698 0.347721232 0.27600438 0.11518211 0
LASR1 prototyping 0.30047654 0.292788245 0.222532197 0.074603987 0.00477707
LASR2 Simulation 0.34780246 0.347803992 0.276069846 0.115208234 0
LASR2 prototyping 0.30047654 0.292788245 0.222532197 0.074603987 0.00477707
114
7.1.2 The IPO Prototype
In this subsection, we describe our performance evaluation using an IPO prototype.
Like the iPipe prototype, the IPO prototype simply replays 1000 read requests from the two
LASR traces. We measure the application?s elapsed time using the IPO prototype. Then,
we compare the results obtained from the IPO prototype with the simulation results.
Tables 7.9 and 7.13 show the lapsed time measured in the IPO prototype when block
size is set to 10 MB and 200 MB, respectively. The prototyping results con rm that the
IPO scheme improves the performance of multi-level storage systems by 51% when block
size is 10 MB and by 25% when the block size is 200 MB. We also observe that when the
number of prefetching bu ers increases, the performance gap between IPO and its non-
IPO counterpart is diminishing. Given the same multi-level storage system, applications
issuing small requests (e.g., 10MB) gains more bene ts from IPO than applications issuing
large requests (e.g., 200MB). Small blocks enjoy more bene ts from IPO than large blocks,
because SSDs show better performance than HDDs in this case.
We implement an application in the IPO prototype to resemble real-world applications
whose access patterns are the same as the LASR traces. Tables 7.10 and 7.14 show elapsed
time measured from the IPO prototype when the application is tested. The results con rm
that both the IPO simulator and its prototype share the same performance trend in term of
performance improvement over the non-IPO case.
Table 7.11 reveals the IPO simulation results when the block size is 10 MB; Table 7.15
shows the results when the block size is as large as 200 MB.
Tables 7.12 and 7.16 illustrate the comparison between the simulation results and the
prototyping results. We observe that both the simulation and prototyping results share very
similar elapsed times. Like the iPipe prototype, the IPO prototype exhibits smaller elapsed
time than the IPO simulator when block size is small (e.g., 10MB) and Xcache is less than
11. The opposite is true when Xcache is larger than or equal to 11. The variation between
the simulation and the prototyping results is as small as 3% in all the IPO cases.
115
Table 7.9: Total elapsed time measured in seconds when the IPO prototype is tested. The
number of prefetching bu ers is set to a range between 1 to 15. The block size is 10 MB.
Xcache 1 3 5 7
With IPO 47.9343 20.2705 17.9612 14.5456
Without IPO 99.4349 34.9642 22.1658 15.9048
Xcache 9 11 13 15
With IPO 12.5878 11.2401 9.99 9.79321
Without IPO 12.9741 11.338 10.0755 9.79321
Table 7.10: Total elapsed time measured in seconds when the LASR traces are replayed by
the IPO prototype. The number of prefetching bu ers is set to a range between 1 to 15.
The block size is 10 MB.
Xcache 1 3 5 7
LASR1 With IPO 560.1602298 236.881063 209.8945832 169.9798816
LASR1 Without IPO 1161.996241 408.5916412 259.0295388 185.8634928
LASR2 With IPO 2454.523766 1037.971223 919.7212072 744.8219936
LASR2 Without IPO 5091.663489 1790.376825 1135.021955 814.4211888
Xcache 9 11 13 15
LASR1 With IPO 147.1010308 131.3518086 116.74314 114.4434521
LASR1 Without IPO 151.6153326 132.495868 117.742293 114.4434521
LASR2 With IPO 644.5708868 575.5605606 511.54794 501.4711113
LASR2 Without IPO 664.3517646 580.573628 515.926053 501.4711113
Table 7.11: Total elapsed time measured in seconds when the IPO simulator is tested. The
number of prefetching bu ers is set to a range between 1 to 15. The block size is 10 MB.
Xcache 1 3 5 7
LASR1 With IPO 607.876 202.792 201.128 172.036
LASR1 Without IPO 1402.32 467.516 280.552 200.392
LASR2 With IPO 2662.92 887.79 880.872 753.498
LASR2 Without IPO 6144.72 2048.28 1229.03 877.908
Xcache 9 11 13 15
LASR1 With IPO 155.768 127.448 107.878 93.5731
LASR1 Without IPO 155.87 127.547 107.878 93.5731
LASR2 With IPO 682.69 558.602 472.678 409.672
LASR2 Without IPO 682.792 558.701 472.678 409.672
116
Table 7.12: Comparison between IPO?s simulation results and the prototyping results.
Total elapsed time measured in seconds when the LASR traces are replayed by the iPipe
simulator and prototype. The block size is 10 MB.
Xcache 1 3 5 7
LASR1 Simulation 0.566521193 0.566235166 0.283099033 0.141502655
LASR1 prototyping 0.517932838 0.420249856 0.189688619 0.085458478
LASR2 Simulation 0.566632816 0.566568047 0.283278683 0.141711888
LASR2 prototyping 0.517932838 0.420249856 0.189688619 0.085458478
Xcache 9 11 13 15
LASR1 Simulation 0.000654391 0.000776184 0 0
LASR1 prototyping 0.029774705 0.00863468 0.008485931 0
LASR2 Simulation 0.000149387 0.000177197 0 0
LASR2 prototyping 0.029774705 0.00863468 0.008485931 0
Table 7.13: Total elapsed time measured in seconds when the IPO prototype is tested. The
number of prefetching bu ers is set to a range between 1 to 15. The block size is 200 MB.
Xcache 1 3 5 7
With IPO 1361.64 548.053 355.994 266.736
Without IPO 1832.43 653.29 371.157 273.445
Xcache 9 11 13 15
With IPO 239.318 174.533 148.564 133.107
Without IPO 246.693 174.627 148.697 133.107
Table 7.14: Total elapsed time measured in seconds when the LASR traces are replayed by
the IPO prototype. The number of prefetching bu ers is set to a range between 1 to 15.
The block size is 200 MB.
Xcache 1 3 5 7
LASR1 With IPO 15912.12504 6404.547358 4160.145884 3117.076896
LASR1 Without IPO 21413.77698 7634.34694 4337.340702 3195.47827
LASR2 With IPO 69724.13784 28063.60192 18229.02876 13658.48362
LASR2 Without IPO 93831.41058 33452.36774 19005.46534 14002.02467
Xcache 9 11 13 15
LASR1 With IPO 2796.670148 2039.592638 1736.118904 1555.488402
LASR1 Without IPO 2882.854398 2040.691122 1737.673142 1555.488402
LASR2 With IPO 12254.51751 8937.136798 7607.368184 6815.877042
LASR2 Without IPO 12632.16176 8941.950162 7614.178582 6815.877042
117
Table 7.15: Total elapsed time measured in seconds when the IPO simulator is tested. The
number of prefetching bu ers is set to a range between 1 to 15. The block size is 200 MB.
Xcache 1 3 5 7
LASR1 With IPO 17533 7181.56 4842.92 3650.45
LASR1 Without IPO 26877.8 8960.76 5377.32 3840.85
LASR2 With IPO 76813 31457.1 21215.7 15988.5
LASR2 Without IPO 117774 39258.7 23556.5 16826.6
Xcache 9 11 13 15
LASR1 With IPO 2900.08 2408.85 2059.59 1793.48
LASR1 Without IPO 2987.52 2444.64 2067.66 1793.48
LASR2 With IPO 12705.6 10552.5 9018.85 7852.05
LASR2 Without IPO 13086.9 10708.4 9059.66 7852.05
Table 7.16: Comparison between IPO?s simulation results and the prototyping results.
Total elapsed time measured in seconds when the LASR traces are replayed by the iPipe
simulator and prototype. The block size is 200 MB.
Xcache 1 3 5 7
LASR1 Simulation 0.347677265 0.198554587 0.09938036 0.04957236
LASR1 prototyping 0.256921138 0.161087725 0.040853332 0.024535098
LASR2 Simulation 0.347793231 0.198722831 0.099369601 0.049808042
LASR2 prototyping 0.256921138 0.161087725 0.040853332 0.024535098
Xcache 9 11 13 15
LASR1 Simulation 0.029268423 0.014640192 0.003902963 0
LASR1 prototyping 0.029895457 0.00053829 0.000894436 0
LASR2 Simulation 0.029136006 0.014558664 0.004504584 0
LASR2 prototyping 0.029895457 0.00053829 0.000894436 0
118
7.1.3 The IPODS Prototype
Now we are in a position to present prototyping results for the IPODS scheme. We
develop a prototype of a multi-level distributed storage system where the IPODS prefetching
scheme is implemented. The client and storage nodes in the prototype are connected by the
fast Ethernet. Like the iPipe and IPO prototypes, the IPODS prototype allow us to measure
elapsed time of an application issuing 1000 read requests to the multi-level storage system.
In the IPODS prototype, we set the block size to 200 MB and the maximum concurrent reads
to 15. Thus, the TIP module in the IPODS prototype can choose the number of informed
prefetching bu ers anywhere between 1 and 15. We compare the IPODS prototyping results
with the simulation results.
Table 7.17 reveals the prototyping results in terms of elapsed time in seconds. As men-
tioned earlier, the IPODS prototype processes 1000 read requests issued by the application
that replay the two LASR traces. The IPODS prototype con rms that IPODS can reduced
the application?s elapsed time by approximately 6% when the number of prefetching bu ers
is set to a small value (e.g., 1-3). With the increasing number of prefetching bu ers, the
performance gaps is narrowing.
The application implemented in the IPODS prototype and simulator used the two LASR
traces machine01 (LASR1) or machine06 (LASR2), which contain 11686 and 51206 read
requests, respectively. Table 7.18 shows the prototyping results in terms of elapsed time
in seconds when the application is executed in the prototype. Table 7.18 indicates that
the IPODS prototype and simulator share the same performance trend in term of improving
system performance over the non-IPODS case.
Table 7.19 plots the the IPODS simulation results presented in the previous IPODS
chapter. Table 7.20 shows the comparison between the simulation and prototyping results.
We observe that both the simulator and prototype agree with each other when the number of
prefetching bu ers is set from 1 to 9. When there are 11 to 15 bu ers, both the simulator and
prototype show 0% improvement over the non-IPODS case due to the pipeline limitation.
119
Table 7.17: Total elapsed time measured in seconds when the IPODS prototype is tested.
The number of prefetching bu ers is set to a range between 1 to 15. The block size is 200
MB.
Xcache 1 3 5 7
With IPODS 4158 1401.48 888.555 650.74
Without IPODS 4430 1520.81 916.11 667.74
Xcache 9 11 13 15
With IPODS 518.703 428.184 364.925 320.925
Without IPODS 526.37 429.777 365.925 320.925
Table 7.18: Total elapsed time measured in seconds when the LASR traces are replayed by
the IPODS prototype. The number of prefetching bu ers is set to a range between 1 to 15.
The block size is 200 MB.
Xcache 1 3 5 7
LASR1 With IPODS 48590.388 16377.69528 10383.65373 7604.54764
LASR1 Without IPODS 51768.98 17772.18566 10705.66146 7803.20964
LASR2 With IPODS 212914.548 71764.18488 45499.34733 33321.79244
LASR2 Without IPODS 226842.58 77874.59686 46910.32866 34192.29444
Xcache 9 11 13 15
LASR1 With IPODS 6061.563258 5003.758224 4264.51355 3750.32955
LASR1 Without IPODS 6151.15982 5022.374022 4276.19955 3750.32955
LASR2 With IPODS 26560.70582 21925.5899 18686.34955 16433.28555
LASR2 Without IPODS 26953.30222 22007.16106 18737.55555 16433.28555
We use SCP (secure copy) to read data from the remote HDDs and SSDs; we take the average
of the disk access time measured during the parameter validation process. For this reason,
the IPODS simulation and prototyping results are very close when Xcache is set to 1. When
Xcache increases, the IPODS prototype?s elapsed time is slightly higher (about 1%-8%) than
those of the IPODS simulator.
7.2 Summary
The goal of this part of study is three-fold. First, we built real-world storage systems that
implements our pipelined prefetching solutions. Second, we evaluated the performance of
our informed prefetching solutions in a real-world system. Last, we validated the correctness
of our simulators presented in the previous chapters by comparing simulation results with
the prototyping results.
120
Table 7.19: Total elapsed time measured in seconds when the LASR traces are replayed by
the IPODS simulator. The number of prefetching bu ers is set to a range between 1 to 15.
The block size is 200 MB.
Xcache 1 3 5 7
LASR1 With IPODS 48590.9 16200 9933.42 7246.72
LASR1 Without IPODS 51769 17259.2 10357.2 7397.95
LASR2 With IPODS 212915 70973.4 43515 31742.4
LASR2 Without IPODS 226843 75615.6 45371.9 32409.7
Xcache 9 11 13 15
LASR1 With IPODS 5750.29 4704.81 3982.53 3454.88
LASR1 Without IPODS 5754.39 4708.83 3982.53 3454.88
LASR2 With IPODS 25202.5 20621.7 17449.7 15123.9
LASR2 Without IPODS 25206.6 20625.7 17449.7 15123.9
Table 7.20: Total elapsed time measured in seconds when the IPODS simulator is tested.
The number of prefetching bu ers is set to a range between 1 to 15. The block size is 200
MB.
Xcache 1 3 5 7
LASR1 Simulation 0.061390021 0.061370168 0.040916464 0.02044215
LASR1 prototyping 0.061399549 0.078464765 0.030078266 0.025459011
LASR2 Simulation 0.061399294 0.061392094 0.040926212 0.020589515
LASR2 prototyping 0.061399549 0.078464765 0.030078266 0.025459011
Xcache 9 11 13 15
LASR1 Simulation 0.0007125 0.000853715 0 0
LASR1 prototyping 0.0145658 0.003706573 0.0027328 0
LASR2 Simulation 0.000162656 0.000193933 0 0
LASR2 prototyping 0.0145658 0.003706573 0.0027328 0
121
The three prototypes replay real-world traces to evaluation the performance of the iPipe,
IPO, and IPODS prefetching schemes. After comparing the simulators with the prototypes,
we conclude that both our prototyping and simulations results share similar performance
trends. Speci cally, for iPipe and IPO, the simulation and prototyping results vary by 3-
18%. For IPODS, the simulation and prototyping results vary by only 1-8%. The prototyping
results con rm that the simulator presented in the previous chapters can be used to evaluate
the performance of our proposed pipelined prefetching solutions.
122
Chapter 8
Conclusions & Future Work
This chapter  rst concludes this dissertation by summarizing our contributions to the
 eld of informed prefetching. Then, we outline future research directions.
8.1 Main Contributions
Transparent Informed prefetching or TIP aims to utilize storage bandwidth and prefetch
data blocks to the cache before they are accessed by applications. Traditional informed
prefetching schemes can hide the latency of accessing storage systems by invoking disk I/O
parallelisms and fetching data based on application-disclosed hints. Disk access time e ects
the informed prefetching because long disk access time increases application stalls. When
disk speed is slow, informed prefetching needs more bu ers at the expense of demand caching
(LRU) for prefetching.
A multi-level (a.k.a., hybrid) storage system consists of multiple types of storage devices
that di er in their speci cations (hardware, speed, size, and the like) [87]. Upper levels
exhibit better speed performance. Multi-level storage systems o er cost-e ective solutions
for large-scale data centers.
In this dissertation, we investigate approaches to incorporating the TIP module in multi-
level storage systems to leverage informed prefetching mechanisms to boost I/O performance
of storage systems. The goal of this research is to demonstrate that building an informed
prefetching pipeline can signi cantly improve the I/O performance of multi-level storage
systems.
To achieve the above goal, we proposed three solutions (iPipe, IPO, and IPODS) that
perform informed prefetching in a pipeline manner. We illustrate a way of using application
123
hints to initiate prefetching among multiple storage levels like main memory, solid state disks,
and hard disk drives. The centerpiece of the iPipe, IPO, and IPODS schemes is a pipeline in
which we split the informed prefetching process into a set of independent prefetching steps
among the multiple storage levels. We showed how to integrate this pipeline with informed
prefetching and caching to manage I/O bu ers at various storage levels.
In the iPipe scheme, we assume that multi-level storage systems have su cient I/O
bandwidth and enough I/O parallelism, meaning that there is no I/O congestions in the
storage systems. This assumption was relaxed when we extend iPipe into the IPO scheme.
Finally, we developed the IPODS scheme to meet the needs of distributed and parallel storage
systems. The next three subsections summarizes our main contributions.
8.1.1 iPipe: An Pipelined and Informed Prefetching
In the  rst phase of this dissertation research, we developed iPipe - a new informed
prefetching technique powered by a pipeline. We started this research by assuming that
multi-level storage storage systems have scalable I/O bandwidth and parallelism, meaning
that that there is no I/O congestion. This assumption of scalable I/O enables a prefetching
mechanism to issue a large number of concurrent read requests without introducing long I/O
queues.
iPipe aims to reduce application I/O stalls and elapsed times by preloading hinted data
blocks to multi-level storage systems? uppermost level, which has better I/O performance
than lower levels. iPipe contains multiple informed prefetching modules running concurrently
in a pipeline manner. The number of prefetchers in a storage server depends on the number
of storage levels. An N-level storage server requires N-1 prefetchers. The prefetchers in
lower-level storage devices help in boosting I/O performance of the prefetchers in upper-
levels. Consequently, iPipe reduces stall times, application elapsed times, and the prefetching
horizon. iPipe also improves the bene t of using more bu ers for prefetching.
124
We developed an iPipe simulator to evaluate the performance of our pipelined prefetch-
ing scheme. The accuracy of the iPipe simulator was validated by a prototype running in
a real-world multi-level storage system. The simulation results show that iPipe reduces the
total elapsed time by up to 56% when the block size is 10 MB and by 34% when the block
size is 200 MB. The experimental results also show that as the number bu ers increases,
the performance gap between iPipe and non-iPipe cases diminishes because the prefetching
module in the upper level becomes more e cient to reduce the application stalls.
8.1.2 IPO: Informed Prefetching Optimization
In the iPipe study, we assumed that multi-level storage systems have su cient I/O band-
width. Unfortunately, our empirical experiments indicate that parallel storage systems may
have I/O congestions; evidence shows that there is a maximum number of concurrent read
requests processed in multi-level storage systems. In the second phase of our dissertation
study, we show that an upper-level prefetching mechanism many not fully utilize I/O band-
width, suggesting that unused I/O bandwidth can be allocated for lower-level prefetching
mechanisms to bring hinted blocks from lower-level to upper-level storage in a pipeline fash-
ion. We conducted experiments to illustrate that the pipelined prefetching process largely
depends on available I/O bandwidth for lower-level prefetchers.
In the IPO scheme, the lower-level prefetcher trigger a informed prefetching request
when the following conditions are met. First, the upper level SDDs have su cient available
I/O bandwidth. Second, fetching hinted blocks from HDDs to SSDs will not saturate the
I/O bandwidth of SSDs and HDDs. We believe that the main concern is SSDs? bandwidth,
because the SSDs are accessed by two prefetching modules. Compared with HDDs, SSDs
are more likely to have their I/O bandwidth saturated. Our IPO design ensures that a
lower-level prefetcher will never slow down the performance of its upper-level prefetching
counterpart.
125
The experimental results show that performance improvements o ered by IPO are not
as signi cant as those o ered by iPipe due to the limited I/O bandwidth. Nevertheless, IPO
reduces elapsed time by approximately 56% when the Xcahce is set to 1. As the Xcahce value
increases, IPO?s performance improvement is decreasing. This part of the study indicates
that our proposed IPO is very bene cial to parallel storage systems o ering su cient I/O
bandwidth.
8.1.3 IPODS: Pipelined Prefetching in Distributed/Parallel Storage Systems
In the third phase of the dissertation study, we extended the IPO schemes to meet the
needs of distributed/parallel multi-level storage systems, where massive amounts of data are
allocated to a set of storage servers built with multi-level storage devices.
Similar to iPipe and IPO, IPODS have several prefetching modules residing in each
storage server. Unlike iPipe and IPO, IPODS should incorporate a prefetcher to bring
data from remote storage servers to a client?s local I/O bu ers. IPODS has a coordination
mechanism between a pair of client local I/O bu er and bu ers in SSDs (i.e., upper-level
device) of a remote storage server. IPODS checks if hinted blocks have been cached in the
local bu er. If the hinted blocks are not available in the local bu er, a request will be sent
to the remote storage server to have the hinted blocks fetched from the server.
We developed the IPODS simulator and prototype to evaluate the performance of
IPODS deployed in distributed and parallel storage systems. Our results show that IPODS
can judiciously reduce the elapsed times of applications by approximately 6%. Importantly,
IPODS shows its best performance improvement when the there are a limited number of
prefetching bu ers available in the upper-level storage devices.
8.1.4 Prototypes for iPipe, IPO, and IPODS
We implemented three prototypes for the above informed prefetching schemes. We
developed our prototypes using a computing cluster in our research laboratory at Auburn.
126
The prototypes replays real-world traces to evaluation the performance of the three pipelined
prefetching mechanism. The prototypes help us to validate the correctness of the simulators
in which iPipe, IPO, and IPODS are implemented. The validate process was performed by
comparing experimental results obtained from the prototypes with those obtained from the
simulators.
After comparing the simulators with the prototypes, we con rmed that both our proto-
typing and simulations results share similar performance trends. For example, for iPipe and
IPO, the simulation and prototyping results vary by 3-18%. When it comes to IPODS, the
simulation and prototyping results vary by only 1-8%. The prototyping results con rm that
the three simulators can be used to evaluate the performance of the iPipe, IPO, and IPODS
solutions.
8.2 Future Work
There are several important issues that we were not able to address in this dissertation.
These issues open many future research opportunities for us. In this section, we will discuss
our future research directions.
8.2.1 Data Migration
Our pipelined prefetching schemes can hide the latency of accessing storage systems
by loading hinted blocks to upper-level storage devices. Our solutions invoke disk I/O
parallelisms and fetching data based on application-disclosed hints. When data blocks are
fetched to upper-level storage devices, the lower-level storage devices keep the original copies
of the data blocks.
As the  rst future research direction, we will investigate data migration schemes. Instead
of sending copies of the data to the uppermost level, our prefetching mechanisms can migrate
the entire data blocks; obviously, the uppermost level?s space limitation is an obstacle that
we may face. Data migration requires moving data back and forth among the di erent
127
storage levels. This activity consumes storage system I/O bandwidth, because a prefetching
scheme must move the prefetched data blocks back to lower storage level when the blocks
are evicted from I/O bu ers. We will show that the data migration scheme will be very
bene cial to applications with mixed workload of both reads and writes.
8.2.2 The Cost-Bene t Model
Our pipelined prefetching schemes relies on the important cost-bene t model, that helps
to divide the compound cache space between the demand misses cache (LRU) and prefetch-
ing. The division may change frequently in a dynamic workload conditions. Our pipelining
mechanisms can adjust their pipelined prefetching depth according to available I/O band-
width. In this dissertation research, we  x the number of prefetching bu ers. This strategy
is good as long as workload conditions do not change dramatically.
In the future, we will extend our pipelined prefetching schemes by dynamically adjusting
the number of prefetching bu ers for fast changing I/O load. To allow our prefetching
modules to change the number of prefetching bu ers, we must rely on a good cost-bene t
model that can estimate the cost and bene t of increasing or decreasing prefetching bu er
size. We plan to integrate the cost-bene t model in multiple prefetchers deployed in multiple
level storage systems. Because each prefetching module individually handles a pair of two
consecutive storage level, each prefetcher must independently control its prefetching bu er
size.
8.2.3 Write Performance
We only focus on read performance of multi-level storage systems. Our proposed iP-
ipe, IPO, and IPODS are designed to improve performance of read-intensive applications
supported by multi-level storage systems. When it comes to write-intensive applications,
iPipe, IPO, and IPODS may not be able to boost the I/O performance of the applications.
128
This motivate us to investigate in the future a pipelined prefetching module that hand e -
ciently handle write requests. We also will develop a simulator to evaluate the performance
of multi-level storage systems supporting applications with mixed workload of both writes
and reads.
8.2.4 Most Recently Used Policy
The TIP module employed in our solutions uses most recently used policy (MRU)for
informed caching. As discussed in chapter 3, we did not implement informed caching in
our solution. This is mainly because we cannot predict if a particular hinted access will
be cached when TIP performs prefetching. In one of our future project, we will integrate
the (MRU) policy into our pipelined prefetching mechanism for multi-level storage systems.
Such an integration will enable us to determine if a particular hinted access will be cached
at its prefetching phase.
8.2.5 Extending Storage Hierarchy
In this dissertation research, we only evaluated the performance of iPipe, IPO, and
IPODS using two-level storage systems that contain solid state disks and hard drives. In the
future, we plan to further evaluate our three pipelined prefetching scheme in a three-level
storage systems, where tapes are used as the lowest storage level. Tapes have a huge storage
capacity and less speed performance compared to HDDs and SSDs. IBM System Storage
TS2240 Tape Drive Express is a good example of this [95]. It has a capacity of 1.6 TB and
a speed of approximately 120 MB/s.
To incorporate tapes, we need to extend our added a prefetcher to fetch hinted blocks
from tapes and hard disks and solid state disks. Fetching hinted blocks from tapes is dif-
ferent from fetching blocks from hard drives, because data tapes have very slow seek times.
However, once the tape is positioned, the tape drive can stream data much faster. This
129
motivate us to group hinted blocks together so that hinted blocks sequentially stored on the
tape can be fetched at the same time to reduce long seek time.
8.2.6 Caching and Benchmarking
Caching prevents our prefetching schemes from loading hinted blocks data again. Both
caching and pipelined prefetching improve performance of informed prefetching. In our
solutions, we assumed the worst case in which informed caching was not implemented. In
our future work, we will seamlessly integrate informed caching with pipelined prefetching
mechanisms to further improve I/O performance of multi-level storage systems.
We will evaluate performance of the new design using real-world I/O-intensive bench-
marks. The benchmarks to be considered in our future research include XDataSlice, Sphinx,
Agrep, and Gnuld. In addition, we will use the IOzone  le system benchmark used to mea-
sure the performance of  le system operations [100]. Other benchmarks to be considered are
some Linux commands such as (grep) and (wc) [31].
8.2.7 Various Solid State Disks
In our iPipe study, we conducted extensive experiments using various solid state disks.
However, the system parameters of these solid state disks were collected from devices vendors?
websites [90] [91] [92] [93] [94] [95]. In the future, we will have to validate these parameters
in the context of multi-level storage systems.
Solid state disks provided by di erent vendors have di erent capacity and speed [91]
[92] [93]. In our future work, we will test our solutions using a wide range of SSD products.
We will purchase more di erent types of solid state disks to perform such experiments.
8.2.8 Block Sizes
Our tests show that the SSD?s performance in respect to the HDD is a ected by the
data block size. For example, reading a data block from SSD is faster than from HDD about
130
56% when the block size is 10 MB and by about 34% when it is 200 MB. Also, SSDs in
a distributed storage system shows better performance than HDDs by about 6% when the
block size is 200 MB.
In our future work, we will investigate a possibility of varying block sizes in a multi-
level storage system. Future multi-level storage systems should dynamically choose the most
appropriate block size for each storage level under I/O di erent workload.
8.2.9 Fast Networks for Distributed Storage Systems
Our prototypes show that IPODS provided 6% performance improvement when it was
implemented in a 2-level distributed storage system in our laboratory. We observed that
the network is an I/O performance bottleneck. Madhyastha et al. used a fast network to
implement their solutions [65]. In our future work, we also plan to further improve the
performance of IPODS using a fast network, such as  ber channel networks.
131
Bibliography
[1] R. Patterson, Hugo, G. Gibson, D. Stodolsky, and J. Zelenka: Informed prefetching and
caching, In Proceedings of the 15th ACM Symposium on Operating System Principles,
pages 79-95, CO, USA, 1995.
[2] J. Gri oen and R. Appleton: Reducing  le system latency using a predictive approach,
In Proceedings of the 1994 USENIX Annual Technical Conference, pages 197 207, Berke-
ley, CA, USA, 1994.
[3] Chuanpeng Li , Kai Shen , Athanasios E. Papathanasiou: Competitive prefetching
for concurrent sequential I/O, Proceedings of the 2nd ACM SIGOPS/EuroSys Euro-
pean Conference on Computer Systems 2007, March 21-23, 2007, Lisbon, Portugal
[doi>10.1145/1272996.1273017]
[4] Vellanki, V., Chervenak, A.L., A Cost-Bene t Scheme for High Performance Predictive
Prefetching, Proceedings of the AMC/IEEE SC99 Conference, 1999.
[5] Chen, Y., Byna, S., Sun, X., Data Access History Cache and Associated Data Prefetch-
ing Mechanisms, Proceedings of the AMC/IEEE Conference on Supercomputing, Reno,
NV, Nov, 2007, pp. 1-12.
[6] Wang, J.Y.Q, Ong, J.S., Coady,Y., Feeley, M.J., Using Idle Workstations to Implement
Predictive Prefetching, Proceedings of the 9th IEEE International Symposium on High
Performance Distributed Computing, Pittsburgh, PA, Aug. 2000, pp. 87-94.
[7] Domenech, J., Sahuquillo, J., Gil, J.A., Pont, A., The Impact of the Web Prefetch-
ing Architecture on the Limits of Reducing Users Perceived Latency, IEEE/WIC/ACM
International Conference on Web Intelligence, Hong Kong, China, Dec. 2006, 740-744.
[8] Jeon, J., Lee, G., Cho, H., Ahn, B., A Prefetching Web Caching Method Using Adaptive
Search Patterns, 2003 IEEE Paci c Rim Conference On Communications, Computers,
And Signal Processing, Aug. 2003, vol. 1, pp. 37-40.
[9] Z. Zhang ; K. Lee ; X. Ma ; Y. Zhou: PFC: Transparent Optimization of Existing
Prefetching Strategies for Multi-Level Storage Systems, In Proceedings of 28th Interna-
tional Conference on Distributed Computing System, pages 740 - 751, Beijing, China,
2008.
[10] M. Huizinga, D. and S. Desai: Implementation of informed prefetching and caching in
linux, In Proceedings of the International Conference on Information Technology, pages
443 448, Las Vegas, NV, USA, 2000.
132
[11] R. Hugo Patterson , Garth A. Gibson , M. Satyanarayanan: A status report on research
in transparent informed prefetching, ACM SIGOPS Operating Systems Review, v.27
n.2, pages: 21-34, 1993.
[12] A.Tomkins, R. Hugo Patterson and G. Gibson: Informed multi-process prefetching and
caching, In Proceedings of the 1997 ACM SIGMETRICS international conference on
measurement and modeling of computer systems, pages 100 - 114, Seattle, WA , USA,
1997.
[13] Fay Chang , Garth A. Gibson: Automatic I/O hint generation through speculative
execution, Proceedings of the third symposium on Operating systems design and imple-
mentation, p.1-14, February 1999, New Orleans, Louisiana, United States
[14] Pei Cao , Edward W. Felten , Anna R. Karlin , Kai Li: A study of integrated prefetching
and caching strategies, Proceedings of the 1995 ACM SIGMETRICS joint international
conference on Measurement and modeling of computer systems,p.188-197, May 15-19,
1995, Ottawa, Ontario, Canada [doi>10.1145/223587.223608].
[15] Susanne Albers , Markus Bttner : Integrated prefetching and caching in single and
parallel disk systems, Proceedings of the  fteenth annual ACM symposium on Par-
allel algorithms and architectures, June 07-09, 2003, San Diego, California, USA
[doi>10.1145/777412.777431].
[16] R.H. Patterson, G.A. Gibson, M. Satyanarayanan: Using Transparent Informed
Prefetching (TIP) to Reduce File Read Latency , In Proceedings of Conference on Mass
Storage Systems and Technologies, Pages: 329-342, Greenbelt, MD, 1992.
[17] D. Rochberg, G.A. Gibson: Prefetching over a network: early experience with CTIP,
ACM SIGMETRICS Performance Evaluation Review, v.25 n.3, Pages: 29-36, 1997.
[18] F. Jason: Multi-level memory prefetching for media and stream processing , In IEEE
International Conference on Multimedia and Expo, pages 101 104, St. Louis, MO, USA,
2002.
[19] Song J. and X. Zhang: Ulc: A  le block placement and replacement protocol to e ec-
tively exploit hierarchical locality in multilevel bu er caches, In Proceedings of the 24th
International Conference on Distributed Computer Systems, pages 168 - 177, St. Louis,
MO, USA, 2004.
[20] J. Kang and W. Sung: A multi-level block priority based instruction caching scheme
for multimedia processors, In Proceedings of the 24th International Conference on Dis-
tributed Computer Systems, pages 125 132, Antwerp, Belgium, 2001.
[21] S. Przybylski, M. Horowitz, and J. Hennessy: Performance tradeo s in cache design,
In 15th Annual International Symposium on Computer Architecture, pages 290 298,
Honolulu, HI, USA, 1988.
133
[22] S. Przybylski, M. Horowitz, and J. Hennessy: Characteristics of performance-optimal
multi-level cache hierarchies, In 16th Annual International Symposium on Computer
Architecture, pages 114 - 121, Stanford University, CA, USA, 1989.
[23] M. Nijim: Modelling Speculative Prefetching for Hybrid Storage Systems, In Network-
ing, Architecture and Storage (NAS), 2010 IEEE Fifth International Conference on,
pages 143 - 151, Macau, 2010.
[24] M. Nijim, Z. Zong, X. Qin, Y. Nijim: Multi-Layer Prefetching for Hybrid Storage
Systems: Algorithms, Models, and Evaluations, In 2010 39th International Conference
on Parallel Processing Workshops, DOI 10.1109/ICPPW.2010.18.
[25] T. Kimbrel, P. Cao, E. Felten, A. Karlin, K. Li: Integrated Parallel Prefetching and
Caching, Proceedings of the 1996 ACM SIGMETRICS international conference on Mea-
surement and modeling of computer systems, pages 262 - 263, PA,USA, 1996.
[26] R. Hugo Patterson, G. Gibson: Exposing I/O concurrency with informed prefetching,
Proceedings of the third international conference on on Parallel and distributed infor-
mation systems, pages 7 - 16, Austin, TX, USA, 1994.
[27] Jon A. Solworth , Cyril U. Orji: Write-Only Disk Caches, Proceedings of the 1990 ACM
SIGMOD international conference on Management of data, p.123-132, May 23-26, 1990,
Atlantic City, New Jersey, United States [doi>10.1145/93597.98722]
[28] David Kotz , Carla Schlatter Ellis: Caching the writeback policies in parallel  le systems,
Journal of Parallel and Distributed Computing, v.17 n.1-2, p.140-145, Jan./Feb. 1993
[doi>10.1006/jpdc.1993.1012]
[29] Mendel Rosenblum: The design and implementation of a log-structured  le system,
University of California at Berkeley, Berkeley, CA, 1992
[30] Y. Chen ; Byna, S. ; X. Sun ; Thakur, R. ; Gropp, W: Exploring Parallel I/O Concur-
rency with Speculative Prefetching, Proceedings of the 2008 37th International Confer-
ence on Parallel Processing, pages 422 - 429, Portland, OR, USA, 2008.
[31] J. Lewis, M. I. Alghamdi, M. A. Assaf, X.-J. Ruan, Z.-Y. Ding, and X. Qin: An
Automatic Prefetching and Caching System, In Proceedings of the 29th International
Performance Computing and Communications Conference (IPCCC), 2010.
[32] Y. Chen ; Byna, S. ; X. Sun ; Thakur, R. ; Gropp, W: Hiding I/O Latency with Pre-
execution Prefetching for Parallel Applications, Proceedings of the 2008 ACM/IEEE
conference on Supercomputing, pages 1 - 10, Austin, TX, USA, 2008.
[33] Gabriel Rivera , Chau-Wen Tseng: Locality optimizations for multi-level caches,
Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM),p.2-es,
November 14-19, 1999, Portland, Oregon, United States [doi>10.1145/331532.331534].
134
[34] Achim Kraiss , Gerhard Weikum: Integrated document caching and prefetching
in storage hierarchies based on Markov-chain predictions, The VLDB Journal The
International Journal on Very Large Data Bases,v.7 n.3, p.141-162, August 1998
[doi>10.1007/s007780050060]
[35] Surendra Byna , Yong Chen , Xian-He Sun , Rajeev Thakur , William Gropp: Parallel
I/O prefetching using MPI  le caching and I/O signatures, Proceedings of the 2008
ACM/IEEE conference on Supercomputing, November 15-21, 2008, Austin, Texas.
[36] Xiaonan Zhao, Zhanhuai Li, and Leijie Zeng: A hierarchical storage strategy based on
block-level data valuation, In Fourth International Conference on Networked Computing
and Advanced Information Management, pages 36 41, Gyeongju, South Korea, 2008.
[37] Song Jiang, Kei Davis, and Xiaodong Zhang: Coordinated multilevel bu er cache man-
agement with consistent access locality quanti cation, In IEEE Transcations on Com-
puters, volume 56, pages 95 108, Baltimore, MD, USA, 2007.
[38] Srinivas Kashyap, Samir Khuller, Yung-Chun Wan, and Leana Golubchik: Fast recon-
 guration of data placement in parallel disks, Proceedings of the eighth Workshop on
Algorithm Engineering and Experiments and the Third Workshop on Analytic Algorith-
mics and Combinatorics , 2006.
[39] Mary G. Baker , John H. Hartman , Michael D. Kupfer , Ken W. Shirri , John K.
Ousterhout: Measurements of a distributed  le system, Proceedings of the thirteenth
ACM symposium on Operating systems principles, p.198-212, October 13-16, 1991, Pa-
ci c Grove, California, United States [doi>10.1145/121132.121164]
[40] Mirjana Spasojevic , M. Satyanarayanan: An empirical study of a wide-area distributed
 le system, ACM Transactions on Computer Systems (TOCS), v.14 n.2, p.200-222, May
1996 [doi>10.1145/227695.227698]
[41] A Vijay Srinivas , D. Janakiram : A model for characterizing the scalability of dis-
tributed systems, ACM SIGOPS Operating Systems Review , Volume 39 Issue 3, July
2005.
[42] M. Satyanarayanan : Scalable, Secure and Highly Available Distributed File Access,
Computer, vol. 23, no. 5, pp. 9-21, May 1990.
[43] Dhruba Borthakur: The Hadoop Distributed File System: Architecture and Design,
The Apache Software Foundation., 2007, DOI: http://hadoop.apache.org/
common/docs/r0.18.0/hdfs design.pdf
[44] Dhruba Borthakur: HDFS Architecture, The Apache Software Foundation., 2008, DOI:
http://hadoop.apache.org/
common/docs/r0.20.0/hdfs design.pdf
[45] Shafer, J., Rixner S., Cox A.L : The Hadoop distributed  lesystem: Balancing porta-
bility and performance , IEEE International Symposium on Performance Analysis of
135
Systems Software (ISPASS), pp. 122 - 133 , White Plains, NY , March 2010. DOI:
10.1109/ISPASS.2010.5452045 .
[46] Diana Moise, Gabriel Antoniu, Luc Boug: Improving the Hadoop map/reduce frame-
work to support concurrent appends through the BlobSeer BLOB management sys-
tem , Proceedings of the 19th ACM International Symposium on High Performance
Distributed Computing (HPDC ?10), pp. 834 - 840 , Chicago, IL, June 2010. DOI:
10.1145/1851476.1851596
[47] Je rey Dean , Sanjay Ghemawat: MapReduce: a  exible data processing tool, Commu-
nications of the ACM, , v.53 n.1, January 2010 [doi>10.1145/1629175.1629198]
[48] Je rey Dean , Sanjay Ghemawat: MapReduce: simpli ed data process-
ing on large clusters, Communications of the ACM, v.51 n.1, January 2008
[doi>10.1145/1327452.1327492]
[49] G. Hughes; J. Murray: Reliability and security of RAID storage systems and D2D
archives using SATA disk drives, ACM Transactions on Storage, Vol. 1, Issue 1, Pages:
95 107, 2005.
[50] Q. Zou; Y. Zhu, D. Feng: A Study of Self-similarity in Parallel I/O Workloads, MSST
?10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Tech-
nologies (MSST), Pages:1 - 6 , 2010, DOI: 10.1109/MSST.2010.5496978 .
[51] Ganger, G.R. ; Worthington, B.L. ; Hou, R.Y. ; Patt, Y.N: Disk arrays: high-
performance, high-reliability storage subsystems, Journal: Computer, issn: 0018-9162,
volume 27, pages 30-36, doi: 10.1109/2.268882 Ann Arbor, MI, USA, 1994.
[52] Alexander Thomasian: Multi-level RAID for very large disk arrays, ACM
SIGMETRICS Performance Evaluation Review, v.33 n.4, March 2006
[doi>10.1145/1138085.1138091]
[53] Maria E. Gomez , Vicente Santonja: Analysis of Self-Similarity in I/O Workload Us-
ing Structural Modeling, Proceedings of the 7th International Symposium on Modeling,
Analysis and Simulation of Computer and Telecommunication Systems, p.234, March
24-28, 1999.
[54] Nancy Tran , Daniel A. Reed: ARIMA time series modeling and forecasting for adaptive
I/O prefetching, Proceedings of the 15th international conference on Supercomputing,
p.473-485, June 2001, Sorrento, Italy [doi>10.1145/377792.377905]
[55] James Oly , Daniel A. Reed: Markov model prediction of I/O requests for scienti c
applications, Proceedings of the 16th international conference on Supercomputing, June
22-26, 2002, New York, New York, USA [doi>10.1145/514191.514214]
[56] J. No: A Design for Hybrid File System, European Conference for the APPLIED MATH-
EMATICS and INFORMATICS, Applied Mathematics and Informatics, ISBN: 978-960-
474-260-8, Pages: 143 - 148, Vouliagmeni, Athens, Greece, 2010.
136
[57] G. Gibson et al: A Cost-E ective, High-Bandwidth Storage Architectures, Proceed-
ings of the 8th Conference on Architectural Support for Programming Languages and
Operating Systems, San Jose, California, 1998.
[58] Mishra, S.K. ; Mohapatra, P.: Performance Study of RAID-5 Disk Arrays with Data
and Parity Cache, Proceedings of 1996 International Conference on Parallel Processing,
Pages: 222-229, Ithaca, NY , 1996.
[59] S.H. Baek et al: Reliability and performance of hierarchical RAID with multiple con-
trollers, Proceedings of twentieth annual ACM symposium on principles of distributed
computing, Pages: 246 254, 2001.
[60] W. MEADOR: DISK ARRAY SYSTEMS, Proceedings of COMPCON Spring ?89.
Thirty-Fourth IEEE Computer Society International Conference, Pages: 143 - 146 ,
San Francisco, CA , USA, 1989.
[61] Y. Wu, A. G. Dimakis, and K. Ramchandran: Deterministic regenerating codes for
distributed storage, presented at the Allerton Con. Control, Computing, and Commu-
nication, Urbana-Champaign, IL, Sep. 2007.
[62] Dimakis A.G., Godfrey P.B., Yunnan Wu, Wainwright M.J., Ramchandran K.: Network
Coding for Distributed Storage Systems, IEEE TRANSACTIONS ON INFORMATION
THEORY, VOL. 56, NO. 9, SEPTEMBER 2010
[63] R. Hou, J. Menon; and Y. Patt: Balancing I/O Response Time and Disk Rebuild Time
in a RAID5 Disk Array, Proceedings of on Systems Sciences, Pages: 70-79, 1993.
[64] A. Drapeau ; R. Katz: Striped tape arrays, Proceedings of 12th IEEE Symposium on
Mass Storage Systems, 1993.
[65] T.Madhyastha; G. Gibson; C. Faloutsos: Informed prefetching of collective in-
put/output requests, Proceedings of the 1999 ACM/IEEE conference on Supercomputing
(CDROM), Portland, Oregon, 1999.
[66] Fay Chang; Je rey Dean; Sanjay Ghemawat; Wilson C. Hsieh; Deborah A. Wallach;
Mike Burrows; Tushar Chandra; Andrew Fikes; Robert E. Gruber: Bigtable: A Dis-
tributed Storage System for Structured Data, ACM Transactions on Computer Systems
(TOCS), v.26 n.2, pages: 1-26, 2008.
[67] Brian Tierney ; Jason Lee ; Ling Tony Chen ; Hanan Herzog ; Gary Hoo ; Guojun Jin
; William E. Johnston: Distributed parallel data storage systems: a scalable approach
to high speed image servers, Proceedings of the second ACM international conference
on Multimedia, pages: 399-405, San Francisco, CA, 1994.
[68] Edward K. Lee ; Chandramohan A. Thekkath: Petal: distributed virtual disks, Proceed-
ings of the seventh international conference on Architectural support for programming
languages and operating systems, pages: 84-92, Cambridge, Massachusetts, 1996.
137
[69] Moyer, S.A.; Sunderam, V.S: PIOUS: a scalable parallel I/O system for distributed com-
puting environments, Proceedings of Scalable High-Performance Computing Conference,
pages: 71 - 78, Knoxville, TN, 1994.
[70] D. Feng, Q. Zou, H. Jiang, and et al.: A novel model for synthesizing parallel i/o
workloads in scienti c applications, Proceedings of the IEEE International Conference
on Cluster Computing (Cluster?08), Tsukuba, Japan, September 2008.
[71] Luis Cabrera , Darrell D.E. Long: SWIFT: USING DISTRIBUTED DISK STRIPING
TO PROVIDE HIGH I/O DATA RATES, University of California at Santa Cruz, Santa
Cruz, CA, 1991.
[72] D. D.E. Long , Bruce R. Montague , Luis Cabrera: SWIFT/RAID: A DISTRIBUTED
RAID SYSTEM, University of California at Santa Cruz, Santa Cruz, CA, 1994.
[73] Sanjay Ghemawat , Howard Gobio , Shun-Tak Leung: The Google  le system, Pro-
ceedings of the nineteenth ACM symposium on Operating systems principles, October
19-22, 2003, Bolton Landing, NY, USA [doi>10.1145/945445.945450]
[74] Sage A. Weil , Scott A. Brandt , Ethan L. Miller , Darrell D. E. Long , Carlos Maltzahn:
Ceph:a scalable, high-performance distributed  le system, Proceedings of the 7th sympo-
sium on Operating systems design and implementation, November 06-08, 2006, Seattle,
Washington.
[75] John H. Howard , Michael L. Kazar , Sherri G. Menees , David A. Nichols , M. Satya-
narayanan , Robert N. Sidebotham , Michael J. West: Scale and performance in a
distributed  le system, ACM Transactions on Computer Systems (TOCS), v.6 n.1,
p.51-81, Feb. 1988 [doi>10.1145/35037.35059]
[76] Chandramohan A. Thekkath , Timothy Mann , Edward K. Lee: Frangipani: a
scalable distributed  le system, Proceedings of the sixteenth ACM symposium on
Operating systems principles, p.224-237, October 05-08, 1997, Saint Malo, France
[doi>10.1145/268998.266694]
[77] Siegel A., Birman K., Marzullo K.: Deceit: A Flexible Distributed File System, Pro-
ceedings of the Workshop on the Management of Replicated Data, 1990, p.15-17, 8-9
Nov 1990, Houston, TX , USA .
[78] M. Satyanarayanan , John H. Howard , David A. Nichols , Robert N. Sidebotham , Al-
fred Z. Spector , Michael J. West,: The ITC distributed  le system: principles and de-
sign, Proceedings of the tenth ACM symposium on Operating systems principles,p.35-50,
December 1985, Orcas Island, Washington, United States [doi>10.1145/323647.323633]
[79] M. Satyanarayanan, J. J. Kistler, P. Kumar, M. E. Okasaki, E. H. Siegel, and D. C.
Steer.: Coda: A highly available  le system for a distributed workstation environment,
IEEE Transactions on Computers,39(4):447459, 1990.
138
[80] Narayan, S., Chandy, J.A: Parity Redundancy in a Clustered Storage System, In-
ternational Workshop on Storage Network Architecture and Parallel I/Os, 2007.
SNAPI.,page(s): 17 - 24, Volume: Issue: , 24-24 Sept. 2007
[81] C.K. Yang, T. Mitra and T. Chiueh: A Decoupled Architecture for Application-Speci c
File Prefetching, Freenix Track of USENIX 2002 Annual Conference, 2002.
[82] R. W. Watson , R. A. Coyne: The parallel I/O architecture of the high-performance
storage system (HPSS), Proceedings of the 14th IEEE Symposium on Mass Storage
Systems, Page: 27, 1995.
[83] John H. Hartman , John K. Ousterhout: The Zebra striped network  le system, ACM
Transactions on Computer Systems (TOCS), v.13 n.3, pages: 274-310, 1995.
[84] Brian L. Tierney , William E. Johnston , Hanan Herzog , Gary Hoo , Guojun Jin , Jason
Lee , Ling Tony Chen , Doron Rotem: Using high speed networks to enable distributed
parallel image server systems, Proceedings of the 1994 conference on Supercomputing,
Pages: 610-619, Washington, D.C, 1994.
[85] W. Tuma: Comparison of drive technologies for high transaction databases, Storage
developer conference, Santa Clara,2008, doi: www.storage-developer.org
[86] S. Rizvi ; T. Chung: Flash SSD vs HDD: High Performance Oriented Modern Embedded
and Multimedia Storage Systems, Proceedings of 2010 2nd International Conference on
Computer Engineering and Technology, Pages: V7-297 - V7-299, Chengdu, 2010.
[87] T. Kaneko: Optimal Task Switching Policy for a Multilevel Storage System, IBM Jour-
nal of Research and Development, vol.18, no.4, pp.310-315, July 1974.
[88] Nanopoulos, A., Katsaros, D., Manolopoulos, Y., A Data Mining Algorithm for Gener-
alized Web Prefetching, IEEE Transactions on Knowledge and Data Engineering, Sep.
2003, vol. 15, no. 5.
[89] Hadoop Archive Guide, doi: http://hadoop.apache.org/mapreduce/docs/
r0.21.0/hadoop archives.html
[90] Kingston Valueram, doi: http://www.valueram.com/desktop/memory.asp
[91] Intel X25-E Extreme SATA Solid-State Drive, doi: http://www.intel.com/
design/ ash/nand/extreme/index.htm
[92] Corsair Nova Series V64 Solid-State Hard Drive, doi: http://www.corsair.com/
solid-state-drives/nova-series.html
[93] Corsair Force Series F115 Solid-State Hard Drive, doi: http://www.corsair.com/
force-feries-f115-solid-state-hard-drive.html
[94] WD Caviar Green, doi: http://www.wdc.com/en/products/Products.asp?DriveID=559
139
[95] Ibm system storage ts2240 tape drive express model, doi. http://www03.ibm.com/
systems/storage/tape/ts2240/.
[96] DELL PowerConnect 2824 Switch, doi:http://www.dell.com/us/business/p/
powerconnect-2824/pd.
[97] Lasr trace machine01, doi:http://iotta.snia.org/traces/list/Subtrace?parent=LASR+Traces.
[98] Lasr trace machine06, doi:http://iotta.snia.org/traces/list/Subtrace?parent=LASR+Traces.
[99] RamSpeed Cache and memory benchmarking tool, doi:
http://alasir.com/software/ramspeed/
[100] IOzone Filesystem Benchmark, doi: http://www.iozone.org/
140