Enhancing Flash Lifetime in Secondary Storage
by
Chengjun Wang
A dissertation submitted to the Graduate Faculty of
Auburn University
in partial ful llment of the
requirements for the Degree of
Doctor of Philosophy
Auburn, Alabama
August 6, 2011
Keywords: Flash, Lifetime, Storage Systems
Copyright 2011 by Chengjun Wang
Approved by
Sanjeev Baskiyar, Chair
Associate Professor of Computer Science and Software Engineering
Hari Narayanan, Professor of Computer Science and Software Engineering
Cheryl Seals, Associate Professor of Computer Science and Software Engineering
Abstract
This research addresses the limited usable life of NAND  ash-based storage systems.
Unlike a magnetic disk drive, a NAND  ash su ers from limited number of write cycles
ranging from 10-100K depending on the speci c type of  ash. As  ash memory densities
increase and cell sizes shrink, further decrease in write endurance is expected. Although
substantial research has been conducted to solve or mitigate the problem by wear leveling,
write endurance remains a concern for write intensive applications.
We have proposed to use a DRAM cache to  lter write tra c to  ash memory. Intuition
tells us that DRAM cache can  lter writes to  ash by coalescing and merging overwrites.
However, the e ectiveness of such a mechanism is not obvious considering that a large  le
system cache already exists which also merges overwrites. Since DRAM is volatile, to handle
integrity of data upon power failure, we propose to use a supercapacitor backup which
can provide short duration power during which the DRAM data can be retired to  ash
memory. The use of supercapacitors is superior to traditional use of battery-backed DRAMs
as batteries, unlike supercapacitors, su er from limited number of charge/discharge cycles
as well as slow charge times.
We studied the e ectiveness of DRAM cache in reducing write tra c to  ash and the
resulting response time and throughput changes. We investigated the use of a DRAM cache
under two settings: a) when  ash is used as a disk cache within a magnetic disk controller
and b) when  ash is used as a full secondary storage. Under the  rst setting, we used two
levels of disk cache: DRAM disk cache and  ash disk cache. Under the second setting, we
used a single DRAM cache to  lter the tra c to the full  ash secondary storage. For both
settings, we compared two policies to retire data from DRAM to  ash memory: early vs.
lazy retirement. In early retirement policy,  ash is updated at the same time DRAM is
ii
updated. In lazy retirement policy,  ash is updated only upon data eviction from DRAM
cache. Conventionally, early update policy has been used to avoid data loss upon power
failure. With early update, write tra c to  ash is not reduced via DRAM cache. In contrast,
lazy update policy substantially reduces write tra c thereby extending the  ash lifetime.
Our simulation results show that using a medium-sized DRAM cache,  ash lifetime doubles
with lazy update policy compared to early update policy. Moreover, miss ratio and average
response time decrease as well. With little e ort, our technique can be extended to improve
the usable life of other emerging non volatile memory systems, such as PCM and MRAM.
iii
Acknowledgments
It is a pleasure to thank those who made this dissertation possible. I would never have
been able to  nish my dissertation without the guidance of my committee members, help
from friends, and support from my family and my wife.
I would like to express my deepest gratitude to my advisor, Dr. Sanjeev Baskiyar, for
his excellent guidance, caring, patience, and providing me with an excellent atmosphere for
doing research during my study and research at Auburn University. His perpetual energy
and enthusiasm in research had motivated all his advisees, including me. In addition, he was
always accessible and willing to help his students with their research. As a result, research
life became smooth and rewarding for me.
Dr. Hari Narayanan, Dr. Cheryl Seals, and Dr. Victor P. Nelson deserve special thanks
as my advisory committee members and advisors for guiding my research and lending me
encouragement. I attended Dr. Narayanan HCI course. I was moved by his rigor and passion
on research. I asked Dr. Seals some questions when I took my Ph.D. qualifying exams. She
was (still is and will be) very nice and willing to help her students. All of my classmates
had the same impression. I would like to thank Dr. Nelson for willing to serve as a outside
reader to assure the quality and validity. He spent so much time on reading and correcting
my dissertation. His advice made this dissertation much better.
I would like to thank Dr. Prathima Agrawal for granting me Vodafone Fellowship for
two years, which is a special honor and an encouragement for me.
I am grateful to Dr. David Umphress. He has been hiring me as Graduate Teaching
Assistant since Fall of 2008, during which I have worked with Dr. Sanjeev Baskiyar, Dr. Kai
Chang, Dr. Saad Biaz, Dr. Alvin Lim, Dr.Richard Chapman, and Dr. Daniela Marghitu.
They are all helpful to my understanding of teaching.
iv
I would like to extend my thanks and appreciation to Ms. Jacqueline Hundley, who
served as my mentor for the summer course COMP1200 (MATLAB for Engineers) that I
taught at Auburn University. It was her help that made my teaching easier and successful.
I would like to thank the faculty and sta of the CSSE department as they have guided
me through this process. Special thanks to Ms. Barbara McCormack, Ms. JoAnn Lauraitis,
Ms. Michelle Brown, and Ms. Kelly Price. They are all quick in response.
I would like to thank Dr. Xiao Qin for o ering a course on "storage system", which
helps me a lot in my research of storage systems. Furthermore, he sets a example for me on
how productive a researcher can be.
I would like to thank my Master advisor Professor Yantai Shu and Professor Jianqing
Pang when I was in Chinese Academy of Sciences. They inspired me to do research.
I would like to thank my fellow students Shaoen Wu, Qing Yang, Cong Liu, and Xiaojun
Ruan. Shaoen was my o ce mate from whom I learned Latex, Cygwin, Microsoft Visio, and
other tools. Qing was my roommate of the  rst year at Auburn. We spent the  rst year
together. Cong was my classmate, with whom I took many courses together at Auburn and
we spent some time to discuss research. Xiaojun has provided me with recently published
papers related to my research and we had good time together to talk about DiskSim 4.0, the
well regarded storage simulator.
I would like to thank Professor Emeritus Dr. Donald Street and his wife Dr. Mary
Street. They kept on inviting my family to their family, which gave me opportunities to
know more about American culture. I have learned a lot from them.
I would like to thank my colleagues Jianlin Nie, Tianshu Zheng, and Xin Yang. We
co-founded our company in 1992 in China. I still received  nancial support after I left the
company. It would not have been possible for me to pursue my Ph.D. in the US without
them.
I would like to thank my wife, Jinhua Li. She has been encouraging me and standing
by me through the good times and bad.
v
I would regret my doctoral years at Auburn University if I did not join Auburn Chinese
Bible Study Group and Lakeview Baptist Church. It was not only a turning point in my
life, but also a wonderful experience. I cherished the prayers and support between me and
them, and the friendships with my Christian brothers and sisters. Special thanks to Dr.
Kai Chang, Prof. Tin-man Lau, Mr. Paul Froede, Dr. Rob Martin, and Associate Paster
Grady Smith. They have helped me grow spiritually. They have been a constant source of
encouragement during my graduate study.
Above all, thanks be to God for giving me wisdom and guidance throughout my life.
You were with me through all the tests in the past several years. You have made my life
more bountiful. May your name be honored, may your kingdom come, may your will be
done on earth as it is in heaven.
This dissertation is dedicated to my beloved wife Jinhua, my lovely daugh-
ter Helen and all my family.
vi
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Flash Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Flash Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Limitations of Existing Solutions . . . . . . . . . . . . . . . . . . . . 5
1.2 Scope of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Flash Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Flash Disk Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 System Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Solid State Drives and Reliability . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Reducing Write Tra c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Comparison of Tra c Mitigation Techniques . . . . . . . . . . . . . . . . . . . . 23
4.1 Battery-backed DRAM as a write cache . . . . . . . . . . . . . . . . . . . . . 23
4.2 SLC as a write cache for MLC . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 PCM as a write cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
vii
4.4 HDDs as a write cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1 Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 System Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Workloads and Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.5 DiskSim 4.0 modi cations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.5.1 Trace formats: SRT1.6 and SPC . . . . . . . . . . . . . . . . . . . . . 34
5.5.2 Secondary  ash cache . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.5.3 Smart controller and MS SSD add-on . . . . . . . . . . . . . . . . . . 36
5.6 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6 Tra c Savings for Flash as a Victim Disk Cache . . . . . . . . . . . . . . . . . . 39
6.1 Early Update vs. Lazy Update . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2 Lazy Update Policy for Reads . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.3 Lazy Update Policy for Writes . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4 The Bene ts of Lazy Update Policy . . . . . . . . . . . . . . . . . . . . . . . 42
6.5 Simulation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.6 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.7.1 Write Tra c Savings . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.7.2 DRAM Size Need Not to be Very Large to be E ective . . . . . . . . 55
6.7.3 Flash size has little e ect on Write Tra c Savings . . . . . . . . . . . 55
6.7.4 Miss Ratio Is improved Slightly . . . . . . . . . . . . . . . . . . . . . 55
6.7.5 Response Time Is Improved Slightly . . . . . . . . . . . . . . . . . . . 55
6.7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7 Tra c Savings for Flash as a Major Storage Medium . . . . . . . . . . . . . . . 75
viii
7.1 Early Update for reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.2 Early Update for writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.3 Lazy Update for reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.4 Lazy Update writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.5 Simulation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.6 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.7.1 Write Tra c Savings . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.7.2 DRAM Size Needs not to be Large to Be E ective . . . . . . . . . . . 85
7.7.3 Response Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8 Fault Tolerance Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.1 What are Supercapacitors? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.2 Why Supercapacitors not Batteries? . . . . . . . . . . . . . . . . . . . . . . . 95
8.3 How to calculate the Capacitance of Supercapacitors? . . . . . . . . . . . . . 95
9 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
9.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
9.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
A DiskSim 4.0 Flowcharts, Modi cations, and Post Processing Scripts . . . . . . . 108
A.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
A.2 System Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
A.3 DiskSim 4.0  owcharts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
A.3.1 Main loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
A.3.2 Message routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
ix
A.3.3 HDD reads with dumb controller (type:1) . . . . . . . . . . . . . . . 110
A.3.4 HDD writes with dumb controller (type:1) . . . . . . . . . . . . . . . 110
A.3.5 HDD reads with smart controller (type:3) . . . . . . . . . . . . . . . 116
A.3.6 HDD writes with smart controller (type:3) . . . . . . . . . . . . . . . 116
A.3.7 SSD reads with dumb controller (type:1) . . . . . . . . . . . . . . . . 116
A.3.8 SSD writes with dumb controller (type:1) . . . . . . . . . . . . . . . . 121
A.3.9 SSD reads with smart controller (type:3) . . . . . . . . . . . . . . . . 121
A.3.10 SSD writes with smart controller (type:3) . . . . . . . . . . . . . . . . 121
A.4 Post Processing Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
A.4.1  g hplajw.sh source code . . . . . . . . . . . . . . . . . . . . . . . . . 125
A.4.2 hplajw.sh source code . . . . . . . . . . . . . . . . . . . . . . . . . . 128
A.4.3 process.sh source code . . . . . . . . . . . . . . . . . . . . . . . . . . 129
A.4.4 miss ratio.p source code . . . . . . . . . . . . . . . . . . . . . . . . . 130
A.4.5 res time.p source code . . . . . . . . . . . . . . . . . . . . . . . . . . 130
x
List of Figures
1.1 Lifetime of Flash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Memory taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Non volatile memory roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Memory hierarchy of computers . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Flash cell structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Flash programmed state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Flash erased state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 SLC vs. MLC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 System architecture of Hybrid Hard Drives . . . . . . . . . . . . . . . . . . . . . 21
3.2 System architecture of Solid State Drives . . . . . . . . . . . . . . . . . . . . . . 22
5.1 System architecture of modi ed DiskSim 4.0 . . . . . . . . . . . . . . . . . . . . 37
6.1 Early update for reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.2 Early update for writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3 Lazy update for reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.4 Lazy update for writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
xi
6.5 Relative tra c for OpenMail . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.6 Relative tra c for OpenMail . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.7 Relative tra c for Synthetic workload . . . . . . . . . . . . . . . . . . . . . . . 51
6.8 Relative tra c for Synthetic workload . . . . . . . . . . . . . . . . . . . . . . . 52
6.9 Relative tra c for Websearch3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.10 Relative tra c for Websearch3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.11 Miss Ratio for OpenMail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.12 Miss Ratio for OpenMail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.13 Miss Ratio for Synthetic workload . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.14 Miss Ratio for Synthetic workload . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.15 Miss Ratio for Websearch3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.16 Miss Ratio for Websearch3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.17 Response Time for OpenMail . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.18 Response Time for OpenMail . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.19 Response Time for Synthetic workload . . . . . . . . . . . . . . . . . . . . . . . 64
6.20 Response Time for Synthetic workload . . . . . . . . . . . . . . . . . . . . . . . 65
6.21 Response Time for Websearch3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.22 Response Time for Websearch3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
xii
6.23 Improvement of lazy update policy over early update policy for OpenMail . . . 68
6.24 Improvement of lazy update policy over early update policy for OpenMail . . . 69
6.25 Improvement of lazy update policy over early update policy for UMTR(websearch3) 70
6.26 Improvement of lazy update policy over early update policy for UMTR(websearch3) 71
6.27 Improvement of lazy update policy over early update policy for synthetic workload 72
6.28 Improvement of lazy update policy over early update policy for synthetic workload 73
7.1 Early update for reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.2 Early update for writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.3 Lazy update for reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.4 Lazy update for writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.5 Relative tra c for OpenMail . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.6 Relative tra c for Synthetic workloads . . . . . . . . . . . . . . . . . . . . . . . 84
7.7 Relative tra c for Websearch3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.8 Response time for OpenMail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.9 Response time for Synthetic workloads . . . . . . . . . . . . . . . . . . . . . . . 88
7.10 Response time for Websearch3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.11 Improvement of lazy update policy over early update policy for OpenMail . . . 90
7.12 Improvement of lazy update policy over early update policy for UMTR(websearch3) 91
xiii
7.13 Improvement of lazy update policy over early update policy for synthetic workload 92
8.1 Flash as major store with a supercapacitor backup power . . . . . . . . . . . . . 94
8.2 Flash as disk cache with a supercapacitor backup power . . . . . . . . . . . . . 94
8.3 Supercapacitor and rechargeable batteries . . . . . . . . . . . . . . . . . . . . . 96
A.1 System architecture of DiskSim 4.0 . . . . . . . . . . . . . . . . . . . . . . . . . 109
A.2 The main loop of DiskSim 4.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
A.3 Run simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
A.4 io-internal-event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A.5 Message routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
A.6 Flowchart for HDD reads (controller type:1) . . . . . . . . . . . . . . . . . . . . 115
A.7 Flowchart for HDD writes (controller type:1) . . . . . . . . . . . . . . . . . . . 117
A.8 Flowchart for HDD reads (controller type:3) . . . . . . . . . . . . . . . . . . . . 118
A.9 Flowchart for HDD writes (controller type:3) . . . . . . . . . . . . . . . . . . . 119
A.10 Flowchart for SSD reads (controller type:1) . . . . . . . . . . . . . . . . . . . . 120
A.11 Flowchart for SSD writes (controller type:1) . . . . . . . . . . . . . . . . . . . . 122
A.12 Flowchart for SSD reads (controller type:3) . . . . . . . . . . . . . . . . . . . . 123
A.13 Flowchart for SSD writes (controller type:3) . . . . . . . . . . . . . . . . . . . . 124
A.14 Miss Ratio for Hplajw workload . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
A.15 Response Time for Hplajw workload . . . . . . . . . . . . . . . . . . . . . . . . 127
xiv
List of Tables
4.1 Comparison of tra c mitigation techniques . . . . . . . . . . . . . . . . . . . . 27
5.1 System environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Workload characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Validation of modi ed DiskSim 4.0 . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.1 Disk parameters (QUANTUM QM39100TD-SW) . . . . . . . . . . . . . . . . . 45
xv
Chapter 1
Introduction
The performance gap between the processor and storage of computers is widening with
approximately 60% and 10% annual improvement in the processor and hard disk drives
(HDDs) respectively [1]. The trend is becoming more marked with the advent of multi-
socket, multi-core processors architectures. The I/O performance, especially I/O operations
per second (IOPS), has not caught up with the corresponding improvements in processor
performance. The processor utilization stays low due to the need to wait for the data being
fed from the storage system [2]. Therefore, storage performance is becoming the bottleneck
of a computer system.
Fortunately, there are emerging memory technologies that try to bridge the growing
performance gap, such as  ash memory, Phase Change RAM (PCM), and Magnetic RAM
(MRAM). Noticeable among them is  ash memory, which has been the most widely used
nonvolatile memory. There are two types of  ash: NOR  ash and NAND Flash. NOR
 ash is byte addressable while NAND  ash is page addressable. In this paper, we are only
concerned about NAND  ash, which is referred to as  ash for short hereafter. Not only has
 ash been widely used on portable devices as storage media, but also  ash-based Solid State
Drives (SSDs) are being installed into data centers.
SSDs can provide as much as 3,300 write IOPS and 35,000 read IOPS consuming 2.5
watts of power, whereas even the best HDDs (15K RPM drives) can only o er 300-400 IOPS
while consuming 15-20 watts of power[2]. In comparison, the processor can o er 1,000,000
IOPS. Performance in term of IOPS is critical for enterprise applications serving a large
number of users, like web servers, email servers, cloud computing, and cloud storage. The
common method used to close the gap is to deploy multiple HDDs working in parallel to
1
support peak workloads. In order to meet the IOPS requirement, more HDDs are added,
which results in environments that have underutilized storage (well below 50% of their useful
storage capacity). The extra storage in turn incurs power and cooling waste.
However,  ash can only be written a limited number of times, ranging from 10K to
100K depending on the type of  ash used. Traditional solutions to limited lifetime of  ash
focused on the algorithms used within Flash Translation Layer (FTL), which spread the
writes evenly across the medium. It is referred to as wear leveling, in which no single cell
fails ahead of others. Wear leveling does nothing but uses up as much as the  ash potential
lifetime. Although wear leveling increase the lifetime to some extent, it remains a concern
for write intensive applications.
The reminder of the chapter is organized as follows: Section 1.1 presents the problem
statement. In Section 1.2, we talk about the scope of the research. In Section 1.3, we show
our contributions brie y. Finally, Section 1.4 gives the organization of this dissertation.
1.1 Problem Statement
1.1.1 Flash Memory
As mentioned above, one of  ash drawbacks is write endurance. The endurance issue
stems from cell degradation caused by each burst of high voltage (10V) across the cell. This
problem shows no imminent sign of vanishing for Multi-Level Cell (MLC) write endurance
is worse compared to Single Level Cell (SLC). For example, the write cycles of 2X MLCs
drop to 10,000 from 100,000 of that of SLC. Furthermore, write endurance becomes worse
as cells become smaller.
In order to mitigate the such drawbacks, a Flash Translation Layer (FTL) has been
proposed to manage how the  ash resources are used [3]. FTL has mapping tables between
logical and physical address spaces. When an update to a  le is issued, FTL writes the
 le to a blank page and marks the old page as invalid. FTL updates the mapping tables
accordingly. A technique called wear leveling attempts to evenly use all of the memory cells.
2
In spite of all these e orts, endurance is still a concern for  ash based storage systems.
The  ash lifetime over I/Os per second is shown in Figure 1.1 [4]. The lifetime drops as the
number of I/Os increases. As Kim et al. [4] put it, \Although MTTFs for HDDs tend to
be of the order of several decades, recent analysis has established that other factors (such as
replacement with next, faster generation) implies a much shorter actual lifetime and hence
we assume a nominal lifetime of 5 years in the enterprise." As we can see from the  gure,
when the number of I/Os exceeds approximately 50 IOPS, the lifetime of  ash is less than
5 years. For example, the OpenMail workloads we used have 98 I/Os per second, which
corresponds to a lifetime less than 2 years. Therefore, endurance is one of the factors that
would hinder further applications of  ash-based storage systems to a heavy write-intensive
workload environment like some embedded or enterprise systems, where the storage system
needs to work 24/7 with heavy write tra c. Therefore, enhancing the  ash endurance is
demanded.
1.1.2 Flash Trends
According to [5],  ash trends can be expressed as: bigger, faster, and cheaper, but not
better.
Bigger: Flash component densities are doubling at a rate greater than Moore?s Law.
With a new type of 3D stacking technique, called TSOP stacking, the density can be doubled
again. In addition, more dies can be stacked in some types of BGA packages. Currently,
MLC technology makes it possible for SSD to have capacities of 512 GB to 1TB in 2.5-inch
form factors.
Faster: A single operation is needed for programming an entire page or erasing an entire
block. Depending on component density, currently page size varies from 2KB, 4KB to 8KB
in 2009 while block size can range from 16KB to 512KB. The amount of time is independent
of page size and block size. Therefore, as page size and block size are becoming bigger, the
e ective program/erasure speed becomes faster.
3
Figure 1.1: Lifetime of Flash
4
Cheaper: As the densities are going up, price is going down. For example, 2X MLC is
roughly half the cost of the equivalent density SLC component. In 2004, 8GB SLC-based
SSDs sold for about $4,000 while 40GB Intel X25-V sells for $129.99 in January of 2010{ 5
times the capacity at about 1/30 price in about 6 years.
Not better: Better means more reliable in terms of endurance and data retention. Data
retention is de ned as the length of time a charge remains on the  oating gate after the
last program. The cells are worn out over program/erasure and make it more di cult to
keep the electrons in place. Therefore, there is an inverse relationship between endurance
and data retention { the more program/erasure, the shorter the data retention. As NAND
manufactures are struggling for lower cost per bit, they keep sacri cing endurance and data
retention. The most renowned trade-o is in MLC vs. SLC, in which 2:1 or 3:1 cost bene t
is obtained through 10:1 reduction in rated endurance.
A classi cation of memory technologies is given in Figure 1.2 [6]. Flash has been clas-
si ed as "mature nonvolatile memory". Most importantly,  ash is likely not to disappear in
the near future, which can be seen in Figure 1.3 [7]. Although there are emerging memories,
such as PCM (PCRAM) and MRAM, they have not reached the maturity level to replace
 ash. Flash will coexist with the emerging memories for 5-10 years. Therefore, extending
the  ash lifetime needs to be explored.
1.1.3 Limitations of Existing Solutions
There are two basic methods to solve or mitigate the short  ash lifetime 1) e ciently
use cells, 2) reduce the write tra c to  ash. Most research uses the  rst method, e.g.,
wear leveling. Since  ash lifetime is directly related to cycles of program/erase, reducing
write tra c to  ash will extend the  ash lifetime. However, little research has employed
the second method. We did  nd Soundararajan et al.?s paper [8] using the second method.
However, they use disk-based write cache instead of DRAM cache to save write tra c.
RAM bu er has been proposed to improve the  ash performance. Park et al. [9] reduce
5
Figure 1.2: Memory taxonomy
writes by evicting clean data  rst. Jo et al. [10] propose a  ash aware bu er management
scheme to decrease the number of erase operations by choosing victims based upon page
utilization. Kim et al. [11] use three key techniques, block-level LRU, page padding, and
LRU compensation to improve random write performance. Gupta et al. [12] enhance random
write performance by selectively caching page-level address mappings. However, the above
studies except Soundararajan?s work did not concentrate on the lifetime issue although Kim
et al. mentioned the erase counts in their papers. Next, a common problem of using DRAM is
the fault tolerance issue inherent in the volatile memory. Although Kim et al. [11] suggested
using a small battery or capacitor to delay shutdown until the RAM content is backed up
to  ash, both batteries and capacitors have their limitations. Batteries have short lifetime,
which require maintenance and replacements during the lifetime of devices while capacitors
of a reasonable size do not have enough energy sustaining enough time, during which the
6
Figure 1.3: Non volatile memory roadmap
7
data is transferred to  ash. Finally,  ash as a disk cache has not been studied in their
research.
1.2 Scope of the Research
In this dissertation, we focus on extending the  ash lifetime, where  ash is used at the
device level shown in 1.4.
We propose to use  ash as a victim device of a DRAM disk cache with a supercapacitor
backup power to prolong its lifetime. Unlike traditional solutions, our method uses the
second method listed above to decrease the tra c to  ash.
Intuition tells us that DRAM disk cache can reduce writes by coalescing and merging
multiple writes into a single write. However, on second thought, we doubt that it
would be e ective given that there is a larger  le system cache in main memory
above the device level which already merges writes. For example, typical DRAM
disk cache size in HDDs is 16MB/32MB whereas  le system cache can be 512MB or all free
memory for a PC with 4GB main memory. Thus, it is far from clear whether such a small
DRAM disk cache at device level could indeed reduce tra c to persistent storage.
Moreover, researchers are reluctant to use DRAM in the disk-cache for this purpose
because [8] DRAM is volatile memory and thus could cause loss of data upon power failure.
Yet, whether DRAM is suitable for reducing tra c depends on the following factors:
 How much write tra c would be reduced by using DRAM cache?
 What is the minimal size of DRAM cache to be e ective as a tra c saver?
 How would  ash size impact the tra c savings?
 What sort of update policy should be used?
In this dissertation, we attempt to answer these questions.
8
CPU
L1 cache
L2 cache
Main memory
(File system cache)
Disk cache
(DRAM, Flash)
HDDs/SSDs
System Level
Device Level
Figure 1.4: Memory hierarchy of computers
9
1.3 Contributions
A storage system is an indispensable part of a computer system. The success of this
dissertation will widen the horizon of using  ash in the storage systems, where the endurance
issue hinders its wider applications. It will narrow the widening performance gap between
the processor and the storage systems, which has plagued the computer system for several
decades. It also prepares the storage systems for transition to new kinds of memory tech-
nologies. Surely, it will have a big impact on almost all areas, which use computers as a
tool since the storage system is a fundamental part of a computer system. More detailed
contributions are presented in Chapter 9.1.
1.4 Dissertation Organization
This dissertation is organized as follows. Chapter 2 presents literature review. Chapter
3 shows the system architecture. In Chapter 4, we compare three tra c saving methods.
In Chapter 5, our research methodology is discussed. Chapter 6 presents our experimental
results when  ash acts as a victim disk cache. In Chapter 7, we show our experimental results
when  ash is used as a primary store. In Chapter 8, the fault tolerance issue is discussed.
In Chapter 9, we conclude this dissertation and talk about the future work.
10
Chapter 2
Literature Review
In this chapter, we brie y present previous literature related to  ash fundamentals and
research on  ash endurance that is most relevant to our research.
The remainder of the chapter is organized as follows: Section 2.1 presents the funda-
mentals of  ash physics. In Section 2.2, we show the work related to  ash acting as disk
cache. In Section 2.3, we talk about the work related to  ash being used as part of system
memory. In Section 2.4, we show the work related to SSDs and reliability. Finally, Section
2.5 shows the work related to using write tra c savings to extend the lifetime of  ash.
2.1 Flash Basics
The fundamentals of  ash physics help us understand the properties of  ash. The  ash
cell structure is shown in Figure 2.1 [5]. Each cell is made up of one of these transistors. In
a SLC device, one of these transistors can hold 1-bit of data while holding multiple bits for
a MLC. Data is written to the cell by electron tunneling; a high enough voltage is applied
to the gate, creating a powerful electric  eld such that electrons will tunnel through the
oxide and into the  oating gate as shown in Figure 2.2. The electrons remain in the  oating
gate after the voltage is removed. Apply the voltage across the channel instead of the gate,
reversing the bias, and the electrons will go in the other direction as shown in Figure 2.3. We
have two states, 0 and 1, and the state is unchanged even if the cell has no power, making
it ideal for a storage device.
There are two types of  ash memory: NOR and NAND [13]. They were invented
to replace EPROMs and EEPROMs. Both types can be read and written. Like their
predecessors, a non-blank  ash cell must be erased before further writing. A single cell
11
Figure 2.1: Flash cell structure
Figure 2.2: Flash programmed state
12
Figure 2.3: Flash erased state
cannot be erased; only a block of cells can be erased together. NOR  ash is byte accessible,
meaning that an individual byte can be read or written. This characteristic is suitable
for holding code. Thus, NOR  ash was quickly adopted as a replacement for EPROMs and
EEPROMs. Unlike NOR  ash, NAND  ash cannot be byte-accessible but is page-accessible,
which means that only pages of data can be streamed in or out of the NAND  ash. This
change makes NAND  ash denser than NOR  ash. The standard NAND cell is 2.5 times
smaller. Therefore, NAND  ash is much cheaper than NOR  ash and is more suitable for
mass storage. In this proposal, we are concerned about NAND  ash. The words  ash and
NAND  ash will be used interchangeably to denote NAND  ash.
Flash has been developed into two forms: single-level cell (SLC) and multilevel cell
(MLC) [14] as shown in Figure 2.4. In SLC technology, each cell represents one bit while in
MLC, each cell can represent more than one bit. 2-bit MLC has been commercially used.
3-bit and 4-bit MLC are on the manufacturers? roadmaps. SLC needs one reference voltage
(V1) to distinguish 0 from 1. In a 2-bit MLC, three threshold voltages (V1, V2, and V3) are
needed. MLCs with more bits tend to be more error prone because the more the bits per
cell the smaller the voltage margins.
13
1
V1
11
V1
V2
V3
SLC
One bit per cell
2X MLC
Two bits per cell
0
10
01
00
Number of cells
Number of cells
Voltage
Voltage
Figure 2.4: SLC vs. MLC
14
There are three basic operations: reading, erasing, and programming [15]. The read
operation is performed by applying to the cell a gate voltage that is between the values of
the thresholds of the erased and programmed cells and sensing the current  owing through
the device. The write operation involves three main operations: read, erase, and program.
Erasing sets all of the bits to 1. Programming only occurs on pages that are erased since
program operation can only change a bit from 1 to 0. The erasure operation requires a high
voltage pulse to be applied to the source when the control gate is grounded and the drain
 oating. Programming is an iterative process. The controller will apply voltage to the gate
(or the channel),allow some electrons to tunnel and check the threshold voltage of the cell.
When the threshold voltage has reached some predetermined value, the data is stored.
NAND  ash consists of blocks, which are composed of pages. A block is the smallest
erasable unit of storage. Erasing a block is an expensive operation, taking about 1.5-2 ms in
a SLC device, and about 3 ms in a MLC device. Each block consists of a set of addressable
pages, where SLC usually has 64 pages per block and MLC has 128 pages per block. A page
is the smallest programmable unit. Each SLC page is usually 2 KB in size, and a 2X MLC
page is 4 KB in size.
2.2 Flash Disk Cache
One usage model of  ash is for  ash to act as disk cache.
Hsieh et al. [16] proposed to use  ash as disk cache for saving energy and improving
performance. They did not study the relationship between the  ash disk cache and the
upper layer DRAM disk cache. They focused on a lookup mechanism. They compared
several lookup mechanisms in terms of energy e ciency and product lifetime. However, they
cited 1M as endurance cycles (106) to estimate the product lifetime. The current high density
chip has fewer endurance cycles (105). The MLC NAND Flash has even fewer endurance
cycles, typically being rated at about 5K-10K cycles. Therefore, the estimated lifetime by
Hsieh et al. is no longer accurate with regard to the current technology. In addition, they
15
assumed that all the reads and writes are inserted into the  ash disk cache. However, some
data, such as stream data, are unlikely to be reused frequently. Allowing such data to go
through the cache will waste the lifetime of  ash.
Bisson et al. [17] presented four techniques to improve system performance using the
characteristics of hybrid disks. They made use of the NVCache provided in hybrid disks to
keep the spin down time longer. They divided the NVCache into two: one for writes and
one for reads.
Another work that Bisson et al. [18] [19] have done was to use  ash as a backup write
bu er, in which the data stored in  ash is used only in case of power failure. Flash acts as
mirror storage to a DRAM write bu er. The limitation of this method is that endurance
would su er for a small size of  ash. It will not be suitable for heavy write workloads.
Microsoft ReadyDrive [20] makes use of hybrid hard disk drives. Microsoft ReadyDrive
is used to control the pinned set. The  ash is mainly used to store boot or program launch
related data. It is also used as a non-volatile write bu er while the hard disk drives are
in sleep mode. However, the details of how ReadyDrive controls  ash remain a commercial
secret.
Kim et al. [21] proposed using a PCM and NAND  ash hybrid architecture for embedded
storage systems. They used PCM to overcome the random and small write issue of  ash.
Their  ndings are bene cial for our research in the general mass storage system with PCM
and  ash.
Joo et al. [22] take advantage of pinned-set  ash in a hybrid drive to hold application
code. They demonstrate how to decrease the application launch times with only a small
part of applications stored into pinned set of  ash memory. By pinning 5% and 10% of the
application launch sequences, launch times are improved by 15% and 24% respectively.
2.3 System Memory Model
Another usage model of  ash is to present  ash as part of the system memory.
16
Microsoft ReadyBoost belongs to this category. It can use nonvolatile  ash memory
devices, such as universal serial bus (USB)  ash drives, to improve performance. The  ash
memory device serves as an additional memory cache. Windows ReadyBoost relies on the
intelligent memory management of Windows SuperFetch. Intel TurboMemory [23] can be
used as ReadyDrive and ReadyBoost.
Kgil et al. [24] [25] [26] also proposed  ash cache at the main memory level. They also
split  ash cache into a write region and read region. But such splitting would decrease the
lifetime of  ash memory because small portions of  ash are overused while a large uni ed
 ash would mitigate the  ash write endurance issues by wear-leveling across a large range.
2.4 Solid State Drives and Reliability
The third usage model of  ash is to use  ash as the main storage medium in solid state
drives.
A solid state drive (SSD) is a storage device that uses solid-state memory. A SSD is
commonly composed of  ash memory along with DRAM [27].
Several techniques, such as wear leveling, error detection and correction, and block
management algorithms, have been employed to extend the life of a SSD at the drive or
system level.
Wear Leveling: it is a technique for extending the service life of some kinds of erasable
computer storage media, such as EEPROM and  ash [28]. Wear leveling attempts to work
around these limitations by arranging data so that erasures and re-writes are distributed
evenly across the medium. Therefore, it prevents some portions of  ash cells from being
overused so no single cell fails ahead of others. Wear leveling is implemented in the Flash
Translation Layer (FTL). It can also be implemented in software by special-purpose  le
systems such as JFFS2 [29], YAFFS [30], and ZFS [31].
FTL falls into three categories based on the mapping granularity [32].
 Page-level FTL
17
 Block-level FTL
 Hybrid FTL with mixture of page and block mapping
In page-level FTL, page is the mapping unit. The logical page number is used to map
the requests from upper layers such as the  le system to any page of the  ash. The advantage
of this scheme is the high utilization e ciency. However, the size of the mapping table is
big. For example, 32MB of space is needed to store the page-level mapping table for a 16GB
 ash memory.
On the other hand, block-level FTL uses the logical block number to translate requests
between  ash and the  le system. Block-level FTL is better than page-level FTL in terms
of mapping table size but worse in terms of utilization e ciency. The size of the mapping
table is reduced by a factor of block size/page size (e.g., 128KB/2KB=64) as compared to
page-level FTL.
Hybrid FTL has been proposed to take advantage of both page-level FTL and block-
level FTL. In Hybrid FTL,  ash blocks are divided into two groups: data blocks and log
blocks. Data blocks use block-level FTL while log blocks use page-level FTL.
According to Chang et al. [33] [34] [35], there are two kinds of wear leveling techniques:
dynamic wear leveling (DWL) and static wear leveling (SWL) [36] [33]. In DWL, only
updated data (hot data) is involved in wear leveling while unchanged data remain in place.
In contrast, the SWL technique proactively moves the static data so that all data has a role
in wear leveling. According to the authors, SWL achieves much better write endurance than
DWL.
Besides, there is much research [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] dealing with
write endurance through wear leveling. Although the detailed techniques they use may di er
from each other, their goals remain the same: evenly use the memory cells.
Error Detection and Correction: Like DRAM, Flash is not error free memory. Even
worse,  ash wears over writes and exhibits more errors as the number of program/erasure
increases. Therefore, error detection and correction should be applied to  ash memory to
18
guarantee correct data. An error-correcting code (ECC) or parity is used to detect and cor-
rect errors. There are trade-o s between the number of bits corrected, cost and complexity.
Storage Management Algorithms: To enhance endurance, the SSD controller man-
ages bad blocks and spare blocks. Bad blocks are recorded in a table. Spare blocks are used
to replace bad blocks. In general, a SSD is equipped with spare blocks that are 1 to 2 % of
the capacity. More (e.g., 50%) can be expected if more reliability is needed.
2.5 Reducing Write Tra c
Usually, a non-volatile write cache or log region is employed in front of the  ash medium
to catch the write tra c so that less write tra c would reach the  ash medium. Some works
[21] [47] use the emerging memory technologies (e.g., PCM) as the non-volatile log region.
Unlike  ash, PCM supports in-place updating. In addition, PCM is faster than  ash. Due
to the front end PCM log region, tra c to the  ash is decreased.
Soundararajan et al. [8] even use disk-based write caches to extend SSD lifetimes. Their
design is based on the following observations. First, there are many overwrites of a small set
of popular blocks at the block device level. Second, thanks to the  le system cache, there will
be not many immediate reads following a write. Finally, HDDs are excellent at sequential
writes and reads. Therefore, HDDs  t log-structured write cache perfectly.
Qureshi et al. [48] have proposed using a DRAM bu er to lazy-write to PCM in order
to improve write performance as well as lifetime of PCM. However, their target is at the
system level rather than at the device level. Additionally, they did not deal with the power
failure issue. Therefore, our research is complementary to theirs.
Using DRAM cache with supercapacitor backup power to save write tra c does not
need many extra e orts since DRAM is an indispensable part in almost all  ash controllers.
A little larger DRAM might be needed to be used as a tra c saver.
19
Chapter 3
System Architecture
There are three major usage models of  ash memory:
1. System memory model
2. Disk caches
3. Storage devices (SSDs)
Since our focus is on secondary storage, the system memory model is beyond the scope
of this dissertation. Therefore, we deal with two usage models: disk caches and storage
devices. For both categories, we propose to use  ash as a victim device of the DRAM cache.
The  rst category is shown in Figure 3.1. There are two levels of disk caches: the primary
DRAM cache and the secondary  ash cache. The  ash cache is only updated upon data
being evicted from the DRAM cache. The second category is shown in Figure 3.2. There is
only one level of disk cache. Like the  rst category,  ash is updated only when the data is
evicted from the DRAM cache. The main purpose for doing that is to reduce write tra c
to  ash thereby extending the lifetime of  ash.
There is a main issue here though: fault tolerance. The data in the DRAM cache will
be lost in the event of power failure and system crash. This issue must be solved before it is
considered as a practical choice. Unlike main memory, secondary storage is supposed to be
data safe by operating systems. We will provide our solution to this issue in Chapter 8.
20
Device driver
Disk controller
DRAM cache
Flash cache
Disk
Figure 3.1: System architecture of Hybrid Hard Drives
21
Device driver
SSD controller
DRAM cache
SSD
Figure 3.2: System architecture of Solid State Drives
22
Chapter 4
Comparison of Tra c Mitigation Techniques
The idea of tra c mitigation is to have a non-volatile cache in front of the  ash medium.
Apart from our solution (DRAM with a supercapacitor backup power), such kind of cache
can be:
 Battery-backed DRAM
 SLC as a write cache for MLC
 PCM as a write cache
 HDDs
Since a thorough comparative analysis of all the options is beyond the scope of this
disseration, we brie y describe a few other designs and compare them qualitatively with our
solution.
The rest of the chapter is organized as follows: Section 4.1 compares our solution with
battery-backed DRAM. In Section 4.2, we discuss SLC as a write cache. In Section 4.3,
we talk about PCM. Section 4.4 shows HDDs as a write cache. Finally, in Section 4.5, we
present a summary of several tra c mitigation techniques.
4.1 Battery-backed DRAM as a write cache
The di erence between battery-backed DRAM and our solution lies only in the way
the backup power is supplied. As calculated in Chapter 8, a period of 10 seconds is long
enough for the contents in DRAM to be transferred into  ash. Therefore, we do not need a
long-lasting (hours) power supply in the presence of  ash, a supercapacitor backup power is
23
a perfect  t. It does not su er many drawbacks of batteries, such as regular maintenance or
replacement, limited charge/discharge cycles, slow charge, degrade with shallow discharge,
etc. More discussions are presented in Chapter 8.
4.2 SLC as a write cache for MLC
SLC can be a write cache for MLC due to the fact that each SLC block has more
write cycles. However, SLC is expensive. To be feasible, the size of SLC must be small.
The problem with SLC is that SLC also su ers from limited write endurance although
write endurance is 10 times better than MLC. Therefore, SLC endurance should be taken
into account as a design constraint along with MLC endurance. Since SLC has 10 times
the endurance of MLC, to reach the same lifetime, SLC can be a tenth as large as MLC.
According to Soundararajan et al. [8], if SLC receives twice as many writes as MLC (where
MLC receives 50% write savings), SLC should be a  fth as large as MLC. Hence, the larger
the backing MLC SSD, the larger the SLC cache. For example, a 80GB MLC SSD, 16GB
SLC is needed. It is believed that 16GB SLC will continue to be expensive enough for a
80GB MLC SSD to a ord. In contrast, in our solution, the size of primary store has little
impact on the size of cache, i.e., the size of cache alone determines the write savings no
matter how large the primary store is.
4.3 PCM as a write cache
Phase-change memory (also known as PCM, PRAM, PCRAM, etc.) is a type of non-
volatile computer memory. According to ITRS [7], it is in the status of development un-
derway and at the border of quali cation/pre-production. PCM is one of a number of
new memory technologies competing in the non-volatile role with the almost universal  ash
memory. PCM is a potential choice to be a write cache for SLC/MLC due to the following
characteristics [49]:
 Nonvolatile
24
 Fast read speeds: Access times comparable to DRAM.
 Fast write speeds: Signi cant write speed improvement over NOR and NAND and no
erase needed.
However, The write speed of PCM is far less than that of DRAM. Additionally, PCM
still su ers write endurance issue (108) although it improves a lot over SLC/MLC. It would
still be a concern for write-intensive applications. It is not yet clear that how the future of
PCM will be in terms of price and the production readiness. Only time will tell. However,
our study does not exclude the use of PCM.
4.4 HDDs as a write cache
HDDs have been proposed to be a write cache for  ash to take advantage of the fact that
a SATA disk drive can deliver over 80 MB/s of sequential write bandwidth. Two observed
characteristics support the use of HDDs as a write cache. First, at block level, there are many
overwrites of a small set of popular blocks. Second, reads do not follow writes immediately,
which gives the write cache a grace period to  ush the contents into  ash before reads.
There are two competing imperatives: On one hand, data should stay longer in write cache
in order to catch more overwrites. On the other hand, data should be  ushed into  ash in
advance to avoid expensive read penalty from HDDs. Therefore, many triggers to  ush must
be designed to achieve the goal of more overwrites with less read penalty, which makes it
quite complicated. In contrast, data in our solution can stay as long as the capacity of the
DRAM cache allows since there is no read penalty.
Flash-based SSDs have many advantages over HDDs in terms of:
 Acoustic levels: SSDs have no moving parts and make no sound unlike HDDs.
 Mechanical reliability: SSDs contain no moving parts thereby virtually eliminating
mechanical breakdowns.
25
 Susceptibility to environmental factors: Compared to traditional HDDs, SSDs are
typically less susceptible to physical shock. Additionally, they have wider temperature
ranges.
 Weight and size: The weight of  ash memory and the circuit board material are very
light compared to HDDs.
 Power consumption: High performance  ash-based SSDs generally require 1/2 to 1/3
the power of HDDs.
 Magnetic susceptibility: SSDs are not concerned about magnets or magnetic surges,
which can alter data on the HDD media.
By introducing HDDs into SSDs, the device as a whole will lose its advantages in
the above aspects. Therefore, using HDDs as write cache of SSDs may hinder their wider
applications.
Our experimental results show no less write savings using DRAM cache than using
HDD cache although the comparison may not be precise since we use di erent traces from
Soundararajan et al. [8].
4.5 Summary
The comparison between our research and several others is presented in Table 4.1. As
we can see from the table, log-structure has been employed in [8] [21] [47]. As mentioned in
[8], log-structure is good at writing but it incurs read-penalty.
26
Tec
hnique
!
Kim
et
al.
[21
]
Sun
et
al.
[47
]
Qureshi
et
al.
[48
]
Soundarara
jan
et
al.
[8]
Ours
Media
to
reduce
PCM
PCM
as
alog
DRAM
Disk-based
write
DRAM
(sup
erca-
tra c
region
of
 ash
cac
hes
pacitor)
data
region
Tra c
handle
d
Metadata
Metadata
All
data
All
data
All
data
Up
date
sp
eed
Slo
w
(PCM)
Slo
w
(PCM)
Fast
(DRAM)
Slo
w
(H
DD)
Fast
(DRAM)
Role
of
 ash
Main
storage
Main
storage
N/A
Main
storage
Disk
cac
he/main
storage
Usage
Em
bedded
devices
Secondary
storage
Main
memory
Secondary
storage
Secondary
storage
W
orkload
cha-
Blo
ck
device
Blo
ck
device
Main
memory
Blo
ck
device
Blo
ck
device
racteristics?Read
penalt
yd
ue
Yes
Yes
No
Yes
No
to
log
structure
d
writes?Data
loss
up
on
No
No
Yes
No
No
po
wer
failure?
Comparativ
e
High
High
Lo
w
High
Lo
w
added
cost?
Matured
tec
h-
No
(PCM)
No
(PCM)
No(PCM)
Yes
(HDD)
Yes
(DRAM)
nology?
Table
4.1:
Comparison
of
tra c
mitig
ation
tec
hniques
27
Chapter 5
Methodology
Extensive experiments were conducted with a disk simulator. Our simulator was based
on DiskSim 4.0 and Microsoft SSD add-on, which were implemented in C. In this chapter,
we describe the methodology we used in this dissertation.
The rest of the chapter is organized as follows: Section 5.1 presents the DiskSim 4.0
simulator. In Section 5.2, we discuss system setup. In Section 5.3, we talk about workloads
we used. In Section 5.4, metrics are presented. Section 5.5 shows the modi cations we did
on DiskSim 4.0. Finally, Section 5.6 presents validation of the modi ed DiskSim 4.0.
5.1 Simulator
We evaluated the proposed architecture using DiskSim 4.0 [50] with  ash extension.
DiskSim is a widely used disk drive simulator both in academia and industry alike. DiskSim
is an event-driven simulator. It emulates a hierarchy of storage components such as buses
and controllers as well as disks. In 2008, Microsoft research implemented an SSD module
[51] on top of DiskSim. Based on that, we made some changes to the code to re ect our
architecture.
5.2 System Setup
The system environment in which we run our simulator is presented in Table 5.1.
28
Hardware Parameters Descriptions
CPU Intel (R) Core (TM) 2 Quad Q6600 2.4 GHz
L2 cache 4096KB
Memory 2GB
Storage Intel SATA X25-M 80GB SSD
Software Ubuntu 10.10 (Linux 2.6.35-22 generic)
gcc/g++ v4.4.5
bisson 2.4.1
 ex 2.5.35
GNU bash 4.1.5
perl v5.10.1
python 2.6.6
gnuplot 4.4
dia 0.97.1
pdfTeX 3.1415926-1.40.10-2.2 (TeX Live 2009/Debian)
pdfcrop 1.2
latex2rtf 2.1.0
latex2rtf-gui 0.5.3
texmaker 2.0
emacs 23.1.1 (GTK+ version 2.20.0)
Table 5.1: System environment
29
Deva TIORb Reads IOPSc ARSd(KB)
OpenMail 080 51295 36373 (70%) 5.0/10.0/6.5
096 363789 119400 (32%) 98.29 7.8/7.3/7.5
097 366223 117530 (32%) 7.9/7.5/7.6
098 421270 219553 (52%) 3.6/5.6/4.6
099 424887 227744 (53%) 3.5/5.6/4.5
100 295536 152372 (51%) 3.5/5.4/4.4
UMTR 0 1439434 1439434 (100%) 5.27 15.2/-/15.2
Web3 1 1410370 1409113 (99%) 15/26/15
2 1410493 1410492 (99%) 15/8.0/15.5
3 489 487 (99%) 15/8.0/15
4 486 486 (100%) 15.1/-/15.1
5 434 434 (100%) 16.3/-/16.3
Synthetic 0 10000 6600 (66%) 40.05 6.4/6.2/6.3
workloads
a Device No
b Total I/O Requests
c I/O Per Second
d Average Request Size in KB
Table 5.2: Workload characteristics
5.3 Workloads and Traces
Many studies of storage performance analysis use traces of real  le system requests to
produce more realistic results. The traces we used are from HP [52] [53] [54]: OpenMail. In
addition, we used disk traces from University of Massachusetts Trace Repository (UMTR)
[55] to test the impact of di erent update policies on the disk behavior of enterprise level
applications like web servers, database servers, and web search. We also generate synthetic
workloads that represent typical access distributions and approximate real disk usage. The
characteristics of workloads used in this paper are shown in Table 5.2. We selected the  rst
device (which can represent the workload characteristics in class) within each trace to report
simulation results.
30
OpenMail [52]: It was a one-hour trace from  ve servers running HP?s OpenMail, col-
lected during the servers? busy periods. The trace was collected in 1999.
UMTR [55]: There are two kinds of traces: OLTP application I/O and search engine
I/O. The former includes two I/O traces (Financial1.spc and Financial2.spc) from OLTP
applications running at two large  nancial institutions. The later includes three traces
(Websearch1.spc, Websearch2.spc, and Websearch3.spc) from a popular search engine. These
traces are made available courtesy of Ken Bates from HP, Bruce McNutt from IBM and the
Storage Performance Council.
Synthetic workloads [50]: The synthetic workloads were generated using DiskSim 4.0
workload generator. The generator can be con gured to generate a wide range of synthetic
workloads. In addition, probability-distribution parameters can be set to uniform, normal,
exponential, Poisson, or twovalue.
5.4 Performance Metrics
Response time and throughput are generally two important metrics in measuring I/O
performance. Response time starts from a request being issued until the request is served.
Throughput measures the ability of a system in a form of the number of I/Os per second. The
di erence between Response time and Throughput is whether we measure one task (Response
Time) or many tasks (Throughput). According to Hsu and Smith [1], throughput is hard to
be quanti ed for trace-driven simulation in that the workloads are constant. However, they
maintain that throughput can be estimated by taking the reciprocal of the average service
time, which tends to be optimistic estimate of the maximum throughput. Additionally, we
examine the miss ratio of the read cache and the write cache. The miss ratio is the fraction
of I/Os that need to access physical devices (HDDs). Finally,  ash endurance is measured
using the write tra c (Bytes) to the  ash. To compare di erent policies (early update policy
and lazy update policy), we introduce Relative Tra c  , which is de ned as:
31
 =   (5.1)
where
-  is the write tra c to  ash.
-  is the write tra c to device.
The ideal  ash lifetime, which does not take write ampli cation into consideration (write
ampli cation will be discussed later), can be expressed as:
 =  " (5.2)
where
-  is the  ash lifetime in days.
-  is the maximum of the  ash write cycles (10K-100K) depending on the type of  ash.
- " is the capacity of the device.
-  is the write tra c per day.
As we can see from 5.2,  ash lifetime is related to rated write cycles  and the capacity
of the device ". Since we will use the same kinds of  ash and same size of  ash to compare
early update policy to lazy update policy, we would like to eliminate these two factors from
the equation. In order to do so, we introduce relative lifetime ?:
? =  1 2 =
 "
 1
 "
 2
=  2  1 =  2 1 =
 2
 
 1
 
=  2 1 (5.3)
where
-  1 is the  ash lifetime for early update policy.
32
-  2 is the  ash lifetime for lazy update policy.
-  is the time in days the  ash is used.
-  1 is the write tra c to  ash per day for early update policy.
-  2 is the write tra c to  ash per day for lazy update policy.
-  1 is the write tra c to  ash for early update policy.
-  2 is the write tra c to  ash for lazy update policy.
-  is the total tra c to the device.
-  1 is relative tra c for early update policy.
-  2 is relative tra c for lazy update policy.
From 5.3, we see that relative lifetime ? can be expressed using relative tra c  . The
advantage of using relative tra c to compare relative lifetime is that the rated write cycles
 and the capacity of the device " are taken out of the equation.
The  ash lifetime in practice is smaller than the ideal  ash lifetime due to a factor,
called write ampli cation. Write ampli cation is referred to as the phenomenon that n bytes
of write tra c from a  le system will be translated into nm (m > 1) bytes of write tra c
to  ash. There are several factors [56] that contribute to write ampli cation. First, write
ampli cation is caused by wear leveling, where cold/hot data swapping consumes extra write
cycles. Second, garbage collection contributes to write ampli cation. In garbage collection,
in order to reclaim invalid pages, the valid pages within the reclaiming blocks need to be
relocated, which consumes extra write cycles. Write ampli cation varies for di erent FTL.
Usually, simpler FTLs own larger write ampli cation while more complex FTLs have smaller
write ampli cation. Many e orts have been made to minimize write ampli cation to extend
 ash lifetime. According to Soundararajan et al. [8],  ash lifetime can be an order of
33
magnitude worse than the optimum even for advanced FTLs, such as Intel X25-M MLC
SSD.
Due to the write ampli cation, it is not straightforward to map between reduced write
tra c and increased lifetime. However, decreasing the write tra c in half will at least double
its lifetime because of reduced write ampli cation [8].
5.5 DiskSim 4.0 modi cations
In doing our research, we made some modi cations on DiskSim4.0. First, we added two
trace formats: SRT 1.6 and SPC. In addition, we added a trace  lter to control what types
of requests will be fed into our simulator. Second, we added secondary  ash cache on top of
the primary disk cache. Last, we  xed bugs with regard to the incompatibility issue of Type
3 Smart Controller for Microsoft SSD add-on.
5.5.1 Trace formats: SRT1.6 and SPC
DiskSim 4.0 supports many trace formats including SRT 1.4, but it does not support
HP SRT1.6 and SPC. Many traces, like HP OpenMail and Cello, use SRT1.6. SPC is a trace
format designed by the Storage Performance Council. University of Massachusetts Trace
uses SPC format.
Our implementation of SRT 1.6 is based on HP WinTraceto SRT. The source code for
SRT 1.6 is in disksim srt.c. Our implementation of SPC complies with SPC speci cation
1.01. The source code for SPC is in disksim spc.c. Both are included in disksim4.0-ms-ssd-
t2-c3 version. We are planning to make it publicly available.
In addition to adding the above two trace formats, a trace  lter is added. With the
 lter, you can select:
 Device number
 Read only
34
 Write only
 Set dev=0
 Set maxmum bcount
The  lter can be con gured via Global parameter in .parv:
disksim global Global f
Init Seed = 42,
Real Seed = 42,
# uncomment the following line if you just want requests for dev=80
# Device to Trace = 80,
# uncomment the following line if you only trace reads
# Trace Read Only = 1,
# uncomment the following line if you only trace writes.
# Trace Write Only = 1,
# uncomment the following line if you want to set dev=0
# Map Devices to devno-0 = 1,
# uncomment the following line if you only want bcount <= 512
# Max Bcount = 512
Stat de nition  le = statdefs
g
35
5.5.2 Secondary  ash cache
DiskSim 4.0 does not support a secondary  ash cache that our research needs to have.
The original system architecture of DiskSim 4.0 is shown in Figure A.1. Therefore, we
implemented a secondary  ash cache layer on DiskSim 4.0 as shown in Figure 5.1. Since we
concentrate on the tra c savings, we only implemented a simpli ed model of a  ash device
that has a constant access time, which can be adjusted by parameter con gurations.
5.5.3 Smart controller and MS SSD add-on
Microsoft research implemented an SSD module on top of DiskSim 4.0. Unlike the disk
implementation, the SSD implementation does not have a read or write cache. One way to
have a read or write cache is to use Smart Controller (type:3) (see Figure 5.1). However,
the SSD add-on is not compatible with Smart Controller. The problem is rooted in the
slightly di erent ways that dumb controller and smart controller handle data transfer. We
 xed the bugs and added the support of Smart Controller to the SSD add-on. Our version
disksim4.0-ms-ssd-t2-c3 contains the update.
5.6 Validation
DiskSim 4.0 comes with a set of validation tests (runvalid), which are used to test the
simulator. Runvalid uses a series of validation data to validate the simulator. The validation
data were from a logic analyzer attached to the SCSI bus. Validation was achieved by
comparing measured and simulated response time distributions. Our modi ed version of
DiskSim 4.0 produces the same values as DiskSim 4.0 for all the validation tests as shown in
Table 5.3.
36
Device Driver
Controller
(cache)
Disk
[?ash cache]
(cache)
System Bus
I/O Bus
SSD
Figure 5.1: System architecture of modi ed DiskSim 4.0
37
Tests DiskSim 4.0 Modi ed DiskSim 4.0
QUANTUM QM39100TD-SW rmsa= 0.377952 rms = 0.377952
SEAGATE ST32171W rms = 0.347570 rms = 0.347570
SEAGATE ST34501N rms = 0.317972 rms = 0.317972
SEAGATE ST39102LW rms = 0.106906 rms = 0.106906
IBM DNES-309170W rms = 0.135884 rms = 0.135884
QUANTUM TORNADO rms = 0.267721 rms = 0.267721
HP C2247 validate rms = 0.089931 rms = 0.089931
HP C3323 validate rms = 0.305653 rms = 0.305653
HP C2490 validate rms = 0.253762 rms = 0.253762
Open synthetic workload 10.937386msb 10.937386ms
Closed synthetic workload 87.819135ms 87.819135ms
Mixed synthetic workload 22.086628ms 22.086628ms
RAID 5 at device driver 22.861326ms 22.861326ms
Disk arrays at device driver 34.272035ms 34.272035ms
Memory cache at controller 24.651367ms 24.651367ms
Cache device/controller 28.939379ms 28.939379ms
Simpledisk 13.711596ms 13.711596ms
3 di erent disks 10.937386ms 10.937386ms
Separate controllers 10.937386ms 10.937386ms
HP srt trace input 48.786646ms 48.786646ms
ASCII input 13.766948ms 13.766948ms
Syssim system integration 8.894719ms 8.894719ms
a Root Mean Square error of di erences
b Average Response Time in millisecond
Table 5.3: Validation of modi ed DiskSim 4.0
38
Chapter 6
Tra c Savings for Flash as a Victim Disk Cache
In this chapter, we focus research on  ash used as a victim disk cache, whose architecture
was shown in 3.1. There are two levels of disk caches: primary DRAM disk cache and
secondary  ash disk cache. We concentrated on the relationship between the two levels of
disk caches: when is the right time to update the  ash disk cache? Compared with the early
update policy, our study shows that using  ash as a victim cache (which corresponds to lazy
update policy) can save a lot of lifetime. At the same time, the performance in terms of
response time and miss ratio is improving as well.
The rest of the chapter is organized as follows: Section 6.1 introduces the concept of
early update and lazy update policy. In Section 6.2, we discuss lazy update policy for reads.
In Section 6.3, we talk about lazy update policy for writes. In Section 6.4, the bene ts
of lazy update policy are presented. Section 6.5 shows the simulation parameters. Section
6.6 talks about performance metrics used in this chapter. Finally, Section 6.7 presents the
experimental results.
6.1 Early Update vs. Lazy Update
Early update means the  ash is updated as soon as the DRAM cache is updated. Lazy
update is referred to as the policy that the  ash is updated only when the data is evicted
from DRAM cache. In other words,  ash acts as a victim device of the DRAM cache.
Traditionally, early update policy is employed. For examples, Bisson et al. [18] and Kgil et
al. [24] [25] [26] use early update policy. The  ow chart of early update policy is shown in
Figure 6.1 and Figure 6.2.
39
Read Requests
In DRAM?
Return
Yes In Flash?
No
Read from Disks
No
Update Flash
Update DRAM
Return
Yes
Increment Write 
Counter
Figure 6.1: Early update for reads
40
Write requests
In DRAM?
merge
Yes
In Flash?
Update Flash
Invalid Flash Data 
Entry
Yes
No
Update DRAM
No
Increment Write 
Counter
Return
Figure 6.2: Early update for writes
41
6.2 Lazy Update Policy for Reads
When a read request arrives, DRAM cache is  rst checked as shown in Figure 6.3. If
found in DRAM cache, then the request can be served through DRAM cache. If not, then
check with  ash. If it has an entry in it, the request can be satis ed by  ash. The data will
also be copied into DRAM cache. If the data is not in  ash either, then the data is fetched
from HDDs and it is cached in DRAM cache. But  ash is not updated until the data is
evicted.
6.3 Lazy Update Policy for Writes
When a write request arrives, DRAM cache is  rst checked as shown in Figure 6.4. If
found in DRAM cache, the request will be merged. If not, allocate an entry in DRAM cache.
If there is no space, the least used clean data will be evicted. The evicted data will be copied
into  ash. If all data in DRAM cache are dirty, the write request has to wait for the dirty
data to be retired into HDDs.
6.4 The Bene ts of Lazy Update Policy
Lazy update policy bene ts from the fact that block devices receive many overwrites of
a small set of popular blocks. One cause for overwrites is that many  le systems enforce a 30-
second rule [54], which  ushes bu ered writes to disks every 30 seconds for the sake of data
integrity. Soundararajan et al. [8] have found that, on average, 54% of the total overwrites
occur within the  rst 30 seconds, which con rms the role that 30-second rule plays. Some
 le systems, such as ext2,  ush metadata more frequently (e.g., every 5 seconds) [57]. With
lazy update policy, overwrites can be coalesced, thereby reducing the write tra c to  ash
signi cantly.
42
Read Requests
Data in DRAM?
Return
Yes
Data in Flash?
No
Yes
Read from Disk
No
Update DRAM
Evicted from 
DRAM?
Update Flash
Increment Flash 
Write Counter
Return 
No
Yes
Figure 6.3: Lazy update for reads
43
Write requests
In DRAM?
Merge
Yes
Return
In Flash?
No
Update DRAM
No
Invalid Flash Data 
Entry
Yes
Evicted from 
DRAM?
No
Update Flash
Yes
Increment Write 
Counter
Figure 6.4: Lazy update for writes
44
Parameter Value
Form 3.5"
Capacity 9100MB
Seek time/track 7.5/0.8ms
Cache/Bu er 1024KB
Data transfer rate 19MB/S int
40MB/S ext
Bytes/Sector 512
Table 6.1: Disk parameters
(QUANTUM QM39100TD-SW)
6.5 Simulation Parameters
In our simulation, we choose QUANTUM QM39100TD-SW SCSI hard disk drive, which
was one of 10 disk drives that have been validated for DiskSim 4.0. Its basic parameters
are shown in Table 6.1. We have tried other brands of hard disk drives, which have little
impact on our  ndings since we concentrated our research on lifetime of  ash and disk
cache. Therefore, in this dissertation, we use QUANTUM QM39100TD-SW to report our
experimental results.
The parameter  le (victim.parv) we used in running DiskSim is listed below in the two-
column text. In addition, we changed the following parameters via command line parameter
interface:
 DRAM cache size
 Flash cache size
 Devno
45
disksim global Global f
Init Seed = 42,
Real Seed = 42,
Device to Trace = 1,
Map Devices to devno-0 = 1,
With Flash Cache = 1,
Flash Update Policy = 1,
Flash Cache Size = 1,
Flash Page Size = 8,
Flash Cost Per Page = 0.0,
Stat de nition  le = statdefs
g
disksim stats Stats f
iodriver stats = disksim iodriver stats f
Print driver size stats = 1,
Print driver locality stats = 0,
Print driver blocking stats = 0,
Print driver interference stats = 0,
Print driver queue stats = 1,
Print driver crit stats = 0,
Print driver idle stats = 1,
Print driver intarr stats = 1,
Print driver streak stats = 1,
Print driver stamp stats = 1,
Print driver per-device stats = 1 g,
bus stats = disksim bus stats f
Print bus idle stats = 1,
Print bus arbwait stats = 1 g,
ctlr stats = disksim ctlr stats f
Print controller cache stats = 1,
Print controller size stats = 1,
Print controller locality stats = 1,
Print controller blocking stats = 1,
Print controller interference stats = 1,
Print controller queue stats = 1,
Print controller crit stats = 1,
Print controller idle stats = 1,
Print controller intarr stats = 1,
Print controller streak stats = 1,
Print controller stamp stats = 1,
Print controller per-device stats = 1 g,
device stats = disksim device stats f
Print device queue stats = 0,
Print device crit stats = 0,
Print device idle stats = 0,
Print device intarr stats = 0,
Print device size stats = 0,
Print device seek stats = 1,
Print device latency stats = 1,
Print device xfer stats = 1,
Print device acctime stats = 1,
Print device interfere stats = 0,
Print device bu er stats = 1 g,
process  ow stats = disksim pf stats f
Print per-process stats = 1,
Print per-CPU stats = 1,
Print all interrupt stats = 1,
Print sleep stats = 1
g
g # end of stats block
disksim iodriver DRIVER0 f
type = 1,
Constant access time = 0.0,
Scheduler = disksim ioqueue f
Scheduling policy = 3,
Cylinder mapping strategy = 1,
Write initiation delay = 0.0,
Read initiation delay = 0.0,
Sequential stream scheme = 0,
Maximum concat size = 128,
Overlapping request scheme = 0,
Sequential stream di maximum = 0,
Scheduling timeout scheme = 0,
Timeout time/weight = 6,
Timeout scheduling = 4,
Scheduling priority scheme = 0,
Priority scheduling = 4
g, # end of Scheduler
Use queueing in subsystem = 1
g # end of DRV0 spec
disksim bus BUS0 f
type = 1,
Arbitration type = 1,
Arbitration time = 0.0,
Read block transfer time = 0.0,
Write block transfer time = 0.0,
Print stats = 0
46
g # end of BUS0 spec
disksim bus BUS1 f
type = 1,
Arbitration type = 1,
Arbitration time = 0.0,
Read block transfer time = 0.0512,
Write block transfer time = 0.0512,
Print stats = 1
g # end of BUS1 spec
disksim ctlr CTLR0 f
type = 1,
Scale for delays = 0.0,
Bulk sector transfer time = 0.0,
Maximum queue length = 0,
Print stats = 1
g # end of CTLR0 spec
source atlas III.diskspecs
# component instantiation
instantiate [ statfoo ] as Stats
instantiate [ bus0 ] as BUS0
instantiate [ bus1 ] as BUS1
instantiate [ disk0 .. disk2 ] as
QUANTUM QM39100TD-SW
instantiate [ driver0 ] as DRIVER0
instantiate [ ctlr0 ] as CTLR0
# system topology
topology disksim iodriver driver0 [
disksim bus bus0 [
disksim ctlr ctlr0 [
disksim bus bus1 [
disksim disk disk0 [],
disksim disk disk1 [],
disksim disk disk2 []
]
]
]
]
disksim logorg org0 f
Addressing mode = Parts,
Distribution scheme = Asis,
Redundancy scheme = Noredun,
devices = [ disk0 ],
Stripe unit = 2056008,
Synch writes for safety = 0,
Number of copies = 2,
Copy choice on read = 6,
RMW vs. reconstruct = 0.5,
Parity stripe unit = 64,
Parity rotation type = 1,
Time stamp interval = 0.000000,
Time stamp start time = 60000.000000,
Time stamp stop time =
10000000000.000000,
Time stamp  le name = stamps
g # end of logorg org0 spec
disksim pf Proc f
Number of processors = 1,
Process-Flow Time Scale = 1.0
g # end of process  ow spec
disksim synthio Synthio f
Number of I/O requests to generate =
20000,
Maximum time of trace generated =
10000.0,
System call/return with each request = 0,
Think time from call to request = 0.0,
Think time from request to return = 0.0,
Generators = [
disksim synthgen f # generator 0
Storage capacity per device = 17938986,
devices = [ disk0 ],
Blocking factor = 8,
Probability of sequential access = 0.7,
Probability of local access = 0.7,
Probability of read access = 0.5,
Probability of time-critical request = 0.2,
Probability of time-limited request = 0.3,
Time-limited think times = [ normal, 30.0,
100.0 ],
General inter-arrival times = [ exponential,
0.0, 25.0 ],
Sequential inter-arrival times = [ normal,
0.0, 0.0 ],
Local inter-arrival times = [ exponential,
0.0, 0.0 ],
47
Local distances = [ normal, 0.0, 40000.0 ],
Sizes = [ exponential, 0.0, 8.0 ]
g
] # end of generator list
g # end of synthetic workload spec
6.6 Performance Metrics
As discussed in Chapter 5, we use relative tra c, miss ratio, and response time as
performance metrics.
6.7 Experimental Results
We ran OpenMail, Synthetic, and Websearch3 workload traces against early update
policy and lazy update policy, during which relative tra c, miss ratio, and average response
time were observed. We especially watched the impact of DRAM size and  ash size on
relative tra c savings. The simulation results are shown in Figure 6.5 through Figure 6.28.
Several observations can be made from these  gures.
6.7.1 Write Tra c Savings
OpenMail (Figure 6.5 and Figure 6.6): relative tra c with lazy update policy is 50% of
that with early update policy when DRAM size reaches 25MB (0.33 for lazy update policy
vs. 0.68 for early update policy).
Synthetic workload (Figure 6.7 and Figure 6.8): relative tra c with lazy update policy
is 50% of that with early update policy when DRAM size reaches 60MB (roughly 0.5 for
lazy update policy vs. 1 for early update policy).
Websearch3 (Figure 6.9 and Figure 6.10): relative tra c does not see improvement with
Websearch3 because Websearch3 is a read-only workload. For read-only workloads, there is
no overwrite that can be coalesced using lazy update policy.
48
 0.25
 0.3
 0.35
 0.4
 0.45
 0.5
 0.55
 0.6
 0.65
 0.7
 0  5  10  15  20  25  30  35
R
elative tra?c
DRAM cache size(MB)
Flash cache size=31MB
Early
Lazy
Figure 6.5: Relative tra c for OpenMail
49
 0.4
 0.45
 0.5
 0.55
 0.6
 0.65
 0.7
 5  10  15  20  25  30  35
R
elative tra?c
Flash cache size(MB)
DRAM cache size=6MB
Early
Lazy
Figure 6.6: Relative tra c for OpenMail
50
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 0  20  40  60  80  100  120  140
R
elative tra?c
DRAM cache size(MB)
Flash cache size=62MB
Early
Lazy
Figure 6.7: Relative tra c for Synthetic workload
51
 0.93
 0.94
 0.95
 0.96
 0.97
 0.98
 0.99
 1
 0  10  20  30  40  50  60  70
R
elative tra?c
Flash cache size(MB)
DRAM cache size=6MB
Early
Lazy
Figure 6.8: Relative tra c for Synthetic workload
52
 0.99
 0.995
 1
 1.005
 1.01
 0  10  20  30  40  50  60  70  80  90
R
elative tra?c
DRAM cache size(MB)
Flash cache size=74MB
Early
Lazy
Figure 6.9: Relative tra c for Websearch3
53
 0.99
 0.995
 1
 1.005
 1.01
 0  20  40  60  80  100  120  140  160
R
elative tra?c
Flash cache size(MB)
DRAM cache size=14MB
Early
Lazy
Figure 6.10: Relative tra c for Websearch3
54
6.7.2 DRAM Size Need Not to be Very Large to be E ective
DRAM size ranges from 15MB to 60MB depending on the workloads, which is shown
in Figure 6.5 and Figure 6.7.
6.7.3 Flash size has little e ect on Write Tra c Savings
As can be seen from Figure 6.6 and Figure 6.8,  ash size has little impact on write
tra c savings when compared to varying DRAM size.
We see that the write tra c savings depend mainly on DRAM size rather than  ash
size. This is a great characteristic since a moderate size of DRAM cache (< 60MB) can
extend the  ash lifetime signi cantly no matter how large the  ash is.
6.7.4 Miss Ratio Is improved Slightly
From Figure 6.11, Figure 6.12, Figure 6.13 and Figure 6.14, we notice that miss ratio
with lazy update policy is slightly better than that with early update policy. However, miss
ratio with lazy update policy for Websearch3 does not see any improvement, as shown in
Figure 6.15 and Figure 6.16.
6.7.5 Response Time Is Improved Slightly
From Figure 6.17, Figure 6.18, Figure 6.19 and Figure 6.20, we observe that response
time with lazy update policy is slightly better than that with early update. Like miss ratio,
response time does not show improvement with Websearch3 workload shown in Figure 6.21
and 6.22.
6.7.6 Summary
The lazy update policy improvement over early update policy for OpenMail, Web-
search3, and synthetic workloads is presented in Figure 6.23, Figure 6.24, Figure 6.25, Figure
6.26, Figure 6.27, and Figure 6.28 respectively.
55
 0.44
 0.445
 0.45
 0.455
 0.46
 0.465
 0.47
 0.475
 0.48
 0  5  10  15  20  25  30  35
Miss ratio
DRAM cache size(MB)
Flash cache size=31MB
Early
Lazy
Figure 6.11: Miss Ratio for OpenMail
56
 0.456
 0.458
 0.46
 0.462
 0.464
 0.466
 0.468
 0.47
 0.472
 0.474
 5  10  15  20  25  30  35
Miss ratio
Flash cache size(MB)
DRAM cache size=6MB
Early
Lazy
Figure 6.12: Miss Ratio for OpenMail
57
 0.974
 0.976
 0.978
 0.98
 0.982
 0.984
 0.986
 0.988
 0.99
 0.992
 0  20  40  60  80  100  120  140
Miss ratio
DRAM cache size(MB)
Flash cache size=62MB
Early
Lazy
Figure 6.13: Miss Ratio for Synthetic workload
58
 0.981
 0.982
 0.983
 0.984
 0.985
 0.986
 0.987
 0.988
 0.989
 0.99
 0.991
 0  10  20  30  40  50  60  70
Miss ratio
Flash cache size(MB)
DRAM cache size=6MB
Early
Lazy
Figure 6.14: Miss Ratio for Synthetic workload
59
 0.99
 0.995
 1
 1.005
 1.01
 0  10  20  30  40  50  60  70  80  90
Miss ratio
DRAM cache size(MB)
Flash cache size=74MB
Early
Lazy
Figure 6.15: Miss Ratio for Websearch3
60
 0.99
 0.995
 1
 1.005
 1.01
 0  20  40  60  80  100  120  140  160
Miss ratio
Flash cache size(MB)
DRAM cache size=14MB
Early
Lazy
Figure 6.16: Miss Ratio for Websearch3
61
 1603.6
 1603.8
 1604
 1604.2
 1604.4
 1604.6
 1604.8
 1605
 1605.2
 0  5  10  15  20  25  30  35
A
verage r
esponse tim
e(ms)
DRAM cache size(MB)
Flash cache size=31MB
Early
Lazy
Figure 6.17: Response Time for OpenMail
62
 1604.45
 1604.5
 1604.55
 1604.6
 1604.65
 1604.7
 1604.75
 1604.8
 1604.85
 1604.9
 1604.95
 5  10  15  20  25  30  35
A
verage r
esponse tim
e(ms)
Flash cache size(MB)
DRAM cache size=6MB
Early
Lazy
Figure 6.18: Response Time for OpenMail
63
 43.45
 43.5
 43.55
 43.6
 43.65
 43.7
 43.75
 43.8
 43.85
 0  20  40  60  80  100  120  140
A
verage r
esponse tim
e(ms)
DRAM cache size(MB)
Flash cache size=62MB
Early
Lazy
Figure 6.19: Response Time for Synthetic workload
64
 43.55
 43.6
 43.65
 43.7
 43.75
 43.8
 43.85
 43.9
 0  10  20  30  40  50  60  70
A
verage r
esponse tim
e(ms)
Flash cache size(MB)
DRAM cache size=6MB
Early
Lazy
Figure 6.20: Response Time for Synthetic workload
65
 50.2
 50.4
 50.6
 50.8
 51
 51.2
 51.4
 0  10  20  30  40  50  60  70  80  90
A
verage r
esponse tim
e(ms)
DRAM cache size(MB)
Flash cache size=74MB
Early
Lazy
Figure 6.21: Response Time for Websearch3
66
 50.2
 50.4
 50.6
 50.8
 51
 51.2
 51.4
 0  20  40  60  80  100  120  140  160
A
verage r
esponse tim
e(ms)
Flash cache size(MB)
DRAM cache size=14MB
Early
Lazy
Figure 6.22: Response Time for Websearch3
67
 0
 20
 40
 60
 80
 100
 0  5  10  15  20  25  30  35
Impr
ovement of Laz
y Update P
olicy Ove
r Early Update P
olicy
(%)
DRAM cache size(MB)
Flash cache size=31MB
Relative Tra?c
Miss Ratio
Response Time
Figure 6.23: Improvement of lazy update policy over early update policy for OpenMail
68
 0
 20
 40
 60
 80
 100
 5  10  15  20  25  30  35
Impr
ovement of Laz
y Update P
olicy over
 Early Update P
olicty
 (%)
Flash cache size(MB)
DRAM cache size=6MB
Relative Tra?c
Miss Ratio
Response Time
Figure 6.24: Improvement of lazy update policy over early update policy for OpenMail
69
 0
 20
 40
 60
 80
 100
 0  10  20  30  40  50  60  70  80  90
Impr
ovement of Laz
y Update P
olicy Ove
r Early Update P
olicy
(%)
DRAM cache size(MB)
Flash cache size=74MB
Relative Tra?c
Miss Ratio
Response Time
Figure 6.25: Improvement of lazy update policy over early update policy for
UMTR(websearch3)
70
 0
 20
 40
 60
 80
 100
 0  20  40  60  80  100  120  140  160
Impr
ovement of Laz
y Update P
olicy over
 Early Update P
olicty
 (%)
Flash cache size(MB)
DRAM cache size=14MB
Relative Tra?c
Miss Ratio
Response Time
Figure 6.26: Improvement of lazy update policy over early update policy for
UMTR(websearch3)
71
 0
 20
 40
 60
 80
 100
 0  20  40  60  80  100  120  140
Impr
ovement of Laz
y Update P
olicy Ove
r Early Update P
olicy
(%)
DRAM cache size(MB)
Flash cache size=62MB
Relative Tra?c
Miss Ratio
Response Time
Figure 6.27: Improvement of lazy update policy over early update policy for synthetic work-
load
72
 0
 20
 40
 60
 80
 100
 0  10  20  30  40  50  60  70
Impr
ovement of Laz
y Update P
olicy over
 Early Update P
olicty
 (%)
Flash cache size(MB)
DRAM cache size=6MB
Relative Tra?c
Miss Ratio
Response Time
Figure 6.28: Improvement of lazy update policy over early update policy for synthetic work-
load
73
Although we see no big di erence in terms of miss ratio and average response time with
lazy update policy, the write tra c is signi cantly reduced (between 5-97% depending upon
write content of tra c).
74
Chapter 7
Tra c Savings for Flash as a Major Storage Medium
In this chapter, we focus our research on  ash used as a primary store instead of a victim
disk cache, whose architecture was shown in Figure 3.2. Unlike a disk cache, read requests
would not entail write tra c to  ash. Moreover, as a major storage medium, we assume
the  ash size is much larger than a disk cache. We still concentrated on the impact that
update policies have on the lifetime of  ash. Compared with the early update policy, our
study shows that using  ash as a victim device (which corresponds to lazy update policy)
can save a lot of lifetime. At the same time, the performance in terms of response time is
improving as well, which is consistent with the  ndings in Chapter 6.
The rest of the chapter is organized as follows: Section 7.1 introduces early update for
reads. In Section 7.2, we discuss early update policy for writes. In Section 7.3, we talk about
lazy update policy for reads. In Section 7.4, we present lazy update for writes. Section 7.5
shows the simulation parameters. Section 7.6 talks about performance metrics used in this
chapter. Finally, Section 7.7 presents the experimental results.
7.1 Early Update for reads
The  owchart of early update for reads is shown in Figure 7.1. When a read request
arrives, DRAM cache is  rst checked. If found, then the request is satis ed. Otherwise, data
is read from  ash. Meanwhile, DRAM cache is updated upon receiving data from  ash.
75
Read Requests
In DRAM?
Return
Yes
Read from Flash
Update DRAM
Return
No
Figure 7.1: Early update for reads
7.2 Early Update for writes
The  owchart of early update for write is shown in Figure 7.2. When a write request
comes, search in DRAM cache  rst. If found, merge the data. Otherwise, DRAM cache is
updated. Next, update  ash with Write Counter incremented.
7.3 Lazy Update for reads
The  owchart of lazy update for reads is shown in Figure 7.3. When a read request
arrives, search in DRAM cache. If found, return the data. If not, read the data from  ash.
Next, update DRAM cache with the new data. If a dirty data entry is evicted from DRAM
cache, update  ash with the Write Counter incremented.
76
Write Requests
In DRAM?
Merge
Update Flash
Increment Write 
Counter
Return
Yes
Update DRAMNo
Figure 7.2: Early update for writes
77
Read Request
In DRAM
Return
Yes Read from Flash
No
Update DRAM
Evicted from 
DRAM?
Return
No
Dirty? No
Yes
Update Flash
Yes
Increment Write 
Counter
Figure 7.3: Lazy update for reads
78
Write requests
In DRAM?
Merge
Yes
Update DRAM
Evicted from 
DRAM?
return
No
Dirty?
Yes
Update Flash
Increment Write 
counter
Yes
No
No
Figure 7.4: Lazy update for writes
7.4 Lazy Update writes
The  owchart of lazy update for writes is shown in Figure 7.4. When a write request
arrives, search in DRAM cache. If found, merge the data. Otherwise, DRAM cache is
updated. If dirty data is evicted from DRAM cache, update the  ash with the Write Counter
incremented.
79
7.5 Simulation Parameters
The part of the parameter  le(ssd-victim.parv) we used in running DiskSim is listed
below.
ssdmodel ssd SSD f
# vp - this is a percentage of total pages in the ssd
Reserve pages percentage = 15,
# vp - min percentage of free blocks needed. if the free
# blocks drop below this, cleaning kicks in
Minimum free blocks percentage = 5,
# vp - a simple read-modify-erase-write policy = 1 (no longer supported)
# vp - osr write policy = 2
Write policy = 2,
# vp - random = 1 (not supp), greedy = 2, wear-aware = 3
Cleaning policy = 2,
# vp - number of planes in each  ash package (element)
Planes per package = 8,
# vp - number of  ash blocks in each plane
Blocks per plane = 2048,
# vp - how the blocks within an element are mapped on a plane
# simple concatenation = 1, plane-pair stripping = 2 (not tested),
# full stripping = 3
Plane block mapping = 3,
# vp - copy-back enabled (1) or not (0)
Copy back = 1,
# how many parallel units are there?
# entire elem = 1, two dies = 2, four plane-pairs = 4
Number of parallel units = 1,
# vp - we use di allocation logic: chip/plane
# each gang = 0, each elem = 1, each plane = 2
Allocation pool logic = 1,
# elements are grouped into a gang
Elements per gang = 1,
# shared bus (1) or shared control (2) gang
Gang share = 1,
80
# when do we want to do the cleaning?
Cleaning in background = 0,
Command overhead = 0.00,
Bus transaction latency = 0.0,
# Assuming PCI-E, with 8 lanes with 8b/10b encoding.
# This gives 2.0 Gbps per lane and with 8 lanes we get about
# 2.0 GBps. So, bulk sector transfer time is about 0.238 us.
# Use the "Read block transfer time" and "Write block transfer time"
# from disksim bus above.
Bulk sector transfer time = 0,
Flash chip elements = 8,
Page size = 8,
Pages per block = 64,
# vp - changing the no of blocks from 16184 to 16384
Blocks per element = 16384,
Element stride pages = 1,
Never disconnect = 1,
Print stats = 1,
Max queue length = 20,
Scheduler = disksim ioqueue f
Scheduling policy = 1,
Cylinder mapping strategy = 0,
Write initiation delay = 0,
Read initiation delay = 0.0,
Sequential stream scheme = 0,
Maximum concat size = 0,
Overlapping request scheme = 0,
Sequential stream di maximum = 0,
Scheduling timeout scheme = 0,
Timeout time/weight = 0,
Timeout scheduling = 0,
Scheduling priority scheme = 0,
Priority scheduling = 1
g,
Timing model = 1,
# vp changing the Chip xfer latency from per sector to per byte
Chip xfer latency = 0.000025,
Page read latency = 0.025,
Page write latency = 0.200,
81
Block erase latency = 1.5
g # end of SSD spec
In addition, we change the following parameters via command line parameter interface:
 DRAM cache size
 Devno
7.6 Performance Metrics
As in Chapter 6, we use relative tra c and response time as performance metrics.
However, we will not use miss ratio in this chapter since  ash is not used as disk cache.
7.7 Experimental Results
We ran OpenMail, Synthetic, and Websearch3 workload traces against early update
policy and lazy update policy, during which relative tra c and average response time were
observed. We especially watched the impact of DRAM size on relative tra c savings. The
simulation results are shown in Figure 7.5 through Figure 7.13. Several observations can be
made from these  gures.
7.7.1 Write Tra c Savings
OpenMail (Figure 7.5): relative tra c with lazy update policy is roughly 33% of that
with early update policy when DRAM size reaches 10MB (0.33 for lazy update policy vs. 1
for early update policy).
Synthetic workload (Figure 7.6): relative tra c with lazy update policy is roughly 40%
of that with early update policy when DRAM size reaches 10MB (roughly 0.4 for lazy update
policy vs. 1 for early update policy).
82
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 0  5  10  15  20  25  30  35
R
elative tra?c
DRAM cache size(MB)
Early
Lazy
Figure 7.5: Relative tra c for OpenMail
83
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 0  5  10  15  20  25  30  35
R
elative tra?c
DRAM cache size(MB)
Early
Lazy
Figure 7.6: Relative tra c for Synthetic workloads
84
Websearch3 (Figure 7.7): relative tra c does not see improvement with Websearch3
because Websearch3 is a read-only workload. For read-only workloads, there is no overwrite
that can be coalesced using lazy update policy.
7.7.2 DRAM Size Needs not to be Large to Be E ective
DRAM size larger than 10MB is e ective to reduce write tra c signi cantly, which is
shown in Figure 7.5 and Figure 7.6.
7.7.3 Response Time
OpenMail (Figure 7.8): Response time with lazy update policy is roughly 5.6% of that
with early update policy (roughly 0.2 ms for lazy update policy, roughly 3.6 ms for early
update policy).
Synthetic workload (Figure 7.9): Response time with lazy update policy is roughly
63.6% of that with early update policy (roughly 0.14 ms for lazy update policy, roughly 0.22
ms for early update policy).
Websearch3 (Figure 7.10): the response time for Websearch3 does not see improvement.
7.7.4 Summary
The lazy update policy improvement over early update policy for OpenMail, Web-
search3, and synthetic workloads has been presented in Figure 7.11, Figure 7.12, and Figure
7.13 respectively.
Although we see no di erence for read-only workloads, lazy update policy reduces write
tra c signi cantly with read/write workloads. Performance in terms of average response
time also sees marked improvement.
85
 0.99
 0.995
 1
 1.005
 1.01
 0  5  10  15  20  25  30  35
R
elative tra?c
DRAM cache size(MB)
Early
Lazy
Figure 7.7: Relative tra c for Websearch3
86
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 0  5  10  15  20  25  30  35
A
verage r
esponse tim
e(ms)
DRAM cache size(MB)
Early
Lazy
Figure 7.8: Response time for OpenMail
87
 0.13
 0.14
 0.15
 0.16
 0.17
 0.18
 0.19
 0.2
 0.21
 0.22
 0.23
 0  5  10  15  20  25  30  35
A
verage r
esponse tim
e(ms)
DRAM cache size(MB)
Early
Lazy
Figure 7.9: Response time for Synthetic workloads
88
 0.1365
 0.137
 0.1375
 0.138
 0.1385
 0.139
 0.1395
 0  5  10  15  20  25  30  35
A
verage r
esponse tim
e(ms)
DRAM cache size(MB)
Early
Lazy
Figure 7.10: Response time for Websearch3
89
 0
 20
 40
 60
 80
 100
 0  5  10  15  20  25  30  35
Impr
ovement of Laz
y Update P
olicy Ove
r Early Update P
olicy
(%)
DRAM cache size(MB)
Relative Tra?c
Response Time
Figure 7.11: Improvement of lazy update policy over early update policy for OpenMail
90
 0
 20
 40
 60
 80
 100
 0  5  10  15  20  25  30  35
Impr
ovement of Laz
y Update P
olicy Ove
r Early Update P
olicy
(%)
DRAM cache size(MB)
Relative Tra?c
Response Time
Figure 7.12: Improvement of lazy update policy over early update policy for
UMTR(websearch3)
91
 0
 20
 40
 60
 80
 100
 0  5  10  15  20  25  30  35
Impr
ovement of Laz
y Update P
olicy Ove
r Early Update P
olicy
(%)
DRAM cache size(MB)
Relative Tra?c
Response Time
Figure 7.13: Improvement of lazy update policy over early update policy for synthetic work-
load
92
Chapter 8
Fault Tolerance Issue
As mentioned in Chapter 3, there is a fault tolerance issue with DRAM cache since
DRAM is volatile memory, which will lose data when the power is o . There is battery-
backed DRAM. However, batteries have many issues, such as lifetime issue, maintenance
issue, recharge-time issue. In this disseration, we propose to use supercapacitors as backup
power for the controller and DRAM as shown in Figure 8.1 ( ash as major store) and Figure
8.2 ( ash as disk cache).
Supercapacitor backup power has been used by Seagate [58] in their SSDs and Sun
Oracle [59] in their storage systems. Although the use of supercapacitors as a backup power
source is not new, we have not found it being used to solve the  ash lifetime issue.
The rest of the chapter is organized as follows: Section 8.1 introduces supercapacitors.
In Section 8.2, we discuss the reasons for using supercapacitors instead of batteries. Finally,
Section 8.3 presents how to calculate the needed capacitance.
8.1 What are Supercapacitors?
A supercapacitor (also known as ultracapacitor) is an electrochemical capacitor that
o ers very high capacitance in a small package. The amount of energy a capacitor can hold
is measured in microfarads or  F. (1 F = 10 6 Farad). While small capacitors are rated in
nano-farads (nF=10 9F) and pico-farads (1pF = 10 12F), supercapacitors come in farads.
93
Device
Interface
Controller DRAM
Flash
Array
Flash Module
Supercapacitor
Figure 8.1: Flash as major store with a supercapacitor backup power
Device
Interface
Controller DRAM
Flash
Array
Disks
Supercapacitor
Figure 8.2: Flash as disk cache with a supercapacitor backup power
94
8.2 Why Supercapacitors not Batteries?
Unlike the electrochemical battery, there is very little wear and tear induced by cycling
and age does not a ect the supercapacitor much. In normal use, a supercapacitor deterio-
rates to about 80 percent after 10 years, which is long enough for most applications whereas
a battery needs many replacements during the lifetime of a product. Additionally, super-
capacitors do not need a full charge detection circuit like rechargeable batteries. They take
as much energy as needed. When full, they stop accepting charge. There is no danger of
overcharge.
The supercapacitor o ers high power density although the energy density is far below
that of the battery, as depicted in Figure 8.3. Today, supercapacitors can store 5%-10% as
much energy as a modern lithium-ion battery of the same size. What supercapacitors lack
in range, they make up in the ability to rapidly charge and discharge. They can be charged
in seconds rather than in minutes or hours. Supercapacitors are already used extensively.
Millions of them provide backup power for the memory used in microcomputers and cell
phones. As mentioned above, supercapacitors have been used in enterprise SSDs.
8.3 How to calculate the Capacitance of Supercapacitors?
The value of a supercapacitor can be estimated [56] by equating the energy needed
during the hold-up period to the energy decrease in the supercapacitor, starting at Vwv and
ending at Vmin.
The energy (Ex) needed during the hold-up period (t):
Ex = IVwv +Vmin2 t (8.1)
The energy decrease (Ey) as voltage drops from Vwv to Vmin:
Ey = CV
2
wv
2  
CV2min
2 (8.2)
95
Figure 8.3: Supercapacitor and rechargeable batteries
Since Ex = Ey, the minimum capacitance value that guarantees hold-up to Vmin is:
C = IVwv +VminV2
mv V2min
t (8.3)
When the main power is o for any reason, the supercapacitor backup power needs to
provide the temporary power long enough for the controller to transfer the dirty data into
 ash.
Suppose we use 64MB of DRAM cache, which is large enough to have an e ective write
tra c savings. We use Intel X-25M SSD drives [60] to estimate the transfer time. The drive
needs 5V (+/-5%) power and the active power is 150mW (current is 0.15/5=0.03A). The
sustained sequential write speed is 70MB for 80 GB SSD drives. Therefore, in less than one
second, 64MB will be moved into  ash. We use 2 seconds to calculate the minimum capac-
itance value. Applying (8.3), we get C=0.12F. According to Maxwell [61], supercapacitor
prices will be approaching $0.01 per farad in production volumes of millions. Apart from
96
supercapacitors, a power control circuit should be added. However, the total cost would not
be high.
97
Chapter 9
Conclusions and Future Work
In this dissertation, we have concentrated our research on extending the  ash lifetime.
We have investigated a new way to extend the lifetime of  ash with DRAM cache. For
data integrity, we propose to use a supercapacitor for backup power instead of batteries so
that DRAM cache can be used without loss of data integrity. This chapter concludes the
dissertation by summarizing the contributions and describing future directions.
The reminder of the chapter is organized as follows: Section 9.1 highlights the main con-
tributions of the dissertation. In Section 9.2, we concentrate on some future directions, which
are extensions of our past and current research on cache. Finally, Section 9.3 summarizes
the results and their implications.
9.1 Main Contributions
This dissertation introduced the following main contributions that aim to extend the
lifetime of  ash:
 Extend the lifetime of  ash in a new way: DRAM cache has been used to
improve performance in terms of response time. Due to its volatile nature, researchers
are reluctant to use it to save write tra c. Soundararajan et al. [8] put it this way,
"RAM can make for a fast and e ective write cache, however the overriding problem
with RAM is that it is not persistent (absent some power-continuity arrangements)."
The potential of DRAM cache being a write tra c saver has been overlooked. We
conducted extensive simulation based on real workloads and synthetic workloads on
how e ective a DRAM cache can be a write tra c saver and how large the DRAM
98
cache should be to be e ective. Our results show that with a medium-sized DRAM
cache, the lifetime of the backing  ash can be doubled. Therefore, DRAM cache is a
e ective write tra c saver.
 Solve the data integrity: To be e ective and fast is not enough to justify that DRAM
cache is a correct candidate unless the data can be guaranteed safe out of power outage.
An Achilles? heel is that the data will be lost in case of power failure. A common way
to solve this issue is to have a battery backup power. However, batteries have limited
charge/discharge cycles and need regular maintenance or replacement, which limits
their use in many environments. Instead, we proposed to use a supercapacitor backup
power. Supercapacitors are perfect at supplying short period of power in this scenario.
Supercapacitors are not new. However, using it to solve the lifetime of  ash, to our
best knowledge, has not been done.
 Enhance DiskSim simulator: DiskSim 4.0 is a well-regarded disk simulator. How-
ever, it lacks support of two levels of disk cache (DRAM primary disk cache and
secondary  ash disk cache). Microsoft SSD add-on does not support a read/write
cache. Besides, it does not work well with Type 3 Smart Controller, which supports
read/write cache at controller level. Moreover, trace formats (SRT 1.6 and SPC) are
not included. We added those missing components to DiskSim and successfully passed
all validation tests.
9.2 Future Work
Our work in this dissertation is at device level. However, we realize that the concept
of supercapacitor backup power can be applied to computer main memory to enhance data
integrity of computer systems as well.
A conventional practice to alleviate data loss due to unexpected power failure is to
enforce a 30-second  ush rule, as Unix operating systems do. The negative side of the rule is
99
that disk fragmentation is increased, which will decrease its performance. Some important
computer systems are even equipped with Uninterruptible Power Supply (UPS) to protect
its data.
With a supercapacitor backup power, backing  ash, and controller, the content of main
memory can be backed up into  ash on power loss. The data can be recovered on power
resumption. In this context, 30-second  ush rule is no longer needed, which would reduce
disk fragmentation and improve data integrity as well as performance.
9.3 Conclusions
We have demonstrated through simulation that a medium-sized DRAM cache can save
up to 50% write tra c to  ash, which is translated into at least doubling the lifetime of
 ash. Meanwhile, performance in terms of response time and miss ratio sees improvement as
well. Furthermore, our  ndings can be applied to computer main memory to enhance data
integrity of computer systems.
100
Bibliography
[1] W. Hsu and A. J. Smith, \The performance impact of I/O optimizations and disk
improvements," IBM J. Res. Dev., vol. 48, no. 2, pp. 255{289, 2004.
[2] SolarisTM ZFSTM enables hybrid storage pools{Shatters economic and perfor-
mance barriers. [Online]. Available: http://download.intel.com/design/ ash/nand/
SolarisZFS SolutionBrief.pdf
[3] \Understanding the  ash translation layer (FTL) speci cation," Intel Corporation,
Tech. Rep., 1998.
[4] Y. Kim, A. Gupta, and B. Urgaonkar, \MixedStore: An enterprise-scale storage system
combining Solid-state and Hard Disk Drives ," The Pennsylvania State University, Tech.
Rep., 2008.
[5] \NAND evolution and its e ects on Solid State Drive (SSD) useable life," Western
Digital, White paper WP-001-01R, 2009.
[6] J. Hutchby and M. Garner. Assessment of the potential & maturity of selected
emerging research memory technologies. Workshop & ERD/ERM Working Group
Meeting (April 6-7, 2010). [Online]. Available: http://www.itrs.net/Links/2010ITRS/
2010Update/ToPost/ERD ERM 2010FINALReportMemoryAssessment ITRS.pdf
[7] (2010) Process integration, devices & structures. The International Technology
Roadmap for Semiconductors. [Online]. Available: http://www.itrs.net/Links/
2010ITRS/Home2010.htm
[8] G. Soundararajan, V. Prabhakaran, M. Balakrishnan, and T. Wobber, \Extending SSD
lifetimes with disk-based write caches," in Proceedings of the 8th USENIX conference
on File and storage technologies, ser. FAST?10. Berkeley, CA, USA: USENIX
Association, 2010, pp. 8{8. [Online]. Available: http://portal.acm.org/citation.cfm?
id=1855511.1855519
[9] S.-y. Park, D. Jung, J.-u. Kang, J.-s. Kim, and J. Lee, \CFLRU: a replacement
algorithm for  ash memory," in Proceedings of the 2006 international conference
on Compilers, architecture and synthesis for embedded systems, ser. CASES
?06. New York, NY, USA: ACM, 2006, pp. 234{241. [Online]. Available:
http://doi.acm.org/10.1145/1176760.1176789
101
[10] H. Jo, J.-U. Kang, S.-Y. Park, J.-S. Kim, and J. Lee, \FAB:  ash-aware bu er man-
agement policy for portable media players," Consumer Electronics, IEEE Transactions
on, vol. 52, no. 2, pp. 485 { 493, May 2006.
[11] H. Kim and S. Ahn, \BPLRU: a bu er management scheme for improving random writes
in  ash storage," in Proceedings of the 6th USENIX Conference on File and Storage
Technologies, ser. FAST?08. Berkeley, CA, USA: USENIX Association, 2008, pp. 16:1{
16:14. [Online]. Available: http://portal.acm.org/citation.cfm?id=1364813.1364829
[12] A. Gupta, Y. Kim, and B. Urgaonkar, \DFTL: a  ash translation layer employing
demand-based selective caching of page-level address mappings," in ASPLOS ?09: Pro-
ceeding of the 14th international conference on Architectural support for programming
languages and operating systems. New York, NY, USA: ACM, 2009, pp. 229{240.
[13] R. Bez, E. Camerlenghi, A. Modelli, and A. Visconti, \Introduction to  ash memory,"
Proceedings of the IEEE, vol. 91, no. 4, pp. 489{502, April 2003.
[14] Flash memory. [Online]. Available: http://en.wikipedia.org/wiki/Flash memory
[15] K. Takeuchi, S. Satoh, T. Tanaka, K. Imamiya, and K. Sakui, \A negative Vth cell
architecture for highly scalable, excellently noise-immune, and highly reliable NAND
 ash memories," Solid-State Circuits, IEEE Journal of, vol. 34, no. 5, pp. 675 {684,
May 1999.
[16] J.-W. Hsieh, T.-W. Kuo, P.-L. Wu, and Y.-C. Huang, \Energy-e cient and
performance-enhanced disks using  ash-memory cache," in Proceedings of the
2007 international symposium on Low power electronics and design, ser. ISLPED
?07. New York, NY, USA: ACM, 2007, pp. 334{339. [Online]. Available:
http://doi.acm.org/10.1145/1283780.1283851
[17] T. Bisson, S. A. Brandt, and D. D. E. Long, \A hybrid disk-aware spin-down algorithm
with I/O subsystem support." in IPCCC?07, 2007, pp. 236{245.
[18] T. Bisson and S. A. Brandt, \Reducing Hybrid Disk write latency with  ash-backed
I/O requests," in Proceedings of the 2007 15th International Symposium on
Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.
Washington, DC, USA: IEEE Computer Society, 2007, pp. 402{409. [Online]. Available:
http://portal.acm.org/citation.cfm?id=1474555.1475519
[19] ||, \Flushing policies for NVCache enabled hard disks," in Proceedings of the
24th IEEE Conference on Mass Storage Systems and Technologies. Washington,
DC, USA: IEEE Computer Society, 2007, pp. 299{304. [Online]. Available:
http://portal.acm.org/citation.cfm?id=1306871.1306922
[20] R. Panabaker. (2006, May) Hybrid hard disk and Ready-Drive technology: Improving
performance and power for Windows Vista mobile PCs. [Online]. Available:
http://www.microsoft.com/whdc/winhec/pres06.mspx
102
[21] J. K. Kim, H. G. Lee, S. Choi, and K. I. Bahng, \A PRAM and NAND
 ash hybrid architecture for high-performance embedded storage subsystems," in
Proceedings of the 8th ACM international conference on Embedded software, ser.
EMSOFT ?08. New York, NY, USA: ACM, 2008, pp. 31{40. [Online]. Available:
http://doi.acm.org/10.1145/1450058.1450064
[22] Y. Joo, Y. Cho, K. Lee, and N. Chang, \Improving application launch times with
hybrid disks," in CODES+ISSS ?09: Proceedings of the 7th IEEE/ACM international
conference on Hardware/software codesign and system synthesis. New York, NY, USA:
ACM, 2009, pp. 373{382.
[23] J. Matthews, S. Trika, D. Hensgen, R. Coulson, and K. Grimsrud, \Intel R Turbo
Memory: Nonvolatile disk caches in the storage hierarchy of mainstream computer
systems," Trans. Storage, vol. 4, no. 2, pp. 1{24, 2008.
[24] T. Kgil and T. Mudge, \Flashcache: a NAND  ash memory  le cache for low power web
servers," in Proceedings of the 2006 international conference on Compilers, architecture
and synthesis for embedded systems, ser. CASES ?06. New York, NY, USA: ACM,
2006, pp. 103{112. [Online]. Available: http://doi.acm.org/10.1145/1176760.1176774
[25] T. Kgil, D. Roberts, and T. Mudge, \Improving NAND  ash based disk caches," in
Proceedings of the 35th Annual International Symposium on Computer Architecture,
ser. ISCA ?08. Washington, DC, USA: IEEE Computer Society, 2008, pp. 327{338.
[Online]. Available: http://dx.doi.org/10.1109/ISCA.2008.32
[26] D. Roberts, T. Kgil, and T. Mudge, \Integrating NAND  ash devices onto
servers," Commun. ACM, vol. 52, pp. 98{103, April 2009. [Online]. Available:
http://doi.acm.org/10.1145/1498765.1498791
[27] Solid-state drive. [Online]. Available: http://en.wikipedia.org/wiki/Solid-state drive
[28] Wear leveling. [Online]. Available: http://en.wikipedia.org/wiki/Wear levelling
[29] Journalling  ash  le system version 2. [Online]. Available: http://en.wikipedia.org/
wiki/JFFS2
[30] Ya s (yet another  ash  le system). [Online]. Available: http://en.wikipedia.org/wiki/
YAFFS
[31] ZFS. [Online]. Available: http://en.wikipedia.org/wiki/Zfs
[32] A. Gupta, Y. Kim, and B. Urgaonkar, \DFTL: a  ash translation layer employing
demand-based selective caching of page-level address mappings," in Proceeding of the
14th international conference on Architectural support for programming languages and
operating systems, ser. ASPLOS ?09. New York, NY, USA: ACM, 2009, pp. 229{240.
[Online]. Available: http://doi.acm.org/10.1145/1508244.1508271
103
[33] Y.-H. Chang, J.-W. Hsieh, and T.-W. Kuo, \Improving  ash wear-leveling by
proactively moving static data," IEEE Trans. Comput., vol. 59, pp. 53{65, January
2010. [Online]. Available: http://dx.doi.org/10.1109/TC.2009.134
[34] ||, \Endurance enhancement of  ash-memory storage systems: an e cient static
wear leveling design," in Proceedings of the 44th annual Design Automation Conference,
ser. DAC ?07. New York, NY, USA: ACM, 2007, pp. 212{217. [Online]. Available:
http://doi.acm.org/10.1145/1278480.1278533
[35] L.-P. Chang, \On e cient wear leveling for large-scale  ash-memory storage
systems," in Proceedings of the 2007 ACM symposium on Applied computing, ser.
SAC ?07. New York, NY, USA: ACM, 2007, pp. 1126{1130. [Online]. Available:
http://doi.acm.org/10.1145/1244002.1244248
[36] A. Ban, \Wear leveling of static areas in  ash memory," U.S. Patent 6 732 221, 2004.
[37] A. Kawaguchi, S. Nishioka, and H. Motoda, \A  ash-memory based  le system," in
Proceedings of the USENIX 1995 Technical Conference Proceedings, ser. TCON?95.
Berkeley, CA, USA: USENIX Association, 1995, pp. 13{13. [Online]. Available:
http://portal.acm.org/citation.cfm?id=1267411.1267424
[38] H.-j. Kim and S.-g. Lee, \A new  ash memory management for  ash storage system,"
in 23rd International Computer Software and Applications Conference, ser. COMPSAC
?99. Washington, DC, USA: IEEE Computer Society, 1999, pp. 284{. [Online].
Available: http://portal.acm.org/citation.cfm?id=645981.674620
[39] K. M. J. Lofgren, R. D. Norman, G. B. Thelin, and A. Gupta, \Wear leveling
techniques for  ash EEPROM systems," U.S. Patent 6 850 443, February, 2005.
[Online]. Available: http://www.freepatentsonline.com/6850443.html
[40] E. Jou and J. H. Jeppesen, III, \Flash memory wear leveling system providing
immediate direct access to microprocessor," U.S. Patent 5 568 423, 1996. [Online].
Available: http://www.freepatentsonline.com/5568423.html
[41] D. Jung, Y.-H. Chae, H. Jo, J.-S. Kim, and J. Lee, \A group-based wear-leveling
algorithm for large-capacity  ash memory storage systems," in Proceedings of the
2007 international conference on Compilers, architecture, and synthesis for embedded
systems, ser. CASES ?07. New York, NY, USA: ACM, 2007, pp. 160{164. [Online].
Available: http://doi.acm.org/10.1145/1289881.1289911
[42] S.-W. Lee, D.-J. Park, T.-S. Chung, D.-H. Lee, S. Park, and H.-J. Song, \A
log bu er-based  ash translation layer using fully-associative sector translation,"
ACM Trans. Embed. Comput. Syst., vol. 6, July 2007. [Online]. Available:
http://doi.acm.org/10.1145/1275986.1275990
[43] J.-U. Kang, H. Jo, J.-S. Kim, and J. Lee, \A superblock-based  ash translation
layer for NAND  ash memory," in Proceedings of the 6th ACM & IEEE International
conference on Embedded software, ser. EMSOFT ?06. New York, NY, USA: ACM,
2006, pp. 161{170. [Online]. Available: http://doi.acm.org/10.1145/1176887.1176911
104
[44] L.-P. Chang and T.-W. Kuo, \An adaptive striping architecture for  ash memory
storage systems of embedded systems," in Proceedings of the Eighth IEEE Real-Time
and Embedded Technology and Applications Symposium (RTAS?02), ser. RTAS ?02.
Washington, DC, USA: IEEE Computer Society, 2002, pp. 187{. [Online]. Available:
http://portal.acm.org/citation.cfm?id=827265.828513
[45] Y.-S. Chu, J.-W. Hsieh, Y.-H. Chang, and T.-W. Kuo, \A set-based mapping strategy
for  ash-memory reliability enhancement," in Proceedings of the Conference on Design,
Automation and Test in Europe, ser. DATE ?09. 3001 Leuven, Belgium, Belgium:
European Design and Automation Association, 2009, pp. 405{410. [Online]. Available:
http://portal.acm.org/citation.cfm?id=1874620.1874717
[46] Y.-L. Tsai, J.-W. Hsieh, and T.-W. Kuo, \Con gurable NAND  ash translation
layer," in Proceedings of the IEEE International Conference on Sensor Networks,
Ubiquitous, and Trustworthy Computing -Vol 1 (SUTC?06) - Volume 01. Washington,
DC, USA: IEEE Computer Society, 2006, pp. 118{127. [Online]. Available:
http://portal.acm.org/citation.cfm?id=1136649.1137086
[47] G. Sun, Y. Joo, Y. Chen, D. Niu, Y. Xie, Y. Chen, and H. Li, \A hybrid solid-state stor-
age architecture for the performance, energy consumption, and lifetime improvement,"
in High Performance Computer Architecture (HPCA), 2010 IEEE 16th International
Symposium on, 9-14 2010, pp. 1 {12.
[48] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, \Scalable high performance
main memory system using phase-change memory technology," in Proceedings
of the 36th annual international symposium on Computer architecture, ser.
ISCA ?09. New York, NY, USA: ACM, 2009, pp. 24{33. [Online]. Available:
http://doi.acm.org/10.1145/1555754.1555760
[49] \The basics of phase change memory (PCM) technology," Numonyx White Paper.
[Online]. Available: www.numonyx.com/Documents/WhitePapers/PCM Basics WP.
pdf
[50] J. S. Bucy, J. Schindler, S. Schlosser, G. R. Ganger, and Contributors, The DiskSim
simulation environment version 4.0 reference manual, Carnegie Mellon University, Pitts-
burgh, PA, 2008.
[51] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis, M. Manasse, and R. Panigrahy,
\Design tradeo s for SSD performance," in USENIX 2008 Annual Technical Conference
on Annual Technical Conference. Berkeley, CA, USA: USENIX Association, 2008, pp.
57{70. [Online]. Available: http://portal.acm.org/citation.cfm?id=1404014.1404019
[52] HP open source software. [Online]. Available: http://tesla.hpl.hp.com/opensource/
[53] Block I/O traces from SNIA. [Online]. Available: http://iotta.snia.org/traces/list/
BlockIO
105
[54] C. Ruemmler and J. Wilkes, \Unix disk access patterns." in USENIX Winter?93, 1993,
pp. 405{420.
[55] University of Massachusetts trace. [Online]. Available: http://traces.cs.umass.edu/
index.php/Storage/Storage
[56] X.-Y. Hu, E. Eleftheriou, R. Haas, I. Iliadis, and R. Pletka, \Write ampli cation
analysis in  ash-based solid state drives," in Proceedings of SYSTOR 2009: The Israeli
Experimental Systems Conference, ser. SYSTOR ?09. New York, NY, USA: ACM,
2009, pp. 10:1{10:9. [Online]. Available: http://doi.acm.org/10.1145/1534530.1534544
[57] V. Prabhakaran, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, \Analysis and
evolution of journaling  le systems," in Proceedings of the annual conference on
USENIX Annual Technical Conference, ser. ATEC ?05. Berkeley, CA, USA: USENIX
Association, 2005, pp. 8{8. [Online]. Available: http://portal.acm.org/citation.cfm?
id=1247360.1247368
[58] Pulsar. Seagate. [Online]. Available: http://www.seagate.com/static les/support/disc/
manuals/ssd/100596473a.pdf
[59] Sun Storage F5100  ash array. Sun Oracle. [Online]. Available: http://www.oracle.
com/us/products/servers-storage/storage/disk-storage/043970.pdf
[60] Intel R X18-M/X25-M SATA Solid State Drive-34 nm product line. Intel Corpora-
tion. [Online]. Available: http://download.intel.com/design/ ash/nand/mainstream/
322296.pdf
[61] A. Schneuwly, G. Sartorelli, J. Auer, and B. Maher. Ultracapacitor applications
in the power electronic world. Maxwell Technologies. [Online]. Available: http:
//www.maxwell.com/ultracapacitors/white-papers/power electronic applications.pdf
106
Appendices
107
Appendix A
DiskSim 4.0 Flowcharts, Modi cations,
and Post Processing Scripts
A.1 Overview
DiskSim 4.0 is a well regarded tool to simulate the storage system. We used DiskSim
4.0 along with Microsoft SSD add-on in our storage research, during which we made modi-
 cations on DiskSim 4.0. We added two trace formats (SRT 1.6 and SPC) to DiskSim and
 xed bugs related to smart controller (type:3) for Microsoft SSD add-on. Moreover, we wrote
Bash scripts to automatically plot  gures using gnuplot.
I/O Block Traces (e.g., OpenMail, Cello) that can be downloaded from SNIA (http://
www.snia.org) and HP Labs (http://tesla.hpl.hp.com/opensource) use SRT 1.6 while DiskSim
4.0 only supports SRT 1.4. SPC is another important trace format that DiskSim 4.0 does
not support. We have a set of workloads in SPC format that can be downloaded from
http://traces.cs.umass.edu/index.php/Storage/Storage. We implemented the two trace for-
mats in DiskSim 4.0. In addition, we added  ve parameters to DiskSim to  lter traces, e.g.,
tracing read only I/O requests.
Microsoft SSD add-on does not include a write or read cache in the implementation,
which is a problem for those who need the write or read cache. One way to solve this problem
is to use Smart Controller (type:3). However, the SSD implementation is not compatible
with type 3 smart controller. We have an update to solve this issue.
In order to make these modi cations, we read thoroughly through the DiskSim 4.0
source code and stepped through the source code. Based on these, we drew  owcharts of
reads, writes both for HDD and SSD, which will help us understand how DiskSim 4.0 works
internally. With these  owcharts, it is much easier to comprehend DiskSim 4.0 source code.
We would like to share these  owcharts to help those who are reading DiskSim source code.
So far, we have not found such kinds of published materials.
Last but not least, the task of running DiskSim 4.0 and collecting data can be tiresome
since most likely we need to change a series of parameters to see the impact, which will
produce many  les, from which we extract the wanted outcomes to produce  gures. With
our Bash scripts, we obtain the  gures by running a single Bash shell program.
A.2 System Diagram
DiskSim is a widely used disk drive simulator both in academia and industry alike. The
system diagram of DiskSim 4.0 is shown in Figure A.1. DiskSim is an event-driven simulator.
It emulates a hierarchy of storage components such as device drivers, buses, and controllers
as well as disks. With Microsoft SSD extension, it can also simulate  ash-based Solid State
Drives. Furthermore, it is designed to be easily integrated into a full system simulator, like
SimOS.
108
Device Driver
Controller
(cache)
Disk
(cache)
System Bus
I/O Bus
SSD
Figure A.1: System architecture of DiskSim 4.0
109
A.3 DiskSim 4.0  owcharts
A.3.1 Main loop
DiskSim is written mainly in C with a small portion of perl and python code. The
main loop of DiskSim in disksim main.c is shown in Figure A.2. After initialization, it runs
simulation (void disksim run simulation ()) until all I/O requests are processed or it runs out
of time. The function disksim run simulation() is shown in Figure A.3. It gets the next event
and processes the event according to event types: IO event, PF, Timeout, and Checkout.
The detailed io internal event is shown in Figure A.4. It handles events based on the event
types: driver event, bus event, controller event, and device event.
A.3.2 Message routing
As mentioned earlier, DiskSim is a event-driven simulator. Events are put into queues to
be handled. There are two kinds of queues: internal Queue and extra Queue shown in Figure
A.5. Extra Q is a pool of queues, which is in charge of memory allocation. A queue can
be gotten via getfromextraq() and it must be returned via addtoextraq() upon completion.
Internal Q, on the other hand, is a list of active messages (allocated from Extra Q) passing
among the device driver, bus, controller, and devices. An event is added to the Internal Q
via addtointq() and removed via getfromintq(). For example, an I/O request is put into the
Internal Q by the workload generator. Then, it is processed along the path: device driver,
system bus, controller, I/O bus, and devices.
A.3.3 HDD reads with dumb controller (type:1)
The  owchart for HDD reads is shown in Figure A.6. The  owchart is based on the
trace to the run of DiskSim with parameter  le atlas III.parv. Di erent parameter set-
tings may have slightly di erent  owcharts. In this category, the dumb controller passes
all messages without processing them upward or downward. A cycle is started while the
device driver gets an IAA message. Next, IAA is passed down to the controller. Then,
it reaches the device (disk). The disk issues IIA(disconnect) to disconnect the I/O bus,
which is completed upon receiving IIC(disconnect). When the data is ready, the disk issues
IIA(reconnect) to reconnect the I/O bus, which is  nished upon receiving IIC(reconnect).
Then, data is transfered via DDTC, which is done upon receiving DDTC. Finally, the disk
issues IIA(completion) to complete the cycle. The device driver gets the IIA(completion)
and returns an IIC(completion).
A.3.4 HDD writes with dumb controller (type:1)
The  owchart for HDD writes is shown in Figure A.7. The  owchart is based on the trace
to the run of DiskSim with parameter  le atlas III.parv. Like reads, the dumb controller
passes all messages without processing them upward or downward. A cycle is started while
the device driver gets an IAA message. Next, IAA is passed down to the controller. Then,
it reaches the device (disk). The disk issues IIA(reconnect) to reconnect the I/O bus, which
is completed upon receiving IIC(reconnect). Then, it issues DDTC to start data transfer,
110
main
init
stop?
Cleanup and print statiscs
N
Y
End
Run Simulation
Figure A.2: The main loop of DiskSim 4.0
111
Get next event
Null?
IO_event?
PF?
Timeout?
Checkpoint?
Return
N
Io_internal_event
pf_internal_event
timeout
Checkpoint
Y
Y
Y
Y
Y
Run Simulation
Figure A.3: Run simulation
112
Driver event?
Bus event?
Controller event?
Device event?
Driver event
Bus event
Controller event
Device event
Return
Y
Y
Y
Y
io_internal_event
Figure A.4: io-internal-event
113
Internal Q
getfromintq()
Driver event
Bus event
Controller event
Device event
Extra Q
getfromextraq()
addtoextraq()
addtointq()
addtointq()
addtointq()
addtointq()
Figure A.5: Message routing
114
Driver Controller
IAA
Disk
IAA
Sys BUS
I/O BUS
IIA(disc)
IIC(disc)
IIA(rec)
IIC(rec)
DDTC
DDTC
IIA(comp)
IIC(comp)
IIA(comp)
IIC(comp)
IAA: IO_ACCESS_ARRIVE
IIA: IO_ITERRUPT_ARRIVE
IIC: IO_INTERRUPT_COMPLETE
DDTC: DEVICE_DATA_TRANSFER_COMPLETE
IIA(disc)
IIC(disc)
IIA(rec)
IIC(rec)
Figure A.6: Flowchart for HDD reads (controller type:1)
115
which is done upon receiving DDTC. Finally, the disk sends IIA(completion) to complete
the cycle. The device driver gets the IIA(completion) and returns an IIC(completion).
A.3.5 HDD reads with smart controller (type:3)
The  owchart for HDD reads is shown in Figure A.8. The  owchart is based on the
trace to the run of DiskSim with parameter  le atlas III3.parv, which sets up smart controller
(type:3). In this category, the smart controller processes all messages upward or downward.
A cycle is started while the device driver gets an IAA message. Next, IAA is passed down
to the controller. Then, it reaches the device (disk). The disk issues IIA(disconnect) to dis-
connect the I/O bus, which is completed upon receiving IIC(disconnect). Unlike dumb con-
troller, smart controller processes IIA and returns IIC. When the data is ready, the disk issues
IIA(reconnect) to reconnect the I/O bus, which is  nished upon receiving IIC(reconnect).
Then, data is transfered via DDTC, which is done upon receiving DDTC. The disk issues
IIA(completion) to complete the cycle. Data is transmitted between the device driver and
the controller via CDTCH. Finally, the controller issues an IIA(completion) to wrap up the
cycle. The device driver gets the IIA(completion) and returns an IIC(completion).
A.3.6 HDD writes with smart controller (type:3)
The  owchart for HDD writes is shown in Figure A.9. The  owchart is based on the
trace to the run of DiskSim with parameter  le atlas III3.parv, which sets up smart con-
troller (type:3). Like reads, the smart controller processes all messages upward or downward.
A cycle is started while the device driver gets an IAA message. Next, IAA is passed down
to the controller. CDTCH/CDTCH starts data transfer between the device driver and the
controller. Then, IAA reaches the device (disk). Next, the disk issues IIA(reconnect) to
reconnect the I/O bus, which is completed upon receiving IIC(reconnect). Unlike dumb con-
troller, smart controller processes IIA and returns IIC. Then, data is transfered via DDTC,
which is done upon receiving DDTC. Next, the disk issues IIA(completion) to complete the
cycle, which is done upon receiving IIC(completion). DBS and DBSD are internal events in
the disk. Finally, the controller issues an IIA(completion) to wrap up the cycle. The device
driver gets the IIA(completion) and returns an IIC(completion).
A.3.7 SSD reads with dumb controller (type:1)
The  owchart for SSD reads is shown in Figure A.10. The  owchart is based on the
trace to the run of DiskSim with parameter  le ssd-postmark.parv. In this category, the
dumb controller passes all messages without processing them upward or downward. A cycle
is started while the device driver gets an IAA message. Next, IAA is passed down to the
controller. Then, it reaches the device (SSD). When the data is ready, the SSD issues
IIA(reconnect) to reconnect the I/O bus, which is  nished upon receiving IIC(reconnect).
Then, data is transfered via DDTC, which is done upon receiving DDTC. Finally, the SSD
issues IIA(completion) to complete the cycle. The device driver gets the IIA(completion)
and returns an IIC(completion).
116
Driver Controller
IAA
Disk
IAA
Sys BUS
I/O BUS
IIA(rec)
IIC(rec)
DDTC
DDTC
IIA(comp)
IIC(comp)
IIA(comp)
IIC(comp)
IAA: IO_ACCESS_ARRIVE
IIA: IO_ITERRUPT_ARRIVE
IIC: IO_INTERRUPT_COMPLETE
DDTC: DEVICE_DATA_TRANSFER_COMPLETE
DBS: DEVICE_BUFFER_SEEKDONE
DBSD: DEVICE_BUFFER_SECTOR_DONE
IIA(rec)
IIC(rec)
DBS
DBSD
Figure A.7: Flowchart for HDD writes (controller type:1)
117
Driver Controller
IAA
Disk
IAA
Sys BUS
I/O BUS
IIA(disc)
IIC(disc)
IIA(rec)
IIC(rec)
DDTC
DDTC
IIA(comp)
IIC(comp)
CDTCH
CDTCH
IIA(comp)
IIC(comp)
IAA: IO_ACCESS_ARRIVE
IIA: IO_ITERRUPT_ARRIVE
IIC: IO_INTERRUPT_COMPLETE
DDTC: DEVICE_DATA_TRANSFER_COMPLETE
CDTCH: CONTROLLER_DATA_TRANSFER_COMPLETE(host)
CDTCD: DEVICE_DATA_TRANSFER_COMPLETE(device)
CDTCD
CDTCD
Figure A.8: Flowchart for HDD reads (controller type:3)
118
Driver Controller
IAA
Disk
IAA
Sys BUS
I/O BUS
IIA(rec)
IIC(rec)
DDTC
DDTC
IIA(comp)
IIC(comp)
CDTCH
CDTCH
IIA(comp)
IIC(comp)
IAA: IO_ACCESS_ARRIVE
IIA: IO_ITERRUPT_ARRIVE
IIC: IO_INTERRUPT_COMPLETE
DDTC: DEVICE_DATA_TRANSFER_COMPLETE
CDTCH: CONTROLLER_DATA_TRANSFER_COMPLETE(host)
CDTCD: DEVICE_DATA_TRANSFER_COMPLETE(device)
DBS: DISK_BUFFER_SEEKDONE
DBSD: DISK_BUFFER_SECTOR_DONE
CDTCD
CDTCD
DBS
DBSD
Figure A.9: Flowchart for HDD writes (controller type:3)
119
Driver Controller
IAA
SSD
IAA
Sys BUS
I/O BUS
IIA(rec)
IIC(rec)
DDTC
DDTC
IIA(comp)
IIC(comp)
IIA(comp)
IIC(comp)
IAA: IO_ACCESS_ARRIVE
IIA: IO_ITERRUPT_ARRIVE
IIC: IO_INTERRUPT_COMPLETE
DDTC: DEVICE_DATA_TRANSFER_COMPLETE
IIA(rec)
IIC(rec)
DOC
DAC
DOC: DEVICE_OVERHEAD_COMPLETE
DAC: DEVICE_ACCESS_COMPLETE
Figure A.10: Flowchart for SSD reads (controller type:1)
120
A.3.8 SSD writes with dumb controller (type:1)
The  owchart for SSD writes is shown in Figure A.11. The  owchart is based on the
trace to the run of DiskSim with parameter  le ssd-postmark.parv. Like reads, the dumb con-
troller passes all messages without processing them upward or downward. A cycle is started
while the device driver gets an IAA message. Next, IAA is passed down to the controller.
Then, it reaches the device (SSD). After DOC, which is an internal event, the SSD issues
IIA(reconnect) to reconnect the I/O bus, which is completed upon receiving IIC(reconnect).
Then, it issues DDTC to start data transfer, which is done upon receiving DDTC. Next,
DAC is done internally in SSD. Finally, the SSD sends IIA(completion) to complete the
cycle. The device driver gets the IIA(completion) and returns an IIC(completion).
A.3.9 SSD reads with smart controller (type:3)
The  owchart for SSD reads is shown in Figure A.12. The  owchart is based on the
trace to the run of DiskSim with parameter  le ssd-postmark3.parv, which sets up smart
controller (type:3). In this category, the smart controller processes all messages upward or
downward. A cycle is started while the device driver gets an IAA message. Next, IAA is
passed down to the controller. Then, it reaches the device (SSD). After two internal events
(DOC and DAC), the SSD issues DDTC to start data transfer, which is done upon receiving
DDTC. The SSD issues IIA(completion) to complete the cycle. Then, Data is transmitted
between the device driver and the controller via CDTCH. Finally, the controller issues an
IIA(completion) to wrap up the cycle. The device driver gets the IIA(completion) and
returns an IIC(completion).
A.3.10 SSD writes with smart controller (type:3)
The  owchart for SSD writes is shown in Figure A.13. The  owchart is based on the
trace to the run of DiskSim with parameter  le ssd-postmark3.parv, which sets up smart
controller (type:3). Like reads, the smart controller processes all messages upward or down-
ward. A cycle is started while the device driver gets an IAA message. IAA is passed down
to the controller. Next, data is transmitted between the device driver and the controller
via CDTCH. Then, IAA reaches the device (SSD). Next, the SSD issues IIA(reconnect) to
reconnect the I/O bus, which is completed upon receiving IIC(reconnect). Unlike dumb con-
troller, smart controller processes IIA and returns IIC. Then, data is transfered via DDTC,
which is done upon receiving DDTC. The SSD issues IIA(completion) to complete the cycle,
which is done upon receiving IIC(completion). DOC and and DAC are internal events in
the SSD. Finally, the controller issues an IIA(completion) to wrap up the cycle. The device
driver gets the IIA(completion) and returns an IIC(completion).
A.4 Post Processing Scripts
We wrote Bash scripts regarding the run of modi ed DiskSim 4.0. The scripts have the
following features:
1. Run DiskSim 4.0 with di erent parameters
121
Driver Controller
IAA
SSD
IAA
Sys BUS
I/O BUS
IIA(rec)
IIC(rec)
DDTC
DDTC
IIA(comp)
IIC(comp)
IIA(comp)
IIC(comp)
IAA: IO_ACCESS_ARRIVE
IIA: IO_ITERRUPT_ARRIVE
IIC: IO_INTERRUPT_COMPLETE
DDTC: DEVICE_DATA_TRANSFER_COMPLETE
IIA(rec)
IIC(rec)
DOC
DAC
DOC: DEVICE_OVERHEAD_COMPLETE
DAC: DEVICE_ACCESS_COMPLETE
Figure A.11: Flowchart for SSD writes (controller type:1)
122
Driver Controller
IAA
SSD
IAA
Sys BUS
I/O BUS
DDTC
DDTC
IIA(comp)
IIC(comp)
CDTCH
CDTCH
IIA(comp)
IIC(comp)
IAA: IO_ACCESS_ARRIVE
IIA: IO_ITERRUPT_ARRIVE
IIC: IO_INTERRUPT_COMPLETE
DDTC: DEVICE_DATA_TRANSFER_COMPLETE
CDTCH: CONTROLLER_DATA_TRANSFER_COMPLETE(host)
CDTCD: DEVICE_DATA_TRANSFER_COMPLETE(device)
DOC: DEVICE_OVERHEAD_COMPLETE
DAC: DEVICE_ACCESS_COMPLETE
CDTCD
CDTCD
DOC
DAC
Figure A.12: Flowchart for SSD reads (controller type:3)
123
Driver Controller
IAA
SSD
IAA
Sys BUS
I/O BUS
DDTC
DDTC
IIA(comp)
IIC(comp)
CDTCH
CDTCH
IIA(comp)
IIC(comp)
IAA: IO_ACCESS_ARRIVE
IIA: IO_ITERRUPT_ARRIVE
IIC: IO_INTERRUPT_COMPLETE
DDTC: DEVICE_DATA_TRANSFER_COMPLETE
CDTCH: CONTROLLER_DATA_TRANSFER_COMPLETE(host)
CDTCD: DEVICE_DATA_TRANSFER_COMPLETE(device)
DOC: DEVICE_OVERHEAD_COMPLETE
DAC: DEVICE_ACCESS_COMPLETE
CDTCD
CDTCD
DOC
DAC
IIA(rec)
IIC(rec)
Figure A.13: Flowchart for SSD writes (controller type:3)
124
2. Produce multiple output  les
3. Extract corresponding outcomes into  les
4. Plot  gures based on the above outcomes
The scripts consist of:
  g hplajw.sh {main shell
 hplajw.sh {sub shell
 process.sh {out le processing shell
 miss ratio.p {gnuplot script for plotting Miss Ratio  gure
 res time.p {gnuplot script for plotting Response Time  gure
To work with these scripts, copy all these script  les into DiskSim 4.0/valid. Run:
sh  g hplajw.sh
Figures for miss ratio and response time will be plotted like Figure A.14 and Figure
A.15 respectively.
A.4.1  g hplajw.sh source code
#!/bin/bash
#|||||||||||||||||||||||||||||
# C.J. Wang
# This script call DiskSim to produce data and plot the  gures via gnuplot
# This script assumes that it is under DiskSim4.0/valid.
#|||||||||||||||||||||||||||||-
# variables that need to change based on the traces
#
FileBase=hplajw
SubShell=hplajw.sh
#|||||||||||||||||||||||||||||-
echo Producing \$FileBase.txt (it takes up to 30 seconds)"
echo
sh ./$SubShell > ./$FileBase.txt
echo $FileBase.txt is done
echo
125
 0.78
 0.79
 0.8
 0.81
 0.82
 0.83
 0.84
 0.85
 0.86
 0.87
 0.88
 0  500  1000  1500  2000  2500
Miss ratio
Cash size(KB)
Miss ratio
Miss Ratio
Figure A.14: Miss Ratio for Hplajw workload
126
 46.6
 46.8
 47
 47.2
 47.4
 47.6
 47.8
 48
 0  500  1000  1500  2000  2500
R
esponse T
ime(ms)
Cash size(KB)
Response Time
Response Time
Figure A.15: Response Time for Hplajw workload
127
#gnuplot
echo \Producing  gures (ftrfc, fmiss, frestime)"...
echo
Param=\call ?./miss ratio.p? $FileBase"
gnuplot -e \$Param"
Param=\call ?./res time.p? $FileBase"
gnuplot -e \$Param"
echo  gures are done.
echo
A.4.2 hplajw.sh source code
#!/bin/bash
#|||||||||||||||||||||||||
# C.J. Wang
#|||||||||||||||||||||||||
# variables that need to change based on traces
#
BASE OUTFILE=hplajw
PARFILE=hplajw.parv;
TRACEFILE=ajw.1week.srt
TRACEFORMAT=hpl;
CacheSizeRange=?10 100 1000 2000 3000 4000 5000?
#|||||||||||||||||||||||||
export OUTFILE
export PARFILE
export TRACEFILE
export TRACEFORMAT
export com 1
export par 1
export val 1
export com 2
export par 2
export val 2
export CacheSize
printf \#Cache Size(KB), MissRatio, Response Time(ms)nn"
128
for CacheSize in $CacheSizeRange
do
com 1=\disk*"
par 1=\Number of bu er segments"
val 1=$CacheSize
com 2=\disk*"
par 2=\Maximum number of write segments"
val 2=$val 1
OUTFILE=$BASE OUTFILE-$CacheSize.outv;
sh ./process.sh
done
A.4.3 process.sh source code
#!/bin/bash
#|||||||||||||||||||||||||||-
# C.J. Wang
# This script extracts Miss Ratio and Response Time from $OUTFILE
# and print into stdout
#|||||||||||||||||||||||||||-
PREFIX=../src
if ! [ -f $OUTFILE ]; then
$PREFIX/disksim $fPARFILEg $fOUTFILEg $fTRACEFORMATgn
$fTRACEFILE-0g $fSYNTH-0gn
\$fcom 1g" \$fpar 1g" $fval 1gn
\$fcom 2g" \$fpar 2g" $fval 2g
 
MissRatio=?grep \Bu er miss ratio" $fOUTFILEgn
|head -1 | cut -d: -f2 | cut -f2?
ResTime=?grep \Overall I/O System Response time average" $fOUTFILEgn
| cut -d: -f2?
CacheSizeInKB=?echo \scale=1; $CacheSize / 2.0" | bc?
printf \$CacheSizeInKBntntnt$MissRationtntnt$ResTimenn"
129
A.4.4 miss ratio.p source code
#|||||||||||||||||||||||||||-
# C.J. Wang
# gnuplot script  le for plotting data in  le
# gnuplot 4.4 patch level 0
#|||||||||||||||||||||||||||-
set autoscale # scale axes automatically
unset log # remove any log-scaling
unset label # remove any previous labels
set xtic auto # set xtics automatically
set ytic auto # set ytics automatically
set title \Miss ratio"
set xlabel \Cash size(KB)"
set ylabel \Miss ratio"
set key right center
set macros
FileBase=\?echo $0?"
FileTxt = sprintf( \%s.txt", FileBase)
OutFile= sprintf( \%s-miss-ratio.pdf", FileBase )
plot FileTxt using 1:2 title ?Miss Ratio? with linespoints
#output to .pdf
set size 1, 1
set terminal pdf enhanced color dashed lw 4 size 6, 6
set output OutFile
replot
A.4.5 res time.p source code
#|||||||||||||||||||||||||||-
# C.J. Wang # Gnuplot script  le for plotting data in  le
# gnuplot 4.4 patch level 0
#|||||||||||||||||||||||||||-
set autoscale # scale axes automatically
unset log # remove any log-scaling
unset label # remove any previous labels
set xtic auto # set xtics automatically
130
set ytic auto # set ytics automatically
set title \Response Time"
set xlabel \Cash size(KB)"
set ylabel \Response Time(ms)"
set key right center
set macros
FileBase=\?echo $0?"
FileTxt = sprintf( \%s.txt", FileBase)
OutFile= sprintf( \%s-res-time.pdf", FileBase )
plot FileTxt using 1:3 title ?Response Time? with linespoints
#output to .pdf
set size 1.0, 1.0
set terminal pdf enhanced color dashed lw 4 size 6, 6
set output OutFile
replot
131