An Active and Hybrid Storage System for Data-intensive Applications by Zhiyang Ding A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Auburn, Alabama Aug 6th, 2011 Keywords: Active Storage, Parallel and Distributed Computing, Storage Systems Copyright 2011 by Zhiyang Ding Approved by Xiao Qin, Chair, Associate Professor of Computer Science and Software Engineering Kai-Hsiung Chang, Chair and Professor of Computer Science and Software Engineering David A. Umphress, Associate Professor of Computer Science and Software Engineering Abstract Since large-scale and data-intensive applications have been widely deployed, there is a growing demand for high-performance storage systems to support data-intensive appli- cations. Compared with traditional storage systems, next-generation systems will embrace dedicated processor to reduce computational load of host machines and will have hybrid combinations of different storage devices. We present a new architecture of active storage system, which leverage the computational power of the dedicated processor, and show how it utilizes the multi-core processor and o?oads the computation from the host machine. We then solve the challenge of applying the active storage node to cooperate with the other nodes in the cluster environment by design a pipeline-parallel processing pattern and re- port the effectiveness of the mechanism. In order to evaluate the design, an open-source bioinformatics application is extended based on the pipeline-parallel mechanism. We also explore the hybrid configuration of storage devices within the active storage. The advent of flash-memory-based solid state disk has become a critical role in revolutionizing the storage world. However, instead of simply replacing the traditional magnetic hard disk with the solid state disk, researchers believe that finding a complementary approach to corporate both of them is more challenging and attractive. Thus, we propose a hybrid combination of different types of disk drives for our active storage system. An simulator is designed and implemented to verify the new configuration. In summary, this dissertation explores the idea of active storage, an emerging new storage configuration, in terms of the architecture and design, the parallel processing capability, the cooperation of other machines in cluster computing environment, and the new disk configuration, the hybrid combination of different types of disk drives. ii Acknowledgments I definitely could not have survived the Ph.D process without the guidance, friendship and support of many people. First and foremost, I will always be grateful to my advisers, Dr. Xiao Qin and Dr. Kai- Hsiung Chang, for all invaluable guidance that they have given for me during my years in Auburn University. Dr. Qin, in particular, poured uncountable hours into my development ever since I started pursing the Ph.D degree. Under his supervision, I learned how to do research in storage system arena from scratch. Without them, it would be impossible for me to finish this dissertation. I would gratefully thank my dissertation committee member, Dr. David A. Umphress. Dr. Umphress not only provides me valuable academic advices for this dissertation, but he also help and guide me in my entire Auburn life. I can still recall all that he have done for me when I first came to US. I cannot appreciate it more. I also gratefully thank Professor Wei Wang who is the Chair and Alumni Professor of Graphic Design program for serving as the university reader. The many hours at the office were made both bearable and enjoyable by all the members of the research group. I would like to thank all my colleagues: Xiaojun Ruan, Shu Yin, James Majors, Jiong Xie, Yun Tian, Yixian Yang, Jianguo Lu and my former colleague, Adam Manzanares. As the neighbor in the office and research partner to me, Xiaojun, in particular, help me to solve a lot of research problems. Do appreciate for all his help. In addition, I would like to thank all my friends in Auburn, including Lei Chen, Bo Dai, Wei Wang, Tianxia Li, Tianzi Guo, Bin Xu, Jiawei Zhang, Wei Yuan, Chen Chen, Wei Huang, Chun Guo, Ying Zhu, Sihe Zhang, Rui Xu, Qiang Gu, Jingshan Wang, Lingzhao iii Kong, Jingyuan Xiong and etc. I will always miss the time we spent together and value our friendship. Last but not the least, my deepest gratitude goes to my family, especially my grandpa, Qi Ding who passed away in 2009, my grandma Lizhi Deng, and my parents Yuming Ding and Yan Bai for their years selfless support and unconditional love. iv Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 The Next Generation Data Storage is Active/Smart . . . . . . . . . . . . . . 4 2.1.1 Smart Disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.2 Active Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Parallel Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 The MapReduce Programming Model . . . . . . . . . . . . . . . . . . 6 2.2.2 Parallel Processing for Scientific Applications . . . . . . . . . . . . . 8 2.3 Solid State Disk and Hybrid Disk Combination . . . . . . . . . . . . . . . . 8 2.3.1 Hybrid Disk Combination . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Extended FTL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 McSD: Multicore-Enabled Smart Storage for Clusters . . . . . . . . . . . . . . . 11 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Smart Storage and MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 From Smart Disks to Smart Storage . . . . . . . . . . . . . . . . . . . 14 3.2.2 Parallel Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.1 Design of the McSD Prototype . . . . . . . . . . . . . . . . . . . . . 19 3.3.2 A Testbed for McSD . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 v 3.3.3 A Programming Framework . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.4 Putting It All Together - A Cluster with McSD . . . . . . . . . . . . 22 3.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.4.1 System Workflow and Configuration . . . . . . . . . . . . . . . . . . 22 3.4.2 Implementation of smartFAM . . . . . . . . . . . . . . . . . . . . . . 24 3.4.3 Partitioning and Merging . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4.4 Incorporating the Partitioning Module into Phoenix . . . . . . . . . . 28 3.5 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5.1 Experimental Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5.2 Single-Application Performance . . . . . . . . . . . . . . . . . . . . . 32 3.5.3 Multiple-Application Performance . . . . . . . . . . . . . . . . . . . . 33 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4 Using Active Storage to Improve the Bioinformatics Application Performance: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.1.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2.1 Active Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2.2 Parallel Bioinformatic Applications . . . . . . . . . . . . . . . . . . . 41 4.3 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3.1 Active Storage for Clusters . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3.2 Parallel Pipelined System . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4 Modeling and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.5.1 Evaluation Environment . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.5.2 Individual Node Evaluation . . . . . . . . . . . . . . . . . . . . . . . 51 vi 4.5.3 System Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 53 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5 HcDD: The Hybrid Combination of Disk Drives in Active Storage Systems . . . 61 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2.1 SSD and Hybrid Storage . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2.2 Internal Parallelism Processing on SSD . . . . . . . . . . . . . . . . . 64 5.2.3 Data Duplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3 The Design of HcDD ? a Hybrid Combination of Disk Devices . . . . . . . . 67 5.3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3.2 Hybrid Combination of Storage Drives . . . . . . . . . . . . . . . . . 69 5.3.3 Intra-parallelism buffer on SSD . . . . . . . . . . . . . . . . . . . . . 71 5.4 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4.1 The Evaluation Environment . . . . . . . . . . . . . . . . . . . . . . 75 5.4.2 Internal Parallelism Supported Buffer for SSD . . . . . . . . . . . . . 76 5.4.3 Deduplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.4.4 System Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 80 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.1 McSD: Multicore-Enabled Smart Storage for Clusters . . . . . . . . . . . . . 86 6.2 Using Active Storage to Improve the Bioinformatics Application Performance: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3 HcDD: Hybrid Combination of Disk Drives in Active Storage . . . . . . . . . 88 6.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.4.1 Memory Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4.2 Wimpy System Board . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 vii Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 viii List of Figures 3.1 The work flow of MapReduce. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 McSD - The prototype of multicore-enabled smart storage. Each smart storage node in the prototype contains memory, a SATA disk drive, and a multicore- processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 A testbed for the McSD prototype. A host computing node and an McSD storage node are connected via a fast Ethernet switch. The host node can access the disk drives in McSD through the networked file system or NFS. . . . . . . . . . . . . 20 3.4 The programming framework for a host computing node supported by an McSD smart storage node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5 The implementation of smartFAM - an invocation mechanism that enables a host computing node to trigger data-intensive processing modules in an McSD storage node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.6 Workflow of the extended Phoenix model with partitioning and merging . . . . 27 3.7 The workflow diagram of integrity checking. . . . . . . . . . . . . . . . . . . . . 28 3.8 Performance Speedup. It depicts speedups of partition-enabled Phoenix vs. orig- inal Phoenix and the sequential approach on both duo-core and quad-core ma- chines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.9 Single Application Performance. The data size is scaling from 500MB to 1.25GB. 30 ix 3.10 Optional caption for list of figures . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.11 Speedups of Matrix-multiplicy and String-match. Compared with the traditional smart storage (SD) running sequentially, our McSD improves the overall perfor- mance by 1.5x. When data size is increasing, McSD improves the performance of the non-partitioning approaches (the DuoC-SD and Host-only) by 2x. . . . . 34 4.1 A cluster involves a collection of computing nodes and active storage nodes. . . 42 4.2 pp-mpiBlast Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Pipeline Tasks Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 The workflow diagram of sequence-integrity checking. . . . . . . . . . . . . . . . 46 4.5 Workflow of the extended Phoenix model - intra-application pipeline . . . . . . 47 4.6 Time Consumption Trends Comparison: SSD vs. HDD. . . . . . . . . . . . . . 52 4.7 System Evaluation Results I: Execution time comparison between pp-mpiBlast system and native system (running on a computer cluster of 12 nodes). The pp-mpiBlast system contains a twelve nodes computing cluster and one ASN. Results are generated under 3 different partition sizes: 250 MB, 500 MB, and 1.25 GB, which are presented by three sub-figures, respectively from left to right. 54 4.8 System Evaluation Results II: Execution time comparison between pp-mpiBlast system and native system (running on a computer cluster of 13 nodes). The pp-mpiBlast system contains a twelve nodes computing cluster and one ASN. Results are generated under 3 different partition sizes: 250 MB, 500 MB, and 1.25 GB, which are presented by three sub-figures, respectively from left to right. 55 x 4.9 Speedup Trends: As input data size grows larger, the performance speedups of using pp-mpiBlast increase. Sub-figure on left is the comparison result between pp-mpiBlast and the 12-node testbed. And the right one is the result compared with the 13-node testbed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.10 Time Consumption Curves Comparison: Different Partition Size. . . . . . . . . 58 5.1 The System Configuration of HcDD ? a hybrid active storage system. . . . . . . 68 5.2 The System Architecture of HcDD. There are three modules: a controller, a deduplication engine, and some storage devices. . . . . . . . . . . . . . . . . . . 69 5.3 The Workflow of Hybrid Active Storage System. There are four steps: write to the buffer disk, calculate a hash value, compare the value against a value store, and write to the SSD or map to the table. . . . . . . . . . . . . . . . . . . . . . 71 5.4 The Workflow of Hybrid Active Storage System. . . . . . . . . . . . . . . . . . . 72 5.5 The Workflow of On-board Buffer. . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.6 Performance Comparison of Iozone and Postmark . . . . . . . . . . . . . . . . . 77 5.7 Performance Comparison of Random Write. . . . . . . . . . . . . . . . . . . . . 77 5.8 Performance Comparison of traceKernel. . . . . . . . . . . . . . . . . . . . . . . 78 5.9 Performance Comparison of tracePhoenix.. . . . . . . . . . . . . . . . . . . . . . 79 5.10 The number of removed requests and the deduplication ratio. . . . . . . . . . . 80 5.11 Processing Time Comparison between HcDD and traditional hybrid storage. . . 81 5.12 Average Response Time Comparison between HcDD and traditional hybrid storage. 82 xi 5.13 Response Time Comparison between HcDD and SSD. . . . . . . . . . . . . . . . 83 5.14 Average Response Time Comparison between HcDD and SSD. . . . . . . . . . . 84 6.1 The Work Flow of PASS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 xii List of Tables 2.1 NAND Flash Memory Lifetime (P/E Cycles) . . . . . . . . . . . . . . . . . . . 10 3.1 The Configuration of the 5-Node Cluster . . . . . . . . . . . . . . . . . . . . . . 31 4.1 Running time (in seconds) of performing the Word-Count and String-Match benchmarks w/ and w/o partition function under different input data size (in GBytes) on single node. The testbed machine contains 2 GBytes main memory. 48 4.2 The Test Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Time cost (in seconds) of performing mpiformatdb program under different input data size (in MB) on an ASN. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.1 NAND Flash Memory Lifetime (P/E Cycles) . . . . . . . . . . . . . . . . . . . 64 5.2 Num of Read/Write Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 xiii Chapter 1 Introduction Since large-scale and data-intensive applications have been widely deployed, there is a growing demand for high-performance storage systems to support data-intensive applica- tions. Compared with traditional storage systems, next-generation data storage will embrace computing capacity to reduce computational load of host processors in computing nodes. Ex- isting hard disks have limited processing power, which can only be used to handle on-disk scheduling and physical resource management. With the advance of processor and memory technologies, future smart storage systems are promising devices to perform complex on-disk operations [97]. Active disks (a.k.a., smart disks), in which processors are embedded, leverage the com- putational power, available in commodity disk drives, to deal with application-level pro- cessing [87]. Recently, smart disks coupled with embedded processors have been proposed to address the needs of on-drive-data-intensive workloads. Active storage brings three key advantages. First, the amount of data moved back and forth between computing nodes and storage nodes in clusters can be significantly reduced, since large datasets can be locally pro- cessed by storage nodes before being forwarded to computing nodes. Second, data-intensive applications run faster, because active storage nodes accelerate data processing operations. If computing nodes and active storage nodes efficiently coordinate, both computing nodes and storage nodes can perform data processing in parallel. Third, network performance in clusters can be improved due to reduced amounts of data moved into and out of storage nodes. In the past, single-core processors were integrated to smart disks to improve I/O perfor- mance ofdata-intensive applicationsby manipulating data directly onthedisks [38][48][69][97]. 1 But, architecture of energy-efficient processors on a single chip does not necessary guarantee high performance. Thus, improving utilization of advanced multi-core processors has been a thorny subject in computer systems research. We design the architecture, programming model and communication interface of a active/smart storage node. And to fully utilize multi-core processors in our active storage, we incorporated a MapReduce programming model with the active storage (see Chapter 3). There are two main challenges in applying the active storage to support data-intensive applications onclusters. The first challenge isto partitiona parallel application into computation- intensive and data-intensive tasks. If such a partition is successfully created, computing nodes will handle computation-intensive tasks whereas active storage nodes will run data- intensive tasks. The second challenge lies in the coordination between computing nodes and active storage nodes. When it comes to applications where computation-intensive tasks are independent of data-intensive tasks, computing nodes and active storage nodes are non- blocking to each other, meaning that computing and storage nodes can easily operate in parallel. However, if computing nodes have to wait for storage nodes to catch up, the blocked computing nodes could slow down data-intensive applications. In Chapter 4, to solve the blocking problem incurred by synchronized computing and storage nodes, we developed a pipelining mechanism that exploits parallelism among data processing transactions in a sequential transaction stream. We report the effectiveness of the pipelining mechanism that leverage active storage to maximize throughput of data-intensive applications on a high-performance cluster. To demonstrate the effectiveness ofthe pipelining mechanism, we implemented a pipelined application called pp-mpiBLAST, which extends an open-source parallel BLAST tool, mpiBLAST [34]. We also develop an analytic model to study the scalability of pp-mpiBLAST on large-scale clusters. The third part of this dissertation focuses on the configuration of storage devices within the active storage node. Recently, the use of NAND-flash-based Solid State Devices (here- inafter referred as SSDs) has evolved in the primary system storage of enterprise servers and 2 data centers. However, in those areas, users still hesitate a little bit to perform a large-scale deployment of SSDs because of their poor reliability and the limited lifespan. As the cost of NAND flash has declined with increased density, the number of erase cycles a flash cell can tolerate, on which the number of write operations depends (because of the erase-before-write characteristic), has suffered. Meanwhile, server applications such as OLTP (Online Trans- action Processing) [44] normally demand a high-performance and highly reliable underlying storage system. Thus, instead of simply replacing magnetic hard disk drives with SSDs, re- searchers [49][68][51][80][33] believe that finding an appropriate, fittest and complementary approach to balance the performance, reliability and cost is more challenging and attractive. In Chapter 5, we propose a hybrid combination of disk arrays designed for active storage system, which uses hard disk drives as a write buffer to cache write requests and de-duplicate the redundant requests. This chapter also introduces an enhanced version of the on-SSD buffer, which supports internal parallelism processing algorithm. Together, the goal is to minimize the writes sent to the SSD without significantly impacting the performance; by doing so, it reduces the number of erase cycles and thus extends SSDs lifetime. In summary, this dissertation is organized into the following chapters: Chapter 3 pro- poses and designs the new active storage architecture, the programming interface and the explores the parallel processing potential; Chapter 4 presents a pipeline-parallel processing pattern to cooperate the active node with other machines in the cluster computing envi- ronment; and the exploration of the hybrid combination of different types of disk drives is described in Chapter 5; and, finally, Chapter 6 summarizes this dissertation. 3 Chapter 2 Related Work 2.1 The Next Generation Data Storage is Active/Smart Since large-scale and data-intensive applications have been widely deployed, there is a growing demand for high-performance storage systems to support data-intensive applica- tions. Compared with the CPU and memory, which are developing at a rapid speed, the development of storage devices is behind the times. Five primary factors have catalyzed the evolution of storage architectures: I/O-bound workloads, improved disk drive attachment technologies, increased on-drive transistor density, emergence of new interconnects, and the cost of storage systems [38]. Next-generation data storage may embrace computing capacity to o?oad computation from host processors and improve the overall I/O performance. The active or smart disk, which is empowered by assigning the dedicated on-board processor, is introduced to be an aspiring attempt of a new storage model. 2.1.1 Smart Disk Active/smart disks have been implemented in various forms. For example, in an ac- tive disk model proposed by Uysal et al. , on-disk application processing becomes possible because of large on-disk memory [97]. Another active disk model designed by Mustafa et al. largely relies on stream-based programming, in which host-resident code interacts with disk-resident code using streams and a code-partitioning scheme [70]. Memik et al. de- veloped a smart disks architecture, where the operation bundling concept was introduced to further optimize database query executions [69]. Chiu et al. investigated a fully dis- tributed processor-embedded distributed smart disks [22]. Pushing computation workloads to memory is another o?oading choice [59]. 4 The term smart disk may be used interchangeably with other terms such as intelligent disk (IDISK) [55], SmartSTOR [48], processor-embedded disks [22], and semantically smart disks [92]. Note that IDISK uses on-disk integrated processor-in-memory to exploit emerging VLSI technologies [55]; SmartSTOR [48] consists of a processing unit coupled to one or more disks [48]; and semantically smart disks [92] contains various high-level functionalities. The interface of the object-based storage devices has been standardized recently [86]. Some storage device manufacturers, such as Seagate [89], are also developing object-based storage devices. 2.1.2 Active Storage From the perspective of the cluster computing area, active storage can be implemented at storage node levels [71][32][91]. Current practice for data-intensive applications on high- performance clusters often result in high I/O communication overhead between computing nodes and storage nodes in the cluster. When massive amounts of data must be transferred back and forth between parallel computing nodes and storage systems, cluster computing applications? performance can suffer from network bandwidth saturation. One efficient ap- proach to reducing network traffic caused by moving data between computing nodes and storage systems is to incorporating computing capacities into storage systems, thereby of- floading some data-intensive computing tasks from clusters to their storage nodes. ASF [35] and FAWN [7] are two new examples of active storage implementations at storage node levels. Fitch et al. developed the Active Storage Fabrics (ASF) model to address petascale data intensive challenges [7]. ASF is aimed at transparently accelerating host workloads by close integration at the middleware data/storage boundary or directly by data-intensive applications [7]. FAWN - developed by Andersen et al. - couples embedded CPUs to small amounts of local flash storage [7]. Andersen et al. used FAWN as a building block to con- struct a cluster, in which computation and I/O capabilities are balanced to improve energy 5 efficiency of the cluster running data-intensive parallel applications. Active storage has also been explored in terms of handling unstructured data [?] and working in a lustre system [79]. 2.2 Parallel Programming The IT industry has improved the cost-performance of sequential computing by about 100 billion times over the past 60 years [78]. The approach, which increased transistor count and power dissipation, worked fine until it hit the power limit a chip is able to dissipate. After crashing into the power wall, the industry decided to replace the single power-inefficient pro- cessor with multiple cores processor. Although the exponential processor transistor growth still follows the prediction from Moore, a high transistor density in multi-core processors does not always guarantee great practical performance on applications due to the lack of parallelism. There are cases where a roughly 45% increase in processor transistors have translated to roughly 10-20% increase in processing power [90]. Therefore, an open issue ad- dressed in this study is how to enable data-intensive application to exploit parallelism both in the single-node environment and the cluster computing area. We believe that well-designed parallel programming APIs and models are critical players to benefit from multiple-machines or multiple-core processors. 2.2.1 The MapReduce Programming Model MapReduce is a parallel programming model developed by Google for simplified data processing in data-intensive applications running on large clusters [26]. Map and Reduce - two primitives in MapReduce - are brought from the idea of functional testing. When the Map function is called, user?s input data is partitioned into M pieces, which is processed by one copy of the program on each node in a cluster. One of the program copy becomes a master program managing the entire execution. In each MapReduce program, the Map function first takes an input data specified by the users, and outputs a list of intermediate key/value pairs (key, value). Then, the Reduce function takes all intermediate values associated with 6 the same key and produces a list of result key/value pairs. The Reduce function typically performs some kind of merging operation. Finally, the output pairs are sorted by their key value. Like the Google file system, MapReduce is not open source software. Google only de- scribes the MapReduce idea without implementation details. Thus, various open source implementations of MapReduce are available for different computing platforms like clus- ters [9], multi-core systems [82], multiprocessor systems [82], and graphics processing units (GPU) [46]. Haddoop [9]. Hadoop is a Java software framework implemented by Apache to support data-intensive distributed applications. Inspired by Google?s MapReduce and the Google file system, the Hadoop project created its own versions of MapReduce and Hadoop Distributed File System. Hadoop applications can be deployed easily by configuring some variables?some paths and nodes; Hadoop defines one master node that manages all systems and jobs, and other worker nodes. Hadoop is so far the most complete quasi-open-source version of MapRe- duce in the cluster arena. Mars [46].MapReduce has been implemented on NVIDIA GPUs using CUDA in the Mars project. Mars takes advantage of GPUs by using the capability of massive threading. In the Mars implementation, there are a large number of physically- collocated mappers and reducers run in multiple threads. The functions in Mars are classified in two categories: runtime system functions and user define functions. Our McSD is differ- ent from Mars in the sense that McSD automatically coordinates data-processing activities between host CPUs and multi-core processors embedded in smart disks. Phoenix [82]. Phoenix is an implementation of Google?s MapReduce for shared-memory multi-core and multi-processor systems. There are two categories of functions in Phoenix: the first group is provided by the Phoenix runtime environment; the second one is defined by programmers. Ranger et .al concluded that Phoenix is a promising MapReduce implementation for scalable performance on multi-core and multi-processor systems. Different from other MapReduce implementations on clusters, Phoenix is independent of any parallelizing compiler. In our 7 study, we evaluated the suitability of the Phoenix implementation for multicore-embedded smart disks. 2.2.2 Parallel Processing for Scientific Applications Parallel computing can improve performance of scientific applications like sequence database search tools in bioinformatic filed. For example, given a database and query se- quences (e.g., DNA, amino-acid sequences), the search tools search for similarities between the query sequences and known sequences in the database. The tools enable scientists to quickly identify the function of newly discovered DNA sequences or to accurately identify species of a common ancestor [64]. Among many sequence search tools, BLAST is one of the most popular tools used on daily basis by bioinformatic researchers . MpiBlast [34] is a promising, open source, parallel implementation of the BLAST toolkit [5]. Like other bioinformatic applications, mpiBlast has to pre-process (e.g., format data) databases before searching for similarities between query sequences and known sequences in the preprocessed databases. Thus, in Chapter 4 we implement a pipelined application - pp-mpiBlast - to demonstrate a way of employing active storage to improve performance of data-intensive bioinformatic applications. Our pp-mpiBlast incorporates a data-processing pipeline with mpiBlast on a high-performance cluster. 2.3 Solid State Disk and Hybrid Disk Combination The NAND flash memory based SSD, which is used to be evolved in mobile devices and laptops, plays a critical role in revolutionizing the storage system [21][52][54][56][67][81]. Tape is dead, disk is tape, flash is disk [39]. They are completely built on semiconductor chips without any moving parts, which results in high random access performance and lower power consumption. And the cost of commodity NAND flash ? often cited as the primary barrier to SSD deployment [73] ? has dropped significantly, increasing the possibility for dynastic changes in the storage arena. However, reliability and write performance are still 8 two major concerns against largely deployment of SSDs. Hybrid combinations of storage devices and the enhanced Flash Translation Layer (FTL) are proposed to solve the issues. 2.3.1 Hybrid Disk Combination San Diego Supercomputer Center (SDSC) has built a large flash-based cluster, which adopts 256TB of flash memory, called Gordon [15]. This academic project is backed by a $20 million funding from the National Science Foundation. Different from the academic case, however, enterprise users still hesitate a little bit to perform a large-scale deployment of SSDs because of their relatively higher cost, poor reliability and the limited lifespan. In terms of reliability and lifespan concerns, the challenge is each block on flash-based storage media has limited erasure times and each block has to be erased before written. Erasure operation reduces both performance, reliability and lifetime. Table5.1 shows the lifespan of the single- level cell and the multiple-level cell NAND memory in terms of P/E (Program/Erase) cycles. Based on the SSD lifetime calculator provided by Virident website [40], the lifetime of a 200GB MLC-based SSD could be only 160 days if the application write rate is 50MB/s. Thus, HDDs are still regarded as indispensable in the storage hierarchy because of their merits of low cost, huge capacity, above-average reliability, and fast sequential access speed. Instead of simply replacing HDDs with SSDs, researchers [49][68][51][80][33] [95][102]believe that finding a appropriate, fittest and complementary approach to balance the performance, reliability and cost is more challenging and attractive. So far there are many hybrid combina- tions of storage devices. Chang presented a hybrid approach to large SSDs, which combines MLC flash and SLCflash [17]. Yoonand etc proposeahigh performance Flash/FRAMhybrid SSD architecture, in which metadata used by the flash translation layer (FTL) is maintained in a small FRAM which targets at processing small random writes [101]. Soundararajan proposed a hybrid disk architecture which uses the magnetic hard disk as the write buffer for the SSD [93]. 9 Table 2.1: NAND Flash Memory Lifetime (P/E Cycles) Generation SLC MLC eMLC 5X 100,000 10,000 N/A 3X 100,000 5,000 35,000 2X 100,000 2,500 N/A 2.3.2 Extended FTL Unlike magnetic hard disk drives, SSDs have a Flash Translation Layer (or FTL) which is implemented to emulate a hard disk drive by exposing an array of logical block addresses (LBAs) to the host. FTL averagely spreads the erasure workload on flash-based storage. A bad FTL algorithm not only reduces the SSDs performance, but also wears out SSDs storage units rapidly. Chen and Zhang proposed CAFTL: a content-aware flash translation layer enhancing the lifespan of flash memory based SSDs [20]. CAFTL removes unnecessary duplicate writes to reduce write traffic to flash memory. Gupta and Urgaonkar proposed Demand-based Flash Translation Layer (DFTL) which selectively caches page-level address mappings [41] . A journal Remapping Algorithm JFTL was presented by Choi and Park [23]. JFTL writes all the data to a new region in a out-of-place update process by using an address mapping method [23] . Most current research attempts to design new Flash Translation Layer algorithms to im- prove reliability and enhance performance by utilizing the built-on-boardcache. Hence, many research projects have been done for using cache as write buffer. Kang applied Non-Volatile RAM (RAM) as write buffer for SSDs to improve overall performance [53]. Kim and Ahn proposed a buffer management scheme called BPLRU for improving random writes in flash storage. BPLRU buffers writes improve performance of random writes [57]. Soundararajan and Prabhakaran presented Griffin, a hybrid storage device, to buffer large sequential writes in Hard Drives [94]. Park and Jung also presented write buffer-aware address mapping for flash memory devices [77]. 10 Chapter 3 McSD: Multicore-Enabled Smart Storage for Clusters 3.1 Motivation Since large-scale and data-intensive applications have been widely deployed, there is a growing demand for high-performance storage systems to support data-intensive applica- tions. Compared with traditional storage systems, next-generation data storage will embrace computing capacity to reduce computational load of host processors in computing nodes. Ex- isting hard disks have limited processing power, which can only be used to handle on-disk scheduling and physical resource management. With the advance of processor and memory technologies, future active storage systems are promising devices to perform complex on-disk operations [97]. Smart disks (a.k.a., active disks), in which processors are embedded, leverage compu- tational power available in commodity disk drives to deal with application-level process- ing [28][30][87][66][50]. Recently, active disks coupled with embedded processors have been proposed to address the needs of on-drive-data-intensive workloads. In the past, single-core processors were integrated to active disks to improve I/O performance of data-intensive appli- cations by manipulating data directly on the disks [38][69][48][97]. Smart disks avoid moving data back and forth between storage devices and host processors. Since there is no smart disk product on the market, we decided to investigate smart storage nodes rather than smart disks in this study. The target areas to apply smart/active disks include the content-aware service [24][45], unstructured data management [84][98], o?oaded virus protections [25][75], and etc. For the past two decades, hardware designers have used the rapidly increasing tran- sistor speed made possible by silicon technology advances to double performance every 18 11 months [11]. But the power consumption issue has been raised as a side effect [14][43][42][58]. Unfortunately, approaches to increasing transistor count and clock cycle crashed into the power wall recently; increasing transistor count hits the power limit that a chip is able to dissipate. The industry decided to replace single power-inefficient processors with multi- core processors. There is an increasing need to integrate multi-core processors with devices and peripherals. Thanks to the escalating manufacturing technology, it is feasible to embed multi-core processors into storage nodes for high-performance computing. These demand and trend motivate us to develop a programming framework and a prototype for Multicore- enabled smart storage (hereinafter referred as McSD) for clusters in general and MapReduce clusters in particular. Architecture of energy-efficient processors on a single chip does not necessary guarantee high performance. Thus, improving utilization of advanced multi-core processors has been a thorny subject in computer systems research [60]. Phoenix?an implementation of theMapRe- duce model?automatically manages thread creation, dynamic task scheduling, data partition- ing, and fault tolerance in multicore processor systems [82]. Phoenix includes a programming API and an efficient runtime system. Phoenix allows programmers to write functional-style code that improves the utilization of multicore processors by automatically parallelizing and scheduling. To fully utilize multi-core processors in McSD, we incorporated Phoenix into multicore-embedded smart disks. Note that MapReduce is Google?s programming model for scalable parallel data processing [26]. In addition to Phoenix, there exist a wide range of MapReduce implementations tailored for various computing platforms [9][46][83]. The difference between our McSD approach and conventional smart disks is two-fold. First, the focus of McSD is smart storage nodes rather than smart disks. Second, the goal of McSD is to take performance benefits of multi-core processors not single-core processors embedded in storage nodes. Six New Features. When architecting an McSD smart storage system in a high- performance cluster, the following six features will be implemented: 12 ? A two-layer cluster computing architecture contains host computing nodes and smart storage nodes. ? Improved I/O performance is achieved by combining processing capabilities of both computing and storage nodes. ? The McSD storage system allows programmers to write MapReduce-like code that can automatically o?oad data-intensive computation to smart storage. ? Smart storage nodes can communicate with their host computing nodes via a storage interface. ? A programming framework of McSDs allows smart storage nodes to take full advantages from multi-core processors in the storage nodes. ? The APIs and runtime environment in our McSD programming framework automati- cally handles computation o?oad, data partitioning, and load balancing. We will show how to use the McSD programming framework to implement a few real world applications like word-count, string matching, and matrix multiplication. Main Contributions. In summary, the four major contributions of this study are: ? A prototype of next-generation multicore-enabled smart data storage. ? A programming framework, which include MapReduce-like programming APIs and a runtime environment for multicore-based smart storage in the context of clusters. ? Development of three benchmark applications to test McSD for clusters. ? Single-application experimental results and multiple-application performance evalua- tion. In the following Section 3.2, we review the background information and previous related research that motivate and inspire this study. In Section 3.3, we describe the design issues 13 of the prototype of multicore-enabled smart storage. Section 3.4 presents implementation details of the McSD runtime environment and McSD programming APIs. Experiment results and performance evaluation are discussed in Section 3.5. Finally, Section 3.6 concludes the chapter with future research directions. 3.2 Smart Storage and MapReduce An increasing number of large-scale data-intensive applications impose performance demands on storage systems, which are the performance bottleneck of various computing platforms. One way to boost I/O performance is to embed processors into hard disk drives, thereby o?oading data-intensive computation from host CPUs to hard disks. While the processing capacity in today?s disk drives is to manage on-disk scheduling and resource management, future disks can be equipped with dedicated processors to perform complicated data-intensive operations. We call disk drives in which processors are incorporated as smart disks or active disks. 3.2.1 From Smart Disks to Smart Storage Five primary factors have catalyzed the evolution of storage architectures: I/O-bound workloads, improved disk drive attachment technologies, increased on-drive transistor den- sity, emergence of new interconnects, and the cost of storagesystems [38]. Existing smart disk prototypes consist of, from hardware perspective, an embedded processor, a disk controller, on-disk memory, local disk space, and a network interface controller (NIC). From software perspective, a smart disk is comprised of an embedded operating system, a database engine, programming APIs and the like. Unlike stand-alone PCs, smart disks do not contain I/O components such as keyboard and display. Smart drives may directly connect to their host processors through NICs. In 14 our prototype, smart disks or smart storage nodes are connected to host CPUs using a file- alternation-monitor mechanism, which allows a smart storage node to communicate with its host computing node without relying on a keyboard and display unit. Single-core embedded smart disks have been implemented in various forms. Forexample, in an active disk model proposed by Uysal et al. , on-disk application processing becomes possible because of large on-disk memory [97]. Another active disk model designed by Mustafa et al. largely relies on stream-based programming, in which host-resident code interacts with disk-resident code using streams and a code-partitioning scheme [70]. Memik et al. developed a smart disks architecture, where the operation bundling concept was introduced to further optimize database query executions [69]. Chiu et al. investigated a fully distributed processor-embedded distributed smart disks [22]. Pushing computation workloads to memory is another o?oading choice [59]. The term smart disk may be used interchangeably with other terms such as intelligent disk (IDISK) [55], SmartSTOR [48], processor-embedded disks [22], and semantically smart disks [92]. Note that IDISK uses on-disk integrated processor-in-memory to exploit emerging VLSI technologies [55]; SmartSTOR [48] consists of a processing unit coupled to one or more disks [48]; and semantically smart disks [92] contains various high-level functionalities. The interface of the object-based storage devices has been standardized recently [86]. Some storage device manufacturers, such as Seagate [89], are also developing object-based storage devices. Due to a combination of reasons, no out-of-shelf product of the smart/active disk is available on the market. We aimed at implementing an active/smart storage system for clus- ters [6] and; therefore, we decided to focus on active storage nodes rather than active/smart disks in this section. 15 Split Map Map worker 1 Map Map worker n SortInput Reduce worker 1 worker n Merge Reduce Reduce Reduce Output Map Stage Reduce Stage Figure 3.1: The work flow of MapReduce. 3.2.2 Parallel Programming The IT industry has improved the cost-performance of sequential computing by about 100 billion times over the past 60 years [78]. A high transistor density in multi-core processors does not always guarantee great practical performance on applications due to the lack of parallelism. There are cases where a roughly 45% increase in processor transistors have translated to roughly 10-20% increase in processing power [90]. Therefore, an open issue addressed in this study is how to enable data-intensive application to exploit parallelism in smart disks coupled with embedded multi-core processors. We believe that well-designed parallel programming APIs must be implemented for future smart disks that can benefit from multi-core processors. MapReduce is a programming model developed by Google for simplified data processing in data-intensive applications running on largeclusters [26]. Map and Reduce - two primitives in MapReduce - are brought from the idea of functional testing. When the Map function is called, user?s input data is partitioned into M pieces, which is processed by one copy of the program on each node in a cluster. One of the program copy becomes a master program managing the entire execution. In each MapReduce program, the Map function first takes an input data specified by the users, and outputs a list of intermediate key/value pairs (key, value). Then, the Reduce function takes all intermediate values associated with the same 16 key and produces a list of result key/value pairs. The Reduce function typically performs some kind of merging operation. Finally, the output pairs are sorted by their key value. Fig. 3.1 illustrates the work flow of the MapReduce model. The main benefit MapReduce lies in its simplicity. Programmers only provide a simple description of an algorithm that focuses on functionality and leaves actual parallelization and concurrency management to a MapReduce runtime system. Like the Google file system, MapReduce is not open source software. Google only de- scribes the MapReduce idea without implementation details. Thus, various open source im- plementations of MapReduce are available for different computing platforms like clusters [9], multi-core systems, multiprocessor systems [82] and graphics processing units (GPU) [46]. Hadoop [9] is a Java software framework implemented by Apache to support data- intensive distributed applications. Inspired by Google?s MapReduce and the Google file system, the Hadoop project created its own versions of MapReduce and Hadoop Distributed File System. Hadoop applications can be deployed easily by configuring some variables? some paths and nodes; Hadoop defines one master node that manages all systems and jobs, and other worker nodes. Hadoop is so far the most complete quasi-open-source version of MapReduce in the cluster arena. Phoenix [82] is an implementation of Google?s MapReduce for shared-memory multi- core and multi-processor systems. There are two categories of functions in Phoenix: the first group is provided by the Phoenix runtime environment; the second one is defined by programmers. Ranger et .al concluded that Phoenix is a promising MapReduce implemen- tation for scalable performance on multi-core and multi-processor systems. Different from other MapReduce implementations on clusters, Phoenix is independent of any paralleliz- ing compiler. In our study, we deploy a extended Phoenix system implementation to our multicore-embedded active/smart disks. 17 Memory CPU HDD storage interface Memory CPU HDD Smart Disk Memory CPU HDD Smart Disk Figure 3.2: McSD - The prototype of multicore-enabled smart storage. Each smart storage node in the prototype contains memory, a SATA disk drive, and a multicore-processor. 3.3 Design Issues A growing number of data-intensive applications coupled with advances in processors indicate that it is efficient, profitable, and feasible to o?oad data-intensive computations from CPUs to hard disks [87]. Our preliminary results show that we can improve perfor- mance of cluster computing applications by o?oading computations from computing nodes to storage nodes. To improve the performance of large data-intensive applications, we designed McSD - a prototype of multicore-enabled smart data storage. Different from the existing smart- disk solutions, McSD not only addresses the performance needs of data-intensive applications using multi-core processors, but also focuses on smart storage nodes rather than smart disks. Fig.3.2 depicts theMcSD prototype, where each smart storagenodecontains amulticore- processor, memory, and a SATA disk drive. In what follows, let us address the following design issues. ? How to build a testbed where an McSD smart storage node is connected to a host computing node? 18 ? How to evaluate the performance of McSD in the testbed and a cluster computing environment? ? How to fully utilize a multi-core processor available in McSD? ? What is the programming framework for McSD? ? How to pass input parameters from a host computing node to its McSD - a smart storage node? ? How to return results from the McSD storage node to the host computing node? 3.3.1 Design of the McSD Prototype In our McSD prototype, we integrate multi-core processor, a disk controller, main mem- ory, a local disk drive, and a network interface controller (NIC) into a smart storage node. The storage interface of existing smart-disk prototypes (see Section 3.2.1 for details on exist- ing smart disks) is not well designed, because the existing prototypes simply represented a case where host CPUs and embedded processors are coordinated through the network inter- faces or NICs in smart disks. To fully utilize the storage-interface in smart data storage, we designed a communication mechanism similar to the file alteration monitor. In our McSD prototype, a host computing node communicates with a disk drive in McSD via its storage interface rather than the NIC. In doing so, we made smart disk prototypes cost-effective since no NIC is needed. Without using NICs, the McSD prototype can adequately represent all the important features of our proposed smart data storage. The design details are described in the following two subsections. 3.3.2 A Testbed for McSD Recall that although a few smart disk prototypes have been developed, there is no off- the-shelf commodity smart disks. Instead of simulating a smart storage system, we built a testbed for the McSD prototype. Fig. 3.3 briefly outlines the McSD testbed, where two 19 Memory CPU HDD Host Node SD Node MemoryCPU HDD N I C N I C SD Node MemoryCPU HDD N I C Figure 3.3: A testbed for the McSD prototype. A host computing node and an McSD storage node are connected via a fast Ethernet switch. The host node can access the disk drives in McSD through the networked file system or NFS. computers are connected through the fast Ethernet. The first computer in the testbed plays the role of host computing node, whereas the second one performs as the McSD smart storage node. The host computing node can access the disks in the McSD node through the networked file system or NFS, which allows a client computer to access files on a remote server over a network interconnect. In our testbed the host computing node is the client computer; the McSD node is configured as an NFS server. We chose to use NFS as an efficient means of connecting the host computing node and the smart storage node, because data transfers between the host and smart storage nodes are handled by NFS. We run three real-world applications as benchmarks on this testbed to evaluate the performance of the McSD prototype. The benchmarks considered in our experiments (see Section 3.5) include word count, string matching, and matrix-multiplication. 20 Program McSD Runtime System Host Machine McSD McSD storage interface Host program SD program (data-intensive) Figure 3.4: The programming framework for a host computing node supported by an McSD smart storage node. 3.3.3 A Programming Framework Fig 3.4 shows a programming framework for a host computing node supported by an McSD smart storage node. The framework generates an optimized operation plan for data- intensive programs running on the McSD testbed (see Section 3.3.2 for the description of the testbed), where there is a host computing node and an McSD smart storage node. The framework automatically assigns general purpose operations to the host computing nodes and o?oads data-intensive operations to the McSD storage node, in which Phoenix (see Section 2.2.1 for the description of Phoenix) handles parallel data processing. Although ap- plying Phoenix in the McSD node can not increase performance for all applications running in our testbed, Phoenix can substantially boost performance of data-intensive applications (see Section 3.5). Because this programming framework provides a relatively flexible auton- omy, data processing modules (e.g., word-count, sort, and other primitive operations) can be readily added into an McSD smart storage node. To seamlessly integrate Phoenix into an McSD smart system, we addressed the issue of limited embedded memory in McSD by implementing new functions like data partitioning, which split input data files whose memory footprints exceed the memory capacity of the 21 McSD smart data storage. The implementation details of the new functions in the McSD programming framework can be found in Section 3.4. 3.3.4 Putting It All Together - A Cluster with McSD We can build both stand-alone computing systems and active storage farms using McSD. Stand-Alone Computing Systems. A set of McSD nodes can be connected together to serve as a stand-alone data-intensive computing system, where each McSD node is running data intensive applications whose data are locally stored in each storage node. The cluster of McSD can connect to back-end storage servers supporting data services such as data deduplication, data backup and recovery. Smart Storage Farm. McSD can be used to build a smart storage farm that is connected to a front-end high-performance computing cluster. In this case, data-intensive applications are running on the front-end cluster and data processing services are supported by McSD nodes of the smart storage system. The smart McSD storage nodes can improve the front-end clusters I/O bandwidth by greatly reducing data movement between the front- cluster and the McSD storage system. 3.4 Implementation Details 3.4.1 System Workflow and Configuration Unlike the previous network-attached smart storage, the McSD smart storage uses the SATA interface to transfer data. We implemented the McSD prototype using a host com- puting node and a multicore-enabled storage node (see Section 3.3.1 and 3.3.2 for the design issues of the prototype). In the prototype, the multicore storage node has no keyboard, mouse, and display unit. It is worth noting that storage nodes in other existing smart-disk prototypes have keyboard and mouse activities. Compared with the other earlier prototypes, our McSD prototype better resembles next-generation multicore-enabled smart storage sys- tems for clusters. Smart storage nodes connected to cluster computing nodes only needs to 22 process on-node data-intensive operations. In other words, smart storage nodes only pro- vide some primitive functions termed as data-intensive processing modules (or processing modules for short) in the McSD prototype. Fig. 3.3 shows the hardware configuration of the McSD prototype where a host com- puting node is connected to an McSD storage node through the SATA bus interface. One of the most important implementation issues is to allow a host computing node to o?oad data-intensive computations to McSD. There are two general approaches to implementing computation o?oading. First, each o?oaded data-intensive operation or module are de- livered from a host node to an McSD storage node (hereinafter refered to as McSD or McSD node) when the operation or module needs to be processed by McSD. Second, all data-intensive operations and modules are preloaded and stored in the McSD storage node. Although the first approach can handle the dynamic environment problem where data- intensive operations/modules are not predictable, the downside of the first approach lies in high communication overhead between host computing nodes and McSD storage nodes. The second approach reduces the communication overhead caused by moving data-intensive operations/modules, because the operations/modules are residing in McSD prior to the ex- ecution of the data-intensive programs. In the process of implementing the McSD prototype, we took the second approach - preloading data-intensive modules. We believe that the preloading approach is practical for a vast variety of real-world applications, where data-intensive processing modules can be determined before the programs are executed in a host computing node accompanied by an McSD smart storage node. In our preloading approach, the program running on the computing node has to invoke the processing modules preloaded to the McSD node. An invocation mechanism, called smart-file-alternation monitor (smartFAM), was implemented to enable the host node to readily trigger the processing modules in the McSD node. The implementation issues of smartFAM are addressed in the next subsection. 23 smartFAM Active Node Daemon inotify NFS Log files Module Log & Result data smartFAM Host node Daemon inotify Data- intensive function Merge Results Pre-assembled Modules ... 1 1 4 3 2 Main Program General functions 2 3 4 Figure 3.5: The implementation of smartFAM - an invocation mechanism that enables a host computing node to trigger data-intensive processing modules in an McSD storage node. 3.4.2 Implementation of smartFAM Fig. 3.5 illustrates the implementation of smartFAM - an invocation mechanism that enables a host computing node to trigger data-intensive processing modules in an McSD storage node. smartFAM mainly contains two components: (1) the inotify program - a Linux kernel subsystem that provides file system event notification; and (2) a daemon program that invokes on-node data-intensive operations or modules. To make our McSD prototype closely resemble future multicore-enabled smart storage, we connected the host node with the McSD smart-storage node using the Linux network file system or NFS. In the NFS configuration, the host node plays a client role whereas the McSD node performs as a file server. A log-file folder, created in NFS at the server side (i.e., the McSD smart-storage node), can be accessed by the host node via NFS. Each data-intensive processing module/operation has a log file in the log-file folder. Thus, when a new data-intensive module is preloaded to the McSD node, a corresponding log-file is 24 created. The log file of each data-intensive module is an efficient channel for the host node to communicate with the smart-storage node (McSD node). For example, let us suppose that a data-intensive module in the McSD node has input parameters. The host node can pass the input parameters to the data-intensive module residing the McSD node through the corresponding log file. Thus, the host writes the input parameters to the log file that is monitored and read by the data-intensive module. Below we address the following two questions related to usage of log files in McSD: ? (1) How to pass input parameters from a host node to an McSD storage node? ? (2) How to return results from an McSD storage node to a host? Passing input parameters from a host node to an McSD smart-storage node. When an application running on the host node o?oads data-intensive computations to the McSD node, the following five steps are performed so that the host node can invoke a data- intensive module in the smart-storage node via the module?s log file (see Fig. 3.5): Step 1: The application on the host node writes input parameters of the module to its log file on in McSD. Note that NFS handles communications between the host and McSD via log files. Step 2: The inotify program in the McSD node monitors all the log files. When the data- intensive module?s log file in McSD is changed by the host, inotify informs the Daemon program in smartFAM of McSD. Step 3: The Daemon program opens the module?s log file to retrieve the input parameters passed from the host. Note that this step is not required if no input parameter needs to be transmitted from the host to the McSD node. Step 4: The data-intensive module is invoked by the Daemon program; the input parameters are passed from Daemon to the module. Step 5: Go to Step 1 is more data-intensive modules in the McSD node are invoked by the application on the host. 25 Returning results from an McSD smart-storage node to a host node. Results produced by a data-intensive module in the McSD node must be returned to the module?s caller - a calling application that invokes the module from the host node. To achieve this goal, smartFAM takes the following four steps (see Fig. 3.5): Step 1: Results produced by the module in the McSD node are written to the module?s log file. Step 2: The inotify program in the host node monitors the log file, checking whether or not the results have been generated by McSD. After the module?s log file is modified by McSD (i.e., the results are available in the log file), This inotify program informs the Daemon program in the host node. Step 3: The Daemon program in the host notifies the calling application that the results from the McSD node are available for further process. Step 4: The host node accesses the module?s log file and obtain the results from the McSD node. Note that this step can be bypassed if no result should be returned from McSD to the host. 3.4.3 Partitioning and Merging A second implementation issue that has not been investigated in the existing smart- storage prototypes is how to process large data sets that are too large to fit in on-node memory. In one of our experiments, we observed that the Phoenix runtime system does not support any application whose required data size exceeds approximately 60% of a com- puting node?s memory size. This is not a critical issue for Phoenix, because Phoenix is a MapReduce framework on shared-memory multi-core processor or multiple processors sys- tems where memory size are commonly larger than those residing in smart storage nodes. On-node memory space in smart storage nodes is typically small compared with front-end high-performance computing nodes. Thus, before we attempted to apply Phoenix in McSD 26 Split Map Map worker 1 Map Map worker n Sort Input Reduce worker 1 worker n Merge Reduce Reduce Reduce Output 1 Map Stage Reduce Stage Partition 1 Partition 1 Partition n Output 1 Output Output n MergePartition Figure 3.6: Workflow of the extended Phoenix model with partitioning and merging smart disks, we had to address this out-of-core issue - data required for computations in McSD is too large to fit in McSD memory. Our solution to the aforementioned out-of-core issue is to partition a large data set into a number of small fragments that can fit into on-node memory before calling a MapReduce procedure. Once a large data set is partitioned, the small fragments can be repeatedly processed by the MapReduce procedure in McSD. Intermediate results obtained in each iteration can be merged to produce a final result. Our partitioning solution has two distinct benefits: ? Supporting huge datasets whose size may exceed the memory capacity of an McSD storage node. ? Boosting performance of data-intensive applications (e.g., word-count) by improving the memory usage of McSD (see Fig. 3.9 in Section 3.5). 27 Start Starting Point space, return, or other delimited characters Starting Point ++ N new Partition size Stop draft number Y manuallyautomatic Figure 3.7: The workflow diagram of integrity checking. Because both input data sets and emitted intermediate data are located in memory during the MapReduce stage, the memory footprint is at least twice of input data size. The partitioning solution, of course, is only applicable for data-intensive applications whose input data can be partitioned. In our experiments, we evaluated the impact of fragment size on the performance of applications. Evidence (see Fig. 3.9 in Section 3.5) shows that data partitioning can improve performance of certain data-intensive applications. 3.4.4 Incorporating the Partitioning Module into Phoenix Fig. 3.6 depicts the work flow of the modified version of Phoenix; it can be considered as a two-stage MapReduce process. The Partition function is provided by the runtime system, while the Merge function needs to be programmed by the user to support different applica- tions. Take an example of a Word-count command: wordcount [data-file] [partition-size]. Fragment sizes of every new partitions are determined by (1) the number of [partition-size] 28 gid23gid1 gid23gid22gid28gid1 gid24gid1 gid24gid22gid28gid1 gid25gid1 gid25gid22gid28gid1 gid26gid1 gid26gid22gid28gid1 gid27gid1 gid27gid22gid28gid1 gid28gid23gid23gid1gid7gid2gid1 gid29gid28gid23gid1gid7gid2gid1 gid24gid1gid5gid2gid1 gid24gid22gid25gid28gid1gid5gid2gid1 gid9gid17 gid13gid13gid12 gid19gid17 gid1 gid6gid15gid17gid19gid18gid1gid4gid11gid18gid11gid1gid9gid14gid20gid13gid1 gid8gid19gid11gid12gid21gid1gid10gid3gid1 gid4gid19gid16gid21gid1gid10gid3gid1 gid8gid19gid11gid12gid21gid1gid9gid7gid1 gid4gid19gid16gid21gid1gid9gid7gid1 Figure 3.8: Performance Speedup. It depicts speedups of partition-enabled Phoenix vs. original Phoenix and the sequential approach on both duo-core and quad-core machines. provided by the programmer/system and (2) the extra displacements from integrity-check function in order to make sure the new partition is ended correctly. If there is no [partition- size] parameter, the program will run in native way. Otherwise, the number of [partition-size] can be manually filled in by the programmer or automatically determined by the runtime system (e.g., Phoenix). In order to achieve a better performance, the empirical data or the details of operator may be required for the automatic approach. The integrity-check func- tion will automatically return the extra displacements by scanning from the starting point of [partition-size] till the first space, return or the symbol defined by the programmer. The reason we involved the integrity-check procedure to the Partition function is there exists the consistency issue of partitioned data files; the content of the source data file could be broken in shatters (e.g. a word could be cut and placed into two slitted files not on purpose). Fig. 3.7 describes the integrity-check procedure. 29 Input Data Size Time (seconds) (a) Growth curve of WC on Duo and Quad. It draws the growth curves of elapsed time on duo-core and quad-core machines in terms of Word-Count. Input Data Size Time (seconds) (b) Growth curve of SM on Duo and Quad. It draws the growth curves of elapsed time on duo-core and quad-core machines in terms of String-Match. Figure 3.9: Single Application Performance. The data size is scaling from 500MB to 1.25GB. 30 3.5 Evaluations 3.5.1 Experimental Testbed We performed our experiments on a 5-node cluster, whose configuration is outlined in Table 1. There are three types of nodes in the cluster: one of host computing node, one of smart storage nodes, and three other general purpose computing nodes. Operating system running on the cluster is Ubuntu 9.04 64-bit version. The nodes in the cluster are connected by Ethernet adapters, Ethernet cables, and one 1Gbit switch. All the general purpose computing nodes share disk space on the host node through Network File System (NFS), while the host node is sharing one folder on the McSD node. The processing modules, extended Phoenix system and SmartFAM have been set up on both the host and SD nodes. Then in order to emulate the routine work, we run the Sandia Micro Benchmark (SMB) among all the nodes except the McSD smart-storage node. We choose MPICH2-1.0.7 as our message passing interface (MPI) on the cluster. All benchmarks are compiled with gcc 4.4.1. We briefly describe the benchmarks running on our testbed in the following sub-section. Table 3.1: The Configuration of the 5-Node Cluster Host SD Nodes?3 CPU Intel Core2 Intel Core2 Intel Quad Q9400 Duo E4400 Celeron 450 Memory 2GB OS Ubuntu 9.04 Jaunty Jackalope 64bit version Kernel version 2.6.28-15-generic Network 1000Mbps 31 ? Word Count (WC): It counts the frequency of occurrence for each word in a set of files. The Map tasks process different sections of the input files and return intermediate data?key, value?that consist of a word and a value of 1. Then the Reduce tasks add up the values for each identity word. Finally, the words are sorted and printed out in accordance with the frequency in decreasing order. ? String Match (SM): Each Map searches one line in the ?encrypt? file to check whether the target string from a ?keys? file is in the line. Neither sort or the reduce stage is required. ? Matrix Multiplication (MM): Matrix multiplication is widely applicable to analyze the relationship of two documents. Each Map computes multiplication for a set of rows of the output matrix. It outputs multiplication for a row ID and column ID as the key and the corresponding result as the value. The reduce task is just the identity function. ? Sandia Micro Benchmark (SMB): It is developed by Sandia National Labora- tory to evaluate and test high-performance networks and protocols. We use it in our experiment to emulate the routine work. 3.5.2 Single-Application Performance Fig. 3.9 shows the speedup achieved by using the Partition-enabled programming model, relative to the no-partition version and sequential implementation, respectively. In terms of single application benchmarks, we observed that the traditional Phoenix cannot support the Word-count and the String-match for data size larger than 1.5G, because of the memory overflow. From Fig. 3.9, when the data size is in a reasonable interval (say, less than half of the memory size), the traditional parallel approach provides almost the same performance. However, in terms of Word-count, when the data size is huge (compared with the memory size), the elapsed time of Partition-enabled approach is only 1/6 of the traditional one. 32 When comparing with the sequential approach, both the benchmarks can achieve a 2X speedup, which proves the fully utilization of duo-core processor. Fig. 3.9(a) and Fig. 3.9(b) show the plots of the execution time versus the size of the input data file on the two SD platforms. Since we can observe that the performance curve has linear-like growth, our methodology provides scalability performance for its audience objective. We can summarize that: (1) for data-size sensitive applications, such as Word Count, the Partition procedure can not only support data size which cannot fit in the physical memory but also improve the performance; (2) for applications that are not data-intensive, the Partition model can only enhance their supportability of data-size range. Of course, all those observations are based on the assumption that the applications are partition-able; (3) the last but not the least, the use of our Partition-enabled approach can fully utilize the multicore processor in almost all subjects in this test. 3.5.3 Multiple-Application Performance When multiple applications are running concurrently?following the McSD framework, the system should exhibit the basic properties: (1) the system overall throughput can be increased, and (2) the overall performance of the application set can be improved. In order to evaluate our McSD execution framework, we create two multiple-application benchmarks, each of which contains : a computation-intensive function and a data-intensive one. To ex- plore how well our system meets the performance expectations, we report two pairs of appli- cation benchmarks: Matrix-multiplicity/Word-count and Matrix-multiplicity/String-match. The first pair is very data-intensive, or memory-consuming, since the memory footprint of Word-Count is around three times of the input data size. On the other hand, the memory footprint of String-Match is around two times of the input data size. Thus, those two are representatives of two levels of data-intensive applications. 33 500M 750M 1G 1.25G0 2 4 6 8 10 12 14 16 18 20 Speedup (a) Host Node Only 500M 750M 1G 1.25G0 0.5 1 1.5 2 2.5 Speedup (b) Tradtional SD 500M 750M 1G 1.25G0 2 4 6 8 10 12 14 16 Speedup (c) McSD without Partition Figure 3.10: Speedups of Matrix Multiplicity and Word-count. Trad SD - traditional smart storage (SD) with single-core processor embedded. DuoC SD-nopar - duo-core proces- sor embedded smart storage operating in a parallel way without the partitioning function. The benchmarks are running on the multicore host node only in the Host-only scenario. The last one, Host-part, is partitioning-enabled on the Host node. Compared with the tra- ditional smart storage (running sequentially), our McSD improves the overall performance by 2x. With the data size increasing, the elapsed time of non-partitioned approaches (the DuoC-SD and Host-only) can cost 16 to 18 times more than that of the McSD approach. 500M 750M 1G 1.25G0 0.5 1 1.5 2 2.5 Speedup (a) Host Node Only 500M 750M 1G 1.25G0 0.5 1 1.5 2 2.5 Speedup (b) Tradtitional SD 500M 750M 1G 1.25G0 0.5 1 1.5 2 2.5 Speedup (c) McSD without Partition Figure 3.11: Speedups of Matrix-multiplicy and String-match. Compared with the tra- ditional smart storage (SD) running sequentially, our McSD improves the overall perfor- mance by 1.5x. When data size is increasing, McSD improves the performance of the non- partitioning approaches (the DuoC-SD and Host-only) by 2x. 34 For each pair of applications, we set up four scenarios to execute the program: (1) the benchmarks running on the traditional single-core SD mode (a combination of host and single-core SD node), (2) the benchmarks running on the duo-core embedded SD mode without Partition function, (3) the programs running on the host node only, and (4) the programs follow the McSD execution framework; the host machine handles the computation- intensive part and the SD machine processes the on-node data-intensive function. Each of the solutions performs three tests: parallel processing without partition, parallel processing with partition and the sequential solution. Fig. 3.10and Fig.3.11 illustratethe performanceimprovement ofusing theoptimized ap- proach, the parallel-enabled one with 600MB partition, against the other scenarios. Fig. 3.10 and Fig. 3.11 show speedups on the pair of MM/WC and MM/SM, respectively. We defined the performance speedup to be the ratio of the elapsed time without the optimization tech- nique to that with the McSD technique. From both of figures, we observe a common point: compared with the traditional (single-core processor equipped) SD, the McSD (duo-core pro- cessor embedded) averagely improves the overall performance by 2X for both two pairs of applications. Thus it illustrates the utilization of the duo-core processor. Also, the difference between those two sets of figures is obvious. In terms of the MM/WC, the elapsed time of non-partitioned parallel approaches, host node only and McSD without Partition, increase nonlinearity. When the data size exceeds a threshold, the speedups averagely achieve 6.8X and 17.4X. On the other hand, the McSD can only make slightly improvement when the data size are 500MB and 750MB (below the threshold). In contrary, the speedups of the MM/SM, which represents less data-intensive applications, are both averagely 2X speedup. As we can see, using our methodology gives better speedups compared with the tradi- tional SD (averagely 2X) and parallel processing without Partition (maximum to 17X). While the SD being widely considered to be one of the heterogeneous computing platforms, the frameworks like ours will be considered to manage the system and improve the performance. 35 3.6 Summary Processor-enabled smart storage can improve I/O performance of data-intensive appli- cations by processing data directly using storage nodes, because smart nodes avoid moving data back and forth between storage and host computing nodes. Thanks to the escalating manufacturing technology, it is possible to integrate multi-core processors into smart stor- age nodes. In this study, we implemented a prototype called McSD for multicore-enabled smart storage that can improve performance of data-intensive applications by o?oading data processing to multicore processors employed in storage nodes of computing clusters. Our McSD system differs from conventional smart/active storage in two ways. First, McSD is a smart storage nodes rather than a smart disk. Second, McSD can leverage multi-core processors in storage nodes to improve performance of data-intensive applications running on clusters. McSD alongwith itsprogramming framework enables programmerstowrite MapReduce- like code that can automatically o?oad data-intensive computation to multicore processors residing in smart storage nodes. The McSD programming framework allows smart stor- age nodes to take full advantages from embedded multi-core processors. The APIs and a runtime environment in this programming framework automatically handles computation o?oad, data partitioning, and load balancing. The McSD prototype was implemented in a testbed?a 5-node cluster containing both host computing nodes and McSD smart-storage nodes. Our experimental results were taken by running three real-world applications on the testbed. The tested data-intensive applications include Word Count, String Matching, and Matrix Multiplication. Our multicore-enabled smart storage system - McSD - significantly reduces the execution time of the three applications. Overall, we conclude that McSD is a promising approach to improving I/O performance of data-intensive applications. Our prototype for multcore-enabled smart storage was built in a MapReduce cluster. The performance of the benchmark applications largely depends on the testbed. Therefore, we will upgrade our testbed (e.g., replace Ethernet with Infiniband) to evaluate the impact of 36 fast network interconnects on McSD. Perhaps the most exciting future work lies in exploring (1) the extensibility of data-processing modules and operations (i.e. data-intensive appli- cations and database operations) that are preloaded into McSD smart-disk nodes, (2) the parallelisms among multiple McSD smart disks, and (3) a mechanism in McSD to support fault tolerance and improve reliability. 37 Chapter 4 Using Active Storage to Improve the Bioinformatics Application Performance: A Case Study 4.1 Motivation Processing massive amounts of data has resulted in a mushrooming of data-intensive applications like bioinformatics data processing. Evidence shows that the collective amount of genomic information doubles every 12 months [63]. Most bioinformatics applications have to deal with the I/O bottleneck issue. Processing huge datasets in a high-performance cluster normally requires copying data from storage nodes to computing nodes, thereby leading to a large number of I/O operations. Active storage is an effective technique to improve applications? end-to-end performance by o?oading data processing from computing nodes to storage nodes. Active storage brings three key advantages. First, the amount of data moved back and forth between computing nodes and storage nodes in clusters can be significantly reduced, since large datasets can be locally processed by storage nodes before being forwarded to computing nodes. Second, data-intensive applications run faster, because active storage nodes accelerate data processing operations. If computing nodes and active storage nodes efficiently coordinate, both computing nodes and storage nodes can perform data processing in parallel. Third, network performance in clusters can be improved due to reduced amounts of data moved into and out of storage nodes. 38 4.1.1 Challenges There are two main challenges in applying active storage to support data-intensive appli- cations on clusters. The first challenge is to partition a parallel application into computation- intensive and data-intensive tasks. If such a partition is successfully created, computing nodes will handle computation-intensive tasks whereas active storage nodes will run data- intensive tasks. The second challenge lies in the coordination between computing nodes and active stor- agenodes. When it comes to applications where computation-intensive tasks are independent of data-intensive tasks, computing nodes and active storage nodes are non-blocking to each other, meaning that computing and storage nodes can easily operate in parallel. However, if computing nodes have to wait for storage nodes to catch up, the blocked computing nodes could slow down data-intensive applications. 4.1.2 Contributions In this chapter, we address the partitioning and synchronization issues in the context of active storage supporting bioinformatic applications. To solve the blocking problem incurred by synchronized computing and storage nodes, we developed a pipelining mechanism that exploits parallelism among data processing transactions in a sequential transaction stream. We report the effectiveness of the pipelining mechanism that leverage active storage to maximize throughput of data-intensive applications on a high-performance cluster. To demonstrate the effectiveness of the pipelining mechanism designed for active storage, we implemented a pipelined application called pp-mpiBLAST, which extends mpiBLAST, which is an open-source parallel BLAST tool. pp-mpiBLAST deals with a sequential data processing transactions, each of which contains a filtering/formatting task and a mpiBLAST task. The mechanism overlaps data filtering/formatting in active storage with parallel BLAST computations in computing nodes, thereby allowing clusters and their active storage nodes to perform data processing in parallel. The pp-mpiBLAST application relies on active 39 storage to filter unnecessary data and to format databases, which are then forwarded to the cluster running mpiBLAST. We develop an analytic model to study the scalability of pp-mpiBLAST on large-scale clusters. This model allows us to study the performance of pp-mpiBLAST on a cluster using active storage. This performance model is ideal for application developers who have limited computing resources to test the scalability of their parallel applications using active storage. Furthermore, programmers can use the analytic model to explore the design space related to the number of computing nodes, active storage speed, data processing capacity, and input data size. The model shows the behavior of pp-mpiBLAST under different configurations of the cluster coupled with active storage. Measurements made from a working implementation and a modeling study suggest that this method not only improves mpiBLAST?s overall performance by up to 75%, but also achieves high scalability on clusters coupled with active storage. In the Section 4.2, we review the background information and previous related research. In Section 4.3, we describe the design and implementation details of the active storage node. The analytical model is presented in Section 4.4. Experiment results and performance evaluation are discussed in Section 4.5. Finally, Section 4.6 provides conclusions and future research directions. 4.2 Background 4.2.1 Active Storage Current practice for data-intensive applications on high-performance clusters often re- sult in high I/O communication overhead between computing nodes and storage nodes in the cluster. When massive amounts of data must be transferred back and forth between parallel computing nodes and storage systems, cluster computing applications? performance can suffer from network bandwidth saturation. One efficient approach to reducing network 40 traffic caused by moving data between computing nodes and storage systems is to incor- porating computing capacities into storage systems, thereby o?oading some data-intensive computing tasks from clusters to their storage nodes. That inspired researchers to make the storage more ?active?, or ?smart?. Active storage can be implemented at storage node levels [71][32][91]. ASF [35] and FAWN [7] are two new examples of active storage implementations at storage node levels. Fitch et al. developed the Active Storage Fabrics (ASF) model to address petascale data intensive challenges [7]. ASF is aimed at transparently accelerating host workloads by close integration at the middleware data/storage boundary or directly by data-intensive applica- tions [7]. FAWN - developed by Andersen et al. - couples embedded CPUs to small amounts of local flash storage [7]. Andersen et al. used FAWN as a building block to construct a cluster, in which computation and I/O capabilities are balanced to improve energy efficiency of the cluster running data-intensive parallel applications. Active storage has also been ex- plored in terms of handling unstructured data [?] and working in a lustre system [79]. In this study, we focus on active storage like ASF and FAWN. We implemented the pipelining technique in an active storage testbed that is similar to ASF and FAWN. 4.2.2 Parallel Bioinformatic Applications Parallel computing can improve performance of bioinformatic applications like sequence database search tools. Given a database and query sequences (e.g., DNA, amino-acid se- quences), the search tools search for similarities between the query sequences and known sequences in the database. The tools enable scientists to quickly identify the function of newly discovered DNA sequences or to accurately identify species of a common ancestor [64]. Among many sequence search tools, BLAST is one of the most popular tools used on daily basis by bioinformatic researchers . MpiBlast [34] is a promising, open source, parallel implementation of the BLAST toolkit [5]. Like other bioinformatic applications, mpiBlast 41 SSD Massg3g3Storage Activeg3Storageg3Node SSD Ne tw ork g3g3sw itc h Co mp uti ng g3g3N od es controller Bufferg3Storage Figure 4.1: A cluster involves a collection of computing nodes and active storage nodes. has to pre-process (e.g., format data) databases before searching for similarities between query sequences and known sequences in the preprocessed databases. We implemented a pipelined application -pp-mpiBlast - todemonstrate a way of employ- ing active storage to improve performance of data-intensive bioinformatic applications. Our pp-mpiBlast incorporates a data-processing pipeline with mpiBlast on a high-performance cluster. 4.3 Design and Implementation In this section, we describe the system from the top down: an overview of the system, a hybrid mix of storage devices, and the parallel pipelined processing. Hereinafter, the active storage node is refereed to as ASN for short. 42 4.3.1 Active Storage for Clusters A typical high-performance cluster consists of computing nodes and storage nodes. Data-intensive applications on clusters can cause heavy I/O traffic between the computing and storage nodes in the cluster. To achieve high performance for data-intensive applications, we aim to reduce network traffic caused by moving massive amounts of data from/to storage systems in clusters. This goal can be accomplished by o?oading data-intensive computations from computing nodes to active storage nodes. Fig. 4.1 illustrates a cluster that involves a collection of computing nodes coupled with active storage nodes. Storage nodes become active if they can handle application-level data processing o?oaded from computing nodes. The storage devices can be further divided into two categories: the mass storage and the buffer storage. Benefiting from the non-volatile memory store, Solid state disk (SSD) is a new option to fill the latency gap, which is around 5-order-of-magnitude, between main memory and spinning disks [47]. Thus in our system, SSDs are used as the buffer disk drives: large-scale data is moved from mass storage to buffer drives before processing. The results of experiments in Section 4.5 show that SSDs not only speed up the I/O but also provide a better scalability performance. The advantage of using the hybrid mix of both the solid state disk and the magnetic hard disk is mutual complementarity: the fast and expensive cooperates with the large-capacity and cost-efficiency. In this chapter, we use a commodity computer as the computing end. The ?channel? formed by the computing nodes of cluster and the ASN is considered as a pipeline, or as- sembly line. In other words, by applying the active storage, applications containing multiple stages are capable to extend to a parallel pipelined implementation. The exploration of par- allelism improve the performance data-intensive applications. As a case study, we extend the mpiBlast, a well-know parallel BLAST application, to a parallel pipelined implementation, called pp-mpiBlast. 43 Active Storage Node Computing Nodes Input File 1 2 3 Partition n 1 2 3 Inter- data n 1 2 3 Sub- output Output File Partition format db mpiBlast Merge (n-1) times (n-1) times n times Figure 4.2: pp-mpiBlast Workflow 4.3.2 Parallel Pipelined System Native mpiBlast application can be easily considered as two steps: format the raw database file (corresponding to the query request) and run the parallel BLAST functions to do the comparison. Thus, the pre-cook (i.e. format) phase and the parallel computation phase are handled by the ASN and computing nodes, respectively. How to utilize the active storage device has always been another critical issue. In general, all cases can fall into two scenarios. The first is that tasks are independent (i.e. computing nodes and active storage nodes are non-blocking to each other), meaning that computing and storage nodes can easily can operate in parallel. However, if computing nodes have to wait for storage nodes to catch up, the blocked computing nodes can slow down data-intensive applications. For instance, in terms of mpiBlast, the parallel comparison step requires the formated database file from the previous step. The assembly line pattern is a one of the solutions for the second scenario. As a case study, we extend the mpiBlast to a parallel pipeline implementation (here- inafter refereed to as pp-mpiBlast). The pp-mpiBlast system consists two tasks: 1) raw database formating, and 2) genome or protein sequences comparison. Further subdividing 44 ASN: Computingg3Nodes: formatg3dbg31 formatg3dbg32 formatg3dbg3n mpiBlastg31 mpiBlastg32 ? ? mpiBlast n Figure 4.3: Pipeline Tasks Scheduling the pipeline patterns, there are inter- and intra-application pipeline processing. The pp- mpiBlast is intra-application parallel processing, which means that, as the name - ?intra-? - suggests, one native sequential transaction is partitioned into multiple parallel pipelined transactions. The system performance is improved by fully exploiting the parallelism. The workflow of pp-mpiBlast is depicted in Fig. 4.2. Intra-application Pipeline Processing As we mentioned in the previous section, in order to extend a sequential transaction to multiple pipelined parallel transactions, both partition and merge functions are introduced in the pp-mpiBlast system. Fig. 4.3 illustrates the two-task two-stage pipeline processing workflow. The pipeline pattern no only improves the performance by exploiting the par- allelism, but also can solve the out-of-core processing issue, which means required amount of data are too large to fit in the ASN?s main memory. In pp-mpiBlast, partition function is implemented within mpiformatdb fucntion running on ASN. And the merge function is a separate one running on the front node of the cluster. When partitioning the source data, an assistant function - the integrity-check - automat- ically returns the extra displacements by scanning the return or the symbol defined by the programmer. The reason we involved the integrity-check procedure to the partition function is that there exists the consistency issue of partitioned data files; the content of the source 45 Start Starting Point space, return, or other delimited characters Starting Point ++ N new Partition size Stop draft number Y manuallyautomatic Figure 4.4: The workflow diagram of sequence-integrity checking. data file could be broken in shatters (e.g. a sequence could be cut and placed into two slitted fragments not on purpose). Fig. 4.4 describes the integrity-check work flow. Pipeline parallelism is an important processing pattern and we are interested in provid- ing models and guidance for tuning the scalability and the performance using this pattern. In Section 4.4, we develop a mathematics model for analysis. Preliminary Result In order to prove the feasibility of the partition-based intra-application pipeline de- sign, a preliminary test on a single-node with 2GB memory environment is performed. We extended Word-Count and String-Match benchmarks of Phoenix system [82], which is a shared-memory implementation of MapReduce, to intra-application pipeline editions using partition and merge runtime functions we developed under Phoenix system. Fig. 4.5 depicts the work flow of the extended approach. After the input data is partitioned into fragments in size of 600 MB, each of them is processed sequentially using native word-count or string- match applications. And then, the generated sub-results are merged at the end. The control 46 Split Map Map worker 1 Map Map worker n Sort Input Reduce worker 1 worker n Merge Reduce Reduce Reduce Output 1 Map Stage Reduce Stage Partition 1 Partition 1 Partition n Output 1 Output Output n MergePartition Figure 4.5: Workflow of the extended Phoenix model - intra-application pipeline test is that we run the native applications to process the input data without the partition and merge functions. Table 4.1 shows the results. We can observe that in terms of data-intensive, especially memory-intensive, applications, partitioning can significantly reduce the running time. For example, in terms of word-count application results, an average 2.4X speedup of time con- sumption can be achieved. To the contrary, partitioning does not benefit the speed of the string-match application. But, it can make the large-scale data-intensive applications run- ning on limited memory machines. For example, when executing the string-match application without the help of partition function, the native system does not support the case that the input data size is more than two times of the local memory size. Thus, the preliminary results show that the partition-enabled design can (1) improve data-intensive applications? performance, (2) adapt the data-intensive applications to limited memory machines, or both. 47 Table 4.1: Running time (in seconds) of performing the Word-Count and String-Match benchmarks w/ and w/o partition function under different input data size (in GBytes) on single node. The testbed machine contains 2 GBytes main memory. WordCount (s) StringMatch (s) 1 GB 1.25 GB 1 GB 1.25 GB w/ partition 40.50 50.91 17.76 20.61 w/o partition 85.71 139.54 17.62 21.00 4.4 Modeling and Analysis We develop in this section an analytic model to study performance and scalability of the parallel pipelined BLAST on large-scale cluster computing platforms. The analytic model has the following three key advantages: ? Performance Evaluation: The model allows us to study the performance (i.e., speedup and throughput) of parallel pipelined BLAST system with active storage. ? Scalability Analysis: The performance model is ideal and suitable for bioinformatics application developers who have limited local computing resources to test the scalabil- ity of their parallel applications using active storage. ? Design-Space Exploration: Application programmers can use the model to explore the design space related to the number of computing nodes, active storage speed, data processing capacity, and input data size. The model shows the behavior of the parallel pipelined BLAST under different configurations of a high-performance cluster coupled with active storage. Section 4.3 describes the parallel pipelined implementation of a basic local alignment search tool or BLAST. Our pipelined BLAST has two data processing stages: (1) formatting data and (2) retrieving and processing formatted data. The first stage of the pipeline pre- process is the input of a data set before passing on to the second stage that run mpiBLAST - a parallel implementation of BLAST. 48 Response time, speedup, and throughput are three critical performance measures for the pipelined BLAST. Denoting T1 and T2 as the execution times associated with the first stage and second stage in the pipeline, we can calculate the response time Tresponse for processing each input data set as the sum of T1 and T2. Thus, we have Tresponse = T1 + T2. (4.1) The throughput (see Eq. 4.2) of the two-stage pipelined system is inversely proportional to the maximum of the two execution times T1 and T2. Throughput = 1max(T 1,T2) . (4.2) The speedup for the pipelined BLAST is: Speedup = TunpipelinedT pipelined , (4.3) where Tunpipelined is the data processing time for the unpipelined BLAST and Tpipelined is the processing time of the pipelined BLAST. If n is the number of input data sets to be processed by a cluster with active storage, then processing time of the unpipelined BLAST is the product of n and Tresponse (see Eq. 4.1), leading to Tunpipelined = n?Tresponse = n?(T1 + T2). (4.4) The processing time for the pipelined BLAST is: Tpipelined = T1 + (n?1)?max(T1,T2) + T2 = Tresponse + (n?1)?max(T1,T2), (4.5) 49 where T1 is the processing time of stage 1 for the first data set, T2 is the execution time of stage 2 for nth data set, (n?1)?max(T1,T2) is the time spent on n?1 data sets when the two stages are carried out is parallel. Applying Eqs. 4.4 and 4.5 to Eq. 4.3, we obtain speedup as: Speedup = n1 + (n?1)?max(T 1,T2) Tresponse . (4.6) Now we are positioned to model execution times T1 and T2 for the two stages in the pipeline. The processing time of the first stage is the sum of (1) data input/output times and (2) filtering/formatting time T1,comp. Input time is proportional to unformatted input data size su and inversely proportional to disk read bandwidth bi. Similarly, output time is proportional to formatted output data size sf and inversely proportional to disk write bandwidth bo. This leads to: T1(su,sf,bi,bo) = sub i + T1,comp(su) + sfb o . (4.7) The execution time T2 of stage two is the sum of (1) input time of formatted data and (2) processing time T2,comp. The input time depends on data size sf, disk input bandwidth bi, and the number m of computing nodes in a cluster. Assuming that the formatted data size sf is uniformed distributed among the m computing node, we can express the input data as sfm?bi. Thus, the execution time T2 for the second stage is given below: T2(sf,bi,m) = sum?b i + T2,comp(sf,m), (4.8) where T2,comp(sf,m) is affected by the formatted data size sf and the cluster size (i.e., number of computing nodes m). 50 4.5 Evaluations 4.5.1 Evaluation Environment We implemented the pp-mpiBlast in a 14-node cluster (1 node works as ASN), whose configuration is outlined in Table 4.2. The nodes in the cluster are connected by Ethernet adapters, Ethernet cables, and one 1Gbit switch. We choose an Intel X-25M 80GB solid state disk, and a SATA Raid tower with four WD50000AAKS disks as a RAID 0 array. The MPICH2-1.0.7 is chosen as the message passing interface (MPI) in the cluster. The pp-mpiBlast is extended from mpiBlast-1.5.0, in which NCBI Blast 2.2.24 is the comparison tool. All applications are compiled with gcc 4.4.1. Table 4.2: The Test Platform Computing Nodes ASN CPU Intel Xeon X3430 Intel Q9400 Memory 2GB OS Ubuntu 9.04 Jaunty Jackalope 64bit version Kernel version 2.6.28-15-generic Network 1000Mbps 4.5.2 Individual Node Evaluation We perform mpiformatdb program under different storage disk schemes: w/ SSD as the buffer disk and w/o buffer disk. Since the data pre-fetching is out of the scope of this chapter, when we test the SSD cases, we modify the program to move the data from mass storage devices to the SSD and then trigger the format function. That means the time consumption of SSD contains both the data-transfer and data-format phases. Table 4.3 shows the results. Observed from the table, the case of SSD (e.g.the second row) always perform better than the HDD since it benefits from large amount of random read and write when the function is reordering the sequence in a descending order based on entry length. After balancing the disk capacity, storage reliability, I/O speed, and random w/r speed, the hybrid mix of mass storage and buffer disk is a promising choice. Fig. 4.6 shows the comparison of trends of 51 0 500 1000 1500 2000 25000 500 1000 1500 2000 2500 3000 3500 mpiformatdb 12 nodes Input Data Size (MB) Time Consumption (s) SSD Modeling Data HDD Modeling Data Figure 4.6: Time Consumption Trends Comparison: SSD vs. HDD. using SSD and HDD. We can observe that the scalability of using HDD for mpiformatdb function is not good, compared with the SSD one. Thus in the following experiments, we use the SSD as the buffer disk. Table 4.3: Time cost (in seconds) of performing mpiformatdb program under different input data size (in MB) on an ASN. Running time (seconds) Devices 500 MB 750 MB 1 GB 1.25 GB 1.5 GB 1.75 GB HDD 108.4 369.5 639.1 945.6 1385.0 1845.1 SSD 101.9 164.4 225.9 291.1 369.5 441.1 52 4.5.3 System Performance Evaluation Figure 4.7 shows the system performance evaluation. The pp-mpiBlast testbed, which is configured by 12 computing nodes and 1 ASN ( follows the pipelined processing pattern), is compared with two control experiments (native systems with different number of nodes) : the native mpiBlast running on 1) 12 nodes cluster (equals to the number of computing nodes in pp-mpiBlast testbed), and 2) 13 nodes cluster (equals to the total number of nodes in pp-mpiBlast testbed). The reason to choose two competitors is to present the performance improvement comprehensively. In Fig. 4.7 and Fig. 4.8, the testbed with native mpiBlast contains 12 and 13 nodes, respectively. In each figure, results generated by pp-mpiBlast are using 3 different partition sizes: 250 MB, 500 MB, and 750 MB, which are presented by three sub-figures, respectively from left to right. Observed from the figures, the time consumption comparison results show that the performance of pp-mpiBlast beat both control testbeds: averagely reducing the execution time by 50% (i.e. 2X speedup ). And Fig. 4.9 shows that: the improvement is greater when the input data size increase within a certain range of the input size, which is relative to the main memory size. 53 0 325 650 975 1300 1 GB 1.25 GB 1.5 GB 1.75 GB 2 GB 0 325 650 975 1300 1 GB 1.25 GB 1.5 GB 1.75 GB 2 GB 0 325 650 975 1300 1 GB 1.25 GB 1.5 GB 1.75 GB 2 GB Spending Time (seconds) (a) Partition Size: 250 MB Data Size (GB) (b) Partition Size: 500 MB (c) Partition Size: 750 MB Data Size (GB) Data Size (GB) native pp-mpiBlast native pp-mpiBlast native pp-mpiBlast Spending Time (seconds) Spending Time (seconds) Figure 4.7: System Evaluation Results I: Execution time comparison between pp-mpiBlast system and native system (running on a computer cluster of 12 nodes). The pp-mpiBlast system contains a twelve nodes computing cluster and one ASN. Results are generated under 3 different partition sizes: 250 MB, 500 MB, and 1.25 GB, which are presented by three sub- figures, respectively from left to right. 54 0 300 600 900 1200 1 GB 1.25 GB 1.5 GB 1.75 GB 2 GB 0 300 600 900 1200 1 GB 1.25 GB 1.5 GB 1.75 GB 2 GB 0 300 600 900 1200 1 GB 1.25 GB 1.5 GB 1.75 GB 2 GB Spending Time (seconds) (a) Partition Size: 250 MB Data Size (GB) (b) Partition Size: 500 MB (c) Partition Size: 750 MB Data Size (GB) Data Size (GB) native pp-mpiBlast native pp-mpiBlast native pp-mpiBlast Spending Time (seconds) Spending Time (seconds) Figure 4.8: System Evaluation Results II: Execution time comparison between pp-mpiBlast system and native system (running on a computer cluster of 13 nodes). The pp-mpiBlast system contains a twelve nodes computing cluster and one ASN. Results are generated under 3 different partition sizes: 250 MB, 500 MB, and 1.25 GB, which are presented by three sub- figures, respectively from left to right. 55 We also compare the performance in terms of different partition size. Fig. 4.10 presents that 500 MB partition provides a better performance in general. Based on the data, the reason can be summarized as followings. 1) The smaller the better is not true because small partitions always generate more overheads. And 2) the larger the better is also not convinced since the . It means that a partition-size-threshold exists for optimal performance. The issue of measuring the threshold in quantity will be dug in our future work. Based on the test results, we can see that using intra-application pipeline parallel pro- cessing model to extend mpiBlast improves the performance and scalability. However, as we mentioned in the previous sections, the approach is not general; it requires that the target application can be decomposed into stages, such as streaming and RMS applications. Also, applications parallelized using pipeline model are very sensitive to load balancing. In order to avoid bubbles or reduce their side effect, how to balance the heterogeneity issue between the ASN and computing nodes will be our next topic. 56 1.200 1.425 1.650 1.875 2.100 1G 1.25 G 1.5 G 1.75 G 2G 500 MB 750 MB Speedup Data Size (GB) 250 MB Results Comparison: pp-mpiBlast and 12-node testbed 1.2 1.4 1.6 1.8 2.0 1G 1.25 G 1.5 G 1.75 G 2G 500 MB 750 MB Speedup Data Size (GB) 250 MB Results Comparison: pp-mpiBlast and 13-node testbed Figure 4.9: Speedup Trends: As input data size grows larger, the performance speedups of using pp-mpiBlast increase. Sub-figure on left is the comparison result between pp-mpiBlast and the 12-node testbed. And the right one is the result compared with the 13-node testbed. 57 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2300 350 400 450 500 550 600 650 Database Size(GB) Time(s) Time Comparison of Stystems with Different Partition Sizes Partition Size: 250MB Partition Size: 500MB Partition Size: 750MB Figure 4.10: Time Consumption Curves Comparison: Different Partition Size. 58 4.6 Summary We have presented a pipelining technique that allows computing nodes and active stor- age nodes in a cluster to perform CPU-intensive and data-intensive computing in parallel. The technique exploits parallelism among data processing transactions in a sequential trans- action stream, where each transaction is sequentially handled by computing and storage nodes in the cluster. One central trait of the pipelining technique is that it makes use of active storage to maximize throughput of data-intensive applications (e.g., bioinformatic applications) on high-performance clusters. We have showed that the synchronizations among computing and storage nodes can hinder active storage nodes from accelerating data processing opera- tions. Our technique solves the blocking problem that prevents active storage and computing nodes from processing data in parallel. The implementation of pp-mpiBLAST demonstrates that active storage is an effective approach to improving data-intensive applications? overall performance by o?oading data processing to storage nodes in clusters. We implemented the pipelining technique in a bioinformatic application called pp- mpiBLAST, which incorporates mpiBLAST - an open-source parallel BLAST tool. The pipeline in pp-mpiBLAST processes a sequential transaction stream, in which each transac- tion contains a filtering/formatting task and a mpiBLAST task. The pipelining technique overlaps data filtering/formatting in active storage with parallel BLAST computations in computing nodes. Measurements made from a working implementation and analytic model have been encouraging. The pipeline and active storage not only reduce mpiBLAST?s overall execution time by up to 50%, but they also achieved high scalability on large-scale clusters. There are three directions to pursue future work. First, we intend to explore the extensi- bility of data-processing modules and operations that are preloaded into active storage. The modules and operations arelikely to beapplication domain specific. Second, we will study the parallelisms among multiple active storage nodes. Third, we intend to investigate whether 59 similar partitioning and synchronization schemes can be applied to both read-intensive and write-intensive applications. 60 Chapter 5 HcDD: The Hybrid Combination of Disk Drives in Active Storage Systems 5.1 Motivation Previous two chapters propose the architecture of a new active storage system and the pipeline parallelism processing pattern for applying the active storage system in a computing cluster. The new architecture helps in achieving the performance improvement by o?oading the computation to the active storage, especially when read requests areperformed on storage nodes. In this chapter, hybrid storage devices in an active storage system are introduced to boost write performance and to extend disk reliability. Recently, the use of NAND-flash-based Solid State Devices (hereinafter referred to as SSDs) has evolved from specialized applications in mobile devices [18] and laptops to primary system storages in general-purpose computers and even data centers [72]. Evidence shows that Tape is dead; disk is tape; flash is disk [39]. Different from HDDs (magnetic hard disk drives) which rely on mechanical moving parts, SSDs are completely built on semiconductor chips resulting in high random access performance and lower power consumption. And the cost of commodity NAND flash ? often cited as the primary barrier to SSD deployment [73] ? has dropped significantly, increasing the possibility for dynastic changes in the storage arena. Although, today, flash disks have not dominated the whole storage market, SSDs begin to be deployed in enterprise servers and data centers. But, in data centers, engineerings still hesitate somehow to perform a large-scale deployment of SSDs due to their poor reliability and the limited lifespan issues [8]. As thecost of NAND flash hasdeclined with increased density, the number of erase cycles that a flash cell can tolerate, on which the number of write operations depends (because of the erase-before-write characteristic), has suffered. Meanwhile, server applications, such 61 as OLTP (Online Transaction Processing) [44], normally demand a high-performance and highly reliable storage system. Some researchers believe that the stressful workload and fewer available erase cycles may further reduce lifetimes of SSDs, in some cases, even to less than one year [93]. From another perspective, when building terascale or petascale servers and data centers using only SSDs rather than HDDs, the cost ? even though with the decreasing price of SSDs ? is often beyond the acceptable budgets in most cases. Thus, HDDs are still regarded as indispensable components in the storage hierarchy because of their merits of low cost, huge capacity, above-average reliability, and fast sequential access speed. Instead of simply replacing HDDs with SSDs, researchers [49][68][51][80][33] believe that finding a complementary approach balancing the performance, reliability and cost is more attractive and challenging. In this chapter, we propose a hybrid combination of disk arrays for active storage sys- tems. The hybrid-disk design involves an enhanced duo-buffer design, which consists (1) a HDD-write buffer to server and de-duplicate write requests and an on-SSD buffer to pro- vide parallel write processing. Read requests are all served by SSDs directly. But writes to an active storage are first transferred to its HDD buffer and perform de-duplication before migrating to the SSD. The de-duplication computation and I/O operations from the buffer disk to the major storage disk are o?oaded to the dedicated processor in the active storage. This chapter also introduces an enhanced version of SSD?s buffer, which supports internal parallelism processing. Together, the goal of this part of study is to minimize the number of writes sent to SSDs without significantly affecting the performance; by doing so, it reduces the number of erase cycles and thus extends SSDs lifetime. We have made the following contributions in this chapter: ? We design and simulate a hybrid combination of storage devices for active storage systems. ? A HDD in our design is first assigned as a write buffer to SSDs, thereby supporting deduplication service. 62 ? We design and simulate the internal-parallelism supporting cache on SSDs. ? A system simulator is built to evaluate the hybrid drives combination in active storage systems. The rest of this chapter is organized as follows. Section 5.2 briefly introduces the background and related work. Then, the design and the implementation details are described in Section 5.3. It is followed by experimental results and the performance evaluation in Section 5.4. Last but no the least, Section 5.5 is the conclusion of this chapter. 5.2 Background 5.2.1 SSD and Hybrid Storage NAND-flash-memory based SSDs, which used to be evolved in mobile devices and lap- tops, play a critical role in revolutionizing storage systems [21][52][54][56][67][81]. SSDs are completely built on semiconductor chips without any moving parts, giving rise to high ran- dom access performance and lower power consumption. And the cost of commodity NAND flash ? often cited as the primary barrier to SSD deployment [73] ? has dropped significantly, increasing the possibility for dynastic changes in the storage arena. In order to improve the storage performance, the San Diego Supercomputer Center (SDSC) has built a large flash-based cluster called Gordon, which adopts 256TB of flash memory [15]. This research project is backed by a $20 million grant from the U.S. National Science Foundation. Different from the research case, however, enterprise data centers still are unable to adopt a large-scale deployment of SSDs due to high cost, poor reliability and limited lifespan of SSDs. In the perspective of the limited lifespan, the challenge is each block on flash-based storage media has limited erase times and each block has to be erased before being written. Thus erase-before-write operations not only degrade SSD performance, but also shorten SSD lifespan. Table 5.1 shows the lifespan of a single-level cell and a multiple-level cell NAND memory in terms of P/E (Program/Erase) cycles. Based on the SSD lifetime calculator 63 provided by the Virident website [40], the lifetime of a 200GB MLC-based SSD could be only 160 days if the write rate performing on the SSD is 50MB/s. Thus, traditional magnetic hard disks are still regarded as indispensable in the stor- age hierarchy because of their merits of low cost, huge capacity, above-average reliabil- ity, and fast sequential access speed. Instead of simply replacing HDDs with SSDs, re- searchers [49][68][51][80][33][95][102]believe that it is attractive and chanllenging to investi- gate a cogent and complementary approach balancing the performance, reliability and cost. Table 5.1: NAND Flash Memory Lifetime (P/E Cycles) Generation SLC MLC eMLC 5X 100,000 10,000 N/A 3X 100,000 5,000 35,000 2X 100,000 2,500 N/A 5.2.2 Internal Parallelism Processing on SSD The inter-disk parallelism technique has been well explored since decades ago. Data striping is the basic idea of inter-disk parallelism. However, even for Hard Drives, inter- disk parallelism just start to be considered recently [88]. Since (1) there is no mechanical movements in SSDs and (2) an SSD has one or multiple identical elements which can work in parallel [4], we have better chance to improve internal parallelisms in SSDs than Hard drives. Park points out that intra-SSD parallelism is possible on die-level, package-level, and plane-level. Furthermore, parallelism-aware request processing is an effective solution to enhance intra-SSD parallelisms [100]. Chen and Zhang analyzed the essential roles of exploiting internal parallelism SSDs in high-speed data processing [19]. Unlike Hard Drives, SSDs have a Flash Translation Layer (or FTL) implemented to emulate a hard disk drive by exposing an array of logical block addresses (LBAs) to the host. FTL averagely spreads erase workloads on flash-based storage. A ill-designed FTL algorithm not only reduces the SSDs performance, but also wears out SSDs storage units rapidly. Chen and Zhang proposed CAFTL ? a content-aware flash translation layer enhancing the lifespan 64 of flash memory based SSDs [20]. CAFTL removes unnecessary duplicate writes to reduce write traffic to flash memory. Gupta and Urgaonkar proposed the Demand-based Flash Translation Layer (DFTL), which selectively caches page-level address mappings [41] . A journal Remapping Algorithm JFTL was presented by Choi and Park [23]. JFTL writes all the data to a new region in an out-of-place update process by using an address mapping method [23] . Reliability and performance are two major research challenges of Solid State Drives. Most current research attempts to design new Flash Translation Layer algorithms to im- prove reliability and enhance performance by utilizing the built-on-board cache. Hence, many research projects have been focused on using cache as write buffers. Kang applied Non-Volatile RAM (RAM) as write buffer for SSDs to improve overall performance [53]. Kim and Ahn proposed a buffer management scheme called BPLRU for improving random writes in flash storage. BPLRU buffers writes improve performance of random writes [57]. Soundararajan and Prabhakaran designed Griffin, a hybrid storage device, to buffer large sequential writes in Hard Drives [94]. Park and Jung also studied write buffer-aware address mapping for flash memory devices [77]. 5.2.3 Data Duplication Data deduplication is not a new problem; it is a specialized data compression technique for eliminating coarse-grained redundant data, thereby improving improve storage utilization in backup/archival systems [12][31][76][10][74][29][99] . Deduplication techniques are able to reduce the required storage capacity; duplicate data are deleted, leaving only one copy of the data to be stored, along with references to the unique copy of data. Different applications and data types naturally have different levels of data redundancies. Data-backup applications generally benefit the most from de-duplication due to the nature of repeated full backups of an existing file system. 65 In order to show data duplication is common, researchers from Ohio State University studied 15 disks installed on 5 machines, in which three kinds of file systems, (i.e., Ext2, Ext3, and NTFS) can be found [20]. In terms of four different environments, the results show that the duplication rate ranges from 7.9% to 85.9% across the 15 disks. The results show that a well-designed approach to removing duplicate data is able to substantially extend the available storage space. When we consider the characteristics of NAND flash memory, such as no in-place overwrite and limited erase cycles, deduplication can further extend the lifespan of the flash memory. Traditionally, the deduplication can be performed either ?in-line? (i.e., as data is flow- ing) or ?post-process? (i.e., after data have been written) [27]. Each of them cuts both ways. The ?post-process? scheme stores new data on the storage media first. Then, a pro- cess initiated when disks are idle will analyze and delete duplicated data. There is no need to wait for deduplication computation before storing the data. When it comes to SSDs, the potential drawbacks include unnecessary storage space and write operations for duplicate data. As we described earlier, duplicated data waste the expensive memory space and also may reduce the lifespan of SSDs. On the other hand, the ?in-line? scheme is the process where the deduplication calculations are handled in a real-time manner on the target device as the data enters the device. If a deduplicate engine spots a block that it has been already stored on the media, the engine just maps to the existing block without storing the new block. The benefit is that it requires less storage as data is not duplicated. On the negative side, however, ?in-line? scheme is frequently suffered from the computation overhead, which reduces the data throughput of the device. In this chapter, we propose the use of hybrid disk systems to provide a deduplication service in active storages. The deduplication service in this study is also hybrid ? a com- bination of both ?in-line? and ?post-process? styles. The design details can be found in Section 5.3. The benefits of our design are: (1) in the perspective of a host machine, data 66 stores to an HDD-buffer in an active storage system directly without waiting for the dedu- plication service to complete, (2) the deduplication computation is o?oaded to a dedicated processor in the active storage system, and (3) after the deduplication procedure, the unique data (i.e., deduplicated data) are written to the SSD in the active storage system. 5.3 The Design of HcDD ? a Hybrid Combination of Disk Devices 5.3.1 System Architecture In order to (1) extend the lifespan and improve the reliability of SSDs, (2) save the storage space and (3) boost the write performance of SSDs, we design two techniques (i.e., two buffers) ? hybrid combination of disk drives and enhanced on-SSD buffer with internal parallelism. As mentioned in previous sections, the benefits of the proposed schemes are: ? The elimination of duplicate writes and redundant data to SSDs via a hybrid dedupli- cation mechanism combing both ?in-line? and ?post-process? approaches. ? The o?oaded computation power for the deduplication calculation from host machines. ? The fast internal data transfer speed by parallel processing via an enhanced on-board buffer of SSDs. In this section, we describe the design of a simulated hybrid disk system (hereinafter referred as HcDD), which stands for the hybrid combination of disk drives in an active storage system. The reasons that we conducted this hybrid disks research using the simulation are: (1) the modification of both the disk controller and the FTL module of SSDs are impossible without the support from hardware vendors and (2) it is important to evaluate the performance of an emerging architecture before its implementation. Below are descriptions of modules that an active storage system consists. ? Disk Drive. The disk drive is a major component in any storage system, where data are permanently stored. 67 g44g18g50g3g90g82g85g78g79g82g68g71 g54g54g39 g39g72g71g88g83g79g76g70g68g87g72g3 g40g81g74g76g81g72 g85g72g68g71g3 g85g72g84 g90g85g76g87g72g3 g85g72g84 g85g72g68g71g3g82g85g3 g90g85g76g87g72g34 g53g72g68g71 g58g85g76g87g72 g37g88g73g73g72g85g3 g43g39g39 g36g70g87g76g89g72g3g54g87g82g85g68g74g72 g43g82g86g87g3g48g68g70g75g76g81g72 Figure 5.1: The System Configuration of HcDD ? a hybrid active storage system. ? Controller. The controller is a processing unit (i.e. data management unit) for disks in a storage system. The controller communicates between disks and host machines, manages disk drives, and distributes data among disks. ? Deduplication Engine. It is in charge of computing fingerprints for incoming re- quests, looking the requests up in a fingerprint table, and deciding whether requests should be written to SSDs. Figure 5.1 depicts the architecture of a HcDD in an active storage system. As we mentioned in previous chapters (see Chapter 3.3.1), the active node can be connected with a host machine or a cluster by either network connections or I/O connections. The active storage system has its own computation facilities (i.e., a dedicated processor), the memory and storage facilities, including a disk controller, a deduplication engine as well as some disk drives. And there is a hybrid combination of two kinds of storage devices ? SSDs serving 68 g44g18g50g3g90g82g85g78g79g82g68g71 g39g76g86g78g54g76g80g29g3 g54g54g39 g39g72g71g88g83g79g76g70g68g87g76g82g81g3 g40g81g74g76g81g72 g85g72g68g71g3 g85g72g84g86 g90g85g76g87g72g3 g85g72g84g86 g38g82g81g87g85g82g79g79g72g85 g39g76g86g78g54g76g80g29g3 g37g88g73g73g72g85g3g39g76g86g78 g85g72g80g82g89g72g3 g71g88g83g79g76g70g68g87g72g86g85g72g84g86 g85g72g84g86 Figure 5.2: The System Architecture of HcDD. There are three modules: a controller, a deduplication engine, and some storage devices. as main data disks, and HDDs serving as the write buffers ? in each HcDD. Further, we enhance on-SSD buffer to support of internal parallelism processing. 5.3.2 Hybrid Combination of Storage Drives Figure 5.2 is the configuration of the simulated hybrid combination of storage drives. The HcDD system mainly contains three types of modules: a controller, a deduplication engine, and storage devices. The controller module is in charge of managing and distributing input I/O requests. The duplication engine compares the fingerprint of each write request with existing ones in a fingerprint table. To simulate the storage system, we integrated DiskSim in our system. DiskSim, developed by Ganger et al [37][36], is a well-known disk simulator. DiskSim has been validated against several disk drives using the the published disk specifications and I/O workload traces. Since DiskSim only support the storage-level simulation, the developers also provide programming interfaces to integrate DiskSim with any system-level simulator. 69 The proposed hybrid storage model integrates both cost-effective HDDs and high-speed SSDs as a hybrid combination storagecomponent in the active storage system. The controller handles data distribution, which avoid undesirable significant changes to existing file systems and applications. The controller directs write requests to a buffer disk (i.e., HDD) and sends the read requests to the SSDs. Thus, in terms of read operations, the overhead caused by the controller can be ignored; there is no performance interference since the controller simply redirects requests without any computation overhead. Once awrite request is issued by a host machine, the data will be written on the buffer disk of the active storage node. As soon as the data is buffed on the HDD, the data will be processed by the dedicated deduplication engine, which calculates the hash value and looks the data up in the hash table ? before sending the data to the next step. Then, the deduplicated data will be written to SSDs. The duplicates are mapped to existing ones and removed from the buffer disk. We consider the deduplication process hybrid in nature because of the following reasons. From the perspective of the host machine, all the request are handled ?in-line?; meaning that there is no calculation workload and thus no waiting time required. Meanwhile, within the active node, the deduplication is handled more similar to the ?post-process? pattern, which stores new data on the storage media and then the deduplication engine will analyze the data looking for redundancy as soon as possible. The workflow of deduplication engine is introduced in Figure 5.3. When a write request of the input workload trace is received at the buffer disk, the processing of deduplication can be described in four steps: 1. Before data are written to the buffer disk, the incoming request triggers the dedupli- cation engine in the active storage node. 2. Each updated page in the buffer is computed a hash value (i.e. hash fingerprint) by the dedicated processor of the active storage node. 70 g43g68g86g75g3g57g68g79g88g72g3g54g87g82g85g72 g39g72g71g88g83g79g76g70g68g87g76g82g81g3g40g81g74g76g81g72 g43g68g86g75g3g40g81g74g76g81g72 g58g85g76g87g72g3g53g72g84g88g72g86g87 g49g82 g60g72g86 g79g82g82g78g88g83 g70g82g80g83g88g87g72 g37g88g73g73g72g85g3g39g76g86g78g3 g11g43g39g39g12 g58g85g76g87g72 g48g68g83g83g76g81g74g3g55g68g69g79g72g48g68g83g83g76g81g74g3g55g68g69g79g72g86 g54g54g39 g90g85g76g87g72 g80g68g87g70g75g34 g71g72g71g88g83g79g76g70g68g87g72 g48g68g83 g53g72g80g82g89g72g3g39g88g83g79g76g70g68g87g72g86 Figure 5.3: The Workflow of Hybrid Active Storage System. There are four steps: write to the buffer disk, calculate a hash value, compare the value against a value store, and write to the SSD or map to the table. 3. Then each hash valueis looked up against a hash table, which maintains thefingerprints of data already stored in the SSD. 4. (a) If the hash value is fresh (i.e., it is not found in the table) the write is performed as a regular operation in the SSD . (b) Otherwise, if the fingerprint is found, the mapping tables are updated by mapping the duplicate request to the physical location of the residing data. Then the write operation to flash is canceled. The goal of this step is to minimize the number of writes sent to the SSD without significantly impacting the performance; in doing so, HcDD reduces the number of erase cycles, which is the performance bottleneck to the NAND flash. In addition to improved perfrmance, extending SSDs? lifetime is a second benefit gained from the deduplication service. 5.3.3 Intra-parallelism buffer on SSD 71 ... ... ... ... Req 17 Req 9 Req 1 ... ... ... ... Req 18 Req 10 Req 2 ... ... ... ... Req 19 Req 11 Req 3 ... ... ... ... Req 20 Req 12 Req 4 ... ... ... ... Req 21 Req 13 Req 5 ... ... ... ... Req 22 Req 14 Req 6 ... ... ... ... Req 23 Req 15 Req 7 ... ... ... ... Req 24 Req 16 Req 8 SD RA M Ca che Lis t #0 Lis t #1 Lis t #2 Lis t #3 Lis t #4 Lis t #5 Lis t #6 Lis t #7 Pa cka ge #2 Pa cka ge #5 Pa cka ge #0 Pa cka ge #4 Pa cka ge #3 Pa cka ge #1 Pa cka ge #7 Pa cka ge #6 Fig ur e5 .4: Th eW ork flo w of Hy br id Ac tiv eS tor ag eS yst em . 72 File System Flash Translation Layer Flash Memory SDRAM Buffer Write Read Page Write Page Read Page Write Figure 5.5: The Workflow of On-board Buffer. An SSD is built by arrays of NAND flash memory, which is a semiconductor storage module [4][13]. Agrawal?s [4] describes that an SSD has one or multiple identical elements (i.e. packages) and all of them can work in parallel. The challenge is which part of the multiple elements should be in charge of handling data distribution and parallel processing. Unlike traditional hard drives, the flash-based storage has a Flash Translation Layer (FTL) which maps Logic Block Number (LBN) to Physical Block Number (PBN). Also, a remapping algorithm is designed in FTL for wear leveling. Thus, the block number in file system level is not the block number in flash memory level. Based on different remapping algorithms, the same LBN may lead to different PBN at different time periods. Hence, a buffer must be designed in the lower level of FTL to keep data consistency. Figure 5.4 presents our new architecture in a flash memory. We chose to use Synchronous Dynamic Random Access Memory (or SDRAM) as an on-board buffer for its high performance. The on-board buffer is the lower-level buffer in our ?duo-buffer? design. Since the performance of random writes, especially re-writes, is the Achilles? heel in flash memory, the buffer is designed for buffering only write requests. Figure 5.5 presents the software structure of an enhanced SSD with SDRAM buffer. In the buffer, the number of lists is the same as that of packages. Each list contains data for its 73 corresponding package. Since the buffer is built under FTL, the granularity in the buffer is 8 KB page. The size of pages can be tuned. Recall that all requests buffering in SDRAM are writes. Once the buffer is full, our algorithm will assign the same number of pages to each package in parallel since all packages are able to work independently. Below we presents an algorithm to enhance parallelisms by buffering writes. Even though there are nested loops, the number of parallelism pages and the number of packages are both small and fixed values, the time complexity is still approximately O(n), where n is the number of packages. Algorithm 1 Enhancing Package-Level Parallelism Algorithm. r represents one request. Li represents the List stores the request in the buffer for the ith package. Pn represents the nth package. if rcurrent is a write request then find its corresponding package Pi for rj ?all requests in Li do if rj = rcurrent then remove rj end if add rcurrent to Ltaili end for if buffer is full then for m?number of parallelism pages do for n?number of packages do issue Lheadn to Pn end for end for end if end if 5.4 Evaluations The HcDD storage system is implemented and evaluated based on a comprehensive trace-driven simulator. In this section, I present the experiment environment and results from the HcDD simulation studies under a variety of configurations. 74 5.4.1 The Evaluation Environment Our HcDD is a simulator emulates the behaviors of a hybrid active storage systems. There are two types of disk drive modules in the simulated system. The HDD module, a 15,000 RPM Seagate Cheetah 15.5K SAS hard disk drive, is provided by DiskSim 4.0 [36], which emulates a hierarchy of storage components including buses, controllers, and disks. The enhanced SSD module supporting internal-parallelism processing is implemented in a sophisticated SSD simulator, from Microsoft Research SSD extension [85] for the DiskSim simulation environment [36]. Although the Microsoft extension implements the major com- ponents of FTL (i.e., indirect mapping, garbage collection and wear-leveling policies), it does not have a on-board buffer, which is an essential facility in newly released commodities. Thus, we designed and implemented an enhanced on-board buffer. In this section, we mainly compare four different storage configurations: (1) HcDD, which consists a deduplication engine and the hybrid combination of both a RAID 0 disk array of two enhanced SSDs and a RAID0 disk array of two HDDs, (2) Hybrid Storage, which stands for traditional hybrid storage with one RAID0 array of two SSDs and one write-buffer RAID0 array of two HDDs (in the following figures, it is referred as Trad.Hybrid), (3) a RAID0 array of two SSDs (hereinafter referred as SSDs), and (4) a RAID0 array of two HDDs (hereinafter referred as HDDs). We evaluate the HcDD deisgn by running simulations upon three real-world applica- tion traces (see Table 5.2): traceKernel, tracePhoenix, and Finanical2 [61]. ?traceKernel? was collected while a target machine compiling the Linux kernel. ?tracePhoenix? was col- lected while the target machine running a MapReduce application, WordCount, provided by the Phoenix MapReduce System [83]. ?Financial? is from OLTP (Online Transaction Processing) applications running at a large financial institution provided by the laboratory for Advanced System Software of University of Massachusetts Amherst. ?traceKernel? and ?tracePhoenix? are collected on a HP ProLiant ML110 G6 workstation with an Intel Xeon X3430 processor, a 2GB main memory, and a 500GB 7,200 RPM Seagate Barracuda hard 75 disk drive. The operating system is Ubuntu 10.10 with the Ext3 file system. The logical address of all traces were evenly shrunk so that each request?s address can be mapped to a physical address within the scope of the SSD configuration (32GB in this study). Table 5.2 presents the statistics of three traces. Table 5.2: Num of Read/Write Requests Trace Read Requests Write Requests traceKernel 84,847 70,243 tracePhoenix 822,909 155,413 Financial 2,252,549 480,127 5.4.2 Internal Parallelism Supported Buffer for SSD An SSD has one or multiple identical elements (i.e. packages) and all of them can work in parallel. In this section, we compare our internal-parallel SSD algorithm against the classic LRU cache management algorithm. The performance impact of parallelism levels and buffer size are presented and discussed as follows. In the first test, two general purpose traces provided by the Microsoft Research SSD extension [85], iozone and postmark, are evaluated. The size of write buffer scales from 1 MB to 64 MB. From Figure 5.6, we observe that when the buffer size is small (1 MB or 2MB), internal-parallelism scheme surpasses the LRU one. However, when as the buffer size increases, the average response time of the internal-parallelism scheme is either similar to the competitor or worse. Thus, in Section 5.4.4, we choose a 1 MB buffer in the HcDD simulator. In the second test, we use the random write trace. We do not provide the comparison of reads, because the internal-parallelism does not affect read operations in SSDs. There are two observations from Figure 5.7: (1) when the size of random write request is small (e.g., 250 KB), the performance of both the internal-parallelism and LRU are similar, and (2) when the request size is 5 MB, the response time is reduced by the virtue of the internal-parallelism scheme compared with the LRU one. 76 gid38gid1 gid39gid1 gid40gid1 gid41gid1 gid42gid1 gid43gid1 gid44gid1 gid45gid1 gid39gid1 gid40gid1 gid42gid1 gid46gid1 gid39gid44gid1 gid41gid40gid1 gid44gid42gid1 gid2gid29 gid17gid33gid1gid8 gid15gid26gid24 gid23gid22 gid26gid15gid1gid10 gid18gid21 gid15gid1gid34 gid21gid18 gid20gid20gid18gid26gid15 gid13gid23 gid22gid14 gid26gid35gid1 gid3gid28gid16gid15gid25gid1gid9gid18gid30gid15gid1gid34gid6gid3gid35gid1 gid4gid23gid21gid24gid12gid25gid18gid26gid23gid22gid32gid1gid18gid23gid30gid23gid22gid15gid1gid12gid22gid14gid1gid24gid23gid26gid27gid21gid12gid25gid19gid1gid27gid25gid12gid13gid15gid26gid1 gid7gid12gid25gid12gid20gid20gid15gid20gid18gid26gid21gid1gid47gid1gid46gid31gid1gid18gid23gid30gid23gid22gid15gid1 gid7gid12gid25gid12gid20gid20gid15gid20gid18gid26gid21gid1gid47gid1gid46gid31gid24gid23gid26gid27gid21gid12gid25gid19gid1 gid5gid8gid11gid31gid1gid18gid23gid30gid23gid22gid15gid1 gid5gid8gid11gid31gid1gid24gid23gid26gid27gid21gid12gid25gid19gid1 Figure 5.6: Performance Comparison of Iozone and Postmark gid40gid1 gid40gid35gid40gid45gid1 gid40gid35gid41gid1 gid40gid35gid41gid45gid1 gid40gid35gid42gid1 gid40gid35gid42gid45gid1 gid40gid35gid43gid1 gid40gid35gid43gid45gid1 gid40gid35gid44gid1 gid41gid1 gid42gid1 gid44gid1 gid47gid1 gid41gid46gid1 gid43gid42gid1 gid46gid44gid1 gid2gid30 gid19gid35gid1gid9 gid17gid27gid25 gid24gid23 gid27gid17gid1gid11 gid20gid22 gid17gid1gid36 gid22gid20 gid21gid21gid20gid27gid17 gid15gid24 gid23gid16 gid27gid37gid1 gid3gid29gid18gid17gid26gid1gid10gid20gid32gid17gid1gid36gid7gid3gid37gid1 gid4gid24gid22gid25gid14gid26gid20gid27gid24gid23gid34gid1gid9gid14gid23gid16gid24gid22gid1gid13gid26gid20gid28gid17gid27gid1gid36gid26gid31gid37gid1gid1 gid8gid14gid26gid14gid21gid21gid17gid21gid20gid27gid22gid1gid48gid1gid47gid33gid1gid26gid31gid42gid45gid40gid5gid3gid1 gid8gid14gid26gid14gid21gid21gid17gid21gid20gid27gid22gid1gid48gid1gid47gid33gid1gid26gid31gid45gid7gid1 gid6gid9gid12gid33gid1gid26gid31gid42gid45gid40gid5gid3gid1 gid6gid9gid12gid33gid1gid26gid31gid45gid7gid1 Figure 5.7: Performance Comparison of Random Write. 77 gid38gid33gid39gid40gid1 gid38gid33gid39gid42gid1 gid38gid33gid39gid43gid1 gid38gid33gid39gid44gid1 gid38gid33gid40gid1 gid38gid33gid40gid40gid1 gid38gid33gid40gid42gid1 gid38gid33gid40gid43gid1 gid38gid33gid40gid44gid1 gid38gid33gid41gid1 gid39gid1 gid40gid1 gid42gid1 gid44gid1 gid39gid43gid1 gid41gid40gid1 gid43gid42gid1 gid2gid30 gid18gid33gid1gid9 gid16gid26gid24 gid23gid22 gid26gid16gid1gid11 gid19gid21 gid16gid1gid34 gid21gid19 gid20gid20gid19gid26gid16 gid14gid23 gid22gid15 gid26gid35gid1 gid3gid29gid17gid16gid25gid1gid10gid19gid31gid16gid1gid34gid7gid3gid35gid1 gid5gid16gid25gid22gid16gid20gid1gid4gid23gid21gid24gid19gid20gid13gid27gid23gid22gid1gid11gid25gid13gid14gid16gid1 gid8gid13gid25gid13gid20gid20gid16gid20gid19gid26gid21gid1gid45gid1gid44gid1 gid8gid13gid25gid13gid20gid20gid16gid20gid19gid26gid21gid1gid45gid1gid39gid43gid1 gid8gid13gid25gid13gid20gid20gid16gid20gid19gid26gid21gid1gid45gid1gid41gid40gid1 gid8gid13gid25gid13gid20gid20gid16gid20gid19gid26gid21gid1gid45gid1gid43gid42gid1 gid6gid9gid12gid1 Figure 5.8: Performance Comparison of traceKernel. Besides the comparison between the internal-parallelism algorithm and the LRU scheme, we evaluate the performance impact of interleaving levels (a.k.a, parallelism levels) in ad- dition. Running two traces collected from real-world applications ? traceKernel and tra- cePhoenix, we scale the parallelism level in the buffer from 8 to 64 and the size of write buffer from 1 MB to 64 MB. The result shows that (1) the internal-parallelism algorithm outperforms LRU in most cases, (2) parallelism level 8 is slightly better than other three levels. The optimal parallelism level depends on how many packages (elements) on a board, bus bandwidth, data access patterns, and the like. Overall, the internal-parallelism algorithm improves performance in most cases, espe- cially in the random writes case (with no impact on read-only workload), compared with the traditional LRU. As the on-board buffer becomes an essential component for the SSD, our solution provides an approach to fully utilizing the buffer and boosting the performance of the SSD. 78 gid39gid33gid40gid41gid44gid1 gid39gid33gid40gid42gid1 gid39gid33gid40gid42gid44gid1 gid39gid33gid40gid43gid1 gid39gid33gid40gid43gid44gid1 gid39gid33gid40gid44gid1 gid40gid1 gid41gid1 gid43gid1 gid46gid1 gid40gid45gid1 gid42gid41gid1 gid45gid43gid1 gid2gid30 gid18gid33gid1gid8 gid16gid27gid25 gid24gid23 gid27gid16gid1gid10 gid20gid22 gid16gid1gid35 gid22gid20 gid21gid21gid20gid27gid16 gid14gid24 gid23gid15 gid27gid36gid1 gid3gid29gid17gid16gid26gid1gid9gid20gid32gid16gid1gid35gid6gid3gid36gid1 gid7gid19gid24gid16gid23gid20gid31gid1gid12gid24gid26gid15gid34gid4gid24gid29gid23gid28gid1gid10gid26gid13gid14gid16gid1 gid7gid13gid26gid13gid21gid21gid16gid21gid20gid27gid22gid1gid47gid1gid46gid1 gid7gid13gid26gid13gid21gid21gid16gid21gid20gid27gid22gid1gid47gid1gid40gid45gid1 gid7gid13gid26gid13gid21gid21gid16gid21gid20gid27gid22gid1gid47gid1gid42gid41gid1 gid7gid13gid26gid13gid21gid21gid16gid21gid20gid27gid22gid1gid47gid1gid45gid43gid1 gid5gid8gid11gid1 Figure 5.9: Performance Comparison of tracePhoenix.. 5.4.3 Deduplication We simulate the deduplication process within the HcDD storage system. The overhead of deduplication are derived from a real-implement statistic from a previous research [20]. Chen et al. run the deduplication function on the SimpleScalar-ARM simulator [65] to extract the total number of cycles for executing the function. The result for running SHA-1 hashing function to process a 4KB page is 47,548 cycles. Thus, we generated the latency by dividing the number of cycles by the processor frequency. In the HcDD simulator, we assign a low-level 1GHz dedicated processor for an active storage node. The overhead can be further minimize since even higher frequency processors are normal in the active storage node. HcDD provides the service of removing duplicated writes to SSDs. Let us denote numtotal and numdeduplicated as the total number of requested writes and the number of 79 gid36gid42gid31gid39gid35gid45gid1 gid38gid40gid31gid42gid37gid45gid1 gid36gid41gid31gid42gid34gid45gid1 gid34gid31gid34gid34gid45gid1 gid39gid31gid34gid34gid45gid1 gid35gid34gid31gid34gid34gid45gid1 gid35gid39gid31gid34gid34gid45gid1 gid36gid34gid31gid34gid34gid45gid1 gid36gid39gid31gid34gid34gid45gid1 gid37gid34gid31gid34gid34gid45gid1 gid37gid39gid31gid34gid34gid45gid1 gid38gid34gid31gid34gid34gid45gid1 gid38gid39gid31gid34gid34gid45gid1 gid39gid34gid31gid34gid34gid45gid1 gid34gid1 gid36gid34gid1 gid38gid34gid1 gid40gid34gid1 gid41gid34gid1 gid35gid34gid34gid1 gid35gid36gid34gid1 gid35gid38gid34gid1 gid35gid40gid34gid1 gid24gid22gid9gid10gid12gid4gid12gid22gid18gid12gid16gid1 gid24gid22gid9gid10gid12gid5gid14gid19gid12gid18gid15gid29gid1 gid3gid15gid18gid9gid18gid10gid15gid9gid16gid1 gid2gid12gid11gid27gid20gid16gid15gid10 gid9gid25gid19gid18gid1gid6gid9 gid24gid12gid1 gid8gid14gid19gid27gid23gid9gid18gid11gid23gid1gid19gid13gid1gid6 gid12gid17gid19 gid28gid11gid12gid1gid1gid6 gid12gid21gid27gid12gid23 gid24gid23gid1 gid7gid9gid28gid12gid11gid1gid6gid12gid21gid23gid1 gid2gid12gid11gid27gid20gid16gid15gid10gid9gid25gid19gid18gid1gid6gid9gid24gid12gid1 Figure 5.10: The number of removed requests and the deduplication ratio. deduplicated writes actually written to a SSD, respectively. We have DeduplicationRate = numtotal?numdeduplicatednum total . (5.1) Figure 5.10 shows the results of deduplication ratio and the number of ?removed? re- quests. We observe that the deduplication rates of traceKernel, tracePhoenix and Financial are 29.15%, 46.93% and 28.9%, respectively. The number of removed write requests to the flash memory are 20,731, 72,933, and 138,748, respectively. The decrease of Program/Erase cycles can extend flash-memory?s lifespan. 5.4.4 System Performance Evaluation In this section, we evaluate the overall system performance of HcDD. The goal is to compare HcDD with three traditional storage architectures, Trad.Hybrid, SSDs, and HDDs. 80 gid31gid26gid31gid31gid1 gid33gid31gid31gid26gid31gid31gid1 gid34gid31gid31gid26gid31gid31gid1 gid35gid31gid31gid26gid31gid31gid1 gid36gid31gid31gid26gid31gid31gid1 gid32gid31gid31gid31gid26gid31gid31gid1 gid32gid33gid31gid31gid26gid31gid31gid1 gid32gid34gid31gid31gid26gid31gid31gid1 gid23gid21gid9gid11gid13gid5gid13gid21gid18gid13gid16gid1 gid23gid21gid9gid11gid13gid6gid14gid19gid13gid18gid15gid24gid1 gid3gid15gid18gid9gid18gid11gid15gid9gid16gid1 gid1gid7gid13 gid22gid20 gid19gid18 gid22gid13gid1gid8 gid15gid17 gid13gid1gid27 gid22gid13gid11 gid19gid18 gid12gid22gid28 gid1 gid8gid21gid9gid11gid13gid22gid1 gid3gid6gid13gid11gid10gid9gid6gid1gid4gid7gid8gid6gid1gid2gid10gid8gid11gid5gid12gid7gid13gid10gid9gid1 gid4gid11gid2gid2gid1 gid8gid21gid9gid12gid26gid4gid25gid10gid21gid15gid12gid1 Figure 5.11: Processing Time Comparison between HcDD and traditional hybrid storage. Response time (includes average response time) and number of write requests written to the SSD are two major metrics in this set of experiments to measure the speed and reliability, respectively. Before testing the system, we have two expectations: (1) In terms of response time, the performance of HcDD should be better than those of Trad.Hybrid and HDDs schemes but close to that of SSDs, and (2) in terms of reliability (we measure it using numbers of requests written on the flash memory in this study), HcDD should be better than SSDs and Trad.Hybrid. HcDD is not a ?speed-oriented? design, because it targets at enhancing the reliability of SSDs and while providing short response times. When compared with Trad.Hybrid (see Figure 5.11 and Figure 5.12), the HcDD reduces in both response time (averagely saved 12% of response time) and average response time (averagely saved 13% of average response time). And as mentioned in the previous section, there are 29.15%, 46.93% and 28.9% request respectively removed from being issued to the SSD. Even though the deduplication process in HcDD brings in computation overhead, the 81 gid35gid30gid35gid35gid35gid1 gid35gid30gid37gid35gid35gid1 gid35gid30gid38gid35gid35gid1 gid35gid30gid39gid35gid35gid1 gid35gid30gid40gid35gid35gid1 gid36gid30gid35gid35gid35gid1 gid36gid30gid37gid35gid35gid1 gid26gid24gid11gid13gid15gid6gid15gid24gid21gid15gid19gid1 gid26gid24gid11gid13gid15gid7gid17gid22gid15gid21gid18gid28gid1 gid4gid18gid21gid11gid21gid13gid18gid11gid19gid1 gid2gid27 gid15gid24gid11gid16 gid15gid1gid8 gid15gid25gid23gid22gid21gid25gid15gid1gid10gid18gid20gid15gid1gid31gid20gid18gid19gid19gid18gid25gid15gid13 gid22gid21gid14gid25gid32gid1 gid10gid24gid11gid13gid15gid25gid1 gid2gid15gid6gid13gid5gid7gid6gid1gid4gid6gid14gid12gid11gid10gid14gid6gid1gid3gid11gid9gid12gid5gid13gid8gid14gid11gid10gid1 gid5gid13gid3gid3gid1 gid5gid29gid12gid24gid18gid14gid1gid9gid9gid3gid1 Figure 5.12: Average Response Time Comparison between HcDD and traditional hybrid storage. performance of HcDD is still better because of fewer write requests forwarded to the SSD and the internal parallel data transferring on the enhanced SSDs. When comparing HcDD with the SSDs scheme, we observe intriguing results. When it comes to traceKernel and tracePhoenix (see Figure 5.13), the response time of HcDD takes 54 seconds and 123 seconds longer than the SSDs approach. But when it comes to the Financial trace, HcDD outgoes the SSDs scheme by 50 seconds. This improvement depends on the deduplicated number of writes, because the poor random write performance of the flash memory can largely degrade the overall performance of data-intensive applications. We do not show the total response time of the HDDs scheme, because HDDs? performance is much worse than that of HcDD (i.e., almost 3 times faster). Figure 5.14 presents the same trend in terms of average response time of HcDD and the two competitors. In summary, based on the evaluation results, we can see that HcDD fulfills the previ- ous expectations. It not only responds I/O requests with a very competitive speed (better 82 gid30gid25gid30gid30gid1 gid31gid30gid30gid25gid30gid30gid1 gid32gid30gid30gid25gid30gid30gid1 gid33gid30gid30gid25gid30gid30gid1 gid34gid30gid30gid25gid30gid30gid1 gid35gid30gid30gid25gid30gid30gid1 gid36gid30gid30gid25gid30gid30gid1 gid37gid30gid30gid25gid30gid30gid1 gid38gid30gid30gid25gid30gid30gid1 gid39gid30gid30gid25gid30gid30gid1 gid31gid30gid30gid30gid25gid30gid30gid1 gid23gid21gid10gid11gid13gid5gid13gid21gid18gid13gid16gid1 gid23gid21gid10gid11gid13gid6gid14gid19gid13gid18gid15gid24gid1 gid3gid15gid18gid10gid18gid11gid15gid10gid16gid1 gid7gid13 gid22gid20gid19gid18gid22 gid13gid1gid9 gid15gid17 gid13gid1gid26 gid22gid13gid11 gid19gid18gid12gid22 gid27gid1 gid9gid21gid10gid11gid13gid22gid1 gid3gid6gid13gid11gid10gid9gid13gid6gid1gid4gid7gid8gid6gid1gid2gid10gid8gid11gid5gid12gid7gid13gid10gid9gid1 gid4gid11gid2gid2gid1 gid8gid8gid2gid22gid1 54s 123s 50s Figure 5.13: Response Time Comparison between HcDD and SSD. than the traditional hybrid disk), but it also removes duplicate writes, which normally oc- cupies 30% to 46% of the total requests, to the SSD. It saves the storage space and extend the lifespan. Thus, HcDD is a promising hybrid storage model, which balances the perfor- mance, saves the storage space and extends the lifespan of the SSD, for data-intensive server applications. 5.5 Summary The use of NAND-flash-based Solid State Devices (hereinafter referred as SSDs) has evolved in primary storage systems in enterprise servers and data centers. SSDs are com- pletely built on semiconductor chips without any moving parts resulting in high random access performance and lower power consumption. However, many engineers still hesitate to perform a large-scale deployment of SSDs due to their poor reliability and limited lifespans. 83 gid33gid28gid33gid33gid1 gid33gid28gid35gid33gid1 gid33gid28gid36gid33gid1 gid33gid28gid37gid33gid1 gid33gid28gid38gid33gid1 gid34gid28gid33gid33gid1 gid34gid28gid35gid33gid1 gid34gid28gid36gid33gid1 gid34gid28gid37gid33gid1 gid34gid28gid38gid33gid1 gid25gid23gid11gid12gid14gid6gid14gid23gid20gid14gid18gid1 gid25gid23gid11gid12gid14gid7gid16gid21gid14gid20gid17gid27gid1 gid4gid17gid20gid11gid20gid12gid17gid11gid18gid1 gid2gid26 gid14gid23gid11gid15 gid14gid1gid8 gid14gid24gid22gid21gid20gid24gid14gid1gid10gid17gid19gid14gid1gid29 gid19gid17 gid18gid18gid17gid24 gid14gid12 gid21gid20 gid13gid24 gid30gid1 gid10gid23gid11gid12gid14gid24gid1 gid2gid15gid6gid13gid5gid7gid6gid1gid4gid6gid14gid12gid11gid10gid14gid6gid1gid3gid11gid9gid12gid5gid13gid8gid14gid11gid10gid1 gid5gid12gid3gid3gid1 gid9gid9gid3gid24gid1 gid5gid3gid3gid24gid1 Figure 5.14: Average Response Time Comparison between HcDD and SSD. In this chapter, we propose a hybrid combination of disk arrays designed for active storage system, useing hard disk drives as a write buffer to cache write requests and de- duplicate the redundant write requests to the SSD. Read requests are all served from SSDs directly. Unlike reads, writes to the active storage are first processed by HDDs and per- formed de-duplications before migrating to the SSDs. The de-duplication computation and I/O operations are o?oaded to a dedicated processor in the active storage system. This chapter also introduces an enhanced FTL algorithm for SSD?s buffer, which supports inter- nal parallelism processing. The proposed ?duo-buffer? architecture minimizes the number of writes issued to SSDs without significantly impacting the I/O performance, thereby extend- ing the lifespan of flash memory. The new architecture also utilizes the parallel on-board packages to boost the write performance of SSDs. In order to evaluate the proposed architecture, we design and implement a trace-driven storage system simulator for the hybrid active storage system. We compare our internal- parallelism algorithm with the classic LRU scheme in terms of single-SSD drive performance. 84 The results shows that our proposed algorithm outgoes LRU in most cases, when the inter- leaving level is 8 and buffer size is 1 MB or 2 MB. Then, we compare the overall performance of HcDD with the other three storage configurations, including the traditional hybrid storage scheme, the SSDs disk array and the HDDs disk array. The evaluation results are gener- ated using several traces collected from daily usages and some applications in academic and financial areas. In terms of the system response time, HcDD beats the traditional hybrid disk array in all three tests and even performs slightly better than the SSD disk array in the data-intensive trace test. Based on the evaluation, we can observe that HcDD is a promising configuration for data-intensive applications because it (1) utilizes the parallel processing capability of on-disk buffer, and (2) balances the system I/O performance and the SSD lifespan. 85 Chapter 6 Conclusion and Future Work In this dissertation, we have designed an active storage architecture, studied a pipeline- parallel processing pattern to apply the active storage node in the cluster system, and ex- plored a new disk configuration, the hybrid combination of disk devices, of the active storage system. This chapter concludes the dissertation by summarizing the contributions and de- scribing future directions. 6.1 McSD: Multicore-Enabled Smart Storage for Clusters Processor-enabled smart storage can improve I/O performance of data-intensive appli- cations by processing data directly using storage nodes, because smart nodes avoid moving data back and forth between storage and host computing nodes. Thanks to the escalating manufacturing technology, it is possible to integrate multi-core processors into smart stor- age nodes. In this study, we implemented a prototype called McSD for multicore-enabled smart storage that can improve performance of data-intensive applications by o?oading data processing to multicore processors employed in storage nodes of computing clusters. Our McSD system differs from conventional smart/active storage in two ways. First, McSD is a smart storage nodes rather than a smart disk. Second, McSD can leverage multi-core processors in storage nodes to improve performance of data-intensive applications running on clusters. The four major contributions of this study are: ? A prototype of next-generation multicore-enabled smart data storage. ? A programming framework, which include MapReduce-like programming APIs and a runtime environment for multicore-based smart storage in the context of clusters. 86 ? Development of three benchmark applications to test McSD for clusters. ? Single-application experimental results and multiple-application performance evalua- tion. McSD alongwith itsprogramming framework enables programmerstowrite MapReduce- like code that can be automatically o?oad data-intensive computation to multicore proces- sors residing in smart storage nodes. The McSD programming framework allows smart storage nodes to take full advantages from embedded multi-core processors. The APIs and a runtime environment in this programming framework automatically handles computation o?oad, data partitioning, and load balancing. The McSD prototype was implemented in a testbed?a 5-node cluster containing both host computing nodes and McSD smart-storage nodes. Our experimental results were taken by running three real-world applications on the testbed. The tested data-intensive applications include Word Count, String Matching, and Matrix Multiplication. Our multicore-enabled smart storage system - McSD - significantly reduces the execution time of the three applications. Overall, we conclude that McSD is a promising approach to improving I/O performance of data-intensive applications. 6.2 Using Active Storage to Improve the Bioinformatics Application Perfor- mance: A Case Study We have presented a pipelining technique that allows computing nodes and active stor- age nodes in a cluster to perform CPU-intensive and data-intensive computing in parallel. The technique exploits parallelism among data processing transactions in a sequential trans- action stream, where each transaction is sequentially handled by computing and storage nodes in the cluster. One central trait of the pipelining technique is that it makes use of active storage to maximize throughput of data-intensive applications (e.g., bioinformatic applications) on high-performance clusters. We have showed that the synchronizations among computing 87 and storage nodes can hinder active storage nodes from accelerating data processing opera- tions. Our technique solves the blocking problem that prevents active storage and computing nodes from processing data in parallel. The implementation of pp-mpiBLAST demonstrates that active storage is an effective approach to improving data-intensive applications? overall performance by o?oading data processing to storage nodes in clusters. We implemented the pipelining technique in a bioinformatic application called pp- mpiBLAST, which incorporates mpiBLAST - an open-source parallel BLAST tool. The pipeline in pp-mpiBLAST processes a sequential transaction stream, in which each transac- tion contains a filtering/formatting task and a mpiBLAST task. The pipelining technique overlaps data filtering/formatting in active storage with parallel BLAST computations in computing nodes. Measurements made from a working implementation and analytic model have been encouraging. The pipeline and active storage not only reduce mpiBLAST?s overall execution time by up to 50%, but they also achieved high scalability on large-scale clusters. 6.3 HcDD: Hybrid Combination of Disk Drives in Active Storage The use of NAND-flash-based Solid State Devices (hereinafter referred as SSDs) has evolved in primary system storage in enterprise servers and data centers. SSDs arecompletely built on semiconductor chips without any moving parts, which results in high random access performance and lower power consumption. But, meanwhile, users still hesitate a little bit to perform a large-scale deployment of SSDs because of their poor reliability and the limited lifespan. In this chapter, we propose a hybrid combination of disk arrays designed for active storage system, which uses hard disk drives as a write buffer to cache write requests and de-duplicate the redundant write requests to the SSD. Read requests are all served from the SSDs directly. But writes to the active storage are first processed to the HDDs and perform de-duplication before migrating to the SSDs. The de-duplication computation and I/O operations are o?oaded to the dedicated processor in active storage. This chapter also 88 introduces an enhanced FTL algorithm of SSD?s buffer, which supports internal parallelism processing. Together, the proposed architecture minimizes the number of writes sent to the SSD without significantly impacting the performance, which can extend the lifespan of flash memory, and utilizes the parallel on-board packages to boost the write performance of the SSD. In order to evaluate the proposed architecture, we design and implement a trace-driven storage system simulator, the hybrid combination of disk drives (HcDD), for the active storage node. First, we compare our internal-parallelism algorithm with the classic LRU one in terms of single SSD drive performance. The results shows that our proposed algorithm outgoes the classic LRU algorithm in most cases, when the interleaving level is 8 and buffer size is 1 MB or 2 MB. Then, we compare the overall performance of HcDD with the other storage configurations, including traditional hybrid storage scheme, the SSDs disk array and the HDDs disk array. The evaluation results are generated using several traces collected from daily usages and some applications in academic and financial areas. In terms of the system response time, HcDD beats the traditional hybrid disk array in all three tests and even performs slightly better than the SSD disk array in the data-intensive trace test. Based on the evaluation, we can observe that HcDD is a promising configuration for data-intensive applications because it (1) utilizes the parallel processing capability of on-disk buffer, and (2) balances the system I/O performance and the SSD lifespan. 6.4 Future Work In the course of designing and implementing active and hybrid storage systems, we have found two interesting techniques that may further improve the system in terms of performance and energy-efficiency. This section overviews those two techniques: the memory compression and the parallel processing using wimpy nodes. 89 6.4.1 Memory Compression In many areas, the compression technology has been used in many settings to increase the effective size of a storage device or to increase the effective bandwidth. Other researchers have proposed to integrate compression into the memory hierarchy. It has been proved to be a feasible approach to improve the I/O performance of data-intensive applications [96], [16]; the basic idea is to reserve some memory and use this memory region in compressed form instead of holding pages. In general, the in-memory-compression approach creates RAM based block device, which acts as swap region. Pages swapped to a disk are compressed and stored in memory itself. Compressing pages and keeping them in RAM virtually increases its capacity, which result in allowing more data to fit in given limited amount of the main memory. Because of the increase of available memory capacity, accesses to local disk, i.e., the swap partition, are fewer. Thus, in a certain measure, the memory compression balances the performance loss suffered from the data compression cost; the in-memory-compression offers more available memory space at a affordable price. Existing studies of memory compression only focus on general purpose usage, such as the compcache [1]. Tuduce and Gross [96] mentions that the potential benefits of memory compression depend on three issues: (a) the size of the compressed area, (b) an applica- tion?s compression ratio, and (c) an application?s access pattern. In order to benefit the wimpy machines, we are going to tailor the existing project in terms of (1) the hardware?s characteristics and (2) data access pattern. 6.4.2 Wimpy System Board For decades, the notion of ?performance? has been synonymous with ?speed. High- performance supercomputers and clusters consume egregious amounts of electrical power and produce so much heat that extravagant cooling facilities must be constructed to ensure proper operation [2]. Recently, along with the reconsideration of the power-consumption, ?green? or energy-efficient computing becomes widespread not only in the personal devices 90 and appliances but also in supercomputers and cluster computers arena. For example, the Green500 list [2], a complement to the Top500 supercomputers [3], was inaugurated on 2008. Its unveiling ushered in a new era where supercomputers can be compared by performance- per-watt. FAWN [7] is an exploration of building well-matched cluster systems by fully utilizing the low-power efficient embedded CPUs with flash storage. It is claimed to have the potential of achieving high performance and be fundamentally more energy-efficient than conventional architectures for serving massive-scale I/O and data-intensive workloads. Anderson et al. proposed the principles of the FAWN architecture and the design and implementation of FAWN-KV (key-value storage system). FAWN has a few limitations, since the published results only focused on key-value workloads, where near-perfect performance scaleup complements the poor performance of individual wimpy nodes [62]. In our work, inspired by the FAWN, we want to apply the wimpy system boards to general purpose unstructured data processing for some real-world data-intensive applications. Inspired by researches of using wimpy nodes, the energy-efficient parallel active storage system (PASS) empowers the traditional storage node in the cluster with the parallel wimpy data processing peripherals. Thus, the storage system is capable to process some data- intensive tasks close to where the data stores in an energy-efficiency way. The proposed architecture of PASS is depicted in Fig 6.1. The hardware testbed contains four types ofmachines: a client PC, a front nodeofthecluster, several computation machines, and a storage system. Note that the storage node is made up of a end-node connected to the disk array and the computation module built by some wimpy system boards. We consider the I/O path of the storage device as a programmable computational substrate, which supports general purpose computing at certain points on the path. In the other words, the computation module is considered within the I/O path. For example, in Fig 6.1, the red line indicates a programmable I/O path. There are two levels of parallelism computation: 91 LAN/WANclient Interconnectiong3 network Frontg3node Endg3 node Storageg3 Drives Wimpyg3 node fla sh PASS: Parallelg3Active Storageg3System ?g3? Computationg3 Node Computationg3 Node Computationg3 Node ?g3? Wimpyg3 node fla sh Wimpyg3 node fla sh 1 Wimpyg3 node fla sh 2 3 Figure 6.1: The Work Flow of PASS (1) from the cluster perspective, the tasks are processed in a pipeline approach, and (2) the data is processed in parallel under MPI standards within the computation module of the storage system. We will develop a benchmark suite in order to evaluate the PASS system. It will include the cluster-level benchmarks, (e.g., the computation and the I/O benchmarks), and real data-intensive applications, (e.g., BioParallel and mpiBLAST). Besides the?speed?, the system will be evaluated in terms of new metrics, such as ?performance per watt?, ?energy efficiency for improved reliability?, and etc. 6.5 Conclusion This dissertation has presented an active and hybrid storage system designed for of- floading data-intensive applications to where data stores. The experimental results reported 92 in the dissertation have shown that the proposed techniques can deliver promising perfor- mance improvements by cooperating active storage system with host machines. We applied our scheme to a real world bioinformatics application as a case study. In addition, the pro- posed system utilizes the hybrid combination of storage devices in order to achieve a balance between reliability and performance. 93 Bibliography [1] Compcache project, 2010. http://code.google.com/p/compcache/. [2] Green500 list, 2010. http://www.green500.org/. [3] Top500 supercomputer sites, 2010. http://www.top500.org. [4] Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark Manasse, and Rina Panigrahy. Design tradeoffs for ssd performance. In ATC?08: USENIX 2008 Annual Technical Conference on Annual Technical Conference, pages 57?70, Berkeley, CA, USA, 2008. USENIX Association. [5] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of molecular biology, 215(3):403?410, October 1990. [6] Khalil Amiri, David Petrou, Gregory R. Ganger, and Garth A. Gibson. Dynamic function placement for data-intensive cluster computing. In Proceedings of the an- nual conference on USENIX Annual Technical Conference, ATEC ?00, pages 25?25, Berkeley, CA, USA, 2000. USENIX Association. [7] David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. Fawn: a fast array of wimpy nodes. In SOSP ?09: Pro- ceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 1?14, New York, NY, USA, 2009. ACM. [8] D.G. Andersen and S. Swanson. Rethinking flash in the data center. Micro, IEEE, 30(4):52 ?54, july-aug. 2010. [9] Apache. Apache hadoop, 2006. http://lucene.apache.org/hadoop/. [10] A. Arasu, C. Re, and D. Suciu. Large-scale deduplication with constraints using dedupalog. In Data Engineering, 2009. ICDE ?09. IEEE 25th International Conference on, pages 952 ?963, 2009. [11] Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen, John Wawrzynek, David Wessel, and Katherine Yelick. A view of the parallel computing landscape. Commun. ACM, 52(10):56?67, 2009. 94 [12] D. Bhagwat, K. Eshghi, D.D.E. Long, and M. Lillibridge. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Modeling, Analysis Simulation of Computer and Telecommunication Systems, 2009. MASCOTS ?09. IEEE International Symposium on, pages 1 ?9, sept. 2009. [13] Simona Boboila and Peter Desnoyers. Write endurance in flash drives: measurements and analysis. In Proceedings of the 8th USENIX conference on File and storage tech- nologies, FAST?10, pages 9?9, Berkeley, CA, USA, 2010. USENIX Association. [14] Enrique V. Carrera, Eduardo Pinheiro, and Ricardo Bianchini. Conserving disk energy in network servers. In Proceedings of the 17th annual international conference on Supercomputing, ICS ?03, pages 86?97, New York, NY, USA, 2003. ACM. [15] Adrian M. Caulfield, Laura M. Grupp, and Steven Swanson. Gordon: using flash memory to build fast, power-efficient clusters for data-intensive applications. SIG- PLAN Not., 44:217?228, March 2009. [16] R. Cervera, T. Cortes, and Y. Becerra. Improving application performance through swap compression. In ATEC ?99: Proceedings of the annual conference on USENIX Annual Technical Conference, pages 46?46, Berkeley, CA, USA, 1999. USENIX Asso- ciation. [17] Li-Pin Chang. Hybrid solid-state disks: combining heterogeneous nand flash in large ssds. In Proceedings of the 2008 Asia and South Pacific Design Automation Conference, ASP-DAC ?08, pages 428?433, Los Alamitos, CA, USA, 2008. IEEE Computer Society Press. [18] Feng Chen, Song Jiang, and Xiaodong Zhang. Smartsaver: Turning flash drive into a disk energy saver for mobile computers. In Low Power Electronics and Design, 2006. ISLPED?06. Proceedings of the 2006 International Symposium on, pages 412 ?417, oct. 2006. [19] Feng Chen, Rubao Lee, and Xiaodong Zhang. Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing. In Proceedings of 17th International Symposium on High Performance Computer Archi- tecture, ISCA ?11. IEEE Computer Society, 2011. [20] Feng Chen, Tian Luo, and Xiaodong Zhang. Caftl: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In Proceedings of the 9th USENIX Conference on File and Storage Technologies, FAST?11, pages 77?90. USENIX Association, 2011. [21] Shimin Chen. Flashlogging: Exploiting flash devices for synchronous logging perfor- mance. In In Proceedings of SIGMOD?09, June 2009. [22] Steve C. Chiu, Wei-keng Liao, Alok N. Choudhary, and Mahmut T. Kandemir. Processor-embedded distributed smart disks for i/o-intensive workloads: architectures, performance models and evaluation. J. Parallel Distrib. Comput., 65(4):532?551, 2005. 95 [23] Hyun Jin Choi, Seung-Ho Lim, and Kyu Ho Park. Jftl: A flash translation layer based on a journal remapping for flash memory. Trans. Storage, 4:14:1?14:22, February 2009. [24] Ritendra Datta, Jia Li, and James Z. Wang. Content-based image retrieval: approaches and trends of the new age. In Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval, MIR ?05, pages 253?262, New York, NY, USA, 2005. ACM. [25] Garth Gibson David, David F. Nagle, Khalil Amiri, Fay W. Chang, Eugene M. Fein- berg, Howard Gobioff, Chen Lee, Berend Ozceri, Erik Riedel, David Rochberg, and Jim Zelenka. File server scaling with network-attached secure disks. In In Proceed- ings of the 1997 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 272?284. ACM Press, 1997. [26] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107?113, 2008. [27] S. DEBNATH, B. SENGUPTA and J. LI. Chunkstash: speeding up inline storage deduplication using flash memory. In In Proceedings of USENIX?10, June 2010. [28] David J. DeWitt and Paula B. Hawthorn. A performance evaluation of data base machine architectures (invited paper). In Proceedings of the seventh international conference on Very Large Data Bases - Volume 7, VLDB ?1981, pages 199?214. VLDB Endowment, 1981. [29] J. Dinerstein, S. Dinerstein, P.K. Egbert, and S.W. Clyde. Learning-based fusion for data deduplication. In Machine Learning and Applications, 2008. ICMLA ?08. Seventh International Conference on, pages 66 ?71, dec. 2008. [30] David H. C. Du. Intelligent storage for information retrieval. Next Generation Web Services Practices, International Conference on, 0:214?220, 2005. [31] A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios. Duplicate record detection: A survey. Knowledge and Data Engineering, IEEE Transactions on, 19(1):1 ?16, jan. 2007. [32] E.J. FELIX, K. FOX, K. REGIMBAL, and J. NIEPLOCHA. Active storage processing in a parallel file system. In In Proc. of the 6th LCI International Conference on Linux Clusters: The HPC Revolution, 2006. [33] Chen Feng, David Koufaty, and Xiaodong Zhang. Hystor: Making the best use of solid state drives in high performance storage systems. In In Proceedings of International Conference on Supercomputing, ICS 2011, ICS ?11, Tuscon, Aizona, 2011. ACM. [34] Wu Feng and etc. mpiblast: Open-source parallel blast, 2010. http://www.mpiblast.org/. 96 [35] Blake G. Fitch, Aleksandr Rayshubskiy, Michael C. Pitman, T. J. Christopher Ward, and Robert S. Germain. Using the active storage fabrics model to address petas- cale storage challenges. In PDSW ?09: Proceedings of the 4th Annual Workshop on Petascale Data Storage, pages 47?54, New York, NY, USA, 2009. ACM. [36] Greg Ganger. The disksim simulation environment, 2011. [37] Gregory Robert Ganger. System-oriented evaluation of i/o subsystem performance. Technical report, 1995. [38] Garth A. Gibson and Rodney Van Meter. Network attached storage architecture. Commun. ACM, 43(11):37?45, 2000. [39] Jim Gray. Tape is dead, disk is tape, flash is disk. ram locality is king, 2007. [40] Jim Gray. Flash memory pe cycles, 2011. [41] Aayush Gupta, Youngjae Kim, and Bhuvan Urgaonkar. Dftl: a flash translation layer employing demand-based selective caching of page-level address mappings. In Pro- ceeding of the 14th international conference on Architectural support for programming languages and operating systems, ASPLOS ?09, pages 229?240, New York, NY, USA, 2009. ACM. [42] S. Gurumurthi, A. Sivasubramaniam, M. Kandemir, and H. Franke. Drpm: dynamic speed control for power management in server class disks. In Computer Architecture, 2003. Proceedings. 30th Annual International Symposium on, pages 169 ? 179, june 2003. [43] Sudhanva Gurumurthi. Should disks be speed demons or brainiacs? SIGOPS Oper. Syst. Rev., 41:33?36, January 2007. [44] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker. Oltp through the looking glass, and what we found there. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD ?08, pages 981?992, New York, NY, USA, 2008. ACM. [45] James Hays and Alexei A. Efros. Scene completion using millions of photographs. Commun. ACM, 51:87?94, October 2008. [46] Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. Mars: a mapreduce framework on graphics processors. In PACT ?08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 260?269, New York, NY, USA, 2008. ACM. [47] Jiahua He, Jeffrey Bennett, and Allan Snavely. Dash-io: an empirical study of flash- based io for hpc. In TG ?10: Proceedings of the 2010 TeraGrid Conference, pages 1?8, New York, NY, USA, 2010. ACM. 97 [48] Windsor W. Hsu, Honesty C. Young, and Alan Jay Smith. Projecting the performance of decision support workloads on systems with smart storage (smartstor). In ICPADS ?00: Proceedings of the Seventh International Conference on Parallel and Distributed Systems, page 417, Washington, DC, USA, 2000. IEEE Computer Society. [49] An i A. Wang, Peter Reiher, and Gerald J. Popek. Conquest: better performance through a disk/persistent-ram hybrid file system. In In Proceedings of the 2002 USENIX Annual Technical Conference, pages 15?28, 2002. [50] Sami Iren and Steve Schlosser. Database storage management with object-based stor- age devices. In Proceedings of the 1st international workshop on Data management on new hardware, DaMoN ?05, New York, NY, USA, 2005. ACM. [51] Yongsoo Joo, Youngjin Cho, Kyungsoo Lee, and Naehyuck Chang. Improving appli- cation launch times with hybrid disks. In Proceedings of the 7th IEEE/ACM interna- tional conference on Hardware/software codesign and system synthesis, CODES+ISSS ?09, pages 373?382, New York, NY, USA, 2009. ACM. [52] William K. Josephson, Lars A. Bongo, Kai Li, and David Flynn. Dfs: A file system for virtualized flash storage. In Proceedings of the 8th USENIX conference on File and storage technologies, FAST?10, Berkeley, CA, USA, 2010. USENIX Association. [53] Sooyong Kang, Sungmin Park, Hoyoung Jung, Hyoki Shim, and Jaehyuk Cha. Perfor- mance trade-offs in using nvram write buffer for flash memory-based storage devices. IEEE Trans. Comput., 58:744?758, June 2009. [54] S. KAWAGUCHI, A. NISHIOKA and H. MOTODA. A flash-memory based file system. In In Proceedings of USENIX Winter, pages 155?164, Jan 1995. [55] Kimberly Keeton, David A. Patterson, and Joseph M. Hellerstein. A case for intelligent disks (idisks). SIGMOD Rec., 27(3):42?52, 1998. [56] H. KIM and S. AHN. Bplru: A buffer management scheme for improving random writes in flash storage. In In Proceedings of FAST?08, Feb 2008. [57] Hyojun Kim and Seongjun Ahn. Bplru: a buffer management scheme for improving random writes in flash storage. In Proceedings of the 6th USENIX Conference on File and Storage Technologies, FAST?08, pages 16:1?16:14, Berkeley, CA, USA, 2008. USENIX Association. [58] Youngjae Kim, S. Gurumurthi, and A. Sivasubramaniam. Understanding the performance-temperature interactions in disk i/o of server workloads. In High- Performance Computer Architecture, 2006. The Twelfth International Symposium on, pages 176 ?186, feb. 2006. [59] Peter M. Kogge, Jay B. Brockman, Thomas Sterling, and Guang Gao. Processing in memory: Chips to petaflops. In In Workshop on Mixing Logic and DRAM: Chips that Compute and Remember at ISCA ?97, page pages, 1997. 98 [60] R. Kumar, D.M. Tullsen, P. Ranganathan, N.P. Jouppi, and K.I. Farkas. Single-isa heterogeneous multi-core architectures for multithreaded workload performance. In Computer Architecture, 2004. Proceedings. 31st Annual International Symposium on, pages 64 ? 75, june 2004. [61] University of Massachusetts Amherst Lab for Advanced System Software. Umass trace repository, 2009. [62] Willis Lang, Jignesh Patel, and Srinath Shankar. Wimpy node clusters: What about non-wimpy workloads. In DAMON ?10:Proceedings of the Sixth International Workshop on Data Management on New Hardware. ACM, 2010. [63] H Lin, X Ma, W Feng, and N Samatova. Coordinating computation and i/o in mas- sively parallel sequence search. Parallel and Distributed Systems, IEEE Transactions on, PP(99):1 ?1, 2010. [64] Heshan Lin, Xiaosong Ma, Praveen Chandramohan, Al Geist, and Nagiza Samatova. Efficient data access for parallel blast. In IPDPS ?05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS?05) - Papers, page 72.2, Washington, DC, USA, 2005. IEEE Computer Society. [65] SimpleScalar LLC. Simplescalar 4.0 simulator, 2010. [66] Xiaonan Ma and A.L.N. Reddy. Mvss: an active storage architecture. Parallel and Distributed Systems, IEEE Transactions on, 14(10):993 ? 1005, oct. 2003. [67] T. MAKATOS, Y. KIONATOS, M. MARAZAKIS, M. D. FLOURIS, and A. BILAS. Using transparent compression to improve ssd-based i/o caches. In In Proceedings of EuroSys?10, April 2010. [68] Bo Mao, Hong Jiang, Dan Feng, Suzhen Wu, Jianxi Chen, Lingfang Zeng, and Lei Tian. Hpda: A hybrid parity-based disk array for enhanced performance and reliability. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1 ?12, 2010. [69] Gokhan Memik, Alok Choudhary, and Mahmut T. Kandemir. Design and evaluation of smart disk architecture for dss commercial workloads. In ICPP ?00: Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing, page 335, Washington, DC, USA, 2000. IEEE Computer Society. [70] Anurag Acharya Mustafa, Mustafa Uysal, and Joel Saltz. Active disks: Programming model, algorithms and evaluation. In In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 81?91, 1998. [71] David Nagle, DenisSerenyi, and Abbie Matthews. The panasasactivescale storageclus- ter: Delivering scalable high bandwidth storage. In Proceedings of the 2004 ACM/IEEE conference on Supercomputing, SC ?04, pages 53?, Washington, DC, USA, 2004. IEEE Computer Society. 99 [72] D. NARAYANAN, E. THERESKA, A. DONNELLY, S. ELNIKETY, and A. ROW- STRON. Migrating enterprise storage to ssds:analysis of tradeffs. In In Proceedings of EuroSys?09, March 2009. [73] Dushyanth Narayanan, Eno Thereska, Austin Donnelly, Sameh Elnikety, and Antony Rowstron. Migrating server storage to ssds: Analysis of tradeoffs. In In EuroSys, 2009. [74] Howard B. Newcombe and James M. Kennedy. Record linkage: making maximum use of the discriminating power of identifying information. Commun. ACM, 5:563?566, November 1962. [75] Jon Oberheide, Evan Cooke, and Farnam Jahanian. Cloudav: N-version antivirus in the network cloud. In Proceedings of the 17th conference on Security symposium, pages 91?106, Berkeley, CA, USA, 2008. USENIX Association. [76] Nohhyun Park and D.J. Lilja. Characterizing datasets for data deduplication in backup applications. In Workload Characterization (IISWC), 2010 IEEE International Sym- posium on, pages 1 ?10, 2010. [77] Sungmin Park, Hoyoung Jung, Hyoki Shim, Sooyong Kang, and Jaehyuk Cha. Write buffer-aware address mapping for nand flash memory devices. In Modeling, Analy- sis and Simulation of Computers and Telecommunication Systems, 2008. MASCOTS 2008. IEEE International Symposium on, pages 1 ?2, 2008. [78] David A. Patterson and John L. Hennessy. Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008. [79] Juan Piernas, Jarek Nieplocha, and Evan J. Felix. Evaluation of active storage strate- gies for the lustre parallel file system. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, SC ?07, pages 28:1?28:10, New York, NY, USA, 2007. ACM. [80] L. Prada, J.D. Garcia, J. Carretero, and F. Garcia. Saving power in flash and disk hybrid storage system. In Modeling, Analysis Simulation of Computer and Telecom- munication Systems, 2009. MASCOTS ?09. IEEE International Symposium on, pages 1 ?3, 2009. [81] T. PRITCHETT and M. THOTTETHODI. Sievestore: a highly-selective, ensemble- level disk cache for cost-performance. In In Proceedings of ISCA?10, June 2010. [82] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In HPCA ?07: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13?24, Washington, DC, USA, 2007. IEEE Computer Society. 100 [83] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In HPCA ?07: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13?24, Washington, DC, USA, 2007. IEEE Computer Society. [84] IBM Research. Ibm unstructured information management architecture. [85] Microsoft Reserach. Ssd extension for disksim simulation environment, 2009. [86] E. RIEDEL. Object based storage (osd) architecture and systems, 2006. [87] Erik Riedel, Christos Faloutsos, Garth A. Gibson, and David Nagle. Active disks for large-scale data processing. Computer, 34(6):68?74, 2001. [88] Sriram Sankar, Sudhanva Gurumurthi, and Mircea R. Stan. Intra-disk parallelism: An idea whose time has come. In Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA ?08, pages 303?314, Washington, DC, USA, 2008. IEEE Computer Society. [89] Seagate. The advantages of object-based storage secure, scalable, dynamic storage devices., 2005. [90] Anand Lal Shimpi. Anandtech: Intel?s 90nm pentium m 755: Dothan investigated, Dec 2007. http://www.anandtech.com/cpuchipsets/. [91] Muthian Sivathanu, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Evolv- ing rpc for active storage. In Proceedings of the 10th international conference on Archi- tectural support for programming languages and operating systems, ASPLOS-X, pages 264?276, New York, NY, USA, 2002. ACM. [92] Muthian Sivathanu, Vijayan Prabhakaran, Florentina I. Popovici, Timothy E. Denehy, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Semantically-smart disk systems. In FAST ?03: Proceedings of the 2nd USENIX Conference on File and Storage Technologies, pages 73?88, Berkeley, CA, USA, 2003. USENIX Association. [93] Gokul Soundararajan, Vijayan Prabhakaran, Mahesh Balakrishnan, and Ted Wobber. Extending ssd lifetimes with disk-based write caches. In Proceedings of the 8th USENIX conference on File and storage technologies, FAST?10, pages 8?8, Berkeley, CA, USA, 2010. USENIX Association. [94] Gokul Soundararajan, Vijayan Prabhakaran, Mahesh Balakrishnan, and Ted Wobber. Extending ssd lifetimes with disk-based write caches. In Proceedings of the 8th USENIX conference on File and storage technologies, FAST?10, pages 8?8, Berkeley, CA, USA, 2010. USENIX Association. [95] Guangyu Sun, Yongsoo Joo, Yibo Chen, Dimin Niu, Yuan Xie, Yiran Chen, and Hai Li. A hybrid solid-state storage architecture for the performance, energy consumption, and lifetime improvement. In High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on, pages 1 ?12, Jan 2010. 101 [96] Irina Chihaia Tuduce and Thomas Gross. Adaptive main memory compression. In ATEC ?05: Proceedings of the annual conference on USENIX Annual Technical Con- ference, pages 29?29, Berkeley, CA, USA, 2005. USENIX Association. [97] Mustafa Uysal, Anurag Acharya, and Joel Saltz. Evaluation of active disks for large decision support databases. Technical report, Santa Barbara, CA, USA, 1999. [98] C. WHITE. Consolidating, accessing, and analyzing unstructured data. 2005. [99] William E. Winkler. The state of record linkage and current research problems. Tech- nical report, Statistical Research Division, U.S. Census Bureau, 1999. [100] Seon yeong Park, Euiseong Seo, Ji-Yong Shin, Seungryoul Maeng, and Joonwon Lee. Exploiting internal parallelism of flash-based ssds. Computer Architecture Letters, 9(1):9 ?12, 2010. [101] Jin Hyuk Yoon, Eyec Hyun Nam, Yoon Jae Scong, H. Kim, B.S. Kim, Sang Lyul Min, and Yookun Cho. Chameleon: A high performance flash/fram hybrid solid state disk architecture. Computer Architecture Letters, 7(1):17 ?20, jan. 2008. [102] Jin Hyuk Yoon, Eyee Hyun Nam, Yoon Jae Seong, Hongseok Kim, Bryan Kim, Sang Lyul Min, and Yookun Cho. Chameleon: A high performance flash/fram hy- brid solid state disk architecture. IEEE Computer Architecture Letters, 7:17?20, 2008. 102