Improve I/O performance and Energy Efficiency in Hadoop Systems
Type of Degreedissertation
MetadataShow full item record
MapReduce is one of the most popular distributed computing platforms for the large-scale data-intensive applications. MapReduce has been applied to many areas of divide-and-conquer problems like search engines, data mining, and data indexing. Hadoop - developed by Yahoo - is an open source Java implementation of the MapReduce model. In this dissertation, we focus on approaches to improving performance and energy efficiency of Hadoop clusters. We start this dissertation research by analyzing the performance problems of the native Hadoop system. We observe that Hadoop's performance highly depends on system settings like block sizes, disk types, and data locations. A low observed network bandwidth in a shared cluster raise serious performance issues in the Hadoop system. To address this performance problem in Hadoop, we propose a key-aware data placement strategy called KAT for the Hadoop distributed file system (or HDFS, for short) on clusters. KAT is motivated by our observations that a performance bottleneck in Hadoop clusters lies in the shuffling stage where a large amount of data is transferred among data nodes. The amount of transferred data heavily depends on locations and balance of intermediate data with the same keys. Before Hadoop applications approach to the shuffling stage, our KAT strategy pre-calculates the intermediate data key for each data entry and allocates data according to the key. With KAT in place, data sharing the same key are not scattered across a cluster, thereby alleviating the network performance bottleneck problem imposed by data transfers. We evaluate the performance of KAT on an 8-node Hadoop cluster. Experimental results show that KAT reduces the execution times of Grep and Wordcount by up to 21% and 6.8%, respectively. To evaluate the impact of network interconnect on KAT, we applied the traffic-shaping technique to emulate real-world workloads where multiple applications are sharing the network resources in a Hadoop cluster. Our empirical results suggest that when observed network bandwidth drops down to 10Mbps, KAT is capable of shortening the execution times of Grep and Wordcount by up to 89%. To make Hadoop clusters economically and environmentally friendly, we design a new replica architecture that reduces the energy consumption of HDFS. The core conception of our design is to conserve power consumption caused by extra data replicas. Our energy-efficient HDFS saves energy consumption caused by extra data replicas in two steps. First, all disks within in a data node are separated into two categories: primary copies are stored on primary disks and replica copies are stored on backup disks. Second, disks archiving primary replica data are kept in the active mode in most cases; backup disks are placed into the sleep mode. We implement the energy-efficient HDFS that manages the power states of all disks in Hadoop clusters. Our approach conserves energy at the cost of performance due to power-state transitions. We propose a prediction module to hide overheads introduced by the power-state transitions in backup disks.