Improving Performance of Hadoop Clusters

Xie, Jiong

Metadata Field	Value	Language
dc.contributor.advisor	Xiao, Qin
dc.contributor.author	Xie, Jiong	eng
dc.date.accessioned	2012-01-18T14:35:31Z
dc.date.available	2012-01-18T14:35:31Z
dc.date.issued	2012-01-18
dc.identifier.uri	http://hdl.handle.net/10415/2962
dc.description.abstract	The MapReduce model has become an important parallel processing model for large- scale data-intensive applications like data mining and web indexing. Hadoop, an open-source implementation of MapReduce, is widely applied to support cluster computing jobs requiring low response time. The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into account for launching speculative map tasks, because it is assumed that most map tasks can quickly access their local data. Network delays due to data movement during running time have been ignored in the recent Hadoop research. Unfortunately, both the homogeneity and data locality assumptions in Hadoop are optimistic at best and unachievable at worst, potentially introducing performance problems in virtualized data centers. We show in this dissertation that ignoring the data-locality issue in heterogeneous cluster computing environments can noticeably reduce the performance of Hadoop. Without considering the network delays, the performance of Hadoop clusters would be significatly downgraded. In this dissertation, we address the problem of how to place data across nodes in a way that each node has a balanced data processing load. Apart from the data placement issue, we also design a prefetching and predictive scheduling mechanism to help Hadoop in loading data from local or remote disks into main memory. To avoid network congestions, we propose a preshuffling algorithm to preprocess intermediate data between the map and reduce stages, thereby increasing the throughput of Hadoop clusters. Given a data-intensive application running on a Hadoop cluster, our data placement, prefetching, and preshuffling schemes adaptively balance the tasks and amount of data to achieve improved data-processing performance. Experimental results on real data-intensive applications show that our design can noticeably improve the performance of Hadoop clusters. In summary, this dissertation describes three practical approaches to improving the performance of Hadoop clusters, and explores the idea of integrating prefetching and preshuffling in the native Hadoop system.	en_US
dc.rights	EMBARGO_GLOBAL	en_US
dc.subject	Computer Science	en_US
dc.title	Improving Performance of Hadoop Clusters	en_US
dc.type	dissertation	en_US
dc.embargo.length	MONTHS_WITHHELD:6	en_US
dc.embargo.status	EMBARGOED	en_US
dc.embargo.enddate	2012-07-18	en_US

Files in this item

Name:: dissertation.pdf.txt
Size:: 183.7Kb

Name:: dissertation.pdf
Size:: 1.830Mb

Show simple item record