Efficient Movement and Task Management in MapReduce for Fast Analytics of Big Data

Wang, Yandong

View/Open

yandong-dissertation.pdf (3.693Mb)

Date

2014-05-12

Author

Wang, Yandong

Type of Degree

thesis

Department

Computer Science

Metadata

Show full item record

Abstract

MapReduce programming model has achieved great success over the past decade. With its recognized merits such as superior scalability and strong fault tolerance, MapReduce has thrived as a primary processing engine adopted by leading enterprises for analyzing gigantic datasets and driving cloud services. Recently, in order for companies to pursue high quality-of-service and fulfill requests from many customers, the demand for enhancing MapReduce frameworks so that they can leverage the best performance from underlying systems and support multi-tenancy is growing. However, many challenges exist in optimizing contemporary MapReduce frameworks to deliver fast job completion, fairness among many users, high cluster utilization, as well as platform-aware adaptability. This dissertation focuses on pushing forward the evolution of contemporary MapReduce frameworks. We have comprehensively analyzed several major MapReduce systems on differ- ent platforms, identified their limitations, and explored various optimization techniques to enhance their performance. In particular, this dissertation aims to address three major challenges that pre- vent MapReduce frameworks from achieving the optimal performance. These three challenges include (1) exploiting the design of a high-performance I/O services for accelerating the interme- diate data movement for MapReduce frameworks; (2) enhancing task management to provision ideal quality-of-service in terms of efficiency and fairness in multi-tenant MapReduce clusters; (3) improving the adaptability of MapReduce frameworks for platforms featuring high performance characteristics. To address these challenges, in this dissertation, we have introduced a novel Net- work Levitated Merge algorithm, along with a Hadoop Acceleration framework for MapReduce to provide a high-performance I/O layer. Built on top of these techniques, MapReduce can efficiently move a deluge amount of intermediate data among a large number of nodes and yield effective ii performance improvement than the otherwise. In addition, to provision satisfactory quality-of- service, this dissertation has also designed a lightweight work-conserving Preemptive ReduceTask and Fast Completion Scheduler to enhance the task management so that MapReduce can deliver both fast job execution and fairness for multi-tenant MapReduce clusters. Our evaluation with a diverse collection of workloads adequately demonstrate that our solutions can efficiently out- perform the state-of-the-art MapReduce schedulers. Furthermore, to cope with the demand for leveraging MapReduce frameworks to process gigantic simulation results from scientific applica- tions, this dissertation has thoroughly characterized MapReduce frameworks on High-Performance Computing (HPC) systems. Based on our findings, we have concluded that existing MapReduce frameworks lack the capability to fully exploit the advantages of high-performance computing facilities. Accordingly, we have introduced several enhancement techniques to optimize the adapt- ability of MapReduce for these platforms. Our performance examination sufficiently corroborates the effectiveness of our techniques. Through systematic experiments and comprehensive evaluation and analysis described in this dissertation, we have demonstrated the efficacy of our innovations, meanwhile revealing many optimization spaces as well as opportunities for future research on MapReduce frameworks.

URI

http://hdl.handle.net/10415/4153