Scalable Collective Communication and Data Transfer for High-Performance Computation and Data Analytics

Xu, Cong

View/Open

Cong Xu Dissertation.pdf (4.718Mb)

Date

2015-04-27

Author

Xu, Cong

Type of Degree

Dissertation

Department

Computer Science

Metadata

Show full item record

Abstract

Large-scale computer clusters are leveraged to solve various crucial problems in both High Performance Computing and Cloud Computing fields. Message Passing Interface (MPI) and MapReduce are two prevalent tools to tap the power of parallel data processing and computing resources in HPC and commercial machines. Scientific applications use collective communication operations in MPI for global synchronization and data exchanges. Meanwhile, MapReduce is a popular programming model that provides a simple and scalable parallel data processing framework for large-scale off-the-shelf clusters. Hadoop is an open source implementation of MapReduce and YARN is the next-generation of Hadoop’s compute platform. To achieve efficient data processing over large-scale clusters, it is crucial to improve MPI collective communication and optimize the MapReduce framework. However, the existing MPI libraries and MapReduce framework face critical issues in terms of collective communication and data movement. Specifically, the MPI AlltoallV operation relies on a linear algorithm for exchanging small messages, it cannot obtain the advantage of shared memory on hierarchical multicore system. Meanwhile, Hadoop employs Java-based network transport stack on top of the Java Virtual Machine (JVM) for its MapTask to ReduceTask all to all data shuffling and merging purposes. Detailed examination reveals that JVM imposes a significant amount of overhead to data processing. In addition, scientific datasets are stored on HPC backend storage servers, these datasets can be analyzed by Yarn MapReduce programs on compute nodes. However, the storage servers and computation powers are separated in the HPC environment, the datasets are too costly to transfer due to their sheer size. This dissertation has addressed above issues in the existing MPI libraries and MapReduce framework. Accordingly, the MPI collective communication and Hadoop data movement have been optimized. For MPI AlltoallV collective operation, we design and implement a new Scalable LOgarithmic AlltoallV algorithm, named SLOAV, for MPI AlltoallV collective operations. SLOAV aims to achieve global exchange of small messages of different sizes in logarithmic manner. Furthermore, we design a hierarchical AlltoallV algorithm based on SLOAV by taking advantage of shared memory in multicore systems, which is referred to as SLOAVx. Also this dissertation has optimized Hadoop with JVM Bypass Shuffling (JBS) plugin library for fast data movement, overcoming the existing limitations, and removing the overhead and limitations imposed by JVM. In the third study, we exploit the analytics shipping model for fast analysis of large-scale scientific datasets on HPC backend storage servers. Through an efficient integration of MapReduce and the popular Lustre storage system, we have developed a Virtualized Analytics Shipping (VAS) framework that can ship MapReduce programs to Lustre storage servers.

URI

http://hdl.handle.net/10415/4508