Boosting Cloud Computing Frameworks with High Performance Computing Features
Type of DegreeMaster's Thesis
Computer Science and Software Engineering
MetadataShow full item record
High Performance Computing (HPC) often employs a set of networked and/or distributed computers working together to solve extremely large problems. Since the early 1990's, HPC has made use of technologies such as the Message Passing Interface (MPI) middleware and InfiniBand networking in order to provide developers with standard means to create solutions that leverage the power of a cluster of computers. However, MPI, InfiniBand, and other HPC-related technologies have gradually been losing their monopoly for applications in HPC as modern cloud computing frameworks, such as Apache Spark, are rising in popularity and are beginning to be used in fields traditionally dominated by HPC. Higher speed and Data-Center Ethernet has also challenged the choice of InfiniBand for certain scalable applications where InfiniBand was once the usual choice. One of the primary factors advancing the popularity of these new cloud computing frameworks is that they are much more user friendly and do not immediately require an expertise in parallel and/or distributed computing to develop powerful programs. However, when every bit of performance matters, technologies such as MPI and InfiniBand still have an edge because they provide many powerful features that modern cloud computing frameworks, such as Apache Spark, do not offer at present. I propose to augment Apache Spark with features currently only available in traditional HPC technologies, while maintaining Spark's ease of use that makes it so attractive to its current user base. There has already been some related work in this area, namely adding support for high speed network interconnects. In this thesis, I present an extension to Spark that provides easy task-parallel processing through the addition of the novel Multiply Distributed Resilient Distributed Dataset (MDRDD). The existing Spark implementation allows easy data-parallel processing via Resilient Distributed Datasets (RDDs). I also propose several future enhancements to close the gap further between the performance of HPC and the ease-of-use of Apache Spark. The outcome of this study is that Spark is enhanced to perform user-friendly task-parallel operations in addition to the data-parallel operations it already offers. Several examples are presented to demonstrate how this functionality works and several benchmarks are run to demonstrate that the task-parallel functionality is efficient and can be used for real-world problems. The extension to Spark implemented in this study already provides enough functionality for developers to begin taking advantage of simple task-parallelism via map commands. Once more RDD operations, such as reduce are implemented in a similar fashion, it would be worthwhile to submit the changes to Apache and try to get them added to a future version of Spark.