Toward a Transparent, Checkpointable Fault-Tolerant Message Passing Interface for HPC Systems
View/Open
Date
2019-12-09Type of Degree
PhD DissertationDepartment
Computer Science and Software Engineering
Metadata
Show full item recordAbstract
With each successive generation of large-scale high-performance computing (HPC) systems, faults and associated failures are becoming more frequent. Long-running applications in such systems require efficient fault-tolerance support. The Message Passing Interface (MPI) is the de facto standard for HPC message passing middleware since its first release in 1994. However, the MPI Standard itself does not provide any mechanism to continue running MPI after a failure. Thus, there is a growing effort in the MPI community to incorporate fault-tolerance constructs into MPI. When an MPI program experiences a failure, the most common recovery approach is to restart all processes from a previous checkpoint and to re-queue the entire job which incurs unnecessary overhead. The purpose of this dissertation is to design a fault-tolerant MPI for Bulk Synchronous Parallel (BSP) applications with the goals of efficient failure recovery as well as easy adoption in large-scale production applications. This dissertation describes a new fault-tolerant model for MPI called “MPI Stages”. We discuss the design, applicability, and performance of MPI Stages recovery model. Additionally, we provide the minimal MPI semantics for applications and libraries to use MPI Stages. To demonstrate this new model, we introduce “ExaMPI”, a modern C++ implementation of a subset of MPI-3.x functionality. In addition, we analyze applications that use MPI programming model to understand the most commonly used features and characteristics of MPI in next-generation exascale systems.