This Is AuburnElectronic Theses and Dissertations

Show simple item record

Toward a Transparent, Checkpointable Fault-Tolerant Message Passing Interface for HPC Systems


Metadata FieldValueLanguage
dc.contributor.advisorCross, James
dc.contributor.authorSultana, Nawrin
dc.date.accessioned2019-12-09T17:39:22Z
dc.date.available2019-12-09T17:39:22Z
dc.date.issued2019-12-09
dc.identifier.urihttp://hdl.handle.net/10415/7052
dc.description.abstractWith each successive generation of large-scale high-performance computing (HPC) systems, faults and associated failures are becoming more frequent. Long-running applications in such systems require efficient fault-tolerance support. The Message Passing Interface (MPI) is the de facto standard for HPC message passing middleware since its first release in 1994. However, the MPI Standard itself does not provide any mechanism to continue running MPI after a failure. Thus, there is a growing effort in the MPI community to incorporate fault-tolerance constructs into MPI. When an MPI program experiences a failure, the most common recovery approach is to restart all processes from a previous checkpoint and to re-queue the entire job which incurs unnecessary overhead. The purpose of this dissertation is to design a fault-tolerant MPI for Bulk Synchronous Parallel (BSP) applications with the goals of efficient failure recovery as well as easy adoption in large-scale production applications. This dissertation describes a new fault-tolerant model for MPI called “MPI Stages”. We discuss the design, applicability, and performance of MPI Stages recovery model. Additionally, we provide the minimal MPI semantics for applications and libraries to use MPI Stages. To demonstrate this new model, we introduce “ExaMPI”, a modern C++ implementation of a subset of MPI-3.x functionality. In addition, we analyze applications that use MPI programming model to understand the most commonly used features and characteristics of MPI in next-generation exascale systems.en_US
dc.subjectComputer Science and Software Engineeringen_US
dc.titleToward a Transparent, Checkpointable Fault-Tolerant Message Passing Interface for HPC Systemsen_US
dc.typePhD Dissertationen_US
dc.embargo.statusNOT_EMBARGOEDen_US
dc.contributor.committeeSkjellum, Anthony
dc.contributor.committeeUmphress, David
dc.contributor.committeeQin, Xiao
dc.contributor.committeeBangalore, Purushotham
dc.creator.orcidhttps://orcid.org/0000-0003-1597-3369en_US

Files in this item

Show simple item record