This Is AuburnElectronic Theses and Dissertations

Taming the Scientific Big Data with Flexible Organizations for Exascale Computing

Date

2012-07-31

Author

Tian, Yuan

Type of Degree

dissertation

Department

Computer Science

Abstract

The last five years of supercomputers has evolved at an unprecedented rate as High Performance Computing (HPC) continue to progress towards exascale computing in 2018. These systems enable scientists to simulate scientific processes with high fidelity at large scale and consequently, often produce complex data that are also exponentially increasing in size. However, the growth within the computing infrastructure is significantly imbalanced. The dramatically increasing computing power is accompanied with the slowly improving storage system. Such discordant progress among computing power, storage and data has led to a severe I/O bottleneck for the advancing of scientific computing. While intensive research for the next generation storage is undergoing, a revolutionary upgrade to current back-end storage systems is not foreseeable in the near future. As a result, applications become more reliant on I/O software in hoping to alleviate the performance bottleneck through driving the storage system at its full speed. Efficient I/O for scientific big data is crucial for a successful transition into exascale for HPC. However, providing a high performance I/O at software layer is nontrivial. The large volume, high complexity and mismatch between the organization of scientific data and underlying storage system pose grand challenges for I/O software design. This dissertation investigates the characteristics of scientific data and storage system as a whole, and explores the opportunities to drive the I/O performance for petascale computing and prepare it for the exascale. To this end, a set of flexible data organization and management technique are introduced to address the I/O challenges from ve directions, namely system-wide data concurrency, in-node data organization, complex I/O patterns, time dimension analytics and asynchronous compression. For these purposes, four key techniques are designed to exploit the capability of the back-end storage system for processing and storing scientific big data with a fast and scalable I/O performance. It has been shown that these techniques can contribute to the real world scientific applications with enhanced I/O performance and scalability for end-to-end data flow. It also contributes as part of the solution towards scalable data management techniques while high performance computing is progressing into exascale.