Efficient Storage Design and Query Scheduling for Improving Big Data Retrieval and Analytics
Type of DegreeDissertation
MetadataShow full item record
With the perpetually increasing requirement and generation of digital data, the human being has been stepping into the Big Data era. To efficiently manage, retrieve and exploit such gigantic amount of data continuously generated by all individuals and organizations of the society, a rich set of efforts has been invested to develop high-performance, scalable and fault-tolerant data storage systems and analytics frameworks. Recently, flash-based solid state disks and byte-addressable non-volatile memories have been developed and introduced into computer system storage hierarchy for substituting traditional hard drives and DRAM due to the faster data accesses, higher density and energy-efficiency. Along with the trend, how to systematically integrate such cutting edge memory technologies for fast system data retrieval becomes a highly concerned issue. In addition, from the users’ point of view, some mission-critical scientific applications are suffering from inefficient I/O schemes, thus not able to fully utilize the underlying parallel storage systems. This fact makes the development of more efficient I/O methods appealing. Moreover, MapReduce has emerged as a powerful big data processing engine that supports large-scale complex analytics applications. Most of them are written in declarative query languages such as Hive and Pig Latin. Therefore, it requires efficient coordination of Hive compiler and Hadoop runtime for fast and fair big data analytics. This dissertation investigates the research challenges mentioned above and contributes efficient storage design, I/O methods and query scheduling for improving big data retrieval and analytics. I firstly aim at addressing the I/O bottleneck issue in large-scale computers and data centers. Accordingly, in my first study, by leveraging the advanced features of cutting-edge non-volatile memories, I have presented and devised a Phase Change Memory (PCM)-based hybrid storage architecture, which provides efficient buffer management and novel wear leveling techniques, thus achieving highly improved data retrieval performance and at the same time solving the PCM’s endurance issue. In the second study, we adopt a mission-critical scientific application, GEOS-5, as a case to profile and analyze the communication and I/O issues that are preventing applications from fully utilizing the underlying parallel storage systems. Through detailed architectural and experimental characterization, we observe that current legacy I/O schemes incur significant net- work communication overheads and are unable to fully parallelize the data access, thus degrading applications’ I/O performance and scalability. To address these inefficiencies, we redesign its I/O framework along with a set of parallel I/O techniques to achieve high scalability and performance. In the third study, I have identified and profiled important performance and fairness issues existing in current MapReduce-based data warehousing system. In particular, I have proposed a prediction based query scheduling framework, which bridges the semantic gap between MapReduce runtime and query compiler and enables efficient query scheduling for fast and fair big data analytics.