Feature Enhancement and Performance Evaluation of BioPig Analytics

Shi, Lizhen

View/Open

Lizhen_Shi(lzs0047).pdf (730.8Kb)

Date

2016-01-27

Author

Shi, Lizhen

Type of Degree

Master's Thesis

Department

Computer Science

Metadata

Show full item record

Abstract

Next-Generation sequencing produces huge collections of strings to be analyzed. This massive dataset challenges traditional analytics tools and increasingly requires novel solutions adapting to big data platforms. MapReduce software framework presents a viable solution to large-scale sequence analysis in terms of e ciency and scalability. Hadoop as an opensource implementation of MapReduce framework is designed to run applications on largescale clusters built on commodity hardware. Hadoop distributed le system (HDFS) and Hadoop MapReduce are two important components of Hadoop framework. HDFS provides scalable, fault-tolerant and distributed data storage, while MapReduce is the core concept of Hadoop framework and provides a scale-out data processing solution across hundreds or thousands of nodes in Hadoop cluster. Since Hadoop version 0.23, MapReduce has changed signi cantly, which we call MapReduce 2 (aka YARN). YARN is the resource manager of the Hadoop cluster and has the ability to enhance the power of cluster computation. Hadoop is being widely used in many domains including Bioinformatics. BioPig, the current version of which is built on Hadoop 1, is a Hadoop-based toolkit for large-scale sequence analysis. In this thesis, I present the YARN-based BioPig toolkit, which is an upgrade and continuous development of current version of BioPig. Bene ts are gained from the development: not only the job throughput and cluster utilization are improved, but also other computational frameworks are permitted to run on Hadoop cluster simultaneously. k-mer counting is a preliminary step of subsequent sequence analyses in Bioinformatics and acts as a central role in BioPig. Unlike usual application workloads, k-mer counting generates a large volume of intermediate data which makes general parameter tuning guidelines inapplicable. To optimize BioPig performance on YARN cluster, I tuned Hadoop parameters according to the distinct k-mer counting workload characteristic from ve perspectives: data ii compression, HDFS block size, map-side spills, JVM garbage collection and reducer starttime. The evaluation reveals that this tuning practice reaches a signi cant performance gain comparing to the performance of the baseline con guration: the overall job execution time is reduced by about 50% . Through feature enhancement and performance evaluation, this thesis provides a valuable reference for other similar applications that generate large volume of intermediate data. Besides migrating current BioPig to YARN and tuning parameters, I also developed a new module, PigSimilarity, to extend the application domain of BioPig tookit.

URI

http://hdl.handle.net/10415/5018