Feature Enhancement and Performance Evaluation of BioPig Analytics

Shi, Lizhen

Metadata Field	Value	Language
dc.contributor.advisor	Yu, Weikuan	en_US
dc.contributor.author	Shi, Lizhen	en_US
dc.date.accessioned	2016-01-27T16:01:26Z
dc.date.available	2016-01-27T16:01:26Z
dc.date.issued	2016-01-27
dc.identifier.uri	http://hdl.handle.net/10415/5018
dc.description.abstract	Next-Generation sequencing produces huge collections of strings to be analyzed. This massive dataset challenges traditional analytics tools and increasingly requires novel solutions adapting to big data platforms. MapReduce software framework presents a viable solution to large-scale sequence analysis in terms of e ciency and scalability. Hadoop as an opensource implementation of MapReduce framework is designed to run applications on largescale clusters built on commodity hardware. Hadoop distributed le system (HDFS) and Hadoop MapReduce are two important components of Hadoop framework. HDFS provides scalable, fault-tolerant and distributed data storage, while MapReduce is the core concept of Hadoop framework and provides a scale-out data processing solution across hundreds or thousands of nodes in Hadoop cluster. Since Hadoop version 0.23, MapReduce has changed signi cantly, which we call MapReduce 2 (aka YARN). YARN is the resource manager of the Hadoop cluster and has the ability to enhance the power of cluster computation. Hadoop is being widely used in many domains including Bioinformatics. BioPig, the current version of which is built on Hadoop 1, is a Hadoop-based toolkit for large-scale sequence analysis. In this thesis, I present the YARN-based BioPig toolkit, which is an upgrade and continuous development of current version of BioPig. Bene ts are gained from the development: not only the job throughput and cluster utilization are improved, but also other computational frameworks are permitted to run on Hadoop cluster simultaneously. k-mer counting is a preliminary step of subsequent sequence analyses in Bioinformatics and acts as a central role in BioPig. Unlike usual application workloads, k-mer counting generates a large volume of intermediate data which makes general parameter tuning guidelines inapplicable. To optimize BioPig performance on YARN cluster, I tuned Hadoop parameters according to the distinct k-mer counting workload characteristic from ve perspectives: data ii compression, HDFS block size, map-side spills, JVM garbage collection and reducer starttime. The evaluation reveals that this tuning practice reaches a signi cant performance gain comparing to the performance of the baseline con guration: the overall job execution time is reduced by about 50% . Through feature enhancement and performance evaluation, this thesis provides a valuable reference for other similar applications that generate large volume of intermediate data. Besides migrating current BioPig to YARN and tuning parameters, I also developed a new module, PigSimilarity, to extend the application domain of BioPig tookit.	en_US
dc.subject	Computer Science	en_US
dc.title	Feature Enhancement and Performance Evaluation of BioPig Analytics	en_US
dc.type	Master's Thesis	en_US
dc.embargo.status	NOT_EMBARGOED	en_US

Files in this item

Name:: Lizhen_Shi(lzs0047).pdf
Size:: 730.8Kb

Show simple item record