Analyzing the E ects of Sequencer Discrepancies on Next-Generation Genome Assembly Tools
View/ Open
Date
2016-08-04Type of Degree
Master's ThesisDepartment
Computer Science and Software Engineering
Metadata
Show full item recordAbstract
The advent of Next-Generation Sequencing (NGS) techniques in the early 21st century massively increased genetic sequencing throughput while dramatically reducing associated costs. This is turn lowered barriers of entry sufficient to permit vastly expanded research interests. To handle the resulting explosion of sequencing data being produced, new techniques for assembling genomes, transcriptomes, and proteomes were required. In the last 15 years, numerous tools for each of these assembly categories have arisen, each purporting superiority relative to other tools. In particular, de novo genome assembly has spawned more than 75 tools utilizing different assembly pipelines, error correcting methods, and novel data structures. Previous works have shown that no one tool can lay claim to general supremacy –- some are, by design or happenstance, better suited to certain data types (e.g. human, plant, or bacteria genomes). What these works have not done is shown how variations in sequenced libraries affect assembly or explained why these effects occur. The goal of this work therefore is to analyze these effects. Execution of this goal is split into two primary parts: an in-depth architectural analysis of several popular de novo genome assemblers including expected behavioral changes across sequencer variations, and evaluations of these tools using data sets permuted over a range of coverage depths, read lengths, and read types. The focus of this work is to assess the flexibility of several popular de novo genome assemblers (which can grouped as either utilizing de Bruijn graphs or a hybridized approach for their assembly) with respect to sequencer variations over a single genome. The results of the evaluations revealed a startlingly high sensitivity to variation in the de Bruijn based assemblers even with libraries that would, at first glance, appear far better suited to assembly. Though error detection and correction methodologies worked exceptionally for both de Bruijn assemblers, the maximum contig length and other important metrics degraded rapidly as library coverage increased. As expected, the hybrid de Bruijn/String graph approach was not as vulnerable to these same variations, but had its own shortcomings. The minimum threshold of coverage for reasonable assembly was higher than the pure de Bruijn approaches; additionally, the incidence of misassembled contigs was much higher. The analysis performed in this work provides useful and practical insights into the behaviors of genome assemblers which can both ease assembly tuning and expedite the process of choosing appropriate data sets for future research.