This Is AuburnElectronic Theses and Dissertations

Novel Machine Learning Algorithms for Analyzing Large-scale Genomic and Genetic Data

Date

2020-12-01

Author

Wang, Ye

Type of Degree

PhD Dissertation

Department

Computer Science and Software Engineering

Restriction Status

EMBARGOED

Restriction Type

Full

Date Available

05-31-2022

Abstract

With the advancement of next-generation sequencing technology, numerous disease/ phenotypic associations with the human microbiome and human genome are uncovered and revealed. In this dissertation, we take advantage of this and explore these association patterns using machine learning methods. We first design a deep learning method MDeep for microbiome-based prediction by considering both the taxon abundance and phylogenetic tree. MDeep models the taxonomic rank by the convolutional layers and captures the phylogenetic correlation on each taxonomic rank via the convolutional operation. Our simulations and real data analysis demonstrate that MDeep outperforms competing methods in both regression and binary classifications. In order to explore the diseases/ phenotypic associations with the human genome, we propose two machine learning frameworks. The first framework, WEVar, is a supervised learning framework by integrating the pre-computed scores from representative existing scoring methods, which will benefit from each individual method by automatically learning the relative contribution of each method and produce an ensemble score for the final prediction. Using simulation and real data studies, we show both context-free WEVar and context-dependent WEVar outperform the individual scoring methods on the state-of-the-art benchmark datasets. Furthermore, we find WEVar can prioritize experimentally validated non-coding variants in an LD block. The second framework, DeepMFIVar, is a deep multimodal learning framework for the functional interpretation of genetic variants. DeepMFIVar learns a predictive model linking DNA sequence context and clinical information to quantitative epigenetic signals. The mutation effect of the 210 million genetic variants are generated by the difference of the predicted epigenetic signal for the reference and for alternative alleles. The application to DNA methylation and histone modification demonstrates that DeepMFIVar can accurately predict locus-specific epigenetic signals using DNA sequence and clinical information, and it is also capable of prioritizing variants for downstream experiments.