Augmenting Genomic Applications Through Simulation and Machine Learning-based Parameter Optimization

Li, Minghao

Augmenting Genomic Applications Through Simulation and Machine Learning-based Parameter Optimization

Files

ucalgary_2020_li_minghao.pdf (62.67 MB)

Date

2020-04-03

Authors

Li, Minghao

Abstract

Certain methods for genomic analyses do not take advantage of the full extent of available biological context to address limited sample sizes. Another noted issue is the gulf in bioinformatics software performance between tool authors and end-users. This project explores two cases where genomic applications can be augmented: power estimations for low-prevalence condition studies and haplotype reconstruction. Power is a key statistic for predicting the success of genomic sequencing projects. Low-prevalence conditions are not amenable for usage with existing power estimation frameworks designed for common conditions and consequently will appear underpowered. SimPEL is a tool for simulation-based power estimation for sequencing studies of low-prevalence conditions. It meets an unmet need in the field and augments power estimation through the inclusion of unused genomic aspects of low-prevalence conditions. Elements of low-prevalence condition studies are input into SimPEL and a simulated cohort is applied to calculate the likelihood of identifying the true causal gene(s). SimPEL demonstrates competitive performance on single causal gene conditions and viable performance in instances of heterogeneity. PoolHapX is a haplotype reconstruction tool capable of reconstructing haplotypes and their corresponding frequencies from mixed populations. Its extensive parameter set is laborious to optimize by hand when applied toward unknown use-cases. Pattern recognition algorithms allow for the delegation of this parameter tuning process to machine learning. SLiM, an evolutionary simulation framework, provides the biological basis for the haplotype reconstruction task. Stochastic simulation of PoolHapX parameter values within a defined space generates the large-scale datasets required for supervised learning. Mapping genomic sequencing features to an optimal PoolHapX parameter set is a multi-task learning problem. A novel two-model scaffold has been designed to address this. A gradient boosted decision tree model, mapping PoolHapX parameter sets to a quantitative performance metric, is nested as the cost function of a multi-head feedforward neural network, which in turn takes an input set of summary statistics from aligned genomic data and outputs PoolHapX parameters. Hyperparameter tuning is enabled by Bayesian optimization techniques. This workflow and framework is parallely extendable toward any PoolHapX extension in the future.

Keywords

Machine Learning, Deep Learning, Simulation, Parameter Optimization, Power Estimation

Citation

Li, M. (2020). Augmenting Genomic Applications Through Simulation and Machine Learning-based Parameter Optimization (Master's thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca.

URI

http://hdl.handle.net/1880/111782

Collections

Open Theses and Dissertations

Full item page