Augmenting Genomic Applications Through Simulation and Machine Learning-based Parameter Optimization

dc.contributor.advisorLong, Quan
dc.contributor.advisorYang, Guang
dc.contributor.authorLi, Minghao
dc.contributor.committeememberDe Koning, A. P. Jason
dc.contributor.committeememberRancourt, Derrick E.
dc.date2020-06
dc.date.accessioned2020-04-07T15:02:00Z
dc.date.available2020-04-07T15:02:00Z
dc.date.issued2020-04-03
dc.description.abstractCertain methods for genomic analyses do not take advantage of the full extent of available biological context to address limited sample sizes. Another noted issue is the gulf in bioinformatics software performance between tool authors and end-users. This project explores two cases where genomic applications can be augmented: power estimations for low-prevalence condition studies and haplotype reconstruction. Power is a key statistic for predicting the success of genomic sequencing projects. Low-prevalence conditions are not amenable for usage with existing power estimation frameworks designed for common conditions and consequently will appear underpowered. SimPEL is a tool for simulation-based power estimation for sequencing studies of low-prevalence conditions. It meets an unmet need in the field and augments power estimation through the inclusion of unused genomic aspects of low-prevalence conditions. Elements of low-prevalence condition studies are input into SimPEL and a simulated cohort is applied to calculate the likelihood of identifying the true causal gene(s). SimPEL demonstrates competitive performance on single causal gene conditions and viable performance in instances of heterogeneity. PoolHapX is a haplotype reconstruction tool capable of reconstructing haplotypes and their corresponding frequencies from mixed populations. Its extensive parameter set is laborious to optimize by hand when applied toward unknown use-cases. Pattern recognition algorithms allow for the delegation of this parameter tuning process to machine learning. SLiM, an evolutionary simulation framework, provides the biological basis for the haplotype reconstruction task. Stochastic simulation of PoolHapX parameter values within a defined space generates the large-scale datasets required for supervised learning. Mapping genomic sequencing features to an optimal PoolHapX parameter set is a multi-task learning problem. A novel two-model scaffold has been designed to address this. A gradient boosted decision tree model, mapping PoolHapX parameter sets to a quantitative performance metric, is nested as the cost function of a multi-head feedforward neural network, which in turn takes an input set of summary statistics from aligned genomic data and outputs PoolHapX parameters. Hyperparameter tuning is enabled by Bayesian optimization techniques. This workflow and framework is parallely extendable toward any PoolHapX extension in the future.en_US
dc.identifier.citationLi, M. (2020). Augmenting Genomic Applications Through Simulation and Machine Learning-based Parameter Optimization (Master's thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca.en_US
dc.identifier.doihttp://dx.doi.org/10.11575/PRISM/37669
dc.identifier.urihttp://hdl.handle.net/1880/111782
dc.language.isoengen_US
dc.publisher.facultyCumming School of Medicineen_US
dc.publisher.institutionUniversity of Calgaryen
dc.rightsUniversity of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission.en_US
dc.subjectMachine Learningen_US
dc.subjectDeep Learningen_US
dc.subjectSimulationen_US
dc.subjectParameter Optimizationen_US
dc.subjectPower Estimationen_US
dc.subject.classificationBioinformaticsen_US
dc.titleAugmenting Genomic Applications Through Simulation and Machine Learning-based Parameter Optimizationen_US
dc.typemaster thesisen_US
thesis.degree.disciplineMedicine – Biochemistry and Molecular Biologyen_US
thesis.degree.grantorUniversity of Calgaryen_US
thesis.degree.nameMaster of Science (MSc)en_US
ucalgary.item.requestcopytrueen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
ucalgary_2020_li_minghao.pdf
Size:
62.67 MB
Format:
Adobe Portable Document Format
Description:
ucalgary_2020_li_minghao.pdf
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.62 KB
Format:
Item-specific license agreed upon to submission
Description: