Augmenting Genomic Applications Through Simulation and Machine Learning-based Parameter Optimization
dc.contributor.advisor | Long, Quan | |
dc.contributor.advisor | Yang, Guang | |
dc.contributor.author | Li, Minghao | |
dc.contributor.committeemember | De Koning, A. P. Jason | |
dc.contributor.committeemember | Rancourt, Derrick E. | |
dc.date | 2020-06 | |
dc.date.accessioned | 2020-04-07T15:02:00Z | |
dc.date.available | 2020-04-07T15:02:00Z | |
dc.date.issued | 2020-04-03 | |
dc.description.abstract | Certain methods for genomic analyses do not take advantage of the full extent of available biological context to address limited sample sizes. Another noted issue is the gulf in bioinformatics software performance between tool authors and end-users. This project explores two cases where genomic applications can be augmented: power estimations for low-prevalence condition studies and haplotype reconstruction. Power is a key statistic for predicting the success of genomic sequencing projects. Low-prevalence conditions are not amenable for usage with existing power estimation frameworks designed for common conditions and consequently will appear underpowered. SimPEL is a tool for simulation-based power estimation for sequencing studies of low-prevalence conditions. It meets an unmet need in the field and augments power estimation through the inclusion of unused genomic aspects of low-prevalence conditions. Elements of low-prevalence condition studies are input into SimPEL and a simulated cohort is applied to calculate the likelihood of identifying the true causal gene(s). SimPEL demonstrates competitive performance on single causal gene conditions and viable performance in instances of heterogeneity. PoolHapX is a haplotype reconstruction tool capable of reconstructing haplotypes and their corresponding frequencies from mixed populations. Its extensive parameter set is laborious to optimize by hand when applied toward unknown use-cases. Pattern recognition algorithms allow for the delegation of this parameter tuning process to machine learning. SLiM, an evolutionary simulation framework, provides the biological basis for the haplotype reconstruction task. Stochastic simulation of PoolHapX parameter values within a defined space generates the large-scale datasets required for supervised learning. Mapping genomic sequencing features to an optimal PoolHapX parameter set is a multi-task learning problem. A novel two-model scaffold has been designed to address this. A gradient boosted decision tree model, mapping PoolHapX parameter sets to a quantitative performance metric, is nested as the cost function of a multi-head feedforward neural network, which in turn takes an input set of summary statistics from aligned genomic data and outputs PoolHapX parameters. Hyperparameter tuning is enabled by Bayesian optimization techniques. This workflow and framework is parallely extendable toward any PoolHapX extension in the future. | en_US |
dc.identifier.citation | Li, M. (2020). Augmenting Genomic Applications Through Simulation and Machine Learning-based Parameter Optimization (Master's thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca. | en_US |
dc.identifier.doi | http://dx.doi.org/10.11575/PRISM/37669 | |
dc.identifier.uri | http://hdl.handle.net/1880/111782 | |
dc.language.iso | eng | en_US |
dc.publisher.faculty | Cumming School of Medicine | en_US |
dc.publisher.institution | University of Calgary | en |
dc.rights | University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission. | en_US |
dc.subject | Machine Learning | en_US |
dc.subject | Deep Learning | en_US |
dc.subject | Simulation | en_US |
dc.subject | Parameter Optimization | en_US |
dc.subject | Power Estimation | en_US |
dc.subject.classification | Bioinformatics | en_US |
dc.title | Augmenting Genomic Applications Through Simulation and Machine Learning-based Parameter Optimization | en_US |
dc.type | master thesis | en_US |
thesis.degree.discipline | Medicine – Biochemistry and Molecular Biology | en_US |
thesis.degree.grantor | University of Calgary | en_US |
thesis.degree.name | Master of Science (MSc) | en_US |
ucalgary.item.requestcopy | true | en_US |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- ucalgary_2020_li_minghao.pdf
- Size:
- 62.67 MB
- Format:
- Adobe Portable Document Format
- Description:
- ucalgary_2020_li_minghao.pdf
License bundle
1 - 1 of 1
No Thumbnail Available
- Name:
- license.txt
- Size:
- 2.62 KB
- Format:
- Item-specific license agreed upon to submission
- Description: