Developing Statistical Models For Multi-Omics Data Integration And Data Mining To Reveal Genetic Basis Underlying Diseases
dc.contributor.advisor | Long, Quan | |
dc.contributor.advisor | Yan, Jun | |
dc.contributor.author | Li, Qing | |
dc.contributor.committeemember | Yang, Guang | |
dc.contributor.committeemember | Bousman, Chad | |
dc.date | 2023-11 | |
dc.date.accessioned | 2023-10-10T18:36:07Z | |
dc.date.available | 2023-10-10T18:36:07Z | |
dc.date.issued | 2023-10-03 | |
dc.description.abstract | Aiming to assist in the discovery of the genetic basis of complex diseases, many researchers are generating multi-scale -omics data (such as genomes, transcriptomes, and proteomes) for joint analyses. However, despite the depth of sequencing, i.e., molecular information from a single individual could be massive, the sample size (number of individuals) for a particular study is usually small. As such, many researchers organize large consortiums to aggregate data into relatively larger biobanks for worldwide researchers to reuse. In parallel to the efforts towards enhancing sample size, in this thesis work, I developed advanced models by integrating domain knowledge seamlessly with modern machine learning (ML) techniques to further biological discoveries with high-dimensional data of moderate sample sizes. The core innovation in my thesis is to improve feature selection in statistical learning by leveraging biological a priori. Centralized by the general theme of knowledge-directed feature selection, my thesis has contributed four novel developments: In my first project, I developed Interaction-integrated Linear Mixed Model (ILMM), integrating three-dimensional (3D) genomic interaction information to pre-select genetic regions for the linear mixed model. This tool avoids the astronomic number of combinations usually encountered when searching for interactions genome wide. We showed ILMM is more powerful than established models and discovered a distal regulation mechanism underlying Autism. In my second project, I developed eXplainable Autoencoder for Critical genes (XA4C), which carries out gene selection from a unique angle: the gene’s ability to interpret hidden dimensions learned by an Autoencoder using gene expression data. This work coined the term “critical gene”, which is demonstrated to be more disease-relevant than conventional terms such as differentially expressed and hub genes in expression analysis. In my third project, on top of a state-of-the-art massive machine learning model integrating 5,313 human epigenetic and transcriptomic tracks of functional-omics data, I have developed a transfer learning framework to re-task the general comprehension model towards breast cancers. This framework allows effective feature selections for improved downstream analysis, such as association mapping, as we demonstrated using the breast cancer GWAS data. In my fourth project, which is more on in-depth data analysis instead of tool building, I integrated expression and protein data in a coherent fine-mapping framework to select candidate proteins that play an important role in disease pathogenesis, discovering 176 proteins for six cancers. These discoveries are valuable for understanding cancers and drug development. In summary, the works in this thesis delivered ML tools to integrate prior knowledge for feature selections to further biological discoveries and provided additional insights into genes and proteins underlying complex diseases. | |
dc.identifier.citation | Li, Q. (2023). Developing statistical models for multi-omics data integration and data mining to reveal genetic basis underlying diseases (Doctoral thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca. | |
dc.identifier.uri | https://hdl.handle.net/1880/117345 | |
dc.identifier.uri | https://doi.org/10.11575/PRISM/42188 | |
dc.language.iso | en | |
dc.publisher.faculty | Graduate Studies | |
dc.publisher.institution | University of Calgary | |
dc.rights | University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission. | |
dc.subject | Multi-omics | |
dc.subject | Machine learning | |
dc.subject | Statistical models | |
dc.subject | Data mining | |
dc.subject | Genetic basis of diseases | |
dc.subject.classification | Education--Sciences | |
dc.title | Developing Statistical Models For Multi-Omics Data Integration And Data Mining To Reveal Genetic Basis Underlying Diseases | |
dc.type | doctoral thesis | |
thesis.degree.discipline | Medicine – Biochemistry and Molecular Biology | |
thesis.degree.grantor | University of Calgary | |
thesis.degree.name | Doctor of Philosophy (PhD) | |
ucalgary.thesis.accesssetbystudent | I require a thesis withhold – I need to delay the release of my thesis due to a patent application, and other reasons outlined in the link above. I have/will need to submit a thesis withhold application. |