Developing Statistical Models For Multi-Omics Data Integration And Data Mining To Reveal Genetic Basis Underlying Diseases

Date
2023-10-03
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Aiming to assist in the discovery of the genetic basis of complex diseases, many researchers are generating multi-scale -omics data (such as genomes, transcriptomes, and proteomes) for joint analyses. However, despite the depth of sequencing, i.e., molecular information from a single individual could be massive, the sample size (number of individuals) for a particular study is usually small. As such, many researchers organize large consortiums to aggregate data into relatively larger biobanks for worldwide researchers to reuse. In parallel to the efforts towards enhancing sample size, in this thesis work, I developed advanced models by integrating domain knowledge seamlessly with modern machine learning (ML) techniques to further biological discoveries with high-dimensional data of moderate sample sizes. The core innovation in my thesis is to improve feature selection in statistical learning by leveraging biological a priori. Centralized by the general theme of knowledge-directed feature selection, my thesis has contributed four novel developments: In my first project, I developed Interaction-integrated Linear Mixed Model (ILMM), integrating three-dimensional (3D) genomic interaction information to pre-select genetic regions for the linear mixed model. This tool avoids the astronomic number of combinations usually encountered when searching for interactions genome wide. We showed ILMM is more powerful than established models and discovered a distal regulation mechanism underlying Autism. In my second project, I developed eXplainable Autoencoder for Critical genes (XA4C), which carries out gene selection from a unique angle: the gene’s ability to interpret hidden dimensions learned by an Autoencoder using gene expression data. This work coined the term “critical gene”, which is demonstrated to be more disease-relevant than conventional terms such as differentially expressed and hub genes in expression analysis. In my third project, on top of a state-of-the-art massive machine learning model integrating 5,313 human epigenetic and transcriptomic tracks of functional-omics data, I have developed a transfer learning framework to re-task the general comprehension model towards breast cancers. This framework allows effective feature selections for improved downstream analysis, such as association mapping, as we demonstrated using the breast cancer GWAS data. In my fourth project, which is more on in-depth data analysis instead of tool building, I integrated expression and protein data in a coherent fine-mapping framework to select candidate proteins that play an important role in disease pathogenesis, discovering 176 proteins for six cancers. These discoveries are valuable for understanding cancers and drug development. In summary, the works in this thesis delivered ML tools to integrate prior knowledge for feature selections to further biological discoveries and provided additional insights into genes and proteins underlying complex diseases.
Description
Keywords
Multi-omics, Machine learning, Statistical models, Data mining, Genetic basis of diseases
Citation
Li, Q. (2023). Developing statistical models for multi-omics data integration and data mining to reveal genetic basis underlying diseases (Doctoral thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca.