Developing Statistical Models For Multi-Omics Data Integration And Data Mining To Reveal Genetic Basis Underlying Diseases

Li, Qing

Developing Statistical Models For Multi-Omics Data Integration And Data Mining To Reveal Genetic Basis Underlying Diseases

dc.contributor.advisor	Long, Quan
dc.contributor.advisor	Yan, Jun
dc.contributor.author	Li, Qing
dc.contributor.committeemember	Yang, Guang
dc.contributor.committeemember	Bousman, Chad
dc.date	2023-11
dc.date.accessioned	2023-10-10T18:36:07Z
dc.date.available	2023-10-10T18:36:07Z
dc.date.issued	2023-10-03
dc.description.abstract	Aiming to assist in the discovery of the genetic basis of complex diseases, many researchers are generating multi-scale -omics data (such as genomes, transcriptomes, and proteomes) for joint analyses. However, despite the depth of sequencing, i.e., molecular information from a single individual could be massive, the sample size (number of individuals) for a particular study is usually small. As such, many researchers organize large consortiums to aggregate data into relatively larger biobanks for worldwide researchers to reuse. In parallel to the efforts towards enhancing sample size, in this thesis work, I developed advanced models by integrating domain knowledge seamlessly with modern machine learning (ML) techniques to further biological discoveries with high-dimensional data of moderate sample sizes. The core innovation in my thesis is to improve feature selection in statistical learning by leveraging biological a priori. Centralized by the general theme of knowledge-directed feature selection, my thesis has contributed four novel developments: In my first project, I developed Interaction-integrated Linear Mixed Model (ILMM), integrating three-dimensional (3D) genomic interaction information to pre-select genetic regions for the linear mixed model. This tool avoids the astronomic number of combinations usually encountered when searching for interactions genome wide. We showed ILMM is more powerful than established models and discovered a distal regulation mechanism underlying Autism. In my second project, I developed eXplainable Autoencoder for Critical genes (XA4C), which carries out gene selection from a unique angle: the gene’s ability to interpret hidden dimensions learned by an Autoencoder using gene expression data. This work coined the term “critical gene”, which is demonstrated to be more disease-relevant than conventional terms such as differentially expressed and hub genes in expression analysis. In my third project, on top of a state-of-the-art massive machine learning model integrating 5,313 human epigenetic and transcriptomic tracks of functional-omics data, I have developed a transfer learning framework to re-task the general comprehension model towards breast cancers. This framework allows effective feature selections for improved downstream analysis, such as association mapping, as we demonstrated using the breast cancer GWAS data. In my fourth project, which is more on in-depth data analysis instead of tool building, I integrated expression and protein data in a coherent fine-mapping framework to select candidate proteins that play an important role in disease pathogenesis, discovering 176 proteins for six cancers. These discoveries are valuable for understanding cancers and drug development. In summary, the works in this thesis delivered ML tools to integrate prior knowledge for feature selections to further biological discoveries and provided additional insights into genes and proteins underlying complex diseases.
dc.identifier.citation	Li, Q. (2023). Developing statistical models for multi-omics data integration and data mining to reveal genetic basis underlying diseases (Doctoral thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca.
dc.identifier.uri	https://hdl.handle.net/1880/117345
dc.identifier.uri	https://doi.org/10.11575/PRISM/42188
dc.language.iso	en
dc.publisher.faculty	Graduate Studies
dc.publisher.institution	University of Calgary
dc.rights	University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission.
dc.subject	Multi-omics
dc.subject	Machine learning
dc.subject	Statistical models
dc.subject	Data mining
dc.subject	Genetic basis of diseases
dc.subject.classification	Education--Sciences
dc.title	Developing Statistical Models For Multi-Omics Data Integration And Data Mining To Reveal Genetic Basis Underlying Diseases
dc.type	doctoral thesis
thesis.degree.discipline	Medicine – Biochemistry and Molecular Biology
thesis.degree.grantor	University of Calgary
thesis.degree.name	Doctor of Philosophy (PhD)
ucalgary.thesis.accesssetbystudent	I require a thesis withhold – I need to delay the release of my thesis due to a patent application, and other reasons outlined in the link above. I have/will need to submit a thesis withhold application.

Files

Original bundle

Now showing 1 - 1 of 1

Name:: ucalgary_2023_li_qing.pdf
Size:: 5.62 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.64 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Restricted Theses and Dissertations

Developing Statistical Models For Multi-Omics Data Integration And Data Mining To Reveal Genetic Basis Underlying Diseases

Files

Original bundle

License bundle

Collections

Libraries & Cultural Resources