Unveiling Variability in Rare Disease-Gene Association using Bioinformatics and AI
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Problem: Rare diseases affect millions of people worldwide, often leading to prolonged periods of uncertainty due to delayed diagnoses and limited therapeutic options. The complexity of genetic interactions, particularly the role of genetic modifiers, adds to the challenge of understanding and treating these conditions. Current methodologies are limited in their ability to efficiently and accurately identify genetic modifiers, highlighting the need for innovative computational tools and approaches. Method: This research leverages advancements in computational biology and machine learning to take a data-centric approach to studying rare diseases. The study is structured around three primary aims: (1) automating the extraction of gene-phenotype associations from the OMIM database using natural language processing (NLP) techniques; (2) developing a user-friendly WGS (whole genome sequencing) pipeline tailored for the detection of genetic modifiers; and (3) implementing a deep learning model to predict genetic modifiers from a candidate variant list generated by WGS data processing. Solution: The first aim involved creating the Gene-Phenotype Association Discovery (GPAD) tool to automate the text mining process from OMIM database, facilitating text-mining, tracking and trend analysis of rare disease-gene association discoveries. The second aim focused on developing Model Organism Modifier (MOM), an open-source workflow designed to streamline WGS data processing and variant identification, making advanced genomic analysis accessible to researchers with limited bioinformatics expertise. The third aim introduced Modifier Spy (ModSpy), a deep learning model designed to predict genetic modifiers by analyzing candidate genes from WGS data, leveraging the latest advancements in machine learning to detect patterns indicative of modifier variants. Result: The GPAD tool successfully automated the extraction of gene-phenotype associations with ~96% accuracy, revealing trends in methodological approaches and highlighting the importance of model organisms in rare disease studies. MOM pipeline effectively simplified WGS data processing, democratizing access to advanced genomic analysis. ModSpy demonstrated high accuracy (~98%) in predicting genetic modifiers, showcasing the potential of deep learning to uncover the intricate web of genetic interactions influencing phenotypic expression. Conclusion: This dissertation advances the field of rare disease genomics by integrating advanced bioinformatics and Artificial Intelligence (AI) to enhance the identification of genetic modifiers. The tools and methodologies developed in this research provide new avenues for accelerating rare disease studies, improving diagnostics, and paving the way for targeted therapies. By harnessing the power of advanced computational techniques, this work offers significant insights into the genetic basis of phenotypic variability and sets the stage for future innovations in the study of rare genetic disorders.