Browsing by Author "Zhang, Qingrun"
Now showing 1 - 11 of 11
Results Per Page
Sort Options
Item Open Access Bayesian Variable Selection Model with Semicontinuous Response(2022-01-14) Babatunde, Samuel; Chekouo, Thierry; Sajobi, Tolulope; Zhang, Qingrun; Deardon, Robert; Bezdek, KarolyWe propose a novel Bayesian variable selection approach that identifies a set of features associated with a semicontinuous response. We used a two-part model where one of the models is a logit model that estimates the probability of zero responses while the other model is a log-normal model that estimates responses greater than zero (positive values). Stochastic Search Variable Selection (SSVS) procedure is used to randomly sample the indicator variables for variable selection which in turn searches the space of feature subsets and identifies the most promising features in the model. For the logistic model, a data augmentation approach is used to sample from the posterior density. We impose a spike-and-slab prior for the regression effects where the unselected covariates take on a prior mass at zero while the selected covariates follow a normal distribution (including the intercept and clinical covariates). Since the joint posterior density had no closed form, we employed the techniques of the Markov Chain Monte Carlo (MCMC) to sample from the posterior distribution. Simulation studies are used to assess the performance of the proposed method. We computed the average area under the receiver operating characteristic curve (AUC) to assess variable selection and compared it with competing methods. We also assessed the convergence diagnosis of our MCMC algorithm by computing the potential scale reduction factor and correlations between the marginal posterior probabilities. We finally apply our method to the coronary artery disease (CAD) data where the aim is to select important genes associated with the CAD index. This data consists of clinical covariates and gene expressions.Item Open Access Characterization of Stability of Non-Negative Matrix Factorization Models: An Application to Single-Cell Data(2023-08-21) Liu, Alexander EJ; Zhang, Qingrun; Wu, Jingjing; Xu, Yuan; Zhang, QingrunThe non-negative matrix factorization (NMF) is a powerful machine learning technique used in mathematics, computer science, and data science. This technique has applications in a wide range of fields including recommender systems, image processing, signal processing, machine learning and genetics. Recently, NMF has gained popularity in the analysis of single-cell gene expression data to identify cell types and gene expression patterns. In this thesis, we have studied the NMF, its rank estimation, classification, and stability using both simulated data and real single-cell gene expression data. We have designed two simulated data sets with desired features and tested two seeding methods, eight NMF algorithms and five rank estimation criteria. Additionally, a real single-cell gene expression data has been used to further characterize the NMF algorithms. We have also investigated the stability of NMF, first over the sample size consideration and then on initialization. The detailed conditions that have been revealed by this thesis may generate practical impact in directing the appropriate use of NMF in analyzing single-cell gene expression data.Item Open Access Distributionally robust binary classifier under Wasserstein distance(2024-09-08) Huang, Qian; Wu, Jingjing; Zhang, Qingrun; Liao, Wenyuan; Swishchuk, AnatoliyThe robustification of statistical models has been a popular topic for decades. Statistical robustification and robust optimization are the two main approaches in the literature, where the former stabilizes the model output by removing the outlier points while the latter concerns more the outlier points in making the conservative decisions. This thesis develops a novel robust optimization perspective to robustify a class of binary classifiers. Our model considers the worst-case distribution within a pre-determined uncertainty ball that centers at the given benchmark distribution with the radius calculated as per the Wasserstein distance. We derive the tractable formulation for the general problem. When focusing on the support vector machine (SVM), the general problem boils down to an easy-to-solve second- order cone programming problem. The robustified SVM is then applied to synthetic data with and without contamination, and our simulation studies show that our robustified SVM model can outperform the classical SVM and the extreme empirical loss SVM models under many circumstances.Item Open Access Explainable Autoencoder Deciphering Key Pathways Underlying Cancer Expression Patterns(2021-09) Yu, Yang; Liao, Wenyuan; Zhang, Qingrun; Thierry Chekouo, Tekougang; Xuewen, LuModern machine learning methods have been extensively utilized in gene expression data analysis. In particular, autoencoders (AE) have been employed in processing noisy and heterogenous RNA-Seq data. However, AEs usually lead to “black-box” hidden variables difficult to interpret, hindering downstream experimental validations and clinical translation. To bridge the gap between complicat-ed models and the biological interpretations, we developed a tool, XAE4Exp (eXplainable AutoEn-coder for Expression data), which integrates AE and SHapley Additive exPlanations (SHAP), a flagship technique in the field of eXplainable AI (XAI). It quantitatively evaluates the contributions of each gene to the hidden structure learned by an AE, substantially improving the expandability of AE outcomes. By applying XAE4Exp to The Cancer Genome Atlas (TCGA) breast cancer gene ex-pression data, we revealed intriguing pathways including cell damage management, cell cycle, immune system related pathways underlying breast cancer. This tool will enable researchers and practitioners to analyze high-dimensional expression data intuitively, paving the way towards broad-er uses of deep learning.Item Open Access Frechet Localization of Commutative Algebras(2024-08-26) Ahmed, Saleh; Bitoun, Thomas; Hamilton, Ryan; Bitoun, Thomas; Nguyen, Dang Khoa; Zhang, QingrunLet k be a complete non-Archimedean field with non-trivial norm. In this thesis, we aim to lay the groundwork for studying Frechet completions-completion with respect to all submultiplicative semi-norms-of localizations of normed commutative k-algebras by inves-tigating the possible semi-norms on these localizations. We focus on a particular family of semi-norms known as the weighted Gauss semi-norms. Notably, every semi-norm on the localization that extends the base norm is bounded above by a weighted Gauss semi-norm that also extends the base norm and realizes its associated weight. Consequently, the study of Frechet completions on localizations can be effectively pursued by focusing on this dominating class of weighted Gauss semi-norms, which is the primary goal of this work.Item Open Access Frequentist, Bayesian and Resampling Estimation of Extremes Based on the Generalized Extreme Value Distribution(2024-09-04) Xue, Yutong; Chen, Gemai; Shen, Hua; Lu, Xuewen; Zhang, QingrunExtreme events occur in science, engineering, finance and many related fields. The generalized extreme value (GEV) distribution is often used to model extreme events. In this thesis, we study the estimation of GEV related parameters and events using three different approaches. The maximum likelihood approach is a frequentist approach, which has a fully developed theory for both estimation and inference subject to the existence of maximum likelihood estimators and expected and/or observed information matrix. The Bayesian approach starts with the likelihood function, chooses appropriate prior distributions for the GEV distribution parameters, and works with the posterior distribution of the parameters for estimation and inference. The resampling approach may or may not use the likelihood function to estimate the GEV parameters, and inference is based on the variations generated from resampling the observed data directly or indirectly and repeating the estimation procedure. All three approaches are well known in the literature, the main contribution of this thesis is, to the best of our knowledge, that the three approaches are studied and compared under the same setup for the first time, and based on extensive comparisons and the criteria used we are able to recommend the parametric resampling approach based on the empirical distribution function (EDF) estimation, with percentile confidence intervals to practitioners to use. The use of the maximum likelihood, Bayesian, and resampling approaches is illustrated through a case study.Item Open Access Minimum Profile Hellinger Distance Estimation for Semiparametric Simple Linear Regression Model(2021-01-06) Li, Jiang; Wu, Jingjing; Li, Haocheng; Wu, Jingjing; Li, Haocheng; Lu, Xuewen; Zhang, QingrunThe simple linear regression model is essential for analyzing the relation between a response variable and a covariate variable, and the importance of simple linear regression model for statistical analysis of data is well documented. This thesis focuses on the semiparametric simple linear regression model where the distribution of the error term is assumed symmetric but otherwise completely unspecified. Under this model, we constructed a robust estimator of the regression coefficient parameters using the minimum Hellinger distance technique. Minimum Hellinger Distance Estimation (MHDE) was first introduced by Beran (1977) for fully parametric models that has been shown to have good efficiency and robustness properties. In the past decade, the MHDE has been extended to semiparametric models. Furthermore, Wu and Karunamuni (2015) introduced the Minimum Profile Hellinger Distance Estimation (MPHDE) for semeparametric models which retains good efficiency and robustness properties of MHDE in parametric models. In this thesis, I constructed an MPHDE for the semiparametric simple linear regression model. We established in theory the consistency of the proposed MPHDE. Finite-sample performance of the proposed estimator was examined via simulation studies and real data applications. Our numerical results showed that the proposed MPHDE has good efficiency and simultaneously is very robust against outlying observations.Item Open Access Novel Spatio-Temporal Models with Applications in Wind Forecasting(2024-08-20) Jia, Tianxia; Sezer, Deniz; Wood, David; Lu, Xuewen; Zhang, Qingrun; Pietroniro, Alain; Braun, JohnThis research asserts the benefits of incorporating atmospheric regimes from large-scale reanalysis datasets and accounting for regime- and region-specific prevailing winds in covariance models for accurate short-term wind forecasts at multiple weather stations. Extending from classic time series models, regime-switching autoregressive and vector autoregressive models, alongside their mixture counterparts, are utilized first to model short-term wind speed to 6 hours ahead at 23 weather stations across Alberta. The results underscore the advantages of simultaneous modelling of multiple locations and the integration of atmospheric information for short-term wind forecasting. Expanding our scope, we employ spatio-temporal covariance models to model wind speed at 131 weather stations in Alberta. Specifically, the Gneiting class is adopted for capturing the fully symmetrical features of the empirical correlation. To address the underfitting concerns of this model, theoretical foundations are laid for relaxing constraints on the interaction parameter via a discrete spatial grid. Moreover, to account for both prevailing wind speed and direction, we propose a novel form of Lagrangian covariance function and prove its validity under any finite-dimensional Euclidean space. Furthermore, we propose a regime-switching covariance model to enable the prevailing winds in the Lagrangian covariance function to vary by regime. This model is essentially a p-th order Markov chain Gaussian field with the Markov property held in the time domain. We investigate its limiting behaviour as well as its convergence rate and present a parameter estimation method. The superior performance of the proposed models is observed for both observed and unobserved weather stations, highlighting their utility for future wind farm site planning. The thesis concludes by exploring options for allowing regime-specific prevailing winds to vary by region, resulting in region-specific prevailing winds under each regime. This approach is motivated by the spatially varying benefits of modelling prevailing winds. New methods are proposed for estimating and incorporating regime-dependent prevailing winds into regime-switching covariance models, resulting in improved predictive performance for forecasting hourly wind speed at 142 weather stations in Alberta.Item Open Access Novel stabilized models to characterize gene-gene interactions by utilizing transcriptome data(2022-09-28) Kossinna, Thalagala Kossinnage Pathum Subhashana; Long, Quan; Zhang, Qingrun; Arnold, Paul Daniel; De Leon, AlexanderMachine learning models employed in genetics often grapple with issues related to the "curse of dimensionality". Furthermore, due to the inherent noisy nature of most -omics data, most methods suffer from the problem of "stability": i.e., even slight perturbations of the original data may result in wholly different outcomes. This becomes particularly true when dealing with interactions as the number of potential interactions are usually astronomical. In this thesis, we present two novel methods: 1) Stabilized COre gene and Pathway Election (SCOPE) and 2) Interaction Bridged Association Study (IBAS) that uses two differing approaches in analyzing biological interactions. SCOPE employs a stabilized form of the LASSO that is better able to handle highly correlated expression data and a co-expression network analysis that identifies "core" genes that may be of interest as well as the underlying biological pathways or mechanisms by which they interact. Stabilizing these results across six cancers of The Cancer Genome Atlas uncovered hallmark cancer pathways as well as a novel potential therapeutic target of kidney cancer, CD63. IBAS utilizes a "data-bridge" composed of dimensionality reduced pathway level interactions of the transcriptome to identify genes associated with a phenotype of interest using the Sequence Kernel Association Test (SKAT), in a disentangled form of the Transcriptome Wide Association Study. Application to the Wellcome Trust Case Control Consortium reveals novel gene candidates with literature reviews highlighting their potential for further study. In conclusion, we have developed two novel methodologies in analyzing complex interaction patterns in -omics data using stabilized machine learning methods, paving the way to further understand the biological interactions underlying complex disease.Item Open Access Parallelization of Bayesian Phylogenetics to Greatly Improve Run Times(2024-03-24) Yang, David; Zhang, Qingrun; Gordon, Paul; Liao, Wenyuan; van der Meer, Franciscus JohannesPhylogenetic analyses are invaluable to understanding the transmission of viruses, especially during disease outbreaks. In particular, Bayesian phylogenetics has great potential in modeling viral transmission due to the numerous phylogenetic models that can be incorporated. Currently, the availability of user-friendly software and accessibility to sequence data makes phylogenetic analyses easy to perform. However, to date, Bayesian phylogenetic analyses are still limited by long computational run-times which are especially unfavorable during ongoing and evolving disease outbreaks that demand real-time phylogeny results. Current optimization methods of Bayesian phylogenetic analysis mainly focus on iteration-level parallelization and mostly overlook the potential of larger-scale parallelization approaches. In this thesis, we provide an in-depth overview of topics including phylogenetic analysis, relevant biological information, and phylogenetic analysis optimization methods. We also proposed a novel parallelized Markov Chain Monte Carlo method that greatly improved Bayesian phylogenetic run times and integrated the approach into a data pipeline to allow for the direct analysis of viral samples. We demonstrated the validity of our methods by performing phylogenetic analyses on two sets of HIV simulation data and one set of real-world SARS-CoV-2 data. Our results suggested that the parallelization of MCMC in Bayesian phylogenetic analyses drastically reduces run times by 29-fold without causing significant deviations in parameter estimates and predicted phylogenetic trees.Item Open Access Utilizing statistical methods to discover genetic variants underlying disease traits using multi-omics data(2024-08-27) Kahanda Liyanage, Rushani Nilakshika Kumari Perera; Zhang, Qingrun; Ji, Yunqi Jacob; Wang, HaixuIdentifying genetic variants statistically associated with specific diseases is the focus of Genome- Wide Association Studies (GWAS). Advancements in omics technologies have enabled the use of multi-omics data to bridge the gap between genotypes and their resulting phenotypes. Recently, various models have been proposed to utilize omics data for estimating polygenic terms. For example, the Image-Mediated Association Study (IMAS) leverages brain imaging data to conduct association mapping in legacy GWAS cohorts. Meanwhile, the Expression-Directed Linear Mixed Model (EDLMM) incorporates expression data to identify low-effect genetic variants, demonstrating superior performance in terms of power and real data analysis outcomes. However, most current association studies focus on a single biological unit. In our work, we developed an Image Expression Directed Linear Mixed Model (IEDLMM) which utilizes informative weights learned from training genetically predictive models for brain images using a linear mixed model and for gene expressions using a Bayesian Sparse Linear Mixed Model, to estimate the polygenic term in a linear mixed model. Through Simulations we have proven that, IEDLMM exhibits higher power than current methods while keeping the type-I error rates under control. By leveraging the UK Biobank image derived phenotypes (IDPs) and GTEx gene expression data, the IEDLMM identified 15 unique genes related to brain disorders across four datasets which are validated through DisGeNET functional annotations proving the efficacy of IEDLMM compared to existing methods. The creation of IEDLMM paves the way for additional exploration in the integration of multiple omics data within a single framework. This method not only improves the credibitility of the results but also furthers our knowledge in the field, laying a foundation for future research efforts.