Browsing by Author "Deardon, Rob"
Now showing 1 - 19 of 19
Results Per Page
Sort Options
- ItemOpen AccessA Bayesian Variable Selection Model for Semi-Continuous Response Using Gaussian Process(2023-09-06) Lipman, Danika; Chekouo, Thierry; Deardon, Rob; Wu, Jingjing; Lu, Xuewen; Safo, Sandra; Chekouo, Thierry; Deardon, RobTo my knowledge, there is not a statistical method that can perform Bayesian variable selection in a setting where there is a semi-continuous response with a non-linear relationship to predictor variables. I have developed a two-part model to accommodate a semi-continuous response, that uses Gaussian processes to capture the non-linear relationship between input variables and outcomes. Bayesian variable selection is induced in both parts of the model through the construction of the kernel matrices. I have employed the Nystr\"{o}m approximation for kernel matrices to reduce the computational complexity that occurs when working with kernel matrices and large sample sizes. I perform simulation studies and determine my method is competitive in prediction and variable selection with methods such as elastic net, and other methods that capture non-linearity such as random forests, and gradient boosted trees. In addition, I apply my method to a coronary artery disease (CAD) dataset from the Duke Database for Cardiovascular Disease (DDCD) to determine key gene expression features associated with the CAD index, a measure of CAD severity.
- ItemOpen AccessAntimicrobial resistance: Prevalence, genetics and associations with antimicrobial use in food-producing animals(2020-07-27) Borin Nobrega, Diego; Barkema, Herman W.; De Buck, Jeroen M.; Deardon, Rob; Dufour, Simon; Saini, VineetAntimicrobial use (AMU) in livestock has come under growing criticism. There is increasing pressure to optimize AMU in food-producing animals, which will likely entail restrictions and voluntary reductions of their use, as well as implementation of protocols promoting antimicrobial stewardship. In this thesis, 1) methods were compared for obtaining AMU data on dairy farms, 2) factors associated with the prevalence of antimicrobial resistance (AMR) in non-aureus staphylococci (NAS) isolated from intramammary infections were studied, 3) treatment strategies for non-severe clinical mastitis (CM) in dairy cattle were contrasted, and 4) effects of restricted antimicrobial use in food-producing animals towards the prevalence of AMR genes (ARGs) were evaluated. Chapter 2 confirmed that treatment records accurately quantified AMU in well-managed dairy herds. Yet, their widespread adoption into AMU surveillance cannot be recommended, due to an underestimation of AMU in herds with elevated bulk tank somatic cell count. In regard to AMR, Chapter 3 demonstrated that resistance against tetracycline, penicillin and erythromycin in NAS was common in Canadian dairy herds. In Chapter 4, factors associated with AMR were further explored. An association between AMR in NAS and AMU was present when penicillins, 3rd-generation cephalosporins or macrolides were administered systemically, whereas intramammary use of antimicrobials were not associated with AMR. As antimicrobials classified as critically important antimicrobials (CIAs) for humans were associated with AMR, in Chapter 5 a systematic review was done to assess whether CIAs and non-CIAs had comparable efficacy to treat non-severe bovine CM caused by the most prevalent bacteria causing mastitis worldwide. No protocol including the use of CIAs had superior bacteriological cure rates of non-severe CM than protocols relying on non-CIAs. Therefore, no adverse effects in terms of animal health should be expected by ceasing use of CIAs for treating non-severe CM in dairy herds. A second systematic review showed that restricted AMU in food animals was associated with a lower presence of ARGs in bacteria isolated from animals and humans. Reducing use of CIAs to treat non-severe CM in typical dairy herds may reduce load of ARGs without significant impacts on animal health and welfare.
- ItemOpen AccessApplication of Epidemiology and Biostatistics to Malaria Diagnosis in Returning Travellers(2019-07-04) Cheaveau, James; Pillai, Dylan; Deardon, Rob; Gregson, DanToday, malaria elimination is back on the agenda but for this to be feasible, there must be a coordinated global effort utilizing all available tools. Portable, sensitive diagnostics with a low limit of detection are required to detect the malaria reservoir, and novel antimalarials are required to combat the threat of artemisinin resistance. Returning travellers are a good population in which to investigate malaria physiology and diagnostics because there is a good supply of study participants and an abundance of easily available data. In Canada, a combination of microscopy and rapid diagnostic tests are used to diagnose malaria, but these lack sensitivity and require repeated testing to rule out the condition. A prospective diagnostic trial of the illumigene Malaria, loop-mediated isothermal amplification (LAMP) assay, manufactured by Meridian Bioscience was conducted in symptomatic returning travellers between June 2017 and January 2018. After discrepant resolution with RT-PCR, LAMP had a sensitivity of 100% (95% CI; 95.8-100) and a specificity of 100% (95% CI; 98.7-100). In symptomatic returning travellers, LAMP has the potential to replace traditional malaria diagnostics, allowing for malaria to be ruled out in a timely manner. It is unclear if uncomplicated malaria causes deranged liver enzymes, which has implications for antimalarial drug development. A retrospective cohort study was evaluated in returning travellers (n=4548) who underwent a malaria test and had liver enzymes measured within 31 days from 2010-2017. After adjusting for gender, age, and use of hepatotoxic medications, returning travellers testing positive for malaria had higher odds of having an abnormal TB [(OR: 12.64, 95% CI: 6.32 – 25.29), p<0.001] but not ALP [(OR: 0.32, 95% CI: 0.09 – 1.10), p=0.072], ALT [(OR: 1.01, 95% CI: 0.54 – 1.89), p=0.978] or AST [(OR: 1.26, 95% CI: 0.22 – 7.37), p=0.794], compared to those who tested negative. This is most likely to be due to haemolysis, which normalizes following treatment. LAMP can be used in the diagnosis of malaria in returning travellers, and it may have a role in malaria elimination. Uncomplicated malaria does not appear to cause raised aminotransferases in returning travellers, and consideration must be given to this in antimalarial drug development.
- ItemOpen AccessBi-level Variable Selection and Dimension-reduction Methods in Complex Lifetime Data Analytics(2019-12) Cai, Kaida; Lu, Xuewen; Shen, Hua; Lu, Xuewen; Shen, Hua; Tekougang, Thierry Chekouo; Deardon, Rob; Long, Quan; Jin, ZhezhenFor the high-dimensional data, the number of covariates can be large and diverge with the sample size. In many scientific applications, such as biological studies, the predictors or covariates are naturally grouped. In this thesis, we consider bi-level variable selection and dimension-reduction methods in complex lifetime data analytics under various survival models, and study their theoretical properties and finite sample performance under different scenarios. Specifically, in Chapter 2, we focus on the Andersen-Gill regression model for the analysis of recurrent event data with group covariates when the number of covariates is fixed. In order to study the effects of the covariates on the occurrence of recurrent events, a bi-level penalized group selection method is introduced to address the group selection problem. A general group-bridge penalty function with varying weights is invoked to achieve the goal. It is shown that the performance of the bi-level selection depends on the weights. In order to select covariates more efficiently, especially for identifying the important covariates in important groups, adaptive weights are required. The asymptotic oracle properties of the proposed method are investigated in the case of fixed number of covariates. Three methods of tuning parameter selection are proposed. Our simulation studies show that the proposed method performs well in selecting important groups and important individual covariates in these groups simultaneously, and outperforms other popular group selection methods and the traditional unpenalized Wald testing method. In Chapter 3, we extend the proposed method of recurrent event model to the case of a diverging number of covariates. We demonstrate that the proposed method has selection consistency and the penalized estimators have asymptotic normality in the case of diverging a number of covariates. Simulation studies show that the proposed method performs well and the results are consistent with the theoretical properties. We illustrate the method using a real life data set from medicine. In Chapter 4, by imitating the group variable selection procedure with bi-level penalty, we propose a new variable selection method for the analysis of multivariate failure time data, with an adaptive bi-level variable selection penalty function. In the regression setting, we treat the coefficients corresponding to the same prediction variable as a natural group, then consider variable selection at the group level and individual level simultaneously. The proposed adaptive bi-level variable selection method can select a prediction variable in two different levels: the first level is the group level, where the predictor is important to all failure types; the second level is the individual level, where the predictor is only important to some failure types. An algorithm based on cycle coordinate descent (CCD) is proposed to carry out the proposed method. Based on the simulation results, our method outperforms the classical penalty methods, especially in terms of removing unimportant variables for all different failure types. We obtain the asymptotic oracle properties of the proposed variable selection method in the case of diverging number of covariates. We construct a generalized cross validation (GCV) method for the tuning parameter selection and assess model performance based on model errors. We also illustrate the proposed method using a real life data set. Sufficient dimension reduction (SDR) is a powerful tool for dimension reduction in regression and classification problems, which replaces the original covariates with the minimal set of their linear combinations. In Chapter 5, we propose a novel penalty function, called adaptive group composite Lasso (AGCL), for the group sparse sufficient dimension reduction problem. By incorporating this new penalty with the sufficient dimension reduction method, we propose an adaptive group composite Lasso penalized dimension reduction method to simultaneously achieve sufficient dimension reduction and group variable selection in the case of diverging number of covariates. We investigate the asymptotic properties of the penalized sufficient dimension reduction estimators when the number of covariates diverges with the number of sample size. We show that the proposed method can select important groups and individual variables simultaneously. We compare the proposed method with other sparse sufficient dimension reduction methods using simulation studies. The results show that the proposed method outperforms the other methods in terms of removing unimportant covariates, especially in removing the unimportant groups. A real data example is used for illustration.
- ItemOpen AccessBias and Bias-Correction for Individual-Level Models of Infectious Disease(2020-01-30) Jafari, Behnaz; Deardon, Rob; Chekouo, Thierry T.; Kopciuk, Karen ArleneAccurate infectious disease models can help scientists understand how an ongoing disease epidemic spreads and help forecast the course of epidemics more effectively (e.g. O'Neill, 2010; Jewell et al., 2009; Deardon et al., 2010). The main purpose of infectious disease modeling is to capture the main risk factors that affect the spread of a disease and make a prediction based on these factors. In real life, we do not generally have homogeneous and homogeneously mixing populations and various factors affect the spread of a disease (e.g. geographical, social, domestic, and employment networks, genetics factors). Using individual-level-models (ILMs) (Deardon et al., 2010) can help researchers to incorporate population heterogeneity. In these models inferences are made within a Bayesian Markov chain Monte Carlo (MCMC) framework (e.g. Gamerman and Lopes, 2006), obtaining posterior estimates of model parameters. However, parameter estimation and bias of estimates go hand in hand. The issue of bias of parameter estimates, and methods for bias correction, have been widely studied in the context of many of the most established and commonly used statistical models, and associated methods of parameter estimation. However, these methods are not directly applicable to individual-level infections disease data. The focus of this thesis is to investigate circumstances in which ILM parameter estimates may be biased in some simple disease system scenarios. Further, we aim to find bias-corrected estimates of ILM parameters using simulation and compare them with the posterior estimates of the model parameter. We also discuss the factors that affect performance of these estimators.
- ItemOpen AccessBig data and machine learning tools to understand mastitis epidemiology and other topics(2021-11) Naqvi, Syed Ali; Barkema, Herman W.; Deardon, Rob; Williamson, Tyler; Dufour, SimonIncreased availability of technologies to collect and store individual health data is leading to a growing interest in applying Big Data analytical methodologies to better understand health and disease in both humans and dairy cattle. Data collected through routine observations such as doctor or veterinary visits, milking equipment, or remote sensors can be successfully incorporated to monitor and manage individual and public health, and support operational decision-making on dairy farms. These sources of data also provide an invaluable resource in conducting epidemiological and health research, provided they are appropriately handled during the statistical analysis. In this thesis, 1) data from bacteriological sampling were combined with regularly collected dairy herd improvement (DHI) data to describe udder health in primiparous dairy cattle across Canada; 2) a systematic review and meta-analysis was conducted to synthesize all available research on the effectiveness of pre-calving therapies to improve udder health in primiparous dairy cattle; 3) a model was developed for the detection of clinical mastitis (CM) based on routinely collected data from automated milking systems (AMS); 4) a simulation study assessed the impact of unmeasured heterogeneity in secondary data collected from multiple dairy farms on the performance of a model trained to detect CM onset; 5) the immune fingerprint of children presenting with symptoms of appendicitis are compared by combining emergency department admissions data with results from a multiplex cytokine assay and 6) dietary risk factors for immunological flare-ups in patients with Crohn’s disease are explored by combining patient-reported dietary records with results of a multiplex cytokine assay. Chapter 2 demonstrated that the udder health in Canadian primiparous dairy cows was an issue that needed attention, and chapter 3 demonstrated that pre-calving treatments of different types can be effective at improving udder health in early lactation. Both chapters highlighted the need for routinely collected data to be combined with targeted data collection (monitoring of non-milking dairy cows, culture-based treatment selection) to facilitate targeted management for different parts of a dairy herd. In chapter 5, a deep recurrent neural network (RNN) model was used to detect the onset of CM using regularly collected data from AMS, and chapter 6 demonstrated that predictive performance of deep RNNs is robust to the unmeasured heterogeneity in data collected from multiple farms. Chapter 6 describes how immune response differs between children with abdominal pain symptomatic of appendicitis and provides evidence that data from a multiplex immunoassay conducted on admission may be used to effectively predict disease outcomes. In chapter 7, a similar multiplex immunoassay is used to explore associations between inflammation and diet using food records from patients with Crohn’s disease and demonstrates some of the statistical challenges encountered when working with multiple outcomes and large numbers of explanatory variables.
- ItemOpen AccessClimbing the mountain: experimental design for the efficient optimization of stem cell bioprocessing(2017-12-04) Toms, Derek; Deardon, Rob; Ungrin, MarkAbstract “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.” – R.A. Fisher While this idea is relevant across research scales, its importance becomes critical when dealing with the inherently large, complex and expensive process of preparing material for cell-based therapies (CBTs). Effective and economically viable CBTs will depend on the establishment of optimized protocols for the production of the necessary cell types. Our ability to do this will depend in turn on the capacity to efficiently search through a multi-dimensional problem space of possible protocols in a timely and cost-effective manner. In this review we discuss approaches to, and illustrate examples of the application of statistical design of experiments to stem cell bioprocess optimization.
- ItemOpen AccessCovariate Balancing Using Statistical Learning Methods in the Presence of Missingness in Confounders(2019-09-20) Mason, Levi James; Shen, Hua; Chekouo, Thierry T.; Deardon, RobIn observational studies researchers do not have control over treatment assignment. A consequence of such studies is that an imbalance in observed covariates between the treatment and control groups possibly exists. This imbalance can arise due to the fact that treatment assignment is frequently influenced by observed covariates (Austin, 2011a). As a result, directly comparing the outcomes between these two groups could lead to a biased estimation of the treatment effect (d’Agostino, 1998). The propensity score, defined as the probability of treatment assignment conditional on observed covariates, can be used in matching, stratification, and weighting to balance the observed covariates between the treatment and control groups in order to more accurately estimate the treatment effect (Rosenbaum and Rubin, 1983). This study looked at using statistical learning techniques to estimate the propensity score. The techniques included in this study were: logistic regression, classification and regression trees, pruned classification and regression trees, bagging classification and regression trees, boosted classification and regression trees, and random forests. These estimated propensity scores were then used in linearized propensity score matching, stratification, and inverse probability of treatment weighting using stabilized weights to estimate the treatment effect. Comparisons among these methods were made in a simulation study setting. Both a binary and continuous outcome were analyzed. In addition, a simulation was performed to assess the use of multiple imputation using predictive mean matching when a confounder had data missing at random. Based on the results from the simulation studies it was demonstrated that the most accurate treatment effect estimates came from inverse probability of treatment weighting using stabilized weights where the propensity scores were estimated by logistic regression, random forests, or bagging classification and regression trees. These results were then applied in a retrospective cohort data set with a missing confounder to determine the treatment effect of adjuvant radiation on breast cancer individuals.
- ItemOpen AccessData Subset-Based Methods of Inference for Spatial Individual Level Epidemic Models(2023-08) Nyein, Thet Htet Chan; Deardon, Rob; Shen, Hua; Kopciuk, Karen A.Mathematical models are essential to understand infectious disease dynamics, enabling to control the spread of those diseases and preparing for public health measures. Since time and space are important factors affecting the transmission of infectious diseases, spatial individual-level models (ILM) with both temporal and spatial information are developed. Typically, Markov Chain Monte Carlo (MCMC) methods are utilized for the inference of ILM. Nonetheless, this approach can be computationally intensive for complex or large models, resulting in repeated likelihood calculations. This thesis explores various spatial and temporal subset methods to conduct statistical inference for spatial epidemic models, aiming to provide appropriate parameter estimates with minimum computational resources. In this thesis, we utilize the spatial ILM with the Euclidean distance between susceptible individuals and infectious individuals as a kernel function.
- ItemOpen AccessEstimation and Group Selection in Partially Linear Survival Models(2018-01-17) Afzal, Arfan; Lu, Xuewen; Ambagaspitiya, Rohana; Shen, Hua; Deardon, Rob; Zhao, YichunIn survival analysis, different regression models are available to estimate the effects of covariates on the censored survival outcome. The proportional hazards (PH) model has been the most popular model among them because of its simplicity and desirable theoretical properties. However, the PH model assumes that the hazard ratio is constant over observed time. When this assumption is not met or we are interested in the risk difference, the additive hazards (AH) model is a useful alternative. On the other hand, assuming linear structure of covariate effects on survival in these models may be too strict. As a remedy to that, partially linear survival models are getting increasingly popular as it combines the flexibility of nonparametric modeling with the parsimony and easy interpretability of parametric modeling. Nonetheless, building these models becomes a challenging problem when predictors or covariates are high-dimensional and grouped. Consequently, it becomes crucial to select important groups and important individual variables within groups by the so called bi-level variable selection method to reduce the dimension of the data and build a sensible and useful semiparametric model for applications as the methods for individual variable selection in such cases may perform inefficiently by ignoring the information present in the grouping structure. To fill gaps in estimation and group selection in partially linear survival models with high-dimensional data, in this thesis, we propose new methods for estimation and group selection in two partially linear survival models, namely, the partially linear AH model and the partially linear PH model. In the first part of this thesis, we consider estimation in a partially linear AH model with left-truncated and right-censored data when the dimension of covariates is fixed and the risk function has a partially linear structure. We construct a pseudo-score function to estimate the coefficients of the linear covariates and the B-spline basis functions. The proposed estimators are asymptotically normal under the assumption that the true nonlinear functions are B-spline functions whose knot locations and the number of knots are held fixed. In the second and third parts, we study group variable selection in the partially linear AH model and the partially linear PH model with right censored data. In such regression models with a grouping structure among the explanatory variables, variable selection at the group and within group individual variable level is important to improve model accuracy and interpretability. Motivated by the hierarchical grouped variable selection in the linear PH model and the linear AH model, we propose a hierarchical bi-level variable selection approach for high-dimensional covariates in the linear part of the partially linear AH model and the partially linear PH model, respectively. The proposed methods are capable of conducting simultaneous group selection and individual variable selection within groups in the presence of nonparametric risk functions of low-dimensional covariates. For group selection in the partially linear AH model, the rates of convergence and selection consistency of the proposed estimators are established using martingale and empirical process theory; after reducing the dimension of the covariates, we suggest the use of the method in the first part for inference in the partially linear AH model. For group selection in the partially linear PH model, similar theoretical results of the proposed estimators are obtained, and the oracle properties such as asymptotic normality of the estimators are discussed. Finally, computational algorithms and programs are developed for utilizing the proposed methods. Simulation studies indicate good finite sample performance of the methods. For each model, real data examples are provided to illustrate the application of the methods.
- ItemOpen AccessForecasting of Wind Energy Generation in Alberta(2018-09-14) Luo, Yilan; Sezer, A. Deniz; Wood, David H.; Deardon, RobIn this paper, our goal is to build a model for the future wind power generation of Alberta, as Alberta’s wind power capacity is growing, and new wind farms are expected to be built in the near future. An important feature of the wind power data is spatial and temporal correlation. To capture this, we model the wind power generation in Alberta as a spatio-temporal process. We apply the method of Gaussian random fields to analyze the wind power time series of 20 wind farms of Alberta. Following the work of Gneiting et al. [11] , we build several spatio-temporal covariance function estimates with increasing complexity: separable, non-separable symmetric, and non-symmetric. We compare the performance of the models using simple kriging. We also use kriging to demonstrate the performance of the models to forecast the future wind generation for both an existing wind farm and a new farm in Alberta. In the end, we also formulate the mean and variance of the aggregate wind power generation in Alberta.
- ItemOpen AccessGeographically Dependent Individual-level Models for Infectious Disease Transmission(2022-06) Mahsin, MD; Deardon, Rob; Brown, Patrick; Kopciuk, Karen; Shen, Hua; Brown, GrantInfectious disease models can be of great use for understanding the underlying mechanisms that influence the spread of diseases and predicting future disease progression. Modeling has been increasingly used to evaluate the potential impact of different control measures and to guide public health policy decisions. In recent years, there has been rapid progress in developing spatio-temporal modeling of infectious diseases and an example of such recent developments is the discrete time individual-level models (ILMs). These models are well developed and provide a common framework for modeling many disease systems, however, they assume the probability of disease transmission between two individuals depends only on their spatial separation and not on their spatial locations. In cases where spatial location itself is important for understanding the spread of emerging infectious diseases and identifying their causes, it would be beneficial to incorporate the effect of spatial location in the model. In this study, we thus generalize the ILMs to a new class of geographically-dependent ILMs (GD-ILMs), to allow for the evaluation of the effect of spatially varying risk factors (e.g., education, social deprivation, environmental), as well as unobserved spatial structure, upon the transmission of infectious disease. Specifically, we consider a conditional autoregressive (CAR) model to capture the effects of unobserved spatially structured latent covariates or measurement error. This results in flexible infectious disease models that can be used for formulating etiological hypotheses and identifying geographical regions of unusually high risk to formulate preventive action. The reliability of these models are investigated on a combination of simulated epidemic data and Alberta seasonal influenza outbreak data (2009). This new class of models is fitted to data within a Bayesian statistical framework using Markov chain Monte Carlo (MCMC) methods. We also developed the continuous-time GD-ILMs, allowing infection times and infectious periods to be treated as latent variables that are estimated using data-augmented Markov Chain Monte Carlo (MCMC) techniques within a Bayesian framework. This approach results in a flexible infectious disease modeling framework for formulating etiological hypotheses and identifying unusually high-risk geographical regions to develop preventive action. We evaluate the performance of these proposed models on a combination of simulated epidemic data and seasonal influenza data in Alberta in 2009. Finally, we proposed a special case of the GD-ILMs, termed as {\it small-area restricted} GD-ILMs for infectious disease modelling. The reliability of these models are investigated through simulation studies based on disease spread through the Canadian city of Calgary, Alberta.
- ItemOpen AccessGroup Selection in Semiparametric and Nonparametric Accelerated Failure Time Models(2017) Huang, Longlong; Lu, Xuewen; Kopciuk, Karen; Deardon, Rob; Sajobi, Tolulope; Yan, Ying; Hu, JoanIn survival analysis, a number of regression models can be used to estimate the effects of covariates on the censored survival outcome. When covariates can be naturally grouped, group selection is important in these models. Motivated by the group bridge approach for variable selection in a multiple linear regression model, we consider group selection in a semiparametric accelerated failure time (AFT) model using Stute's weighted least squares and a group bridge penalty. This method is able to simultaneously carry out feature selection at both the group and within-group individual variable levels and enjoys the powerful oracle group selection property. Although the group bridge penalized approach can effectively remove unimportant groups, it cannot effectively remove unimportant variables within the important groups. To overcome this limitation, the adaptive group bridge method is proposed. We show that the adaptive group bridge method obtains the oracle property. Simulation studies indicate that the group bridge and adaptive group bridge approaches for the AFT model can correctly identify important groups and variables even with high censoring rates. A real data analysis is provided to illustrate the application of the proposed methods. We further study a nonparametric accelerated failure time additive regression (NP-AFT-AR) model whose covariates have nonparametric effects on the survival time. The proposed model is more flexible than the linear model and can be fitted to high-dimensional censored data when some components are unknown non-linear functions. B-splines are used to approximate the nonparametric components. A group bridge penalized variable selection approach based on the inverse probability-of-censoring weighted least squares is developed to select nonparametric components. The proposed approach is able to distinguish the nonzero components from the zero components and estimate the nonzero components simultaneously. Computational algorithms and theoretical properties of the proposed method are established. Simulation studies indicate that the proposed method has satisfactory performance even with relatively high censoring rates. Two real data analyses are used to illustrate the application of the proposed method to survival data analysis.
- ItemOpen AccessThe impact of transposable elements on genomes of parasitic nematodes(2020-07-06) Dunemann, Sonja Maria; Wasmuth, James D.; Lynch, Tarah; Deardon, Rob; Yeaman, SamParasitic nematodes infect many animal and plant species. In humans, they cause significant levels of morbidity and death in low and middle income countries. In livestock and crop plants, they have a significant, negative economic impact, globally. To better understand the biology of these organisms, we must understand their genetic makeup. Up to 36% of parasitic nematode genomes consist of transposable elements (TEs). TEs are mobile, repetitive elements, and act as major players of evolution due to their ability to transfer horizontally between different taxa, move within genomes, and impact the phenotype of their host. However, TEs are often sidelined in standard genomic and transcriptomic analyses. Hence, we know little about their impact on and interaction with nematodes. Here I show that TEs can be horizontally transferred between parasitic nematodes and their hosts, and that TEs are actively mobilizing themselves throughout the life-cycle of two parasitic species. I compare different phylogenetic methods to test for horizontal transfer of TEs between taxa using AviRTE, a LINE element that has been shown previously to have horizontally transferred between birds and parasitic nematodes. I find that phylogenetic trees of TEs based on coding regions differ from trees based on full-length sequences. I identify another TE, RTE1\_Sar, that was horizontally transferred between parasitic nematodes and the common shrew. To rule out contamination of the shrew genome, I develop a pipeline called ConTest that tests for contamination by comparison of TE flanking sequences to two sequence databases. To better understand potential consequences of TE insertions, I investigate underlying mechanism of TE expression. I find that TE expression is specific to developmental stage, and that genes and TEs have potential transcript chimera. This thesis shows that transposons might have transferred more often between parasites and hosts than previously thought, and provides a pipeline to test if a genomic sequence is based on contamination. Furthermore, this work lays an early foundation to study TE impact on parasite genomes by showing that the majority of TE expression arises from read-through transcription, but that younger LINE elements are active and continue to shape genome evolution.
- ItemOpen AccessMethods for detecting seasonal influenza epidemics using a school absenteeism surveillance system(2019-09-05) Ward, Madeline A; Stanley, Anu; Deeth, Lorna E; Deardon, Rob; Feng, Zeny; Trotz-Williams, Lise AAbstract Background School absenteeism data have been collected daily by the public health unit in Wellington-Dufferin-Guelph, Ontario since 2008. To date, a threshold-based approach has been implemented to raise alerts for community-wide and within-school illness outbreaks. We investigate several statistical modelling approaches to using school absenteeism for influenza surveillance at the regional level, and compare their performances using two metrics. Methods Daily absenteeism percentages from elementary and secondary schools, and report dates for influenza cases, were obtained from Wellington-Dufferin-Guelph Public Health. Several absenteeism data aggregations were explored, including using the average across all schools or only using schools of one type. A 10% absence threshold, exponentially weighted moving average model, logistic regression with and without seasonality terms, day of week indicators, and random intercepts for school year, and generalized estimating equations were used as epidemic detection methods for seasonal influenza. In the regression models, absenteeism data with various lags were used as predictor variables, and missing values in the datasets used for parameter estimation were handled either by deletion or linear interpolation. The epidemic detection methods were compared using a false alarm rate (FAR) as well as a metric for alarm timeliness. Results All model-based epidemic detection methods were found to decrease the FAR when compared to the 10% absence threshold. Regression models outperformed the exponentially weighted moving average model and including seasonality terms and a random intercept for school year generally resulted in fewer false alarms. The best-performing model, a seasonal logistic regression model with random intercept for school year and a day of week indicator where parameters were estimated using absenteeism data that had missing values linearly interpolated, produced a FAR of 0.299, compared to the pre-existing threshold method which at best gave a FAR of 0.827. Conclusions School absenteeism can be a useful tool for alerting public health to upcoming influenza epidemics in Wellington-Dufferin-Guelph. Logistic regression with seasonality terms and a random intercept for school year was effective at maximizing true alarms while minimizing false alarms on historical data from this region.
- ItemEmbargoOn Some New Variable Selection Methods for Multivariate Survival Data(2023-08) Mahmoudi, Fatemeh; Lu, Xuewen; Wu, Jingjing; Deardon, Rob; Wang, Liqun; Ambagaspitiya, Rohana; Lu, XuewenThis dissertation proposes variable selection methods for reducing dimensionality in complex lifetime data for survival analysis. With the advent of big data, survival analysis often involves a large number of covariates, necessitating their identification. High-dimensional data, especially with increasing sample size, presents challenges in terms of variable selection. The dissertation focuses on simultaneous estimation and variable selection methods under various censored data types and survival models, examining their theoretical properties and performance in finite samples. The analysis of complex lifetime data encounters challenges stemming from different sources, including various types of censoring, diverse models, and multiple outcomes. Traditional survival analysis primarily deals with univariate survival data, focusing on a single event of interest. However, real-world applications frequently involve multiple event types with distinct underlying causes and risk factors. This research investigates three types of multiple events data: competing risks, semi-competing risks, and multivariate failure time data. For competing risks data, Chapter 2 considers interval-censored models. A penalized variable selection method is proposed, utilizing the LASSO, Adaptive LASSO, and broken adaptive ridge regression. The proposed method effectively selects important variables based on results of simulation studies. It is also successfully applied to a real-life HIV study dataset. In the context of semi-competing risks data, Chapter 3 explores an illness-death model with shared frailty. Parametric and semiparametric models are employed to examine the effects of covariates and conduct variable selection. The proposed method demonstrates good performance through simulation studies and analysis of colon cancer data. For multivariate failure time data, Chapter 4 introduces the sparse group broken adaptive ridge (SGBAR) penalty. This penalty facilitates variable selection at both the individual and group levels and is applied to interval-censored data. Extensive simulation studies confirm the good performance of the method, and the method is further validated using real-life data from the Aerobic Center Longitudinal Study (ACLS). In summary, this dissertation proposes new variable selection methods for complex lifetime data. It addresses challenges associated with competing risks, semi-competing risks, and multivariate failure time data. The proposed methods are supported by theoretical analysis, simulation studies, and real-life applications.
- ItemOpen AccessOn the Effect of Ignoring Within-Unit Infectious Disease Dynamics When Modelling Spatial Transmission(2019-09-18) Ferdous, Tahsin; Deardon, Rob; Shen, Hua; Ngamkham, ThuntidaIndividual-level models (ILMs) are a class of models that can be used to analyze infectious epidemic data to assist in the understanding of the spatio-temporal dynamics of infectious diseases in discrete time (Deardon et al., 2010). ILMs are generally fitted to epidemic data through Markov chain Monte Carlo (MCMC) methods in a Bayesian statistical framework. Here, we test the effect of ignoring within-unit (e.g., city) infectious disease dynamics when we model spatial transmission. We do this by generating our epidemic data sets from a true model which considers within unit dynamics. It is often hard to get individual-level data in reality. Also, the R package EpiILM used in this thesis for model fitting does not allow for within unit dynamics. For these reasons, we cannot easily fit our generating model to data. We fitted two ILM models (one model with a covariate representing city size, and the other model without covariates), in which within unit dynamics are not explicitly accounted for. We have found from our analysis that the model with the covariate may be a slightly better model to describe the spatio-temporal dynamics of the epidemic. However, although the model with the covariate is better in describing the epidemic process, the dynamics are still not perfectly captured by this model. Our results show the dangers inherent in ignoring within unit dynamics when modelling spatial disease transmission.
- ItemEmbargoTrends in financial markets: uncovering the distribution of intensity and duration(2024-01) Rumana, Afrin Sadia; Wu, Jingjing; Lu, Xuewen; Deardon, Rob; Ambagaspitiya, RohanaIn financial investment, market trends are ubiquitous. Put simply, trending markets are characterized by changes in price that are persistent in time. In this research, we are interested in understanding the global properties of trending markets ex-post, as there is a shortage of research in this direction. The primary goal of our study is to provide a reliable approach for categorizing financial market trends by defining their strength and persistence. However, the noisy characteristics of financial data and the hidden character of a true market trend make this endeavor nontrivial. Towards this end, we use resampling techniques and establish empirical labeling algorithms in parallel with Hidden Markov models and Bayesian smoothing filtering to estimate the underlying structure and dynamics of market trends. From our results, we can comment on the market trend intensity and duration across various financial markets and asset classes. Here, we focus on labeling trends, as opposed to identifying them in real-time, as this can provide valuable diagnostic information ex-post about how the macroeconomic conditions of the market influences the dynamics and characteristics of trends.
- ItemOpen AccessUsing prior-data conflict to tune Bayesian regularized regression models(2023-09-22) Biziaev, Timofei; Chekouo Tekougang, Thierry; Kopciuk, Karen A.; Deardon, Rob; Evans, MichaelIn high-dimensional regression models, variable selection becomes challenging from a computational and theoretical perspective. Bayesian regularized regression via shrinkage priors like the Laplace or spike-and-slab prior are effective methods for variable selection in p > n scenarios provided the shrinkage priors are configured adequately. We propose configuring shrinkage priors using checks for prior-data conflict: tests that assess whether there is disagreement in parameter information provided by the prior and data. We apply our proposed method to the Bayesian LASSO and spike-and-slab shrinkage priors and assess variable selection performance of our prior configurations against competing models through a linear and logistic high-dimensional simulation study. Additionally, we apply our method to proteomic data collected from patients admitted to the Albany Medical Center in Albany NY in April of 2020 with COVID-like respiratory issues. Simulation results suggest our proposed configurations may outperform competing models when the true regression effects are small.