Browsing by Author "Lee, Joon"
Now showing 1 - 11 of 11
Results Per Page
Sort Options
Item Open Access A Systematic Machine Learning-Based Investigation of Bloodstream Infection Biomarkers to Predict Clinical Outcome(2024-04-16) Gilliland, Rory Lewis; MacDonald, M. Ethan; Lewis, Ian A.; Dingle, Tanis C.; Lee, Joon; Messier, GeoffreyBloodstream infections (BSI) represent a major burden on modern medicine, annually causing millions of cases worldwide with high mortality rates. Concerted efforts have been made in recent decades to improve BSI diagnostics to treat these dangerous infections more rapidly and precisely. However, these efforts have been hindered by an incomplete understanding of what factors make certain BSIs more severe than others. To address this shortcoming, this thesis applied statistical, machine learning, and epidemiological analyses to systematically investigate patient- and microbe-related traits as biomarkers of BSI clinical outcome. The analyses were facilitated by the Calgary BSI Cohort, a collection of over 35,000 BSI episodes detailing microbial genomic, proteomic, and metabolomic profiles linked to extensive patient medical records. Unsurprisingly, the results demonstrated that patient-related traits (e.g., age and comorbidity) are tightly linked to BSI clinical outcome. Patient mortality, hospital stay duration, and healthcare cost could all be predicted using patient features with areas under the receiver operating characteristic curve exceeding 0.80. Several microbe-related traits such as species classification and virulence factors were also found to be associated — albeit less strongly — with BSI patient mortality risk. Interestingly though, when patient- and microbe-related traits were combined, their predictive performance for BSI patient mortality did not surpass that of patient traits alone. Follow-up analyses revealed a compelling possible explanation: many “predictive” microbial traits may simply report the underlying characteristics of patients that tend to be infected by the pathogens carrying those traits. Taken together, the results suggest that patient-related traits are critically important as markers of BSI clinical outcome. Prompt development of formalized, patient-factor-based BSI risk stratification tools seems warranted to assist physicians in precisely identifying high-risk infections early in the clinical trajectory. In contrast, while microbial characteristics are invaluable for directing clinical therapy of BSIs, they provide little unique predictive information for BSI clinical outcome, making them unsuitable as biomarkers in the context of BSI risk stratification. Future research investigating the diagnostic relevance of the microbe should take great care to adequately correct for confounding patient dynamics.Item Open Access Acute Ischemic Stroke Analysis Using Deep Learning-based Image-to-image Translation(2023-08) Gutierrez Munoz, Jose Alejandro; Forkert, Nils Daniel; Pike, G. Bruce; Lee, Joon; LeVan, Pierre; Almekhlafi, MohammedAcute ischemic stroke occurs due to the sudden occlusion of a cerebral artery, leading to a disruption in metabolic homeostasis and cell damage. Accurate diagnosis and informed treatment decision-making rely on clinical assessments accompanied by medical imaging. Deep learning methods offer the potential to enhance this decision-making by enabling complex pattern recognition. However, they often rely on large amounts of data, which poses a challenge in stroke centers due to the diverse range of imaging modalities employed. Moreover, specialized processing is often required for the meaningful interpretation of valuable imaging methods like perfusion imaging. Recent advancements in deep learning have made it easier to process and analyze perfusion data by predicting the follow-up tissue outcome. However, these models rely on manual binary lesion segmentations as prediction targets, which may hinder interpretability and limit the amount of available data. To address these limitations, the work described in this thesis utilizes a set of deep learning techniques known as image-to-image translation networks in two distinct ways. First, a method was developed to simulate computed tomography datasets based on magnetic resonance imaging scans and vice versa. The results showed that the proposed approach produces realistic outputs, effectively changing the modality while preserving stroke lesions and brain morphology in follow-up scans. This increases the availability of single-modality data and provides an alternative imaging option for follow-up stroke evaluation. Second, a method was developed to predict stroke tissue outcomes from perfusion scans, without relying on manual lesion segmentations and predicting the follow-up image instead. The results show that the proposed method is able to capture the effects of different treatments, highlighting its potential as a tool for treatment guidance or efficacy evaluation. In conclusion, the application of image-to-image generative modelling proves to be valuable for enhancing acute ischemic stroke analysis and care.Item Open Access Applications of Data Science to Electronic Health Data in Health Services Research(2022-09-06) Lee, Seungwon; Quan, Hude; Lee, Joon; Naugler, Chris Terrance; Shaheen, Abdel-Aziz; Samuel, Susan Matthew; Kaul, PadmaThe application of data science to medical big data is an essential for achieving precision medicine and building a learning health system. There are many electronic health databases that contain big data in medicine. Largely, these electronic health databases are divided into administrative data, electronic medical records (EMR) data, and other types such as clinical registries. These databases were designed for different purposes and have informed the health system and stakeholders. Bringing together these datasets for data-driven research is an essential step. This manuscript-based thesis focuses on applying data science to electronic health data. The first part of this thesis explores the Allscripts Sunrise Clinical Manager (SCM) EMR data for research purposes, including its advantages and challenges. The work then proceeds to establish a linkage process of this database with other databases for establishing a disease cohort. The second part presents a systematic scoping review that explores how data science has been applied to similarly linked data to define conditions and comorbidities. Capturing comorbidities and outcomes is fundamental for studying treatment effects and tailoring medical decisions. The third and last part narrows the focus disease to non-alcoholic fatty liver disease and applies data science methodologies to answer specific disease-context related health services research questions. The completion of this work demonstrates the successful application of data science to electronic health data for health services research. Specifically, the first part paves the way for routinely using SCM EMR data for research in Alberta. Organizational procedures on data storage and transfer are also mapped out. These activities may not be of direct scientific value but are crucial for building the infrastructure capable of supporting scientific works. Second part informs the current data science applications on how to identify comorbidities and outcomes. This part sheds light on the potential directions of currently ongoing and future research. The third part successfully combines data analytics and existing health services research methods (i.e., epidemiology), and demonstrates that data tools can be developed to reduce the burden on care providers and the health system. Multidisciplinary collaboration and inputs from diverse perspectives are vital for achieving precision medicine.Item Open Access Creating a Frailty Case Definition for Primary Care EMR Using Machine Learning(2021-05-04) Aponte-Hao, Zhi Yun (Sylvia); Williamson, Tyler; Lee, Joon; McBrien, Kerry; Ronksley, PaulBackground: Frailty is a geriatric syndrome characterized by increased vulnerability and increased risk of adverse events. The Clinical Frailty Scale (CFS) is a judgement-based scale used to identify frailty in senior populations (over the age of 65). Primary care electronic medical records (EMRs) contain routinely collected medical data and can be used for frailty screening. There is currently no method to detect frailty automatically using primary care electronic medical records that aligns with the CFS definition. Purpose: To create a machine learning based algorithm for the identification of frailty in routinely collected primary care electronic medical records. Methods: Primary care physicians within the Canadian Primary Care Sentinel Surveillance Network retrospectively identified frailty in 5466 senior patients from their own practice using the CFS, and the corresponding patient EMR data were extracted and processed as features. The patient data were split 30-70, with 30% being the hold-out set used for final testing and 70% for the training set. A collection of machine learning algorithms was created using the training dataset, including regularized logistic regression models, support vector machines, random forests, k-nearest neighbours, classification and regression trees, feedforward neural networks, Naïve Bayes, and XGBoost. A balanced training dataset was also created by oversampling. Sensitivity analyses were also performed using two alternative dichotomization cut-offs of frailty. Final model performance was assessed using the hold-out dataset, and reported using ROC, accuracy, F1-score, sensitivity, specificity, positive and negative predictive values. Results: 18.4% of patients were classified as frail based on a CFS score of 5 and above. Of the 8 models developed, an XGBoost model had the best classification performance, with sensitivity of 78.14% and specificity of 74.41%. Neither the balanced training dataset, nor the sensitivity analyses using two alternative cut-offs resulted in improved performance. Conclusion: Supervised machine learning was able to distinguish between frail and non-frail patients with good performance. Future work may wish to develop a protocol for standardized assignment of the CFS, use all available unstructured and structured data, and supplement with additional geriatric-specific data.Item Embargo Developing Novel Supervised Learning Model Evaluation Metrics Based on Case Difficulty(2024-01-05) Kwon, Hyunjin; Lee, Joon; Josephson, Colin Bruce; Greenberg, MatthewPerformance evaluation is an essential step in the development of machine learning models. The performance of a model informs the direction of its development and provides diverse knowledge to researchers. Most common ways to assess a model’s performance are based on counting the numbers of correct and incorrect predictions the model makes. However, this conventional approach to evaluation is limited in that it does not consider the differences in prediction difficulty between individual cases. Although metrics for quantifying the prediction difficulty of individual cases exist, their usefulness is hindered by the fact that they cannot be applied universally across all types of data; that is, each metric requires specific data conditions be met for its use, which can be a significant obstacle when dealing with real-world data characterized by diversity and complexity. Therefore, this thesis proposes new metrics for calculating case difficulty that perform well across diverse datasets. These new case difficulty metrics were developed using neural networks based on varying definitions of prediction difficulty. In addition, the metrics were validated using various datasets and compared with existing metrics from the literature. New performance evaluation metrics incorporating case difficulty to reward correct predictions of high-difficulty cases and penalize incorrect predictions of low-difficulty cases were also developed. A comparison of these case difficulty-based performance metrics with conventional performance metrics revealed that the novel evaluation metrics could provide a more detailed explanation and deeper understanding of model performance. We anticipate that researchers will be able to calculate case difficulty in diverse datasets under various data conditions with our proposed metrics and use these values to enhance their studies. Moreover, our newly developed evaluation metrics considering case difficulty could serve as an additional source of insight for the evaluation of classification model performance.Item Open Access Efficacy of wearable devices for describing fatigue-related movement patterns in running and neurological disease(2024-01-09) Dimmick, Hannah Lee; Ferber, Reed; Culos-Reed, Nicole; Lee, JoonWearable technology allows for research to take place in more applied settings and generate more data than ever before, giving researchers the opportunity to collect enough data to construct individualized models in a variety of settings. These techniques can help tie subjective feelings of fatigue to objective physiological and biomechanical observations, driving better understanding into this psychosomatic connection. In Chapter 3, good-to-excellent reliability was shown for a variety of statistical features derived from the acceleration waveform of a low-back IMU worn while running during both non-fatigued and fatigued conditions. However, this utilized a group-based analysis and did not include reliability metrics for individuals or quantify within-subject variability and was performed on a treadmill, limiting generalizability. Due to these limitations, Chapter 4 aimed to classify group and individual changes in biomechanics with fatigue in both laboratory and overground environments, finding that classification accuracy was lower for the group-based models (57.0 – 61.5%) than the individualized models (68.2 – 68.9%), and variable importance rankings differed between models and participants. We concluded that using an individualized approach to measure fatigue-related biomechanics in running could lead to better understanding of how these may impact performance or injury. Furthermore, we hypothesized that this approach could similarly be used to investigate fatigue in other sports science and clinical applications, such as neurological disease. Thus, in Chapter 5, we reviewed the evidence for the relationship between gait and fatigue in neurological disease and found no obvious transdiagnostic relationships between gait/mobility and fatigue in neurological diseases, and instead indicated that these relationships were more likely to be condition- and subject-specific. Based on these conclusions, Chapter 6 investigated the association between activity/sleep metrics and fatigue/symptom severity in myasthenia gravis, employing similar methods to Chapter 4. When analyzing the individual models, it was clear that there are often individual-level associations between movement and fatigue/symptom severity, highlighting the importance of analyzing within individuals to determine potentially relevant outcomes to the patient. Overall, these investigations demonstrate how individualized approaches may be superior to group-based analyses when made possible using wearable devices and big data approaches, with implications for injury prevention, performance enhancement, and improving patient care.Item Open Access Evaluating the coding accuracy of type 2 diabetes mellitus among patients with non-alcoholic fatty liver disease(2024-02-16) Lee, Seungwon; Shaheen, Abdel A.; Campbell, David J. T.; Naugler, Christopher; Jiang, Jason; Walker, Robin L.; Quan, Hude; Lee, JoonAbstract Background Non-alcoholic fatty liver disease (NAFLD) describes a spectrum of chronic fattening of liver that can lead to fibrosis and cirrhosis. Diabetes has been identified as a major comorbidity that contributes to NAFLD progression. Health systems around the world make use of administrative data to conduct population-based prevalence studies. To that end, we sought to assess the accuracy of diabetes International Classification of Diseases (ICD) coding in administrative databases among a cohort of confirmed NAFLD patients in Calgary, Alberta, Canada. Methods The Calgary NAFLD Pathway Database was linked to the following databases: Physician Claims, Discharge Abstract Database, National Ambulatory Care Reporting System, Pharmaceutical Information Network database, Laboratory, and Electronic Medical Records. Hemoglobin A1c and diabetes medication details were used to classify diabetes groups into absent, prediabetes, meeting glycemic targets, and not meeting glycemic targets. The performance of ICD codes among these groups was compared to this standard. Within each group, the total numbers of true positives, false positives, false negatives, and true negatives were calculated. Descriptive statistics and bivariate analysis were conducted on identified covariates, including demographics and types of interacted physicians. Results A total of 12,012 NAFLD patients were registered through the Calgary NAFLD Pathway Database and 100% were successfully linked to the administrative databases. Overall, diabetes coding showed a sensitivity of 0.81 and a positive predictive value of 0.87. False negative rates in the absent and not meeting glycemic control groups were 4.5% and 6.4%, respectively, whereas the meeting glycemic control group had a 42.2% coding error. Visits to primary and outpatient services were associated with most encounters. Conclusion Diabetes ICD coding in administrative databases can accurately detect true diabetic cases. However, patients with diabetes who meets glycemic control targets are less likely to be coded in administrative databases. A detailed understanding of the clinical context will require additional data linkage from primary care settings.Item Open Access Leveraging artificial intelligence to monitor unhealthy food and brand marketing to children on digital media(The Lancet, 2020-06) Olstad, Dana Lee; Lee, JoonItem Open Access Phenotype-based prediction of incident cardiovascular hospitalization and inpatient care costs in patients referred for cardiovascular magnetic resonance imaging: Applications of traditional statistical modelling and machine learning(2021-09-24) Lei, Lucy Y; White, James A; Fine, Nowell M.; Lee, Joon; Quan, Hude; Josephson, Colin B.Background: Cardiovascular disease has an estimated lifetime prevalence of 48% in adults and imposes the highest economic burden on health care systems among noncommunicable diseases. These costs are largely related to chronic disease management, clinical procedures, and hospitalization, particularly for major adverse cardiovascular events (MACE). Importantly, health expenditures incurred by cardiovascular care are expected to increase substantially as the global population ages and life expectancies continue to rise. To improve health system efficiency and resource allocation in preparation for future cardiovascular care needs, it is necessary to improve baseline patient characterization and offer more accurate personalized risk predictions to optimally plan for opportunities to improve cardiovascular health while controlling costs.Aims: The aim of this thesis was to develop and validate models for the prediction of MACE and one-year cumulative inpatient care costs in a large cohort of patients referred for cardiovascular magnetic resonance imaging.Methods: Patients were recruited from the Cardiovascular Imaging Registry of Calgary, a prospective clinical outcomes registry that provides automated linkages of data abstracted from electronic health records, cardiovascular magnetic resonance imaging reports, and patient-reported health questionnaires. These data were used for predictive modelling using both traditional statistical methodologies and machine learning approaches.Results: Random survival forest and Cox proportional hazards models were developed for time-to-event prediction of hospitalization for MACE. Both models achieved time dependent AUCs of 0.83 in holdout validation. Patients with predicted risk in the upper tertile experienced 29- and 21-fold (p < 0.001) increased risk of MACE, respectively. A two-part hurdle model was developed for cost regression to predict one-year cumulative inpatient expenditures following cardiovascular magnetic resonance imaging. When binning the cost predictions into zero-, low-, and high-cost brackets, the model achieved 0.73 precision, 0.76 recall, and 0.74 F1. The best performing machine learning classification model combined predictions from random forest and artificial neural network algorithms to achieve 0.76 precision, 0.82 recall, and 0.79 F1.Conclusions: The results of this thesis demonstrate the prognostic capacity of multi-domain health data and its utility in the development of patient-specific risk models for adverse cardiovascular events and cumulative inpatient care costs. Additionally, while machine learning modelling methodologies offer advantages in handling large health care data sets, the interpretability of traditional statistical models remains valuable for delineating relationships between health-related variables and outcomes.Item Open Access Sessile serrated lesions in focus: Examining temporal trends, patient risk factors, and the role of the endoscopist in lesion detection(2023-09-22) Mazurek, Matthew; Brenner, Darren Michael RIehl; Heitman, Steven James; Hilsden, Robert Jay; Lee, Joon; Ferraz, Jose Geraldo PSerrated polyps of the colorectum have become increasingly recognized as an important clinical entity, as these precursor lesions are hypothesized to be responsible for up to 25% of sporadic cases of colorectal cancer. Much confusion exists regarding these polyps; particularly, their classification and associated malignant risk due to varied nomenclature, evolving pathological criteria, and ongoing research in prognostication. A specific subtype, sessile serrated lesions (SSLs), are of particular interest, as they are the most prevalent premalignant subtype and are over-represented in cases of interval cancers. Accurate identification and risk assessment remains a challenge owing to variable detection of clinically relevant serrated lesions by endoscopists, high inter-observer variability in diagnosis by pathologists, and an incomplete understanding of risk of future neoplasia. In this thesis, we analyze over 75,000 screening colonoscopies performed over a five-year period at a dedicated, large volume, high-efficiency screening centre to identify trends in the endoscopic detection of SSLs. The intent of this work is to better understand the temporal factors influencing SSL detection prevalence, the patient risk factors that are associated with these lesions, and how detection is related to procedural and endoscopist factors. The analysis includes consideration of traditional statistical methods as well as novel machine learning algorithms. We demonstrated a positive temporal trend in SSL detection over study period and identified several patient, procedural, and endoscopist factors associated with SSL detection. Machine learning models improved upon the predictive capabilities of traditional statistical models, yet a significant proportion of variability in risk remained unexplained, underscoring the complexity of accurately predicting SSLs. Endoscopic detection of SSLs demonstrates strong correlation with other detection metrics, notably adenoma detection rate, implying a shared underlying skillset requisite for the identification of these distinct polyp types. This connection highlights opportunities for enhancing detection through benchmarking and established quality improvement strategies.Item Embargo Using Domain Adaptation and Inductive Transfer Learning to Improve Patient Outcome Prediction in the Intensive Care Unit(2023-12-07) Mutnuri, Maruthi Kumar; Lee, Joon; Stelfox, Thomas; Forkert, Nils Daniel; MacDonald, Matthew Ethan; Parhar, Ken KuljitPredicting patient outcomes in the intensive care unit (ICU) can allow appropriate allocation of resources, minimize costs, and provide better patient care. Machine learning and deep learning models can predict patient outcomes with a high degree of accuracy, but training those models is both data- and resource-intensive. Deep learning models trained on small datasets tend to overfit and generalize poorly, and transfer learning (TL) helps in such situations by leveraging the knowledge learned from pre-trained models. Transfer learning is a machine learning technique where a model pre-trained on source task is adapted for a different but related target task. Here, source task is trained with a large dataset whereas a small dataset is sufficient for training target task. Notably, TL is widely used in medical image analysis and natural language processing, but it is uncommon in electronic health record (EHR) analysis. Within the TL literature, domain adaptation (DA) is most common, whereas inductive transfer learning (ITL) is rare. This study explores both DA and ITL using EHR data. To investigate the effectiveness of these TL models, we compared them with baseline models of logistic regression (LR), lasso regression, and fully connected neural networks (FCNN) in the prediction of 30-day mortality, acute kidney injury (AKI), hospital length of stay (H_LOS), and ICU length of stay (ICU_LOS). We used two cohorts: (1) eCritical, a multicenter ICU data linked with administrative data from ICUs in Alberta, Canada between March 2013 and December 2019, which has 55,689 unique admission records from 48,672 unique patients admitted to 15 medical-surgical ICUs, and (2) MIMIC-III, a single-center publicly available ICU dataset from Boston, USA between 2001 and 2012. The first admission of adult patient records with more than 24-hour ICU stays were included in this retrospective study. We included common features from both the cohorts. Random data subsets of training data, ranging from 1% to 75%, and the full dataset were used to compare the performances of DA and ITL with FCNN, LR, and lasso. Overall, ITL outperformed baseline FCNN, LR, and lasso models in 55 of the 56 comparisons (7 data subsets, 4 outcomes, and 2 baseline models) using BA and MSE metrics. However, DA models outperformed the baseline models 45 out of the 56 times. ITL performance was comparatively better than DA considering the number of times it outperformed baseline models and the margin with which it outperformed baseline models. Also, in 11 out of the 16 cases (8 of 8 for ITL and 3 of 8 for DA) TL models outperformed baseline models at 1% data subset. This is significant because TL models are useful in data-scarce scenarios. When using EHR data, the similarity of data distributions in source and target domains was crucial, as evident from ITL performing much better than DA, mostly because of the domain mismatch in the two cohorts concerning AKI, H_LOS, and ICU_LOS outcomes. As the pre-trained models will be made available publicly, further research can be conducted with additional outcomes and different cohorts to make these pre-trained models more robust using incremental or cumulative transfer learning. These pre-trained models can be used for predicting patient outcomes at ICU.