Browsing by Author "Lee, Seungwon"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
Item Open Access Applications of Data Science to Electronic Health Data in Health Services Research(2022-09-06) Lee, Seungwon; Quan, Hude; Lee, Joon; Naugler, Chris Terrance; Shaheen, Abdel-Aziz; Samuel, Susan Matthew; Kaul, PadmaThe application of data science to medical big data is an essential for achieving precision medicine and building a learning health system. There are many electronic health databases that contain big data in medicine. Largely, these electronic health databases are divided into administrative data, electronic medical records (EMR) data, and other types such as clinical registries. These databases were designed for different purposes and have informed the health system and stakeholders. Bringing together these datasets for data-driven research is an essential step. This manuscript-based thesis focuses on applying data science to electronic health data. The first part of this thesis explores the Allscripts Sunrise Clinical Manager (SCM) EMR data for research purposes, including its advantages and challenges. The work then proceeds to establish a linkage process of this database with other databases for establishing a disease cohort. The second part presents a systematic scoping review that explores how data science has been applied to similarly linked data to define conditions and comorbidities. Capturing comorbidities and outcomes is fundamental for studying treatment effects and tailoring medical decisions. The third and last part narrows the focus disease to non-alcoholic fatty liver disease and applies data science methodologies to answer specific disease-context related health services research questions. The completion of this work demonstrates the successful application of data science to electronic health data for health services research. Specifically, the first part paves the way for routinely using SCM EMR data for research in Alberta. Organizational procedures on data storage and transfer are also mapped out. These activities may not be of direct scientific value but are crucial for building the infrastructure capable of supporting scientific works. Second part informs the current data science applications on how to identify comorbidities and outcomes. This part sheds light on the potential directions of currently ongoing and future research. The third part successfully combines data analytics and existing health services research methods (i.e., epidemiology), and demonstrates that data tools can be developed to reduce the burden on care providers and the health system. Multidisciplinary collaboration and inputs from diverse perspectives are vital for achieving precision medicine.Item Open Access Cerebrovascular disease case identification in inpatient electronic medical record data using natural language processing(2023-09-02) Pan, Jie; Zhang, Zilong; Peters, Steven R.; Vatanpour, Shabnam; Walker, Robin L.; Lee, Seungwon; Martin, Elliot A.; Quan, HudeAbstract Background Abstracting cerebrovascular disease (CeVD) from inpatient electronic medical records (EMRs) through natural language processing (NLP) is pivotal for automated disease surveillance and improving patient outcomes. Existing methods rely on coders’ abstraction, which has time delays and under-coding issues. This study sought to develop an NLP-based method to detect CeVD using EMR clinical notes. Methods CeVD status was confirmed through a chart review on randomly selected hospitalized patients who were 18 years or older and discharged from 3 hospitals in Calgary, Alberta, Canada, between January 1 and June 30, 2015. These patients’ chart data were linked to administrative discharge abstract database (DAD) and Sunrise™ Clinical Manager (SCM) EMR database records by Personal Health Number (a unique lifetime identifier) and admission date. We trained multiple natural language processing (NLP) predictive models by combining two clinical concept extraction methods and two supervised machine learning (ML) methods: random forest and XGBoost. Using chart review as the reference standard, we compared the model performances with those of the commonly applied International Classification of Diseases (ICD-10-CA) codes, on the metrics of sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). Result Of the study sample (n = 3036), the prevalence of CeVD was 11.8% (n = 360); the median patient age was 63; and females accounted for 50.3% (n = 1528) based on chart data. Among 49 extracted clinical documents from the EMR, four document types were identified as the most influential text sources for identifying CeVD disease (“nursing transfer report,” “discharge summary,” “nursing notes,” and “inpatient consultation.”). The best performing NLP model was XGBoost, combining the Unified Medical Language System concepts extracted by cTAKES (e.g., top-ranked concepts, “Cerebrovascular accident” and “Transient ischemic attack”), and the term frequency-inverse document frequency vectorizer. Compared with ICD codes, the model achieved higher validity overall, such as sensitivity (25.0% vs 70.0%), specificity (99.3% vs 99.1%), PPV (82.6 vs. 87.8%), and NPV (90.8% vs 97.1%). Conclusion The NLP algorithm developed in this study performed better than the ICD code algorithm in detecting CeVD. The NLP models could result in an automated EMR tool for identifying CeVD cases and be applied for future studies such as surveillance, and longitudinal studies.Item Open Access Evaluating the coding accuracy of type 2 diabetes mellitus among patients with non-alcoholic fatty liver disease(2024-02-16) Lee, Seungwon; Shaheen, Abdel A.; Campbell, David J. T.; Naugler, Christopher; Jiang, Jason; Walker, Robin L.; Quan, Hude; Lee, JoonAbstract Background Non-alcoholic fatty liver disease (NAFLD) describes a spectrum of chronic fattening of liver that can lead to fibrosis and cirrhosis. Diabetes has been identified as a major comorbidity that contributes to NAFLD progression. Health systems around the world make use of administrative data to conduct population-based prevalence studies. To that end, we sought to assess the accuracy of diabetes International Classification of Diseases (ICD) coding in administrative databases among a cohort of confirmed NAFLD patients in Calgary, Alberta, Canada. Methods The Calgary NAFLD Pathway Database was linked to the following databases: Physician Claims, Discharge Abstract Database, National Ambulatory Care Reporting System, Pharmaceutical Information Network database, Laboratory, and Electronic Medical Records. Hemoglobin A1c and diabetes medication details were used to classify diabetes groups into absent, prediabetes, meeting glycemic targets, and not meeting glycemic targets. The performance of ICD codes among these groups was compared to this standard. Within each group, the total numbers of true positives, false positives, false negatives, and true negatives were calculated. Descriptive statistics and bivariate analysis were conducted on identified covariates, including demographics and types of interacted physicians. Results A total of 12,012 NAFLD patients were registered through the Calgary NAFLD Pathway Database and 100% were successfully linked to the administrative databases. Overall, diabetes coding showed a sensitivity of 0.81 and a positive predictive value of 0.87. False negative rates in the absent and not meeting glycemic control groups were 4.5% and 6.4%, respectively, whereas the meeting glycemic control group had a 42.2% coding error. Visits to primary and outpatient services were associated with most encounters. Conclusion Diabetes ICD coding in administrative databases can accurately detect true diabetic cases. However, patients with diabetes who meets glycemic control targets are less likely to be coded in administrative databases. A detailed understanding of the clinical context will require additional data linkage from primary care settings.