Development and validation of a rheumatoid arthritis case definition: a machine learning approach using data from primary care electronic medical records

dc.contributor.authorPham, Anh N. Q.
dc.contributor.authorBarber, Claire E. H.
dc.contributor.authorDrummond, Neil
dc.contributor.authorJasper, Lisa
dc.contributor.authorKlein, Doug
dc.contributor.authorLindeman, Cliff
dc.contributor.authorWiddifield, Jessica
dc.contributor.authorWilliamson, Tyler
dc.contributor.authorJones, C. A.
dc.date.accessioned2024-12-01T01:04:44Z
dc.date.available2024-12-01T01:04:44Z
dc.date.issued2024-11-27
dc.date.updated2024-12-01T01:04:44Z
dc.description.abstractAbstract Background Rheumatoid Arthritis (RA) is a chronic inflammatory disease that is primarily diagnosed and managed by rheumatologists; however, it is often primary care providers who first encounter RA-related symptoms. This study developed and validated a case definition for RA using national surveillance data in primary care settings. Methods This cross-sectional validation study used structured electronic medical record (EMR) data from the Canadian Primary Care Sentinel Surveillance Network (CPCSSN). Based on the reference set generated by EMR reviews by five experts, three machine learning steps: ‘bag-of-words’ approach to feature generation, feature reduction using a feature importance measure coupled with recursive feature elimination and clustering, and classification using tree-based methods (Decision Tree, Random Forest, and Extreme Gradient Boosting). The three tree-based algorithms were compared to identify the procedure that generated the optimal evaluation metrics. Nested cross-validation was used to allow evaluation and comparison and tuning of models simultaneously. Results Of 1.3 million patients from seven Canadian provinces, 5,600 people aged 19 + were randomly selected. The optimal algorithm for selecting RA cases was generated by the XGBoost classification method. Based on feature importance scores for features in the XGBoost output, a human-readable case definition was created, where RA cases are identified when there are at least 2 occurrences of text “rheumatoid” in any billing, encounter diagnosis, or health condition table of the patient chart. The final case definition had sensitivity of 81.6% (95% CI, 75.6–86.4), specificity of 98.0% (95% CI, 97.4–98.5), positive predicted value of 76.3% (95% CI, 70.1–81.5), and negative predicted value of 98.6% (95% CI, 98.0-98.6). Conclusion A case definition for RA in using primary care EMR data was developed based off the XGBoost algorithm. With high validity metrics, this case definition is expected to be a reliable tool for future epidemiological research and surveillance investigating the management of RA in CPCSSN dataset.
dc.identifier.citationBMC Medical Informatics and Decision Making. 2024 Nov 27;24(1):360
dc.identifier.urihttps://doi.org/10.1186/s12911-024-02776-w
dc.identifier.urihttps://hdl.handle.net/1880/120139
dc.language.rfc3066en
dc.rights.holderThe Author(s)
dc.titleDevelopment and validation of a rheumatoid arthritis case definition: a machine learning approach using data from primary care electronic medical records
dc.typeJournal Article
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
12911_2024_Article_2776.pdf
Size:
1.93 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: