Text analysis framework for identifying mutations among non-small cell lung cancer patients from laboratory data

dc.contributor.authorYusuf, Amman
dc.contributor.authorBoyne, Devon J.
dc.contributor.authorO’Sullivan, Dylan E.
dc.contributor.authorBrenner, Darren R.
dc.contributor.authorCheung, Winson Y.
dc.contributor.authorMirza, Imran
dc.contributor.authorJarada, Tamer N.
dc.date.accessioned2024-03-17T01:06:12Z
dc.date.available2024-03-17T01:06:12Z
dc.date.issued2024-03-11
dc.date.updated2024-03-17T01:06:11Z
dc.description.abstractAbstract Background Laboratory data can provide great value to support research aimed at reducing the incidence, prolonging survival and enhancing outcomes of cancer. Data is characterized by the information it carries and the format it holds. Data captured in Alberta’s biomarker laboratory repository is free text, cluttered and rouge. Such data format limits its utility and prohibits broader adoption and research development. Text analysis for information extraction of unstructured data can change this and lead to more complete analyses. Previous work on extracting relevant information from free text, unstructured data employed Natural Language Processing (NLP), Machine Learning (ML), rule-based Information Extraction (IE) methods, or a hybrid combination between them. Methods In our study, text analysis was performed on Alberta Precision Laboratories data which consisted of 95,854 entries from the Southern Alberta Dataset (SAD) and 6944 entries from the Northern Alberta Dataset (NAD). The data covers all of Alberta and is completely population-based. Our proposed framework is built around rule-based IE methods. It incorporates topics such as Syntax and Lexical analyses to achieve deterministic extraction of data from biomarker laboratory data (i.e., Epidermal Growth Factor Receptor (EGFR) test results). Lexical analysis compromises of data cleaning and pre-processing, Rich Text Format text conversion into readable plain text format, and normalization and tokenization of text. The framework then passes the text into the Syntax analysis stage which includes the rule-based method of extracting relevant data. Rule-based patterns of the test result are identified, and a Context Free Grammar then generates the rules of information extraction. Finally, the results are linked with the Alberta Cancer Registry to support real-world cancer research studies. Results Of the original 5512 entries in the SAD dataset and 5017 entries in the NAD dataset which were filtered for EGFR, the framework yielded 5129 and 3388 extracted EGFR test results from the SAD and NAD datasets, respectively. An accuracy of 97.5% was achieved on a random sample of 362 tests. Conclusions We presented a text analysis framework to extract specific information from unstructured clinical data. Our proposed framework has shown that it can successfully extract relevant information from EGFR test results.
dc.identifier.citationBMC Medical Research Methodology. 2024 Mar 11;24(1):63
dc.identifier.urihttps://doi.org/10.1186/s12874-024-02192-8
dc.identifier.urihttps://hdl.handle.net/1880/118291
dc.language.rfc3066en
dc.rights.holderThe Author(s)
dc.titleText analysis framework for identifying mutations among non-small cell lung cancer patients from laboratory data
dc.typeJournal Article
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
12874_2024_Article_2192.pdf
Size:
1.79 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.25 KB
Format:
Item-specific license agreed upon to submission
Description: