Improved Basecalling and Base Modification Detection Through Signal-level Analysis of Nanopore Direct RNA Data

Date
2023-09-14
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Genome sequencing technologies emerged as an essential tool for addressing challenges presented by the natural biological complexity of organisms. Unlike traditionally used next-generation sequencing (NGS) methods, which yield short reads, Third-generation sequencing (TGS) methods can sequence transcripts and complete genomes in single contiguous sequencing reads, providing innovative means to address practical topics surrounding viral transmission, evolution, and pathogenesis. TGS alleviates the computational challenges of consensus genome assembly or transcript construction from fragmented reads as required with building NGS libraries. Despite these advantages, as an emerging technology, TGS faces many technical challenges. High error rates make it difficult to distinguish machine errors from low frequency mutations in the genome. Some of the most well known and pervasive diseases in society originate from viruses with ribonucleic acid (RNA) genomes; these include but are not limited to Influenza and Coronaviruses. Advancement towards a comprehensive understanding of RNA viruses has been hindered by their unique biology and high levels of diversity, along with quick replication and mutation rates, which leads to important viral evolutionary signals in individual viral copies. Some of the high basecalling error rate in TGS can be attributed to the presence of unmodeled signal, e.g. calling just the four canonical nucleobases (A, C, G, T/U) when methylation along with other nucleobase modifications are also contributing to the signal. Being able to accurately identify (i.e. signal model) the location of such nucleobase modifications would naturally lead to better nucleobase calling and provide insights into RNA virus biology. The few extant tools in this area for TGS are based on deep-learning AI methods due to computational tractability, and are demonstrably biased. In contrast to such opaque methods, in this work, new efficient implementations of theoretically optimal (“dynamic programming”) methods for Oxford Nanopore Technologies (ONT) TGS raw signal segmentation, alignment, clustering, and consensus are deployed. With follow-on statistical analyses of signal deviations within those results, this defines a minimally biased, statistically grounded procedure for detecting unmodeled signal (i.e. putative nucleobase modifications or mutations), as demonstrated using multiple publicly available raw ONT direct RNA sequencing viral datasets.
Description
Keywords
Nanopore, Basecalling, SARS-CoV-2, RNA virus
Citation
Wang, S. (2023). Improved basecalling and base modification detection through signal-level analysis of nanopore direct RNA data (Master's thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca.