Data Structures, Algorithms and Applications for Big Data Analytics: Single, Multiple and All Repeated Patterns Detection in Discrete Sequences

Date
2017
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
My research work of the current thesis focuses on the detection of single, multiple and all repeated patterns in sequences. Many algorithms exist for single pattern detection that take an input argument (i.e., pattern to be detected) and produce as outcome the position(s) where the pattern exists. However, to the best of my knowledge, there is nothing in literature related to all repeated patterns detection, i.e., the detection of every pattern that occurs at least twice in one or more sequences. This is a very important problem in science because the outcome can be used for various practical applications, e.g., forecasting purposes in weather analysis or finance by detecting patterns having periodicity. The main problem of detecting all repeated patterns is that all data structures used in computer science are incapable of scaling well for such purposes due to their space and time complexity. In order to analyze sequences of Megabytes the space capacity required to construct the data structure and execute the algorithm can be of Terabyte magnitude. In order to overcome such problems, my research has focused on simultaneous optimization of space and time complexity by introducing a new data structure (LERP-RSA) while the mathematical foundation that guarantees its correctness and validity has also been built and proved. A unique, innovative algorithm (ARPaD), which takes advantage of the exceptional characteristics of the introduced data structure and allows big data mining with space and time optimization, has also been created. Additionally, algorithms for single (SPaD) and multiple (MPaD) pattern detection have been created, based on the LERP-RSA, which outperform any other known algorithm for pattern detection in terms of efficiency and usage of minimal resources. The combination of the innovative data structure and algorithm permits the analysis of any sequence of enormous size, greater than a trillion characters, in realistic time using conventional hardware. Moreover, several methodologies and applications have been developed to provide solutions for many important problems in diverse scientific and commercial fields such as Finance, Event and Time Series, Bioinformatics, Marketing, Business, Clickstream Analysis, Data stream Analysis, Image Analysis, Network Security and Mathematics.
Description
Keywords
Computer Science
Citation
Xylogiannopoulos, K. (2017). Data Structures, Algorithms and Applications for Big Data Analytics: Single, Multiple and All Repeated Patterns Detection in Discrete Sequences (Doctoral thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca. doi:10.11575/PRISM/25522