Leveraging Feature Exploitation to Automate Practical Machine Learning with Text, Image and Tabular Data

Date
2021-02-18
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
There is a huge growth in the amount of data being generated in forms of tabular, text, and image data. Machine Learning (ML) is a powerful paradigm to support the knowledge discovery process from generated data to the knowledge that is useful in decision-making. It is paramount to have methods to find important features on different applications. To this direction, this dissertation investigates four distinct problems related to exploring ML tasks on predicting various types of data including text, image and table. The first two problems concentrated around tabular data, have the overarching goal of increasing Health-Related Quality of Life (HRQoL) used in treatment and care of prostate cancer patients. Specifically, I first propose a Cluster-based method to particularly exploit the most important features for the desired output. In the second problem, my objective is to identify the minimal set of important features which can predict 1-year follow-up HRQoL while adding interpretability to the proposed model. Using 5093 patients’ information with 1500 measures, the results support the use of the proposed ML technique as an essential tool in identifying predictable features and interpreting the findings. The third study corresponds to using Natural Language Processing (NLP) to propose a test case failure prediction approach for manual testing that can be used as a specification-based heuristic for test selection, prioritization, and reduction. I show that a simple linear regression model using the extracted NLP-based feature together with a typical history-based feature can accurately predict the test cases’ failure in new releases. The comparison of several proposed approaches on 41 releases of Mozilla Firefox over three projects, shows that the NLP-based feature can improve the prediction models. The last study focuses on image analysis for velocity picking in seismic data. Velocity analysis is a time-consuming task which is mostly performed manually. I develop a novel data-driven ensembling strategy for combining geophysical models with Convolutional Neural Network (CNN), which uses spatiotemporally varying image data for training and predicting purposes. We perform extensive experiments using nine field datasets and evidence better performance compared to current state-of-the-art method.
Description
Keywords
Citation
Sharifi, F. (2021). Leveraging Feature Exploitation to Automate Practical Machine Learning with Text, Image and Tabular Data (Doctoral thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca.