Performance Modeling for Spark Applications

Date
2020-12-16
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The Apache Spark cluster computing platform is being increasingly used to execute big data applications. This research explores two distinct but related problems pertaining to the performance of Spark applications. The first problem focuses on predicting the execution time of a Spark application. Predicting the execution time of an application before submitting it to the cluster will allow a user to quickly estimate the right amount of resources required to achieve a desired execution time target. Most existing prediction techniques require extensive historical executions of an application, thus being time consuming and requiring extensive cluster resources, making it hard for them to be deployed in real-world settings. I address this problem by proposing a quick and lightweight analytic execution time prediction technique that only requires two reference executions of any given application to offer predictions. I show that the proposed technique provides accurate predictions and outperforms other baselines I consider. The second problem considers Spark environments where applications encounter interference, i.e., contention for system resources shared with other applications. Spark operators often co-locate multiple applications on cluster nodes to improve resource utilization. However, this can lead to interference thereby adversely impacting an application's execution time. Several studies have proposed models to detect and diagnose such interference. However, these models require extensive historical training data to be effective and can take considerable time to offer predictions. I conduct a systematic study to devise a machine learning (ML) based technique that can diagnose interference quickly and accurately without requiring extensive training data. Specifically, I explore techniques that would allow the model to generalize to scenarios not captured in the training data, e.g., unseen applications and input data sizes, and to quickly offer online predictions without waiting for an interfered application to complete.
Description
Keywords
Big Data Platforms, Apache Spark, Performance Modeling, Machine Learning, Execution Time Prediction, Interference Detection
Citation
Shah, S. (2020). Performance Modeling for Spark Applications (Master's thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca.