Performance Modeling for Spark Applications
dc.contributor.advisor | Krishnamurthy, Diwakar | |
dc.contributor.advisor | Amannejad, Yasaman | |
dc.contributor.author | Shah, Sarah | |
dc.contributor.committeemember | Shor, Roman J. | |
dc.contributor.committeemember | Uddin, Gias | |
dc.date | 2021-02 | |
dc.date.accessioned | 2021-01-07T19:55:55Z | |
dc.date.available | 2021-01-07T19:55:55Z | |
dc.date.issued | 2020-12-16 | |
dc.description.abstract | The Apache Spark cluster computing platform is being increasingly used to execute big data applications. This research explores two distinct but related problems pertaining to the performance of Spark applications. The first problem focuses on predicting the execution time of a Spark application. Predicting the execution time of an application before submitting it to the cluster will allow a user to quickly estimate the right amount of resources required to achieve a desired execution time target. Most existing prediction techniques require extensive historical executions of an application, thus being time consuming and requiring extensive cluster resources, making it hard for them to be deployed in real-world settings. I address this problem by proposing a quick and lightweight analytic execution time prediction technique that only requires two reference executions of any given application to offer predictions. I show that the proposed technique provides accurate predictions and outperforms other baselines I consider. The second problem considers Spark environments where applications encounter interference, i.e., contention for system resources shared with other applications. Spark operators often co-locate multiple applications on cluster nodes to improve resource utilization. However, this can lead to interference thereby adversely impacting an application's execution time. Several studies have proposed models to detect and diagnose such interference. However, these models require extensive historical training data to be effective and can take considerable time to offer predictions. I conduct a systematic study to devise a machine learning (ML) based technique that can diagnose interference quickly and accurately without requiring extensive training data. Specifically, I explore techniques that would allow the model to generalize to scenarios not captured in the training data, e.g., unseen applications and input data sizes, and to quickly offer online predictions without waiting for an interfered application to complete. | en_US |
dc.identifier.citation | Shah, S. (2020). Performance Modeling for Spark Applications (Master's thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca. | en_US |
dc.identifier.doi | http://dx.doi.org/10.11575/PRISM/38534 | |
dc.identifier.uri | http://hdl.handle.net/1880/112942 | |
dc.language.iso | eng | en_US |
dc.publisher.faculty | Schulich School of Engineering | en_US |
dc.publisher.institution | University of Calgary | en |
dc.rights | University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission. | en_US |
dc.subject | Big Data Platforms | en_US |
dc.subject | Apache Spark | en_US |
dc.subject | Performance Modeling | en_US |
dc.subject | Machine Learning | en_US |
dc.subject | Execution Time Prediction | en_US |
dc.subject | Interference Detection | en_US |
dc.subject.classification | Engineering--Electronics and Electrical | en_US |
dc.title | Performance Modeling for Spark Applications | en_US |
dc.type | master thesis | en_US |
thesis.degree.discipline | Engineering – Electrical & Computer | en_US |
thesis.degree.grantor | University of Calgary | en_US |
thesis.degree.name | Master of Science (MSc) | en_US |
ucalgary.item.requestcopy | true | en_US |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- ucalgary_2020_shah_sarah.pdf
- Size:
- 1.41 MB
- Format:
- Adobe Portable Document Format
- Description:
- Thesis Document
License bundle
1 - 1 of 1
No Thumbnail Available
- Name:
- license.txt
- Size:
- 2.62 KB
- Format:
- Item-specific license agreed upon to submission
- Description: