Performance Modeling for Spark Applications

dc.contributor.advisorKrishnamurthy, Diwakar
dc.contributor.advisorAmannejad, Yasaman
dc.contributor.authorShah, Sarah
dc.contributor.committeememberShor, Roman J.
dc.contributor.committeememberUddin, Gias
dc.date2021-02
dc.date.accessioned2021-01-07T19:55:55Z
dc.date.available2021-01-07T19:55:55Z
dc.date.issued2020-12-16
dc.description.abstractThe Apache Spark cluster computing platform is being increasingly used to execute big data applications. This research explores two distinct but related problems pertaining to the performance of Spark applications. The first problem focuses on predicting the execution time of a Spark application. Predicting the execution time of an application before submitting it to the cluster will allow a user to quickly estimate the right amount of resources required to achieve a desired execution time target. Most existing prediction techniques require extensive historical executions of an application, thus being time consuming and requiring extensive cluster resources, making it hard for them to be deployed in real-world settings. I address this problem by proposing a quick and lightweight analytic execution time prediction technique that only requires two reference executions of any given application to offer predictions. I show that the proposed technique provides accurate predictions and outperforms other baselines I consider. The second problem considers Spark environments where applications encounter interference, i.e., contention for system resources shared with other applications. Spark operators often co-locate multiple applications on cluster nodes to improve resource utilization. However, this can lead to interference thereby adversely impacting an application's execution time. Several studies have proposed models to detect and diagnose such interference. However, these models require extensive historical training data to be effective and can take considerable time to offer predictions. I conduct a systematic study to devise a machine learning (ML) based technique that can diagnose interference quickly and accurately without requiring extensive training data. Specifically, I explore techniques that would allow the model to generalize to scenarios not captured in the training data, e.g., unseen applications and input data sizes, and to quickly offer online predictions without waiting for an interfered application to complete.en_US
dc.identifier.citationShah, S. (2020). Performance Modeling for Spark Applications (Master's thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca.en_US
dc.identifier.doihttp://dx.doi.org/10.11575/PRISM/38534
dc.identifier.urihttp://hdl.handle.net/1880/112942
dc.language.isoengen_US
dc.publisher.facultySchulich School of Engineeringen_US
dc.publisher.institutionUniversity of Calgaryen
dc.rightsUniversity of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission.en_US
dc.subjectBig Data Platformsen_US
dc.subjectApache Sparken_US
dc.subjectPerformance Modelingen_US
dc.subjectMachine Learningen_US
dc.subjectExecution Time Predictionen_US
dc.subjectInterference Detectionen_US
dc.subject.classificationEngineering--Electronics and Electricalen_US
dc.titlePerformance Modeling for Spark Applicationsen_US
dc.typemaster thesisen_US
thesis.degree.disciplineEngineering – Electrical & Computeren_US
thesis.degree.grantorUniversity of Calgaryen_US
thesis.degree.nameMaster of Science (MSc)en_US
ucalgary.item.requestcopytrueen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
ucalgary_2020_shah_sarah.pdf
Size:
1.41 MB
Format:
Adobe Portable Document Format
Description:
Thesis Document
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.62 KB
Format:
Item-specific license agreed upon to submission
Description: