A generic execution management framework for long running jobs in grid environments

Date
2012
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Over the last decade, the grid has emerged as a paradigm of distributed and collabora­tive computing focusing on the sharing of computational and storage resources spanning across geographical and organizational domains. Greater access to high-end computa­tional facilities provides researchers fom a broad spectrum of domains an inexpensive option of carrying out sophisticated computational experiments. However, the inher­ent dynamics and heterogeneity of grid environments make the execution of resource and compute intensive applications a challenging task. Increasing fault tolerance by checkpointing and migrating jobs between resources requires significant expertise and intervention from users. Automation of such tasks can allow them to focus more on the scientific results and less on the technical details. This thesis addresses the issues associated with management of execution of long run­ning applications in grid environments. It presents a generic framework for automating execution of such applications. The framework is driven by a set of information mod­els that capture knowledge about the resources and the applications. Crucial to the functioning of the framework is information on two application characteristics: the con­figurability, and the memory usage behaviour. Separate models are presented to encode knowledge of both of these characteristics. Use of a common representation of knowledge abstracts the heterogeneity of both the resources and the applications and makes the framework functional without the need to be tailored to any specific application. Two important issues that need to be considered in managing job execution are the amount of memory required by the job and the wait time the job may experience on a specific resource. The framework presented in this thesis is equipped with mechanisms to address both of these issues. It is able to make estimations about the wait time for jobs with different resource requirements. A learning system has been designed as part of the framework to characterize the memory usage behaviour of application instances. The system facilitates execution management operations by providing accurate estimation of job's memory usage.
Description
Bibliography: p. 188-202
Keywords
Citation
Elahi, T. (2012). A generic execution management framework for long running jobs in grid environments (Doctoral thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca. doi:10.11575/PRISM/4747
Collections