ETHZ.7

VM-MAD

Long Title: Virtual Machines Management and Advanced Deployment
Leading
Organization:
ETH Zürich
Participating
Organizations:
Universität Zürich
SWITCH - Teleinformatikdienste für Lehre und Forschung
Domain: Grid
Status: finished
Start Date: 15.02.2011
End Date: 15.07.2012
Project Leader: P. Kunszt
Deputy Project Leader: Ch. Panse

User communities running complex applications with many dependencies can benefit from simple mechanisms - based on Virtual Machine (VM) technology - to deploy complex scientific applications on heterogeneous hardware and software resources. Also university IT departments and research IT supporters can profit through minimized reinstallation times, perfect encapsulation and much increased manageability.

Results

Component Description
Public Resource Repository of the code produced by the VM-MAD project (public)
Online Documentation Documentation wiht installation instructions, module description and available commands (public)
Project Wiki Internal project wiki (protected, open by request to project partners and universities)
Orchestrator 'Passive' component, that monitors the state of the LRMS (Compute Cluster) and adds or removes compute nodes based on a set of policies defined by the system admin (implemented as set of python classes)
VM Policies Example of local policy (FGCZ), defined in a python class (start, stop, running jobs)
Batch System interface Sun Grid Engine monitoring module and module reading accounting files from LRMS for simulation cloud/grid behavior using real world data and different system configurations and parameters.
Provider Interface using apache LibCloud which supports Amazon EC2, RackSpace, Go Grid and others. It can also use SMSCG through the gc3pie framework.

The project delivered a solution and a set of procedures and best practices allowing a local resource provider to cloud-burst towards a public/private cloud provider and towards the SMSCG infrastructure.
The final result is a mechanism to seamlessly and dynamically expand a computational batch cluster to respond to peak of loads and/or to urgent situations where the immediate availability of computational resource is of a critical importance.
Know-how and experiences on how to configure and operate a cloud-bursting computational cluster as well as how to monitor and control the cloud-bursting features have been acquired. A set of technical documents and procedural guidelines have been published.
The project has been focusing on the FGCZ use-case and a set of site-specific virtual appliances have been created. These appliances have been used to cloud-burst the FGCZ computational cluster towards both the Amazon EC2 and SMSCG infrastructures. The approach, the software components as well as the methodology have been made publicly available so other resource providers could benefit.
A cloud-burst simulator has been also developed as part of the software stack that dynamically controls the cloud-bursting capability of a site. Such a simulator could be used by a local provider to test and verify the potential advantages of a cloud-bursting policy with own real usage data.


Goals

In order to manage the increasingly complex software stacks, and to reduce the effort needed to migrate software to the latest environments, virtualization offers many benefits to cluster administrators and to the end-users. Especially for the many high throughput applications that do not need parallel processing (and this is the vast majority of the scientific codes) the advantages make virtualization a very useful technology in that context.
Once the cluster environment can be virtualized, the same can be done in a Grid context and be extended to commercial cloud environments. In order to reuse virtual machines locally and across multiple Grid sites (and commercial clouds), a repository of Virtual Machines (VM-Repo) should be established as well as a mechanism to select the right VMs and to submit them to the individual Grid sites. Finally the VMs must be made available to the end user through a dedicated, dynamic batch queue.

Benefits

Benefits to the scientific end-user eommunity:

  • access to a much larger number of heterogeneous resources with no extra effort
  • easy migration between infrastructure updates
  • abstraction of infrastructure dependencies
  • gaining more control on installation and dependency management of scientific codes
  • easy sharing of codes among peers
  • easy versioning (running multiple versions simultaneously)

Benefits to the IT supporters and IT infrastructure providers:

  • perfect encapsulation of codes
  • easy dependency and migration management
  • possible access to external resources on system overload if needed

Steps

In summary, we will

  • build a repository of preconfigured VMs
  • develop and validate VM templates
  • integrate with the SMSCG Grid, especially the Grid schedulers
  • integrate with loeal schedulers to launch the VMs and to execute jobs in them
  • integrate with existing accounting and monitoring mechanisms
  • re-use existing teehnology as much as possible

The following components will be established:

  • Orchestrator module (responsible for connecting all the other modules and to instruct them according to its functional workflows)
  • VM Provider (central component; needs to be interfaced with all the other components; manages and controls the execution of VMs; also combines both local computing resources and remote cloud resources, according to allocation policies)
  • VM Repository (stores VM Images; API to list, select, upload, stage, delete, and generally manage the VM Images is foreseen; each local resource provider should have their own repository to store the VMs that are being used locally)
  • LRMS interface (embodies the interface with the Local Resource Management System and the necessary configuration for exposing the VMs as integral part of the LRMS)

Back