Selectome (phase 1)

Long Title: Selectome: positive Darwinian selection on the Grid - Phase 1
Université de Lausanne
Swiss Institute of Bioinformatics (unfunded partner)
Domain: Grid
Status: finished
Start Date: 01.07.2010
End Date: 31.12.2011
Project Leader: H. Stockinger
Deputy Project Leader: N. Salamin, M. Robinson-Rechavi
Website: http://selectome.unil.ch

Selectome has designed, developed and deployed a scientific bioinformatics application on the sustainable Grid infrastructure of SMSCG, as an essential ingredient for a functional Grid environment in Switzerland. This project also provides the necessary expertise to support the targeted life science user community.

See also Selectome Phase 2


The project provides a software tool (called gcodeml) that taps into the power of a computational Grid to support CPU intensive calculations that support the study of evolution of species. It enables the Selectome team at UNIL to create future versions of Selectome using computational Grid resources of Swiss scientific and academic partners. In this way, life scientists will have an up-to-date Selectome database which provides an easy-to-use web interface to biological knowledge.
Grid/Selectome has selected GC3's gc3pie framework as the underlying software to build a fault-tolerance submission system for codeml jobs.

Komponente Beschreibung
Software package gcodeml The software is part of the gc3pie distribution and available in the SVN repository of the gc3pie project. Additionally, a copy of the gcodeml code can be found in the private SVN of SIB/Vital-IT.
It is available on demand: selectome@unil.ch
Selectome: a Database of Positive Selection Selectome web site: The main entry point

UNIL, SIB/Vital-IT and UZH/GC3 will extend the existing software based on gc3pie from UZH to run the Selectome workflow in a full production environment during Phase 2.


The overall scientific goal is to:

  • provide up-to-date data (based on CPU intensive calculations) for the Selectome database and
  • design, develop and deploy a high-throughput Grid version of Codeml.

Positive selection (i.e. the usage of the PAML/Codeml package) is applied by several thousands of scientists in the life science domains (PAML has more than 2700 citations). Therefore, there is a large potential user community of Selectome within and outside Switzerland. The SMSCG project is seen as a major driver to calculate new data sets and to provide a better service to the user community.

The envisaged scientific application is based on the concept of Darwinian selection, which is the force that drives evolutionary diversification and functional changes for living beings. The group of Prof. Marc Robinson-Rechavi has developed and operates a database of such Darwinian selection, which is called Selectome and is freely available to scientists of the life science community world-wide.
The actual data behind the database is pre-calculated using phylogenetic approaches and in particular with the reference software called Codeml (PAML package).
Even if the database/service is already online, there are data to update on a regular basis, and many more data sets to be pre-calculated. The group does not have sufficient CPU resources to process all the data. It is therefore of major interest to tap the computation engine into the SMSCG computing Grid since the problem is embarrassingly parallel. In addition to being CPU intensive, the application also generates considerably big amounts of data (in the order of several hundreds of GBs) that need to be properly managed.


The technical programme is divided into three technical work packages which are accompanied by a management work package:

  1. Grid-enabling of Selectome and Codeml: command line-based, fault-tolerant gridification of Codeml as the computational back-end to Selectome.
  2. Simulation and production runs of Selectome/Codeml: production runs of small data sets such as fungi; and production runs of large data sets.
  3. Life science application and user support: in addition to supporting the phylogenetic applications Selectome/Codeml, other life science applications (molecular modelling, sequence search and analysis, proteomics etc.) and users will be supported. Other applications will be considered in cooperation with the SMSCG-II project.