SUPPLEMENTARY MATERIAL
A fully Bayesian model to cluster gene expression profiles
C.Vogl1*, F. Sanchez-Cabo2*, G. Stocker2, S. Hubbard3, O. Wolkenhauer4 and Z. Trajanoski2
 
1 Institute of Animal Breeding and Genetics, Veterinaermedizinische Universitaet Wien, 1210 Vienna, Austria
2 Institute for Genomics and Bioinformatics, Graz University of Technology, 8010 Graz, Austria
3 Faculty of Life Sciences, University of Manchester, M60 1QD Manchester, UK
4 Institute for Informatic, University of Rostock, 18051 Rostock, Germany * these authors contributed equally
 

ABSTRACT

 Cell cycle, organ development, and cellular differentiation involve regular cascades of changes in gene expression. With cDNA or oligonucleotide chips, these changes can be simultaneously monitored for most genes in a genome. After proper normalization of the data, genes are often classified into co-expressed classes (clusters) to identify subgroups of genes that share common regulatory elements, a common function, or a common cellular origin. We propose a fully probabilistic Bayesian model to cluster gene expression profiles. The number of classes does not need to be specified in advance, instead it is adjusted dynamically using a Reversible Jump Markov Chain Monte Carlo (RJMCMC) algorithm. In addition, the imputation of missing values is integrated into the model. Simulated data was used to assess the performance of the algorithm. Specificity was very high while sensitivity was around 50%, outperforming the results from the traditional k-means algorithm. Clusters from data sets with and without missing values showed a great simmilarity. The method is specially useful in order to determine genes likely to be involved in the same biological process than a given one or to identify genes that exhibit a pre-determined profile relevant to the process under study.
 
RESULTS

 
Rate of false positives and negatives for all simulated data sets and percentage of genes shared between
the clusters with and without missing values.
Rate of false positives and negatives for the kmeans clusters corresponding to the simulated clusters.
Simulated gene expression profiles in clusters 1-4 compared to the RJMCMC-based clustering.
Simulated gene expression profiles in clusters 5-8 compared to the RJMCMC-based clustering.
Simulated gene expression profiles in clusters 9-12 compared to the RJMCMC-based clustering.
Simulated gene expression profiles in clusters 13-16 compared to the RJMCMC-based clustering.
Simulated gene expression profiles in clusters 17-20 compared to the RJMCMC-based clustering.
Simulated gene expression profiles in clusters 21-24 compared to the RJMCMC-based clustering.
Simulated gene expression profiles in clusters 25-28 compared to the RJMCMC-based clustering.
Simulated gene expression profiles in clusters 29-32 compared to the RJMCMC-based clustering.
Simulated gene expression profiles in clusters 33-36 compared to the RJMCMC-based clustering.
Simulated gene expression profiles in clusters 37-40 compared to the RJMCMC-based clustering.
 
SOFTWARE

  The C++ program and the R code for the post-Bayesian analysis can be found here.
 
FURTHER INFORMATION

 
Fatima Sanchez-Cabo: F.Sanchez-Cabo@postgrad.umist.ac.uk
  f.sanchezcabo@tugraz.at
Claus Vogl: claus.vogl@vu-wien.ac.at
  i122server.vu-wien.ac.at/~vogl/index.html