PUMA Project


Microarray analysis allows the simultaneous evaluation of thousands of gene expression levels. In recent years both cDNA microarrays and oligonucleotide based arrays have become established technologies for genomic analysis across biology. Once the biological processes are complete the raw date generated from both the oligonucleotide arrays and the cDNA arrays consists of an image. This proposal concerns itself with the processing of the arrays once the image has been generated. We will use probabilistic models to extract information from the data, beginning with the image analysis, followed by normalisation and finally data mining. We will use Bayesian methods to integrate and propagate errors inherent in each stage of the probabilistic modelling process, eventually providing the biologist with levels of confidence for inferences made from the data.

The project is sponsored by BBSRC Project Ref BBS/B/0076X and is a collaboration with Dr Marta Milo of Wellcome Trust Fellow at University of Sheffield, Dr Xuejun Liu of Nanjing University of Aeronautics & Astronautics (former PhD student), Dr Guido Sanguinetti of University of Sheffield (former post-doc) and Dr Antti Honkela of Helsinki University of Technology (visitor and collaborator).

Personnel from ML@SITraN


The following software has been made available either wholly or partly as a result of work on this project:


M. Rattray, X. Liu, G. Sanguinetti, M. Milo and N. D. Lawrence. (2006) “Propagating uncertainty in microarray data analysis” in Briefings in Bioinformatics 7 (1), pp 37–47 [Errata][PDF][Pubmed][Google Scholar Search]


Microarray technology is associated with many sources of experimental uncertainty. In this review we discuss a number of approaches for dealing with this uncertainty in the processing of data from microarray experiments. We focus here on the analysis of high-density oligonucleotide arrays, such as the popular Affymetrix GeneChip® array, which contain multiple probes for each target. This set of probes can be used to determine an estimate for the target concentration and can also be used to determine the experimental uncertainty associated with this measurement. This measurement uncertainty can then be propagated through the downstream analysis using probabilistic methods. We give examples showing how these credibility intervals can be used to help identify differential expression, to combine information from replicated experiments and to improve the performance of principal component analysis.

The following conference publications were made associated with this project.

G. Sanguinetti, M. Rattray and N. D. Lawrence. (2006) “Identifying submodules of cellular regulatory networks” in International Conference on Computational Methods in Systems Biology, Springer-Verlag, . [DOI][Google Scholar Search]


Recent high throughput techniques in molecular biology have brought about the possibility of directly identifying the architecture of regulatory networks on a genome-wide scale. However, the computational task of estimating fine-grained models on a genome-wide scale is daunting. Therefore, it is of great importance to be able to reliably identify submodules of the network that can be effectively modelled as independent subunits. In this paper we present a procedure to obtain submodules of a cellular network by using information from gene-expression measurements. We integrate network architecture data with genome-wide gene expression measurements in order to determine which regulatory relations are actually confirmed by the expression data. We then use this information to obtain non-trivial submodules of the regulatory network using two distinct algorithms, a naive exhaustive algorithm and a spectral algorithm based on the eigendecomposition of an affinity matrix. We test our method on two yeast biological data sets, using regulatory information obtained from chromatin immunoprecipitation.

N. D. Lawrence, G. Sanguinetti and M. Rattray. (2007) “Modelling transcriptional regulation using Gaussian processes” in B. Schölkopf, J. C. Platt and T. Hofmann (eds) NIPS, MIT Press, Cambridge, MA, pp 785–792. [Errata][Software][Gzipped Postscript][PDF][Google Scholar Search]


Modelling the dynamics of transcriptional processes in the cell requires the knowledge of a number of key biological quantities. While some of them are relatively easy to measure, such as mRNA decay rates and mRNA abundance levels, it is still very hard to measure the active concentration levels of the transcription factor proteins that drive the process and the sensitivity of target genes to these concentrations. In this paper we show how these quantities for a given transcription factor can be inferred from gene expression levels of a set of known target genes. We treat the protein concentration as a latent function with a Gaussian Process prior, and include the sensitivities, mRNA decay rates and baseline expression levels as hyperparameters. We apply this procedure to a human leukemia dataset, focusing on the tumour repressor p53 and obtaining results in good accordance with recent biological studies.

R. D. Pearson, X. Liu, G. Sanguinetti, M. Milo, N. D. Lawrence and M. Rattray. (2009) “Puma: a Bioconductor package for propagating uncertainty in microarray analysis” in BMC Bioinformatics 10 (211) [Pubmed][DOI][Google Scholar Search]


Background\ \ Most analyses of microarray data are based on point estimates of expression levels and ignore the uncertainty of such estimates. By determining uncertainties from Affymetrix GeneChip data and propagating these uncertainties to downstream analyses it has been shown that we can improve results of differential expression detection, principal component analysis and clustering. Previously, implementations of these uncertainty propagation methods have only been available as separate packages, written in different languages. Previous implementations have also suffered from being very costly to compute, and in the case of differential expression detection, have been limited in the experimental designs to which they can be applied.\ \ Results\ \ puma is a Bioconductor package incorporating a suite of analysis methods for use on Affymetrix GeneChip data. puma extends the differential expression detection methods of previous work from the 2-class case to the multi-factorial case. puma can be used to automatically create design and contrast matrices for typical experimental designs, which can be used both within the package itself but also in other Bioconductor packages. The implementation of differential expression detection methods has been parallelised leading to significant decreases in processing time on a range of computer architectures. puma incorporates the first R implementation of an uncertainty propagation version of principal component analysis, and an implementation of a clustering method based on uncertainty propagation. All of these techniques are brought together in a single, easy-to-use package with clear, task-based documentation.\ \ Conclusions\ \ For the first time, the puma package makes a suite of uncertainty propagation methods available to a general audience. These methods can be used to improve results from more traditional analyses of microarray data. puma also offers improvements in terms of scope and speed of execution over previously available methods. puma is recommended for anyone working with the Affymetrix GeneChip platform for gene expression analysis and can also be applied more generally.

P. Gao, A. Honkela, M. Rattray and N. D. Lawrence. (2008) “Gaussian process modelling of latent chemical species: applications to inferring transcription factor activities” in Bioinformatics 24, pp i70–i75 [Software][PDF][DOI][Google Scholar Search]


Motivation: Inference of latent chemical species in biochemical interaction networks is a key problem in estimation of the structure and parameters of the genetic, metabolic and protein interaction networks that underpin all biological processes. We present a framework for Bayesian marginalisation of these latent chemical species through Gaussian process priors.\ \ Results: We demonstrate our general approach on three different biological examples of single input motifs, including both activation and repression of transcription. We focus in particular on the problem of inferring transcription factor activity when the concentration of active protein cannot easily be measured. We show how the uncertainty in the inferred transcription factor activity can be integrated out in order to derive a likelihood function that can be used for the estimation of regulatory model parameters. An advantage of our approach is that we avoid the use of a coarse-grained discretization of continuous-time functions, which would lead to a large number of additional parameters to be estimated. We develop efficient exact and approximate inference schemes, which are much more efficient than competing sampling-based schemes and therefore provide us with a practical toolkit for model-based inference.\ \ Availability: The software and data for recreating all the experiments in this paper is available in MATLAB from http://staffwww.dcs.shef.ac.uk/people/N.Lawrence/gpsim\ \ Contact: Neil Lawrence

G. Sanguinetti, N. D. Lawrence and M. Rattray. (2006) “Probabilistic inference of transcription factor concentrations and gene-specific regulatory activities” in Bioinformatics 22 (22), pp 2275–2281 [Errata][Software][PDF][Pubmed][DOI][Google Scholar Search]


Motivation: Quantitative estimation of the regulatory relationship between transcription factors and genes is a fundamental stepping stone when trying to develop models of cellular processes. Recent experimental high-throughput techniques such as Chromatine Immunoprecipitation provide important information about the architecture of the regulatory networks in the cell. However, it is very difficult to measure the concentration levels of transcription factor proteins and determine their regulatory effect on gene transcription. It is therefore an important computational challenge to infer these quantities using gene expression data and network architecture data.\ \ Results: We develop a probabilistic state space model that allows genome-wide inference of both transcription factor protein concentrations and their effect on the transcription rates of each target gene from microarray data. We use variational inference techniques to learn the model parameters and perform posterior inference of protein concentrations and regulatory strengths. The probabilistic nature of the model also means that we can associate credibility intervals to our estimates, as well as providing a tool to detect which binding events lead to significant regulation. We demonstrate our model on artificial data and on two yeast data sets in which the network structure has previously been obtained using Chromatine Immunoprecipitation data. Predictions from our model are consistent with the underlying biology and offer novel quantitative insights into the regulatory structure of the yeast cell.\ \ Availability: MATLAB code is available from http://umber.sbs.man.ac.uk/resources/puma.

X. Liu, M. Milo, N. D. Lawrence and M. Rattray. (2006) “Probe-level measurement error improves accuracy in detecting differential gene expression” in Bioinformatics 22 (17), pp 2107–2113 [Errata][PDF][Pubmed][DOI][Google Scholar Search]


Motivation: Finding differentially expressed genes is a fundamental objective of a microarray experiment. Numerous methods have been proposed to perform this task. Existing methods are based on point estimates of gene expression level obtained from each microarray experiment. This approach discards potentially useful information about measurement error that can be obtained from an appropriate probe-level analysis. Probabilistic probe-level models can be used to measure gene expression and also provide a level of uncertainty in this measurement. This probe-level variance provides useful information which can help in the identification of differentially expressed genes.\ \ Results: We propose a Bayesian method to include probe-level variances into the detection of differentially expressed genes from replicated experiments. A variational approximation is used for effcient parameter estimation. We compare this approximation with MAP and MCMC parameter estimation in terms of computational effciency and accuracy. The method is used to calculate the probability of positive log-ratio (PPLR) of expression levels between conditions. Using the measurements from a recently developed Affymetrix probe-level model, multi-mgMOS, we test PPLR on a spike-in data set and a mouse time-course data set. Results show that the inclusion of probelevel measurement error improves accuracy in detecting differential gene expression.\ \ Availability: The methods described in this paper have been implemented in an R package pplr that is currently available from http://umber.sbs.man.ac.uk/resources/puma.\ \ Contact: Magnus Rattray

G. Sanguinetti, M. Rattray and N. D. Lawrence. (2006) “A probabilistic dynamical model for quantitative inference of the regulatory mechanism of transcription” in Bioinformatics 22 (14), pp 1753–1759 [Software][PDF][Pubmed][DOI][Google Scholar Search]


Motivation: Quantitative estimation of the regulatory relationship between transcription factors and genes is a fundamental stepping stone when trying to develop models of cellular processes. This task, however, is difficult for a number of reasons: transcription factors’ expression levels are often low and noisy, and many transcription factors are post-transcriptionally regulated. It is therefore useful to infer the activity of the transcription factors from the expression levels of their target genes.\ \ Results: We introduce a novel probabilistic model to infer transcription factor activities from microarray data when the structure of the regulatory network is known. The model is based on regression, retaining the computational efficiency to allow genome-wide investigation, but is rendered more flexible by sampling regression coefficients independently for each gene. This allows us to determine the strength with which a transcription factor regulates each of its target genes, therefore providing a quantitative description of the transcriptional regulatory network. The probabilistic nature of the model also means that we can associate credibility intervals to our estimates of the activities. We demonstrate our model on two yeast data sets. In both cases the network structure was obtained using Chromatine Immunoprecipitation data. We show how predictions from our model are consistent with the underlying biology and offer novel quantitative insights into the regulatory structure of the yeast cell.\ \ Availability: MATLAB code is available from http://umber.sbs.man.ac.uk/resources/puma.

X. Liu, M. Milo, N. D. Lawrence and M. Rattray. (2005) “A tractable probabilistic model for Affymetrix probe-level analysis across multiple chips” in Bioinformatics 21 (18), pp 3637–3644 [Software][PDF][Pubmed][DOI][Advance Access][Pre-print PDF][Google Scholar Search]


Motivation: Affymetrix GeneChip arrays are currently the most widely used microarray technology. Many summarisation methods have been developed to provide gene expression levels from Affymetrix probe-level data. Most of the currently popular methods do not provide a measure of uncertainty for the expression level of each gene. The use of probabilistic models can overcome this limitation. A full hierarchical Bayesian approach requires the use of computationally intensive MCMC methods that are impractical for large data sets. An alternative computationally efficient probabilistic model, mgMOS, uses Gamma distributions to model specific and non-specific binding with a latent variable to capture variations in probe affinity. Although promising, the main limitations of this model are that it does not use information from multiple chips and that it does not account for specific binding to the mismatch (MM) probes.\ \ Results: We extend mgMOS to model the binding affinity of probe-pairs across multiple chips and to capture the effect of specific binding to MM probes. The new model, multi-mgMOS, provides improved accuracy, as demonstrated on some bench-mark data sets and a real time-course data set, and is much more computationally efficient than a competing hierarchical Bayesian approach that requires MCMC sampling. We demonstrate how the probabilistic model can be used to estimate credibility intervals for expression levels and their log-ratios between conditions.\ \ Availability: Both mgMOS and the new model multi-mgMOS have been implemented in an R package that is currently available from http://umber.sbs.man.ac.uk/resources/puma.

G. Sanguinetti, M. Milo, M. Rattray and N. D. Lawrence. (2005) “Accounting for probe-level noise in principal component analysis of microarray data” in Bionformatics 21 (19), pp 3748–3754 [Software][PDF][Pubmed][DOI][Advance Access][Pre-print PDF][Bioinformatics Abstract][Google Scholar Search]


Motivation: Principal Component Analysis (PCA) is one of the most popular dimensionality reduction techniques for the analysis of high-dimensional datasets. However, in its standard form, it does not take into account any error measures associated with the data points beyond a standard spherical noise. This indiscriminate nature provides one of its main weaknesses when applied to biological data with inherently large variability, such as expression levels measured with microarrays. Methods now exist for extracting credibility intervals from the probe-level analysis of cDNA and oligonucleotide microarray experiments. These credibility intervals are gene and experiment specific, and can be propagated through an appropriate probabilistic downstream analysis.\ \ Results: We propose a new model-based approach to PCA that takes into account the variances associated with each gene in each experiment. We develop an efficient EM-algorithm to estimate the parameters of our new model. The model provides significantly better results than standard PCA, while remaining computationally reasonable. We show how the model can be used to ‘denoise’ a microarray dataset leading to improved expression profiles and tighter clustering across profiles. The probabilistic nature of the model means that the correct number of principal components is automatically obtained.\ \ Availability: The software used in the paper is available from http://www.bioinf.manchester.ac.uk/resources/puma. The microarray data are deposited in the NCBI database.

M. Milo, A. Fazeli, M. Niranjan and N. D. Lawrence. (2003) “A probabilistic model for the extraction of expression levels from oligonucleotide arrays” in Biochemical Transations 31 (6), pp 1510–1512 [PDF][Google Scholar Search]


In this work we present a probabilistic model to estimate summaries of Affymetrix GeneChip probe level data. Comparisons with two different models were made both on a publicly available dataset and on a study performed in our laboratory, showing that our model performs better for consistency of fold change.

N. D. Lawrence, M. Milo, M. Niranjan, P. Rashbass and S. Soullier. (2004) “Reducing the variability in cDNA microarray image processing by Bayesian inference” in Bioinformatics 20 (4), pp 518–526 [Software][Gzipped Postscript][Pubmed][DOI][Pre-print PDF][Google Scholar Search]


Motivation: Gene expression levels are obtained from microarray experiments through the extraction of pixel intensities from a scanned image of the slide. It is widely acknowledged that variabilities can occur in expression levels extracted from the same images by different users with the same software packages. These inconsistencies arise due to differences in the refinement of the placement of the microarray `grids’. We introduce a novel automated approach to the refinement of grid placements that is based upon the use of Bayesian inference for determining the size, shape and positioning of the microarray `spots’, capturing uncertainty that can be passed to downstream analysis.\ \ Results: Our experiments demonstrate that variability between users can be significantly reduced using the approach. The automated nature of the approach also saves hours of researchers’ time normally spent in refining the grid placement.\ \ Availability: A MATLAB implementation of the algorithm and an image of the slide used in our experiments, as well as the code necessary to recreate them are available for non-commercial use from http://www.dcs.shef.ac.uk/~neil/vis.