Our goal in this project was to develop and apply new methods for inferring the parameters of mechanistic models of biological systems and to apply these methods to uncover the mechanisms of transcriptional regulation. This goal was be achieved by unifying two disparate approaches to network analysis: the systems biology approach of specifying differential equation models of transcription (see this book), and the statistical/machine learning approach of constructing probabilistic models of the data (see Related Papers below). Our approach was to infer the parameters of differential equation models through constructing probabilistic models which respect the relationships specified by the differential equation. This was achieved by combining information from gene expression time series data with differential equation models of transcriptional regulation. The advantage of using probabilistic models is that we can simultaneously handle uncertainty in the model parameters along with experimental and biological noise.
Process the expression data from the Drosophila blastoderm and construct differential equation models of the underlying processes by generalising equations developed previously by Jaeger et al.
The expression data processing and development of the models was undertaken at the Nottingham site. The Manchester site developed models for mesoderm development in Drosophila whilst awaiting the data and in recent submissions (Titsias et al, PLoS Computational Biology, under review) have developed the Markov Chain Monte Carlo framework for inference in these models and a linearization of the model with initial conditions determined by Gaussian processes (Alvarez et al, TPAMI, submission imminent).
- Adapt the preliminary work to account for non-linear response models and more complicated network motifs.We will expand the Gaussian process methodology we have developed beyond the single input module motif. We will consider other network motifs such as feed forward loops and dense overlapping regions. We will integrate transcriptional delays within the model. The nonlinear framework under review allows for time delays and multiple input transcription factors. The framework relies on the Markov chain Monte Carlo approach described in this NIPS paper and this book chapter.
- Develop Monte Carlo Sampling methods We will develop the Monte Carlo methods, both as a gold standard for comparison with our approximations and as a practical approach for parameter inference. We developed Monte Carlo approaches and an approach based on the Laplace approximation. For the non-linear models the Monte Carlo approach proved efficient enough to act as a practical substitute for the Laplace approximation. Implementation was also easier as the second order derivatives for the Monte Carlo approach are not required.
- Variational Approximations: We will develop and validate Gaussian process variational approximations for the systems of interest. Such approximations are typically faster than Monte Carlo sampling and are far easier to monitor as far as convergence is concerned. We developed a range of variational approximations for Gaussian process models derived from differential equations. These techniques also apply more genericaly to multiple output Gaussian processes. They included this NIPS paper, this AISTATS paper, this JMLR paper and this AISTATS paper.
- Validation of Techniques Against Drosophila data: We will develop models of the gap gene network involved in the blastoderm development stage of Drosophila. Parameter inference in these models will be undertaken by the methodologies developed in the objectives above. Since the protein concentrations are known in this Drosophila system it can be used as a validation data set for the techniques we develop. In our PNAS paper we validated our techniques against ChIP-chip studies on Drosophila mesoderm. Our latest studies in preparation and review are also being compared to the Drosophila mesoderm, as well as the blastoderm.
- We developed an approach to ranking targets of a transcription factor through a linear differential models of transcription and translation. (see DISIMRANK software below, this PNAS paper and this Bioconductor package).
- We provided software implementations of our algorithms for linear response through Bioconductor as the TIGRE package.
- We developed an approach to inferring protein concentration given the gene expression of known targets in a single input motif. (see GPSIM software below, this NIPS paper and this ECCB paper).
- We developed a range of approaches for efficient computation in multiple-output Gaussian process models. The first relied on conditional independence assumptions (see this NIPS paper and this JMLR paper).
- We pioneered a new class of approximation techniques for Gaussian processes in general based on a variational approximation to the posterior. We showed how this could be done for single output Gaussian processes with a NIPS paper and how the approach could be extended to multiple output Gaussian processes with an AISTATS paper.
- For non-linear response models we showed how sampling can be done efficiently using Gaussian processes even when the posterior over the GP is strongly correlated (see this NIPS paper and this book chapter)
- With collaborators we composed a collected volume of papers in this area.
- We were invited to contribute to two further collected volumes, one on Bayesian inference in time series models and the other is the imminent “Handbook of Systems Biology” edited by Mark Girolami and Michael Stumpf and soon to be published by Springer.
- We organized four workshops, three in a new series “Learning and Inference in Computational Systems Biology” and one in “Machine Learning in Systems Biology”.
- The EPSRC funding covered a post-doctoral researcher in Manchester and a year’s post-doctoral research in Nottingham. This was augmented by a visitor from Finland, funded by the EU under the FP7 PASCAL2 Network of Excellence and a PhD student funded by the School of Computer Science at the University of Manchester.
- Three follow on grants have so far been awarded which depend in some part the ideas developed in this project. Two from the BBSRC (Grant numbers BB/H018123/2 and BB/I004769/2, the second of which is a Europe-wide consortium under the ERASysBio Plus scheme) and one from the EU under FP7: “BioPreDyn-From Data to Models: New Bioinformatics Methods and Tools for Data-Driven Predictive Dynamic Modelling in Biotechnological Applications”.
The project is sponsored by EPSRC Project Ref EP/F005687/1 and is a collaboration with Dr Nick Monk of University of Nottingham, Dr Johannes Jaeger of CRG and Dr Antti Honkela of Helsinki University of Technology (visitor and collaborator).
Personnel from ML@SITraN
The following software has been made available either wholly or partly as a result of work on this project:- gpsim GPSIM: Gaussian Process Modelling of single input module motif networks.
multigp MULTIGP: Modelling multiple outputs with Gaussian processes (will eventually supercede the gpsim toolbox).
disimrank DISIMRANK: Ranking potential targets using a driven input single input model motif.
The following conference publications were made associated with this project.
“Efficient multioutput Gaussian processes through variational inducing kernels” in Y. W. Teh and D. M. Titterington (eds) Proceedings of the Thirteenth International Workshop on Artificial Intelligence and Statistics, JMLR W&CP 9, Chia Laguna Resort, Sardinia, Italy, pp 25–32. [Software][PDF][Google Scholar Search](2010)
Interest in multioutput kernel methods is increasing, whether under the guise of multitask learning, multisensor networks or structured output data. From the Gaussian process perspective a multioutput Mercer kernel is a covariance function over correlated output functions. One way of constructing such kernels is based on convolution processes (CP). A key problem for this approach is efficient inference. Álvarez and Lawrence [Alvarez:convolved08] recently presented a sparse approximation for CPs that enabled efficient inference. In this paper, we extend this work in two directions: we introduce the concept of variational inducing functions to handle potential non-smooth functions involved in the kernel CP construction and we consider an alternative approach to approximate inference based on variational methods, extending the work by Titsias [Titsias:variational09] to the multiple output case. We demonstrate our approaches on prediction of school marks, compiler performance and financial time series.
“Latent force models” in D. van Dyk and M. Welling (eds) Proceedings of the Twelfth International Workshop on Artificial Intelligence and Statistics, JMLR W&CP 5, Clearwater Beach, FL, pp 9–16. [Software][PDF][Google Scholar Search](2009)
Purely data driven approaches for machine learning present difficulties when data is scarce relative to the complexity of the model or when the model is forced to extrapolate. On the other hand, purely mechanistic approaches need to identify and specify all the interactions in the problem at hand (which may not be feasible) and still leave the issue of how to parameterize the system. In this paper, we present a hybrid approach using Gaussian processes and differential equations to combine data driven modeling with a physical model of the system. We show how different, physically-inspired, kernel functions can be developed through sensible, simple, mechanistic assumptions about the underlying system. The versatility of our approach is illustrated with three case studies from computational biology, motion capture and geostatistics.
“Variational learning of inducing variables in sparse Gaussian processes” in D. van Dyk and M. Welling (eds) Proceedings of the Twelfth International Workshop on Artificial Intelligence and Statistics, JMLR W&CP 5, Clearwater Beach, FL, pp 567–574. [Google Scholar Search](2009)
“Sparse convolved Gaussian processes for multi-output regression” in D. Koller, D. Schuurmans, Y. Bengio and L. Bottou (eds) NIPS, MIT Press, Cambridge, MA, pp 57–64. [Software][PDF][Google Scholar Search](2009)
We present a sparse approximation approach for dependent output Gaussian processes (GP). Employing a latent function framework, we apply the convolution process formalism to establish dependencies between output variables, where each latent function is represented as a GP. Based on these latent functions, we establish an approximation scheme using a conditional independence assumption between the output processes, leading to an approximation of the full covariance which is determined by the locations at which the latent functions are evaluated. We show results of the proposed methodology for synthetic data and real world applications on pollution prediction and a sensor network.
“Efficient sampling for Gaussian process inference using control variables” in D. Koller, D. Schuurmans, Y. Bengio and L. Bottou (eds) NIPS, MIT Press, Cambridge, MA, pp 1681–1688. [PDF][Google Scholar Search](2009)
Sampling functions in Gaussian process (GP) models is challenging because of the highly correlated posterior distribution. We describe an efficient Markov chain Monte Carlo algorithm for sampling from the posterior process of the GP model. This algorithm uses control variables which are auxiliary function values that provide a low dimensional representation of the function. At each iteration, the algorithm proposes new values for the control variables and generates the function from the conditional GP prior. The control variable input locations are found by continuously minimizing an objective function. We demonstrate the algorithm on regression and classification problems and we use it to estimate the parameters of a differential equation model of gene regulation.
“Modelling transcriptional regulation using Gaussian processes” in B. Schölkopf, J. C. Platt and T. Hofmann (eds) NIPS, MIT Press, Cambridge, MA, pp 785–792. [Errata][Software][Gzipped Postscript][PDF][Google Scholar Search](2007)
Modelling the dynamics of transcriptional processes in the cell requires the knowledge of a number of key biological quantities. While some of them are relatively easy to measure, such as mRNA decay rates and mRNA abundance levels, it is still very hard to measure the active concentration levels of the transcription factor proteins that drive the process and the sensitivity of target genes to these concentrations. In this paper we show how these quantities for a given transcription factor can be inferred from gene expression levels of a set of known target genes. We treat the protein concentration as a latent function with a Gaussian Process prior, and include the sensitivities, mRNA decay rates and baseline expression levels as hyperparameters. We apply this procedure to a human leukemia dataset, focusing on the tumour repressor p53 and obtaining results in good accordance with recent biological studies.
The following books were published as part of this project.
N. D. Lawrence, M. Girolami, M. Rattray and G. Sanguinetti (eds) (2010) “Learning and inference in computational systems biology”, MIT Press, Cambridge, MA.
Recently there has been an increasing interest in regression methods that deal with multiple outputs. This has been motivated partly by frameworks like multitask learning, multisensor networks or structured output data. From a Gaussian processes perspective, the problem reduces to specifying an appropriate covariance function that, whilst being positive semi-definite, captures the dependencies between all the data points and across all the outputs. One approach to account for non-trivial correlations between outputs employs convolution processes. Under a latent function interpretation of the convolution transform we establish dependencies between output variables. The main drawbacks of this approach are the associated computational and storage demands. In this paper we address these issues. We present different efficient approximations for dependent output Gaussian processes constructed through the convolution formalism. We exploit the conditional independencies present naturally in the model. This leads to a form of the covariance similar in spirit to the so called PITC and FITC approximations for a single output. We show experimental results with synthetic and real data, in particular, we show results in school exams score prediction, pollution prediction and gene expression data
We present a computational method for identifying potential targets of a transcription factor (TF) using wild-type gene expression time series data. For each putative target gene we fit a simple differential equation model of transcriptional regulation, and the model likelihood serves as a score to rank targets. The expression profile of the TF is modeled as a sample from a Gaussian process prior distribution that is integrated out using a nonparametric Bayesian procedure. This results in a parsimonious model with relatively few parameters that can be applied to short time series datasets without noticeable overfitting. We assess our method using genome-wide chromatin immunoprecipitation (ChIP-chip) and loss-of-function mutant expression data for two TFs, Twist, and Mef2, controlling mesoderm development in Drosophila. Lists of top-ranked genes identified by our method are significantly enriched for genes close to bound regions identified in the ChIP-chip data and for genes that are differentially expressed in loss-of-function mutants. Targets of Twist display diverse expression profiles, and in this case a model-based approach performs significantly better than scoring based on correlation with TF expression. Our approach is found to be comparable or superior to ranking based on mutant differential expression scores. Also, we show how integrating complementary wild-type spatial expression data can further improve target ranking performance.
Motivation: Inference of latent chemical species in biochemical interaction networks is a key problem in estimation of the structure and parameters of the genetic, metabolic and protein interaction networks that underpin all biological processes. We present a framework for Bayesian marginalisation of these latent chemical species through Gaussian process priors.\ \ Results: We demonstrate our general approach on three different biological examples of single input motifs, including both activation and repression of transcription. We focus in particular on the problem of inferring transcription factor activity when the concentration of active protein cannot easily be measured. We show how the uncertainty in the inferred transcription factor activity can be integrated out in order to derive a likelihood function that can be used for the estimation of regulatory model parameters. An advantage of our approach is that we avoid the use of a coarse-grained discretization of continuous-time functions, which would lead to a large number of additional parameters to be estimated. We develop efficient exact and approximate inference schemes, which are much more efficient than competing sampling-based schemes and therefore provide us with a practical toolkit for model-based inference.\ \ Availability: The software and data for recreating all the experiments in this paper is available in MATLAB from http://staffwww.dcs.shef.ac.uk/people/N.Lawrence/gpsim\ \ Contact: Neil Lawrence
The following edited chapters were published as part of this project.
“Introduction to learning and inference in computational systems biology” in N. D. Lawrence, M. Girolami, M. Rattray and G. Sanguinetti (eds) Learning and Inference in Computational Systems Biology, MIT Press, Cambridge, MA. [MIT Press Site][Google Scholar Search](2010)
“A brief introduction to Bayesian inference” in N. D. Lawrence, M. Girolami, M. Rattray and G. Sanguinetti (eds) Learning and Inference in Computational Systems Biology, MIT Press, Cambridge, MA. [MIT Press Site][Google Scholar Search](2010)
“Gaussian processes for missing species in biochemical systems” in N. D. Lawrence, M. Girolami, M. Rattray and G. Sanguinetti (eds) Learning and Inference in Computational Systems Biology, MIT Press, Cambridge, MA. [Pubmed][MIT Press Site][Google Scholar Search](2010)
“Markov chain Monte Carlo algorithms for Gaussian processes” in D. Barber, A. T. Cemgil and S. Chiappa (eds) Bayesian Time Series Models, Cambridge University Press, . [Google Scholar Search](2011)
`What’s going to happen next?’ Time series data hold the answers, and Bayesian methods represent the cutting edge in learning what they have to say. This ambitious book is the first unified treatment of the emerging knowledge-base in Bayesian time series techniques. Exploiting the unifying framework of probabilistic graphical models, the book covers approximation schemes, both Monte Carlo and deterministic, and introduces switching, multi-object, non-parametric and agent-based models in a variety of application environments. It demonstrates that the basic framework supports the rapid creation of models tailored to specific applications and gives insight into the computational complexity of their implementation. The authors span traditional disciplines such as statistics and engineering and the more recently established areas of machine learning and pattern recognition. Readers with a basic understanding of applied probability, but no experience with time series analysis, are guided from fundamental concepts to the state-of-the-art in research and practice.
The following publications have provided background to our work in this project.
“Direct targets of the trp63 transcription factor revealed by a combination of gene expression profiling and reverse engineering” in Genome Research 18 (6), pp 939–948 [Pubmed][DOI][Google Scholar Search](2008)
Genome-wide identification of bona-fide targets of transcription factors in mammalian cells is still a challenge. We present a novel integrated computational and experimental approach to identify direct targets of a transcription factor. This consists of measuring time-course (dynamic) gene expression profiles upon perturbation of the transcription factor under study, and in applying a novel “reverse-engineering” algorithm (TSNI) to rank genes according to their probability of being direct targets. Using primary keratinocytes as a model system, we identified novel transcriptional target genes of TRP63, a crucial regulator of skin development. TSNI-predicted TRP63 target genes were validated by Trp63 knockdown and by ChIP-chip to identify TRP63-bound regions in vivo. Our study revealed that short sampling times, in the order of minutes, are needed to capture the dynamics of gene expression in mammalian cells. We show that TRP63 transiently regulates a subset of its direct targets, thus highlighting the importance of considering temporal dynamics when identifying transcriptional targets. Using this approach, we uncovered a previously unsuspected transient regulation of the AP-1 complex by TRP63 through direct regulation of a subset of AP-1 components. The integrated experimental and computational approach described here is readily applicable to other transcription factors in mammalian systems and is complementary to genome-wide identification of transcription-factor binding sites.
“An introduction to systems biology: design principles of biological circuits”, Chapman and Hall/CRC, London.(2006)
Motivation: Quantitative estimation of the regulatory relationship between transcription factors and genes is a fundamental stepping stone when trying to develop models of cellular processes. This task, however, is difficult for a number of reasons: transcription factors’ expression levels are often low and noisy, and many transcription factors are post-transcriptionally regulated. It is therefore useful to infer the activity of the transcription factors from the expression levels of their target genes.\ \ Results: We introduce a novel probabilistic model to infer transcription factor activities from microarray data when the structure of the regulatory network is known. The model is based on regression, retaining the computational efficiency to allow genome-wide investigation, but is rendered more flexible by sampling regression coefficients independently for each gene. This allows us to determine the strength with which a transcription factor regulates each of its target genes, therefore providing a quantitative description of the transcriptional regulatory network. The probabilistic nature of the model also means that we can associate credibility intervals to our estimates of the activities. We demonstrate our model on two yeast data sets. In both cases the network structure was obtained using Chromatine Immunoprecipitation data. We show how predictions from our model are consistent with the underlying biology and offer novel quantitative insights into the regulatory structure of the yeast cell.\ \ Availability: MATLAB code is available from http://umber.sbs.man.ac.uk/resources/puma.
Motivation: In systems like Escherichia Coli, the abundance of sequence information, gene expression array studies and small scale experiments allows one to reconstruct the regulatory network and to quantify the effects of transcription factors on gene expression. However, this goal can only be achieved if all information sources are used in concert.\ \ Results: Our method integrates literature information, DNA sequences and expression arrays. A set of relevant transcription factors is defined on the basis of literature. Sequence data are used to identify potential target genes and the results are used to define a prior distribution on the topology of the regulatory network. A Bayesian hidden component model for the expression array data allows us to identify which of the potential binding sites are actually used by the regulatory proteins in the studied cell conditions, the strength of their control, and their activation profile in a series of experiments. We apply our methodology to 35 expression studies in E.Coli with convincing results.\ \ Availability: www.genetics.ucla.edu/labs/sabatti/software.html\ \ Supplementary information: The supplementary material are available at Bioinformatics online.\ \ Contact: firstname.lastname@example.org
“Probabilistic inference of transcription factor concentrations and gene-specific regulatory activities” in Bioinformatics 22 (22), pp 2275–2281 [Errata][Software][PDF][Pubmed][DOI][Google Scholar Search](2006)
Motivation: Quantitative estimation of the regulatory relationship between transcription factors and genes is a fundamental stepping stone when trying to develop models of cellular processes. Recent experimental high-throughput techniques such as Chromatine Immunoprecipitation provide important information about the architecture of the regulatory networks in the cell. However, it is very difficult to measure the concentration levels of transcription factor proteins and determine their regulatory effect on gene transcription. It is therefore an important computational challenge to infer these quantities using gene expression data and network architecture data.\ \ Results: We develop a probabilistic state space model that allows genome-wide inference of both transcription factor protein concentrations and their effect on the transcription rates of each target gene from microarray data. We use variational inference techniques to learn the model parameters and perform posterior inference of protein concentrations and regulatory strengths. The probabilistic nature of the model also means that we can associate credibility intervals to our estimates, as well as providing a tool to detect which binding events lead to significant regulation. We demonstrate our model on artificial data and on two yeast data sets in which the network structure has previously been obtained using Chromatine Immunoprecipitation data. Predictions from our model are consistent with the underlying biology and offer novel quantitative insights into the regulatory structure of the yeast cell.\ \ Availability: MATLAB code is available from http://umber.sbs.man.ac.uk/resources/puma.
“Inferring quantitative models of regulatory networks from expression data” in Bioinformatics 20 (Suppl. 1), pp 248–256 [Google Scholar Search](2004)
Motivation: Genetic networks regulate key processes in living cells. Various methods have been suggested to reconstruct network architecture from gene expression data. However, most approaches are based on qualitative models that provide only rough approximations of the underlying events, and lack the quantitative aspects that are critical for understanding the proper function of biomolecular systems.\ \ Results: We present fine-grained dynamical models of gene transcription and develop methods for reconstructing them from gene expression data within the framework of a generative probabilistic model. Unlike previous works, we employ quantitative transcription rates, and simultaneously estimate both the kinetic parameters that govern these rates, and the activity levels of unobserved regulators that control them. We apply our approach to expression data sets from yeast and show that we can learn the unknown regulator activity profiles, as well as the binding affinity parameters.We also introduce a novel structure learning algorithm, and demonstrate its power to accurately reconstruct the regulatory network from those data sets.\ \ Keywords: transcription regulation, parameter learning, structure learning, regulatory networks\ \ Contact: email@example.com