For any engineered production process it is highly desirable to perform as much process or component design in silico as possible. This minimises trial and error testing of component interactions in the laboratory/factory. Underpinning in silico design are computational tools that can confidently be employed to predict the functional consequences of parameter change. Our previous first-round BBSRC BRIC funded grant clearly identified the importance of recombinant mRNA dynamics in controlling recombinant protein production by CHO cells. Accordingly, very recent genome-scale studies have highlighted the pre-eminence of mRNA (synthesis/stability and primarily, translational efficiency) in controlling the relative abundance of proteins in mammalian cell generally. This project is therefore concerned with the development and application of a computational design platform, necessarily derived from a combination of genome-scale datastreams, that can be reliably employed to speed the development of mammalian cell factories through the optimal design of synthetic genes with predictable in vivo performance during whole production processes. This project will also provide important tools that can be employed for a variety of genome-scale applications. By confident prediction of mRNA dynamics at the genome scale we will be able to re-create whole CHO cell proteomes in silico from high-throughput RNA sequencing data. This computational “bridge” between layers of cellular functional organisation will greatly facilitate the in silico design of synthetic genetic systems with a desired proportion of functional components and predict the relative abundance of protein components of complex cellular networks for fundamental studies of CHO cell function in the engineered environment. All proteomic and transcriptomic databases and associated computational resources will be available to the BRIC community.
Biopharmaceutical companies producing the new generation of recombinant DNA derived therapeutic proteins (e.g. cancer medicines such as Herceptin and Avastin) often use mammalian cells grown in culture to make the protein product. All production processes are based, fundamentally, upon the ability of the host mammalian cell factory to use a synthetic DNA genetic “code” to manufacture the complex protein product. This is a cornerstone of modern biotechnology. However, because protein synthesis is so complex, involving many cellular resources and machines, it is extremely difficult for genetic engineers to design a DNA code that will best enable the mammalian cell factory to operate most efficiently. Moreover, as individual mammalian cell factories can be very variable, they may differ substantially in their relative ability to make the product. As a consequence, a lot of time and money has to be spent by companies on the initial phases of the biopharmaceutical development process conducting intensive screening operations to find the best cell factory (out of a large population) able to use the genetic code it has been given. For a different protein product it is necessary to start the whole development process again. In this project we will utilise recently available high information content molecular analysis technologies and computational tools to “de-convolute” the complexity of protein synthesis in mammalian cell factories. Effectively, we know that the mammalian cell factory uses its own genetic code to make thousands of its own proteins (machines) that together perform a variety of functions that enable the cell to grow and divide. The rate at which these proteins are made varies hugely, over 1000-fold, so that the cell can make each bit of protein machinery in the right quantity to do its job. We will measure how efficiently each cellular protein is made then using advanced biological information analysis (bioinformatics) and mathematics we will determine how the cell uses pieces of information embedded in each of its genes to vary the rate at which a specific protein is made. This will enable us to create, for the first time, a usable set of “design rules” (computer programmes) that genetic engineers and cell factory developers can employ to (i) reliably design the best genetic code for any given protein product and (ii) accurately predict how much of the protein product the mammalian cell factory can make. This is important as it means that biopharmaceutical companies can design a predictable production system from scratch, enabling a more rapid transition through lengthy cell factory development processes towards (pre-)clinical trials.
This research project clearly derives from (i) underpinning BRIC 1/1 research in DCJs lab which generated a fundamental understanding of the control of recombinant protein synthesis by CHO cells during production processes and (ii) a BRIC 2 Enabling Grant which was used to sequence the CHO cell genome. Based on this pre-competitive knowledge (bioscience underpinning bioprocessing) the proposed research is clearly focused on the creation of new tools and resources that would benefit a number of clearly defined user-groups: 1. UK bioindustry. This project will support UK companies developing biological medicines produced by mammalian cells in culture. We will provide our industrial partners with a data-rich resource as well as new, validated computational and informatic methods that can be implemented immediately to reduce time and costs spent in the creation of biomanufacturing systems - this represents a clear economic benefit and increased capability and competitiveness for UK bioindustry. All data and tools will be made available to BRIC partners as soon as they are generated. 2. BRIC/Bioprocessing researchers. We will produce large reference datasets and computational modelling resources (people and tools) dedicated to biomanufacturing systems. These represent a significant resource not just for industry but for any researcher engaged in pre-competitive research on CHO cell based manufacturing systems. We anticipate that adaptations of our modelling approaches could be applied to other cell factories (e.g. yeast, E. coli) or to other mammalian cell culture systems (e.g. human cell therapies etc). Development of the UKs ability to productively utilise genome-scale datasets to improve biomanufacturing systems is absolutely necessary. 3. Other researchers. This project directly address the BBSRC’s 10-year vision “towards predictive biology” concentrating on a core problem for functional genomics; how to reliably predict cellular protein abundances from measured mRNA abundances. We anticipate that our research and development would be relevant to many projects utilising genome-scale transcriptomic data. The project is sponsored by BBSRC Project Ref BB/K011197/1 and is a collaboration with David James of Chemical and Biological Engineering, Mark Dickman of Chemical and Biological Engineering, Paul Dobson of Chemical and Biological Engineering and Josselin Noirel of Chemical and Biological Engineering.
Personnel from ML@SITraN
- Javier Gonzalez Hernandez Post doctoral research assistant
The following publications have provided background to our work in this project.
Purely data driven approaches for machine learning present difficulties when data is scarce relative to the complexity of the model or when the model is forced to extrapolate. On the other hand, purely mechanistic approaches need to identify and specify all the interactions in the problem at hand (which may not be feasible) and still leave the issue of how to parameterize the system. In this paper, we present a hybrid approach using Gaussian processes and differential equations to combine data driven modelling with a physical model of the system. We show how different, physically-inspired, kernel functions can be developed through sensible, simple, mechanistic assumptions about the underlying system. The versatility of our approach is illustrated with three case studies from motion capture, computational biology and geostatistics.
Kernel methods are among the most popular techniques in machinelearning. From a regularization perspective they play a central rolein regularization theory as they provide a natural choice for thehypotheses space and the regularization functional through the notionof reproducing kernel Hilbert spaces. From a probabilistic perspec-tive they are the key in the context of Gaussian processes, wherethe kernel function is known as the covariance function. Traditionally,kernel methods have been used in supervised learning problems withscalar outputs and indeed there has been a considerable amount ofwork devoted to designing and learning kernels. More recently there has been an increasing interest in methods that deal with multipleoutputs, motivated partially by frameworks like multitask learning. Inthis monograph, we review different methods to design or learn validkernel functions for multiple outputs, paying particular attention to theconnection between probabilistic and functional methods.
Recently there has been an increasing interest in regression methods that deal with multiple outputs. This has been motivated partly by frameworks like multitask learning, multisensor networks or structured output data. From a Gaussian processes perspective, the problem reduces to specifying an appropriate covariance function that, whilst being positive semi-definite, captures the dependencies between all the data points and across all the outputs. One approach to account for non-trivial correlations between outputs employs convolution processes. Under a latent function interpretation of the convolution transform we establish dependencies between output variables. The main drawbacks of this approach are the associated computational and storage demands. In this paper we address these issues. We present different efficient approximations for dependent output Gaussian processes constructed through the convolution formalism. We exploit the conditional independencies present naturally in the model. This leads to a form of the covariance similar in spirit to the so called PITC and FITC approximations for a single output. We show experimental results with synthetic and real data, in particular, we show results in school exams score prediction, pollution prediction and gene expression data