Integrative bioinformatics with data model enabled data analysis: test case industrial microorganisms (SP 3.7.1)
Project leader: R.J. Siezen, Radboud University, Nijmegen / NIZO food research
Participants: A. Smilde, University of Amsterdam; T. Breit, University of Amsterdam; J. B.T.M. Roerdink, University of Groningen; O.P. Kuipers, University of Groningen; B. Poolman, University of Groningen
Microorganisms are widely used as cell factories. To improve these factories, generate new products or enhance their performance, these factories must be studied as an integrated system. In this project, we use '~omics' data for discovery and hypothesis generation in the area of life sciences research based on a microorganism test case. For this, a new strategy is needed to enable integration of heterogeneous models and data, as well as methods for the analysis and visualization of such heterogeneous data. The use of data and knowledge models for data annotation and integration forms the basis for a powerful, robust, and scalable integrative bioinformatics methodology.
Overview of subprojects and results:
Subproject SP3.7.1.1
Project leader: R.J. Siezen, Radboud University, Nijmegen / NIZO food research
Introduction and objectives
This project encompasses a set of bioinformatics studies about genome-scale metabolic networks and its applicability to analyze data from high-throughput techniques. An important preceding step in the application of metabolic networks is its construction. Comparative genomic approaches of sequenced genomes play an important role in the construction process. For that reason a comparative genomic approach is evaluated, which has been used for a novel method to accelerate the construction of metabolic networks. Application of metabolic networks for prediction and analysis of data is also addressed.
Results
A comparative genomics study is presented to evaluate the effect of gene duplication on function prediction by orthology (equivalent genes between species originated from the last common ancestor). The question is raised whether one-to-one orthologs, which are most similar on sequence level, are indeed the most likely functional equivalents when duplicates exist. This is done by analyzing orthologs between pairs of genomes where in one genome the orthologous gene has duplicated after the speciation of the two genomes/species (called inparalogs). Gene neighbourhood conservation (i.e., positioning of genes on the genome) is used as an indicator of functional equivalency. Although the majority of investigated cases show that indeed the most similar orthologs at the sequence level conserve gene neighbourhood, a substantial fraction does not.
The methods developed have subsequently been used to accelerate the reconstruction of genome-scale metabolic networks/models with comparative genomics and manually curated networks. On the basis of reconstructed metabolic network, modelling was employed to study global metabolic function at systems level. An integrative approach is presented addressing the question to what extent transcriptional co-regulation of genes can be explained (predicted) by systems properties of genome-scale metabolic networks. Most studies have addressed the subject of regulation by static graph-theoretical descriptions of metabolic networks. In this project, the metabolic networks of Escherichia coli and Saccharomyces cerevisiae have been modelled and a correlation was found between the type of flux coupling and co-regulation of genes at the level of operon organization, co-expression and transcription factor binding. Moreover, flux coupling and the graph-theoretical measure of shortest path distance were evaluated in the context of co-regulation with the conclusion that flux coupling explains co-regulation better. Furthermore it was demonstrated that the concept of flux coupling can be used to explain specific patterns in functional genomics data and in addition modes of evolution shaping complex systems. It was questioned whether asymmetric relations between reactions are reflected in evolution. For this presence and absence patterns were explored of proteins/reactions across species (also referred to as phylogenetic profiles) and their ancestral states. Moreover, the occurrence was examined of asymmetric relations in gene essentiality (effect of single gene knockouts on fitness/growth) and expression data across environmental conditions.
Finally, a view on the topic of metabolic adaptation was given. Deciphering the adaptive properties underlying the structure and function of metabolic networks is one of the interests of network biology. Many properties can be inferred from networks, such as the global topology, flux states (distribution) and mutational robustness. Are these properties the result of adaptation, favoured by selection? Evolutionary processes alternative to direct selection on the property under investigation could also play a role, but are often ignored. For example, certain systems-level traits might simply arise as by-products of selection on other traits. Metabolic network properties in the light of adaptations and by-products were reviewed and future strategies to investigate metabolic adaptations are proposed.
Subproject SP3.7.1.3
Project leader: T. Breit, University of Amsterdam
Introduction and objectives
Taking a systems-approach to improve the understanding of a biological system means that heterogeneous data and models need to be integrated and analysed as such. In order to provide a systems-level understanding, techniques must be developed that allow the combination of views from different disciplines and biological levels of abstraction. As part of this project, the application of robust and generic solutions for systems-level data integration is investigated. The modelling of data and knowledge is an important aspect of this approach. International standards are studied, such as biological ontologies e.g. Gene Ontology; the OWL standard (Web Ontology Language); RDF (Resource Description Framework); and XML (eXtended Markup Language). Mapping and annotation of experimental data with ontologies and its subsequent use for integration and analysis is investigated. A component-based approach that applies standards, for instance using web-services, ensures re-usability of models, data, and methods.
Results
- Adaptation of a computer-engineering model (Arthemis) aimed at the development of embedded systems by Design Space Exploration, to be used in a biological case study.
- An approach and requirements for the integration of measurement data as part of a semantic framework for computational experimentation
- An interactive workflow for detecting ridges (regions of increased gene expression) in gene expression profiles (SigWinDetector).
- Development and application of a methodology for operon prediction in prokaryote genomes based on integration of in silico and wet-lab data.
- A genome centric database to present the results of the operon findings to biologists.
- Proof that analyzing prokaryotic expression data on an operon basis results in large differences compared to analysis on a per gene basis in gene set enrichment analyses.
In this project, the potential was studied of using engineering methods that were originally developed for the design of embedded computer systems, to analyze biological cell systems. For embedded systems as well as for biological cell systems, design is a feature that defines their identity. The assembly of different components in designs of both systems can vary widely. In contrast to the biology domain, the computer-engineering domain has the opportunity to quickly evaluate design options and consequences of its systems by methods for computer aided design and in particular design space exploration. It was found that there are enough concrete similarities between the two systems to assume that the engineering methodology from the computer systems domain, and in particular that related to embedded systems, can be applied to the domain of cellular systems. This will help to understand the myriad of different design options cellular systems have
Subproject SP3.7.1.4
Project leader: J.B.T.M. Roerdink, University of Groningen
Introduction and objectives
The goal of this subproject was the development of methods and tools for automatic and interactive visualization of "virtual cell" components, in particular of regulatory networks and metabolic pathways.
Results
Interaction networks in biology are very complex, since interactions take place not only at genomic, proteomic, and metabolomic levels, but also between these levels. To deal with this complexity, a software framework was established that is able to visualize such networks, and that offers interactive exploration to a researcher. As part of this effort, an application called GENeVis was developed, which allows simultaneous visualization of gene networks and gene expression time series data. It has features that were lacking in existing tools, such as mapping of expression value and corresponding confidence value to a single visual attribute, multiple time point visualization, visual comparison of multiple time series, and support for statistical data analysis. Various interaction mechanisms, such as panning, zooming, highlighting, data selection, tooltips, and subnetwork views support data analysis and exploration.
To enhance the use of GENeVis, SpotXplore was developed, which is a plug-in for Cytoscape –an open source software platform for visualizing complex networks and integrating these with any type of attributed data.
Subproject SP3.7.1.5
Project leader: O.P. Kuipers, University of Groningen
Introduction and objectives
This project set out to determine the genetic network and operon structures of Lactobacillus lactis based on experimental transcriptome data.
Results
Using DNA microarray data, many new and unexpected transcriptional correlations were discovered between the expression profiles of genes and operons. These interactions were annotated using a metabolic network developed in another related project. The research performed in this project will help in the description of known transcriptional interactions and provides interesting directions for further research. The models generated in this project hold especially promising leads for the metabolic engineering of L. lactis. This application has high industrial relevance as L. lactis is not only a workhorse in the dairy industry, but is also of interest for the production of oral vaccines.
Subproject SP3.7.1.6
Project leader: B. Poolman, University of Groningen
Introduction and objectives
Data analysis on protein composition of complex biological samples, for instance to discover new biomarkers for disease, inevitably starts with collecting accurate and complete data sets. Protein composition analysis (proteomics) is nowadays almost always performed by shotgun proteomics using the analytical techniques of liquid chromatography (LC) coupled with tandem mass spectrometry (MS/MS). The LC-MS/MS technique is widely used not only for identification but also for quantification of proteins in complex samples. This project aims to develop methods to improve on a typical mass spectrometer based pipeline (Figure 1).
Results
The approach encompassed investigating and addressing various bottlenecks present in the pipeline and performing statistical analysis on the quantification data collected. This has led to publication of several algorithms and scripts that offer advantages in terms of increasing the amount of valid protein data (Fig 1.1), reliability of data (Fig. 1.2) and interpretation in terms of quantitative differences between samples (Fig. 1.3). These new tools in the toolbox for proteomics analysis help to facilitate proteomics workflows in applied sciences, including medical and industrial use.


