Print this page

Developing clinical predictors based on high-dimensional genomics data, pathway information and directed experimentation (project SP1.3.2)

Project leader: L. Wessels, Netherlands Cancer Institute, Amsterdam
Participant: J.J. Goeman, Leiden University Medical Centre

In this project we develop statistical techniques that modify the gene selection strategy such that it takes particular known relationships between genes, defining e.g. a particular pathway, into account. Testing the association of a set of gene with the outcome variable becomes very important. This is augmented with approaches that interesting gene sets from data obtained from compendiums of cancer samples and specific data from model systems subjected to directed perturbations. The aim is to identify subtypes in breast, colon and melanoma tumour series that have a clinically relevant predictive value. Finally, the gene expression based prediction will be augmented with prognostic predictors derived from proteomics data originating from the same patient series. These approaches promise to improve the performance of the predictors, allow better treatment choices, and also advance our understanding of the biology of cancer.

Overview of sub projects and results:

Subproject 1.3.2.1 Lamina association

Introduction and objectives

How two meters of DNA are packaged into a tiny nucleus, is a question that has fascinated researchers for many years. Yet, the three-dimensional organization of chromosomes within the nucleus remains unknown. It is known that this organization changes during, for example, cell differentiation. How these dynamic changes occur is also largely unchartered territory. This project was performed in close collaboration with the lab of Bas van Steensel at the NKI-AVL (Amsterdam). The van Steensel lab has developed a technique called DamID, which enables one to measure, for a given protein, the sites on the genome where it preferentially binds. This has typically been a question addressed for transcription factors: if we know where they bind on the genome, we are an important step closer to know how these factors regulate the transcription of their targets. In the van Steensel lab, this process was reversed. By doing the experiment for a protein contained in the nuclear envelope (LaminB1), by doing a DamID experiment with LaminB1 one can determine where the genome interacts with the nuclear lamina.

Results

By recording the DamID profile of LaminB1, previous work revealed that the interaction pattern of the genome with the nuclear envelope is organized in large domains – so called Lamin Associated Domains (LADs). In total there are about 1000 of these domains (stretches of DNA) that are preferentially close to the nuclear envelope. These regions are interspersed with similarly sized regions that are preferentially detached from the nuclear envelope.
By repeating this experiment in different cell types that trace out the development of a cell, the change in genome-lamina interactions during differentiation was visualized to molecular detail. More specifically, high-resolution maps of genome – nuclear lamina interactions during subsequent differentiation stages of mouse embryonic stem cells were generated. The stem cells differentiate to neural precursor cells that are finally transformed to terminally differentiated astrocytes. These DamID profiles revealed that the basal chromosome architecture present in embryonic stem cells is cumulatively altered at hundreds of sites during lineage commitment and subsequent terminal differentiation. This remodelling involves both individual genes and multi-gene regions, and affects many genes that determine cellular identity. Often, genes that move away from the lamina are activated at a later stage; many others however remain inactive yet are prepared for activation in a next differentiation step. These results suggest that lamina-genome interactions are widely involved in the control of gene expression programs during lineage commitment and terminal differentiation.
When comparing the genome-lamina interaction pattern of the different cell types, it was found that some of these LADs are cell-type specific, while many others may be constitutively associated with the lamina, i.e. these LADs are interacting with the lamina in all the measured cell types. These LADs are referred to as constitutive LADS or cLADs. How genome sequence drives constitutive interactions with the lamina is not known. The DNA sequence properties of a set of cLADs were analyzed and  found to be strongly conserved across mouse and man. Moreover, cLADs have fewer synteny breakpoints than expected, which indicates that there has been evolutionary pressure to preserve the LAD structures during evolution. This suggests an important role for these structures in gene regulation. In contrast to the conservation of the global cLAD structure, the overall sequence conservation within cLADs is low. Instead, cLADs are universally characterized by long stretches of DNA of high AT content. The average level of the AT content is therefore a string predictor of the lamin association within the cLADs.
By combining molecular biological experiments and bioinformatics analysis, this project has significantly advanced our understanding of the role that spatial genome organization plays in the regulation of cellular processes.

Subproject 1.3.2.2: Breast cancer outcome prediction.

Project leader: L. Wessels, Netherlands Cancer Institute, Amsterdam

Introduction and objectives

Cancer has recently become the number one cause of death in The Netherlands. Breast cancer is the most prevalent form of cancer among females, with a lifetime risk of 12.8%. Upon diagnosis of breast cancer, the tumour is surgically removed, followed by chemotherapy treatment to eradicate any tumourigenic cells that may still be present. Some patients will develop distant metastases, which will eventually cause death, while others will remain metastases free. To know whom to treat (to prevent metastases), several models have been developed that predict outcome based on clinical and pathological parameters of the tumour and the patient. More recently, microarray analysis has enabled the measurement of the mRNA levels of cells (a derivative of the genes on the DNA). Since research institutes have tissue banks with fresh frozen breast tumours, this presented the opportunity to generate rich microarray datasets of larges sets of tumours. Based on these microarray datasets new predictors of breast cancer outcome, so called classifiers, can be developed. These classifiers select a subset of genes, the ‘signature’, to predict patient outcome.
The work in this project focused on ways to improve the performance of these classifiers. Each of these classifiers is hampered by the fact that there is a relatively small number of patients, and a huge number of genes. This causes classifiers to be inaccurate. To counter this, two approaches were employed. First, the sample size was increased by constructing compendia of several microarray datasets. Second, additional, independent data sources were incorporated.

Results

Combining several datasets into a compendium is a straightforward way to increase the statistical power. However, this comes at the price of increasing the biological and technical heterogeneity within the compendium, due to differences in clinical composition of the datasets and different technologies employed to record the gene expression data. Using several breast cancer microarray datasets that were publicly available, it was found that pooling datasets does, indeed, result in better performance of the classifier.
From dataset to dataset, the sets of genes that are selected in the signatures show a remarkably low degree of overlap. The constructed compendium was employed to compare signatures derived from single datasets and those from multiple datasets. The average overlap was found to increase significantly when signatures are derived from multiple datasets. Thus, the small sample size problem proved to be a key cause underlying the limited overlap and could reject many other explanations put forward to explain this effect.
In the second approach, where independent data is employed, a distinction was made between general, tumour aspecific data from databases and additional tumour specific measurements from the same samples. General data from pathway databases (e.g. the Gene Ontology) was employed to define gene sets. Based on the gene expression data it was determined whether a pathway is activated, inactivated or unchanged. Since this reduces the number of features (there are fewer pathways than genes) it increases the statistical power. However, this higher-level representation did not result in better classification performance. These gene sets do, on the other hand, constitute a much richer representation of the underlying biology. This may help biologists to pinpoint important pathways for follow up experiments.
Tumour specific data, such as extensive annotation of clinical and pathological variables is available for a few datasets. These variables are ignored when training a classifier on microarray data only. Therefore, it was investigated whether training a classifier on both data sources simultaneously improved outcome prediction. For a range of classifiers, the integration strategies result in classifiers with a better performance compared to those derived from a single dataset. This is an indication that additional sample specific measurements hold the promise of increasing the accuracy of outcome prediction even further.