ARANGS12 Automated and reproducible analysis of NGS data
General info
- Date
- 11 Sep 2012 - 14 Sep 2012
- Location
- Oeiras, Portugal
- Website
- http://gtpb.igc.gulbenkian.pt/bicourses/ARANGS12/
- Keywords
- NGS, automation, workflow, design
- Organiser
- Instituto Gulbenkian de Ciência
- Contact(s)
- bicourses@igc.gulbenkian.pt
- Pedro Fernandes
Description
Next generation sequencing (NGS) technologies for DNA have resulted in a yet a bigger deluge of data. Researchers are learning that analysing such data sets is becoming the bottleneck in their work. In many cases, several steps in these analyses are fairly generic (e.g. quality control filtering, alignment to reference sequences, typing) so that off-the-shelf pipelines can be applied. In other cases, novel research approaches require development of new analysis pipelines. Either way, all analysis steps should be repeatable and any changes made to the data (e.g. renaming, annotation, alignment) should be recorded so that the provenance of the results is clear and inferences are reproducible. In this brief workshop we will establish several best practices of reproducibility and provenance recording in the (comparative) analysis of data obtained by NGS. In doing so we will encounter the commonly used technologies that enable these best practices by working through use cases that illustrate the underlying principles. Workflows will be manipulated using graphical platforms (e.g. Galaxy, Taverna, etc.).
Best practices
- Standardized project organization
- Projects fully 'runnable' without user intervention
- No loss of data, metadata or source code through versioning
Technologies
- Next generation sequencing platforms
- File formats
- Command-line executables, command line scripting and batching
- High-level programming with domain-specific toolkits
Revision control systems - Workflow environments (both visual and command line)
Use cases
- Phylogenetic placement of metagenomic data
- Typing of pathogens
- Comparative analysis of multicellular genomic data
Target audience
This course is aimed at researchers who've recently embarked on NGS projects and now, faced with large amounts of data, would like to learn how to automate generic analysis steps and develop new ones in a reproducible (and shareable) way. The course will be illustrated with examples from specific single-celled and multicellular taxa but the concepts are applicable to any species not explicitly referred to in the course. The ideal attendee is a scientist who is not afraid to get his/her hands dirty to acquire the computer-literacy skills for dealing with the informatics side of data analysis.
Teachers
Rutger Vos and Darin London


