PHASAR/BioMeta – mining metabolite data from literature (SP 4.1.1)
Project leader: K. Koster, Radboud University, Nijmegen
Participants: C. Evelo, Maastricht University; B. Mons, Erasmus Medical Centre, Rotterdam; F.M.G. de Jong, Twente University; J. Leunissen, Wageningen University and Research Centre
Text Mining techniques are becoming increasingly important in bioinformatics. Many of the new developments in Text Mining are being pioneered in the bioinformatics context, rather than in mainstream Information Retrieval. The PHASAR/BioMeta project is highly interdisciplinary in set-up. On the Information Retrieval side, the PHASAR (Phrase-based Accurate Search And Retrieval) Text Mining system is being constructed for the automatic extraction of information from large amounts of literature. The resulting system will be generic in nature and, given suitable thesauri and ontologies, can be applied to other subject areas. 'Metabolites' are selected as a test case, since these are normally only mentioned in passing in articles dealing with other topics, and they occur in such diverse sources that manual extraction is practically impossible. Besides a working literature mining system, a detailed thesaurus of metabolite terminology is constructed, as well as a database of metabolites and their relations.
Overview of subprojects and results:
Subproject SP4.1.1.4
Project leader: F.M.G. de Jong, Twente University
Introduction and objectives
This project investigates the possibility to integrate domain-specific knowledge into the automatic indexing and search tools for biomedical information. In the medical domain it is common practise to use terms from thesauri (term banks or concept banks) for the indexing and/or the classification of texts.
Results
Several methods for (semi)-automated text classification were compared. A word-based approach was compared to a concept-based approach and it was investigated to what extent a manual concept-based representation can be automatically obtained from text. Retrieval based on only concepts was demonstrated to be significantly less effective than word-based retrieval, which goes against the approaches adopted in the information systems commonly used in the domain. The findings could help to improve the effectiveness of the access tools currently offered by search portals.

