Beyond Bioinformatics

 

         

 
 
 
 

Previous

 

 

 

 

Title

 

 

 

 

Functional gene clustering via gene annotation sentences, MeSH and GO keywords from biomedical literature

 

Authors

Jeyakumar Natarajan1, * and Jawahar Ganapathy1

 

Affiliation

1Centre of Excellence in Bioinformatics, School of Biotechnology, Madurai Kamaraj University, Madurai 625021, India

 

Email

jkumar@mrna.tn.nic.in; * Corresponding author

 

Article Type

Hypothesis

 

Date

received December 14, 2007; accepted December 30, 2007; published online December 30, 2007

 

Abstract

Gene function annotation remains a key challenge in modern biology. This is especially true for high-throughput techniques such as gene expression experiments. Vital information about genes is available electronically from biomedical literature in the form of full texts and abstracts. In addition, various publicly available databases (such as GenBank, Gene Ontology and Entrez) provide access to gene-related information at different levels of biological organization, granularity and data format. This information is being used to assess and interpret the results from high-throughput experiments. To improve keyword extraction for annotational clustering and other types of analyses, we have developed a novel text mining approach, which is based on keywords identified at the level of gene annotation sentences (in particular sentences characterizing biological function) instead of entire abstracts. Further, to improve the expressiveness and usefulness of gene annotation terms, we investigated the combination of sentence-level keywords with terms from the Medical Subject Headings (MeSH) and Gene Ontology (GO) resources. We find that sentence-level keywords combined with MeSH terms outperforms the typical ‘baseline’ set-up (term frequencies at the level of abstracts) by a significant margin, whereas the addition of GO terms improves matters only marginally. We validated our approach on the basis of a manually annotated corpus of 200 abstracts generated on the basis of 2 cancer categories and 10 genes per category. We applied the method in the context of three sets of differentially expressed genes obtained from pediatric brain tumor samples. This analysis suggests novel interpretations of discovered gene expression patterns.

 

Keywords

text mining; functional clustering; microarray data analysis

Citation

Natarajan & Ganapathy, Bioinformation 2(5): 185-193 (2007)

 

Edited by

T. W. Tan & S. Ranganathan

 

ISSN

0973-2063

 

Publisher

Biomedical Informatics Publishing Group

 

Copyright

Publisher

 

Copyright Transfer Agreement

The authors of published articles in Bioinformation automatically transfer the copyright to the publisher upon formal acceptance. However, the authors reserve right to use the information contained in the article for non commercial purposes.

 

License

This is an open-access article, which permits unrestricted use, distribution, and reproduction in any medium, for non-commercial purposes, provided the original author and source are credited.