miRAFinder and GeneAFinder scripts: large-scale searching for miRNA and related information in indexed literature abstracts

In recent times, information on miRNAs and their binding sites is gaining momentum. Therefore, there is interest in the development of tools extracting miRNA related information from known literature. Hence, we describe GeneAFinder and miRAFinder scripts (open source) developed using python programming for the semi-automatic extraction and arrangement of updated information on miRNAs, genes and additional data from published article abstracts in PubMed. The scripts are suitable for custom modification as per requirement. Availability miRAFinder and GeneAFinder scripts are free and available for download at http://sites.google.com /site/malaheenee/software.


Background:
The number of abstracts for biological articles at the PubMed [1] database has increased over timeline due to steady advancement in biomedical research.There are a number of online servers that extract information in a specific manner from abstract archived databases.The MedlineRanker web server allows a flexible ranking of Medline for a given topic of interest [2].MedEvi imposes positional restriction on occurrences matching multi-term queries, based on the observation that term with semantic relations [3].The FNeTD method for clustering achieved PubMed abstracts using revealing frequent "phrases" or "words" and identifying "nearer terms" of the domain [4].The MiSearch is an adaptive biomedical literature search tool that ranks citations based on a statistical model [5].Genomics researches are having a major impact on biological and medical sciences although the function of many genes remains unknown.In recent times, PubMed database search shows about 285 thousand articles associated with breast cancer researches.While, one of causes of tumorigenesis is the suppression of gene expression via microRNAs (miRNAs) [6].There are more than 2500 human miRNAs and some of them are potential therapeutic targets for neoplastic diseases [7].Database search shows more than 32 thousand articles devoted to miRNAs.Short nucleotide sequences of miRNAs can be used as biomarkers for cancer diagnostics as they circulate in biological liquids.Therefore, it is important to reveal comparative information about each studied miRNA in the literature.It should be noted that more 13 thousand articles are associated with miRNA participation in tumourigenesis as per the PubMed database search.Thus, specific search for refined information from archived databases is often time consuming and tedious in nature.Hence, we describe miRAFinder and GeneAFinder scripts written in python to simplify such tasks during biological investigations.

Software development and usage:
Python was used to write scripts for this purpose.

Software execution command:
The following commands were used for execution.cd folder/python miRAFinder.py-g pathway_1 -f pathway_2 > file.txtWhere pathway_1 is directory to gene dictionary file (genedic.txt)and pathway_2 is directory to file with abstracts (abstract.txt).

Figure 1:
The scheme of miRAFinder and GeneAFinder scripts.
Note: 1 -Preparation of "abstract.txt"file throught the downloading the abstracts from PubMed's site in text format; 2 -treatment of each abstract text; 3 -search for keywords in different groups (gene, miRNA, specific word, etc.); 4 -use of dictionaries with gene data; 5 -identification different miRNA types (mir-or let-); 6 -definition of data presentation depend on miRAFinder or GeneAFinder script.

Input data:
The article abstracts at PubMed [7] were downloaded in text format and was saved as "abstract.txt"(Figure 1).The list of human genes was extracted from the HGNC [8] database and was saved as "genedic.txt"file.The lists of keywords of each word group are make up in depend on subject of research.

Output:
The miRAFinder script processes information of abstracts from the PubMed database.This script allows to find miRNA names and keywords as shown in Figure 1.The result of searching for contents of the following data: PMID (publication medicine identification number) of the article, miRNA name, disease localization (organ or tissue), keywords (methods, change fold, cellular processes, functions, animal species, types of cells, biological liquids, etc.) and genes.Some examples are shown in Table 1 (see supplementary material).The obtained data on gene names are important as they can be mentioned in abstract as host or target genes of miRNAs.The script uses the list of genes from HGNC database for the correct identification of gene names in the abstract (for an exception of different abbreviations).Thus, the quantity of the found genes is regulated by keyword structure of the dictionary created for each separate searching (human genes, mouse genes, rat genes, etc.).The GeneAFinder script is similar to miRAFinder (Figure 1) and it allows to find specific information and gene names in the abstract of PubMed.It is possible to find the list of important keywords for general characteristic in the text of abstract using the GeneAFinder script.Some examples are shown in Table 2 (see supplementary material).The PubMed database was used in this study due to its comfortable data type format for application in the scripts.There are a possibility to use various gene dictionaries and their characteristics to retrieve a different data set.

Caveat & future development:
The need for the effective analysis of miRNA data using computer scripts is gaining momentum in recent times.The miRAFinder and GeneAFinder scripts described in this article help to extract specific information from archived abstracts in NCBI PubMed.We showed that the scripts scan through thousands of abstracts within reasonable time frames.The information gleaned through such approach finds utility in miRNA analysis of specific diseases (e.g.cancer ).