RAmiRNA: Software suite for generation of SVMbased prediction models of mature miRNAs

MicroRNAs (miRNAs) are short endogenous non-coding RNA molecules that regulate protein coding gene expression in animals, plants, fungi, algae and viruses through the RNA interference pathway. By virtue of their base complementarity, mature miRNAs stop the process of translation, thus acting as one of the important molecules in vivo. Attempts to predict precursor-miRNAs and mature miRNAs have been achieved in a significant number of model organisms but development of prediction models aiming at relatively less studied organisms are rare. In this work, we provide a suite of standalone softwares called RAmiRNA (RAdicalmiRNA detector), to solve the problem of custom development of prediction models for mature miRNAs using support vector machine (SVM) learning. RAmiRNA could be used to develop SVM based model for prediction of mature miRNAs in an organism or a group of organisms in a UNIX based local machine. Additionally RAmiRNA generates training accuracy for a quick estimation of prediction ability of generated model. Availability Usage manual and download link for RAmiRNA could be found at http://ircb.iiita.ac.in


Background:
MicroRNAs (miRNAs) are post-transcriptional regulators that bind to complementary sequences on target messenger RNA transcripts (mRNAs), usually resulting in translational repression and gene silencing.By affecting gene regulation, miRNAs are likely to be involved in most biological processes, some as critical as insulin secretion, hematopoietic lineage differentiation and lipid metabolism [1-3].Since experimental cloning methods for searching new miRNAs are less efficient, time consuming and very expensive, computational approaches are becoming more and more popular to choose miRNA candidates for further experimental validation.Thus, most computational methods utilize pre-miRNA sequences and/or their secondary structures to detect miRNAs or pre-miRNAs using support vector machines, random forest models and ab initio prediction models [4-6].
miRNAs arise from a precursor structure (pre-miRNA), a stemloop structure having 80 nucleotides in its body, on average.This pre-miRNA is in turn derived out of a primary miRNA (pri-miRNA) which is a transcript of a miRNA gene.The different strategies successfully developed by few researchers for the prediction of pre-miRNAs are categorized largely as filter-based, machine learning, homology-based and target centered approaches [7].
Here, we develop RAmiRNA -a toolbox for easy development of dynamic prediction models using support vector machine (SVM) learning.RAmiRNA uses an ordered pipeline of PERL scripts to extract and modify mature miRNA sequences from the miRBase database [8] and subsequently compute features for classification and prediction.RAmiRNA provides a straight and easy to use platform for making SVM-based models which can predict mature miRNAs.

Methodology:
RAmiRNA suite approaches the problem of mature miRNA prediction by using a sliding window protocol.Generally, in a sliding window approach to sequence analysis a virtual window of a particular length is placed over a linear sequence (of nucleotides/amino acids) from which meaningful score (or scores; number of nucleotides, for instance) is then calculated.In the next step, sliding window is shifted (we use the term 'jump' to denote this shift) by a few nucleotides and the score is calculated again.This procedure is repeated exhaustively.
RAmiRNA suite utilizes this protocol to implement a sliding window over secondary structures (stem loops) of pre-miRNAs to calculate a set of features.This is then fed into an SVM classifier.RAmiRNA suite builds the SVM based classifier on the basis of differentiation between the regions containing mature miRNA, with the region falling away from it.
RAmiRNA suite consists of four main tools: RAmiRNA-p generates positive set data which corresponds to the region of mature miRNA.RAmiRNA-n is used for negative set preparation which corresponds to region falling away from actual mature miRNA.RAmiRNA-t takes the two sets generated by RAmiRNA-p, and RAmiRNA-n, and combines these two sets into one (for details, see additional information provided in the supplementary).It then feeds this dataset into an efficient, publicly available support vector machine tool called LibSVM-train [9], which trains this dataset and generates the SVM prediction model.Finally, for actual testing of the pre-miRNAs, RAmiRNA-g generates test set and feeds it to LibSVM-predict, ultimately generating predictions in the form of graphical output showing mature miRNA regions.Work flow of RAmiRNA suite is illustrated in (Figure 1).RAmiRNA-p & RAmiRNA-n automatically labels the positive and negative entries respectively into typical LibSVM format.LibSVM tries to form a definite boundary between the two sets which ultimately serves as the basis of prediction for RAmiRNA-g.

Encoding features:
RAmiRNA utilizes some of the most basic, yet powerful features which broadly fall into two categories: sequence based features and structure based features.It encodes a set of fortysix useful features which are then selected on the basis of their statistically significant contribution towards training accuracy of the prediction model.(Figure 2) illustrates the significance of features used in RAmiRNA (see supplementary information for complete list of features).database in the form of a downloadable text file (miRNA.str,see supplementary information for details); b) miRBase ID.For example, if a user wants to build a prediction model for viruses, then the ids to be supplied are ebv, hiv, bkv, rlcv etc. RAmiRNA-p and RAmiRNA-n utilize these inputs in a slightly different manner from each other.RAmiRNA-p extracts out the mature miRNA region from the pre-miRNA structures and encodes these structural entities into numerical values labeling them as +1.On the other hand, RAmiRNA-n traverses the stem of pre-miRNA structures by sliding a window of user defined length, avoiding the area containing mature miRNA, to encode numerical values which are labeled as -1.Consequently, RAmiRNA-n requires two more inputs: c) a window length, 'w'; d) the jumps 'j' that the window is expected to take upon the stem of pre-miRNA structures.Inputs to RAmiRNA-t are the outputs of RAmiRNA-p (Positive dataset) and RAmiRNA-n (Negative dataset).RAmiRNA-t generates a classification model as a result of training of the dataset.RAmiRNA-t also provides users with a training accuracy.This accuracy reflects the prediction reliability of the generated model.RAmiRNA-g needs this model as an input along with the window length and jump size same as those supplied to RAmiRNA-n.The tools that are included in RAmiRNA toolkit are an ordered set (or a pipeline) of Perl programming codes.

Caveat and future development:
Since RAmiRNA is dependent on number of miRNAs in miRBase database, some of the prediction models it generates are less accurate (for instance models for organisms having very few known miRNAs).Such models would become more reliable with the growth of miRBase in future.Some other classification features (such as enzyme recognition sites) would also be considered in future updates of RAmiRNA.

Figure 1 :
Figure 1: Flowchart illustrating the working pipeline of RAmiRNA toolkit.Part (a) of this figure demonstrates the ability of RAmiRNA -p and RAmiRNA -n to generate positive and negative datasets from a given miRBase organism id and miRNA.strdatabase file.Note that RAmiRNA -p utilizes the standard miRBase format of writing a pre-miRNA to identify the mature miRNAs (shown here as boxes on a stem of pre-miRNAs).In part (b), working of RAmiRNA -t is shown.RAmiRNA -t combines the outputs of RAmiRNA -p and n to feed it into LibSVM's 'SVM-scale' and 'SVM-train' tools sequentially to generate a classification SVM model.It also reports cross validation accuracy.Finally, part (c) elucidates the process of testing a pre-miRNA (Test.ramm)using RAmiRNAg.

Figure 2 :
Figure 2: Statistical contribution of various features using Fscores.This bar graph illustrates the contribution of features used in RAmiRNA.Features with highest F-scores are color coded and listed in graph legend to differentiate them from relatively non-contributing features which are shown as red bars.