RetroPred: A tool for prediction, classification and extraction of non-LTR retrotransposons (LINEs & SINEs) from the genome by integrating PALS, PILER, MEME and ANN.

The problem of predicting non-long terminal repeats (LTR) like long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs) from the DNA sequence is still an open problem in bioinformatics. To elevate the quality of annotations of LINES and SINEs an automated tool "RetroPred" was developed. The pipeline allowed rapid and thorough annotation of non-LTR retrotransposons. The non-LTR retrotransposable elements were initially predicted by Pairwise Aligner for Long Sequences (PALS) and Parsimonious Inference of a Library of Elementary Repeats (PILER). Predicted non-LTR elements were automatically classified into LINEs and SINEs using ANN based on the position specific probability matrix (PSPM) generated by Multiple EM for Motif Elicitation (MEME). The ANN model revealed a superior model (accuracy = 78.79 +/- 6.86 %, Q(pred) = 74.734 +/- 17.08 %, sensitivity = 84.48 +/- 6.73 %, specificity = 77.13 +/- 13.39 %) using four-fold cross validation. As proof of principle, we have thoroughly annotated the location of LINEs and SINEs in rice and Arabidopsis genome using the tool and is proved to be very useful with good accuracy. Our tool is accessible at http://www.juit.ac.in/RepeatPred/home.html.


Background:
Long interspersed elements (LINEs) and short interspersed elements (SINEs) are non-LTR retrotransposons that reside within cells of a host organism, copying and inserting themselves into the host genome.Studies have revealed their ubiquity in many eukaryotic organisms, both plants and animals.However, the identification of repetitive elements still remains the cinderella of genome annotation.This can be due to both its technical (algorithmic) inherent complexity and to the prominent interest in determining coding portions of the genome.However, the situation is surprising for different reasons.Repetitive sequences are an important feature of eukaryotic genomes accounting for a large proportion of the genome; at least 50% of the human [1] and about 80% in some plants [2] genome seems to be composed by repetitive elements.They played an important role in the evolutionary game [3].Moreover, some repetitive sequences are also an important tool in genomic analysis and discovery [4].Finally, under a "technical" perspective, repetitive sequences in most cases represent a serious problem in the genome assembly steps.Understanding retrotransposable elements (RE) and their biological role has now become imperative in furthering research in functional and molecular genomics.One way of furthering our knowledge of RE biology is through the computational analysis of REs in the complete genomic sequences.By detailed comparison of the abundance and distribution of REs we can infer the fundamental biological properties that are shared or that differ among species.
The annotation of genomic repeats, typically relies on the results of a single computational program, RepeatMasker (http://www.repeatmasker.org/).Recently it has been reported that RepeatMasker may be ''neither the most efficient nor the most sensitive approach'' for annotation of genomic repeats [5].However, with the development of several new methods for transposable elements and repeat detection [6-9], it is now possible to apply a ''combined evidence'' approach to elevate the quality of RE annotations to a level comparable to that of gene models.Stratigically we have developed a RE annotation pipeline.This integrates the combined computational evidence derived from PALS [10], Piller [9], MEME [11] and ANN for detection of non-LTR elements and their classification into LINEs / SINEs.

Methodology: Implementation of the tool
The stand-alone tool "RetroPred" is implemented in three separate phases (Figure 1).The first phase is meant for www.bioinformation.net

Dataset for identification of genomic repeats (LINEs and SINEs)
The genomic repeats belonging to LINEs and SINEs were obtained from several sources: Repbase (update database release 8.12) (downloaded from http://www.girinst.org);TIGR; and from MIPS (MIPS Repeat Element Database) containing a total of 253 LINEs and 350 SINEs sequences (Table 1 in supplementary material).We have taken 70 sequences of terminal repeats (non LINEs and non SINEs) from the Repbase as negative sequences.

Position specific probability matrix (PSPM)
The position specific probability matrices were built separately each for LINEs and SINEs using MEME.The matrix has 4 x M real-number elements, where M is the length of the sliding window (M = 50).Each element represents the probability of each nucleotide base appearing at each possible position for an occurrence of motifs using 0 th order Markov model.The steps followed for generation of PSPM matrix are described in Figure 2.

Neural network architecture
The implementation of ANN was realized using the software package SNNS version 4.2 from Stuttgart University.The PSPM matrix generated by MEME was used as input to the neural network.The ANN configuration consists of 200 inputs and 2 output nodes to discriminate between LINEs and SINEs from the training sets (Figure 1).The number of nodes in the hidden layer was varied from 3 to 9 in order to find the optimal network that allows most accurate assignment of LINEs and SINEs (Table 2 in supplementary material).During the learning phase, a value of 1 was assigned for the LINEs and SINEs sequence whereas, 0 for the non-LINEs and non-SINEs.For each configuration of the ANN 110 independent training runs were performed to evaluate the average predictive power of the network.The corresponding counts of the false/true positive and negative predictions were estimated using 0.1 and 0.9 cut-off values for non-repeats and repeats respectively.

Fourfold cross-validation
A four-fold cross-validation technique has been used to validate the developed ANN model.The dataset is randomly divided into four subsets (C1, C2, C3 and C4).Each set is an unbalanced set that consist of about 60 percent of LINES/SINEs and 40 percent of non-LINEs/non-SINEs.The ANN was trained with three subsets and was validated (based on performance measure) for minimum error on testing set.This has been done four times to test for each subset.The final prediction result was averaged over four testing sets.

Performance measures
The prediction results of ANN model developed in the study were evaluated using the equations given in the supplementary material.

Results and discussion:
The ANN model develop in this study (200-7-2) is trained with the PSPM matrix calculated using MEME.When applying a fourfold cross-validation test, the network reached an overall accuracy of 95.31 ± 0.78 % for prediction of LINEs and 94.53 ± 1.44 % for SINEs prediction.The prediction results are presented in Table 3 (see supplementary material).The net has achieved an MCC of 0.9351 ± 0.0355 for LINEs and 0.8835 ± 0.0306 for SINEs prediction.The other performance measures were: Qpred = 97.99 ± 1.53 %, sensitivity = 94.17 ± 1.44 % and specificity = 97.04 ± 2.28 % for prediction of LINEs.However, performance measures of the network for prediction of SINEs were: Qpred = 95.03 ± 1.87 %, sensitivity = 96.37 ± 2.29 % and specificity = 93.02± 3.29 %.The value of the learning parameter was set to 0.1.The vast majority of the predictions of LINEs and SINEs have been contained within 0.9 to 1.0.However, the predicted output range for non-LINEs/non-SINEs is 0.0 to 0.1 (Table 3 in supplementary material).This illustrates that 0.1 and 0.9 cut-offs values provide very adequate separation of two bioactive classes using ANN.Performance of networks for prediction of LINEs and SINEs has been evaluated by calculating the area under the receiver output characteristic (ROC) curve.The areas under the curve is 0.97 for prediction of LINEs and 0.84 for prediction of SINEs; revealing a better discrimination of network system.Robust de novo computational identification and classification of genomic repeats is an important unsolved problem.The most obvious difficulties are caused by multiple interacting evolutionary processes.For example, most repeats due to mobile elements were presumably intact at the time they were inserted into the genome, but today are often found as fragmented, degraded copies that may be adjacent to repeats belonging to other families and/or embedded in segmental or tandem duplications.Functional regions within segmental duplications may be conserved, producing a repeat signature that can mimic a mobile element.Raw genomic data could be searched for the presence/absence of these conserved features trying a de novo identification of putative non-LTR retroelements [12].The developed tool "RetroPred" introduced a new approach to genomic repeat identification, classification and extraction of their sequences.In contrast to methods that involve selfalignment of a single genome, our comparative method searches for the conserved signature of LINEs and SINEs and are rely on the sequence similarity between different occurrences of retroelements in the genome.The results demonstrate that the developed ANN-based model is adequate and can be considered an effective tool for 'in silico' annotation of LINEs and SINEs from the complete genome.

Availability:
The program (standalone) is implemented on the Web server RetroPred, available at http://www.juit.ac.in/RepeatPred/home.html by using CGI/Perl script.Users can download the entire program and used for detection, classification and extraction of corresponding LINEs and SINEs sequence from the entire genome.

Conclusion:
Currently, there is no reliable systematic way for detection and classifying retroelements into LINEs and SINEs.Strategically, we have developed a neural network, fully automated computational method capable of classifying predicted genomic repeats into their subfamilies (LINEs and SINEs) based on their conserved sequence patterns.A user-friendly program RetroPred has been developed on the basis of this study.We have designed our system to be manually curated in an efficient manner for detection, classification and extraction of LINEs and SINEs, a goal that has important implications for experimental studies of genome and chromosome biology.
___________________________________________________________________________ ISSN 0973-2063 Bioinformation 2(6): 263-270 (2008) Bioinformation, an open access forum © 2008 Biomedical Informatics Publishing Group 264 identification of repeats in the genome using (PALS) [10] and (PILER) [9].The input genomic sequence is aligned locally to itself using PALS which detect the position of transpose repeat signature within the genome.The output file is parsed by PILER to extract all the dispersed transposable elements from the genome and cluster together similar repeats.The repeats are processed using MEME [11] with energy value 0.01 for discovery of conserved pattern in a window size of 50.In the third phase the predicted genomic repeats are classified into LINEs and SINEs based on their conserved signature using ANN.

Figure 1 :
Figure 1: The flow diagram used for identification and configuration of artificial neural network (ANN) for classification of predicted non-LTR retrotransposons into LINEs and SINEs.

Figure 2 :
Figure 2: The steps followed for generation of position specific probability matrix (PSPM) of the datasets from three different sources using Multiple EM for Motif Elicitation (MEME).

Bioinformation
by Biomedical Informatics Publishing Group open access www.bioinformation.netPrediction Model ___________________________________________________________________________ ISSN 0973-2063 Bioinformation 2(6): 263-270 (2008) Bioinformation, an open access forum © 2008 Biomedical Informatics Publishing Group 266 The reliability of developed tool for prediction, classification and extraction of genomic repeats (LINEs and SINEs) was performed by running the program on the complete genomic sequences of Rice and Arabidopsis downloaded form GenBank.The predicted results are shown in Table4under supplementary material (see the website http://www.juit.ac.in/RepeatPred/results.htm for more details).Our tool has predicted a total of 255 LINEs (0.114 % of entire genome) and 671 SINEs (0.292% of entire genome) out of 12 chromosomes from rice genome.Form the complete genome of Arabidopsis (5 chromosomes) the tool also predicted a total of 46 LINEs 0.04 % of genome) and 65 SINEs (0.082 % of the genome).The tool produces a graphical representation of the entire chromosome indicating the location of LINEs and SINEs in the chromosome (Figure3) (see the website http://www.juit.ac.in/RepeatPred/results.htm for more details).By clicking the corresponding element one should extract the repeat sequence.Although the tool has been tested for two genomes, it can be used for prediction of LINEs and SINEs among other genomes too.

Figure 3 :
Figure 3: Graphical output of the program detecting the location of LINEs and SINEs on the chromosome.The red regions represent the location of SINEs and green region represent the LINEs in the chromosomal DNA.The position of the SINEs and LINEs are in the unit of mega basepair (Mbp).

Table 1 :
The sources of dataset used for identification of genomic repeats and classification into LINEs and SINEs.

Table 2 :
The variation in performance of the network with increasing hidden nodes for both LINES and SINES.

Table 3 :
Performance measures of ANN model for classification of LINEs and SINEs using four fold cross validation.

Table 4 :
Prediction result of LINEs and SINEs in rice and Arabidopsis genome.