CEPiNS: Conserved Exon Prediction in Novel Species

Exon structure is relatively well conserved among orthologs in several large clades of species (e.g. Mammalia, Diptera, Lepidoptera) across evolutionary distances of up to 80 million years. Thus, it should be straightforward to predict the exon structures in novel species based upon the known exon structures of species that have had their genomes sequenced and well assembled. Being able to predict the exon boundaries in the genes of novel species is important given the quickly growing numbers of transcriptome sequencing projects. CEPiNS is a new pipeline for mining exon boundaries of predicted gene sets from model species and then using this information to identify the exon boundaries in a novel species through codon based alignment. The pipeline uses the freeware SPIDEY, an exon boundary prediction tool, and BLAST (BLASTN, BLASTP, TBLASTX), both of which are part of NCBI’s toolkit. CEPiNS provides an important tool to analyze the transcriptome of novel species.


Background:
Genes contain exons which are regions coding for proteins and introns which are non-coding regions. In transcription process, introns are removed by RNA splicing and the exons are joined together to form the functional messenger RNA (mRNA) [1].
Accurate prediction of precise exon-intron boundaries in genes is an essential step in the analysis of genomic sequences [2]. This gene structure is conserved between closely related species for the majority of genes [3]. In evolution, gene structure conservation may be a record of core events [4]. The aim of this project is to develop a new pipeline to predict exon sequences and their boundaries for novel species comparing a model species by using the sequence similarity method.

Methodology:
CEPiNS is a bioinformatics tool for large-scale exon prediction. This application allows study of gene structure for model and novel species by predicting exon boundaries and sequences.
Given the input of a set of gene sequences and their genomic sequences for a model species, CEPiNS generates a table of exon boundaries for these genes. CEPiNS uses BLAST [5] to identify the orthologous genes between the genes of a reference species and the predicted genes from a novel species' assembled transcriptome. Once this orthology has been established, the exons in the genomic reference species can be transferred to the novel species. The output is therefore the predicted exons in the

Preprocessing of Dataset
CEPiNS has preprocessing tool to remove alternate splicing and to predict orthologous sequences by using BLAST. For both the model and novel species, multiple copies of the same genes within a genome are removed by identifying sequences with at least 95% similarity at the nucleotide level using BLASTN and retaining only the longest. Gene sequences in both model and novel species with at least 60% similarity at protein level CEPiNS treated as orthologous sequences using BLASTP.

Exon Prediction in Model Species
SPIDEY is an mRNA to genomic DNA alignment program [6]. When intron-exon boundaries are not already annotated in the reference (model) species, SPIDEY gives the exon boundaries by using a set of mRNA or cDNA sequences and their corresponding genomic sequences. CEPiNS generates a table with the gene ID, genomic sequence ID, exon boundaries in genes and genomic sequences and length of exons by using the table created by SPIDEY. It also creates a file of exon sequences in fasta format.

Exon Prediction in Novel Species
CEPiNS uses TBLASTX for the alignment at amino acid level in all six reading frames for each predicted genes from transcriptome assembly of the novel species and its corresponding exon sequences of the reference species, which has been created by SPIDEY. CEPiNS creates a table output with cDNA ID of Reference species, cDNA ID of novel species, exon Number, genomic coordinates, mRNA coordinates, length and percent identity. It also creates exon sequences fasta file for novel species.

Software Input and output:
CEPiNS requires the input of a set of transcribed genes of a model species, the genomic sequence of the same species and of a predicted gene set from transcriptome assembly of a novel species in fasta file format. The log screen keeps tracking the steps performed and results can be viewed by clicking corresponding buttons. The final outputs are the predicted exons boundaries in a text file and exons sequences in a fasta file of the novel species.

Conclusion:
CEPiNS is a package for predicting large scale exons for novel species with Graphical User Interface (GUI) so that biologists, ecologists, geneticists and people from other backgrounds can use it very easy way. The output data offer several opportunities for further work & other tool development.