Function annotation of peptides generated from the non-coding regions of D. melanogaster genome

De novo emergence of genes is the most fundamental form of genetic diversity that is attracting the attention of the scientific community. Identification of short open reading frames (sORFs) from the non-coding regions of different genomes has been leading this thought recently. The coding potential of these newly identified sORFs have been investigated through experimental and computational approaches in recent studies. In the present work we have tried to make peptides from intergenic sequences of D. melanogaster genome leading to therapeutic applications. Towards this goal of making novel peptides from non-coding genome, we have found strong computational evidence of 145 peptides with conformational stability from the intergenic sequences of D. melanogaster. The structure of these completely unique peptides was predicted using ab initio method. The function annotation of these peptides was carried out using this structural information. The newly generated proteins were categorised as DNA/Protein/ion binding proteins, electron transporters and a very few as enzymes too. Experimental studies can certainly provide validations to these preliminary findings. This work provides further evidence of untapped potential of non-coding genome.


Background:
The term 'junk DNA' for non-genic DNA was coined by Ohno et al. in 1970s and has caught the attention of scientists for many years since now due to its increased perception as regulators of gene expression. However, it is still not clear on whether the noncoding DNA merely exists for giving structural support to the folding DNA or are they expressed on demand and used as a dormant expression reservoir. Deep transcriptome sequencing has shown up transcripts that lack long or conserved open reading frames (ORFs) and there is growing proof that these transcripts translate to make novel peptides. The coding potential of short peptides has been explored widely using different computational and experimental approaches [1][2][3]. Recently, evidence of short novel genes or de novo genes which function in male reproduction has been reported in Drosophila non-coding genome [4,5].
D. melanogaster is one of the extensively used model organisms for the investigation of many developmental and cellular processes in higher eukaryotes. Since last few years, scientists have been exploring the possibility of novel genes from the originally known non-coding regions of the organisms. Studies indicate that almost 12% of the newly emerged genes in D. melanogaster subgroup may have arisen from the non-coding DNA [7]. A common feature of de novo genes is that they are translated from short ORFs (<100 codons length ORF) originated from introns and are not observed in the coding region. The intergenic region of D. melanogaster could be considered as a potential repository for de novo genes and many of them might be functional too. The identification and functional annotation of these genes is still in its infant stage. One way of finding the potential function of the non-coding region is to study if the DNA shows some characteristics of other sequences known to be functional. But if the novel genes are 'orphans' which lacks homologs, a real challenge in characterizing such protein coding genes exists. The present study was carried out to predict the potential peptides from the intergenic sequence of D. melanogaster and to understand its functional significance. The study is an extension of an earlier work where novel and functional proteins were non-natural proteins have been successfully synthesized and characterized from the non-coding regions of Escherichia coli K-12 (strain MG1655) genome [6]. In this study we have proposed an in silico approach in identifying potential peptides from the non-coding genome.

Figure 1:
In silico strategy designed for identifying potential peptides from non-coding DNA

Methodology: Refining the study sample
The non-coding DNA of D. melanogaster was used in this study. D. melanogaster is a well characterised model organism for eukaryotes and the genome size of Drosophila is 168.7Mb out of which 80% is non-coding. The preliminary dataset used for this study consists of 3500 intergenic sequences from the Flybase database version FB2012/04 [8] and were matched against the non-redundant (NR) protein database to verify their uniqueness. Those sequences which possess homologs in protein databases were then omitted. The whole idea was to extract the completely novel sequences and this formed the sample data for this study.

Novel Peptide Prediction
The intergenic sequences were translated into six reading frames and the peptides generated made the novel peptide dataset which do not resemble existing sequences from NR protein databases. The physiochemical characteristics of the peptides such as molecular weight, isoelectric point, instability index were then computed using the Expasy Protparam tool [9]. The structural and functional importance of these newly generated peptides were investigated in various steps as discussed in the workflow chart ( Figure 1).

Function prediction based on the peptide sequence
The molecular function of the peptides was predicted using the standard ProtFun 2.2 Server [10] which made ab initio predictions of protein function from sequence. The program counted on the sequence derived protein features such as predicted post translational modifications (PTMs), protein sorting signals and physical/chemical properties calculated from the amino acid composition. These features used are integrated into final predictions of the cellular role, enzyme class (if any), and selected Gene Ontology categories of the submitted sequence. Search for the presence of sequence patterns helped in identifying the biologically meaningful sites in the sequence and thus in determining the function of a protein. Protein or proteins of a particular family shares common attributes which could be derived from a common ancestor. Prosite database [11] was used to search for the functionally significant regions such as profiles, domains and motifs from these novel sequences.

Antimicrobial Peptide Prediction
From the EKA dataset, peptides with size <40 AA were further considered to investigate on its therapeutic property. This was carried out with a sequence matching of each of the novel peptide against all natural antimicrobial peptides documented in the Antimicrobial Peptide Database (APD) [12]. APD consists of all natural antimicrobial peptides from bacteria, plants and animals with antiviral, antifungal, anticancer activities. The tool was used to calculate properties of the input peptide such as net charge, peptide length, hydrophobic residue (in %) and other amino acid compositions. These details were then used to traverse the database and to list those peptides which are most similar to the input given.

Peptide Structure Prediction and Evaluation
The function annotation of these novel potential peptides would be more reliable if the 3D structure information are available. The structure prediction of these peptides was done using ab initio method as they lack sequence homologs.
I-Tasser webserver [13] was used to predict the protein structures with the combined approach of ab initio and fold recognition or threading alignment. The accuracy of the predicted structures was estimated based on the C-score (confidence score) of I-Tasser and is the score is typically in the range [− 5, 2], wherein a higher score reflects a model of better quality. has developed a strategy for evaluating the peptide structures based on the intra-molecular interactions and the same was adopted for the study. The stability centres (SCs) of the protein/peptide were predicted using the SCide program [15]. The stability centres are residues that are involved in cooperative long-range contacts, which are important in maintaining the stability of the protein. The total energy of a protein was calculated based on the bonds, angles, torsions, non-bonded and electrostatic constraints. The GROMOS force field implemented in Deep View [16] was used to compute the energy of the peptide and to confirm that the modelled protein structures are energetically stable. Chemically specific interactions (hydrogen bonds, ionic interactions) determine the globular structure of a protein. Hydrophobic interactions, ionic interactions and Disulphide bridges were computed using the webserver PIC, Protein Interaction Calculator [17]. Hydrogen bonds and Salt bridges in the peptides were computed using the WHAT-IF server [18]. Apart from these interactions, the non-canonical interactions (C-H··· π, C-H··· O and N-H··· π interactions) were also computed using the HBAT program [19].

Result and Discussion
The initial dataset consists of 3500 intergenic sequences from the 3 rd chromosome of D. melanogaster. These sequences were then subjected to sequence similarity checking against NR protein database. The sequences with no homologs were considered as unique intergenic sequences were then subjected to six frame translation. We were able to generate 145 novel peptides which we named as "EKA peptides" (after it was named by Dhar et al. 2009) [6] and were deposited in the in-house database -EKA knowledgebase.

Sequence based function annotation
The sequences were investigated to predict the possible functions of the novel peptides. Profunc server reported and categorised the biological function of the novel proteins into energy metabolism, transcription-translation regulation, cell-envelop, transport and binding etc. Gene Ontology prediction categorised the novel peptides into different classes such as transporter, growth factor, receptor, immune response and transcription regulation.
Important functional sites such as motifs, domains or patterns or profiles of the novel proteins were then identified using the Prosite webserver. The pie chart given below (Figure 2) describes the profiles and domains predicted from the novel proteins. Many phosphorylation sites, N-glycosylation sites, Nmyristoylation and amidation sites were identified. Apart from that many important sequence profiles like Histidine rich, Cysteine rich regions were identified (Table 1). Leucine zipper patterns which are unique for DNA binding protein were predicted in 8 novel proteins which could be considered as significant findings.
Among our 53 selected structurally stable novel peptides, 12 peptides showed a very significant sequence similarity with the antimicrobial peptides from APD database. These peptides were having a length in the range of 33-40 AA which is comparable to that of the known antimicrobial peptides. Few among these peptides (EKA-26, EKA-31, EKA-36 and EKA-80) showed a significant sequence similarity (>35%) to the antimicrobial peptides deposited in APD database with activity against Gram +, Gram -, fungi and some viruses ( Table 2).

Peptide Structure prediction and stability evaluation
Tertiary structure of the peptide was predicted using the I-Tasser Server and the C-score for the model structures are given in Table 3. The C-score range for the peptides was reported from -4 to 2.5. Top 53 best hits out of the 145 peptides were selected based on the C-score and structure stability. Table 3 gives report on the selected peptides (10 peptides) with the I-Tasser C-score, total energy, instability index, stability centers and the number of different intra-molecular interactions such as hydrogen bonds and hydrophobic interactions. Figure 3(A, B) shows a few of the predicted EKA peptide structures.

Structure based Function Prediction
These homology based approach was used to capture the information regarding the possible functions that these proteins may acquire if they were expressed inside the cell. The gene ontology function annotation and the cellular localization prediction of these stable structures were done using Cofactor webserver [20]. Cofactor analysis reported the molecular and biological features of the selected proteins with reference to their structural homologs. These novel proteins were seemed to possess binding affinity to molecules, which were predicted with the Gene Ontology term prediction using the I-Tasser webserver (  (Table 5). The structurally superimposed image of the EKA-36 and GCN4 (PDB: 4nj2) is shown in the Figure 3(C).  Ras-related protein rab-11b

Conclusion:
The evidence on the huge amount of functionally significant region in the un-expressed 'junk' DNA is increasing day by day. Encode project has already claimed that they have assigned biochemical function to 80% of human genome [22]. This indicates that the conventional annotation process might have missed many important functional transcripts which are RNA coding or protein coding. Also, recent reports have come up on the discovery of short peptides or short ORFs [23] from the nongenic region which supports these findings. Peptide mapping from non-coding DNA region is found to be promising since successful trials have been reported previously by Dhar et al. 2009 [6]. Inspired from all these findings, the present study was planned to develop an in silico approach in predicting potential proteins from the intergenic region of D. melanogaster genome, which may perform some function inside the cell. From the analysis, we were able to create potential novel peptides from intergenic sequences of D. melanogaster genome. The predicted 145 peptides with conformational stability point towards the potentiality of these peptides to be expressed in cell in different conditions. The sequence and the structure of the proteins were used in order to assimilate information regarding the potentiality of the proteins. The in silico functional characterization of these novel peptides revealed the presence of important profiles and patterns such as Histidine-rich region profile, Cysteine-rich region profile, Leucine zipper pattern etc. The prediction of DNA binding capacity of some of these peptides can be further investigated so as to check the role of transcription regulation of these novel proteins. The reliability of the predicted models and its expression capability has to be validated using experimental study. Another set of peptides were found to have similarity with the already known natural anti-microbial peptides. This opens a way to produce novel AMPs of our interest from the intergenic region of D. melanogaster genome. Since the synthesis of peptides from the intergenic region has already been proven [6], there is a good scope for the proposed production of novel AMPs from non-coding DNA. The preliminary findings from the present study open the way for an extensive analysis in re-annotating the non-coding space of D. melanogaster genome.