FAIR: A server for internal sequence repeats.

UNLABELLED
An Internet computing server has been developed to identify all the occurrences of the internal sequence repeats in a protein and DNA sequences. Further, an option is provided for the users to check the occurrence(s) of the resultant sequence repeats in the other sequence and structure (Protein Data Bank) databases. The databases deployed in the proposed computing engine are up-to-date and thus the users will get the latest information available in the respective databases. The server is freely accessible over the World Wide Web (WWW).


AVAILABILITY
http://bioserver1.physics.iisc.ernet.in/fair/


Background:
With the advent of complete genome sequencing of many organisms, the volume of biological data has been increasing rapidly. Bioinformatics has hence emerged as an indispensable field to organize and assemble the plethora of biological data into a more comprehensive state. For instance, internal sequence repeats (without any gaps) are an integral part of DNA and protein sequences. In DNA sequences, these internal repeats present in multiple copies are referred as 'direct repeat'. For example, the tandem repeats are repeated consecutively and act as markers for genomic sequence [1]. In addition, these markers are used for the mapping of genes which are responsible for human genetic diseases [2]. In eukaryotes, the short DNA tandem repeats present in the telomeric region of chromosome forms the common repeat motif [3]. Moreover, the internal sequence repeats are known to play an important role in evolution and approximately 14% of the protein sequences contain internal sequence repeats [4]. Understanding the sequence to structure correlation of internal repeats in protein structures would enhance the knowledge of researchers to further march forward in the world of molecular modeling and in-silico drug discovery. Furthermore, eukaryotic proteins are three times more likely to contain internal sequence repeats than prokaryotic proteins [4] as these eukaryotic repeats have unique functions. Thus, it is essential to retrieve the information pertaining to internal sequence repeats in an efficient manner which is useful to the research community working in the field of structural biology, molecular modelling and genomics. To this end, we have developed a web server to obtain all the internal sequence repeats in both protein and nucleotide sequences by exploiting the utilities of high-end computing and World Wide Web technologies.
The limitation of the existing programs has already been outlined earlier [5]. For example, the number of amino acid residues allowed to be entered as input is limited to 1000 in the RADAR web server [6] and 2000 in the TRUST web server [7]. On the other hands, REPPER [8] does not pose a limitation on the number of amino acid residues; however, it allows only one protein sequence to be entered at a given time. Further, the recently available web servers SWELFE [9] and Censor [10] have their own limitations. The SWELFE program permits a single sequence and uses substitution matrix for its search. In the Censor program, repeats scanning are made with the help of reference repeat database, which makes restriction in finding of non annotated internal repeats. To overcome the above issues, FAIR (Finding All Identical Repeats) web server has been deployed to identify the internal sequence repeats in protein as well as DNA sequences. Moreover, the proposed web server FAIR has been tested with a protein sequence of more than 35000 residues available in the genome database. Also, the proposed server accepts many sequences at a time. The advanced features of the proposed web server are discussed in detail in the subsequent sections.

Methodology:
The proposed web server has been implemented with two different algorithms for the identification of identical is interfaced for visualizing the three-dimensional structures.

Features:
The proposed computing server requires the sequence of interest and the minimum number of amino acids or nucleotides in a repeat. Further, the users have to specify the input sequence type (Protein or DNA). Also, users can specify the number of occurrences of a repeat. The users need to paste the sequence(s) in FASTA format or upload the sequence(s) from a file. There are options provided to search for either the identical or similar sequence repeats in a protein sequence. Incase of DNA sequences, the above option is limited to identical match only. The server produces a detailed display with a brief summary at the end of the output. The user can save the output produced by the computing server in the hard-disk of their local machine for further analysis.

Implementation:
The computing engine is developed and optimized for Fedora Core (Version 7.0) and is driven by a -2.66 GHz Quad-core Intel Xeon processor equipped with 4 Gigabytes of Random Access Memory. The operating system is chosen for security and reliability. The computing engine is tested vigorously using all platforms (windows 98/2000, XP, Linux and SGI). As mentioned, for the sequence provided in the case study, the web server displays the results in about two seconds for a single sequence (number of residues = 1436). However, the computation time varies depending upon the network speed. The computing engine is developed using PERL. In addition, Java script is deployed to develop and validate the web pages. For designing the web pages, CSS and HTML have been used. The algorithm used in this computing engine is written in C++ language. The described computing engine is freely available for academic users and noncommercial organizations over the World Wide Web at the URL: http://bioserver1.physics.iisc.ernet.in/fair/. The users are requested to cite this article in their research publications. Please send your comments and suggestions to Dr. K. Sekar (sekar@physics.iisc.ernet.in).

Case study:
The FAIR web server enables the users to identify the internal sequence repeats in the protein and nucleotide sequences. Following are the various options that recognizes FAIR web server as unique. Figure 1 displays a sample output for an input (FASTA format) protein (probable polypeptide synthase pks12) sequence from Mycobacterium tuberculosis H37Rv. Identification of repeats in polyketide synthase would be of immense importance as it is an important target for drug designing. The total number of amino acid residues in the input sequence is 4151 (only part of the input is shown in Figure 1). The minimum number of residues in the repeat is set to be greater than or equal to 120. The results produced by the computing server are shown in Figure 1. The computing engine found three long identical repeats of 215, 128 and 307 length (Figure 1). The results of the proposed server are compared with the SWELFE server for the above input sequence and are reported in Table 1 (see supplementary material). It is interesting to note that the proposed server identified many identical internal repeats whereas the SWELFE reported only similar repeats (see Table 1 in supplementary material for details).

Identical repeat in nucleotide sequence
The complete genome of Mycobacterium tuberculosis H37Rv has been given as the input. The total number of nucleotides in the input sequence is 4411532 (here again, only part of the input sequence is shown in Figure 2). For a minimum number of nucleotides in the repeat as 50, the sample output has been displayed in Figure 2. These significant large identical repeats would enable scientists to focus on sequence to structure correlation and their implications.

Similar repeats in protein sequence
The given input sequence is taken from the protein Hexokinase 1[Homo sapiens], PDB Id: 1CZA [17]. The protein consists of 917 amino-acid residues and for a minimum number of residues as ten the output obtained is shown in Figure 3. Hexokinase I, a protein found in all mammalian tissues plays an important role as a "house keeping enzyme". Further, insights into the occurrence of such structurally similar repeats would enhance the understanding on the threedimensional structure of Hexokinase I.

Conclusion:
The proposed computing server displays all identical and similar repeats in the provided input sequence. There is no limitation on the number of residues in the input sequence. The interesting feature is that the user will be able to find the occurrence of the resultant repeat in other sequence and structure databases, if any such repeat is present. Further, the three-dimensional structure of the corresponding repeat (identical or similar) can be visualized in the local machine. The computing engine will be an interactive tool to those working in the area of structural biology, molecular modeling and Bioinformatics.