Prediction and analysis of paralogous proteins in Trichomonas vaginalis genome.

Trichomonas vaginalis causes trichomoniasis, second most sexually transmitted disease. The genome sequence draft of T. vaginalis was published by The Institute of Genomic Research reveals an abnormally large genome size of 160 Mb. It was speculated that a significant portion of the proteome contains paralogous proteins. The present study was aimed at identification and analysis of the paralogous proteins. The all against all search approach is used to identify the paralogous proteins. The dataset of proteins was retrieved from TIGR and TrichDB FTP server. The BLAST-P program performed all against all database searches against the protein database of Trichomonas vaginalis available at NCBI genome database. In the present study about 50,000 proteins were searched where 2,700 proteins were found to be paralogous under the rigid selection criteria. The Pfam database search has identified significant number of paralogous proteins which were further categorized among different 1496 paralogous protein in pfam families, 1027 paralogous protein contains domain, 60 proteins were having different repeats and 1092 paralogous protein sequences of clans. Such identification and functional annotation of paralogous proteins will also help in removing paralogous proteins from possible drug targets in future. Presence of huge number of paralogous proteins across wide range of gene families and domains may be one of the possible mechanisms involved in the T. vaginalis genome expansion and evolution.


Background:
Trichomonas vaginalis is a unicellular, anaerobic, flagellated protozoan [1].Infection with T. vaginalis cause of trichomoniasis, number one nonviral and second most sexually transmitted disease (STD) resulting in more than 250 million infections in women each year in the world [2].T. vaginalis transmitted mostly by sexual contact.Adverse consequences to women with trichomoniasis include enhanced risk for human immunodeficiency virus transmission [3]; other complications resulting from infection are cervical cancer and bad pregnancy outcomes [4].The recently published draft genome sequence of T. vaginalis by The Institute of Genomic Research (TIGR) reveals an abnormally large genome size of 160 Mb which is ten times the previously predicted size of this genome [5].It is not still clear why T. vaginalis possesses such a large genome, and how such massive gene expansion happened.There are two possible important mechanisms which may be responsible for large scale genome expansion.It may be either through lateral gene transfer or through large scale gene duplication events.Lateral transfer is the process by which genetic information is passed from one genome to an unrelated genome, where it is stably integrated and maintained [6].This genome is bigger than those of many other medically important protists but is characteristic of trichomonads.One reason for the large Trichomonas genome is the presence of hundreds of DNA transposons [7].But in case of gene duplication a non functional copy of a gene get incorporated in the host genome.Many protein families underwent massive duplication.Pseudogenes are DNA sequences that were derived from a functional copy of a gene but which have acquired mutations that are deleterious to function.This duplicated copy of original functional gene gets incorporated into a new chromosomal location may leading to expansion of the existing gene family [8].The genome also gives the platform to construct and analyze some important signal, secretary and metabolic pathway to identify and validate novel targets, which can be harvested to designed new drug molecules.Sequence similarity search methods provide some insights into putative functions for most gene products.Huge number of pseudogenes was thought to be present in T. vaginalis due to massive gene duplication.In case of T. vaginalis TIGR predicted that there are about 50,000 genes in T. vaginalis but did not mention about pseudogenes.It was speculated that a significant portion of the 50,000 genes might be pseudogenes.Proteins are generally comprised of one or more functional regions, commonly termed domains.Aims of the study were: (i) Identification of paralogous proteins, (ii) Prediction of families, domains and repeats of identified paralogous proteins and (iii) To investigate the role of paralogous proteins in the genome expansion of evolution of .Around 50,000 proteins in the FASTA format retrieved from the database were used to carry out the all against all database searches by using the genomic BLAST-P available at NCBI server [10].In case of all against all search, a comparison was made in which every predicted protein sequence was used as a query in a similarity search against a database composed of the rest of the self-proteome, and the significant matches are identified by a low E-value.The T. vaginalis proteome database is present at NCBI.Protein sequence was searched at E-Value 0 or less than 0. Since many proteins comprise different combinations of a common set of domains, proteins that align more than 80% of their lengths for query and subject were selected [11].After this filtration only those alignment were selected which give the sequence identify more than 60%.

Prediction of families, domain and repeats in paralogous proteins:
For the purpose of functional annotation and to investigate the gene family expansion, the identified set of paralogous proteins was used to search the protein families by using the Pfam search.The Pfam database is a large collection of protein domain families.Each family is represented by multiple sequence alignments and Hidden Markov models (HMMs).The paralogous protein dataset was submitted at Pfam server which predicted the protein families, motifs, repeats and clans at the default pfam parameter (http://pfam.sanger.ac.uk/) [12].Results and Discussion: After using rigid selection criteria for BLASTP search (very low E-value,>60% sequence identity and >80% alignment length) 2700 protein sequences were found to be paralogous proteins and around 47,200 proteins were identified as non paralogous proteins as they do not match with any protein of the proteome.The various protein families, domains, repeats and clans for the paralogous protein were identified with the help of Pfam sequence search.Total 1496 paralogous protein were found in different pfam families (collection of related proteins), 1027 sequences contains different pfam domain (structural unit which can be found in multiple protein contexts), 3 sequences have pfam motif (short unit found outside globular domains) and 60 proteins contains different pfam repeats (short unit which is unstable in isolation but forms a stable structure when multiple copies are present) Table 1 &   Large number of pseudogenes were already reported in many families of protein for example, ankyrin repeat proteins, hypothetical protein, conserved hypothetical protein, adenylate cyclase, vsaA, surface antigen BspA, ANKrepeat protein, CG1651-PDrelated, ABC transporter protein, kinases, major facilitator superfamily protein, leucine rich repeat family protein, and Transmembrane amino acid transporter protein [7, 13].These pseudogenes may be playing active role in the formation of paralogous protein.The New gene functions are thought to be gained by duplication of an existing gene creating different tandem copies.Functional differentiation then occurs between the copies by mutation and selection.We found 2700 paralogous protein which is present across wide range of different protein families, domain, clan and repeats.This clearly reflects that many protein families underwent massive duplication in the T. vaginalis genome.The expansion of genetic material and amplification of specific gene may be the example of adaptations of the T. vaginalis during its transition to a urogenital environment from enteric environment (the habitat of most trichomonads) [5, 14].We hope that after a larger survey on individual duplicated protein families and having more experimental data on the paralogous protein, we could shed light on biological issues like, how genes were duplicated and their evolution histories.The presence of different domains in varying combinations in different proteins gives rise to the diverse repertoire of proteins found in genome.Identifying the domains present in a protein can provide insights into the function of that protein.Such identification of paralogous proteins and their functional ISSN 0973-2063 (online) 0973-8894 (print) Bioinformation 6(1): 31-34 (2011) 33 © 2011 Biomedical Informatics annotation will not only give insight into the biological mechanism of genome but also help in identification of the novel drug targets.The identified paralogous proteins can be excluded from the possible list of drug targets, as paralogous proteins represents non functional product of duplicated genes known as pseudogenes [15].
The identified paralogous proteins and their sequence in the FASTA format can be retrieved using the T. vaginalis protein accession number from http://trichdb.org/trichdb/for future analysis.The amino acid sequence of the predicted hypothetical proteins encoded by the predicted genes can be used as a query of the protein sequence databases in a database similarity search.A match of a predicted protein sequence to one or more database sequences not only serves to identify the gene function, but also validates the gene prediction.
The genome sequence can further be annotated with the information on gene content and predicted structure, gene location, and functional predictions [16].

Conclusion:
Collectively, these data suggest the presence of a very large number of paralogous proteins in unicellular eukaryote

Table 2 :
T. vaginalis.Presence of paralogous proteins across wide range of protein families, domain, repeats, clans and motifs reflects large scale gene duplication events leading to gene family expansion.The identification of paralogous proteins indicates the possible role of gene duplication in the evolutionarily expansion of the T. vaginalis genome because organisms considered to be deep-branching have both paralogs.For further investigation the paralogous proteins can be subjected to cluster analysis in order to identify the most closely related groups of proteins.Predicted pfam Domains in Paralogous Protein