Comparative genome analysis of Solanum lycopersicum and Solanum tuberosum

Solanum lycopersicum and Solanum tuberosum are agriculturally important crop species as they are rich sources of starch, protein, antioxidants, lycopene, beta-carotene, vitamin C, and fiber. The genomes of S. lycopersicum and S. tuberosum are currently available. However the linear strings of nucleotides that together comprise a genome sequence are of limited significance by themselves. Computational and bioinformatics approaches can be used to exploit the genomes for fundamental research for improving their varieties. The comparative genome analysis, Pfam analysis of predicted reviewed paralogous proteins was performed. It was found that S. lycopersicum proteins belong to more families, domains and clans in comparison with S. tuberosum. It was also found that mostly intergenic regions are conserved in two genomes followed by exons, intron and UTR. This can be exploited to predict regions between genomes that are similar to each other and to study the evolutionary relationship between two genomes, leading towards the development of disease resistance, stress tolerance and improved varieties of tomato.


Background:
Solanaceae family represent important family in agriculture as it is one of the major source of edible fruits Solanum lycopersicum, Solanum tuberosum and Nicotiana tabacum. Tomato fruits are the second most consumed vegetable after potatoes, and are a globally important dietary source of lycopene, beta-carotene, vitamin C, and fiber. Potato contributes to dietary intake of starch, protein, antioxidants, and vitamins. In addition to its agricultural value and due to its diploid genetics and inbreeding potential, tomato is a widely used model species for fundamental research on subjects including fruit development and pathogen response [1].
The developments in sequencing technologies are providing genome sequences of different species. Deciphering a genome sequence, that is, determining the linear order of nucleotides for each chromosome in the genome, allows molecular biologists to understand and manipulate this blueprint. For plants in particular, this in turn enables breeders to more efficiently engineer solutions for crop improvement to respond to the growing demand for food and energy from modern society [2].
The genome draft of Tomato and Potato is now available in plant databases. The nuclear genome of potato and tomato consists of twelve chromosomes. Their genomes are expected to measure approximately 840 Mb and 950 Mb in size, respectively [3][4][5].
The availability of their genome sequences will provide the community with a first glimpse into genome evolution of Solanaceae (and Asterids in general) and will impact both fundamental research and breeding strategies in these species for the coming years.
The aim of the present research work was to predict paralogous proteins in Tomato proteome and to carry out comparative genome analysis of Tomato and Potato to uncover various genomic features of two genomes and to gain insight the similarity and differences between two genomes.

Methodology:
The genomic data of S. lycopersicum is available at, NCBI, EMBL, DDBJ and KEGG. The nucleotide and amino acid data is retrieved in the FASTA format from FTP server. These databases and tools are freely available for computational analysis.
The Sol Genomics Network (http:// solgenomics.net) is a database for comparative genomics platform for Solanaceae species.
Computational tools are required for data processing, data visualization, interpretation and interrogation to analyze flood of new sequence data that is being produced. The comparison of Tomato and Potato genome was performed by sing VISTA server. VISTA (http://genome.lbl.gov/vista/index.shtml) is a comprehensive suite of programs and databases for comparative analysis of genomic sequences [6].
The genomic data retrieved from above server was used for selected objectives. The retrieved genomic data was analyzed with the help of different computational tools, software and online servers.

Prediction of Paralogous Proteins in S. lycopersicum and S. tuberosum Genome
The reviewed set of proteins sequences of S. lycopersicum and S. tuberosum was retrieved from the Uniprot Database in FASTA format. The all against all database searches by using the genomic BLAST-P available at NCBI server was used to predict paralogous protein in the selected set of protein sequences [7][8].
In case of all against all search, a comparison was made in which every predicted protein sequence was used as a query in a similarity search against a database composed of the rest of the self-proteome, and the significant matches were identified by a low E-value. Since many proteins comprise different combinations of a common set of domains, proteins that align more than 80% of their lengths for query and subject were selected. After this filtration only those alignments were selected which give the sequence identity more than 60%.

Families, domain and repeats for paralogous protein sequences in S. lycopersicum and S. tuberosum
For the purpose of functional annotation and to investigate the gene family expansion, the identified set of paralogous proteins was used to search the protein families by using the Pfam search. Each family is represented by multiple sequence alignments and Hidden Markov models (HMMs) [9]. The paralogous protein dataset was submitted at Pfam server which predicted the protein families, motifs, repeats and clans at the default pfam parameter (http://pfam.sanger.ac.uk/).

Results and Discussion:
After performing the all against all searches for all reviewed protein sequences of S. lycopersicum and S. tuberosum it was found that 60 paralogous proteins present in S. lycopersicum and while 110 were present in S. tuberosum. All predicted paralogous proteins of S. lycopersicum and S. tuberosum can be retrieved by using accession number given in Table 1 & 2 (see supplementary material). The predicted paralogous proteins belong to different family having different domain and repeats. For the purpose of functional annotation and to investigate the gene family expansion, the identified set of paralogous proteins was used to search the protein families by using the Pfam search.

Pfam analysis of S. lycopersicum and S. tuberosum protein sequences
It was found that most of the identified proteins belong to different families, domains and clans in S. lycopersicum and S. tuberosum protein sequences Table 3 (see supplementary  material). But also there are proteins having no clans (Figure 1). Proteins contain functional units known as domains and various combinations of domains results in different protein formations. Therefore identification of domains in proteins is essential for giving insights into their function. Pfam also generates higher-level groupings of related families, known as clans. A protein belongs to different families, domains and clans may be due to proteins family expansion and adaptations by the genomes [10].
It was found that S. lycopersicum proteins belong to more families, domains and clans in comparison with S. tuberosum. But also there are proteins having no clans.

Comparative genomics Solanum lycopersicum and Solanum tuberosum
The comparison of the genomic regions of S. lycopersicum and S. tuberosum was performed. It was found that the genome of two selected plants have conserved, non conserved and also different genomic compassions and different levels. But there are other areas also where difference in conservation was noted.
It was found that mostly intergenic regions are conserved in two genomes followed by exons, intron (they are found in the genes of most organisms and many viruses, and can be located in a wide range of genes) and UTR (untranslated region) (Figure 2).
An Intergenic region (sometimes also referred to as junk DNA) represent stretch of DNA sequences located between genes. Their function is still unknown but sometime they are involve in regulation of gene expressions (these regions do contain functionally important elements such as promoters and enhancers).
The comparative alignment of genomic regions of S. lycopersicum and S. tuberosum revealed that it was found that there are regions where only conserved part is present in two genomes (Figure 3). Along with this there are regions were conserved regions, untranslated region (UTR) exons present together without any non aligned region (Figure 4). Non aligned Genomic region are also found in the alignment two genomes ( Figure 5).
Once the elements in a genome sequence have been identified, the next step is to assign to them a plausible biological function. Computational inference of the function of a particular sequence can be achieved either directly through sequence similarity searches, or indirectly through the identification of common motifs or domains between groups of functionally related sequences.
Presence of Intergenic region in large number may be due to a higher repeat content in tomato genome than the potato genome. There are many protein families that represent a large gene superfamily in plants, these genes are involved in the biosynthesis of secondary metabolites [11][12].
Alignments between genome sequences of multiple accessions or varieties of a single species allow for the study of genome diversity, evolution and insertion/deletion polymorphisms (InDels). Moreover, alignments between the genomes of related species, for example from the same genus, can be generated to identify structural variation such as translocations, inversions, The identified sequence variation from both approaches can be utilized to study the evolution of genomes, and to generate molecular markers that can be exploited to screen large populations [13][14].
The general availability of genome sequences for crop plant species is having a tremendous impact on the genetics and breeding of these organisms. Future comparative sequence analyses of the completed tomato and potato genome sequences will address many of the unresolved questions related with genome-wide profiles of specific multigene families [15].