Data enabled prediction analysis assigns folate/biopterin transporter (BT1) family to 36 hypothetical membrane proteins in Leishmania donovani

Leishmaniasis is a neglected tropical disease caused by the pathogenic protozoan Leishmania donovani and it is transmitted by an infected sand fly. Approximately 0.4 million cases of Visceral Leishmaniasis are reported across the globe every year, of which 67% is from the Indian subcontinent. The currently available drugs have not been effective owing to their high toxicity levels, inadequate specificity, drug resistance, extended treatment periods and/or prohibitive prices. For this reason, hypothetical proteins in this pathogen, which constitute about 67% of its proteome, must be distinctly characterized and studied for their potential role as drug targets for Leishmaniasis. Domain information from PFAM and functional information from GO has been used to assign putative functions to 36 hypothetical membrane proteins in this protozoan. Furthermore, as a case study, we have performed a thorough sequence level characterization of a hypothetical protein E9BPD7 from the BT1 family of membrane proteins that transports folate/biopterin. Phylogenetic analyses of E9BPD7 have revealed interesting evolutionary correlations to BT1 family and MFS superfamily, which have significant roles in a number of diseases and drug resistance pathways.


Figure 1: Protocol used in sequence processing
Hypothetical proteins define the unknown elements of the proteomes in biological systems. In several pathogenic systems, the analysis of hypothetical proteins has provided insights to the molecular function of these proteins and led to potential therapeutic targets [4][5]. The growing need for identification of newer drug targets can thus be addressed by considering the hypothetical proteome of the etiological agent Leishmania donovani. Membrane proteins form the connecting interface between the intracellular and the extracellular environments. They are also involved in mediating the exchange of molecules and cell communication. Additionally, there is little information about the nature of these bio molecules and their role in communication with extracellular environment. Thus, identification of cell surface proteins and transmembrane proteins would help in determining crucial systems that could be explored as potential targets [6][7]. Several in silico studies on membrane proteins have helped understanding the role of this class of proteins. Using computational approaches, 7 novel outer membrane proteins and other non-paralogous proteins were identified in Edwardsiella tarda, a fish pathogen that infects Daniorerio (zebra fish) [8]. In another study, the hypothetical membrane proteins in Trypanosoma cruzi were characterised and 54 proteins involved in signal transduction processes were meticulously investigated following the identification using computer aided computer models [6]. Also, an in silico analysis of the Neisseria meningitides sero-group B proteome revealed 9 outer membrane proteins, among other potential drug targets, that could be potential vaccine candidates towards treatment of Meningococcal disease [9].
Furthermore, computational annotation of hypothetical proteins has been extensively carried out in Mycobacterium tuberculosis (Mtb) using homology information from databases like COG and GO [10].
Homology based fold predictions have also been widely used approaches to computationally annotate and attribute functional domain information to proteins. Functional domains were assigned to 64 proteins using homology and fold prediction approaches in Mycobacterium tuberculosis [11]. Similarly, using various approaches like homology, structure and fold prediction; 219 structural folds were computationally annotated to hypothetical proteins within M. tb [12]. Hence, a comprehensive in silico analysis of the hypothetical membrane proteins of Leishmania donovani was carried out to understand their roles in the biology of this notorious protozoan.  P:response to unfolded protein; P:macromolecule biosynthetic process; P:response to light stimulus; P:proteasomal protein catabolic process; P:cell growth; P:response to endoplasmic reticulum stress; P:single-organism carbohydrate metabolic process; P:positive regulation of cellular process; P:root hair cell differentiation; P:developmental growth involved in morphogenesis; P:single-organism intracellular transport C:endoplasmic reticulum membrane; F:protein binding; P:proteolysis; P:response to red or far red light; P:cell tip growth; P:protein transport; P:carbohydrate biosynthetic process; P:cellular response to unfolded protein; P:response to endoplasmic reticulum stress; P:cellular protein metabolic process; P:single-organism transport; P:intracellular transport; P:root hair elongation; P:regulation of cellular process; P:multi-organism process   Results: Figure 1 shows the flowchart depicting the methodology adopted in this study. 5,299 sequences belonging to Leishmania donovani (strain BPK282A1) and termed as 'uncharacterised' were retrieved from UNIPROT database. a coverage cut-off of 70% were used as primary criteria to select true positives, which resulted in 127 sequences finding homologs in the Swissprot database. GO mapping using Blast2GO tool was performed for these 127 sequences which resulted in 36 sequences having a membrane term in the Cellular component of the GO annotation. Thus, 36 uncharacterized proteins, which were predicted to have membrane topology using HMMTOP and TMHMM, were associated to a PFAM domain covering more than 50% of the query's length and assigned with a GO term using Blast2GO tool.

Materials and Methods
In the present study, we have utilized the data repository in public databases and bioinformatics tools to assign putative functions to the hypothetical membrane proteins in Leishmania donovani. Amongst the 7,960 sequences that make up the proteome 5,299 proteins in this protozoan are termed uncharacterized/hypothetical. This amounts to a monumental figure of 65% of the proteome that is not associated with any functional information. As detailed in Figure 1 and exercises thereof, 36 sequences have been characterized with domain information from PFAM and functional annotation was given from GO. All the information related to GO terms, Inter proscan domain associations, homologs identified during the BLAST step of Blast2GO analysis for these 36 sequences are presented in Table 1.
In the first step of Blast2Go, a BLAST search is performed to identify sequence homologs to our query. These sequence homologs are considered for attributing GO terms to the query in further steps. The sequence similarity distribution amongst the BLAST hits obtained in the first step of Blast2Go analysis indicated no BLAST hits were present with less than 30% sequence similarity with respect to the query (data not shown). Also, only hits with significant E-values were considered for further analysis (data not shown). Figure 2A-D represents the combined graphs depicting the biological processes, molecular functions and cellular components of these 36 sequences respectively. As may be appreciated from Figure 2 A-D, a sequence may have association to more than one GO term. Further, these 36 sequences were associated to a COG family using STRING-DB search. 35 of the 36 sequences found a COG family while E9BPD7 did not find any COG family. Therefore, a detailed sequence characterization of E9BPD7 was taken up as a case study.  We re-constructed the phylogenetic tree described in [28] using Maximum likelihood (ML) method with 2 different tools. 176 valid sequences belonging to 17 subfamilies were used for building the phylogenetic tree. Clustering of 176 sequences was performed at an identity cut-off of 70%, which reduced the number of representative sequences to 121 from a total of 176. The resulting tree (Tree 1) containing 17-member families of MFS are shown in Figure 3. MEGA tool was used to build ML tree with 100 bootstrap replications ( Figure 3A) while RaXML was used to build the tree with 500 bootstrap replications ( Figure 3B). Additionally, a phylogenetic tree was also built using Neighbour Joining method with 500 bootstrap replications (data not shown) to show equivalency of the phylogenetic clustering using different methods.  Figure 3). There are a total of 726 sequences distributed over 173 species in BT1 family in PFAM (version 27.0).
To retain only representative sequences from this family, we clustered 726 sequences at 35% sequence identity cut-off which resulted in obtaining 67 representative sequences for BT1 family. In order to demonstrate BT1 as a separate functional family within MFS, a phylogenetic tree (Tree 2) was constructed with 121 representative sequences of 17 member families (as for Tree 1 in Figure 3A) along with the 67 representative sequences belonging to BT1 family. Like Tree 1, the phylogenetic tree was constructed with 2 different programs with 2 different bootstrap replications. MEGA tool was used to build ML tree with 100 bootstrap replications ( Figure 4A) while RaXML was used to build the tree with 500 bootstrap replications (Figure 4B). The resulting tree is depicted as Figure 4. Furthermore, phylogenetic clustering of the members of BT1 family alone was performed to understand the sequence level clustering within this class of proteins. The special interest in clustering of BT1-family members from Trypanosomatids motivated us to perform a phylogenetic analysis of the sequences.
For this reason, we retrieved all of the 66 sequences from BT1 family belonging to Trypanosomatids which upon CD-hit clustering of 100%, reduced to 64 protein sequences. A phylogenetic tree was built using these 64 sequences from Trypanosomatids and 67 representative sequences belonging to other species of BT1 family (as used in Tree 2, Figure 4) along with our query of interest, E9BPD7. Two independent methods were used for building the phylogenetic tree. MEGA tool was used to build ML tree with 100 bootstrap replications ( Figure 5A) while RaXML was used to build the tree with 500 bootstrap replications ( Figure 5B). Using the MEME suite, we identified class specific motifs for the 4 different clusters within the tree (Figure 5B) [28]. Figure 6 shows the MEME output/sequence motifs for the four clusters.

©Biomedical Informatics (2019)
Discussion: BT1 family is part of Multi Facilitator Superfamily (MFS), a large class of membrane proteins involved in the transport of various molecules across the membrane. MFS is one of the two largest superfamilies of membrane transport proteins that is virtually distributed among all the recognised organism phyla [26]. . We performed phylogenetic analysis of BT1 family to understand the phyletic relationship and distribution of E9BPD7 amongst the members of BT1 family. To eliminate any artefacts arising out of methodology, in the present study we built the ML tree using MEGA (100 bootstrap) and RaXML (500 bootstrap) and a NJ tree using MEGA (500 bootstrap). These 3 trees are comparable with the earlier work shown by Pao et al. (1998) [26] and all the 17member families cluster identical to the phylogenetic tree previously reported. Each member family is annotated with a different colour in Figure 3A and 3B.
Owing to the finer sequence differences, we would expect BT1 family members to cluster differentially in the presence of members from other MFS families. As previously discussed, to eliminate any artefacts arising out of methodology, ML tree was built using MEGA and RaXML. Figure 4A and 4B shows BT1 protein sequences (coloured in black) clustering as a separate group, clearly indicating that BT1 family is indeed a distinguishable family of MFS. More importantly, our query of interest (E9BPD7) clusters well within the clad of BT1 family strongly suggesting that E9BPD7 is a member of BT1 family. Though Trypanosomatids belong to the Eukaryotic domain, there are numerous differences between these parasitic protozoans and other higher order eukaryotic species. These differences could also be reflected within the protein domains shared by these pathogens with other eukaryotic species. It is apparent from the tree shown in Figure 5A (MEGA) and Figure  5B (RaXML) that BT1 sequences from Trypanosomatids form a distinctive cluster (coloured RED) while other eukaryotic sequences can be categorized into 3 different subfamilies based on phylogenetic clustering. However, our sequence of interest in this case study, E9BPD7 (in BLACK), interestingly clusters with sequences from higher organisms (coloured BLUE). This suggests that our query of interest, E9BPD7 contains sequence characteristics distinct from other BT1 sequences within the phyla of Trypanosomatids.
To further explore the key sequence features and motifs of each of these sub-clusters, we performed sequence based motif analysis. From Figure 6, it is evident that the sequence motifs from clusters coloured in Blue (Figure 6A), Purple ( Figure 6B) and Green ( Figure  6C) are very similar. The conserved sequence signature shared by the individual motifs from the three clusters is preserved within the combined sequence motif that was generated using all the sequences from the Blue, Purple and Green clusters ( Figure 6D). However, the motif belonging to the Red cluster ( Figure 6E) with sequences from Trypanosomatids that contain a BT1 domain is unique in nature. Therefore, based on the sequence motif and conserved signatures, BT1 family can further be classified into two different subfamilies -a subfamily exclusively containing sequences from Trypanosomatid phyla that share a unique sequence motif and another family containing non-trypanosome sequences. It is also interesting to note that besides the classical members, trypanosomatids also contain members like E9BPD7 which share a non Trypanosomatid like BT1 motif. The sequence motif/signature information for BT1 family (Trypanosomatids and non-Trypanosomatids) that has been proposed in the present study may help in further mining and characterization of this important family of transporters.

Conclusion:
The application of various computer aided prediction tools has enabled us to characterise and assign putative functional information to 36 hypothetical membrane proteins in Leishmania donovani. This information could potentially aid in identifying drug targets for the treatment and cure of VL in the future. These 36 sequences have been annotated with functional information from PFAM and GO. Such crucial data may prove to be of great assistance in deciphering the biological roles and molecular functions of these hypothetical proteins in the protozoan. Based on our observation, it can be noted that, 35 of these 36 proteins have been associated to a COG family as well. A thorough sequence characterization of E9BPD7, an uncharacterised hypothetical protein in Leishmania donovani which does not find an association in COG database, has been undertaken. Using phylogenetic studies, E9BPD7 has been proposed to be a member of the BT1 family of MFS with high confidence. The importance of this protein belonging to MFS superfamily is obvious, as its roles are imminent in many diseases, and drug transport mechanism. Often, the resistance to antibiotics is correlated with the action of MFS resistance genes [30]. Mutations within the MFS transporters may lead to several neurodegenerative diseases and vascular disorders of the brain as well [31][32]. Further, the analysis with E9BPD7 revealed interesting information about the presence of a non-Trypanosomatid sequence motif in this sub-family. Additionally, two different sub-families, viz., Trypanosomatids sub-family and non-Trypanosomatids subfamily have been proposed within the BT1 family, based on phyletic clustering and presence of class specific sequence motifs. These two explicit motif signatures would be of immense help in sequence mining and characterization of these important classes of membrane proteins in Trypanosomatids and other organisms. Many such novel sequences within Trypanosomatids, appear to play crucial roles in the biology of these pathogens.