Recent trends in remote homology detection: an Indian Medley.

The development of remote homology detection methods is a challenging area in Bioinformatics. Sequence analysis-based approaches that address this problem have employed the use of profiles, templates and Hidden Markov Models (HMMs). These methods often face limitations due to poor sequence similarities and non-uniform sequence dispersion in protein sequence space. Search procedures are often asymmetrical due to over or under-representation of some protein families and outliers often remain undetected. Intermediate sequences that share high similarities with more than one protein can help overcome such problems. Methods such as MulPSSM and Cascade PSI-BLAST that employ intermediate sequences achieve better coverage of members in searches. Others employ peptide modules or conserved patterns of motifs or residues and are effective in overcoming dependencies on high sequence similarity to establish homology by using conserved patterns in searches. We review some of these recent methods developed in India in the recent past.


Background:
The last few decades have seen spectacular developments in the growth of protein sequence and structure data and in tools to analyse them. Largescale experimental determination of the biochemical roles of proteins is a phenomenal exercise. Bioinformatics shows exciting promise in the development of methods that allow quick resolution of newly sequenced proteins to their closest experimentally verified relatives. Sequence search procedures such as PSI-BLAST [1], can detect protein relationships effectively when sequence similarities are high. However, when sequence similarities become poor (20-30%), such detection becomes a non-trivial task.
Proteins dissimilar in sequence can adopt similar structure, perform similar functions and also be homologous. This is exemplified in structural databases, which classify proteins with high similarity in sequence, structure and function into families and group families of similar structure and function into superfamilies. For superfamily members, which show poor sequence similarity, it is difficult to determine evolutionary relationship in the absence of structure. Such proteins, related despite low sequence similarity, lie in the 'twilight zone' and are termed 'remote homologues'.
Structure-based approaches can detect such relationships since protein structures are less perturbed by sequence changes. Some approaches consider overall structural similarity in defining relatedness while others such as Bhaduri et al., [2] have shown that conserved spatial interactions in a protein superfamily are excellent constraints in the identification of more members in the superfamily.

Description: Sequence analysis based approaches for remote homology detection:
Methods like PSI-BLAST build profiles iteratively during searches in sequence databases. [1] The use of 'intermediate protein sequences' that share sequence features of more than one protein is quite effective in detecting distant protein similarities. [3] As seen in Figure 1, such sequences populate sequence space and relate proteins, traditionally difficult to relate, due to poor sequence similarities. 'Intermediate sequences' that share high similarities with more than one protein, if detected and employed in profile generation can improve effectiveness of sequence analysis-based approaches. Sandhya et al., [4] discuss an application of such sequences in PSI-BLAST searches in foldspecific databases to detect relationships not evident through simple sequence searches.
Several exciting and interesting efforts in remote homology search methods have been made in the Indian sub-continent in the last few years. These methods (Figure 1) show promise and effectively detect such deep relationships in proteins. In this review, we highlight salient features of these approaches.

Cascade PSI-BLAST: Hops through intermediates detect remote homologues:
Sandhya et al., [5] have recently reported the development of a method called Cascade PSI-BLAST that propagates PSI-BLAST searches in a non-directed manner to detect distant similarities between proteins. In this method, that extensively employs intermediate sequences, a PSI-BLAST search termed "first generation" is initiated for a query in a database. Hits detected are allowed to propagate independent searches in the same database to detect more new hits ("second generation" search etc.). Typically, the authors recommend up to three generations of search for better coverage and detection of relationships. An assessment of the approach on the detection of existing relationships from the PALI database [6] shows that the coverage in detecting relationships in protein families is improved by 15% and in protein superfamilies by ~35% over traditional use of PSI-BLAST. This method is being made available for use in public domain through a web server (http://crick.mbu.iisc.ernet.in/~CASCADE) (manuscript communicated). Figure 1: A superfamily of proteins whose members share poor sequence similarity (<20%). 'Intermediate sequences' (in yellow) populate protein space and owing to their high similarities with more than one protein (40-50%) can effectively detect such remote homologues. Methods developed recently in India address the problem of remote homology detection effectively with patterns/ intermediate sequences.

MulPSSM: A database of multiple family profiles corresponding to a constant alignment:
Searches against sequence databases using PSSMs or profiles are effective in identifying distant relationships compared to searches involving pair wise sequence alignments. [1] The effectiveness of any profile-based search depends on the quality and diversity of sequences in the multiple sequence alignment used in profile generation.
Anand et al., [7] demonstrate that generating multiple family profiles enables the reliable detection of distant relationships in protein family and superfamily. In order to assess the effectiveness of multiple family profiles, they have generated a database of multiple and single-family profiles and profile-HMMs for all structural families from integrated sequence-structure database from PALI. [6] Searches against these three