A comparison of four pair-wise sequence alignment methods.

Protein sequence alignment has become an essential task in modern molecular biology research. A number of alignment techniques have been documented in literature and their corresponding tools are made available as freeware and commercial software. The choice and use of these tools for sequence alignment through the complete interpretation of alignment results is often considered non-trivial by end-users with limited skill in Bioinformatics algorithm development. Here, we discuss the comparison of sequence alignment techniques based on dynamic programming (N-W, S-W) and heuristics (LFASTA, BL2SEQ) for four sets of sequence data towards an educational purpose. The analysis suggests that heuristics based methods are faster than dynamic programming methods in alignment speed.

Sequence alignment techniques such as N-W, S-W, LFASTA and BL2SEQ are routinely used in molecular biology laboratory (research) and drug discovery (development) environment. The N-W algorithm performs global alignment (comparison of entire sequences) between sequences and the S-W algorithm performs local alignment (comparison of local stretches of sequences for the identification of motifs). The LFASTA and BL2SEQ methods use heuristic (rule of thumb) to compare protein sequences. The measure of similarity in these methods is scored using similarity matrices [5,6].
The availability of several protein sequence comparison tools provide a wide range of choice for selecting appropriate tools for specific purposes. Generally these tools show varying degree of difference between them. These differences at a fine level are seldom used correctly by end-users who are nonexperts in Bioinformatics developments. Here, use we execution time as a parameter to compare sequence alignment tools using scoring matrices such as BLOSUM 45, BLOSUM 62 and BLOSUM 80 [5,6]. This comparison is of help to biologist who are non-expert in Bioinformatics to select appropriate sequence tools for specific tasks based on the available in-house infra-structural facilities. (PDB, http://www.rcsb.org/pdb) based on maximum sequence similarity. We downloaded S-20 (containing non-redundant sequences at less than 20% sequence similarity) dataset from PISCES. We extracted 200 sequences from S-20 in a random manner and created a dataset designated as DS-20. It contains non-redundant sequences at 20% sequence similarity cut-off.

Dataset #3: DS-40
We downloaded S-40 (containing non-redundant sequences at less than 40% sequence similarity) dataset from PISCES. We extracted 200 sequences from S-40 in a random manner and created a dataset designated as DS-40. It contains nonredundant sequences at 40% sequence similarity cut-off.

Dataset #4: DS-90
We downloaded S-90 (containing non-redundant sequences at less than 90% sequence similarity) dataset from PISCES. We extracted 200 sequences from S-90 in a random manner and created a dataset designated as DS-90. It contains nonredundant sequences at 90% sequence similarity cut-off.

Data statistics
The distribution of sequences with varying lengths for datasets #1 to #4 is summarized in

Sequence comparison
We performed pair-wise alignment for randomly selected sequences from one dataset to sequences in other datasets such as DS-R DS-20, DS-40 and DS-90 using N-W, S-W, LFASTA and BL2SEQ in a one-to-many alignment manner.

Alignment execution time
The execution time is the time needed to perform an alignment between two protein sequences for a given method in a 2.4 GHZ Pentium-IV processor with 512 MB of RAM.

Discussion:
Sequence alignment is an important task in sequence based molecular biology experiments in modern research. A number of sequence alignment tools are available in the internet for varying purposes (see EMBOSS). However, selection of specific tools for a Biologist who is not an expert in the field of Bioinformatics is non-trivial. Here, we describe the comparison of pair-wise sequence alignment using methods N-W, S-W, LFASTA and BL2SEQ described elsewhere [1][2][3][4]. These techniques and their corresponding tools are developed by authors with strong mathematical knowledge. This is not the case with end-users who often have difficulties in selecting tools and interpreting alignment results. The performance of these methods has been discussed extensively in graduate level TEXT books for Bioinformatics. However, a comparative study on the performance of these techniques is not explicitly available. In this study, we use execution time (alignment speed) as a parameter to compare four alignment methods. For the purpose of simplicity, the experiment is conducted in a 2.4 GHZ Pentium-IV processor with 512 MB of RAM under windows platform. Figure 1 gives the profile for execution time (alignment speed) versus sequence length for all the four methods used in the analysis using four different datasets (DS-R, DS-20; DS-40; DS-90). The analysis shows that alignment speed for heuristic methods such as LFASTA and BL2SEQ are faster than dynamic programming methods such as N-W, S-W. This provides insight to the selection of several programs that are