A comparison of MSA tools.

Multiple sequence alignment (MSA) is essential in phylogenetic, evolutionary and functional analysis. Several MSA tools are available in the literature. Here, we use several MSA tools such as ClustalX, Align-m, T-Coffee, SAGA, ProbCons, MAFFT, MUSCLE and DIALIGN to illustrate comparative phylogenetic trees analysis for two datasets. Results show that there is no single MSA tool that consistently outperforms the rest in producing reliable phylogenetic trees.


Dataset #2: DS-HOM
We downloaded 218 reference alignments from Homstrad. We created three different categories for this dataset similar to the first dataset. Category 1: HOM_10 contains 141 reference alignments. Category 2: HOM_20 contains 54 reference alignments. Category 3: HOM_30 contains 23 reference alignments. The dataset is thereafter designated as DS-HOM.

Comparison process
The eight alignment methods are run on DS-BB and DS-HOM datasets using default parameters. Tests were performed on a 1.6-GHz Intel Pentium M with 512 MB RAM. Each method generates a total of 352 test alignments: 134 (from DS-BB) + 218 (from DS-HOM). Thus, a total of 2816 (352*8) test alignments are obtained. The 352 test alignments of each method and the 352 reference alignments are given as input to the Neighbor Joining method described by Saitou and Nei,[14] to estimate phylogenetic TTs and RTs. Each 352 TTs of a given alignment method are compared to the 352 RTs.
The Robinson-Foulds distance (T_dRF) implemented in PAL [15] is used to compare a given phylogenetic TT to its corresponding RT. The T_dRF defines the distance between any two trees as the minimum number of transformations required to obtain the topology of one tree from the topology of the other. This is given by equation 1 in supplementary material. In order to evaluate the performance of each alignment method, we developed a score, namely the dRF(M), which considers only the identical TTs generated by each method compared to RTs. This is given by equation 2 under supplementary material. This score gives the average number of identical TTs produced by each method on each dataset category. High values of dRF(M) signify better performance by a method.

Alignment quality assessment
We used the sum-of-pairs score (SP) implemented in BaliBASE scoring scheme to estimate quality alignment for each method. The SP score determine the extent to which a method succeeds in aligning some or all sequences in the alignment. The aim here is to show if the alignment quality of a given method affects the reliability of its phylogenetic TT.

Discussion:
Biologists use MSA as a first step in phylogenetic analysis. A number of sequence alignment tools are available at the internet. However, the choice of a specific tool for a Biologist who is not an expert in the field of Bioinformatics is not trivial. Many comparison studies of multiple alignment methods are available [1-3]. These study lack arguments on phylogenetic analysis. Here, we evaluated eight MSA tools based on the comparison of their phylogenetic TTs. We use the Robinson-Foulds distance to compare the TTs of each alignment method with respect to the RTs. We derived the dRF(M) metric to estimate the percentage of identical TTs generated by each alignment method on each category of the two datasets used (DS-BB and DS-HOM). Figure 1 gives the variation of dRF(M) scores for all the eight methods used in the analysis. We notice that as sequence identity in each category of DS-BB and DS-HOM datasets is low as the percentage of identical TTs is low. All the methods show similar trends of dRF(M) scores. However, on categories BB_20 and BB_30 of DS-BB dataset, MUSCLE gives higher percentage of identical TTs than all the other methods. MUSCLE performs better on categories HOM_10 and HOM_30 in DS-HOM dataset.
We performed a Wilcoxon rank test for all pairs of methods (Table 1 under supplementary material) to assess the significance of the differences in the overall Robinson-Foulds distances (T_dRF) between all pairs of test and reference trees. Results suggest that the differences between methods are not statistically significant. Each method produces reliable phylogenetic TTs as those given by ProbCons, which is described by Do and colleagues [16] as the best performing method for generating accurate multiple alignment. Figure 2 gives the variation of SP scores for all the methods on each category of DS-BB and DS-HOM datasets. It shows that ProbCons achieves the best performance on all the categories of each dataset. The significance in the difference for overall SP scores using the Wilcoxon rank test for all pairs of programs is given in Table 2 (supplementary material). The differences between methods are significant, with ProbCons showing the highest alignment quality. The results given in Table 1 and Table 2 (see supplementary material) suggest that quality alignment of the different methods do not heavily impact on the reliability of their phylogenetic TTs. It should be noted that all of them perform with good TTs as ProbCons.

Conclusion:
A comparison of phylogenetic TTs of eight MSA for three categories of two sequence data sets is discussed. All methods perform equally well in producing reliable phylogenetic TTs. Despite the significant differences in alignments qualities produced by the different methods, the analysis shows that the statistical difference in phylogenetic TTs generated by each method is minimal. Several distances exist to compare trees, such as the Nearest-Neighbor interchange [17]. The application of the metric for large dataset would provide insights on MSA performances in divergent datasets.

→
(1) where TT ij and RT ij are, respectively, the test tree j and the reference tree j inside the category i (i = 1 to 3) of each of DS-BB and DS-HOM datasets. If the TT is identical to the RT, the T_dRF is equal to 0.
where M is a given alignment method and k is the number of RTs in the category i of each dataset.  Table 2: Ranks and statistical significance on DS-BB and DS-HOM datasets is shown. Each entry in the table contains the Pvalue assigned by a Wilcoxon rank test to the difference between a pair of methods. The upper-right corner of the matrix is obtained from SP scores on DS-BB, the lower-left corner from SP scores on DS-HOM. If the method to the left is ranked higher than the method above, the P-value is preceded by +. If the method to the left is ranked lower, the P-value is preceded by -. If the P-value is >0.05, the difference is not considered significant and is shown in parentheses.