SeqMaT: A sequence manipulation tool for phylogenetic analysis.

Most bioinformatics tools require specialized input formats for sequence comparison and analysis. This is particularly true for molecular phylogeny programs, which accept only certain formats. In addition, it is often necessary to eliminate highly similar sequences among the input, especially when the dataset is large. Moreover, most programs have restrictions upon the sequence name. Here we introduce SeqMaT, a Sequence Manipulation Tool. It has the following functions: data format conversion, sequence name coding and decoding, redundant and highly similar sequence removal, and data mining utilities. SeqMaT was developed using Java with two versions, web-based and standalone. A standalone program is convenient to manipulate a large number of sequences, while the web version will guarantee wide availability of the tool for researchers and practitioners throughout the Internet. Availability The database is available for free at http://glee.ist.unomaha.edu/seqmat


Background:
Molecular phylogeny is a fundamental approach studying species evolution and gene function. Many phylogenetic analysis programs are available, but each program often requires a particular type of input sequence format. Most tools have restrictions regarding the allowable length of sequence name and characters used. Automatically trimming the sequence names may lose important information, such as species names. Most often the trimmed names are redundant, which cannot be accepted by any analysis program. In addition, some programs do not accept sequence names with special characters. Furthermore, the input may contain identical or highly similar sequences which need to be removed to save computational resources and improve resolution of the resulting trees. , are available for handling the sequence name problem. However, those tools do not accept FASTA files other than GenBank and DOE JGI formats; they extract the total or a part of the accession number to rename the sequence. Additionally, they don't support conversion back to the original names once the tree is created. It is thus extremely inconvenient for the user to analyze the phylogenetic result.
The computer program libraries such as BioJava [8] and BioPerl [9] provide functions to convert between different formats but require computer skills to use them. On the other hand, sequence manipulation tools focus mainly on utility functions such as removing and sorting sequences. SeqMaT combines format conversion and essential functions of sequence manipulation under a single platform, with the expectation of addressing the issues with currently available tools.

Methodology:
Based upon our practical experience, we designed a pipeline to deal with the various data formatting and conversion issues described above. Figure  1 shows data flow and major functional modules of SeqMaT. SeqMaT accepts and converts a number of sequence formats, including FASTA, Clustal, Mega, Phylip, Nexus, EMBL, GenBank, PIR, Table, Hennig86, and plain text. The functional modules include 1) finding and removing redundant sequences, 2) format conversion, 3) removing special characters in sequence names, 4) sequence name encoding and 5) sequence name decoding.

Removing duplicate sequences:
In large sequence files, it is often difficult to check for duplicate sequences and the tools may generate error messages because of this redundancy. SeqMaT provides two ways to identify the redundant sequences from the given dataset for all the major sequence formats. One way is removing identical sequences; the other way is to collapse sequences above a user provided threshold (with one representative sequence used for subsequent analysis). The latter requires aligned sequences as input.

Data format conversion:
Data format conversion can be used for converting sequences from one format to another for future analysis. SeqMaT allows conversions among all major sequence and tree formats. We have provided both sequential and interleaved options for Phylip and Mega formats.

Special characters replacement:
Most of the tools do not allow special characters such as '/' or '|' in sequence descriptions. SeqMaT provides an option for replacing the special characters and white space with an underscore character '_' in the sequence description. Currently this option works for Nexus and FASTA sequence formats.

Sequence name coding and encoding:
Automatically trimmed sequence names created by other format conversion tools may be redundant and cannot be accepted by analysis programs. SeqMaT efficiently addresses the problem of long sequence names. This includes a two-step process, sequence encoding and tree decoding.
The coding part takes a set of input sequences in FASTA, Clustal or Mega formats and for each sequence the whole description is replaced with a short alpha-numerical name. These temporary names and original descriptions are stored in a table. Two files, the match table with original and new names and a sequence file in which each sequence has a short unique name are given as output from this step. These converted sequences are easy to use in a tree analysis.
The decoding part takes the match table generated in the coding step and a tree file as input. It replaces the temporary names in the tree file with the original name (or description) from the match table. The resulting tree file has the original description for each sequence without any redundancies in it, which can be opened in any tree viewing program.

Other applications:
SeqMaT also provides additional functions to manipulate data for data analysis and mining. First, the sequence data in FASTA format can be converted to Attribute Relational File Format (ARFF), a native format of the data mining tool -WEKA. Secondly, it calculates individual residue frequencies and k-mer frequencies for a given set of sequences, with both protein and nucleotide sequences accepted. Thirdly, a user can select a particular number of representative sequences that will be randomly picked from a large dataset. Lastly, a user can extract sequences within certain dates and geographical locations. This application takes the starting and ending years from which the user needs to extract sequences and the number of sequences for each year. The program extracts the sequence data per the user choice. This function is particularly useful when collecting sequences for calculating evolutionary rates and the time of most recent common ancestry.

Software platform:
The background programs were written in Java and the web integration was done using JSP/Servlets. The standalone version is provided as a jar file and runs on any systems with the Java Runtime Environment installed.

Applications:
SeqMaT is currently used by the Bioinformatics lab at the University of Nebraska at Omaha and elsewhere.