GSIT: An integrated web-tool for identification of genomic signatures in highly similar DNA sequences

Accurate identification and characterization of infectious agent and its subtype is essential for efficient treatment of infectious diseases on a target population of patients. Comparative biology of microbial populations in vitro and in vivo can identify signatures that may be used to develop and improve diagnostic procedures. Here we report Genomic Signature Identification Tool (GSIT) a web based tool for identification and validation of genomic signatures in a group of similar DNA sequences of microorganisms. GSIT uses multiple sequence alignment to identify the unique base sites and scores them for inclusion as genomic signature for the particular strain. GSIT is a web based tool where the front-end in designed using HTML/CSS and Javascript, while back-end is run using CGI-Perl. Availability The server is freely available at the http://genome-sign.net/gsit


Background:
The diversity among newly emerging pathogens calls for a fast and accurate identification technique for the characterization of the pathogens. Genomic and molecular biology techniques such as plasmid profiling, restriction fragment length polymorphism (RFLP), polymerase chain reaction (PCR) and whole genome sequencing (WGS) are regularly applied in clinical settings for detection of pathogens [1]. However, a major bottleneck in discovery of novel variants is the availability of datasets and the approach used in their discovery. Advent of fast and accurate WGS technology bridges the gap of genomic dataset availability [2]. Comparative genomics among datasets of closely related species of same organisms is extensively used to identify genomic variation signatures. The term genomic signatures are used in various contexts such as specific expression pattern in disease or organism [3] or a stretch of nucleic acid/amino acid sequence specific to a particular trait [4]. Here we define a genomic signature as a point mutation specific only to one subtype/strain of an organism. Identification of genomic signatures would not only help in differentiating the different subtypes infecting the host but would also help in understanding the relationship between subtypes and the patients' response to treatment and may help in providing the optimal duration and type of therapy, which may ultimately improve patient management. Such approaches have been previously extensively used to develop and improve diagnostic procedures, as well as, explore the relationship between infection and drug response [5, 6]. Here we present genomic signature identification tool (GSIT), which employees comparative genomics technique for identification of genomic signatures among subtypes/strain of same species of organism and provides a list of significant genomic signatures for many applications including pathogen characterization and epidemiological applications. The input to server consists of whole genome sequence of three or more subtype/strain of same species along with a reference sequence of each subtype/strain respectively. Length difference among input sequences should be less than 10% in order to confirm quality sequence alignment. Users also need to provide their functional email id where the results will be sent after the job is completed.

Algorithm:
The web server is implemented in Perl/CGI. Once the user files are uploaded to the server, multiple sequence alignment (MSA) of all the individual subtypes and representative sequences is constructed using a MSA tool ClustalW-MPI [7] and the alignments are used to determine the consensus sequence for each subtype. The consensus sequences of each subtype are then aligned with each other in another round of MSA. The MSA generated is parsed through custom scripts, which analyze each position of alignment and determines sites with a unique base as compared to other subtypes.
Next, a probabilistic method is implemented to determine the statistical significance and subtype prediction probability of each of the unique site shortlisted in previous module. The probability of misclassification within a subtype (Pn(x | z)) and between different subtypes (Pn(x | z!)) for each site is calculated as (Please see supplementary material for equation and explanation).
The sites showing highest probability of classification within a subtype and lowest or null probability of misclassification both within and between subtype sequences are shortlisted and termed as genomic signatures of that particular subtype. A concise workflow showing the working of GSIT is shown in Figure 1.

Output:
Once the results are available, link to the result table is sent to the user via email. The output consists of a table showing the top 5 genomic signatures for each subtype along with Probabilities of misclassification. The table also shows the informative signature along with 50 bp sequences flanking the signature. Users can also download the flanking sequences to the genomic signature in fasta format as an archive file.

Proof of Concept:
We validated the pipeline using Hepatitis B as a representative model. Hepatitis B Virus (HBV) whole genome sequences (N = 174 ) obtained from Hepatitis B Virus database [s2as02.genes.nig.ac.jp] and reference sequences for the eight representative genotypes (A-H) obtained from NCBI-Genbank [http://www.ncbi.nlm.nih.gov/projects/genotyping/view.cgi ?db=2] were used to correlate genomic signature based identity of subtypes. A systematic analysis of HBV genotypes was performed using the GSIT tool. In all 10 Genomic Signature sites were identified for Hepatitis B Virus genotypes A-H.

Applications:
Genomic signatures find applications in a diverse set of clinical and non-clinical scenarios. Different subtypes of same viral species are known to show very different responses to therapy. An understanding of the relationship between genotypes and the patients' response to treatment may help in providing the optimal duration and type of therapy, which may ultimately improve patient management. The advantages of our technique include its ability to simultaneously compare strains at the whole-genome level with high sensitivity to detect subtle differences. These genomic signatures can be highly informative and can also be used for detection and monitoring of infectious diseases and their causative agents. Translating genomic signatures into a diagnostic and prognostic aid remains to be demonstrated.