Current Bioinformatics resources in combating infectious diseases

Bioinformatics tools and techniques analyzing next-generation sequencing (NGS) data are increasingly used for the diagnosis and monitoring of infectious diseases. It is of interest to review the application of bioinformatics tools, commonly used databases and NGS data in clinical microbiology, focusing on molecular identification, genotypic, microbiome research, antimicrobial resistance analysis and detection of unknown disease-associated pathogens in clinical specimens. This review documents available bioinformatics resources and databases that are used by medical microbiology scientists and physicians to control emerging infectious pathogens.


Background:
The application of Bioinformatics tools and techniques in analyzing the increasing data generated in molecular biology, genomics, transcriptomics, and proteomics is gaining momentum [1]. Moreover, the amount of information gleaned in the form of databases and literature for generating molecular profiles and for collecting data related epidemiology of pathogens has been also mounting [2]. Therefore, the use of Bioinformatics tools and techniques in pathogen identification and typing, identifying markers for early diagnosis and treatment, enabling personalized interventions and predicting patient outcomes is imperative [3]. Bioinformatics aided next generation sequencing (NGS) data analysis are promising to identify clinically relevant viruses from a variety of specimen types [4]. Similarly, bacterial pathogens such as Francisella tularensis and Leptospira santarosai were successfully identified using culture-Independent NGS identification from primary human clinical specimens [5][6]. The application of Bioinformatics techniques in the surveillance of pathogen outbreaks in fighting infectious diseases is also essential. Thus, this review documents available bioinformatics resources and databases that are used by medical microbiology scientists and physicians to control emerging infectious pathogens.

Bioinformatics Tools for Pathogen identification and typing:
Bioinformatics tools are extensively used in the identification, characterization, and typing of all kinds of pathogens. This followed the widespread use of genomic approaches in the diagnosis and management of viral, bacterial, and fungal infections. Applications of bioinformatics have been used in pathogen identification, detection of virulence factors, resistome analysis, and strain typing. Next-generation sequencing (NGS) technology supported by bioinformatics, phylogenetic, and patho-genomics analyses helped in the identification of the causative agent were a Clostridium haemolyticum isolate [3]. This isolate possesses virulence factors necessary to establish an infection and cause the all the observed symptoms. Thus, NGS holds considerable potential for pathogen identification isolated from human specimens using whole genome sequencing (WGS) assisted by powerful bioinformatics tools [7]. The application of Bioinformatics tools in analyzing WGS and Ribosomal (rRNA) gene sequencing data for the identification of both bacterial and fungal pathogens is becoming routine in recent years. The need for advanced yet improved bioinformatics tools in the analysis of NGS-rRNA sequencing data is emerging in microbiome studies [8]. The available bioinformatics tools used in sequence assembly & analysis and microbiome studies are given in Table 1. MG-RAST server (http://metagenomics.anl.gov) is useful for WGS metagenomics analysis and it is more advanced compared with 16S rRNA sequencing [12]. MG-RAST server is an automated analysis platform for meta-genomes to present the quantitative understandings into microbial populations generated from sequencing data. The server provides options for upload, quality control, automated annotation and comparative analysis for shotgun and amplicon metagenomic samples as well as metatranscriptomes. Moreover, high-throughput sequencing (HTS) using Bioinformatics pipeline (ezVIR) was used to evaluate the entire spectrum of known human viruses and provided results that are easy to interpret and customizable. This pipeline works by identifying the most likely viruses present in the specimen using sequence data. The ezVIR pipeline generates strain typing reports, genome coverage histograms, and cross-contamination analysis for specimens prepared in series. This pipeline was able to identify DNA or RNA viruses in most collected clinical specimens. Tools are also available for the removal of host sequences from the NGS resulting pathogen and human sequence mixed pool. The filtering step is very important since the amount of viral sequencing in the resulting pool is usually less than 1%. For example, rapid identification of non-human sequences (RINS) (https://s3.amazonaws.com/changseq/kqu/) was able to precisely identify sequencing reads from non-human genomes in the used dataset and vigorously produces contigs from these sequences in less than two hours [4,13]. The RINS is an intersection-based pathogen detection workflow that utilizes a user-reference genome set for the identification of non-human sequences in deep sequencing datasets. VirusSeq is an algorithmic method that is also used for detecting known viruses and their integration sites in the human genome using NGS data. NGS supported by bioinformatics tools has been used to catalog discrete organisms within complex yet poly-microbial specimens. Deep sequencing of 16S rRNA implies Actinomadura madurae causing mycetoma in diabetic patient [16]. However, conventional microbiological and molecular methods failed due to the overgrowth of Staphylococcus aureus. Later, the use of bioinformatics analysis in the identification of a bacterial pathogen was introduced elsewhere by Saeb et al. 2017 [3]. We have developed an analysis pipeline to identify and annotate the suggested pathogen. The quality of the reads was assessed and reads with score less than 20bp were removed. Secondly, the selected reads were subjected to Metaphlan software [17] for primary microbial identifications based on unique and cladespecific marker genes. BLAST program was used to map each read to the non-redundant nucleotide database of NCBI. Presence of high contamination with human non-pathogen sequences was observed. Later TMAP (https://github.com/iontorrent/TMAP) program was used to remove the contamination reads. The target non-human sequences were subjected to further analysis. MIRA software (version 4) [18] was used to perform de novo assembly for these non-human sequences. The selected sequences were mapped with bacterial genomes that were top ranked based on Metaphlan, BLAST findings. The pipeline used in the study was imported to the workflow system Tavaxy [3]. We further used QIIME pipeline for performing taxonomic assignment and for results visualizations [9].

Tools for Pathogenicity and virulence:
An important bioinformatics tool to test the pathogenicity of a newly discovered bacterial pathogen is the PathogenFinder 1.1 (https://cge.cbs.dtu.dk/services/PathogenFinder/). PathogenFinder is a webserver used for the prediction of bacterial pathogenicity utilizing proteomic, genomic, or raw reads. The bacterial pathogenicity depends on groups of proteins known to be involved in pathogenicity [25]. This webserver utilizes a selection of proteins created without annotated function or known involvement in pathogenicity. It can predict pathogenicity for all taxonomic groups of bacteria with 88.6% accuracy. The approach of the program is not biased with known pathogenicity. Therefore the program could be used to discovery novel pathogenicity factors.
A recent method for predicting pathogenicity is the PaPrBaG (Pathogenicity Prediction for Bacterial Genomes) (https://github.com/crarlus/paprbag) based on machine learning and provided as R package [26]. PaPrBaG predicts pathogenicity by means of training on a large number of established pathogenic species in comparison with nonpathogenic bacteria. Suitable for NGS data with very low genomic coverages. PaPrBaG is a random forest based method for the assessment of the pathogenic potential of a set of reads belonging to a single genome. It helps in the prediction of novel, unknown bacterial pathogens. PaPrBaG provides prediction in contrast with other approaches that discard many sequencing reads based on the low similarity to known reference genomes.

Figure 1:
Resistome analysis of the first nanosilver resistance bacterium using the bioinformatics tools for identifying and combating anti-microbial resistance Furthermore, the genomic contigs of a pathogen produced by NGS techniques are annotated using Prokaryotic Genomes Automatic Annotation Pipeline (PGAAP) available at NCBI. It can also be annotated using bacterial bioinformatics database and analysis resource (PATRIC) gene annotation service (https://www.patricbrc.org/app/Annotation) for pathogenicity and virulence factors. Virulence genes sequences and functions, corresponding to different major bacterial virulence factors of specific pathogen can also be collected from GenBank and validated using virulence factors of pathogenic bacteria database (http://www.mgc.ac.cn/VFs/), Victors, virulence factors search program (http://www.phidias.us/victors/) and PATRIC_VF tool (https://www.patricbrc.org/) [27]. However, in order to utilize all tools and links provided by PATRIC user should register in the main porter of the website.

Bioinformatics tools for identifying and combating antimicrobial resistance:
The need for rapid, accurate detection and understanding of resistance factors and mechanisms are highly demanded in antimicrobial resistance. The genome contigs can be primarily investigated for the presence of antibiotic resistance loci using both PGAAP and PATRIC gene annotation services. Further, the presence of antibiotic resistance loci for the newly isolated bacterial pathogens can then be investigated using specialized search tools and services namely, Antibiotic Resistance Gene ResFinder 2.1 identifies acquired antimicrobial resistance genes and/or finds chromosomal mutations in total or partially sequenced isolates of bacteria. ResFinder is a web server that provides an appropriate way of identifying acquired antimicrobial resistance genes in completely sequenced isolates. It can be accessed at (www.genomicepidemiology.org). ResFinder is updated on new resistance genes regularly. Similarly, antibacterial biocide and metal resistance genes, can also be investigated using PGAAP, PATRIC gene annotation services, PATRIC Feature Finder searches tool and BacMet (antibacterial biocide and metal resistance genes database) (http://bacmet.biomedicine.gu.se/) [30][31]. P.mirabilis SCDR1, the first Nanosilver resistant isolate contains pathogenicity and virulence factors to establish a successful infection. P.mirabilis SCDR1 contains several mechanisms for antibiotics and metals resistance including biofilm formation, swarming mobility, efflux systems, and enzymatic detoxification. P.mirabilis SCDR1 possesses several mechanisms that may lead to the observed Nanosilver resistance (Figure 1) [32].

Conclusion:
Several Bioinformatics tools are available for analyzing data for combating and control of infectious diseases as discussed in this review. However, there are several bioinformatics tools for drug resistance testing, pathogen-host interaction, infection and treatment outcomes. Nonetheless, the need to facilitate and incorporate bioinformatics tools and applications in clinical microbiology and infectious diseases through training of personnel and by developing simple yet robust user-friendly bioinformatics pipelines.