Benchmarking of 16S rRNA gene databases using known strain sequences

16S rRNA gene analysis is the most convenient and robust method for microbiome studies. Inaccurate taxonomic assignment of bacterial strains could have deleterious effects as all downstream analyses rely heavily on the accurate assessment of microbial taxonomy. The use of mock communities to check the reliability of the results has been suggested. However, often the mock communities used in most of the studies represent only a small fraction of taxa and are used mostly as validation of sequencing run to estimate sequencing artifacts. Moreover, a large number of databases and tools available for classification and taxonomic assignment of the 16S rRNA gene make it challenging to select the best-suited method for a particular dataset. In the present study, we used authentic and validly published 16S rRNA gene type strain sequences (full length, V3-V4 region) and analyzed them using a widely used QIIME pipeline along with different parameters of OTU clustering and QIIME compatible databases. Data Analysis Measures (DAM) revealed a high discrepancy in ratifying the taxonomy at different taxonomic hierarchies. Beta diversity analysis showed clear segregation of different DAMs. Limited differences were observed in reference data set analysis using partial (V3-V4) and full-length 16S rRNA gene sequences, which signify the reliability of partial 16S rRNA gene sequences in microbiome studies. Our analysis also highlights common discrepancies observed at various taxonomic levels using various methods and databases.


Background:
Next-Generation Sequencing (NGS) techniques are capable of generating high quality, comparable data [1]. Traditional methods are left behind as sequencing of 16S rRNA gene amplicons is now a well-established robust method for bacterial identification. The method has revolutionized the microbiome research as well as the field of medicine and pathology. 16S rRNA gene sequencing has been used to decipher the human microbiome associated with various disorders, including colorectal cancer [2]. Moreover, the method has been proven to be quick and reliable in diagnostic laboratories for pathogen identification [3,4]. Although 16S rRNA gene based analysis remains to be the gold standard, proper precautions need to be taken during sequencing, preprocessing of data, and subsequent downstream analyses. The selection of variable region, choice of method for OTU clustering, selection of reference databases, and sequencing platform has been shown to play an important role in the assessment of microbial diversity [5,6]. Although V3-V4 or only V4 region is used widely by the scientific community, the choice of the region has been observed to affect community identification [7,8]. Also, it is known that sequencing errors with different sequencing platforms could reduce the reliability of the analysis [9]. Another limitation is that the taxonomic assignment is dependent on the reference database. Accuracy and resolution of taxonomic assignments might differ depending upon the quality and quantity of reference databases [10]. Genomic data for the 16S rRNA gene of bacterial type strains are pooled and preserved in several databases like Genomic-based 16S ribosomal RNA gene Database (GRD), SILVA, Ribosomal Database Project (RDP) etc. Different pipelines like Quantitative Insights Into Microbial Ecology (QIIME) [11], Mothur [12], Metagenomic Rapid Annotations using Subsystems Technology (MG-RAST) [13] etc. are used worldwide for high precision and quick analysis. Different methods are used to overcome the limitations regarding 16S rRNA gene analysis. One of them is the use of training datasets, which have an impact on the classification and robustness of analysis of big datasets of bacterial 16S rRNA gene [14]. Furthermore, the use of mock communities has been suggested to check the reliability of the sequencing output. Several laboratory specific or commercially made mock microbial communities like 'ATCC Gut microbiome whole cell mix', 'ATCC Vaginal microbiome whole cell mix' (Sourcehttps://www.atcc.org) have proven to be effective in method optimization as well as to achieve transparency across different studies [15,16]. However, though mock microbial communities serve the purpose of estimating sequencing errors, they mostly represent minimal diversity. They thus cannot be used as a standard for taxonomic identification by analysis pipeline and databases. Thus it is a necessity to have a 16S rRNA gene analysis pipeline validated using a standard data set with known taxonomic identification. Several earlier studies have compared different databases and analysis pipelines like one by Nilakanta et al., which reviewed the existing analysis pipelines for genomic data analysis and suggested that Mothur and QIIME are the two outstanding pipelines due to their comprehensive features and support documentation. The limitation of the study was that they did not use a unified dataset to compare the performance of pipelines in terms of taxonomic assignment [17]. Another study by Erica Plummer et al., used a 35 infent fecal samples for comparison of three widely used analysis methods QIIME, Mothur, and MG-RAST, for accuracy in taxonomic assignments using only V3-V5 hypervariable regions of 16S rRNA gene [18]. In the present study, we used authentic and validly published, type strain, full length (1200-1500bp) and partial (In silico extracted V3-V4 region) 16S rRNA gene sequences (n=5895). These sequences were compared against various databases with QIIME pipeline, which incorporate various algorithms for quality control, clustering similar sequences, assigning taxonomy, calculating diversity measures and visualizing. We used 16S rRNA gene sequences of type strains obtained from the RDP database as it allows the option to download the bulk dataset [19]. Three different databases were used for microbiome analysis, namely Greengenes, SILVA, and EzTaxon [20] which used for 16S rRNA gene-based microbiome studies [21][22][23][24][25][26]. Although several different analysis pipeline alternatives like Dada2, QIIME 2, deblur are currently avalilable, we chose QIIME 1 pipeline as it is well accepted by scientific community having over 22000 citations and being used globally for microbiome analysis [27-36].  1  97_d_ez  4486  3609  2  97_d_gg  4486  3609  3  97_d_silva  4486  3609  4  97_c_ez  4789  3752  5  97_c_gg  4486  3609  6  97_c_silva  4699  3699  7  99_d_gg  5650  4733  8  99_d_silva  5650  4733  9  99_c_gg  5687  4743  10  99_c_silva  5643

Selection of data set:
A total of 5895 full-length sequences of 16S rRNA gene along with seven levels of taxonomy were obtained from the RDP official website. Only good quality type strain bacterial sequences with the sequence length >1200 BP were selected were stored in '.fasta' format and used for further analysis.

Generation of Partial (V3-V4 region) from full-length sequences:
16S rRNA gene sequences were taken from the dataset. 16S rRNA gene sequences show high length variability; hence, to extract the V3-V4 region from the sequences, fuzznuc (from EMBOSS software suite) was used. It extracts the sub-sequence based on PROSITE-style patterns in nucleotide sequences. The primer set used for fuzznuc, CCTACGGGAGGCAGCAG, and TCTAAT extracted V3-V4 regions from 91.21% sequences. Fuzznuc failed to extract partial sequences from the remaining 8.79% sequences. These sequences were then manually trimmed using Mega 7 [38].

Analysis of full length and Partial (V3-V4 region) 16S rRNA gene sequences:
The open-source data analysis pipeline QIIMEv.1.8 was used for most of the core analyses. Systematic QIIME analysis protocol for analysis included, assignment of valid QIIME labels followed by Operational Taxonomic Units (OTUs) clustering with pre-decided parameters. Further, representative sequence was selected from each OTU and finally taxonomic assignment was done using three databases. OTUs were clustered using two approaches. First, using the de_novo OTU picking method which detects OTUs without comparing the query sequence to any reference. Second, using an Closed reference-based clustering approach. Also, two different identity thresholds i.e., 97% & 99%, were used for OTUs clustering. Figure 2 depicts the flow of methodology used for sequence analysis. Taxonomic identification of sequences was made using three different databases, in combination with OTU clustering method and identity thresholds, and total of ten different DAMs were created for analysis. Further diversity analysis was performed using OTUs table in '.biom' format.

Results
Total OTUs: The sample dataset used in this study was obtained from an authentic database, and the sample size, i.e., the number of 16S rRNA gene sequences was 5,895. Also, they were approximately full length (>1200bp) type strain sequences. Considering the parameters mentioned above, it was expected that number of OTUs in the DAMs would be similar to the sample data set. However, it was observed that even with the 97% and 99% identity limits in OTUs picking, the number of OTUs obtained in each variation is significantly different. The total number of OTUs detected in different DAMs is represented in Table 1. Total numbers of OTUs obtained from each DAM were compared; which were less than the Operational Taxonomic Units (number of sequences) in data set ( Figure 3). Comparative analysis showed that higher numbers of OTUs were obtained for a 99% identity threshold compared to the 97% identity threshold for the respective combination of the database used.

Overall Classification:
In the original data set, the numbers of different taxa were 1, 29, 56, 126, 277, 1422, and 5895 at the kingdom, phylum, class, order, family, genus, and species, respectively. However, in comparison to the original data set, relatively fewer taxa were detected using different DAMs. Also, differences were observed among the DAMs in the total number of taxa detected at different taxonomic levels (Figure 4(a) and Figure 4(b)). The graph shows the taxonomic ©Biomedical Informatics (2021) assignments performed with the help of QIIME for the data set using different databases. Original seven-level taxonomy for the data set consisted of 29 phyla. Although taxonomic assignment at the phylum level was observed for most of the sequences using all three databases, not all 30 phyla were observed. Classification by EzTaxon database gave rise to 27 phyla, whereas Greengenes and SILVA databases could classify 26 phyla each for the data set. A total of 1422 genera were present in the data set. SILVA and EzTaxon databases assigned more than 1000 genera in each DAM, but the third database Greengenes could assign less than 1000 genera for its all 4 DAMs. It was observed that no database could successfully assign species-level taxonomy for all the sequences. One possible explanation can be the presence of incomplete taxonomy data present in databases. Furthermore, it was observed that different taxonomic assignment is observed for the same sequences using different pipelines. Thus the same data was checked for the total number of discrepancies present in taxonomic assignments in between different DAMs. Here taxonomy from the original data set was used as a standard control, and discrepancies by each DAM and databases were compared with reference data set.

Discrepancies:
Incorrect taxonomic classification of a particular type strain sequence using DAMs is referred to as discrepancy in the taxonomic assignment. High amount of discrepancy was observed in the identification of the data set ( Figure 5). Percentage discrepancy decreases with higher taxonomy hierarchy. Also, a significantly high amount of discrepancy was observed in the data after analysis through QIIME. A total of 18.78% and 10.53% discrepancy was observed in the identification of type strain data set for the full length and partial sequences, respectively. A discrepancy in the taxonomic assignment was calculated at all taxonomic hierarchies for both full length ( Figure 6) and partial sequences (Figure 7). The detailed DAM vise information about classification and discrepancy data for different taxonomic level identifications is represented for full length sequences ( Table 2 to  Table 5) and partial sequences ( Table 6 to Table 9).

Beta Diversity:
Principle Component Analysis (PCA) graphs were plotted for understanding the variation between different samples. A significant variation was observed between databases and standard datasets. Each of the three databases formed their separate clusters in plots for full length as well as partial sequences. Clusters included their variations like 97% and 99% identity thresholds along with de novo and closed reference-based OTU picking method. The original dataset stands alone separately, and some variation exists between data set and clusters. In the case of both full length and partial sequences, SILVA database assigned taxonomy appears closest to the dataset as compared to Greengenes and EzTaxon (Figure 8).

Misclassification:
Percentage misclassification was varying depending upon the method of picking OTU, identity threshold as well as databases used for the identification. Misclassification was observed using ©Biomedical Informatics (2021) different DAMs at genus level (  10.03 (99_c_silva). A total of 48 genera belonging to 26 families were analyzed separately for taxonomic assignment discrepancies. These 48 genera were the most abundant and important in global microbiome studies. Higher misclassification observed in these genera and respective families is evidence of the necessity for correction and up-gradation of databases and software used for bacterial identification using 16S rRNA gene sequencing. Table 11 and Table 12 provide with a list of genera and families used for the top 48 discrepancy analysis, respectively. Table 13 depicts percentage misclassification observed at genus and family level for major human microbiome taxa using different DAMs (Table 13).

Discussion:
Advances in sequencing technologies have reduced the sequencing cost. Also an increase in computational power has facilitated an overwhelming number of microbiome studies in the diverse ecological niches. A primary goal of all microbiome studies is to identify the bacteria that constitute these complex communities. A valid and reliable method is a must for understanding these complex communities. The purpose of this study was to validate widely used databases like EzTaxon, SILVA, Greengenes, and data analysis pipeline QIIME. Since NGS is becoming cost-effective nowadays, researchers prefer sequencing 16S rRNA gene-based amplicon sequencing for the identification of isolates. The data obtained from sequencing is compared with standard sequences from various databases. In this study, we present results of discrepancies in taxonomic assignment occurred by different databases using the same set of sequences and analyzed through QIIME. Detailed literature review was performed for the selection of QIIME 1.8 as analysis pipeline. We narrowed down to QIIME 1 as it is one of the most used and well-accepted analysis pipeline for microbiome analysis having over 22000 citations. QIIME 1 provides with the OTU selection thresholds of 97%, which is the gold standard identity threshold for 16S rRNA gene analysis for several ©Biomedical Informatics (2021) years. Also, We included 99% identity threshold for comparison purpose. We used data from the RDP database, which provides with the option of bulk data download and is scientifically wellaccepted database for the genomic data. Using both full-length sequences (>1200 bp) and partial sequences (V3-V4) of type strain data eliminates chances of bias of selecting only a specific region of 16S rRNA gene and analyzing certain specific sequence data. One of the earlier studies has majorly focused on analyzing data using seven variable regions but has used four mock samples, amplified using the metagenomic kit and a single database [39]. phyla. The rest of the sequences either remained unclassified or got misclassified into those identified phyla. Also, during taxonomic assignments using different databases, many of the OTUs remained 'unclassified' at various taxonomic levels. Tables 2-9 provide with a number of OTUs observed to remain unclassified at each taxonomic level for full length as well as partial sequences. The reason behind this is the unavailability of taxonomy for the sequence at that particular taxonomic level in respective databases. Such 'Unclassified' incidences were seen to reduce as we went to higher taxonomic levels, but such occurrences could indicate incompleteness of databases at various taxonomic levels. Such incompleteness of databases might fail to provide the taxonomic identification till species level. Another important observation was noted about misclassification and discrepancy in taxonomic assignment. Taxonomy assigned for the sequence by QIIME pipeline with a specific database was completely different from the original taxonomy in the data set. Varying discrepancies were observed in assigned taxonomy at each taxonomic level and in each of the 10 DAMs. A total of 18.78% discrepancy in assigned taxonomy was observed at the genus level, which is a high amount of discrepancy. Genus level classification is mentioned here because genus and species are two taxonomic hierarchies that we generally use for referring any organism. The percentage of discrepancy was seen to reduce with higher taxonomic hierarchy. Table 10 enlists some of the misclassified organisms. Each database showed a different percentage of discrepancy in the taxonomic assignment possibly due to different time intervals of database update or the methodological limitations of the analysis measures. The highest discrepancy was observed as 31.58% for 97_c_gg & lowest discrepancy was observed as 4.37% for 97_d_ez at the genus level. Such misclassification i.e., assignment of different taxonomies for the same sequences by different databases, makes it difficult to choose a suitable database for analysis purposes. Generally, only single pipeline and database is used for any analysis and identification purpose by any researcher or sequencing facilities or even pathology labs [3] for identification of the pathogen. Such improper identifications could hamper analysis reliability. In recent years, the field of microbiome research has moved ahead from just exploring the microbial diversity in the sample to interventional and translational research. Hence, it is crucial to use correct bacterial strains. The use of wrong bacterial strains can be deleterious. One of the studies clearly states the misclassification issue of genus Acinetobacter with phylogenomic approach. It has now been established that the genus is poorly classified, especially for closely related species like Acinetobacter calcoaceticus and Acinetobacter baumannii complex (Acb complex) [47]. Such inaccurate assignment in taxonomic classifier might possibly lead to community member misclassification and ultimately misleading ©Biomedical Informatics (2021) conclusions and possibly treatment. In this study, we explored most widely used 16S rRNA gene databases and QIIME pipeline with various parameters. A similar approach can be used to benchmark any new pipeline or database before analyzing actual data. The reference set allows us to cross-validate taxonomic assignments. In the last decade, a lot of microbiome data is available for various ecosystems, like soil, water, plant, animal, and human. This has enabled us to get a fair idea about the taxonomic groups expected in the ecosystem. A combination of a particular pipeline and database can be selected based on the accuracy in assignment taxa expected in the ecosystem. This study gives a fair idea about the limitation of using short reads for taxonomic assignment as well as about the list of important taxa that can be misclassified using a particular analysis pipeline.

Conclusions:
99% identity threshold is better for OTU clustering for full length as well as partial length sequences than conventional 97% identity threshold. A total 18.78% and 10.53% discrepancy was observed at the genus level for the full length and partial sequences, respectively, which is a high amount of discrepancy. The discrepancy at each taxonomic level can be calculated, and the quality of data present in the database can be decided. Beta diversity analysis shows an overall distance of analyzed data from the reference data set. 99_c_silva shows most identity with reference data set as compared to other DAMs. 99_c_silva means SILVA database with a identity threshold of 99% and close reference-based OTU picking method. It is crucial to select databases, pipelines, and algorithms very carefully considering discrepancies in taxonomic assignment and selection should be done based on the necessity of the study. Also, databases should be validated, and discrepancies should be corrected in successive updates of databases.

Limitations:
Study is restricted to QIIME analysis pipeline and one can always use similar approach to benchmark any other globally used analysis pipeline like MOTHUR, MG-RAST, etc. Furthermore, despite using both 97% & 99% identity thresholds and De novo & close reference-based OTU picking approaches; study focused on only a single method i.e. uclust for OTU clustering.