A Genomic Signature for Genotyping Mycobacterium tuberculosis

Mycobacterium tuberculosis (MTB), the causative agent of tuberculosis (TB), has a vast diversity of genotypes including Beijing, CAS, EAI, Haarlem, LAM, X, Ural, T, AFRI1 and AFRI2. However, genotyping can be expensive, time consuming and in some cases, results may vary depending on methodology used. Here, we proposed a new set of 10 SNPs using a total of 249 MTB genomes, and selected by first the inclusion/ exclusion (IE) criteria using spoligotyping and phylogenies, followed by the selection of the nonsynonymous SNPs present in the most conserved cluster of orthologous groups (COG) of each genotype of MTB. Genotype assignment of the new set of 10 SNPs was validated using an additional of 34 MTB genomes and results showed 100% correlation with their known genotypes. Our set of 10 SNPs have not been previously reported and cover the MTB genotypes that are prevalent worldwide. This set of SNPs could be used for molecular epidemiology with drug resistant markers.

complicate their distribution and evolution. This is where genotyping becomes a challenge with the current methodologies MIRU-VNTR and spoligotyping, which lack sufficient discriminatory power to differentiate between families [18]. Whole genome sequencing (WGS) has been shown to give superior resolution to that of MIRU-VNTR and spoligotyping providing all possible genomic targets, information on drug resistance, genome evolution and virulence determinants [19]. However, massive sequencing costs are still expensive in countries with high TB incidence. The best approach will be to analyze a specific set of SNPs associated to a determined genotype. Recent studies have proposed sets of SNPs, one using 45 SNPs from 04 genomic sequences [20] and another using 62 SNPs from 1601 genomes [21]. However, these two studies have been based on phylogenies. In this study, we proposed a new set of 10 SNPs using a total of 249 MTB genomes selected first by the inclusion/ exclusion (IE) criteria using spoligotyping and phylogenies, followed by the selection of the nonsynonymous SNPs present in the most conserved COG (cluster of orthologous groups) of each genotype (Beijing, CAS, EAI, Haarlem, LAM, X, Ural, T, AFRI1 and AFRI2) of MTB. The addition of spoligotyping provides higher informative results on the phylogeographic distribution of MTB's genotypic diversity [22].

Phylogenetic analysis of genomic sequences:
Phylogenetic assays were performed on 173 MTB genomes (125 with genotypes determined by spoligotyping and 48 with unknown genotypes) using two different programs of alignmentfree: kSNP [26] and ParSNP [27]. Each program uses different algorithms to select SNPs, constructs the Maximum likelihood phylogenies and compares topologies. kSNP identifies SNPs based on k-mers analysis without using a reference genome. This program requires a k-value for the mer size, which is calculated by Kchooser tool. Also, in order to count the k-mers in the genomes, kSNP uses the jellyfish software [28], then compares these mers in all the genomes to find SNPs and finally creatse a SNP matrix to make phylogenetic trees. ParSNP uses maximal unique matches (MUM) to generate a multiple genome alignment and a SNP tree. This MUM for two genomes is a genetic index that considers the level of DNA conservation of the core genome and the proportion of DNA shared by these genomes. After the alignment process, the core-genome SNPs are selected and used to construct a phylogenetic tree.

Cog analysis:
To determine the most variable and conserved COGs for MTB, a COG analysis was performed on 249 MTB genomes that included the genotypes Beijing, CAS, EAI, Haarlem, LAM, X, Ural, AFRI1, AFRI2 and unknowns. Briefly, a comparative analysis based on H37Rv to determine that SNPs was performed using SamTools program [24]. Then, a SNP database was generated which included SNPs corresponding gene and COG.

SNP set selection:
Genotyping assignment was performed in two steps: (I) using the inclusion/exclusion criteria in experimental genotype (Spoligotypes) and phylogenetic prediction and, (II) using COG.

(I) Inclusion/ exclusion (IE) criteria using spoligotyping and phylogenies:
First, 125 MTB genomes with experimental information of genotypes were processed. Each genome sequence was transformed to binary codes, as presence (1) or absence (0) of a mutation in loci, based on H37Rv. Then, genomes were grouped according to their genotypes (Beijing, CAS, EAI, Haarlem, LAM, X, AFRI1 and AFRI2). Using IE criteria, loci shared with more than one genotype were discarded. Second, to reinforce the SNP selection, 124 additional MTB genomes (76 with genotypes determined by phylogenetic analysis and 48 of unknown genotypes with genotypes predicted by phylogenetic analysis) were included. Finally, the SNPs shared in more than one MTB genotype of Beijing, CAS, EAI, Haarlem, LAM, X, AFRI1 and AFRI2 were discarded. A total of 2123 specific SNPs to each genotype of MTB were identified as SNPs-IE.

(II) Nonsynonymous SNPs selection based on COG (cluster of orthologous groups):
Then, the SNPs were organized in COG functional groups [29]. The genotypes of the MTB genomes have shown variant locus distribution in COG starting from the most variable to the most conserved (S, R, I, Q, E, C, F, L, N, G, H, K, P, J, T, O, F, D, V, U and A). In this case, we selected the locus from the most conserved to the most variable gene family. Then, we selected a locus associated with a nonsynonymous SNP for each of the 10 MTB genotypes.

Results & Discussion: Lineage determination of 48 unknown MTB genomes by phylogenic analysis:
Tree topology was performed on 249 MTB genomes using kSNP and Parsnp (Figure 1). Eight clades were identified with good support: AFRI1, EAI, CAS, Beijing, Haarlem, X, LAM and T. However, clade AFRI2 remains paraphyletic with strains at different levels of the bifurcation. Of the 48 unknown MTB genomes, 03 were AFRI1, 16 were EAI, 03 were CAS, 22 were Beijing, 15 were Haarlem, 03 were X and 15 were LAM. One Haarlem strain (Haarlem/NITR2) as determined by spoligotyping fell in the X clade by both methodologies (kSNP and Parsnp). There are limitations in the assignment of genotypes by phylogenetic analysis and often require global information of the genomic sequences for an optimal lineage approximation.

Identification and selection of SNP set:
To develop the assigning system of SNPs and genotypes, we first used the IE criteria. 125 MTB genomes with known genotypes determined by spoligotyping were analyzed as described in Methods. The T genotype was not analyzed because the H37Rv falls in this genotype and was used in the process of genomic mapping. Additionally, 124 MTB genomes were included with genotypes determined by phylogenetic analysis. In a database, we integrated the 7649 SNPs from 249 genomic sequences of MTB that included the genotypes Beijing, CAS, EAI, Haarlem, LAM, X, Ural, AFRI1 and AFRI2. The number of non-specific genotype SNPs were 458, 854, 674, 624, 424, 412, 886, 1735 and 1582 respectively. After applying the IE criteria, the SNPs were organized in Group A, which contained SNPs unique for each genotype, and in Group B, which contained the SNPs shared between the different genotypes. Then, it was followed by a manual exclusion of shared SNPs. The result was a new group of SNPs (SNPs-IE) for the genotypes Beijing (n=20), CAS (n=276), EAI (n=208), Haarlem (n=15), LAM (n=37), X (n=16), Ural (n=372), AFRI1 (n=580) and AFRI2 (n=599) (Figure 2).
Finally, this was followed by the selection of the nonsynonymous SNPs present in the most conserved COG of each of MTB. Before the analysis, the COGs most variables and most conserved for each genotype of MTB were determined (Figure 3). Then, a SNP database was generated, which included SNPs, corresponding gene and COG, showing 2, 36, 20, 4, 2, 2, 40, 65 and 67 SNPs for Beijing, CAS, EAI, Haarlem, LAM, X, Ural, AFRI1 and AFRI2 respectively. We selected 10 nonsynonymous SNPs for the 9 MTB genotypes analyzed (Table 1 and 2) and an alternative SNPs set ( Table 2).

Validating genotype assignment using new proposed set of 10 SNPs:
Using an additional 34 MTB genomes with known genotypes Beijing (n=12), LAM (n=07), X (n=06), CAS (n=04), Haarlem (n=04) and EAI (n=01) we tested the ability of our new set of 10 SNPs to assign genotypes. There was 100% correlation of genotypes assignment with all the strains tested (Table 3). Table 1: Stepwise SNP set selection: IE, differential genotype ; *, the loci belonging to other strains from different genotypes and uncommon between genotypes were eliminated. **, SNP set based in COG group (AK); *** SNP set based in less variable COG group genotype.

Genotype
Beijing