Optimizing k-mer size using a variant grid search to enhance de novo genome assembly

Largely driven by huge reductions in per-base costs, sequencing nucleic acids has become a near-ubiquitous technique in laboratories performing biological and biomedical research. Most of the effort goes to re-sequencing, but assembly of de novogenerated, raw sequence reads into contigs that span as much of the genome as possible is central to many projects. Although truly complete coverage is not realistically attainable, maximizing the amount of sequence that can be correctly assembled into contigs contributes to coverage. Here we compare three commonly used assembly algorithms (ABySS, Velvet and SOAPdenovo2), and show that empirical optimization of k-mer values has a disproportionate influence on de novo assembly of a eukaryotic genome, the nematode parasite Meloidogynechitwoodi. Each assembler was challenged with about 40 million Iluumina II paired-end reads, and assemblies performed under a range of k-mer sizes. In each instance, the optimal k-mer was 127, although based on N50 values,ABySS was more efficient than the others. That the assembly was not spurious was established using the “Core Eukaryotic Gene Mapping Approach”, which indicated that 98.79% of the M. chitwoodi genome was accounted for by the assembly. Subsequent gene finding and annotation are consistent with this and suggest that k-mer optimization contributes to the robustness of assembly.


Background:
The progression of technology from Sanger sequencing to the current "next-generation" platforms has heralded striking reductions in the cost of generating data. Sequencing nucleic acids has become a near-ubiquitous technique in laboratories performing biological and bio-medical research. Sequencing comes in two forms, distinguished by their needs for assembly into a contiguous reconstruction of a larger molecule. Most prevalent are various forms of "re-sequencing" in which the sequencing reads are aligned with a reference genome to reveal bases polymorphic between samples. Computationally, this is not a difficult undertaking. The other mode is the assembly of de novo-generated, raw sequence reads into contigs that are, as close as possible a full accounting of the genome of the organism in question. In practice, except for the smallest of genomes, complete coverage is neither attainable nor usually needed. None-the-less, maximizing the amount of sequence that can be correctly assembled into contigs is desirable. Reference-free assembly is based on stacking overlapping sequences of genomic fragments of a defined size (the k-mer), generated by breaking each read into k-mer size. Here we examined three commonly used assembly platforms, and showed that optimization of k-mer values has a disproportionate influence on de novo assembly of a eukaryotic genome.
Genome assembly algorithms permit adjustment of k-mer size, and also of the related feature coverage (or depth) of the k-mer assembly. The k-mer optimizing tool "Velvetadvisor"[1], for example, estimates a theoretically optimal k-mer size as follows:   Table 1-2). X-axis indicates software, k-mer size, and coverage cut off. Y-axis on the left side indicates the length of longest contig (bp) as a function of x-axis, corresponding to grey bars. Y-axis on the right side indicates N50 length (bp), corresponding to red lines. During optimization process, to assess assemblies by N50 (red edges), it is compared of de novo assembly of ABySS, Velvet, and SOAPdenovo using different k-mer sizes and coverage cut offs.A more contiguous assembly is obtained for larger N50. At the default coverage thresholds, when k-mer sizes were increased, N50 was overall concave, peaking at 127-mer. When coverage threshold was increased within the same k-mer size, N50 was decreased within 127-mer whereas increased within 247-mer. The length of longest contig (grey bar), though not exactly identical, shows similar pattern as N50. Among the selected k-mers, the largest numbers of N50 and the length of the longest contig were achieved at 127-mer and 4.6 coverage-cut-off by ABySS. Prior to assembly of the M. chitwoodi reads, we queried "Velvetadvisor" and "KmerGenie" to compute a value for kmer size (247 and 260 respectively). Although similar, these values are not identical, and led us to explore empirical optimization of k-mer size. In this study, we show that a 'Simple Grid Search', a widely used optimization algorithm, achieves the best k-mer value for assembly. Our proposed method has three steps. Firstly, we explicitly specified an equally-spaced interval including the k-mer size predicted by 'Velvet advisor' or 'KmerGenie'. Those k-mers were evaluated according to N50.Secondly, we selected a next set of k-mers in a more-narrow interval around those k-mers with the largest N50 from the first evaluation. Lastly, we chose the best k-mer by assessingthe second set of k-mers by taking into account N50 as well as other statistics. We found that assembly size is much more sensitive to k-mer size than has been theoretically estimated (Figure 1). Importantly, we found that our empirical approach yielded an assembly with an N50 of 70,023, compared to best N50 values of 46,442 (Velvet advisor) or 42,333 (KmerGenie).

©2016
Alone, the N50 value provides no information about the quality of the assembly, which needs to be verified by some independent means. One useful metric is to detect the presence or absence of a set of genes encoding proteins established to be crucial for eukaryotes. The "Core Eukaryotic Gene Mapping Approach" (CEGMA) tool performs such an analysis using a defined database of 458 core proteins [7]. The percent of that protein set identified serves as a surrogate for the percentage genome coverage by the assembly. Additionally, the highly defined CEGMA proteins identify a reference set from which to unambiguously deduce elements of gene structure, including translation start/stop sites and intron/exon boundaries. Such gene models represent a reliable training set for gene prediction algorithms such as AUGUSTUS [8]. In our project, we further seeded the gene finders with EST data.Finally, genomic features were elucidated using RepeatMasker [9], and functional domains were predicted using InterProScan [10] and Blast2GO [11] as functional annotation.The results we present hereindicate that the assembly based on empiricallydetermined k-mers yields not just a larger N50, but also a useful genome assembly.

Methodology: Data Generation and Processing
Total genomic DNA was isolated from Meloidogynechitwoodi collected in a potato field in Washington State, and shipped as an ethanol precipitate to NCSU. Libraries with an average insert size of 700 bp were constructed to facilitate 300 bp paired-end reads, and sequences determined on an IlluminaMiSeq II instrument. Low quality reads (Phred values ≤30) were rejected, and the remainder used for assembly. Because it is likely that different assembly algorithms will give different results in a genome-specific (and an a priori unpredictable) manner, we performed k-mer optimization on three commonly-used assembly algorithms, viz., ABySS (version 1.3.7), Velvet (version 1.2.10) and SOAPdenovo (version 2.01).

De novo Assembly
In advanceof fine-tuning parameters, we estimated the recommended k-mer size using "KmerGenie" to be a 260-mer.

©2016
We trimmed this to 259-mers to suppress palindromes. The "Velvetadvisor" [1]recommended an optimal k-mer to be a 247mer (coverage cut-off 15). We performed assemblies by employing a variant of a 'Simple Grid Search' where we used kmers ranging from 259-mer to 63-mer, with intervals of 36. Other attributes were set to the default setting of coverage-cutoff for each algorithm. To assess if the largest N50 is observed at one particular k-mer size among the previously tested set, k1 (k-mer with the largest N50) and k2 (k-mer with the second largest N50) were averaged to k3, (k1+k2)/2. More values of kmer surrounding k3 at intervals of 2 were performed to identify the optimal k-mer value (Table1-1). The total number of reads aligned to contigs was also taken into account. In addition, for further parameter fine-tuning, results from different coveragecut-offs other than default settings were compared (Table2).SOAPdenovo was run under k-mer sizes equal to or less than 127-mer as it is the maximum k-mer size available in this program. To further validate our method, we arbitrarily selected two organisms for evaluation: the bacteria Neisseriagonorrhoeae (assembled genome size 2.15 Mbp) and Camelpox virus (assembled genome size 0.20 Mbp). FASTQ files were obtained from the European Nucleotide Archive (ENA) and were de novo assembled by ABySSusing k-mer sizes chosen by KmerGenie and Velvet advisor as well as by our empirical methods.  Table 2: With the selected k-mer sizes, different coverage-cut-offs were compared acrossthree software tools. Empirical optimization of k-mer sizes enhances genome assembly across different software platforms.

Gene Prediction and Automated annotation
To generate an initial training set, we queried our assemblies using CEGMA (version 2.4). To expand the training set, we incorporated cDNAs as evidence obtained from nematode.net and NCBI. These sets were processed using the AUGUSTUS web server (http://bioinf.unigreifswald.de/webaugustus/) [12]forpredictinggenes in genome ab initio.Additionally, gene annotations generated by AUGUSTUS were searched by InterProScan and Blast2GO to identify GO terms and gene families. We investigated DNA elements and repeat regions using RepeatMasker (version 4.0.5; http://repeatmasker.org) [9], and GC contents using a tool set of Biopieces (http://www.biopieces.org).

Results &Discussion:
Illumina sequencing yielded a total of 42,011,068 paired-end sequence reads (21,005,534 from each end), occupying 27.5 gigabytes in FASTQ format. The average of quality score is about 34. The reads were empirically optimized for de novo assembly.Under default settings of coverage-cut-off, the overall trend of N50 was concave, peaking at 127-mer (Figure 1  &Table1).We observed that the decrease of N50 within 127-mer and the increase of N50 within 247-mer across all the software we tested as we increased coverage-cut-offs within each k-mer size (Figure 1 &Table 2). The largest N50 within 127-mer was still larger than the largest N50 within 247-mer in ABySS. On the other hand, the largest N50 within 127-mer was smaller than the largest N50 within 247-mer in velvet. When compared across software, the largest N50 of 70,023 was achieved by ABySS tool at optimized k-mer size of 127 at the coverage ©2016 threshold of 4.6. Thus, our empirical optimization achieved better assemblies than the commonly-used k-mer predictors. For the following further analysis, we elected to use our strategy to optimize the M. chitwoodi genome.The genome size of this selected assembly is 152,604,382 (150Mb).
At the protein level, CEGMA predicted, in the M. chitwoodi genome, 245 (98.79%) of the 248 core proteins, implying near 100% genome coverage. In addition, it identified 2.23 average number of orthologs per CEG and 94.29% had more than one potential ortholog. This was supported by blasting CEGMA proteins as a query against the assembled contigs as a database, resulting in one protein hit with more than two contigs. This would imply genome duplication or a genome with high heterozygosity, as has been established for M. incognita [13] but not for M. hapla [5].The broad applicability of our approach was demonstrated in diverse species, including a bacterium and a virus. For Neisseriagonorrhoeae, the k-mers predicted by Velvet advisor and KmerGenie were 275-mer and 198-mer, respectively, yielding N50s of 28,848 and 44,552. In contrast, our method returned an N50 of 48,678 using a 155-mer. On Camelpox, the Velvet advisor k-mer of 301 resulted in a failed assembly. KmerGenie recommended a k-mer of 58, resulting in an assembly with an N50 of 179,206. By contrast, our method yielded an assembly with an N50 of 190,481.

Conclusion:
In assembling a whole genome, it is desirable to achieve a balance between computational costs and the trade-off relationships between k-mer size and its coverage; namely large k-mer size with low coverage or a small k-mer size with deep coverage. Tools "Velvetadvisor" and "KmerGenie" were developed to resolve these problems. As seen in our study, however, those tools cannot be directly applied to the experimental data. Their predicted k-mer sizes gave de novo assembly quite different from our empirically optimized assembly of M. chitwoodi. This was confirmed by our experiments with two other organisms of bacteria and virus.To overcome this, we showed that our approach, using a variant of a 'Simple Grid Search' to identifyoptimal k-mer size and coverage, led to a more complete assembly. The quality of assembly was confirmed by CEGMA, predicting 98.79% core proteins in the M. chitwoodigenome.By integrating different tools of CEGMA and AUGUSTUS, more reliable gene models could be generated. This could also improve the completeness of subsequent analyses, for example, functional analysis or comparative genomics approach. In future studies, we aim to examine the evolutionary history of the genus Meloidogyne and how that relates to, or is derived from, attributes germane to parasitism. For example, because M. chitwoodi and M. hapla are sympatric, they presumably have similar gene compliments.