HPV-QUEST: A highly customized system for automated HPV sequence analysis capable of processing Next Generation sequencing data set

Next Generation sequencing (NGS) applied to human papilloma viruses (HPV) can provide sensitive methods to investigate the molecular epidemiology of multiple type HPV infection. Currently a genotyping system with a comprehensive collection of updated HPV reference sequences and a capacity to handle NGS data sets is lacking. HPV-QUEST was developed as an automated and rapid HPV genotyping system. The web-based HPV-QUEST subtyping algorithm was developed using HTML, PHP, Perl scripting language, and MYSQL as the database backend. HPV-QUEST includes a database of annotated HPV reference sequences with updated nomenclature covering 5 genuses, 14 species and 150 mucosal and cutaneous types to genotype blasted query sequences. HPV-QUEST processes up to 10 megabases of sequences within 1 to 2 minutes. Results are reported in html, text and excel formats and display e-value, blast score, and local and coverage identities; provide genus, species, type, infection site and risk for the best matched reference HPV sequence; and produce results ready for additional analyses. Availability The database is available for free at http://www.ijbcb.org/HPV/


Background:
Human papilloma virus (HPV), the most common sexually transmitted infection, causes cervical cancer in women, contributes to anogenital cancers in men, and is associated with oropharyngeal cancers and genital warts in men and women [1]. Currently, PCR-based assays are applied to identify HPV prevalence, ranges of oncogenic and nononcogenic HPV types, and incidence of multiple type infection [2]. Next Generation sequencing (NGS) technology provides increased sensitivity for in depth analysis of HPV types, although large datasets of HPV sequences present considerable barriers for analyses. Available automated HPV genotyping tools, including Virus Sequence Database [3], REGA HPV Automated Subtyping Tool [4], and NCBI blastn are limited by either a restricted number of reference sequences with incomplete annotation or outdated nomenclature, an inability to classify short sequences, or an inadequate capacity to analyze efficiently large sequence data sets. To accelerate HPV genotyping of high-throughput NGS data, an automated system including a comprehensive collection of HPV mucosal and cutaneous reference sequences with updated nomenclature was developed.

Methodology:
The web-based HPV-QUEST subtyping system uses PHP/HTML language, MYSQL, as the database management system for blast searches is available freely on http://www.ijbcb.org/HPV/. HPV-QUEST is able to processes up to 10 megabases (Mb) of sequences (around 6,500 sequences of 100 bp) per run, returns results within one to two minutes, and displays up to 150 hits with the top hit as default.
HPV genotyping is based on sequences from the L1 region comprised of 1500 nucleotides. HPV-QUEST includes a new HPV database with updated nomenclature for 150 annotated cutaneous and mucosal HPV L1 sequences, representing 5 genuses, 14 species, and 150 types, compiled from complete genomes, subgenomic regions containing the L1 region, or L1 region  A set of Perl scripts is applied to parse the program output files and produce a result page in HTML format, and a report in both text-and excel-format containing: No. (the query sequence serial number), Query id (fasta file header of the query sequence), Score (blast score), Evalue (expect value), Strand (+/+ or +/-), Local identity (percentage of matched nucleotides within alignment region), Coverage identity (percentage of nucleotides matched with reference sequence), Genus, Species, Type, GI (NCBI gene identification number), AN (NCBI accession number), Source (source of reference sequence), Infection site (mucosal or cutaneous or both), Risk (high or low or unknown), Ref seq region (reference sequence region in the genome), Length of ref seq (nt), and Alignment (alignment of query sequence with reference sequence) ( Figure 1B). The original query sequences are included in the report to eliminate the need to match query sequences with correspondent results. Query sequences failing to align with any known reference sequences in the HPV-QUEST are designated as "nd". Any sequences that fail to blast, have low local identity, or with an e-value >1e-15 are considered as low quality, new recombination, or new genotype.

Testing and validation:
HPV-QUEST version 1.0 was tested and validated in two ways. Firstly, reference sequences used to construct the database were blasted against themselves. The typing was 100% correct, and all e-values were 0 with local or coverage identities of 100%. Secondly, a test dataset of 18,000 quality HPV pyrosequences, generated by PGMY9/11 and GP5+/6+ primers using Titanium Amplicon Pyrosequencing technology from DNA extracted from genital swabs of 15 asymptomatic men recruited in an international study cohort, was processed by using HPV-QUEST and the results compared with typing by traditional NCBI blastn with an cutoff evalue of 1e-15 [10]. HPV genotypes and frequency distribution by using HPV-QUEST coincided with results from NCBI blastn with significantly shorter processing time (less than 30 minutes versus more than 40 hours) to produce results ready for analysis.

Caveats and Future development:
Although new HPV types are discovered continuously, HPV classification and nomenclature are updated periodically by the Reference Center for Human Papillomaviruses at the German Cancer Research Center in Heidelberg, which will be used to update HPV-QUEST. Version 2.0 will include HPV subgenomic regions other than L1, reference sequences for non-human papilloma viruses, and extensive data sets generated by next generation sequencing technology.