Discriminating antigen and non-antigen using proteome dissimilarity III: tumour and parasite antigens.

Computational genome analysis enables systematic identification of potential immunogenic proteins within a pathogen. Immunogenicity is a system property that arises through the interaction of host and pathogen as mediated through the medium of a immunogenic protein. The overt dissimilarity of pathogenic proteins when compared to the host proteome is conjectured by some to be the determining principal of immunogenicity. Previously, we explored this idea in the context of Bacterial, Viral, and Fungal antigen. In this paper, we broaden and extend our analysis to include complex antigens of eukaryotic origin, arising from tumours and from parasite pathogens. For both types of antigen, known antigenic and non-antigenic protein sequences were compared to human and mouse proteomes. In contrast to our previous results, both visual inspection and statistical evaluation indicate a much wider range of homologues and a significant level of discrimination; but, as before, we could not determine a viable threshold capable of properly separating non-antigen from antigen. In concert with our previous work, we conclude that global proteome dissimilarity is not a useful metric for immunogenicity for presently available antigens arising from Bacteria, viruses, fungi, parasites, and tumours. While we see some signal for certain antigen types, using dissimilarity is not a useful approach to identifying antigenic molecules within pathogen genomes.


Background:
Vaccines induce provoke protective immunity and come in both "living" and "non-living" varieties. Living vaccines are usually attenuated or weakened pathogenic microbes which retain aspects of natural infection including the ability to revert to a pernicious form. Non-living vaccines are either chemically or heat treated whole pathogens or pathogen components. Subunit vaccines comprise protein components isolated from pathogenic micro-organisms and have several advantages: they have longer shelf-lives, are more stable, and cannot regain pathogenic status. However, their identification can be arduous requiring the expense of much time and resource.
Tumour immunotherapy refers to using the host immune system to battle cancer. Strictly, a tumour is a solid lesion or neoplastic growth resulting from unregulated cell division, and may be benign, pre-malignant, or malignant; we use the term synonymously with cancer, and specifically here as a pseudonym for human tumour antigens extracted from the SEREX database. Tumour vaccines contain a specific protein derived from a tumour able to stimulate a protective immune response. Tumour vaccines are therapeutic rather than prophylactic in nature, and are typically injected subcutaneously or directly into cancerous tissue. They are a nascent form of personalised medicine, with different vaccines targeting different cancers, with the potential to identify antigens directly from the patient.
Diseases of parasite and protozoan origin cause significant mortality and morbidity: over 3.5 billion people currently suffer parasitic infection, primarily in tropical and subtropical countries, particularly pastoral regions of Asia, Latin America, and Africa; incidence in industrialized countries is relatively low. Parasites typically invade the body via mucosal surfaces. Buccal parasites, for example, after ingestion will either remain in the intestine or escape via the intestinal wall, invading other organs; while some will bore through the skin or enter via insect bites. The life-cycles of many parasites, particularly single-celled parasites, are complex with many stages involving eggs and larval forms yet usually reproduce within the host. This makes developing vaccines extremely problematic, and currently there are no licensed vaccines available targeting parasitic diseases.
Genomics is fashioning a new epoch of knowledge-led vaccine design and discovery. Known as reverse vaccinology, it combines advanced molecular biology technology with advanced in silico analysis of pathogen genomes, enabling the systematic identification of potential antigens within a pathogen. Key to this endeavour is the bioinformatics protocols used to detect antigens, such as those which predict sub-cellular location as the main determinant associated with antigens. As proposed by Kundac et al. [1], another persuasive concept is the idea of dissimilarity of antigens versus non-antigens at the sequence level. In this paper, we extend our previous analysis [2, 3] beyond bacteria, viruses, and fungi, to explore parasite and tumour antigens.

Methodology:
Datasets of known antigens obtained previously from the literature were analysed [2, 3, 4, 5]. Non-antigens were randomly selected from Swiss-Prot so that they mirrored the species distribution within the antigen sets [2, 3]. Parasite and Tumour antigens used here are listed below in Figure  1 , which afforded full management of E-value cut-offs. E-value thresholds were raised from 10 to 6000 to identify best matches even when these lacked statistical significance. We also analysed (log10 E-value )+1 values obtained from BLAST. By using the statistical package Minitab, Release 14.1, we compared antigen and non-antigen sets, as random samples of two larger, independent populations, utilising a Mann-Whitney test.

Discussion:
Previously, we have examined the difference between sets of antigens and non-antigens for Bacterial, fungal, and species, and found no clear separation of the two, concluding that proteome dissimilarity does not provide a means of sifting out potential antigens from a newly sequenced pathogen genome. Here we have expanded our analysis, focusing on parasite and tumour antigens.
Compared to our previous sequence similarity analyses [2,3], parasite non-antigens evinced more noticeable dissimilarity to the human genome than did parasite antigens, suggesting a clearer separation than before. See Figure 2. However, there were seven antigenic proteins that demonstrated high similarity to the human genome compared to one equivalently similar non-antigen. We also evaluated the genomes of four different parasite species -Cryptosporidium parvum, Distyostelium discoideum, Leishmania infantum and Trypanosoma brucei -as a background reference, or a "control" as some would put it, for this comparison of antigens and nonantigens, comparing them to both human and mouse genomes. The distribution of matches between parasite and human genomes indicates that most proteins lie inside the range characterising the antigen proteins. While a signal is clearer here than before [2, 3], the distribution is inverted compared to our expectation: a discernible proportion of the antigens were more similar, not less, than are the non-antigens. Thus for these proteins, antigenicity is encoded in a subtle and cryptic manner, not apparent simply from sequence comparison. It is possible, but likely, that using some form of similarity filter -rather than a dissimilarity filter -may provide a threshold able to indicate the potential antigenicity of parasite proteins.
We observed that the distributions of tumour antigen matches to the human and mouse genomes spread across a wide E-value range scale, much wider than seen previously [2,3]. Figure 3 illustrates the analysis of tumour antigens, non-antigens, and reference genomes relative to the human genome. The distributions characterizing antigens and non-antigens were similar, yet antigens were by visual inspection proportionally more similar to the human genome relative to non-antigens. This observation is again inverted compared to our expectations, as was the case for parasites. This distribution may in part result from the presence of both antigens and nonantigens in the host. Thus identifying a threshold able to separate tumour antigens from non-antigens would again prove difficult.
As well as visual inspection of the distributions of antigens and nonantigens for tumour and parasite, we also undertook a statistical comparison using the Mann-Whitney Test. At a 95% confidence level, the test gave a value of 0.000 for Parasite and 0.001 for Tumour. All previous assessments accepted the null hypothesis. Since these values are less than 0.05, this is indicative of a statistically significant discrepancy between the two antigen-versus-non-antigen distributions from both tumours and parasites. While the apparent significance for parasite and tumour was marked, it was somewhat at odds with the visual inspection of the histograms of similarity values. Although there may be a statistically significant difference in both paired distributions, there is again no useable cut-off capable of distinguishing antigen from non-antigen.

Figure 3:
A sequence similarity comparison between Tumour antigens and the whole human proteome. E-value set at 6000 and BLOSUM 62 matrix, between tumour Antigen and Non-antigen. All self-matching identities were excluded from the results. A background control at the genome level was not conducted.

Conclusion:
Antigens are the basis of subunit vaccines, and their identification by computationally-driven reverse vaccinology is vital, particularly in the era of emergent zoonotic infections and recrudescent disease, since many recently discovered pathogens cannot be cultivated, thus precluding the facile experimental identification of their immunogenic antigens. To this endeavour, immunoinformatics searches continually for robust and celeritous approaches to antigen prediction. One such is based on global dissimilarity searching, as suggested by Kanduc et al. [1].
When compared to our previous analyses [2, 3], both tumour and parasite distributions had a clearer and more discernible signal than before. However, both were inverted relative to our expectations, with antigens being more similar to the human genome than non-antigens. We felt this was counter-intuitive, but there may be evolutionary arguments consistent with this observation. Being so close to mammalian hosts, parasites may need to have evolved more complexity at the functional, and thus sequence and structural, levels, in order to allow their own primitive immune systems, to recognise and protect themselves from self-inflicted damage.
Of course, such may be wholly specious arguments, and the true source of the apparent differences manifest as signal, may come from a statistical quirk or an observed sampling bias.
The present work is not definitive, and there is much further work that could be done, though radically different results seem unlikely. We envisage repeating our study as more antigens become available [7]; looking perhaps at functionally or immunologically congeneric subsets within the overall data; using more sensitive and sophisticated similarity assessment operating at both the global and local levels; and combining this approach with other methodology in a more extensive, rigorous, and comprehensive analysis.