Dataset of potential targets for Mycobacterium tuberculosis H37Rv through comparative genome analysis.

Mycobacterium tuberculosis is the causative agent of the disease, tuberculosis and H37Rv is the most studied clinical strain. We use comparative genome analysis of Mycobacterium tuberculosis H37Rv and human for the identification of potential targets dataset. We used DEG (Database of Essential Genes) to identify essential genes in the H37Rv strain. The analysis shows that 628 of the 3989 genes in Mycobacterium tuberculosis H37Rv were found to be essential of which 324 genes lack similarity to the human genome. Subsequently hypothetical proteins were removed through manual curation. This further resulted in a dataset of 135 proteins with essential function and no homology to human.


Background:
Mycobacterium tuberculosis (Mtb), the causative agent of tuberculosis (TB), remains a major health threat. Each year, 8 million new TB cases appear and 2 million individuals die of TB [1]. Moreover, it is estimated that one third of the population is latently infected with Mtb, of which ~10% will develop active disease during lifetime. The development of active TB occurs when the balance between natural immunity and the pathogen changes (e.g. upon waning of protective immune response during adolescence and in HIV patients [2]). Further, about half a million new multi-drug resistant TB cases are estimated to occur every year [3]. The existing drugs, although of immense value in controlling the disease to the extent that is being done today, have several shortcomings, the most important of them being the emergence of drug resistance rendering even the front-line drugs inactive. In addition, drugs such as rifampin have high levels of adverse effects making them prone for patient incompliance. Another important problem with most of the existing antimycobacterials, is their inability to act upon latent forms of the bacillus. In addition to these problems, the vicious interactions between the human immunodeficiency virus and TB have led to further challenges for anti-tubercular drug discovery [4].
The cost of research and development in the pharmaceutical industry has been rising steeply and steadily in the last decade, but the amount of time required for bringing a new product to market remains around ten to fifteen years [5]. This problem has been labeled as an ''innovation gap,'' and it necessitates investment in inexpensive technologies that shorten the length of time spent in drug discovery. As drug discovery efforts are increasingly becoming rational and much less dependent on trial and error, identification of appropriate targets becomes a fundamental pre-requisite. As with all the other steps in drug discovery, this stage is complicated by the fact that the identified drug target must satisfy a variety of criteria to permit progression to the next stage. Important factors in this context include homology between target and host (to prevent host toxicity such homology must be low or nonexistent [6]), activity of the target in the diseased state [7] and the essentiality of the target to the pathogen's growth and survival. Finding new targets can enhance the discovery process as well as solve the problem of drug resistance.
Traditionally, targets have been identified through established knowledge of individual protein molecules and their functions, where their function has been well-characterized. Here, we use comparative genomics for the identification of potential targets for Mtb. These methods have the advantage of speed, low cost and even more importantly, provide a systems view of the whole microbe at a time, which enables asking questions that are often difficult to address experimentally. Drug discovery has witnessed a paradigm shift from the traditional medicinal chemistry-based ligand-oriented discovery approaches to rational drug target identification and target-driven lead discovery, by targeting the molecular mechanisms of the disease.

Methodology: Searching for the M. tuberculosis H37Rv complete genes
The complete genome sequence of M. tuberculosis H37Rv was downloaded using National Center for Biotechnology Information FTP server (www.ncbi.nlm.nih.gov/FTP).

Comparative analysis with human
The protein coding genes from M. tuberculosis H37Rv genome were subjected to BLAST against DEG (http://tubic.tju.edu.cn/deg) to find out the essential genes. The essential genes obtained after DEG search were compared with human genes using BLASTX. Genes which lack the homology with human were considered as potential drug target candidates for further drug development process.

Functional analysis using UNIPROT
The obtained targets genes were further analyzed by UNIPROT (www.uniprot.org) database to find out their functions.

Results:
Available data shows 3989 protein coding genes in the M. tuberculosis H37Rv genome. These genes were subjected to BLAST with DEG and 628 genes were found to be essential for M. tuberculosis H37Rv. Comparative studies with human were performed to find out genes with or without homolog to human. Genes those that were homologous to human were neglected as they were functionally similar with those of human. Out of 628 essentials genes, 324 genes lack similarity to the human genome in BLASTX homology search and were identified as potential candidates for further target based drug development. We manually annotated all the genes having no homolog to human and removed hypothetical and uncharacterized genes to refine the results. The resulting dataset consist of a target dataset of 135 potential genes. These were further classified using UNIPROT based on functions. The analyzed data shows that of the 135 targets genes, 25 were involved in amino-acid biosynthesis, 10 in cell cycle, 8 in transcription, 8 in RNA Binding and 5 in Protein transport (Table 1 in supplementary material). It was also observed from UNIPROT results that some target genes are involved in multiple functions in different pathways ( Table 2 and Table 3  Rifampicin is a well-characterized inhibitor of DNAdependent RNA polymerases. Resistance to rifampicin results from mutations in the drug-binding site of the polymerase that do not adversely affect the enzyme's activity [11]. Resistance to other major and 'second-line' anti-Mtb drugs is now also well known. Pyrazinamide (a close relative of nicotinamide) is also a prodrug, requiring activation by the enzyme pyrazinamidase. Mutation in the relevant pncA gene affects pyrazinamide activation, and resistance may also be facilitated by alteration of its transport into the Mtb cell [13]. Streptomycin targets translation by associating with ribosomal proteins and the 16 S RNA of the ribosome 30 S subunit. Resistance in Mtb arises from mutations in the rspL gene (encoding the S12 protein target for streptomycin binding) and in conserved loop regions of the 16 S RNA, encoded by the rrs gene [14]. Ethambutol is known to be inhibitory to polyamine function and cell-wall synthesis. Mutations in embB, encoding an arabinosyltransferase involved in cell-wall biogenesis, are associated with ethambutol resistance, but other mechanisms of resistance also appear to be operative [15]. Recently, the second-line anti-Mtb drug ethionamide has been shown to require activation by a Mtb flavin mono-oxygenase to convert it into the cytotoxic form [16][17]. Overexpression of InhA was found to confer resistance to both isoniazid and ethionamide, revealing an obvious route for development of antibiotic resistance in the pathogen [18].
No new anti-Mtb drugs have been developed for well over 20 years. In view of the increasing development of resistance to the current leading anti-Mtb drugs, novel strategies are desperately needed to avert the 'global catastrophe' forecast by the WHO. The timely determination of the genome sequence of Mtb H37Rv by Stewart Cole and co-workers in 1998 provided a muchneeded boost for TB research, elucidating the genetic constitution of the pathogen and revealing many novel gene products for mechanistic and structural characterization, and as potential new drug targets [19]. Therefore, computational approach for drug targets identification, specifically for Mtb, can produce a list of reliable targets very rapidly. These methods have the advantage of speed, low cost and even more importantly, provide a systems view of the whole microbe at a time. Since it is generally believed that the genomes of bacteria contain both genes with and without homologues to the human host. Using computational approach for target identification is very quick to produce a desirable list. Here we performed database search and found total 3989 genes in the M. tuberculosis H37Rv genome, we had annotated all the genes and removed all hypothetical genes to refine the results. After removing all hypothetical genes, 135 genes have been identified as potential drug targets. These genes and their products can be targets for future drug development and even screening can be done with the available drugs for tuberculosis.

Conclusion:
Comparative genome analysis of MTB H37Rv and human provides a simple framework for integrating the vast amount of genomic data that can be used in the drug target identification. Drugs that specifically target genes with high homology to the host can lead to unwanted toxicity, therefore, finding new antituberculosis drugs should based on subtractive genome analysis. The analysis shows that 628 of the 3989 genes in Mycobacterium tuberculosis H37Rv were found to be essential of which 324 genes lack similarity to the human genome. Subsequently hypothetical proteins were removed through manual curation. This further resulted in a dataset of 135 proteins with essential function and no homology to human.