A database for human Y chromosome protein data.

The human Y chromosome is the sex determining chromosome. The number of proteins associated with this chromosome is 196 and 107 of the 196 proteins have yet not been characterised. Here, we describe the analysis of these 107 proteins by computing various physicochemical properties using sequence and predicted structural data to elucidate molecular function. We present the derived data in the form a form a database made freely available for download, review, refinement and update. Availability http://puratham.googlepages.com/ or http://puratham.googlepages.com/ftpconnection


Background:
The presence of the Y chromosome determines the male characteristics in a mammalian embryo [1]. It is one of the smallest chromosomes in the human genome (~60Mb) with a limited number of genes [2]. The human Y chromosome comprises 59 million base pairs approximately (59,373,566 bps) with 61 known protein-coding genes, 25 novel protein-coding genes, 282 pseudogenes, 15 miRNA genes, 6 rRNA gens, 13  . Both oncogenes and tumour suppressor genes are hypothesised be present in this chromosome causing genetic disorders in male-specific organs such as testis [2]. The human male infertility has been attributed to mutations in the genes on Y chromosome [10]. Genetic or inherited disease or specific abnormalities in the Y chromosome are major factors for male infertility. Infertility men reveal many abnormal conditions, which include azoospermia, oligozoospermia, teratozoospermia, asthenozoospermia, necrospermia and pyospermia [11]. Despite its central role in sex determination, genetic analysis of the Y chromosome has been limited due to the paucity of available genetic markers [12]. MSY genes participate in diverse processes such as skeletal growth, germ cell tumorigenesis, graft rejection, gonadal sex determination, and spermatogenic failure [13]. A study on the function prediction of the 107 hypothetical proteins of the Y chromosome proteins is performed. Here, we describe the use of prediction methods to characterize the unknown functional information using sequence and modelled structural data and store derived data in the form of a database.

Database:
We present the derived data in the form a form a database made freely available for download, review, refinement and update.

Database features:
The human Y chromosome has been studied for more than 30 years. It is used as a powerful tool to study human population and evolutionary data. The most characterizing feature of this chromosome is in human sex determination and in male germ cell development and maintenance. Therefore, it is important to document physical and chemical properties, sequence comparison, secondary structures, folding class, tertiary structure and biochemical information. Properties such as molecular weight, theoretical pI, estimated half life, extinction coefficients, instability index, aliphatic index, grand average of hydro-pathicity along with the number of negatively and positively charged residues were also document for each entry.

Weight and atoms:
In the dataset, the entry A8MQV7 has the highest no of atoms 16175 with molecular weight 114705.1 and highest number of negatively charged residues (139). On the other hand, the entry Q6KER0 has the lowest total no of atoms and molecular weight of 159 and 1132.3, respectively.
Theoretical pI: Q13381 was found to have the highest theoretical pI of 12.31, while A6NMP8 was found to have the lowest theoretical pI 3.29.

Half life:
Half life is the predicted time it takes for half of the amount of protein in a cell to disappear after synthesis. An acceptable value of 30 half life was found in all the proteins.

Extinction coefficient:
Extinction coefficient indicates the light absorbed by proteins at a certain wavelength. A6NDE4 in the dataset was found to have the highest extinction co-efficient value of 79900 while and A6NM12 showed the least value of 0.

Instability Index:
The instability index provided the stability of a protein. A highest value of 142.90 and lowest value of -51.21 was recorded for entries O14606 and A8MWL0, respectively in the dataset.

Aliphatic Index:
The aliphatic index of a protein was the relative volume occupied by the aliphatic side chains (alanine, valine, isoleucine and leucine). The entry A8MW33 has the highest aliphatic index of 136.53 and a lowest value of 7.22 for A6NMP8.

GRAVY:
The GRAVY value for a peptide or protein was calculated as the sum of hydropathy values of all the amino acids divided by the number of residues in the sequence. The maximum (0.860) and minimum (-1.404) values are recorder for entries A2RUG3 and A6NDE4, respectively in the dataset.

Charged residues:
The highest number of negatively charged amino acids (139) was found in A8MQV7 and the number of positively charged residues was found to be highest in Q24JR0 (94). A NIL value is recorded for the entry A6NMP8.

Predominant residues:
The frequencies of individual residues were calculated in terms of their percentages for each entry. The entry Q13381 had the highest percentage of amino acid serine (23.8%) followed by arginine with 22.2% in Q6KERO.

Atomic composition:
Atomic compositions each protein was calculated by PROTPARAM. The entry A8MQV7 has the highest composition of major atoms while and least number is recorder for Q6KER0.

Repeats:
Analysis shows that 39 protein sequences in the dataset showed the presence of repeats. This is not true with the remaining entries in the dataset. Among those with repeats, the entry Q24JR0 showed the highest number (20) of repeats at positions (9-718) with different scores.

Secondary structures:
Secondary structure prediction was done using GOR, HNN & SOPMA. The entry Q8N4A2 has the highest percentage of alpha helix in the dataset. Q6KER0, A6NNB5 and A6NII1 have the highest percentage of extended strand while Q9BZ97 has the lowest percentage.

Random coils:
A6NII1 has the highest percentage of random coil while A6NNB5 has the lowest percentage of random coil.

Conclusion:
More than hundred human Y chromosome proteins (107) have not yet been characterized. We used I-Tasser to model 100 structures and SWISS MODELER for the remaining 7 proteins. 39 protein sequences indicate the internal repeats in RADAR and function prediction with 10 proteins (100% confidence), 24 proteins (75% confidence), 12 proteins (50% confidence) and 14 proteins (25% confidence). Analysis shows that most proteins are similar to ribosomal protein S4E, nucleosome assembly protein (NAP) and RNA-binding proteins. We present these data in the form a web database made available freely over the internet.