Analysis of human collagen sequences.

The extracellular matrix is fast emerging as important component mediating cell-cell interactions, along with its established role as a scaffold for cell support. Collagen, being the principal component of extracellular matrix, has been implicated in a number of pathological conditions. However, collagens are complex protein structures belonging to a large family consisting of 28 members in humans; hence, there exists a lack of in depth information about their structural features. Annotating and appreciating the functions of these proteins is possible with the help of the numerous biocomputational tools that are currently available. This study reports a comparative analysis and characterization of the alpha-1 chain of human collagen sequences. Physico-chemical, secondary structural, functional and phylogenetic classification was carried out, based on which, collagens 12, 14 and 20, which belong to the FACIT collagen family, have been identified as potential players in diseased conditions, owing to certain atypical properties such as very high aliphatic index, low percentage of glycine and proline residues and their proximity in evolutionary history. These collagen molecules might be important candidates to be investigated further for their role in skeletal disorders.

both at the genomics and proteomics level. It integrates biology with computational algorithms for better understanding of complex molecules. Such analysis can be enhanced through a combinatorial dry to wet lab approach wherein the propositions made through the biocomputational findings can help in providing a direction for further research during wet lab studies. Thus, it facilitates to appreciate and understand the structural and functional roles of protein molecules, which are the heart of human disorders. A similar approach has been adopted to characterize the family of matrix metalloproteinases (MMP), where in silico characterization was followed by experimental confirmation and MMP-7 was subsequently proved to be a potential target in cardiac hypertrophy [8,9]. Collagen and its derivatives, such as gelatin, act as substrates of MMPs and are involved in development of many pathological conditions. An extensive analysis of collagen family is hence essential to comprehend the process of matrix remodeling in diseases.
Our study reports a comparative characterization of alpha-1 sequences of human collagen using biocomputational tools. Of the three alpha chains present in collagen, alpha-1 was observed in all 28 human collagens; thus alpha-1 was used for further analysis of the collagen proteins. The physico-chemical, secondary structural, functional and phylogenetic analysis of alpha-1 sequences of human collagen was executed. This research aims to provide an insight to various protein attributes of collagen proteins and characterize this large family. It also intends to propose potential members implicated in disease conditions based on relative examination of the collagen family or the presence of any atypical characteristics in the collagen molecules. This would aid biologists in carrying out further investigations on these complex molecules, for which, the basic structure analysis is of prime importance.

Retrieval of human collagen alpha-1 protein sequences
The complete alpha-1 protein sequences of all 28 members of human collagen family reported till date were derived from UniProtKB/ SWISS-PROT, a curated protein database (http://expasy.org/sprot/) in FASTA format with the help of the accession number provided for each collagen sequence (http://www.uniprot.org/) [10]. Complete information about the origin, attributes, annotation, ontologies, binary interactions and sequence of proteins was found in this knowledgebase.

Physico-chemical characterization of collagen family
Various features including number of amino acids, molecular weight, theoretical isoelectric point (pI), amino acid composition (%), number of positively (Arg + Lys) and negatively charged (Asp + Glu) residues, extinction co-efficient, instability index, aliphatic index and Grand Average of Hydropathicity (GRAVY) were computed using ExPASy's ProtParam tool using the protein sequence in FASTA format as the input data type (http://expasy.org/tools/protparam.html). Other physico-chemical features including number of codons, bulkiness, polarity, refractivity, recognition factors, hydrophobicity, transmembrane tendency, percent buried residues, percent accessible residues, average area buried, average flexibility and relative mutability were calculated for primary structure characterization by ExPASy's ProtScale tool using the retrieved protein sequence as the input data type (http://expasy.org/tools/protscale.html). A suitable amino acid scale was chosen for computation of each parameter in this sliding windows based tool that gives each amino acid a numerical value known as the amino acid scale.

Secondary structural characterization of collagen family
The secondary structural features of proteins comprising of alpha helix, 310 helix, Pi helix, beta bridge, extended strand, beta turn, bend region, random coil, ambiguous states and other states were predicted using Self-Optimized Prediction Method with Alignment (SOPMA) tool that takes into account the information derived from alignment of protein sequences belonging to the same family (http://npsa-pbil.ibcp.fr/cgibin/npsa_automat.pl?page=npsa_sopma.html) [11]. The protein sequence in FASTA format was used as the input data type and the number of conformational states was adjusted to four in order to predict Helix, Sheet, Coil and Turn. The other parameters were set as default.

Functional characterization of collagen family
The Motif scan tool was used to scan and identify all the known motifs, their nature and location in the selected alpha-1 protein sequences of the collagen family based on a profile and pattern search (http://myhits.isb-sib.ch/cgi-bin/motif_scan) [12,13]. The protein sequence in FASTA format was used as the input data type and scanned against 'PROSITE Patterns', a selected protein profile database out of the eight available.

Phylogenetic classification of collagen family
The human alpha-1 collagen protein sequences were aligned using multiple sequence alignment tool ClustalW2 using the protein sequences in FASTA format as the input data type (http://www.ebi.ac.uk/Tools/msa/clustalw2/) [14]. The best alignment for a set of input sequences was computed and all the identities, similarities and differences were highlighted. The evolutionary relationships were established by constructing phylograms through retrieval of the alignments using Neighbor Joining (NJ) method.

Discussion:
Functional role of alpha-1 chain of each human collagen was analyzed ( Table 1, see Supplementary material). MultAlin tool was used to carry out protein sequence alignment wherein a decreased sequence similarity was witnessed with increasing number of input sequences [15]. It was inferred that the alpha-1 chain showed identities, similarities and differences at various positions along the protein sequence in all 28 members. The computation of amino acid composition of each human alpha-1 collagen sequence using ExPASY's ProtParam tool indicated very high percentages of glycine and proline as compared to other amino acids ( Table 2,

see Supplementary material).
Glycine content in all collagens was more than 12%, except Collagen 12, 14 and 20 with a value of 9.2, 10.9 and 11.5% respectively. High percentage of glycine accounts for the stability of collagen triple helical structure, since incorporation of large amino acids can cause steric hindrance [16]. Furthermore, proline content was more than 10% in most collagens except Collagen 6, 12, 14 and 20 with values 8.7, 8.6, 8.7 and 10%, respectively. Proline residues are equally essential to point outward and stabilize the helix and also to act as structural disruptor of the secondary structural elements [17]. Thus, this not only helps collagen to act as a structural molecule but also aids in processes like cell-cell adhesion, migration. Other essential physico-chemical parameters were also calculated using ExPASy's ProtParam and ProtScale tools (Table 3a and 3b see Supplementary material). The pI values for 15 collagens were found to lie in the acidic range, while for the remaining half, alkaline range was observed. Also, analysis of instability index classified most collagens as stable (instability index <40) while Collagen 15, 17, 18, 20 and 26 were declared as unstable collagens. Additionally, Collagen 20 was regarded most thermostable, with highest aliphatic index (79.61), describing the relative volume of protein occupied by aliphatic side chains, closely followed by Collagen 14 (77.67) and 12 (75.45). Furthermore, the GRAVY values, signifying interaction of collagens with water, were observed within a broad range of -0.261 to -0.919, while the hydrophobicity was found to range between -0.3335 of Collagen20 (most hydrophilic) to 0.3335 of Collagen18 (most hydrophobic). SOPMA analysis of secondary structural features of human collagen protein sequences showed a pre-dominance of random coils while least percentage of β-turns were found. α-helices were found to exceed extended strands in 13 collagens (Collagen 6,9,13,15,17,18,19,21,22, 23, 25, 26 and 28) ( Table  4 see Supplementary material). A high percentage of random coils facilitating self-assembly of the monomer units into welldefined structures can also be linked to high glycine and proline content that are vital to provide the desired flexibility and ability to bond with adjacent units in collagen monomers. This information on packaging of secondary structural elements may assist to derive potential tertiary protein structures and also promote advancements in protein engineering. Motifs of protein can provide significant knowledge about the protein's mechanism of action and nature. Signature motifs were obtained within protein sequences using Motif Scan tool (Table  5, see supplementary material). Collagens 1, 2 and 3 were found to possess the VWFC domain signature, known to participate in oligomerization, hence forming an imperative part of the complex forming proteins [18,19]. The presence of this domain can be correlated to the known characteristic of collagen molecules passing through a complex assembly process to form a triple helical structure. Additionally, Collagens 7 and 28 were observed to have pancreatic trypsin inhibitor (Kunitz) family signature. The 28 human alpha-1 collagen protein sequences were aligned based on sequence homology and phylogram was constructed with distance-based NJ method to establish evolutionary relations in the complex collagen family (Figure 1). Various clusters with close relationships were identified including Collagen 1 and 2, 12 and 14, 23 and 25 relating closely to Collagen 3, 20 and 17 respectively. These molecules may be analyzed together for possible similar properties owing to their close evolutionary history. Also, Collagen 4, reported majorly in diseases shows similarity to Collagen 13, which may also potentially play a role in pathological conditions. Collagen 9, 12, 14 and 20 are included in the FACIT (Fibril Associated Collagens with Interrupted Triple helices) group of collagen family, where alpha-1 chain of Collagen 9 is recognized for its role in skeletal and rheumatoid disorders, especially in multiple epiphyseal dysplasias [20]. Owing to their structural similarities, atypical features and proximity in evolutionary history, the other members of this family also present a prime candidature for investigation of their role in skeletal disorders.

Conclusion:
With the availability of a wide variety of computational tools, an in-depth study of the information hidden behind a protein structure is possible. Comparative studies of the members belonging to a protein family and their physico-chemical, secondary structural, functional and phylogenetic classification can help give extensive information of protein's structure, function and its relationship with other members of the family. Characterization of the alpha-1 chain of the vast collagen protein family in humans yielded new insights. Based on this comparative characterization, we hypothesize, Collagen 12, 14 and 20 as a potential protein cluster showing similarity in many properties along with an atypical behavior. These three proteins possess, low glycine and proline, very high aliphatic index and a close evolutionary relation. Since these collagens form a part of the FACIT collagen family, of which collagen 9 is established for its role in skeletal disorders, these collagen molecules might be possible disease candidates. These findings can help biologists working with ECM proteins concentrate their research on collagen proteins proposed as putative players in diseased conditions. Moreover, this study is a model for researchers to fine-tune their specific systems and comprehend their outcomes better.