CARd: Carbon distribution analysis program for protein sequences

Carbon distribution is responsible for stability and structure of proteins. Arrangement of carbon along the protein sequence is depends on how the amino acids are organized and is guided by mRNAs. An atomic level revision is important for understanding these codes. This will ultimately help in identification of disorders and suggest mutations. For this purpose a carbon distribution analysis program has been developed. This program captures the hydrophobic / hydrophilic / disordered regions in a protein. The program gives accurate results. The calculations are precise and sensitive to single amino acid resolution. This program is to help in mutational studies leading to protein stabilisation.


Method
The flow diagram (Figure 1) outlines how the distribution of inner lengths based on carbon fraction is counted in a outer length. The pink coloured sequence is protein sequence which is converted into atomic sequence (shown in multiple colours (red+blue+pink). The red portion is an inner length. The blue portions are outer length which includes the red portion as well. The entire atomic sequences are given in pink colour which includes both blue and red. Here the outer and inner lengths are taken as 100 and 35 atoms. There are 65 (100-35) inner lengths for statistics. These 65 inner lengths are grouped based on carbon fraction. The C 11 has (11/35) 0.31 carbon fraction. The flow chart (Figure 2) on the other side shows how the algorithm works in calculation. A program has been written to do all this calculations. The program reads the protein sequence, converts into atomic sequence, takes a length (anything from 77 to 700 atoms) of sequence, split into small lengths (from 32 to -350 atoms) of equal sizes, finds fraction of carbon atoms in all this small lengths and counts number of small lengths that contain a defined fraction of carbon. There are small lengths with 0.25 / 0.45 carbon but maximum at around 0.31. A distribution of range of small lengths based on carbon fraction appears like a normal distribution curve. This distribution curve is obtained for all possible outer length. The outer lengths can be any length between 77 and 700 corresponds to 5 and 45 amino acids. Any outer length chosen between 100 and 150 atoms is sufficient for most of the observations. Here it is chosen as 150 in all calculations. The inner length can be between 32 and 350 which are not exceeding half of the outer length. Inner length of 35 atoms is chosen here in all calculations as it is the smallest unit with 11 carbons which can produce fraction of 0.31. The outer length is moved with step value of selected atoms. Normally it is half of the outer length. Here it is 75 atoms as the outer length is 150. A carbon distribution profile is obtained for all possible outer length.
Generally normal profiles will have a Gaussian distribution curve with maximum frequency at 0.3145. Any shift from this maximum is considered as hydrophilic (negative side) or hydrophobic (positive side). Difference in normal distribution is considered as disordered outer length which contains improper amino acid distribution. When the outer length contains a proper arrangement of amino acids then there is normal distribution. That is improper arrangements can be identified from this calculations means that a stretch of sequence which are hydrophobic or hydrophilic or unstable can be predicted. The statistical mean, median, mode and standard deviation of the distribution profile curve can be obtained to check whether the distribution is normal or not. While the mean, median and mode are equal for a given stretch, it is considered as normal distribution. Then the stretch of sequence is in proper mode of arrangement. CARd program computes for every (outer length) distribution profile and prints at the end of the profile results. CARd, the flexible program has option to output mean, median, mode and standard deviation at every amino acid site. It also gives simple average carbon fraction at an amino acid site and for given output length. This replaces our earlier CARBANA program [2] for carbon analysis. A plot of mean value versus amino acid number can be plotted. This plot is to see the overall hydrophobic or hydrophilic regions in the sequence. It is similar to hydropathy plot.

Discussion: Carbon distribution profile
CARd analysis is carried out for super oxide dismutase sequence. An outer length of 150 atoms and inner length of 35 atoms are used. The carbon distribution plot (Figure 3) is obtained for every outer length with a gap of 75 atoms. The plots show the frequency versus carbon fraction. Each plot is labeled with the range of amino acids involved in the particular outer length. The first plot shows distribution profile for first 10 amino acids and the next one is for 11 amino acids from 6 to 16. Third outer length is from 10 to 21 amino acids and 11 amino acids long. This way the distribution plot till last possible outer length is obtained. Each plot shows a different distribution profile. Outer lengths 72-82, 77-88, 82-93, 88-99, 114-123, 119-129 and 141-152  The outer length 50-63 and 56-68 has a disulphide bond that stabilise the structure but carbon distribution is not in order. The disordered outer lengths need to be taken for mutational study. Similar plot on abnormal proteins reveal that most of the stretches are in disrorder regions. So this carbon distribution analysis program can find disordered proteins, small stretch that are disorder and amino acid responsible for disorder.This mathod is sensitive to single amino acid level. This can be better exploited for mutational study leading to stabilisation of proteome.

Figure 3:
CARd analysis on SOD protein with 150 atoms (10 amino acid) of outer length and 35 atoms of inner length. Gap between each outer length is chosen as75 atoms. The individual plots are frequency versus carbon fraction for the short stretch labeled. For example the first plot is for first 10 amino acids, the second one is for residues 6-16, third one is for residues 10-21 and so on. The distribution plot is shown till last possible outer length. Each plot shows a different distribution curve. If the distribution plot is normal and maximum frequency at 0.3145, then the particular stretch is normal and stable (e.g. 77-88 and 88-99). When the curve is normal and shifted to left or right then the stretch is a hydrophilic (e.g. 125-146) or hydrophobic (e.g. 10-21) region. If the distribution in not normal then it is disordered region (e.g. 104-114).

Mean median, mode and standard deviation of the distribution profile
The statistical mean, median, mode and standard deviation of the distribution curve obtained for SOD. The results are not shown. It gives output of average carbon fraction, mean, median with mode and standard deviation of the distribution profile at every amino acid position as shown in Table 1 (see supplementary material). This is achieved by selecting step size of suitable atoms. A plot of amino acid number versus mean values can be obtained to identify the hydrophobic or hydrophilic regions along the sequence. This is similar to CARBANA output on average carbon fraction along the sequence.

Conclusion:
CARbon distribution (CARd) analysis program has been developed to capture the hydrophobic/hydrophilic regions in proteins. The calculations are precise and sensitive to single amino acid change. This program is hoped to help in mutational studies leading to protein stabilisation. It can capture the essence of hydrophobicity in small stretch of sequence.