Prediction of normalized signal strength on DNA sequencing microarrays by n-grams within a neural network model

We have shown previously that a feed-forward, back propagation neural network model based on composite n-grams can predict normalized signal strengths of a microarray based DNA sequencing experiment. The microarray comprises a 4xN set of 25-base single-stranded DNA molecule ('oligos'), specific for each of the four possible bases (A, C, G, or T) for Adenine, Cytosine, Guanine and Thymine respectively at each of N positions in the experimental DNA. Strength of binding between reference oligos and experimental DNA varies according to base complementarity and the strongest signal in any quartet should `call the base` at that position. Variation in base composition of and (or) order within oligos can affect accuracy and (or) confidence of base calls. To evaluate the effect of order, we present oligos as n-gram neural input vectors of degree 3 and measure their performance. Microarray signal intensity data were divided into training, validation and testing sets. Regression values obtained were >99.80% overall with very low mean square errors that transform to high best validation performance values. Pattern recognition results showed high percentage confusion matrix values along the diagonal and receiver operating characteristic curves were clustered in the upper left corner, both indices of good predictive performance. Higher order n-grams are expected to produce even better predictions.


Background:
DNA sequences are strings of hundreds to millions of four nitrogenous bases (Adenine, Cytosine, Guanine and Thymine) represented by the letters A,C,G, and T respectively. Representation of these strings as numerical values enables the application of powerful dig-ital signal processing techniques. Desirable properties of a DNA numerical representations and some examples are given in [3,15]. N-gram method was first introduced by C. E Shannon in 1948 [9]. Neural network learning methods provide a robust approach to approximation of realvalued, discrete-valued and vector-valued target functions, [12] such as numerical DNA data. The study of artificial neural networks has been inspired by the observation that biological learning systems are built of very complex webs of interconnected neurons, [10, 11,12], which communicate through a large set of interconnections assigned variable strengths (weights) in which the learned information is stored, [13]. Each neuron computes a weighted sum of its y input signals. The activation function for neurons is the sigmoid function defined in [12] as σ(y) where y is the weighted sum of the inputs. The output of the sigmoid function ranges from 0 to 1, increasing monotonically with its input and the weights of the interconnections between the different neurons are adjusted during the training process to achieve a desired input/output mapping. is examined by replacement of mono-, di-and tri nucleotide strings with their respective n-gram equivalents. The n-gram ratios for mono-, di-and tri nucleotides are shown in Table 1, Table 2 and Table 3 respectively. The results with 1-gram and 2-gram and their composition have been discussed in [15]. We advance the results obtained previously by examining the influence of 3grams on overall performance of our predictions based on the data evaluation functions. We examine the effect of the different number of neurons in the hidden layer on optimal prediction performance. The output node layer has 4 nodes reflecting our choice of sequence signals to predict. The schematics of DNA neural network architecture are shown in Figure 1. The DNA sequence data are first converted by a sequence encoding schema into neural network input vectors (ratios of n-gram). The neural network then predicts those normalized intensities according to the sequence information embedded in the neural interconnections after network training.

Data evaluation functions:
In [15], we explained the concept of performance and regression values. We also examined their results using 1-gram, 2-gram and their composition. We now check for consistency of the results with the inclusion of 3-gram using two other Matlab neural network data evaluation functions. Performance and regression values are also considered with this inclusion.

Confusion Matrix:
This is a 2-dimensional matrix with a row and column for each class for training, validation, testing and all datasets. Each matrix element shows the number of test examples for which the actual class is the row and the predicted class is the column. Good results correspond to large numbers down the main diagonal. The diagonal (green cells) in each table show the number of cases that were correctly classified. The off-diagonal (red cells) show the misclassified cases. Blue cells in the bottom right show the total percent of correctly classified cases (in green text) and the total percent of misclassified cases (in red text). Figure 2 shows a confusion matrix with 4 tables each displaying the network response for the training, validation, testing and all datasets.

Methodology:
We adopt the same methods as in [15]. The dataset is from the Cambridge Reference Sequence with ascension number NC − 012920 and is made of 15,453 rows and 6 columns where 3 of the columns are the n-grams for n= 1,2, 3 and the other 4 columns represent the normalized intensities for Adenine, Cytosine, Guanine and Thymine. We extract every 26th line of the dataset which reduces the dataset to 594 rows (lines) respectively. 3-gram are used independently to predict the normalized intensities for the four nucleotides ACGT and results obtained are compared with those obtained in [15]. We also use 1-3-gram, 2-3gram and 1-2-3-gram to repeat the analysis and compare with earlier results.
The algorithmic steps for our data manipulation are as follows: [1] Compute n-gram profiles of the DNA data set using Python programming language.
[2] Calculate the nucleotide, di nucleotide and tri nucleotide frequencies of these profiles.

Results:
Neural network regression value R, determine how robust the prediction is. Higher R value and a smaller MSE in terms of performance imply good prediction. We compare the performances of the networks with 1-3-gram, 2-3-gram and 1-2-3gram with different number of neurons in the hidden layer using the Matlab regression toolkit. These results are compared with those obtained in [15]. Again, the number of neurons in the hidden layer has been varied between 20 and 40 with step size 5 as a matter of choice and hopefully to find the optimal network architecture. Table 7 gives a summary of the regression and performance values extracted from 1-2-gram and 1-3-gram with variable number of neurons in the hidden layer. Table 8 gives a summary of the regression and performance values extracted from 1-2-gram and 2-3-gram with variable number of neurons in the hidden layer. Table 9 gives a summary of the regression and 392 ©Biomedical Informatics (2019) performance values extracted from 1 2-gram and 1-2-3-gram with variable number of neurons in the hidden layer. Tables 1, 2 and 3 show the percentages (ratios) from Affymetrix [1] dataset of nucleotides, di nucleotides and tri nucleotides respectively. Using pattern recognition toolkit to investigate the behavior of our predictions in terms of confusion matrices (CM) and receiver operating characteristic (ROC) curves, the results with 1-3-gram are shown in Table 10. The results with 2-3-gram and 1-2-3-gram (not shown) are not as good as those obtained using 1-3-gram. This is not necessarily a trivial result, as the predictive function must accommodate all targets in the 4 x 594 sets. Using regression toolkit, we observed that the values in Tables 4, 5 and 6 were generally better than the results obtained in [15] where 1-2-gram composition of the n-grams were used. This is in part due to the increment in the n-grams from 2 (two) to 3 (three). A look at Table 7, Table 8 and Table 9 shows reduction in the validation error and increment in the regression value when we compare the respective n-gram compositions. The average best validation performance (Bvp) and regression values obtained in [15] for 1-2gram was 0.002793 which translated to 99.72% accuracy with average regression value of 0.98803. These numbers decreased (increased) to 0.002499 and 0.99112 when we used the 1-3-gram com-position. Again, a comparison of 1-2-gram and 2-3-gram showed a decrease in best validation performance to 0.002253 and increase in the regression value to 0.99180 for the 2-3-gram. In the case of 1-2-3 gram, the best validation performance value again decreased to 0.002056 or 26.4% when compared with the value obtained with 1-2-gram. The regression value also increased to 0.99191 from 0.98803 obtained with 1-2-gram. This is again due to the increment in the n-gram number. The use of pattern recognition toolkit to investigate the behaviour of the confusion matrices and receiver operating characteristic (ROC) curves showed general confusion matrices value of 99.8% using 40 neurons in the hidden as shown in Table 10 and the points in ROC curve lying in the upper left corner. These are good signs of near expected results.

Conclusion:
We can predict the signal intensities via their normalized values from Affymetrix data using artificial neural network based n-gram model. It seems the higher the n-gram value and appropriate composition, the better the predictive accuracy of the model. The usage of higher n-gram values and their different compositions are considered in this paper. Efforts could be made to increase the number of n-grams to see if better results can be obtained which we envisage to be true. An effort could also be made to get optimal number of neurons in the hidden layer that give maximal regression values and lower mean square error. An increase in regression value to say 0.999 is indicative of a much better prediction with its attendant low mean square errors which is a measure of performance. As we increase the n-grams, we can also check which composition of the n-grams give better results. Greater confusion matrix values along the diagonal and ROC curves points in the upper most left corner can also be achieved for better classification.