Composition, physicochemical property and base periodicity for discriminating lncRNA and mRNA

Annotation of genome data with biological features is a challenging problem. One such problem deals with distinguishing lncRNA from mRNA. In this study, three groups of classification features, namely base periodicity, physicochemical property and nucleotide compositions were considered. We are attempting to propose a simple neural network model to obtain better results using judicious combination of the above said sequence features. Our approach uses balanced dataset, simple prediction model and use of limited features in distinguishing lncRNA and mRNA. Accordingly (a) two properties of base periodicity: peak power spectrum of the signal and noise-to-signal ratio (SNR) of this peak signal (b) three physicochemical properties: solvation, stacking and hydrogen-bonding energy and (c) all dinucleotides and trinucleotides compositions were used. Classification was performed by considering features independently followed by combining these properties for improvement. Classification metric was used to compare the result for seven eukaryotic organisms for various combinations of features. Nucleotide compositions combined with physicochemical property or base periodicity group of features becomes a strong classifier with more than 99 percentage accuracy. Base periodicity analysis with SNR can be used as discriminating feature of lncRNA from mRNA.

Supplementary Method 1: Explanation of the methodology used to calculate the physicochemical property features of dinucleotides and trinucleotides.
For every given sequences of lncRNA/mRNA, compute the transition matrix for dinucleotide and trinucleotide separately.Then divide each element of the row of the matrix by summation of the entire element of this particular row.Similarly, compute for each row for the matrix.Now This Revised matrix is used as a weight matrix.Using this weight matrix, we calculate the weighted mean physicochemical energy of lncRNA and mRNA for dinucleotides and trinucleotides.
Let a toy sequence of lncRNA/mRNA 5'GCTGTCGTT 3', Divide this sequence into groups of trinucleotides as GCT GTC, GTT and, group of dinucleotides as GC, TG, TC, GT start calculating this sequence, by beginning trinucleotide as the first trinucleotide GCT, then the second trinucleotide GCT, and so on.
There are 64 possible trinucleotides and 16 possible dinucleotides that are equally probable because anyone from 64 or 16 can come in the start.So, the transition of trinucleotide in toy sequence looks like as: Start-GCT-GTC-GTT S k=(1/64 )GCT physico +(3/N ∑ij Vj ) , and for Dinucleotides Start-AC-TG-TC-GT S k=(1/16 )GCT physico +( 2/N ∑ij Vj ) Where Vj is the value of experimental physicochemical energy of dinucleotide or trinucleotide taken from the published paper mentioned above.Wij is the weight (weight of I th row and j th column) of weight (revise transition) matrix.
S k is calculated weighted mean physicochemical energy of kth mRNA or lncRNA sequence.GCT physico and GC physico are the experimental Physicochemical energy of the beginning trinucleotide or dinucleotide respectively, And the N length of the sequence.Using this method calculate the weighted mean physicochemical energy of every sequence of lncRNA and mRNA gene.

Supplementary Method 2:
Base periodicity of lncRNA and mRNA We have analyzed the peak of the power spectrum and SNR of peak spectrum.In bioinformatics, base periodicity has been explored to unravel biological features.Fourier spectrum of lncRNA and mRNA is defined using binary indicator function.Uα, where α is a symbol of nucleotide.
where xj is the jth position of nucleotides in sequences of lncRNA/mRNA.Now, total spectrum is summation power spectrum of individual indicator function is given as: Where, N is length of RNA sequence, f=k/N,k=1,2,3....N/2, and α sα(f ) is a partial power spectrum of nucleotide.Also, signal-to-noise (SNR) of the peak at frequency 1/3 is Supplementary Supplementary explanation 1: Neural network setup explanation and illustration: We considered fifteen features for individual models.For illustration consider the type of feature combination is TPTCDPDC, with 8 symbols means it is composed of four features, containing TP, TC, DP, and DC.So, the total dimension of features is that 3 features of TP +5 features (first 5 PC of trinucleotides) + 3 features of DP and first 4 PC of dinucleotides.Neural network setup was implemented through software tool and techniques like keras.The first column of this table is a combination of features.The Second column of the Table contains the total number of hidden layers.These are the number of hidden layers between the input and output layers.The third column of this Table contains the number of artificial neurons in the hidden layers.For e.g. the neural network for TPTCDPDC contains 4 hidden layers with 16, 8, 4 and 2 neurons.The performance metrics for the above-constructed models are provided in the result section.
Supplementary Figure 1: flow chart of data preparation.

Supplementary explanation 2:
Detail of lognormal distribution fitting MLE values of the lognormal distribution are listed in the following table 4.Although the distribution of peak spectrum at 1/3 of frequency and SNR for both RNA sequences have log-normal but slightly differ in parameters due to differences in inherent variability and diversity.
It is shown in Table 4 MLE value of parameters of μ are almost similar for a maximum peak in lncRNA as well as mRNA.There are a large number of sequences whose power spectrum values are a very small number, and natural logarithm is a negative value The natural logarithm of a number less than 1 is negative, This is a common characteristic of log-normal distributions when the data includes values close to zero.The similarity in μ values between the two sequence types suggests that they may share common statistical properties, at least in terms of their central tendency.The relatively larger values for SNR may be influencing the shape and location of the lognormal distributions.Higher SNR values may lead to shifts in the location μ or the scale sigma parameters of the lognormal distribution, affecting the estimated values of μ.Hence SNR of the sequence has more distinguishability than the maximum peak.
C. elegans have about 28532 mRNA sequence in the transcript file of NCBI and about 3152 lncRNA sequences in the NONECODE database.Few mRNAs have a lower SNR than 4, whenever more than 3000 lncRNA sequences have an SNR less than 4.So SNR can be confidence having a good distinguishable feature.

Figure 2
Figure 2.a : Feature based on physicochemical energy, Figure 2.b : feature based on PCA-based composition and base periodicity , Figure2.c : Combined feature based on physico -chemical energy and PCA-based compositions.Figure 2.d : Combined feature based on physicochemical energy, base periodicity and PCAbased composition.
Figure 2.a : Feature based on physicochemical energy, Figure 2.b : feature based on PCA-based composition and base periodicity , Figure2.c : Combined feature based on physico -chemical energy and PCA-based compositions.Figure 2.d : Combined feature based on physicochemical energy, base periodicity and PCAbased composition.

Table 4 :
MLE values of parameters of log-normal distribution of SNR and peak spectrum for lncRNA and mRNA sequences

Table 5 :
Loss of original variance of dinucleotide/trinucleotide in principal components (pc)

Table 6 :
The metrics of neural networks for the classification of various combination of features of eukaryotic organisms Chimpanzee and C.

Table 7 :
The metrics of neural networks for the classification of various combination of features of eukaryotic organisms Cow and Platypus using SMOTE.

Table 8 :
The metrics of neural networks for the classification of various combinations of features of eukaryotic organisms Zebra fish and Arabidopsis thaliana using SMOTE.

Table 9 :
The metrics of neural networks for the classification of various combination features of eukaryotic organisms Chicken and mix sequences of species C. elegans, Chicken and platypus using with SMOTE/without SMOTE.Classification metrics for mixed sequences of seven considered organism using SMOTE (second approach