Biomarker Identification from RNA-Seq Data using a Robust Statistical Approach

Biomarker identification by differentially expressed genes (DEGs) using RNA-sequencing technology is an important task to characterize the transcriptomics data. This is possible with the advancement of next-generation sequencing technology (NGS). There are a number of statistical techniques to identify DEGs from high-dimensional RNA-seq count data with different groups or conditions such as edgeR, SAMSeq, voom-limma, etc. However, these methods produce high false positives and low accuracy in presence of outliers. We describe a robust t-statistic method to overcome these drawbacks using both simulated and real RNA-seq datasets. The model performance with 61.2%, 35.2%, 21.6%, 6.9%, 74.5%, 78.4%, 93.1%, 35.2% sensitivity, specificity, MER, FDR, AUC, ACC, PPV, and NPV, respectively at 20% outliers is reported. We identified 409 DE genes with p-values<0.05 using robust t-test in HIV viremic vs avirmeic state real dataset. There are 28 up-regulated genes and 381 down-regulated genes estimated by log2 fold change (FC) approach at threshold value 1.5. The up-regulated genes form three clusters and it is found that 11 genes are highly associated in HIV- 1/AIDS. Protein-protein interaction (PPI) of up-regulated genes using STRING database found 21 genes with strong association among themselves. Thus, the identification of potential biomarkers from RNA-seq dataset using a robust t-statistical model is demonstrated.


Background:
Transcriptomics is an evolving and continually growing field of Bioinformatics for biomarker discovery [1]. RNA-sequencing (RNA-seq) technology is an important branch of transcriptomics for identification of novel genes. This technique helps to identify the differentially expressed genes or transcripts (DEGs) associated to trait of interest from the voluminous transcriptomic data. Previously microarray technology had been used by the biological and biomedical researchers for discovering the candidate genes and differentially expressed markers between two or more groups of interest. In recent years, huge amount of transcriptomic data can be generated using high-throughput next-generation sequencing (NGS) technology of cDNA (RNAseq), which ultimately yield RNA-seq count data for subsequent analysis [6]. Additionally, this approach includes the identification of disease biomarkers that may be important in the diagnosis of the different types and subtypes of diseases, with several implications in terms of prognosis and therapy [2]. This sequence-based technology has created significant scope of studying the transcriptome by enabling a wide range of novel applications, including detection of alternative splicing isoforms [3][4], detecting novel genes, gene promoters, isoforms, and allelespecific expression [5]. RNA-seq uses NGS technology to sequence cDNA that has been derived from an RNA sample, and hence generates millions of short reads [6]. These reads are then usually mapped to a reference genome and the number of reads mapping within a genomic feature of interest (such as a gene or an exon) is used as a measure of the abundance of the feature in the analyzed sample [6]. One important objective for RNA-seq is to identify DEGs under different conditions. Researchers typically target for differential expression analysis called "count matrix", where each row represents the gene (or exons or genomic loci), each column represents the sample, and each cell indicates the number of reads mapped to the gene in the sample [7]. A basic research problem in many RNA-seq analyses is the discovery of DEGs between different sample groups (e.g. healthy and disease). RNA-seq analysis has some benefits over microarrays for DE analysis including wide dynamic range and a lower background level, and the chance to detect and quantify the expression of previously unknown transcripts [8][9][10]. To deal with the increasing popularity of RNA-seq technology, several statistical tools have been developed so far by the researchers and these are being continuously updated for robust data analysis. Most of the computational methods such as edgeR [11] are based on negative binomial models. SAMseq [12] is a non-parametric method. Gene-level read counts transformation-based method is voom-limma [14]. Still it is important to keep in mind that even these methods are based on an assumption that most genes are equally expressed in the samples, and that the differentially expressed genes are divided more or less equally between upand down-regulation [2]. For detection of DE genes edgeR, SAMSeq and voom-limma are the popular methods. However these methods are sensitive to outliers. In this study we propose a new method robust t-statistic using minimum β-divergence method [15].

Methodology:
RNA-seq count data often produce noisy data called outliers during data generating steps. Presence of such outliers in the particular dataset will surely provide misleading results in the downstream analysis. There is however a number of statistical tools for identification of over-expressed and under-expressed genes, but they are sensitive to outliers in some cases. In this investigation we considered edgeR, SAMSeq and voom-limma the proposed procedures for performance analysis with the help of synthetic and real dataset (Figure 8).

Robust t-statistic for Biomarker Detection (Proposed):
The two sample-sample t-test statistic depends on the mean (µ) and variance (ϭ 2 ) of the samples. The classical mean and variance are very much sensitive to outliers may mislead detection of DEGs. So, we used robust mean and variance with the minimum β-divergence method [15] instead of classical estimators. For robust t-statistic the classical estimators are replaced by the βestimators. The proposed method tolerate outliers using the βweight function, it is also used as a weight for mean and variance estimators [15]. It is an iterative method for estimating mean and variance estimators for calculating t-statistic values. The performance of the proposed method depends on the tuning parameter β, the optimum value for β=0.2 was selected using cross validation method [15].

Data Transformation:
To transform the RNA-Seq expression data we used the z-score transformation for making the data belong normalized data family. In this study Z normalized score values were used for the normalization with the mean (g) and the standard deviation SD (gi) of the i th genes. The normalized data is used to detect the DE genes using robust t-statistic. The fold change (FC) log2 approach used for selection of up-regulated and down-regulated genes with 1.5 threshold value [16]. The steps for identifying up and down regulated genes from gene expression data (GED) are shown ( Figure  1).

Results and Discussions:
All methods were run on the same dataset and evaluated the statistical performance indices/measures such as ROC, AUC, pAUC, MER and FDR. The performance measure AUC was computed for each of the statistical methods using open source R package ROC. All R packages are freely available in the comprehensive R archive network (https://www.cran.rproject.org) and bioconductor (https://www.bioconductor.org).

Synthetic Study:
The synthetic counts data were generated for each gene from a Negative Binomial distribution. It controls the settings and the true differential expression status of each gene. We generated gene expression profiles of G=1000 genes of k=2 groups (healthy and disease) each with n1=n2=3 samples. Among the expressions of 1000 genes, we divided these expressions into two groups (expressions of important genes or DE genes, 100 and expressions of the unimportant genes or EE genes, 900). In order to determine the performance of the robust t-test in comparison of the three well-known DE genes identification methods (edgeR, SAMSeq, voom+limma) of RNA-Seq data, we investigated the performance of all fours methods in presence of outliers using the simulated dataset (G=1000) where 10% of the genes are DE genes. A method is said to be comparatively good performer if it yields larger values of TPR, TNR, AUC and small value of FDR. To demonstrate the effect in presence of outliers for performance evaluation of these DE genes identification methods, we randomly introduce 5%, 10%, 15% and 20% genes are corrupted by outliers. (Figure 2a-2d) shows the boxplot of AUC values of the four methods estimated from the simulated data at different outlier levels. It showed that the AUC of our proposed method is high. It is observed from the Table 2 that the AUC values of proposed method are 0.75, 0.71, 0.74 and 0.75 for 5%, 10%, 15% and 20% respectively. The proposed method provided the low FDR and high ROC with outliers (Figure 3 and 4). Identification of DEGs is vital issues for personalised medicine and drug discovery. From Table 2 the proposed method shows low MER (misclassification error rate) than the edgeR, SAMSeq and voomlimma methods.

Real RNA-Seq Data Analysis:
The RNA-Seq data GSE5220 was collected from the GEO database   (Figure 6a-6b) using k-means clustering algorithm. There are 21 genes showed the strong association among them (Figure 6a) and 20 genes are structured proteins. The MDFIC gene is the unstructured protein and it helps to make strong bond among other genes. The PPI network among 381 down-regulated genes is strong which forms three functional groups using k-means clustering algorithm (Figure 6c-6d). There are 45 unstructured genes whose protein structures are not available. The functional annotation of up-regulated and down-regulated genes is shown in the Figure 7a and Figure 7b respectively. The details of their cellular component functional annotations are shown in Table 3.

Conclusions:
RNA-seq data analysis helps to identify DEGs to solve important biological problems. Statistical approaches such as edgeR, SAMSeq and voom-limma are widely used in DEGs identification. However, these methods are sensitive to outliers. Hence, we report a robust t-statistic using minimum β-divergence method to overcome this issue. Analysis of synthetic data showed that the robust t-statistic method produced better performance with high AUC, low MER and FDR. The real RNAseq count data analysis detected 409 DEG using this method. There are 28 up-regulated genes and 381 down-regulated genes estimated by log2 fold change (FC) approach at threshold value 1.5. Thus, the identification of potential biomarkers from RNAseq dataset using a robust t-statistical model is demonstrated.