Tuesday, November 19, 2013

RNA-Seq Differential Expression Benchmarks

I recently published a paper whose primary purpose was to serve as a reference for the protocol that I use for RNA-Seq analysis (see main paper and supplemental figures).

The aspect of the paper that I think is most interesting to the genomics community is a comparison of statistical tools for defining differentially expressed genes, which had the greatest influence on the resulting gene lists (at least among the comparisons that I make in the paper).  So, I will review those relevant figures in this blog post.

The plots below show the robustness of the gene lists produced by a given algorithm.  In other words, the higher the "common" line on the graph, the more robust the gene lists (i.e. the higher the proportion of genes commonly called by multiple algorithms).  Most readers will probably not be as interested in the x-axis (rounding factor for RPKM values), and it only changes the gene lists for Partek and sRAP.
Analysis of Patient Cohort (Tumor versus Normal).  1-factor is just tumor versus normal, while 2-factor also includes patient ID (pairing tumor and normal samples).  cuffdiff results not shown because no genes were defined with FDR < 0.05.  sRAP not shown because gene list was very small (see Figure S3 from the paper)
Analysis of Cell Line Comparison (Mutant versus WT)
To be fair, I will certainly admit robustness is not the same as accuracy.   Uniquely identified genes may be true positives that represent a lower false negative rate.  However, this did correspond to some circumstantial evidence I've seen with other datasets where cuffdiff and edgeR have given some weird results.  The results from this paper don't actually contain the clearest examples of this, but you can take a look at the GAGE4 stats to see an example where I would at least argue that edgeR provides inflated statistical significance.

Overall, I think Partek works the best (which is what I use for COH customers), but I was also pleased with DESeq (and sRAP, but I am obviously biased).  In fact, these comparisons support earlier observations that DESeq is conservative in defining lists of differential expressed genes (Robles et al. 2012).

However, my main goal is not to simply tell you what is the single best solution.  In fact, the cell line comparison above also had paired microarray data, and I would say the concordance between the two technologies was roughly similar for most algorithms:
RNA-Seq versus Microarray Gene lists.  "Microarray DEG" = proportion of differentially expressed genes in microarray data also present in RNA-Seq gene list.  "RNA-Seq DEG" = proportion of differentially expressed genes in RNA-Seq data also present in microarray gene list.

The similarity in microarray concordance kind of reminds me of Figure 2a from Rapport et al. 2013, which compares RNA-Seq gene lists to ~1000 qPCR validated genes.  However, I think properly determining accuracy can be difficult.  For example, look at the differences between the qPCR results in Figure 2a and the ERCC spike-ins in Figure S5 for that same paper.

Instead, these are the main take-home points I would like to emphasize:

1) Simple methods comparing RPKM values (in this case, rounded and log2 transformed) for defining differentially expressed genes can work at least as well as more complicated methods that are unique for RNA-Seq analysis (at least for gene-level comparisons).  For example, one claim against count-based methods in general (including edgeR, DESeq, etc.) is that there can be confounding factors, such as changes in splicing patterns.  Although I agree this is a theoretical problem that probably does occur to some extent, it doesn't seem to be a major factor influencing concordance with microarray data, qPCR validation, etc.

2) There is probably not a solution that works best in all situations. In this paper, you can see the results look very different with the patient versus cell line datasets.  For practical reasons, a lot of benchmarks will probably use cell line datasets.  However, it is not safe to assume performance for large patient cohorts will be comparable to cell line data (or patient data with little or no biological replicates).


  1. This comment has been removed by a blog administrator.

  2. I believe the figures for the paper/post still hold up and are useful. However, my opinion has changed somewhat over time, I think the interpretation of “benchmark” reports should probably change a little bit.

    Namely, I think there needs to be some methods testing for each project. This means that "Optimizing" may be better than "Optimized," and perhaps something like "observations" should be more appropriate than "benchmarks."

    For example, I think it is useful to see that the concordance for differential expression results with different programs can vary *for the same dataset* (depending upon conditions like the number of variables used in the differential expression model, and the rounding threshold used for fold-change calculation). This is useful in itself, but it means that one study's benchmarks shouldn't strictly define criteria to use in other experiments (without critically assessing those other experiments).

    I don’t think this is quite enough to justify a formal correction, but I did think I needed to add a comment here.

  3. This comment has been removed by a blog administrator.


Creative Commons License
My Biomedical Informatics Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.