The aspect of the paper that I think is most interesting to the genomics community is a comparison of statistical tools for defining differentially expressed genes, which had the greatest influence on the resulting gene lists (at least among the comparisons that I make in the paper). So, I will review those relevant figures in this blog post.
The plots below show the robustness of the gene lists produced by a given algorithm. In other words, the higher the "common" line on the graph, the more robust the gene lists (i.e. the higher the proportion of genes commonly called by multiple algorithms). Most readers will probably not be as interested in the x-axis (rounding factor for RPKM values), and it only changes the gene lists for Partek and sRAP.
|Analysis of Patient Cohort (Tumor versus Normal). 1-factor is just tumor versus normal, while 2-factor also includes patient ID (pairing tumor and normal samples). cuffdiff results not shown because no genes were defined with FDR < 0.05. sRAP not shown because gene list was very small (see Figure S3 from the paper)|
|Analysis of Cell Line Comparison (Mutant versus WT)|
Overall, I think Partek works the best (which is what I use for COH customers), but I was also pleased with DESeq (and sRAP, but I am obviously biased). In fact, these comparisons support earlier observations that DESeq is conservative in defining lists of differential expressed genes (Robles et al. 2012).
However, my main goal is not to simply tell you what is the single best solution. In fact, the cell line comparison above also had paired microarray data, and I would say the concordance between the two technologies was roughly similar for most algorithms:
The similarity in microarray concordance kind of reminds me of Figure 2a from Rapport et al. 2013, which compares RNA-Seq gene lists to ~1000 qPCR validated genes. However, I think properly determining accuracy can be difficult. For example, look at the differences between the qPCR results in Figure 2a and the ERCC spike-ins in Figure S5 for that same paper.
Instead, these are the main take-home points I would like to emphasize:
1) Simple methods comparing RPKM values (in this case, rounded and log2 transformed) for defining differentially expressed genes can work at least as well as more complicated methods that are unique for RNA-Seq analysis (at least for gene-level comparisons). For example, one claim against count-based methods in general (including edgeR, DESeq, etc.) is that there can be confounding factors, such as changes in splicing patterns. Although I agree this is a theoretical problem that probably does occur to some extent, it doesn't seem to be a major factor influencing concordance with microarray data, qPCR validation, etc.
2) There is probably not a solution that works best in all situations. In this paper, you can see the results look very different with the patient versus cell line datasets. For practical reasons, a lot of benchmarks will probably use cell line datasets. However, it is not safe to assume performance for large patient cohorts will be comparable to cell line data (or patient data with little or no biological replicates).