Tuesday, February 12, 2019

Variance Stabilization and Pseudocounts / Rounding Factors

To be fair, I first want to point out that most of this content was previously posted in an answer Biostars thread.

While I could tell that thread got a fair number of views overall, there wasn't much of a reaction (at least not shortly after posting something on Twitter).  While patience is a virtue, I discovered something when I started working with my own code that I didn't initially notice with the DESeq2 example code.  Plus, I thought some formatting differences may make things more clear, and it might help to separate this point from the slightly different topic of the original Biostars thread.

So, here are those (and additional) results presented in a slightly different way.

First, I thought it was interesting that the plot for rlog values looked a lot like log-transformed values with a pseudocount of 100 (from the meanSdPlot() function in the vsn package, as used in the DESeq2 vignette):



In other words, I am saying the middle plot above (log-transformation with a higher pseudocount) looks a lot more like the rlog plot (on the far right, above).  As a minor point, it may also be worth mentioning that I reformatted the plot (to look more like the other plots that I create below), if you don't immediately recognize the plots above from the Biostars thread.  In the original thread, I thought this was worth mentioning because the log-transformed expression values are independent for each sample, but the rlog values for a given sample will vary depending upon what other samples are processed (so, all other things being equal, I thought the independent per-sample normalization would be better).

However, what I didn't initially notice is what happens if I plot the regular mean versus SD:



In both cases, the higher pseudocount causes there to be less of a bump in the SD values (for lower count genes).  However, with this alternative plot, you can notice more of a difference between the log-transformed values and the rlog values (above, middle-plot versus right-plot).

I previously discussed fold-change values the context of "rounding factors" (in this paper, and this blog post).  While the mean-versus-SD trend is similar, there are some differences from changing the order of operations (CPM with rounding factor in 2nd and 4th columns, with the qualitatively lowest increase in SD with the larger rounding factor in the 4th column):



I frequently find having independently calculated expression values (such as log2(FPKM+X) values) to be helpful in assessing results from differential expression programs (for each project).  However, to be fair, there is in fact some noticeable differences between the log-transformed expression with the higher psuedocount (or rounding factor) and the rlog values from DESeq2.

The code to create the plots above can be downloaded here.  As mentioned in the Biostars thread, I also owe a big thanks to Mike Love (because the 1st part of that code is copied from the DESeq2 vignette).

 
Creative Commons License
My Biomedical Informatics Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.