Using transcripts rather than genomic chromosomes as a reference sequences is actually how I imagined RNA-Seq analysis would be conducted, before I learned about standard practices. In fact, samtools provides an 'idxstats' function that can be used to calculate normalized RPKM expression values. So, I was curious if the extra modeling done by eXpress is really any better than this simple sort of RPKM calculation: having a more complicated model can potentially improve accuracy, but more complicated models can also leave extra room for things to go wrong, can lead to over-fitting, etc. For example, I have used eXpress on some de novo assembly data, and I actually found that normal de novo programs seemed to provide better results than those specifically designed for RNA-Seq data (however, to be clear, I think the results of this blog post emphasize that the problem was with the assembly and not the mRNA quantification, as I would have expected).
The short answer is "Yes" - I think it is better to use eXpress over idxstats for calculating RPKM/FPKM values.
To illustrate this, first take a look at the correlations between the eXpress FPKM values and the RPKM values calculated using idxstats:
The correlation isn't horrible, but you can see a non-trivial amount of genes whose expression levels have consistently lower in eXpress than idxstats. However, this by itself doesn't really prove one options is better than the other option. Because I feel comfortable with the gene-level mRNA quantification levels from cufflinks (and the RSEM-like algorithm implemented in Partek; for example, see Figure 5 in this paper or click here to see a direct correlation between these two results), I decided to see how the results compared when using different tools for a transcript-based reference (eXpress, idxstats) versus a genomic/chromosome-based reference (cufflinks, Partek).
Again, you see these outliers if you compare the idxstats results to cufflinks (or to Partek - click here for those results):
However, you don't see these outliers when comparing eXpress to cufflinks (or to Partek - again, click here for those results):
So, eXpress clearly provides more robust results than the simpler idxstats comparison. You can also see this in box plot below, showing the correlation coefficients for all the mRNA quantification strategies that I tested.
Of course, systematic differences between mRNA quantification methods should (at least partially) be corrected when identifying differentially expressed genes between two groups (because the differences affect both groups). However, there are some certain circumstances when the mRNA quantification levels may want be used in isolation, such as for ranking the most highly expressed genes in a sample (as was the case for the de novo assembly data that I worked with). In this situations, I would definitely recommend a tool like eXpress over trying to calculate RPKM values from tools like idxstats.
FYI, here are some details on the methodology for this comparison:
- MiSeq samples from GSE37703 were used for these comparisons.
- Correlations were calculated using log2(FPKM/RPKM + 0.1) expression values.
- eXpress and idxstats were run on Bowtie2 alignments of the same set of RefSeq transcripts (downloaded from the UCSC Genome Browser, with duplicated gene IDs removed). The Partek EM algorithm used a set of RefSeq sequences used by the vendor and cufflinks used the genes.gtf file downloaded from iGenomes on the TopHat website. Only commonly represented gene symbols were used for calculating correlations. Only genes declared "solvable" by eXpress were considered for calculating correlations. As an example, click here to view a venn diagram of overlapping gene symbols for SRR493372.
P.S. It looks like you may have to be signed into Google Docs to view the image previews properly. However, you can always download the files to view them locally.
No comments:
Post a Comment