Charles Warden's Science Blog

Tuesday, February 12, 2019

Variance Stabilization and Pseudocounts / Rounding Factors

To be fair, I first want to point out that most of this content was previously posted in an answer Biostars thread.

While I could tell that thread got a fair number of views overall, there wasn't much of a reaction (at least not shortly after posting something on Twitter). While patience is a virtue, I discovered something when I started working with my own code that I didn't initially notice with the DESeq2 example code. Plus, I thought some formatting differences may make things more clear, and it might help to separate this point from the slightly different topic of the original Biostars thread.

So, here are those (and additional) results presented in a slightly different way.

First, I thought it was interesting that the plot for rlog values looked a lot like log-transformed values with a pseudocount of 100 (from the meanSdPlot() function in the vsn package, as used in the DESeq2 vignette):

In other words, I am saying the middle plot above (log-transformation with a higher pseudocount) looks a lot more like the rlog plot (on the far right, above). As a minor point, it may also be worth mentioning that I reformatted the plot (to look more like the other plots that I create below), if you don't immediately recognize the plots above from the Biostars thread. In the original thread, I thought this was worth mentioning because the log-transformed expression values are independent for each sample, but the rlog values for a given sample will vary depending upon what other samples are processed (so, all other things being equal, I thought the independent per-sample normalization would be better).

However, what I didn't initially notice is what happens if I plot the regular mean versus SD:

In both cases, the higher pseudocount causes there to be less of a bump in the SD values (for lower count genes). However, with this alternative plot, you can notice more of a difference between the log-transformed values and the rlog values (above, middle-plot versus right-plot).

I previously discussed fold-change values the context of "rounding factors" (in this paper, and this blog post). While the mean-versus-SD trend is similar, there are some differences from changing the order of operations (CPM with rounding factor in 2nd and 4th columns, with the qualitatively lowest increase in SD with the larger rounding factor in the 4th column):

I frequently find having independently calculated expression values (such as log2(FPKM+X) values) to be helpful in assessing results from differential expression programs (for each project). However, to be fair, there is in fact some noticeable differences between the log-transformed expression with the higher psuedocount (or rounding factor) and the rlog values from DESeq2.

The code to create the plots above can be downloaded here. As mentioned in the Biostars thread, I also owe a big thanks to Mike Love (because the 1st part of that code is copied from the DESeq2 vignette).

Friday, January 25, 2019

TopHat2 Really Isn't That Bad

TopHat is a highly cited RNA-Seq aligner. Newer methods for alignment (and quantification without alignment) are now available, so there is a debate if those newer (and also highly cited methods) should be used instead of the final version of TopHat2.

One of the authors on the original TopHat paper (although admittedly not on the TopHat2 paper, or on the HISAT paper, for that matter) has also discouraged people from using TopHat1. To be precise, that tweet is about TopHat1 (and I would in fact use TopHat2). However, I think the interpretation is often that HISAT/HISAT2 should completely replace TopHat2 (and that is what I strongly disagree about).

As pointed out in this article, the TopHat website does recommend use of HISAT2 as it has been moved to a status of "low maintenance". However, I usually prefer STAR and/or TopHat2 over HISAT2, so I still don't completely agree (for reasons explained in this post).

Since I kept bringing it up on Biostars, I thought it might be good to have a brief blog post with some points that I include in multiple responses (in terms of why I think TopHat2 can still be useful):

1) I've had situations where the more conservative alignment from TopHat2 was arguably useful to avoid alignments from unintended sequence and/or contamination.

2) I've had situations where unaligned reads from TopHat2 were useful for conversation back to FASTQ for use in RNA-Seq de novo assembly (such as for exongeous sequence and/or pathogen identification).

3) For gene expression, if I see something potentially strange with a TopHat2 alignment (for endogenous gene expression), I have yet to find an example where the trend was substantially different with STAR (there probably are some examples for some genes, but I would guess the gene expression quantification with htseq-count are usually similar for TopHat2 and STAR single-end alignments for most genes).

There was only small subset where I confirmed that the conclusion was not different with a STAR alignment, but you can see this acknowledgement to get some idea about the total number of samples tested (although I apologize that you have to scroll down to see names, where the name count per protocol is correlated with the total number of projects).

I also have a more recent blog post referencing some comparisons where I tested recovery of the altered gene in a knock-out or over-expression dataset, where I start with a TopHat2 alignment.

4) At least for 51 bp Single-End Reads (with 4 cores and 8 GB of RAM per sample, on a cluster), I wouldn't consider the difference in run-time for TopHat2 (versus STAR) to be prohibitive (I think it is usually within a few hours, running samples in parallel). I believe the difference is a bigger issue with paired-end reads, but I don't encounter those as often.

That said, I am usually running TopHat2 with the --no-coverage-search parameter, which can noticeably decrease the run-time.

Here are a couple possibly relevant Biostars posts:

Comment (and follow-up responses) on "Best RNA-Seq aligner"
Comments on Multiple Answers for "Tophat2/Bowtie failing to align some examples of partial intron retention / exon loss"
Answer for "Tophat problem - R1 and R2 reads mapped to different strandeds of viral genome"
Comment for "Tophat vs Bowtie alignment rates"
Twitter response acknowledging some benefits to STAR with PE data

Also, there are some points that were brought up after the intial post:

predeus had a good comment essentially mentioning that the STAR and HISAT parameters can be changed to be more conservative (for either method) and/or provide unaligned reads (for STAR). This may be worth considering during troubleshooting.
Bastien Hervé had a good comment that the TopHat website says "Please note that TopHat has entered a low maintenance, low support stage as it is now largely superseded by HISAT2 which provides the same core functionality (i.e. spliced alignment of RNA-Seq reads), in a more accurate and much more efficient way." I can now see possible concerns about how someone may think I was trying to imply that the developers agreed with me, when I need to be representing my opinions as independent ideas. So, I apologize about that.

That said, I think it is important to emphasize that programs can have useful applications that may not have been planned by developers, and I think the above points about single-end 50 bp RNA-Seq samples are still relevant.
Also, I have even had a period of time when I recommended people use minfi instead of COHCAP, even though I am the COHCAP developer / maintainer. In fact, I think that was even on the main page for the standalone version, since I am having a hard time finding a ~2015 discussion group question where that was my answer. However, I currently believe that I may have shortchanged by own package (due to the greater popularity of minfi, and one benchmark showing similar validation for COHCAP versus minfi/bumphunter).

It is more of a side note, but the COHCAP paper is also relevant to the discussion of not implying agreement, since one of the comments acknowledges that I shouldn't have used "City of Hope" in the algorithm name.

I will probably continue to update this in the future, but I wanted to have some place to save my thoughts in the meantime.

Update Log:

1/25/2019 - modified formatting wording (decided to add a note about this retroactively on 1/29)
1/29/2019 - not directly an update to this blog, but there was a response to the points that I proposed here: https://www.biostars.org/p/359974/#361003
1/30/2019 - added point about run-time (and specifically reference extra comment about STAR/HISAT parameters)
2/26/2019 - Added link to Biostars post about lower alignment rate (for TopHat2 versus Bowtie2)
2/27/2019 - Add point about COHCAP and not implying agreement
3/2/2019 - Add link to Twitter response
6/17/2019 - Mention that I was using the --no-coverage-search parameter
7/5/2019 - qualify difference in gene expression trends and remove splicing event (since I could also identify that gene with a later version of rMATS and a STAR alignment)
3/18/2020 - add link to note on TopHat website, as well as a link about more recent benchmarks

Sunday, June 8, 2014

Questions About Genetic Testing

NOTE: Getting Advice About Genetic Testing

I've recently been asked some questions about genetic testing, some of which I think are worth mentioning in a public discussion (although please see the note above if you are interested in getting specific recommendations):

1) How can I use my DNA sequence to inform my decision making about reducing my risk of getting a disease?

In many cases, I think the actions that you can take to reduce disease risk are relatively generic (exercise, eat lots of fruits and vegetables, etc.), and these are things that you should do regardless of your genotype.

That said, there certainly are some circumstances where very specific action can be taken, based upon your genome sequence. I am not personally aware of all such examples, and I would recommend talking to a medical professional (such as a genetic counselor) for more information. If this is the only type of information that you wish to learn about your genome, then you may only benefit from determining the sequence for a small portion of your genome (and your family history can guide the likelihood of needing to perform any genetic tests in a clinical setting).

2) How does Promethease compare to the health reports from 23andMe? Does Promethease mostly focus on rare mutations?

Promethease is based upon annotations that come from SNPedia (similar to wikipedia, but specifically for mutation annotations). So, I would expect the content would depend on whatever information is entered by volunteers. Given the amount of information that I see from by 23andMe data (which is mostly common variants), I would say it contains a large amount of information on common variants.

When I first ran Promethease, I remembered mostly seeing risk annotations (which you can see in my old post, comparing my top 23andMe risk associations), and I didn't remember seeing anything like the carrier status report. For example, I didn't remember seeing anything reporting me as a carrier for cystic fibrosis, which is an example of a 23andMe result that was in good concordance with my family history. However, I went back to check my specific mutation (394delTT), with a probe ID i4000313. This particular probe ID makes it a bit harder to match the mutation. However, if I Google the probe ID, I can see that is is included in SNPedia but without a detailed description. If I look up 394delTT in ClinVar, then I can get more information about this variant and I can see that it corresponds to rs121908769. If I then look this variant up in SNPedia, I can see that it provides some information from ClinVar (although there is nothing in the brief main text for mentioning cystic fibrosis), but I don't see it among any of the "cystic fibrosis" variants in my old Promethease report.

However, to be fair, I wanted to re-run my sample through Promethase to see if the reporting system has changed and/or confirm that it now recognizes this mutation in my data. It appears that a many new features have been added within the last 3+ years, such as a more interactive interface (viewed by clicking "UI version 2" in the downloadable report). Additionally, I can tell that the "medicines" and "medical conditions" have been expanded to include more SNPs. However, it still don't recognize my cystic fibrosis carrier status, so it can't simply considered a replacement for the old 23andMe report.

Also, in general, the information available in Promethease tends to be terse, and it won't contain the same level of detail for explaining basic concepts as would have been provided by the old 23andMe health report.

3) More specifically, I do not wish to see a doctor in order to obtain my genetic information. Also, I want something clear and easy to understand, which doesn't require doing any additional research. What are my options, now that 23andMe no longer offers health reports?

I am not aware of any direct-to-consumer test that provides information that is comparable to the old 23andMe health reports, and I am not aware of any tool to analyze your raw 23andMe data that will reproduce your the old 23andMe health report (especially not with all the details to help make the results easier to understand). I believe that even Illumina's Understand Your Genome program requires meeting with a doctor to draw your blood, conduct a predisposition screen, and discuss your results.

Perhaps more importantly, I think it is worth emphasizing that the health reports previously offered by 23andMe were not really a single report: they were updated periodically as new findings were published in the genomics literature, so your estimated risk would change over time for many traits. This should be generally true for any tool connecting you to the genomics literature, as also demonstrated for the Promethease example above.

If you were only interested in the subset of results that are unlikely to change, I think you would probably have been most interested in the carrier status reports (and a handful of the other reports). There are tests like Counsyl that I would expect to probably be similar to the the 23andMe carrier status results (and it is something that I would want to check out, if I was planning on having a child), but I believe that you can only get that test through your doctor. However, this is just one example: I would recommend talking to a doctor or genetic counselor if you want more specific guidance.

In my opinion, I am most uncomfortable with the request to get a result that "doesn't require doing any additional research". Critical thinking and being able to synthesize your own opinion from multiple sources of information are important skills that should be part of everyday life, and I think "additional research" is especially important for tools designed for "research and educational purposes" (including 23andMe, Promethease, Interpretome, etc.). For example, clinical action may be limited because 1) all the genetic influences of disease risk are not known, 2) ways to mitigate genetic risk may not be known, and 3) one current limitation to low-cost options like SNP chips is that you aren't measuring your entire genome sequence (so, some important sequences may not be covered). This might be a problem for some people, but I think it is still OK for many people.

In other words, there certainly have been some cases where people discovered important findings from their 23andMe reports that were worth verifying in a clinical setting, but I think most 23andMe customers took no medical action based upon their reports. Most importantly, "no medical action" need not equate to "dissatisfied": I would personally fall the category of a customer who was "very satisfied" yet has not changed by behavior because of any of the results. I think trying to understand how your biology is influenced by your genome sequence is a life-long goal that will probably never be fully realized, but I think there is value in being able to understand on-going genome research through the context of your own genome.

Charles Warden's Science Blog

Tuesday, February 12, 2019

Variance Stabilization and Pseudocounts / Rounding Factors

Friday, January 25, 2019

TopHat2 Really Isn't That Bad

Sunday, June 8, 2014

Questions About Genetic Testing

About Me

My Websites

Blog Archive

Labels

Charles Warden's Science Blog

Tuesday, February 12, 2019

Variance Stabilization and Pseudocounts / Rounding Factors

Friday, January 25, 2019

TopHat2 Really Isn't That Bad

Sunday, June 8, 2014

Questions About Genetic Testing

About Me

My Websites

Blog Archive

Labels

Follow Me!