Saturday, March 9, 2019

Updated Thoughts on PatientsLikeMe

I have a previous post about PatientsLikeMe, but I importantly did not test creating an account until relatively recently.  I have been continually improving my habits in terms of taking more time to critically assess results and question prior assumptions (in addition to realizing that I may have not had the best title for that previous blog post, in retrospect), so I thought there would be value in providing an updated perspective on this free website.

I have genomics / medical data publicly available to download on my Personal Genome Project page (for hu832966) and I have what I would consider a partial electronic medical record on my PatientsLikeMe page (which I think is an excellent resource for sharing and learning about patient experiences, with the requirement that everybody who participates be completely open; however, you have to sign in with a free account to view my profile).

For those that currently don't have PatientsLikeMe accounts, I thought I should describe a few of my experiences (from the perspective of a patient):

I have taken Citalopram at doses of 20 mg and 40 mg (and 0 mg, during intervals to test the continued benefit of the medication, when my overall stress levels were lowered and/or I learned better cognitive strategies to manage stress).  While it makes quick analysis more difficult, I think being able to see the details of people's experience can be important.  For example, I thought it was interesting that my body's reaction to the medication seemed to change over time (each time I went back on the medication, I think the side effects were more subtle, even though I think the severity of my initial symptoms also gradually improved over time).  If this is in fact true, that would indicate some resistance / reaction that could not be completely captured from studying germline variants (if you are focusing on using DNA genotyping/sequencing for medication guidance), such as somatic variants, epigenetic modifications, etc.

I also like that PatientsLikeMe provides scores for both effectiveness and side-effects (and I admittedly created a PatientsLikeMe account because some plots in the "Health Communities" in 23andMe reminded me of what I had seen for PatientsLikeMe, even without previously creating a PatientsLikeMe account).

On the positive side, I have seen multiple neurologists, and I had previously not really found any of the previous migraine medication that I took to be helpful.  However, my most recent neurologist prescribed me indomethacin, and I found that to be very helpful.  I wrote a positive evaluation for that migraine treatment, and I was surprised to see that this was a relatively rare treatment for migraines.  So, if people found commonly prescribed treatments to not be helpful, I think this might be helpful in brainstorming alternatives.

I also reported 3 negative evaluations for drugs where I experienced moderate-to-severe side effects.  I noticed that severe side effects were self-reported for these drugs among 9-14% of members in the Community Reports (9% was comparable to other drugs that I checked, but the drug for which I had the most severe side effects in 2018 had a the highest severe percentage of 14% and qualitatively most frequent reports that seemed similar to by own experience).  That said, the most commonly prescribed migraine medication (which I never tried) had a reported severe side effect rate of ~20% (so, it seems to me that a self-reported "severe" side effect rate of 5-10% is normal, but 15% or 20%  with hundreds or thousands of patients may be kind of high).  That said, I want to be very careful about being too negative about something that is not my area of expertise (even though the idea of something being helpful for some people and harmful for others seems relevant for genomics research).

Going back to the topic of my anti-depressant (for anxiety or depression, depending upon the time-frame of my treatment that you are talking about), the current maximum recommended dosage of Citalopram is 40 mg (with 60 mg now being considered unsafe), and that would match my own expectation (although for slightly different reasons - I had to drink coffee instead of tea due to extra drowsiness at 40 mg, and I am currently on 20 mg instead of 40 mg).  I can also see a 2016 indication from the FDA that 20 mg is the maximum recommended dose for individuals greater than 60 years of age (so, the maximum recommended dose is currently lower for older individuals).  You can also see more information about this drug in the 1998 drug approval package from the FDA.  To be clear, I am very grateful for the availability of Citalopram and that has made a huge difference in my life, but I think this is something that may be worth discussing more (and I would probably also benefit from understanding better).

There has even been Washington Post article describing a partnership between PatientsLikeMe and the FDA to help with drug reporting (I saw this in a recent e-mail from them, but the article is actually from 2015 - still, it is good to know other people probably have at least somewhat similar thoughts).  While they didn't mention PatientsLikeMe, I think this was also related to the topic of a more recent announcement regarding patients reporting "real-world evidence."  You can also report adverse events to the FDA through MedWatch.

Update Log:
3/9/2019: original blog post
3/26/2019: changed link in 1st paragraph (and added another link in that sentence).
6/28/2019: add MedWatch link

Tuesday, February 12, 2019

Variance Stabilization and Pseudocounts / Rounding Factors

To be fair, I first want to point out that most of this content was previously posted in an answer Biostars thread.

While I could tell that thread got a fair number of views overall, there wasn't much of a reaction (at least not shortly after posting something on Twitter).  While patience is a virtue, I discovered something when I started working with my own code that I didn't initially notice with the DESeq2 example code.  Plus, I thought some formatting differences may make things more clear, and it might help to separate this point from the slightly different topic of the original Biostars thread.

So, here are those (and additional) results presented in a slightly different way.

First, I thought it was interesting that the plot for rlog values looked a lot like log-transformed values with a pseudocount of 100 (from the meanSdPlot() function in the vsn package, as used in the DESeq2 vignette):



In other words, I am saying the middle plot above (log-transformation with a higher pseudocount) looks a lot more like the rlog plot (on the far right, above).  As a minor point, it may also be worth mentioning that I reformatted the plot (to look more like the other plots that I create below), if you don't immediately recognize the plots above from the Biostars thread.  In the original thread, I thought this was worth mentioning because the log-transformed expression values are independent for each sample, but the rlog values for a given sample will vary depending upon what other samples are processed (so, all other things being equal, I thought the independent per-sample normalization would be better).

However, what I didn't initially notice is what happens if I plot the regular mean versus SD:



In both cases, the higher pseudocount causes there to be less of a bump in the SD values (for lower count genes).  However, with this alternative plot, you can notice more of a difference between the log-transformed values and the rlog values (above, middle-plot versus right-plot).

I previously discussed fold-change values the context of "rounding factors" (in this paper, and this blog post).  While the mean-versus-SD trend is similar, there are some differences from changing the order of operations (CPM with rounding factor in 2nd and 4th columns, with the qualitatively lowest increase in SD with the larger rounding factor in the 4th column):



I frequently find having independently calculated expression values (such as log2(FPKM+X) values) to be helpful in assessing results from differential expression programs (for each project).  However, to be fair, there is in fact some noticeable differences between the log-transformed expression with the higher psuedocount (or rounding factor) and the rlog values from DESeq2.

The code to create the plots above can be downloaded here.  As mentioned in the Biostars thread, I also owe a big thanks to Mike Love (because the 1st part of that code is copied from the DESeq2 vignette).

Friday, January 25, 2019

TopHat2 Really Isn't That Bad

TopHat is a highly cited RNA-Seq aligner.  Newer methods for alignment (and quantification without alignment) are now available, so there is a debate if those newer (and also highly cited methods) should be used instead of the final version of TopHat2.

One of the authors on the original TopHat paper (although admittedly not on the TopHat2 paper, or on the HISAT paper, for that matter) has also discouraged people from using TopHat1.  To be precise, that tweet is about TopHat1 (and I would in fact use TopHat2).  However, I think the interpretation is often that HISAT/HISAT2 should completely replace TopHat2 (and that is what I strongly disagree about).

As pointed out in this article, the TopHat website does recommend use of HISAT2 as it has been moved to a status of "low maintenance".  However, I usually prefer STAR and/or TopHat2 over HISAT2, so I still don't completely agree (for reasons explained in this post).

Since I kept bringing it up on Biostars, I thought it might be good to have a brief blog post with some points that I include in multiple responses (in terms of why I think TopHat2 can still be useful):

1) I've had situations where the more conservative alignment from TopHat2 was arguably useful to avoid alignments from unintended sequence and/or contamination.

2) I've had situations where unaligned reads from TopHat2 were useful for conversation back to FASTQ for use in RNA-Seq de novo assembly (such as for exongeous sequence and/or pathogen identification).
3) For gene expression, if I see something potentially strange with a TopHat2 alignment (for endogenous gene expression), I have yet to find an example where the trend was substantially different with STAR (there probably are some examples for some genes, but I would guess the gene expression quantification with htseq-count are usually similar for TopHat2 and STAR single-end alignments for most genes).

There was only small subset where I confirmed that the conclusion was not different with a STAR alignment, but you can see this acknowledgement to get some idea about the total number of samples tested (although I apologize that you have to scroll down to see names, where the name count per protocol is correlated with the total number of projects).

I also have a more recent blog post referencing some comparisons where I tested recovery of the altered gene in a knock-out or over-expression dataset, where I start with a TopHat2 alignment.

4) At least for 51 bp Single-End Reads (with 4 cores and 8 GB of RAM per sample, on a cluster), I wouldn't consider the difference in run-time for TopHat2 (versus STAR) to be prohibitive (I think it is usually within a few hours, running samples in parallel).  I believe the difference is a bigger issue with paired-end reads, but I don't encounter those as often.

That said, I am usually running TopHat2 with the --no-coverage-search parameter, which can noticeably decrease the run-time.

Here are a couple possibly relevant Biostars posts:



Also, there are some points that were brought up after the intial post:

  1. predeus had a good comment essentially mentioning that the STAR and HISAT parameters can be changed to be more conservative (for either method) and/or provide unaligned reads (for STAR).  This may be worth considering during troubleshooting.
  2. Bastien HervĂ© had a good comment that the TopHat website says "Please note that TopHat has entered a low maintenance, low support stage as it is now largely superseded by HISAT2 which provides the same core functionality (i.e. spliced alignment of RNA-Seq reads), in a more accurate and much more efficient way."  I can now see possible concerns about how someone may think I was trying to imply that the developers agreed with me, when I need to be representing my opinions as independent ideas.  So, I apologize about that.
    • That said, I think it is important to emphasize that programs can have useful applications that may not have been planned by developers, and I think the above points about single-end 50 bp RNA-Seq samples are still relevant.
    • Also, I have even had a period of time when I recommended people use minfi instead of COHCAP, even though I am the COHCAP developer / maintainer.  In fact, I think that was even on the main page for the standalone version, since I am having a hard time finding a ~2015 discussion group question where that was my answer.  However, I currently believe that I may have shortchanged by own package (due to the greater popularity of minfi, and one benchmark showing similar validation for COHCAP versus minfi/bumphunter).
      • It is more of a side note, but the COHCAP paper is also relevant to the discussion of not implying agreement, since one of the comments acknowledges that I shouldn't have used "City of Hope" in the algorithm name.


I will probably continue to update this in the future, but I wanted to have some place to save my thoughts in the meantime.

Update Log:

1/25/2019 - modified formatting wording (decided to add a note about this retroactively on 1/29)
1/29/2019 - not directly an update to this blog, but there was a response to the points that I proposed here: https://www.biostars.org/p/359974/#361003
1/30/2019 - added point about run-time (and specifically reference extra comment about STAR/HISAT parameters)
2/26/2019 - Added link to Biostars post about lower alignment rate (for TopHat2 versus Bowtie2)
2/27/2019 - Add point about COHCAP and not implying agreement
3/2/2019 - Add link to Twitter response
6/17/2019 - Mention that I was using the --no-coverage-search parameter
7/5/2019 - qualify difference in gene expression trends and remove splicing event (since I could also identify that gene with a later version of rMATS and a STAR alignment)
3/18/2020 - add link to note on TopHat website, as well as a link about more recent benchmarks
 
Creative Commons License
Charles Warden's Science Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.