Saturday, May 4, 2019

precisionFDA and Custom Scripts for Variant Comparisons

After posting this reply to a tweet, I thought it might be a good idea to separate some of the points that I was making about comparing genotypes for the same individual (from this DeepVariant issue thread).

For those who might not know, precisionFDA provides a way to compare and re-analyze your data for free.  You need to create an account, but I could do so with a Gmail address (and an indication that you have data to upload).  I mostly show results for comparing .vcf files (either directly provided from different companies, or created via command line outside of precisionFDA).

I needed to do some minor formatting with the input files, but I provided this script to help others to the same.  I also have another script that I was using to compare .vcf files.

For the blog post, I'll start by describing the .vcf files provided from the different companies.  If readers are interested, I also have some messy notes in this repository (and subfolders), and I have raw data and reports saved on my Personal Genome Project page.

For example, this is the results for the SNPs from my script (comparing recovery of variants in my Genos Exome data within my Veritas WGS data):

39494 / 41450 (95.3%) full SNP recovery
39678 / 41450 (95.7%) partial SNP recovery

My script also compares indels (as you'll see below), but I left that out this time (because Veritas used freebayes, and I didn't convert between the two indel formats).

I defined "full" recovery as having the same genotype (such as "0/1" and "0/1", for a variant called as heterozygous by both variant callers).  I defined  "partial" recovery as having the same variant, but with a different zygosity (so, a variant at the same position, but called as "0/1" in one .vcf but called as "1/1" in the other .vcf would be a "partial" recovery but not a "full" recovery).

You can also see that same comparison in precisionFDA here (using the RefSeq CDS regions for the target regions), with a screenshot shown below:



So, I think these two strategies complement each other in terms of giving you slightly different views about your dataset.

If I re-align my reads with BWA-MEM and call variants with GATK (using some non-default parameters, like removing soft-clipped bases; similar to shown here, for Exome file, but not WGS, and I used GATK version 3.x instead of 4.x) and filter for high-quality reads (within target regions), these are what the results look like (admittedly, using an unfiltered set of GATK calls to test recovery in my WGS data):

Custom Script:

20765 / 21141 (98.2%) full SNP recovery
20872 / 21141 (98.7%) partial SNP recovery
243 / 258 (94.2%) full insertion recovery
249 / 258 (96.5%) partial insertion recovery
208 / 228 (91.2%) full deletion recovery
213 / 228 (93.4%) partial deletion recovery

precisionFDA:



Since I was originally describing DeepVariant, I'll also show those as another comparison using re-processed data (with variants called from a BWA-MEM re-alignment):

Custom Script:

51417 / 54229 (94.8%) full SNP recovery
53116 / 54229 (97.9%) partial SNP recovery
1964 / 2391 (82.1%) full insertion recovery
2242 / 2391 (93.8%) partial insertion recovery
2058 / 2537 (81.1%) full deletion recovery
2349 / 2537 (92.6%) partial deletion recovery

precisionFDA:



So, one thing that I think is worth pointing out is that you can get better concordance if you re-process the data (although the relative benefits are a little different for the two strategies provided above).

Also, in terms of DeepVariant, I was a little worried about over-fitting, but that was not a huge issue (I think it was more like an unfiltered set of GATK calls, but requiring more computational resources).  Perhaps that doesn't sound so great, but I think it is quite useful to the community to have a variety of freely available programs; for example, if DeepVariant happened to be a little better at finding the mutations for your disease, that could be quite important for your individual sample.  Plus, I got a $300 Google Cloud credit, so it was effectively free for me to use on the cloud.

As a possible point of confusion, I am encouraging people to use precisionFDA to compare (and possibly re-analyze) new data.  However, there was also a precisionFDA competition.  While I should credit DeepVariant to cause me to test out the precisionFDA interface, my opinion is that the ability to make continual comparisons may actually be more important than that competition from a little while ago.  For example, I think different strategies with high values should be comparable (not really one being a lot better than the others, as might be implied from having a "winner"), and it should be noted that that competition focused on regions where "they were confident they could call variants accurately"  Perhaps that explains part of why the metrics are higher than my data (within RefSeq CDS regions)?  Plus, I would encourage you to "explore results" for that competiation to see statistics for subsets of variants, where I think the func_cds group may be more comparable to what I performed (or at least gives you an idea of how rankings can shuffle with a subset of variants that I would guess are more likely to be clinically actionable).

Saturday, March 9, 2019

Updated Thoughts on PatientsLikeMe

I have a previous post about PatientsLikeMe, but I importantly did not test creating an account until relatively recently.  I have been continually improving my habits in terms of taking more time to critically assess results and question prior assumptions (in addition to realizing that I may have not had the best title for that previous blog post, in retrospect), so I thought there would be value in providing an updated perspective on this free website.

I have genomics / medical data publicly available to download on my Personal Genome Project page (for hu832966) and I have what I would consider a partial electronic medical record on my PatientsLikeMe page (which I think is an excellent resource for sharing and learning about patient experiences, with the requirement that everybody who participates be completely open; however, you have to sign in with a free account to view my profile).

For those that currently don't have PatientsLikeMe accounts, I thought I should describe a few of my experiences (from the perspective of a patient):

I have taken Citalopram at doses of 20 mg and 40 mg (and 0 mg, during intervals to test the continued benefit of the medication, when my overall stress levels were lowered and/or I learned better cognitive strategies to manage stress).  While it makes quick analysis more difficult, I think being able to see the details of people's experience can be important.  For example, I thought it was interesting that my body's reaction to the medication seemed to change over time (each time I went back on the medication, I think the side effects were more subtle, even though I think the severity of my initial symptoms also gradually improved over time).  If this is in fact true, that would indicate some resistance / reaction that could not be completely captured from studying germline variants (if you are focusing on using DNA genotyping/sequencing for medication guidance), such as somatic variants, epigenetic modifications, etc.

I also like that PatientsLikeMe provides scores for both effectiveness and side-effects (and I admittedly created a PatientsLikeMe account because some plots in the "Health Communities" in 23andMe reminded me of what I had seen for PatientsLikeMe, even without previously creating a PatientsLikeMe account).

On the positive side, I have seen multiple neurologists, and I had previously not really found any of the previous migraine medication that I took to be helpful.  However, my most recent neurologist prescribed me indomethacin, and I found that to be very helpful.  I wrote a positive evaluation for that migraine treatment, and I was surprised to see that this was a relatively rare treatment for migraines.  So, if people found commonly prescribed treatments to not be helpful, I think this might be helpful in brainstorming alternatives.

I also reported 3 negative evaluations for drugs where I experienced moderate-to-severe side effects.  I noticed that severe side effects were self-reported for these drugs among 9-14% of members in the Community Reports (9% was comparable to other drugs that I checked, but the drug for which I had the most severe side effects in 2018 had a the highest severe percentage of 14% and qualitatively most frequent reports that seemed similar to by own experience).  That said, the most commonly prescribed migraine medication (which I never tried) had a reported severe side effect rate of ~20% (so, it seems to me that a self-reported "severe" side effect rate of 5-10% is normal, but 15% or 20%  with hundreds or thousands of patients may be kind of high).  That said, I want to be very careful about being too negative about something that is not my area of expertise (even though the idea of something being helpful for some people and harmful for others seems relevant for genomics research).

Going back to the topic of my anti-depressant (for anxiety or depression, depending upon the time-frame of my treatment that you are talking about), the current maximum recommended dosage of Citalopram is 40 mg (with 60 mg now being considered unsafe), and that would match my own expectation (although for slightly different reasons - I had to drink coffee instead of tea due to extra drowsiness at 40 mg, and I am currently on 20 mg instead of 40 mg).  I can also see a 2016 indication from the FDA that 20 mg is the maximum recommended dose for individuals greater than 60 years of age (so, the maximum recommended dose is currently lower for older individuals).  You can also see more information about this drug in the 1998 drug approval package from the FDA.  To be clear, I am very grateful for the availability of Citalopram and that has made a huge difference in my life, but I think this is something that may be worth discussing more (and I would probably also benefit from understanding better).

There has even been Washington Post article describing a partnership between PatientsLikeMe and the FDA to help with drug reporting (I saw this in a recent e-mail from them, but the article is actually from 2015 - still, it is good to know other people probably have at least somewhat similar thoughts).  While they didn't mention PatientsLikeMe, I think this was also related to the topic of a more recent announcement regarding patients reporting "real-world evidence."  You can also report adverse events to the FDA through MedWatch.

Update Log:
3/9/2019: original blog post
3/26/2019: changed link in 1st paragraph (and added another link in that sentence).
6/28/2019: add MedWatch link

Tuesday, February 12, 2019

Variance Stabilization and Pseudocounts / Rounding Factors

To be fair, I first want to point out that most of this content was previously posted in an answer Biostars thread.

While I could tell that thread got a fair number of views overall, there wasn't much of a reaction (at least not shortly after posting something on Twitter).  While patience is a virtue, I discovered something when I started working with my own code that I didn't initially notice with the DESeq2 example code.  Plus, I thought some formatting differences may make things more clear, and it might help to separate this point from the slightly different topic of the original Biostars thread.

So, here are those (and additional) results presented in a slightly different way.

First, I thought it was interesting that the plot for rlog values looked a lot like log-transformed values with a pseudocount of 100 (from the meanSdPlot() function in the vsn package, as used in the DESeq2 vignette):



In other words, I am saying the middle plot above (log-transformation with a higher pseudocount) looks a lot more like the rlog plot (on the far right, above).  As a minor point, it may also be worth mentioning that I reformatted the plot (to look more like the other plots that I create below), if you don't immediately recognize the plots above from the Biostars thread.  In the original thread, I thought this was worth mentioning because the log-transformed expression values are independent for each sample, but the rlog values for a given sample will vary depending upon what other samples are processed (so, all other things being equal, I thought the independent per-sample normalization would be better).

However, what I didn't initially notice is what happens if I plot the regular mean versus SD:



In both cases, the higher pseudocount causes there to be less of a bump in the SD values (for lower count genes).  However, with this alternative plot, you can notice more of a difference between the log-transformed values and the rlog values (above, middle-plot versus right-plot).

I previously discussed fold-change values the context of "rounding factors" (in this paper, and this blog post).  While the mean-versus-SD trend is similar, there are some differences from changing the order of operations (CPM with rounding factor in 2nd and 4th columns, with the qualitatively lowest increase in SD with the larger rounding factor in the 4th column):



I frequently find having independently calculated expression values (such as log2(FPKM+X) values) to be helpful in assessing results from differential expression programs (for each project).  However, to be fair, there is in fact some noticeable differences between the log-transformed expression with the higher psuedocount (or rounding factor) and the rlog values from DESeq2.

The code to create the plots above can be downloaded here.  As mentioned in the Biostars thread, I also owe a big thanks to Mike Love (because the 1st part of that code is copied from the DESeq2 vignette).

 
Creative Commons License
Charles Warden's Science Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.