Saturday, May 4, 2019

precisionFDA and Custom Scripts for Variant Comparisons

After posting this reply to a tweet, I thought it might be a good idea to separate some of the points that I was making about comparing genotypes for the same individual (from this DeepVariant issue thread).

For those who might not know, precisionFDA provides a way to compare and re-analyze your data for free.  You need to create an account, but I could do so with a Gmail address (and an indication that you have data to upload).  I mostly show results for comparing .vcf files (either directly provided from different companies, or created via command line outside of precisionFDA).

I needed to do some minor formatting with the input files, but I provided this script to help others to the same.  I also have another script that I was using to compare .vcf files.

For the blog post, I'll start by describing the .vcf files provided from the different companies.  If readers are interested, I also have some messy notes in this repository (and subfolders), and I have raw data and reports saved on my Personal Genome Project page.

For example, this is the results for the SNPs from my script (comparing recovery of variants in my Genos Exome data within my Veritas WGS data):

39494 / 41450 (95.3%) full SNP recovery
39678 / 41450 (95.7%) partial SNP recovery

My script also compares indels (as you'll see below), but I left that out this time (because Veritas used freebayes, and I didn't convert between the two indel formats).

I defined "full" recovery as having the same genotype (such as "0/1" and "0/1", for a variant called as heterozygous by both variant callers).  I defined  "partial" recovery as having the same variant, but with a different zygosity (so, a variant at the same position, but called as "0/1" in one .vcf but called as "1/1" in the other .vcf would be a "partial" recovery but not a "full" recovery).

You can also see that same comparison in precisionFDA here (using the RefSeq CDS regions for the target regions), with a screenshot shown below:

So, I think these two strategies complement each other in terms of giving you slightly different views about your dataset.

If I re-align my reads with BWA-MEM and call variants with GATK (using some non-default parameters, like removing soft-clipped bases; similar to shown here, for Exome file, but not WGS, and I used GATK version 3.x instead of 4.x) and filter for high-quality reads (within target regions), these are what the results look like (admittedly, using an unfiltered set of GATK calls to test recovery in my WGS data):

Custom Script:

20765 / 21141 (98.2%) full SNP recovery
20872 / 21141 (98.7%) partial SNP recovery
243 / 258 (94.2%) full insertion recovery
249 / 258 (96.5%) partial insertion recovery
208 / 228 (91.2%) full deletion recovery
213 / 228 (93.4%) partial deletion recovery


Since I was originally describing DeepVariant, I'll also show those as another comparison using re-processed data (with variants called from a BWA-MEM re-alignment):

Custom Script:

51417 / 54229 (94.8%) full SNP recovery
53116 / 54229 (97.9%) partial SNP recovery
1964 / 2391 (82.1%) full insertion recovery
2242 / 2391 (93.8%) partial insertion recovery
2058 / 2537 (81.1%) full deletion recovery
2349 / 2537 (92.6%) partial deletion recovery


So, one thing that I think is worth pointing out is that you can get better concordance if you re-process the data (although the relative benefits are a little different for the two strategies provided above).

Also, in terms of DeepVariant, I was a little worried about over-fitting, but that was not a huge issue (I think it was more like an unfiltered set of GATK calls, but requiring more computational resources).  Perhaps that doesn't sound so great, but I think it is quite useful to the community to have a variety of freely available programs; for example, if DeepVariant happened to be a little better at finding the mutations for your disease, that could be quite important for your individual sample.  Plus, I got a $300 Google Cloud credit, so it was effectively free for me to use on the cloud.

As a possible point of confusion, I am encouraging people to use precisionFDA to compare (and possibly re-analyze) new data.  However, there was also a precisionFDA competition.  While I should credit DeepVariant to cause me to test out the precisionFDA interface, my opinion is that the ability to make continual comparisons may actually be more important than that competition from a little while ago.  For example, I think different strategies with high values should be comparable (not really one being a lot better than the others, as might be implied from having a "winner"), and it should be noted that that competition focused on regions where "they were confident they could call variants accurately"  Perhaps that explains part of why the metrics are higher than my data (within RefSeq CDS regions)?  Plus, I would encourage you to "explore results" for that competiation to see statistics for subsets of variants, where I think the func_cds group may be more comparable to what I performed (or at least gives you an idea of how rankings can shuffle with a subset of variants that I would guess are more likely to be clinically actionable).

No comments:

Post a Comment

Creative Commons License
My Biomedical Informatics Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.