Monday, August 26, 2019

Comments on Other Papers (outside Disqus) and Positive Examples of Corrections/Retractions

It is easy to find all of my comments through the Disqus comment system.  However, it is more difficult to find other comments that I have made.  So, I thought it might be good to organize some of those here (and note when there are formal corrections):

Comments with Successful Follow-Up:

Kreutz et al. 2020 (Bioinformatics) - correct typo in title after PubPeer comment

Martin et al. 2019  (Nature Genetics) - formal erratum, and earlier PubPeer comment (one typo, one suggestion that Figure S12 makes absolute accuracy more clear)

Jonsson et al. 2019 (Nature) - formal correction following article comment

Weedon et al. 2019 (bioRxiv pre-print) - extra data (including my own) was added after comment

Zhang et al. 2019 (PLOS Computational Biology) - I had a comment that I believe was able to be corrected before the final version of the paper (although the paper did later have a formal correction)

Pique-Regi et al. 2019 (bioRxiv pre-print) - discussion helped fix links in pre-print

BLAST reference issue reported to NCBI (caused from a number of different papers no recognizing cross contamination, I believe e-mail was sent showing incorrect annotations for a limited number of example sequences).  It looks like problem was not permanently fixed (due to similar problems with new data from other studies).  However, through whatever mechanism, top PhiX BLAST hits that were incorrect were removed - not sure if it was the actual cause for correction, but I posted this on Biostars when it was a problem

Multiple but Primarily Minor Errors (not currently fixed):

Yizhak et al. 2019 - blog post (errors and suggestions) and PubPeer comment (at least one clear typo); I also posted an eLetter describing the 2 most clear errors; Science paper

Arguably Gives Reader Wrong Impression:

Li et al. 2021 - there are a number of things that I am understanding better by participating in the discussion process.  Also, my original intention was to add a Disqus-style comment on the journal article, but that was not provided.  So, that is why I posted a PubPeer comment.

Essentially, I think I would prefer a SNP chip (or Exome, higher coverage WGS, Amplicon-Seq, etc.) over lcWGS (especially the 0.5x to 1x presented in this paper), and I think the value of directly measured genotypes for SNP chips is being underestimated.  I agree with lcWGS imputations often being preferable to SNP chip imputations, but I think inclusion of imputed SNP chip genotypes may not be made sufficiently clear to the broader audience.

I also mention problems with consumer lcWGS products.

Maya et al. 2020 - if I understand the paper correctly, the study does not directly investigate COVID-19 infections (they are looking for associations near ACE2 or TMPRSS2, but in other contexts).  Please scroll to the bottom to see my comment.

Nature summary of PLOS ONE paper - image shows input (not output) of encoding and attempted recovery; unlike Nature research reports, I didn't see a Disqus comment system (but I did mention something on Twitter)

Homburger et al. 2019 - I had enough problems with my lcWGS data (which was exceptionally low coverage for my Color data) that I would not recommend it's use.  I realize that sample size is limited (just my own data).  However, I posted my concerns on a pre-print for this article.  I think this is borderline for the previous category ("Data Contradictory to Main Conclusions", more appropriate for a retraction), but I would need more data to conclude that.  I also posted a similar response on PubPeer.

I think this blog post on being able to self-identify myself can give some sense of the concerns about the very low coverage WGS reads that I received from Color.

Singer et al. 2019 - eDNA paper with multiple comments indicating concerns, and I specifically found there were extra PhiX reads in the NovaSeq samples.  As mentioned in the comment, I have an "answer" on a Biostars discussion more broadly related to PhiX (with a temporary success story about the BLAST database mentioned later in this post).

The author provided a helpful response indicating that the PhiX reads should be removed for DADA2 for downstream analysis.  However, I think we both agree that PhiX spike-ins don't have a barcode.  So, if you view that as extra cross-contamination in some samples, then this leaves the question of what else could be in the NovaSeq samples that might be harder to flag as something that needs to be removed.  Time permitting, I am still looking into this.

Minor Typos:



Yuan and Bar-Joseph 2019 - PubPeer comment about MSigDB citation (as GSEA)

Chen et al. 2013 - comment about typos in PLOS comment system

Doorslaer and Burk 2010 - PubPeer comment

Other:

Börjesson et al. 2022 - PubPeer question/comment

Andrews et al. 2021 - PubPeer question (because Nature Methods doesn't have a Disqus comment system; also, public acknowledgement of misunderstanding on my part)

Antipov et al. 2020 - PubPeer comment (comments left in Word document for Supplemental Figure S2)

Robitaille et al. 2020 - PubPeer comment (question, posted there due to lack of comment system)


Older et al. 2019 - comment in PLOS comment system.

Young 2019 - comment in PLOS comment system

Choonoo et al. 2019 - comment in PLOS comment system

Paz-Zulueta et al. 2018 - PubPeer comment

Mirabello et al. 2017 - PubPeer comment

Shen-Gunther et al. 2017 - PubPeer comment

Robinson and Oshlack 2010 - PubPeer question (in the interests of fairness, I believe this reflects misunderstanding on my part)

Munoz et al. 2003 - PubPeer question (I didn't find any errors in the paper, but I wondered why data was presented in a particular way, and journal didn't have a comment system)

I also have some notes on limits to precision in genomics results available to the public (which I would probably define more as a "product" than a "paper), among this collection of posts.

Given it is harder for me to find these (compared to my Disqus comments), I will probably add more in the future (but the comments in the "other" section are not necessarily bad - I will just probably forget them if I don't save a link here).  For example, I believe increased participation in such comment systems may make them as effective as a "minor comment" (particularly if the journal doesn't go back and change the original publication after a formal correction).

Checking / Correcting Citations of My 1st Author Papers

Silva et al. 2022 - journal comment

Xu et al. 2021 - PubPeer comment

Wojewodzic and Lavender et al. 2021 - preprint comment (hopefully, can be corrected before peer-reviewed publication)

Roudko et al. 2020 - journal comment (please scroll down, past references, to see comment)

Nayak et al. 2019 - Disqus comment (on pre-print)

Borchmann et al. 2019 - Disqus comment (on pre-print)

Youness and Gad 2019 - PubPeer comment

Muller et al. 2019 - Twitter comment

Bogema et al. 2018 - PubPeer comment

Einarsdottir et al. 2018 - PubPeer comment

Pranckeniene et al. 2018 - journal comment (please scroll down, past references, to see comment)

Debniak et al. 2018 - journal comment

Hu et al. 2016 -  PubPeer comment (probably due to confusion on Bioconductor page)

Wong and Chang 2015 - PubPeer comment

Wockner et al. 2015 - PubPeer comment (more of a true comment, than post-publication review for error)

As mentioned in a couple other posts (general and COH-specific), I am trying to correct previous errors in my papers, and I think having 4 first-author (or equivalent) papers in 2013 was probably not ideal (in terms of needing to take more time to carefully review each paper).

However, I also want to show a relatively long list of corrections / retractions (even though they still make up a subset of the total public records), with some greater emphasis on high impact papers and/or papers with a large number of authors. To be fair, I may not know the details of the individual corrections or retractions.  However, I hope this will cause future researchers to be brave and responsible and report / correct problems as soon as they discover them.  As a best case scenario, I believe that it is important for prestigious labs to lead by example (so that everybody represents themselves fairly and finds the best long-term fit).  We do not want to encourage people to hesitate reporting problems out of fear that they will lose funding and/or collaborations with peers (and/or those that do worse work but less frequently admit errors and/or oversell results will be chosen over those who present more realistic expectations).  I am very confident these are achievable goals, but I think some topics may need relatively more frequent discussions.


Other Positive Examples of Corrections:

I am trying to collect examples for high-impact papers/journals and/or consortiums.

Correction for Collins et al. 2020 (Nature paper, The Genome Aggregation Database Consortium paper)

Correction for Karczewski et al. 2020 (Nature paper, The Genome Aggregation Database Consortium paper)

Correction for Minikel et al. 2020 (Nature paper, The Genome Aggregation Database Consortium paper)

Correction for Wang et al. 2020 (Nature Communications paper, The Genome Aggregation Database Consortium paper)

Correction for Whiffin et al. 2020 (Nature Communications paper, The Genome Aggregation Database Consortium paper)

Correction for Cortés-Ciriano et al. 2020 (Nature Genetics paper)

Retraction for Cho et al. 2019 (Science / Nobel Laurete paper: I learned about from Retraction Watch)

Retraction for Wei and Nielson 2019 (Nature Medicine paper)
--> Post-publication view indicated on Twitter in advance
--> Also includes retraction of "News & Views" article

Correction to Kelleher et al. 2019 (Nature Genetics paper)

Correction to Grishin et al. 2019 (Nature Biotechnology paper)

Publisher Correction to Exposito-Alonso et al. 2019 (Nature paper, as well as "500 Genomes Field Experiment Team" consortium paper)

Correction to Bolyen et al. 2019 - (Nature Biotechnology paper; correction for QIIME II paper)

Correction to Kruche et al. 2019 - (Nature Biotechnology paper; GA4GH Small Variant Benchmarking)

Retraction to Kaidi et al. 2019 (Nature paper; positive in the sense author resigned and admitted fabrication, rather than denying fabrication; I learned about from Retraction Watch)

Retraction to Kaidi et al. 2019 (Science paper; positive in the sense author resigned and admitted fabrication, rather than denying fabrication; I learned about from Retraction Watch)

Correction to Ravichandran et al. 2019 (Genetics in Medicine paper; added conflicts of interest)

Erratum to Jiang et al. 2019 (one affiliation issue with Nature Communications paper with a lot of authors)

Correction to Ferdowsi et al. 2018 - (mixed up Figures; Scleroderma Clinical Trials Consortium Damage Index Working Group paper)

Correction to Sanders et al. 2018 - (Nature Neuroscience paper; Whole Genome Sequencing for Psychiatric Disorders (WGSPD) consortium

2 corrections to Gandolfi et al. 2018 paper on cat SNP chip array

Corrigendum to Oh et al. 2018 (99 Lives Consortium paper)

Correction to (a different) Gandolfi et al. 2018

Correction to Vijayakrishnan 2018 (PRACTICAL Consortium included on authors list)

Correction to Matejcic et al. 2018 (another PRACTICAL Consortium paper)

Correction to Went et al. 2018 (another PRACTICAL Consortium paper)

Correction to Schumacher et al. 2018 (Nature Genetics paper, several consortium listed as authors)

Correction to Mancuso et al. 2018 (another PRACTICAL Consortium paper)

Correction to Armenia et al. 2018 (affiliation correction, Nature genetics paper, Stand Up To Cancer, SU2C, Consortium paper)

Withdraw of Werling 2018 Review (I learned about from Retraction Watch)

Correction to Sud et al. 2017 (another PRACTICAL Consortium paper)

Retraction and Replacement of Favini et al. 2017 (JAMA paper ; I learned about from Retraction Watch)

Corrigendum to McHenry et al. 2017 (Nature Neuroscience paper; I learned about from Retraction Watch)

2 Erratums to Teng et al. 2016 (Genome Biology paper; I found from this Tweet, describing error identified and corrected by another scientist from post-publication review; one Erratum is really a subset of the other)

Correction to Nik-Zainal et al. 2016 (Nature paper; typo not fixed in on-line article)

Erratum to Aberdein et al. 2016 (another 99 Lives Consortium paper)

2 corrections to Vinik 2016 (NEJM paper; I learned about from Retraction Watch)

Retraction of Jia et al. 2016 (Nature Chemistry paper + Nobel Laurate Author; I learned about from Retraction Watch)

Retraction of Colla et al. 2016 (JAMA paper; I learned about from Retraction Watch)

Erratum to Kocher et al. 2015 (Genome Biology paper)

Retraction of Zhang et al. 2015 (Nature paper; I learned about from Retraction Watch)

2 Retractions (?) to Akakin et al. 2015 (Journal of Neurosurgery paper; I learned about from Retraction Watch)

Addendum to Koh et al. 2015 (Nature Methods paper; figures were reversed, creating mirror images)

Retraction and Replacement for Hollon et al. 2014 (JAMA Psychiatry paper; I learned about from Retraction Watch)

Retraction to Kitambi et al. 2014 (Cell paper; I learned about from Retraction Watch)

Correction to Huang et al. 2014 (Nature paper; I learned about from Retraction Watch)

Retraction and Replacement of Li et al. 2014 (The Lancet; I learned about from Retraction Watch)

Retraction and Replacement of Siempos et al. 2014 (The Lancet Respiratory Medicine paper; I learned about from Retraction Watch)

Correction to Xu et al. 2014 (eLife paper; I learned about from Retraction Watch)

Retraction of Lortez et al. 2014 (Science paper; I learned about from Retraction Watch)

Retraction of De la Herrán-Arita et al. 2014 (Science Translational Medicine paper; I learned about from Retraction Watch)

Retraction for Dixson et al. 2014 (PNAS paper; I learned about from Retraction Watch)

Retraction of Garcia-Serrano and Frankignoul 2014 (Nature Geoscience paper; I learned about from Retraction Watch)

Retraction of Amara et al. 2013 (JEM paper; I learned about from Retraction Watch)

Retraction of Maisonneuve et al 2013 (Cell paper; I learned about from Retraction Watch)

Retraction of Yi et al. 2013 (Cell paper; I learned about from Retraction Watch)

Retraction of Nandakumar et al. 2013 (PNAS paper; I learned about from Retraction Watch)

Retraction of Venters and Pugh 2013 (Nature paper; I learned about from Retraction Watch, describing 6th Nature retraction in 2014)

Retraction of Maisonneuve et al 2011 (PNAS paper; I learned about from Retraction Watch)

Retraction of Frede et al. 2011 (Blood paper; I learned about from Retraction Watch)

Retraction to Olszewski et al. 2010 (Nature paper; mentioned in this Retraction Watch article)

Correction to Werren et al. 2010 (Science paper; correction not actually indexed in PubMed?)

Retraction of Bill et al. 2010 (Cell paper; I learned about from Retraction Watch)

Retraction to Wang et al. 2009 (Nature paper; I learned about from Retraction Watch)

Retraction of Litovchick and Szostak 2008 (PNAS paper + Nobel Laurate Author; I learned about from Retraction Watch)

Retraction of Okada et al. 2006 (Science paper; I learned about from Retraction Watch)

Withdraw of Ruel et al. 1999 (17 years post publication; JBC paper; I learned about from Retraction Watch)

I think this last category is really important (even though it is still no where near complete - it is just some representative examples that I know about).

Change Log:

8/26/2019 - public post date
8/27/2019 - add sentence about increased comment participating being more like formal comment (in some situations)
8/29/2019 - minor changes
8/29/2019 - added examples tagged with "doing the right thing" in Retraction Watch.  Thank you very much to Dave Fernig!
8/30/2019 - changed tense of previous entry of change log (from future tense to past tense)
9/2/2019 - add another positive example of comments improving pre-print
9/5/2019 - fix typo; minor changes
9/17/2019 - all 2019 correction, which I remembered from this tweet
9/18/2019 - fix typo; add multiple other citations
9/20/2019 - add PLOS Genetics correction as successful example
9/30/2019 - add recent Nature / Nature Medicine corrections
10/8/2019 - add Nature Medicine CCR5 official retraction and Nature Biotechnology / Nature Genetics corrections
10/18/2019 - add another PLOS ONE comment, PhiX example, and started list of incorrect citations for my 1st author papers
10/24/2019 - add PubPeer comment and eDNA paper
10/25/2019 - add PubPeer comment
10/28/2019 - add more PubPeer / citation comments
10/30/2019 - add Twitter comment
11/8/2019 - add a couple more corrections
11/10/2019 - revise sentences about being "brave"
11/11/2019 - add another PubPeer comment
12/2/2019 - add lcWGS concern + PLOS "filleted" typo
12/10/2019 - add positive example for formal Nature correction
12/24/2019 - add another PubPeer comment
12/30/2019 - add another PubPeer comment
1/2/2020 - add PubPeer question + Science retraction
1/14/2019 - add another PubPeer comment
1/29/2020 - add another Disqus comment
4/24/2020 - add comment for citation of one of my papers
5/15/2020 - minor changes
6/4/2020 - add PubPeer entry for article with typo in title
6/9/2020 - add COVID-19 paper that I believe uses confusing wording
6/15/2020 - add F1000 Research comment
6/25/2020 - minor formatting changes
7/30/2020 - MetaviralSPAdes comment
7/30/2020 - add TMM question
11/6/2020 - move eDNA category
11/11/2020 - add Bioinformatics typo correction
12/11/2020 - add PLOS ONE comment
2/2/2021 - add a couple more positive correction notes + BMC Medical Genomics figure label issue
2/3/2021 - add additional corrections for consortium papers published in Nature journals on 5/27/2020
3/13/2021 - add MSigDB citation as GSEA
4/8/2021 - minor changes
4/10/2021 - add lcWGS PubPeer comment
4/14/2021 - add another example of successful post-publication review
7/28/2021 - add Nature Methods multiplet PubPeer question
8/6/2021 - add note to acknowledge my misunderstanding for Nature Methods multiplet PubPeer question
10/30/2021 - try to better explain PhiX issue temporarily fixed by NCBI; also, move a subset of examples to my Google Scholar page (since the RNA-MuTect Science paper eLetter indicating a need for corrections was automatically recognized, but I needed to manually fix some things - this made me realize I may be able to use Google Scholar to help emphasize some post-publication review)
3/15/2022 - add MethReg comment
6/2/2022 - additional COHCAP corrigendum citation references
10/4/2022 - TC-hunter PubPeer question/comment

Sunday, August 25, 2019

My Thoughts on Offering Genetic Counseling for "All of Us" from Color

I was initially not certain what to think when I saw that the NIH made a 4.6 million color deal with Color of genetic counseling support for the "All of Us" program (and perhaps there are still some things that I don't completely understand).

I was previously OK with DNAnexus providing support for the precisionFDA program (although I much more strongly support the on-going benchmarks that can be provided for free, versus the original competition).  However, I believe my feelings are a little different for Color supporting All of Us.

First, participation in the All of Us program is not completely without barriers to entry.  For example, I haven't been selected for sequencing, but I was selected for specimen and data collection (blood, urine, height/weight, Fitbit measurements, survey questions, etc.).  In order to do that, I went to USC to have my samples collected (even though I work at City of Hope), although it looks like the list of partner centers may have expanded.

If you were just collecting spit, connecting with the Patient Initiated Testing companies (including but not limited to Color) would help people be involved.  However, as I note for some at-home blood collection tests, I actually prefer having a professional draw my blood.  Most importantly, if you did go to a partner center to collect your biological samples, it seems to me that you probably have a somewhat close way to have a face-to-face meeting with a genetic counselor.  So, in that sense, I wonder why they didn't choose to create infrastructure internally and help provide way to find independent genetic counselors.  I believe Aenta only covered talking to a genetic counselor on the phone (through one genetic counseling company), but I didn't have to worry about insurance coverage for my All of Us sample collection (I actually got a small gift card for my donation).  In other words, if the US government is helping cover the salaries of local genetic counselors (in addition to insurance), that seems like something I might prefer to support over Color.  Or, even without direct funding, helping people find genetic counselors (kind of like you can do on the National Society of Genetic Counselors website) and/or get a better feel for what can or cannot be robustly predicted with genetics seems like a good fit for the All of Us program.

Plus, I believe I had to wait about a month to be able to talk to a clinical pharmacist through Color (with their current set of customers, and with somebody who essentially had enough time to verbally describe the current content of the reports); so, I wonder if you may be able to talk to a different genetic counselor more quickly (and/or find a counselor with more specialized knowledge).

My second possible concern is about what exactly is meant by "technical backbone" (in a different article).  Even though I don't mention Color very much in my recent collection of posts on personal genomics experiences, I did have a subfolder for Color in my GitHub repository.  While I appreciate that they considered genomic data as PHI (meaning you had to be able to get access to your genomics data, under HIPPA), I wasn't really satisfied with the current return of raw data from Color (and I still don't have all of my raw data).  If I am just talking to Color, this isn't a huge deal (I realize they are busy with a lot of stuff).  However, if the US government is funding Color (at least in part) for infrastructure in returning raw data, then this could bother me.  On the other hand, if the plan is to return raw data through the All of Us participant portal (separate from this deal with Color), then I am OK with that.

I also found a 2018 story about a Color supporting All of Us along with two other centers.  However, I am confused why the All of Us website lists a different set of 3 centers (not including Color) on their website on that same date (9/25/2018).  So, perhaps there is something else that I am missing.

That said, the entire reason why I first ordered a test from Color is that I really liked their free public portal for finding more information about variants.  So, if this is what was meant by technical support, then I think that was good.

My third point may be less important, but I have another blog post that considers if the All of Us program can help with more directly providing patients with generic diagnostics (among other topics).  In this respect, I have mostly positive opinions about the connection with Color.  For example, the GenomeWeb article mentions the return of pharmacogenomics results (which I also mention in my GitHub notes); while that GenomeWeb article was emphasizing interactions between the FDA and the NIH, my GitHub notes talk about Color removing the mental health medication information (even though you can still see in that in my earlier uploaded PDF) when the Myriad GeneSight test is still available (even though I'm not entirely clear if those results have FDA approval).  So, if I could only choose between supporting Color or Myriad, I completely agree with the choice to support Color.  However, there are additional options and the situation is more complicated.  So, I thought it may be important to provide this post with my ideas (as a way to encourage further discussion).

Change Log:

8/25/2019 - public post date

Saturday, August 10, 2019

Disqus / Twitter Follow-Up: Comparing My 23andMe SNP Chip Concordance with Different Veritas WGS files

I am mostly summarizing my points from the following Twitter discussion:

https://twitter.com/carolinefwright/status/1157219572514209792

The original intention was to add this as a comment in the Disqus comment in the original pre-print discussion.  However, I thought this was fairly long, and I wanted to be able to have a little more control over figure formatting.

For reference, I recommended taking a look at Illumina arrays in an earlier comment, and I mentioned there there are datasets with both 23andMe SNP chip data and high-throughput sequencing data (like myself).

As a success story, that comment was followed up on, and new data was added.  In particular, there was a plot of the error/discordance rate between my own 23andMe data and my Veritas WGS data posted on Twitter.

I think this is great, but I think it may be worth emphasizing that re-processing my Veritas WGS data resulted in better concordance with my Exome data.  Additionally, I have an upgraded V5 chip, so there are actually 2 sets of 23andMe Data (although almost all of these results are based upon my later set of V3_V5 genotypes).

Nevertheless, I actually have 2 .vcf files for my Veritas WGS data (the provided .vcf, and the .vcf that I produced from extracted FASTQ files and reprocessed using BWA-MEM and GATK).

I did some analysis with my V3 23andMe genotypes, but I think that was mostly consistent with my V3_V5 genotypes.  Among probes on both arrays, there were only 5 discordant sites (so, I think that is how the previously lower error rate was reported: SNP chip versus same SNP chip, instead of SNP chip versus WGS).

In contrast, even the best-case scenario for my data seems to have concordance of 97.6-99.2% (with MAF > 0.01 variants), and this was slightly lower for the V3_V5 genotypes.  If I only considered my original V3 genotypes, this would have been better than I had previously reported for my Exome versus WGS data (98-99% for BWA-MEM + GATK re-processed variants).  However, either way, there are over a million probes on the V5 23andMe array.  So, I think something about variants being used for “research purposes” may be relevant (although I will show below that certain sets of variants do have higher reproducibility).

The trends can vary depending upon what I use for the MAF calculation:

1000 Genomes:


gnomAD:




Kaviar:



I am showing 2 plots each because I have 2 .vcf files for my WGS data (the one provided by Veritas, and the BWA-MEM + GATK re-processed variant file).  While the results are above are mostly similar with either Veritas WGS .vcf, there are some noticeable differences in the exact set of rare variants with different population estimates (which could either be because of the composition of individuals, or with the different sample processing strategies for each project).

When I converted my 23andMe to VCF format, I added a “PASS” status (to keep track of variants within repeats, for example).  If there was no other note, the variant had a “PASS” in the FILTER column.  If I only consider “PASS” variants, this is what those plots look like:

1000 Genomes:



  
gnomAD:



  
Kaviar:
 


If I only consider the “PASS” variants, the accuracy also increases a little (from 98.8-99.7% for my initial V3 genotypes, and 98.009-99.997% with the range of MAF for my V3_V5 genotypes), usually in the far-left column.

Finally, if I start from the full set of variants, but I only look at those that were discordant between my WGS .vcf files, I see better concordance for re-processed variants if they are common (but the trend for the smaller variant sets can vary):

1000 Genomes:



  
gnomAD:



  
Kaviar:






For these plots, the largest number of variants are in the far-right column.  Since that far-right column usually has less SNP chip discordance (less red in the barplot), that is consistent with my earlier conclusion that re-processing the WGS data could produce variants with higher overall concordance (in that situation, between Exome and WGS data).  However, this can clearly vary between individual variants/positions (and the best processing strategy may vary depending upon where you need to be calling variants).

I can't really visualize the SNP chip data, and I kind of have to trust the "NC" status for "No Call" positions (and I don't have access to a more raw form of data, like the intensities).

However, as a general rule, I would always recommend checking your alignments for false negatives or false positives (which you can do with a free genome browser like IGV).  I added some ClinVar annotations to try and find some discordant sites to check, and I've listed a couple below:

1) I was surprised that my re-processed was missing my cystic fibrosis variant (which I have a whole other post about).  However, this was purely a formatting issue.

Namely, I threw out most indel positions (indicated by DI) because figuring out exactly what that represents is more difficult than for the SNPs.  However, with my earlier V3 chip analysis, I manually converted some indels in my code (including my cystic fibrosis indel).  However, I converted to match the freebayes indel format in the provided .vcf for the .

In the other blog post, you can clearly see that I am a cystic fibrosis carrier, even with the re-alignment.  So, I looked up the GATK format for that indel and checked the status of that variant.  Indeed, I do have a variant call for chr7 117149181 . CTT C.  So, this is in fact OK (as long as the annotation software can figure out I have the variant, which I think was part of the problem described in the other blog post).

2) While it wasn't a discordant site between .vcf files, there were only a limited number of total ClinVar pathogenic variants.  So, I happened to notice one that indicated I was homozygous for the pathogenic variant in my 23andMe data (1/1) but a variant was not called at that position with either of the Veritas WGS .vcf files (indicated by a 0/0 in the genotype columns).  In other words, the SNP chip data was consistent with a SNP chip replicate, the WGS variant was robust different processing methods, but the result was different for SNP chip versus WGS.

However, I can check the alignments for both the provided and reprocessed WGS data (as well as provided and reprocessed Exome data), which is what I do below:



The plot above shows alignments in the following order (top to bottom): Genos Exome Provided, Genos Exome Reprocessed, Veritas WGS Provided, Veritas WGS Reprocessed

So, from the alignments, I would be inclined to agree with the WGS variant calls.  If there is some isoform of the gene that is so far diverged the reads wouldn't align, that could be an exception.  However, I don't think that situation would be best described as a SNP.  Also, I don't believe any reports indicated that I had predisposition to Neurofibromatosis (and I don't know anybody in my family that was diagnosed with that disease).  This was a custom 23andMe probe (labeled as i5003284, instead of a typical rsID).  However, the larger ANNOVAR annotation file has the ClinVar information, and I can find a dbSNP ID using the UCSC Genome Browser (for hg19, chr17:29541542).  So, in terms of checking the ANNOVAR annotation, if I actually did have two copies of an NF1 pathogenic variant, rs137854557 does have multiple reports indicating it being pathogenic (less of a confident assertion than my cystic fibrosis variant, but more than most of my other "pathogenic" variants).

As a final note, the data and code for analysis of my V3_V5 23andMe genotype data (and 2 Veritas WGS .vcf files) is available here.

Update Log:

8/10/2019 - public post date
1/15/2020 - add tag for "Converted Twitter Response"

Sunday, August 4, 2019

Updated Thoughts on My Genomics Results

I have notes on GitHub, but they are mostly within subfolders and may be a bit messy for some people to read.

The repository has been called "DTC" for "Direct To Consumer" genomics.  However, many of those results had an on-line physican.  At a genomics conference, I heard to term "Patient Initiated Testing" to be more precise and cover those types of results.  So, that is why I have tagged the posts with the term "Personal PIT Experiences."

So, I have some thoughts on the topic of genomics results that are already available to the general public.  I realize that some of these may give the reader an overly negative impression.  However, I want to emphasize that a fair presentation should include both positive and negative perspectives, and I have tried to do that (or at least describe possible solutions to potential problems).

In general, I have genomics / medical data publicly available to download on my Personal Genome Project page (for hu832966) and I have what I would consider a partial electronic medical record on my PatientsLikeMe page (however, I think you have to sign in with a free account to view my profile).

With all of that being said, I want to end on a positive note: among all of my notes, I think Genes for Good is something more people should know about.  They don't really provide as much interpretation about the traits / disease, but they have multiple types of whatever they provide (such as different formats for genotype files, with and without imputation, as well as different ancestry results with somewhat independent methods).  I think this is really great, so you can critically assess your data/results and develop your own opinion (to try and decide what is the most fair representation of the various possible interpretations of your data that can exist).

Change Log:

8/4/2019 - public post date
8/5/2019 - minor changes
8/19/2019 - change title for lcWGS health / trait result (after Twitter DM feedback)
8/21/2019 - add "human" to title for positive lcWGS post
9/16/2019 - change title to emphasize "my" for ancestry result
2/4/2020 - further change title to emphasize "my" for ancestry result

Digging Deeper into my Cystic Fibrosis Carrier Status

One overall goal from the various subfolders on the DTC_Scripts repository was to get an idea about how much the data / results could vary between vendors.

I suppose some people might consider it surprising that the "raw" genotypes/variants could vary, but I previously discussed that in a post about re-processing raw data to get more concordant genotypes (and I also have a post about tools to make HLA assignments among this collection of posts).

Some things, like ancestry, may arguably fall under what I would call "hypothesis generation" results, in that some results may be more robust than others (and limitations to the accuracy of specific ancestry assignments are described in another post).

In contrast, this post focuses on something that I think can be utilized with relatively greater confidence (that I am a cystic fibrosis carrier).

That said, in an sense, making sure you get single-gene, rare-disease genomics analysis consistently correct is more complicated than you might expect.  However, in terms of being confident about any genomics result, I think rare variants associated with Mendelian diseases should be a strong point for genomics benefiting society.

So, here is the outline of what happened:


  • In 2011, I was genotyped (with the V3 chip) by 23andMe 
    • This indicated that I was a cystic fibrosis carrier
  • While some carrier status results have been removed (and added back in), I knew my carrier status before there were any issues with the FDA.
  • In 2016, I got Veritas Whole Genome Sequencing raw data (and a GET-Evidence and ClinVar report from the Personal Genome Project)
  • In 2017, I got Genos Exome raw data with an automated report
    • Update (3/17/2020): When I currently sign into the Genos browser, I see my pathogenic variant annotation in the CFTR gene.  I am not sure when/if this was changed, but the report does now successfully show multiple references that are correct for my cystic fibrosis carrier status.
  • In 2019, I ordered a bunch of extra tests (primarily emphasizing the interpretation over the raw data), but this included Helix Exome+ data from the Mayo GeneGuide (the raw data cost extra, and was a gVCF).
  • So, I had 3 high-throughput sequencing results that covered my cystic fibrosis variant.  However, none of them indicated that I was cystic fibrosis carrier in a way that was immediately obvious, and I think at least one (Mayo GeneGuide) failed to report my cystic fibrosis status (even when covering a smaller number of diseases).
    • You can see my FDA MedWatch / MAUDE report for Mayo GeneGuide in MW5093889Helix sent me an e-mail that Mayo GeneGuide was discontinued on 4/30/2020, which you can also see on this website.
    • There are some extra formatting changes that I wasn't expecting, but you can also see my FDA MedWatch / MAUDE report for Veritas Genetics in MW5093888.  That said, I was describing my Personal Genome Project report (since I ordered the sequencing through the PGP) and I don't think Veritas specifically marketed annotating my cystic fibrosis status.  So, it might be OK if it is harder to find this report for Veritas Genetics through the search function.
    • I was particularly surprised by this for GeneGuide, since they limited the number of diseases they officially tested for (which I think was a good idea).  However, their guidelines for defining a pathogenic variant didn't include the variant covered by the 23andMe array.
    • It might also be worth mentioning that an on-line physician signed off of these other 3 results, but that didn't improve the accuracy of my cystic fibrosis carrier status.
  • With the 23andMe result, I could check the details of the variant they used to define me as carrier.  Namely, I could verify my carrier status for rs121908769 in ClinVar.
  • I might be forgetting the exact order of events after that.  However, the following gave me extra confidence that my earliest 23andMe result was in fact the "correct" one.
    • I could visualize my alignment in IGV (for my Veritas WGS and Genos Exome data) to see that I did in fact carry the variant (see below).
    • While a lot less intuitive to visualize, the Helix Exome+ data (which I had to pay extra for, beyond my GeneGuide results) also indicated that I had the variant in question, and IGV does accept a gVCF as an input file (see further below, under the .bam visualization).
    • I used the above data in response to a question on Biostars, I was particularly pleased to discover that I got feedback that helped me gain confidence in my own result.
      • For example, I learned about a website called CFTR2, which provides information unique to cystic fibrosis and the CFTR gene.
      • Specifically, this specialized website indicated that my 394delTT variant should be considered pathogenic for cystic fibrosis (if you have two pathogenic alleles).  Please note that you have to check usage agreement to view the specific result linked above.
      • I also discovered some formatting issues that I believe was responsible for at least one false negative.
    • In other words, all 4 results correctly indicated that I had the variant.  The only issue was with interpretation of that variant (which was "correct" for 1 out of 4 results). 
    • I thought I talked to multiple genetic counselors, but my GeneGuide notes indicate that the genetic counselor from PWNhealth agreed that the above information indicates that I was a cystic fibrosis carrier (even though I believe they were providing guidance for a result that more formally incorrectly indicated that I was not a carrier).

Veritas WGS /  Genos Exome BAM (Provided + BWA-MEM Re-Alignment)



Helix Exome +  / Mayo GeneGuide (gVCF)



In many ways, I still consider this a positive experience.  For example, note the following:


  • Having access to raw data allowed me to determine something that was incorrect / missing in my original report (and I think this should essentially be required)
    • That said, I hope the screenshots above show that FASTQ+BAM+VCF is probably a better format to require providing, rather than gVCF
  • Notice, I got free feedback in a public community forum (Biostars) that helped provide me information that I didn't obtain from any of the companies that I paid for genotyping / sequencing.  This emphasizes the value in having free options for re-analysis / re-processing of your data.
  • While it might require some additional training, sometimes simply viewing your data in IGV (a free genome browser) may be helpful for genetic counselors to assess the accuracy of individual genotypes.
    • While it makes life more difficult, the majority vote (3/4 companies, if you count as I did above) would actually be the wrong answer (falsely indicating that I as not a cystic fibrosis carrier).  So, kind of like I can tell that I need to work on fewer projects more in-depth, I think it probably helps to have specialization for genetic counselors (so, they can have an idea about what questions to ask, beyond what is provided in a short report).
  • I successfully learned (somewhat) more in-depth about a carrier status that could impact offspring (if my partner was also a carrier).  If planning to have a child should be decided on the scale of years (or you are assessing life-time risk for diseases with onset later in life), then taking some time to understand your genome on the scale of years may be OK (although, if you use IVF+PGT, you do need to make sure that the pre-defined variants are missing with high accuracy, on a shorter time-scale)


That said, I do think it is important to have realistic expectations about what can be done in genomics, and the need to spend a non-trivial amount of time sorting out the details for your area of expertise.

Update Log:

8/4/2019 - public post date
8/5/2019 - minor changes
8/6/2019 - minor changes
8/14/2019 - minor changes
8/15/2019 - minor changes
8/16/2019 - add link to IGV
3/17/2020 - list my ability to find CFTR pathogenic variant from Genos
4/24/2020 - add link to FDA MedWatch report (Helix + Mayo GeneGuide)
4/30/2020 - add link for Helix discontinuing Mayo GeneGuide
5/4/2020 - add link to FDA MedWatch report (Veritas Genetics)

Concerns About Using Low-Coverage Sequencing for Trait or Health Results

This is a subset of my notes from my Nebula lcWGS sequencing on GitHub:

NOTE (2/24/2020): Nebula is currently offering 30x sequencing.  So, my concerns about the low coverage Whole Genome Sequencing (lcWGS) at ~0.5x are probably less relevant for that particular company.  However, if you get lcWGS from another company, then this information is probably relevant.

Concerns about Specific Variants

While I very much support providing FASTQ, BAM and VCF data, one of my concerns about the Nebula results was the use of low-coverage sequencing.

So, one of the first things that I did was visualize the alignments for some of my more confidently understood variants from previous data (using IGV).

For the two alignments below, the Genos Exome is the top alignment, the Nebula low-coverage alignment is in the middle, and the Veritas Whole Genome Sequencing (WGS, regular-coverage) is at the bottom.

My cystic fibrosis variant (rs121908769):



My APOE Alzhiemer's risk variant (rs429358, Nebula alignment in middle, variant is red-blue bar in the right-most exon):



For APOE, I zoomed out from the screenshot so that you could get a better perspective of the error rate per-read at other positions around the gene.

You could see my cystic fibrosis variant in the 1 read covered at that position, but you can't see any reads with the APOE variant.  My concern about the use of low-coverage sequencing is due to imputation (at least for traits).  Even though this APOE variant is somewhat common (I believe ~15% of the population), the imputation failed to identify me as having that variant.  You see that from the .vcf files

My APOE Alzhiemer's risk variant:

19      45411941        rs429358        T       C       .       PASS    .       GT:RC:AC:GP:DS  0/0:0:0:0.923102,0.0768962,1.71523e-06:0.0768996

As described in the gVCF header:

GT = Genotype
RC = Count of Reads with Ref Allele
AC = Count of reads with Alt Allele
GP = Genotype Probability: Pr(0/0), Pr(1/0), Pr(1/1)
DS = Estimated Alternate Allele Dosage

The "0/0" (for genotype/GT in the last column) means that low-coverage imputation couldn't detect my APOE variant.  In other words, I believe Nebula incorrectly estimates my genotype to be 0/0 with a probability of 92.3%, and the probably for the true genotype was 7.7%.  I also see a blog post mentioning that these probabilities are provided to users through the web-interface, although I am having difficulty in finding them without the gVCF (and you won't see them in the PDFs that I have uploaded in this section).

Update (8/5): Nebula support got in touch with me and explained that the blog post is in reference to Nebula Research Library (rather than the "Your Traits" section).  I canceled my subscription, I am still able to confirm that I see this under the "Library" section (rather than "Traits," "Ancestry," or "Microbiome").

Likewise, there was no delTT variant in the VCF, so my cystic fibrosis carrier status would also be a false negative (if that was used in the report), even though you could actually see that deletion in the 1 read aligned at that position (because 1 read wasn't sufficient to have confidence in that variant).

Overall Variant Concordance

I can also use my VCF_recovery.pl script to compare recovery of my Veritas WGS variants in my Nebula gVCF.

If you compare SNPs, then the accuracy is noticeably lower than GATK (and even lower than DeepVariant):


3,071,596 / 3,419,611 (89.8%) full SNP recovery
3,184,641 / 3,419,611 (93.1%) partial SNP recovery

The indels are harder to compare (becuase of the freebayes indel format).  So, in the interests of fairness, I am omiting them here (as I did for comparing the provided Genos Exome versus Veritas WGS variants).  However, instead of comparing the provided Veritas WGS .vcf file, I can try comparing the BWA-MEM re-aligned GATK Veritas WGS .vcf (which also had higher concordance between my Exome and WGS datasets):


3,133,635 / 3,419,611 (91.6%) full SNP recovery
3,248,277 / 3,419,611 (95.0%) partial SNP recovery
164,140 / 217,959 (75.3%) full insertion recovery
180,736 / 217,959 (82.9%) partial insertion recovery
190,452 / 266,479 (71.5%) full deletion recovery
213,131 / 266,479 (80.0%) partial deletion recovery

The GATK recovery is a little better.  However, it is very important to emphasize that the gVCF variants do not have 99% accuracy (even for average accuracy, or even for SNPs).  I think whatever benchmark was used for that calculation was probably over-fit on some training data.  To be fair, I think the average SNP chip concordance (with higher coverage WGS data) is also lower than some people might expect, but it is definitely higher than this lcWGS data.

You can also show similar results with precisonFDA (using the BWA-MEM realigned GATK gVCF, which we expect to have better concordance than the provided Veritas gVCF).

For example, the overall file shows noticably low recall when comparing the Nebula imputed gVCF versus the Veritas WGS BWA-MEM re-aligned gVCF:




and, to be more fair for the Exome versus WGS comparison in the blog post, the trend is similar within RefSeq CDS regions:




The screenshots are smaller than in the blog post because there was no precision-recall plot for the Imputed Nebula gVCF comparisons.

So, I  disagree with the use of low-coverage sequencing for traits, and I would respectfully consider removing this section (or only made available to those with higher-coverage sequencing).

When I was trying to upload my raw data to my Personal Genome Project page, I noticed that they had an option called "genetic data - Gencove low pass (e.g. Nebula Genomics)".  This makes me think discouraging low-coverage sequencing is something that needs to be done more broadly (at least for health traits).


Concerns abut Nebula Library Results

My concerns for the previous sections are probably solved when using the higher coverage sequencing data.  So, unless you were an earlier customer and had the lower coverage sequencing data, you probably don't have to be extra careful about possibly overestimated accuracy in your genotype imputations.

However, there is one thing that I think could still be a problem for customers with higher coverage sequencing data (if the reports are the same).  The concept is similar to my concern about the basepaws breed index (described in this blog post) and/or other Polygenic Risk Scores that I have collected for myself, but I think I can explain my concern with the top 3 percentile results that I received from Nebula:



As you can see from this link, the percentile above was calculated using 13 SNPs.  Seven of the thirteen SNPs are on chromosome 6, and 5/7 of those variants didn't have alignments against the main reference chrososome (for hg19) in my higher coverage Veritas Whole Genome Sequencing data.  Nebula predicted 3-4 of those 7 chromosome 6 variants to be homozygous variants, but I am not sure if these are correct or not (and the nucleotide for rs3763312 was different from the variants in dbSNP).  There were a pair of variants on chromosome 10 where I was predicted to be heterozgyous at both sites.  For the remaining 6 non-chr6 variants, I had imputed genotypes for 2 heterozygous variants, 2 homozygous non-reference variants, and 2 homozygous reference variants (and they matched my higher coverage WGS data).

I am only 34, but I definitely don't have hair that looks like the Google images for this disease.  So, I don't know the expected age of onset, but I think I might never get this condition (even though Nebula says that I am at the 100th percentile).



As you can see from this link, the percentile above was calculated using 15 SNPs.  12 of those SNPs had at least 1 variant from the reference genome and 11/12 of those variants matched by Vertias WGS variants.  The discordant variant was rs6910071, which was homozygous for the variant allele on the Nebula lcWGS imputed variants.  So, this could have been consistent with 90% overall accuracy, but I didn't have coverage for either dataset at this position (so, this isn't the same as using a gVCF to make a homozygous reference genotype call).

I have osteoarthritis in my lower back, but I don't believe that I (currently) have rheumatoid arthritis.

While I could believe that I am at increased risk, it is important to note that the summary only describes 4% of variance in disease risk.  I think this should be described for all of the reports, to give a sense of the predictive power (along with other statistics).

I also noticed most of the variants were not present in ClinVar (when I was using dbSNP to check the hg19 genome coordinates and reference allele).



As you can see from this link, the percentile above was calculated using 56 SNPs.

I have blood test results uploaded on my PatientsLikeMe profile, and I thought that I had a normal CRP result.  However, it appears that I might not have remembered that correctly, and I might need to wait until my next checkup to see if I can test my CRP level.  However, this is something where I think it would be relatively easy to show if being at the 99% percentile substantially affects your observed CRP levels (or whether there are limits to what this score represents).




As you can see from this link, the percentile above was calculated using 6 SNPs.  Nebula predicted that I had a homozygous variant for 1 SNP and heterozygous variant for 1 SNP (and reference genotypes for the other 4 variants).  However, all of these variants looked OK in my higher coverage Veritas WGS data.

I believe that I have been previously reported to be at higher risk for restless leg syndrome, but that might have actually been for deep vein thrombosis / venous thromboembolism in the earlier 23andMe reports (before the FDA required approval for a more select set of results).  I do sometimes have difficult sitting perfectly still at night.  However, this does't happen all of the time, and I have never been diagnosed by a doctor for having this condition.  So, I would currently lean towards saying that I don't have restless leg syndrome.

For the 2 sets of SNPs that I checked (for alopecia areata and restless leg syndrome), I also visualized my Genos Exome alignment.  However, most of the variants were not covered by sequencing of coding regions.

In general, the journals where these results are published may make some readers think the results are useful.  However, being able to publish a result in a prestigious journal doesn't mean the associations are predictive enough to be clinically meaningful.  Also, even with a more subtle association, being in a prestigious journal doesn't necessarily mean the result can be reproduced.  For example, there are retractions in high impact journals (you can see some in this blog post), and there are objectively wrong conclusions in papers that haven't been retracted (and the science-wide error rate is mentioned in this blog post).  I don't want to cause unnecessarily alarm, but I think it is important to emphasize that time and large sample sizes (and independent validation) are needed to become comfortable with using genetic results to guide your medical treatment.

Nebula does provide a warning: "Disclaimer: Nebula Library is for research, information, and educational use only. This information is not medical advice, nor is it intended to be used for any diagnostic purpose. Please seek the assistance of a health care provider with any questions regarding your health."  However, I think this is easy to miss and the importance may not be fully understood among all customers.

Additional Note #1: I tried to provide a review for Nebula on Trustpilot, but I have encountered some difficulties.  You can see a screenshot of the current review (that was not accepted) here.  While I still haven't gotten a response for the last attempt to submit a review and get an explanation of what I need to change.  While I am not certain if this is the cause, the link that I received to submit a review initially created a review under another name.  To be fair, Nebula did pay me the $10 Amazon gift card (even when I provided a screenshot of a 2-star review, when most are 4- or 5-star reviews), and I hope that this review can eventually be posted (in which case, I will provide a link to that review, instead of this longer explanation).

Additional Note #2: You can see my report to FDA MedWatch (MW5093887) in MAUDE here.  I received an acknowledgement via mail for another report, but I just looked for this report after waiting a while.

Update Log:

8/4/2019 - public post date
8/5/2019 - add update about Nebula research library
8/6/2019 - minor changes
8/10/2019 - add link for 23andMe SNP chip versus WGS concordance
8/14/2019 - minor changes
8/15/2019 - minor changes; add box around APOE variant
8/16/2019 - minor changes
8/17/2019 - revise title (to better emphasize importance, but also presentation of just my own data)
8/16/2019 - minor changes
11/26/2019 - add arrows to Nebula samples in IGV screenshots
1/26/2020 - add screenshot of Trustpilot review
2/24/2020 - mention that Nebula is currently offering higher coverage sequencing
3/19/2020 - add concerns about Nebula Library (to match what I described in my MedWatch report)
3/20/2020 - add links to SNP details for selected library results
3/22/2020 - add link to other PRS post
4/18/2020 - add notes about checking RA post
4/24/2020 - add notes about FDA MedWatch submission
4/26/2020 - minor changes
7/6/2020 - minor changes "Polygenic Risk Score" label, for the last part of the blog post.
 
Creative Commons License
My Biomedical Informatics Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.