Charles Warden's Science Blog

Sunday, August 4, 2019

Human Low-Coverage Sequencing is Mostly OK for Broad Ancestry and Relatedness

This is a subset of my notes from my Nebula lcWGS sequencing on GitHub (as well as a couple images from two sections with Genes for Good, for full probe RFMix as well as RFMix SNP-chip down-sampling):

Ancestry Predictions

Even though I think they should only provide continental ancestry results (kind of like the 1000 Genomes "super-populations"), the ancestry was roughly similar to my other results (indicating that I am mostly European, which is accurate). Plus, I describe limits on the more specific assignments (for SNP chip data) in another blog post.

Nevertheless, if I use my imputed genotypes for RFMix chromosome painting, I get results that look roughly like my SNP chip analysis (which would be an improvement over the Genes for Good imputed SNPs, but comparable to the much smaller number of Genes for Good observed SNPs):

There is no plot for chrX, in part because there are no imputed genotypes for chrX.

For comparison, this is what the full set of observed Genes for Good probes looks like:

and this is what the larger set of imputed Genes for Good probes looks like:

In other words, I think the Nebula results are similar (or perhaps slightly worse) than the genotypes that were directly measured for Genes for Good SNP chip probes (which, by the way, are completely free to obtain), but imputation process also caused some issues with the ancestry with the Genes for Good SNP chip probes.

However, to be fair, the loss of accuracy with imputation does seem be better than using observed measurements if you decrease the probes and/or reference samples enough. Shown below is the effect of using only 66 reference samples and 15,924 probes: (20x reduction in 1000 Genomes unrelated reference set, and 18x reduction in probes from my Genes for Good SNP chip):

Although, to be fair again, I think the primary problem is arguably the number of reference samples. Take a look if I use the same number of probes (15,924 probes), but I have the "full" set 1,329 unrelated reference samples:

However, to get something that looks more like the original result, I would argue you do need to arguably increase the probes as well. For example, please note a similar plot below with 143,320 probes (2x reduction in the starting amount):

Given that I think this looks kind of similar to the Genes for Good imputed set, I think the original Nebula imputed result (with lcWGS) for broad-level ancestry is a reasonable match to the higher coverage results (all things considered).

Kinship / Identity-By-Descent (Close Family Relationships)

Similar to the IBD estimates that are posted within the Helix/Mayo GeneGuide GitHub section (since I was only provided a gVCF), I can test overall similarity between the imputed Nebula genotypes, 23andMe (CW23), Genes for Good (GFG), Veritas WGS (BWA-MEM Re-Aligned) with 77,072 genomic positions (plotting 1000 Genomes reference samples for comparison):

By this measure, you can also clearly see which samples come from the same individual (me). However, there is a slight drop in the accuracy for the Nebula imputed values (with kinship values between 0.489181 and 0.489226, instead of between 0.499859 and 0.499962):

FAM1 ID1 FAM2 ID2 nsnp hethet ibs0 kinship
0 CW23 0 GFG 77072 0.605148 0 0.499962
0 Veritas.BWA 0 GFG 76310 0.605032 0 0.499865
0 Veritas.BWA 0 CW23 76310 0.605032 0 0.499859
0 Nebula 0 GFG 77072 0.584596 7.78493e-05 0.489209
0 Nebula 0 CW23 77072 0.584596 7.78493e-05 0.489181
0 Nebula 0 Veritas.BWA 76310 0.58472 6.55222e-05 0.489226

In other words, there is some loss in the genome-wide similarity using low-coverage Whole Genome Sequcing (lcWGS), but you can still clearly tell which samples all same from the same individual (me).

However, if the underlying data is not reliable for traits and health results, then my opinion is that the SNP chip is still the relatively better option (as something that costs less than higher coverage sequencing, while giving ancestry / relatedness results that are at least as good).

That said, to be fair, I think it really could be best if I could see similar example from those with a different primary (Non-European) ancestry. For example, there was this New York Times article about someone whose broad-level ancestry assignments were less accurate than mine (although there was also evidence for improvement over time). For example, most people were predicted to be mostly European (regardless of their actual ancestry) and most customers were European, then you could have a result that looks good even though it wasn't actually a very good predictor (beyond a baseline, like "assume everybody has European ancestry").

Update Log:

8/4/2019 - public post date
8/6/2019 - minor changes
8/15/2019 - minor changes
8/21/2019 - add "human" to the title
9/15/2019 - add warning / note about others with a different main broad ancestry

My Genome-wide, Broad-Level Super-Population Ancestry was Robust, but I Observed Some False Positives in Smaller or Specific Segments

You can get an idea of the specific (country) assignments for ancestry in the various sub-folders on GitHub as well as sometimes in reports that I uploaded to my personal genome project page.

First, the good news: most companies indicate that I am mostly of European Ancestry, which is correct.

Second, the mixed news: while there were some findings that were correct, I had concerns about emphasizing a non-trivial false positive rate for some of the more specific ancestry predictions. For example, I respectfully believe it is inappropriate for 23andMe to encourage travel destinations based upon their ancestry results.

To some extent, the names themselves sometimes indicate a limit to precision. For example, if the category is "British & Irish" or "French & German," then you already don't have 1 country for a travel recommendation. While I do have both British and Irish ancestry (and accordingly, those have the best specific marker evidence), I am a little concerned about the basis of some of the more specific assignments that I currently see. For example, does overall population affect the density of likeihood that I had relatives from London? If so, I think that would be kind of like assuming I live in either LA or NYC because I am from the United States (technically, I do live in the greater LA area, but I was born in Cincinnati and raised in Atlanta - plus, I think this is probably sufficient to make my point). Also, it looks like that density plot is somewhat contradictory with the marker status, even within 23andMe.

While this sort of thing may be hard to firmly prove (for example, convergence between companies does not necessarily indicate the result is accurate, which we saw in a different way for my cystic fibrosis result), I have examples of the sort of things which I did or did not consider to be accurate below.

While some of these could be correct, I think it may sometimes be best to think of them like "hypotheses".

Positive Examples of More Specific (Relatively Recent) Ancestry

AncestryDNA predicted that I had more recent relatives in Tennessee, which is correct (on my mother's side). However, even that may have had some limits to precision, given that the 1925-1950 interval seems less relevant to what I know.
23andMe predicted that I had relatives living in Kingston Parish less than 200 years ago. This could be correct. Based upon my other family members, I can tell that this comes from my father's side with a relatively robust prediction of ~2-3% African ancestry (with large segments on multiple chromosomes).

I thought I had heard that my Great-Great-Grandfather (my Grandfather's Grandfather) was supposed to have been born from family that moved from the Caribbean to the United States (but I don't currently have confirmation of that).
I also have consistent reports of Y-chromosome lineage E-M123. While I am not sure if that is completely consistent with what I have described above, the greater African ancestry could be coming from Great-Great-Grandfather's father's mother's side (and/or his mother's side).

Effect of Filtering 23andMe Ancestry for Results with Higher Confidence Threshold

While I believe the above explanation for my African ancestry is plausible, there were 2 other specific ancestry predictions that I didn't think were right (and, in fact, those could be filtered by increasing the confidence threshold to 90%)

23andMe V3 Chip Ancestry Results (3/21/2019, 50% Confidence)

23andMe V3 Chip Ancestry Results (3/21/2019, 90% Confidence)

As noted in the GitHub notes, the East Asian & Native American and South Asian results go away with the higher confidence threshold (90%, instead of the default 50%).

What is not as clear from the above plots is that I also have notes of my percent Scandinavian ancestry varying from 11% to 3% (both with the V3 chip, at various times), and this is something that I think should have been called "Broadly European" instead of being assigned to a country that I believe is incorrect). Accordingly, my Scandinavian ancestry also disappears if I change the confidence interval. For other 23andMe customers, note the pull-down in the upper-right of the above screenshots. That is how you can get the more conservative predictions (even though, in my opinion, I think it should be the other way around, where you have to opt-in for more speculative results).

I can also perform chromosome painting re-analysis with RFMix with 1000 Genomes reference samples, which I have shown below:

The overall picture is still that I am of mostly European ancestry. You now start to get some small SAS (South Asian) predictions, but I think this is consistent with my general suggestion that the smaller segments are more likely to be false positives.

Also, all the segments of African ancestry should be coming from my father's side. So, even though I think the above plot is good for some sense of overall estimates (for large segments), there is some sort of issue with phasing for my large SHAPEIT/RFMix chr14 segments (all the red should be 1 of my 2 copies of Chromosome 14, similar to my 23andMe results). However, to be fair, there are chromosome-discordant 50% confidence results in my 23andMe data (with the smaller segments on Chromosome 3), which are on the same chromosome for this particular SHAPEIT/RFMix result (although that may also vary with different random seeds on different days).

Going back to the official 23andMe results, I purchased an upgraded V5 chip, and you can see those results below.

23andMe V5 Chip Ancestry Results (7/11/2019, 50% Confidence)

23andMe V5 Chip Ancestry Results (7/11/2019, 90% Confidence)

You now get a East Asian and Native American segment that remains with the higher confidence threshold. However, using the same rationale as the SAS RFMix segments, I think the 0.1% segment on chr3 (surrounded by regions that were filtered with the higher confidence threshold) should receive less emphasis based upon the size of the segment. So, if you ignore that (or just look at the most common ancestry prediction), the results for the V3 and V5 chips are consistent with each other (and other companies) with the broad conclusion that I am of mostly European ancestry.

On the flip side, I should also have some Spanish ancestry, which I don't see with the 90% confidence threshold. However, I can see that ancestry with 50% confidence on chromosome 3 - in fact, that segment is estimated to be larger (2.1% versus 1.3%) in a later ancestry estimate. So, unless that ancestry is being represented in another way (defined less precisely), this could be an example of a false negative with the higher confidence threshold.

Free Alternate Ancestry Prediction Options

I describe these in more detail on the 1000 Genomes re-analysis page for my 23andMe data (on GitHub). However, I provide the general links here:

Again, it might be possible to have a false positive from multiple programs. However, I think it is an overall good thing that you have these free options for re-analysis available.

To be clear, I am defining a difference between ancestry and relatedness. In 23andMe, these are even in different sections ("Ancestry" versus "Family & Friends"). As mentioned in another blog post (please scroll towards the bottom), I believe the close family predictions should be accurate (even though I got a weird result when I uploaded my 23andMe data to FamilyTreeDNA).

However, to be clear, I think "specific" closely related individual predictions should be accurate (and I could in fact verify predicted relatives up to the range of second cousin on 23andMe and AncestryDNA), and this is the different than the more distant "specific" country assignments. This matches 23andMe's definition of a "close relative." However, it is hard for me to assess the accuracy of the confidence estimates for increasingly distant "DNA relative" predictions.

Update Log:

8/4/2019 - public post date
8/6/2019 - minor changes
8/14/2019 - minor changes
8/15/2019 - mention issue of RFMix phasing for African ancestry
8/16/2019 - minor changes
9/15/2019 - change title to just refer to myself
9/16/2019 - add link about DNA.land
10/18/2019 - mention aunt with Turner Syndrome (later removed, along with entire section)
10/22/2019 - minor change
12/2/2019 - add links to inpute.me and MySeq
1/27/2020 - add note about Great-Great-Grandfather (later removed)
1/29/2020 - modify notes (based upon what I could verify, even though I will probably have more revisions)
2/1/2020 - further modify notes
2/2/2020 - further modify notes
2/4/2020 - modify content throughout post (including changing the name of the section related to changing the 23andMe confidence thresholds, as well as removing some other details and the section about my mom's chromosome X)
2/5/2020 - additional changes in wording
2/6/2020 - minor changes
3/10/2020 - minor change

Emphasis on "Hypothesis-Generation" for Supplement Recommendations

This is a subset of my notes on 4 "Nutrigenomics" companies on GitHub:

To be fair, I am starting with a negative bias. However, I will be very happy if I can convince people that the companies / organizations need to i) share your raw data with you, ii) share all the details for coming to their conclusions, and iii) have some way of publicly sharing trends in their own data, with fair representations in limitations in confidence.

In other words, that is a little different than the results being inaccurate, which is harder to show (and I have to be mindful of my prior expectations, since that can cause me to be overly harsh).

That said, my primary concern with most of the Nutrigenomics results is that they weren't the best way to get your genomic data (and/or they provided risk assessments that I thought were less useful than direct measurements, such as routine blood tests covered by insurance). However, the situation was a little different for Vitagene (who does provide genotype information at 632,150 positions).

While my opinion is that I would probably still prefer 23andMe / Genes for Good / All of US over Vitagene, I think that can largely be considered a personal preference. However, I think there is one point worth emphasizing more (which I feel more strongly about).

Namely, Vitagene provides supplement recommendations based upon your genotypes (and non-genetic information).

While I did experiment a little with 1 supplement because of these results (and may eventually test 1 or 2 more), I think it was important that I viewed the overall results as something that needed to be critically assessed.

In other words, Vitagene recommended that I take 7 supplements:

Bromelain Quercetin Complex (500 mg): Lifestyle (Joint health and Digestive health)
Probiotics (40 billion CFU): Genetics (31%, risk of Overweight, Hormonal support, Eczema, Allergies and Blood pressure health, based upon 103 variants, all reported to have "Fair" research quality), Lifesytle (Everyday stress and Digestive health), and Goals (Everyday stress and Overweight)
Vitamin D (2000 IU): Genetics (59% Vitamin D Levels, Eczema and Joint health, based upon 36 variants, all reported to have "Fair" research quality), Lifesytle (Everyday stress), and Goals (Everyday stress)
Theanine (200 mg): Lifesytle (Everyday stress), and Goals (Everyday stress)
Iron Free Multivitamin (10 Multi): Lifestyle (Energy and Nutrient intake levels)
Zinc (15 mg): Genetics (50% Overweight, based upon 52 variants, all reported to have "Fair" research quality) and Goals (Overweight)
Chromium (200 mcg): Genetics (83% Hormonal support, Overweight and Blood Sugar Health, based upon 203 variants, all reported to have "Fair" research quality) and Goals (Overweight)

To be clear, I believe I currently have a normal BMI (23.7), but I would like to trim down my gut a little bit. However, I think doing some 8 minute abs and exercise (and avoiding over-eating) is probably going to be more helpful to me than a supplement to lose weight. I only bring this up because "Overweight" appears more than once above.

More importantly, if I over-estimated the ability to accurately make supplement / drug recommendations, I think this could potentially cause harm.

For example, I would be cautious about taking 7 new drugs all of a sudden. While they attempt to give some indication of drug interactions (such as listing a potential drug interaction between Bromelain and Indomethican), I didn't see a warning about an interaction between L-Theanine (specifically, Nature's Trove L-Theanine) and SSRIs except on the container after ordering the supplement.

Accordingly, I encountered some minor side effects both times that I took it (as I recorded in my PatientsLikeMe profile). To re-iterate, there was a warning on the container (but not from Vitagene) said "If you are currently taking prescription antidepressants such as MAOIs or SSRIs, consult your physician before taking this product".

I also tested 50 mg of Zinc from target (which I evaluated on PatientsLikeMe). However, I don't think it really helped with weight loss (after testing taking it for a little less than 2 weeks).

So, this was not a huge problem for me. However, I would be a little concerned if the typical patient/customer didn't question the full set of recommendations (and/or didn't consult a doctor prior to adding a large number of supplements into their daily routine). It may also be worth mention that the supplement I ended up testing (as something novel that I thought could help) did not use any of my genetic information to make that recommendation.

Update (3/8/2020): I submitted an FDA MedWatch report for this collection of blog posts, emphasizing the Vitagene result because of the mild adverse reaction (while providing the FDA with the general link). Identifying information was removed, but you can see what the public version of report MW5092056 looks like in the MAUDE Adverse Event Report database.

Update (4/27/2020 + 5/24/2020): I don't mention it in the above post (since I made the purchase noticeably later), but I think my GitHub notes on Dante Labs may also be relevant (1 of the 3 reports was a "Nutrigenomics" report). I have also submitted an FDA MedWatch report, and I will provide the identifier for that as soon as I know it.

I am not sure why the information about Dante Labs is missing (since that seems important), but you can see that report in MAUDE as MW5094322. You can also see the original draft (without the removed information or reformatted text) here.

Update (5/18/2010): When I thought I heard that Everywell was going to provide COVID-19 tests, I thought that I should report my experiences for what appears to be different metabolite levels for the regular blood draw versus the at-home blood test (MW5094002). I also mentioned that DNA results were provided, but I didn't think they were the best way to get those results and I thought the blood test results should be getting more emphasis. As mentioned toward the beginning, this was one of the results that I described in greater detail on the GitHub Nutrigenomics subfolder.

Update Log:

8/4/2019 - public post date
8/16/2019 - minor changes
8/20/2019 - mention zinc testing
8/26/2019 - fix GitHub link
8/30/2019 - add link for PatientsLikeMe zinc evaluation
11/20/2019 - add Chromium (all 7 supplements were on the GitHub page, but I previously only listed 6 in the blog post)
3/8/2020 - add link to FDA MedWatch report
4/27/2020 - add information about Dante Labs
5/18/2020 - add reference to Everywell FDA MedWatch report
5/24/2020 - add reference to Dante Labs FDA MedWatch report

Charles Warden's Science Blog

Sunday, August 4, 2019

Human Low-Coverage Sequencing is Mostly OK for Broad Ancestry and Relatedness

My Genome-wide, Broad-Level Super-Population Ancestry was Robust, but I Observed Some False Positives in Smaller or Specific Segments

Emphasis on "Hypothesis-Generation" for Supplement Recommendations

About Me

My Websites

Blog Archive

Labels

Charles Warden's Science Blog

Sunday, August 4, 2019

Human Low-Coverage Sequencing is Mostly OK for Broad Ancestry and Relatedness

My Genome-wide, Broad-Level Super-Population Ancestry was Robust, but I Observed Some False Positives in Smaller or Specific Segments

Emphasis on "Hypothesis-Generation" for Supplement Recommendations

About Me

My Websites

Blog Archive

Labels

Follow Me!