Sunday, August 4, 2019

My Genome-wide, Broad-Level Super-Population Ancestry was Robust, but I Observed Some False Positives in Smaller or Specific Segments

You can get an idea of the specific (country) assignments for ancestry in the various sub-folders on GitHub as well as sometimes in reports that I uploaded to my personal genome project page.

First, the good news: most companies indicate that I am mostly of European Ancestry, which is correct.

Second, the mixed news: while there were some findings that were correct, I had concerns about emphasizing a non-trivial false positive rate for some of the more specific ancestry predictions.  For example, I respectfully believe it is inappropriate for 23andMe to encourage travel destinations based upon their ancestry results.

To some extent, the names themselves sometimes indicate a limit to precision.  For example, if the category is "British & Irish" or "French & German," then you already don't have 1 country for a travel recommendation.  While I do have both British and Irish ancestry (and accordingly, those have the best specific marker evidence), I am a little concerned about the basis of some of the more specific assignments that I currently see.  For example, does overall population affect the density of likeihood that I had relatives from London?  If so, I think that would be kind of like assuming I live in either LA or NYC because I am from the United States (technically, I do live in the greater LA area, but I was born in Cincinnati and raised in Atlanta - plus, I think this is probably sufficient to make my point).  Also, it looks like that density plot is somewhat contradictory with the marker status, even within 23andMe.

While this sort of thing may be hard to firmly prove (for example, convergence between companies does not necessarily indicate the result is accurate, which we saw in a different way for my cystic fibrosis result), I have examples of the sort of things which I did or did not consider to be accurate below.

While some of these could be correct, I think it may sometimes be best to think of them like "hypotheses".

Positive Examples of More Specific (Relatively Recent) Ancestry


  • AncestryDNA predicted that I had more recent relatives in Tennessee, which is correct (on my mother's side).  However, even that may have had some limits to precision, given that the 1925-1950 interval seems less relevant to what I know.
  • 23andMe predicted that I had relatives living in Kingston Parish less than 200 years ago.  This could be correct.  Based upon my other family members, I can tell that this comes from my father's side with a relatively robust prediction of ~2-3% African ancestry (with large segments on multiple chromosomes).
    • I thought I had heard that my Great-Great-Grandfather (my Grandfather's Grandfather) was supposed to have been born from family that moved from the Caribbean to the United States (but I don't currently have confirmation of that).  
    • I also have consistent reports of Y-chromosome lineage E-M123.  While I am not sure if that is completely consistent with what I have described above, the greater African ancestry could be coming from Great-Great-Grandfather's father's mother's side (and/or his mother's side).


Effect of Filtering 23andMe Ancestry for Results with Higher Confidence Threshold


  • While I believe the above explanation for my African ancestry is plausible, there were 2 other specific ancestry predictions that I didn't think were right (and, in fact, those could be filtered by increasing the confidence threshold to 90%)
23andMe V3 Chip Ancestry Results (3/21/2019, 50% Confidence)

23andMe V3 Chip Ancestry Results (3/21/2019, 90% Confidence)

As noted in the GitHub notes, the East Asian & Native American and South Asian results go away with the higher confidence threshold (90%, instead of the default 50%).

What is not as clear from the above plots is that I also have notes of my percent Scandinavian ancestry varying from 11% to 3% (both with the V3 chip, at various times), and this is something that I think should have been called "Broadly European" instead of being assigned to a country that I believe is incorrect).  Accordingly, my Scandinavian ancestry also disappears if I change the confidence interval.  For other 23andMe customers, note the pull-down in the upper-right of the above screenshots.  That is how you can get the more conservative predictions (even though, in my opinion, I think it should be the other way around, where you have to opt-in for more speculative results).

I can also perform chromosome painting re-analysis with RFMix with 1000 Genomes reference samples, which I have shown below:



The overall picture is still that I am of mostly European ancestry.  You now start to get some small SAS (South Asian) predictions, but I think this is consistent with my general suggestion that the smaller segments are more likely to be false positives.

Also, all the segments of African ancestry should be coming from my father's side.  So, even though I think the above plot is good for some sense of overall estimates (for large segments), there is some sort of issue with phasing for my large SHAPEIT/RFMix chr14 segments (all the red should be 1 of my 2 copies of Chromosome 14, similar to my 23andMe results).  However, to be fair, there are chromosome-discordant 50% confidence results in my 23andMe data (with the smaller segments on Chromosome 3), which are on the same chromosome for this particular SHAPEIT/RFMix result (although that may also vary with different random seeds on different days).

Going back to the official 23andMe results, I purchased an upgraded V5 chip, and you can see those results below.

23andMe V5 Chip Ancestry Results (7/11/2019, 50% Confidence)

23andMe V5 Chip Ancestry Results (7/11/2019, 90% Confidence)


You now get a East Asian and Native American segment that remains with the higher confidence threshold.  However, using the same rationale as the SAS RFMix segments, I think the 0.1% segment on chr3 (surrounded by regions that were filtered with the higher confidence threshold) should receive less emphasis based upon the size of the segment.  So, if you ignore that (or just look at the most common ancestry prediction), the results for the V3 and V5 chips are consistent with each other (and other companies) with the broad conclusion that I am of mostly European ancestry.

On the flip side, I should also have some Spanish ancestry, which I don't see with the 90% confidence threshold.  However, I can see that ancestry with 50% confidence on chromosome 3 - in fact, that segment is estimated to be larger (2.1% versus 1.3%) in a later ancestry estimate.  So, unless that ancestry is being represented in another way (defined less precisely), this could be an example of a false negative with the higher confidence threshold.


Free Alternate Ancestry Prediction Options


  • I describe these in more detail on the 1000 Genomes re-analysis page for my 23andMe data (on GitHub).  However, I provide the general links here:
  • Again, it might be possible to have a false positive from multiple programs.  However, I think it is an overall good thing that you have these free options for re-analysis available.


To be clear, I am defining a difference between ancestry and relatedness.  In 23andMe, these are even in different sections ("Ancestry" versus "Family & Friends").  As mentioned in another blog post (please scroll towards the bottom), I believe the close family predictions should be accurate (even though I got a weird result when I uploaded my 23andMe data to FamilyTreeDNA).

However, to be clear, I think "specific" closely related individual predictions should be accurate (and I could in fact verify predicted relatives up to the range of second cousin on 23andMe and AncestryDNA), and this is the different than the more distant "specific" country assignments.  This matches 23andMe's definition of a "close relative."  However, it is hard for me to assess the accuracy of the confidence estimates for increasingly distant "DNA relative" predictions.

Update Log:

8/4/2019 - public post date
8/6/2019 - minor changes
8/14/2019 - minor changes
8/15/2019 - mention issue of RFMix phasing for African ancestry
8/16/2019 - minor changes
9/15/2019 - change title to just refer to myself
9/16/2019 - add link about DNA.land
10/18/2019 - mention aunt with Turner Syndrome (later removed, along with entire section)
10/22/2019 - minor change
12/2/2019 - add links to inpute.me and MySeq
1/27/2020 - add note about Great-Great-Grandfather (later removed)
1/29/2020 - modify notes (based upon what I could verify, even though I will probably have more revisions)
2/1/2020 - further modify notes
2/2/2020 - further modify notes
2/4/2020 - modify content throughout post (including changing the name of the section related to changing the 23andMe confidence thresholds, as well as removing some other details and the section about my mom's chromosome X)
2/5/2020 - additional changes in wording
2/6/2020 - minor changes
3/10/2020 - minor change

2 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete

 
Creative Commons License
My Biomedical Informatics Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.