Charles Warden's Science Blog: Human Low-Coverage Sequencing is Mostly OK for Broad Ancestry and Relatedness

Sunday, August 4, 2019

Human Low-Coverage Sequencing is Mostly OK for Broad Ancestry and Relatedness

This is a subset of my notes from my Nebula lcWGS sequencing on GitHub (as well as a couple images from two sections with Genes for Good, for full probe RFMix as well as RFMix SNP-chip down-sampling):

Ancestry Predictions

Even though I think they should only provide continental ancestry results (kind of like the 1000 Genomes "super-populations"), the ancestry was roughly similar to my other results (indicating that I am mostly European, which is accurate). Plus, I describe limits on the more specific assignments (for SNP chip data) in another blog post.

Nevertheless, if I use my imputed genotypes for RFMix chromosome painting, I get results that look roughly like my SNP chip analysis (which would be an improvement over the Genes for Good imputed SNPs, but comparable to the much smaller number of Genes for Good observed SNPs):

There is no plot for chrX, in part because there are no imputed genotypes for chrX.

For comparison, this is what the full set of observed Genes for Good probes looks like:

and this is what the larger set of imputed Genes for Good probes looks like:

In other words, I think the Nebula results are similar (or perhaps slightly worse) than the genotypes that were directly measured for Genes for Good SNP chip probes (which, by the way, are completely free to obtain), but imputation process also caused some issues with the ancestry with the Genes for Good SNP chip probes.

However, to be fair, the loss of accuracy with imputation does seem be better than using observed measurements if you decrease the probes and/or reference samples enough. Shown below is the effect of using only 66 reference samples and 15,924 probes: (20x reduction in 1000 Genomes unrelated reference set, and 18x reduction in probes from my Genes for Good SNP chip):

Although, to be fair again, I think the primary problem is arguably the number of reference samples. Take a look if I use the same number of probes (15,924 probes), but I have the "full" set 1,329 unrelated reference samples:

However, to get something that looks more like the original result, I would argue you do need to arguably increase the probes as well. For example, please note a similar plot below with 143,320 probes (2x reduction in the starting amount):

Given that I think this looks kind of similar to the Genes for Good imputed set, I think the original Nebula imputed result (with lcWGS) for broad-level ancestry is a reasonable match to the higher coverage results (all things considered).

Kinship / Identity-By-Descent (Close Family Relationships)

Similar to the IBD estimates that are posted within the Helix/Mayo GeneGuide GitHub section (since I was only provided a gVCF), I can test overall similarity between the imputed Nebula genotypes, 23andMe (CW23), Genes for Good (GFG), Veritas WGS (BWA-MEM Re-Aligned) with 77,072 genomic positions (plotting 1000 Genomes reference samples for comparison):

By this measure, you can also clearly see which samples come from the same individual (me). However, there is a slight drop in the accuracy for the Nebula imputed values (with kinship values between 0.489181 and 0.489226, instead of between 0.499859 and 0.499962):

FAM1 ID1 FAM2 ID2 nsnp hethet ibs0 kinship
0 CW23 0 GFG 77072 0.605148 0 0.499962
0 Veritas.BWA 0 GFG 76310 0.605032 0 0.499865
0 Veritas.BWA 0 CW23 76310 0.605032 0 0.499859
0 Nebula 0 GFG 77072 0.584596 7.78493e-05 0.489209
0 Nebula 0 CW23 77072 0.584596 7.78493e-05 0.489181
0 Nebula 0 Veritas.BWA 76310 0.58472 6.55222e-05 0.489226

In other words, there is some loss in the genome-wide similarity using low-coverage Whole Genome Sequcing (lcWGS), but you can still clearly tell which samples all same from the same individual (me).

However, if the underlying data is not reliable for traits and health results, then my opinion is that the SNP chip is still the relatively better option (as something that costs less than higher coverage sequencing, while giving ancestry / relatedness results that are at least as good).

That said, to be fair, I think it really could be best if I could see similar example from those with a different primary (Non-European) ancestry. For example, there was this New York Times article about someone whose broad-level ancestry assignments were less accurate than mine (although there was also evidence for improvement over time). For example, most people were predicted to be mostly European (regardless of their actual ancestry) and most customers were European, then you could have a result that looks good even though it wasn't actually a very good predictor (beyond a baseline, like "assume everybody has European ancestry").

Update Log:

8/4/2019 - public post date
8/6/2019 - minor changes
8/15/2019 - minor changes
8/21/2019 - add "human" to the title
9/15/2019 - add warning / note about others with a different main broad ancestry