Sunday, December 12, 2021

Human Metagenomics Comp 2021/2022: Divergence for Signatures / Scores for Same Company and Same Sample

 This is a subset of content available on GitHub, focusing on samples collected from the same company at the same time.  This also includes code to reproduce plots and statistics.

Again, I took no action based upon any of these reports.



Intuitively, I thought something about the variability for the same sample for the Viome collections seemed high.  I think the plot above might help illustrate this.

You can also see more about the underlying details below:

Company: Signature / Score

Mean

SD

Viome: Active Microbial Diversity (Percentile)

6.5

7.8

Viome: Digestive Efficacy

64.0

9.9

Viome: Gas Production

38.0

11.3

Viome: Gut Lining Health

67.5

0.7

Viome: Gut Microbiome Health

45.0

5.7

Viome: Inflammatory Activity

38.5

3.5

Viome: Metabolic Fitness

28.5

0.7

Viome: Microbiome-Induced Stress

43.5

10.6

Viome: Protein Fermentation

35.0

9.9

thryve: Gut Diversity Score

93.5

0.7

thryve: Gut Wellness Score

81.0

0.0

 To be fair, some Viome signatures / scores may be relatively robust.  Also, in general, every individual claim needs to be investigated independently (and you not should draw conclusions about one analysis based upon what you saw for a different analysis, whether that is good or bad).  However, I think there is evidence to back up my conclusion that you should not follow all recommendations, and treat the overall set of scores/signatures as hypotheses that may or may not be absolutely true for specific sample.

On the original GitHub page, I think there is some additional evidence (looking at measurements for all 3 companies) that increased variability in similar samples makes it harder to detect a noticeable phenotypic difference in the 4th sample.  However, to be fair, I think that is less rigorous than the comparisons that I am making above.


Please click here to return to the overall summary.

Change Log:
12/12/2021 - public post date
12/13/2021 - various wording revisions/corrections
12/14/2021 - change wording for last change log entry; additional revision
5/2/2022 - modify title to match new main blog post

Saturday, February 27, 2021

Why I think the basepaws "Breed Group" should be called an "Ancestry Cluster" (with abstract cluster names)

Part #1: Short Summary:

In the basepaws Facebook group, there are some noticeable issues with results for purebred cats that are not already part of the reference set (as I understand it).  To be fair, I have seen at least one report that was a good result (matching the known breed in a new test cat), but I think independent test datasets tend to be discordant with known purebred cats more often than they are concordant with new purebred cats (in the small fraction of customers where that can be tested).  Also, the majority of cats are mixed breed "Polycats" where there probably shouldn't be any close purebred relatives, and I think the results more-or-less match that expectation (if interpreted correctly, and I have found explaining precisely what I think is "correct" interpretation to be bit of a challenge).

Again, to be fair, basepaws does say "Disclaimer: The Basepaws Breed Index is not a breed test".  However, I think there is still some noticeable confusion.  So, regardless of whether my suggestion is helpful, I think it is safe to say you should not be using basepaws to confirm your cat's breed or disprove your cat's breed, and you may not have closely related cats for the "breed" with the highest estimated breed fraction in chromosome painting (which, in the wording of the current reports, may be "Breed Group" instead of "Breed Index" - either way, this is what you receive when you order the Breed + Health DNA Test, and you can see some type of chromosome painting in the various reports for Bastu).

After having several very helpful discussions, I thought of a possible suggestion that I thought might help: perhaps use a more abstract name for an ancestry group / cluster to make clear that you shouldn't use what is being provided to define breeds.

I think the best example may be the broad "Eastern" ancestry group, which is already in the reports (as a broader category) as well in previous literature on cat genomics.

If calculated accurately, I would expect Siamese cats will tend to have a higher estimate of Eastern contribution (at least for the "Traditional" Siamese cats, as I understand it).  However, I think the Colorpoint Restriction mutation might be what owners might want to check for cats with the characteristic hair color pattern.  For example, you might have to dig around a bit in the comments for this video but I believe "Lynx Point" cats can be defined from 2 markers (the Colorpoint Restriction variants, and Agouti variants for the tabby pattern).

Nevertheless, my point is that I believe giving the ancestry group the "Eastern" name adds a layer of conceptually separation than the "Siamese" (or related) breeds.

So, if technical problems are minimal, then I think this might reduce confusion for purebred cat owners (as well as minimize miscommunication from misunderstanding).  It might also reduce interest, but I would rather plan to have fewer customers than have more customers whose satisfaction might decrease over time.

It might also be worth noting that I believe most imputed genotypes are phenotypically neutral, meaning that they do not cause the traits that are used to define the breeds (although relatedness does define the breed, once it becomes established).

There are some additional details that I will discuss below, but I hope this can help for discussions (instead of me posting several comments, which may or may not be appreciated by all basepaws customers).  I also have a "Change Log" since I hope that I can update this based upon feedback, in order to try and be able to explain my point as precisely as possible.

Part #2: Intermediate-Level Details:

I have a separate blog post describing results (from multiple organizations / companies) for my cat Bastu.

However, I will copy / repeat some points below.  Some hopefully complement the main point for the "Short Summary," but some of these are separate points.

First, I think seeing 3 technical replicates for my 1 cat may be helpful in giving a sense of confidence in the results:

"Confident" Threshold



"Possible" Threshold


All 3 reports above were re-processed at the same time (for an update), but they were sequenced in different batches.

If you look at the broad ancestry groups, then there is some noticeable consistency with the technical replicates.  A mostly "Western" contribution for Bastu also makes sense, and that is also consistent with the her Ancestry Test from UC-Davis.

Please also note that the 1st 2 samples (left and middle) are low coverage Whole Genome Sequencing (lcWGS) and the report on the right is "regular" coverage sequencing (~15x).

For example, initiatively, I would not expect Bastu to be related to any Exotic breeds.  So, in addition to the general issue that accurately genotyped variation can exist before breed creation and confound the results, you can see the "Exotic" contributions go away in the higher/regular coverage sequencing data.

The main reason that I purchased the "Whole Genome Sequencing" kit was that I would get access to the raw data.  Since the price is different now, I will include the analysis part of this file below (without the header listing the original price):

Again, noticed the relatively higher Western contribution is also consistent with my re-analysis.



While I want to make clear that the PCA plots (the dots in the bottom part of the image, which I tried to describe as part "B") come from a relatively limited number of SNPs (from public cat SNP chips).  So, I don't know what the spread looks like if you could look like if you could see the individual points in the basepaws data, but I think plotting a centroid may not be the best way to represent the data (if that is what was done here).

There has also been at least 1 sample mix-up (for a different customer).  However, basepaws offered a free re-test which showed much better results.  So, that can happen, but I am guessing that is rare (and the more common issue is what I am describing in this blog post).  For example, when you look at my 2 lcWGS replicates (equivalent to a paid re-test), one is not really better than the other.  There is just a certain about of noise and limits to the methods (at least currently).

Part #3: Advanced Details:

With my ~15x cat genome data (along with my human data), I tested running imputation of down-sampled reads, which you can see here.  I am also plotting that below:



For reference, I don't have the raw lcWGS FASTQ data, but earlier versions of reports said that technical replicate #1 had 3,488,515 fragments and technical replicate #2 had 2,712,098 fragments.  For comparison, my regular coverage WGS sample had 166,490,724 paired-end reads (I think comparable to the "fragments" in those reports).

Because of the fragments smaller than the read length, my "regular" coverage WGS was closer to 15x than 20x.  I also don't know how many fragments are for lcWGS versus Amplicon-Seq in the "Breed + Health" results.  Nevertheless, I think the lcWGS "fragments" in those earlier reports roughly match an expectation of being 0.1x to 0.5x.

My main goal in the above plot was to test self identification.  With the same number of reads, the Gencove cat imputation worked a little worse for my cat's sample than my human sample (for either Gencove or open-source options for humans), but I don't think that is surprising because I think the human variation is better understood.  Nevertheless, if you follow the Gencove-cat trajectory, then I think there might be some benefits to having ~2x the fragment count (even though I would probably still not consider that to be equivalent to genotypes without imputation).  That said, to be fair, I don't know if some things (like broad ancestry groups) might be detectable with less precision than being able to self identify your own (cat's) sample.  Also, if a relative finder function is added, then Bastu's 2 lcWGS technical replicates can act as a positive control (to gauge how important this measure really was).

Likewise, I believe there is at least the Martin et al. 2021 paper emphasizing a noticeable advantage to using 4-6x WGS (over 0.5x WGS).  For example, there is a noticeable difference for the 0.5x results in Figure 4A (and as well as Table S4 and Table S5).  That is for certain populations, but it seems like that may match what I am seeing as well?

Perhaps most importantly, to give some more examples of selected problems on the human side, you can also see issues with the human lcWGS imputation here at the level of individual genotypes, which I think may be why Nebula now only provides regular/high coverage sequencing for customers.

basepaws uses Amplicon-Seq (not the lcWGS) for the health markers, so I believe those should be thought of as independent results (and the technical replicate discordance is not necessarily predictive of the concordance for the health markers).

My understanding is that basepaws is planning on adding Amplicon-Seq for trait markers in the future.  I think this will be added to the existing report for the "regular" coverage sequencing data.  However, customers will need to keep in mind this can't (or at least shouldn't) be added from the lcWGS.  My understanding is that the additional cost to add those extra Amplicon-Seq markers to previous reports will be minimal, but I think that is worth making clear new libraries need to be prepared to return those results.

It might not be the most important point, but I wanted to acknowledge that I believe what basepaws is doing with the chromosome painting is better than produced got for Bastu's sample with RFMix (which you can see here, for example, if you scroll towards the middle of the post).  However, I thought that was sufficiently far from being a reasonable result that I did not include that in my mini-report of re-analysis that I showed above.  I am also expressing similar concerns for the technical replicates above.

So, in the immediate future, I am not sure if doing something like the ADMIXTURE analysis used for the table and pie charts might help with giving robust results for those broad ancestry groups (which I think should also work with closely related cats, if you used something like plink kinship).  If there were any issued with defining relatedness with the chromosome painting results, then I think that that should be OK with enough reads for 1 genome-wide estimate (and I would rather have a close cat relative finder than a breed index/group that might cause confusion and may not be great at identifying breeds in new test samples, like what I have seen in the Facebook discussion group).  However, either way, that is the explanation for what I presented in my "mini-report" from re-analysis of raw "regular" coverage sequencing data.

Part #4: Acknowledgements:

I can't thank the other basepaws customers enough for their discussions, which have helped me sort through my thoughts.  I also hope that can continue in the future!

The basepaws staff has also provided responses to all of my e-mails.  For example, I think that is how I know the Breed Index/Group (which shouldn't be used to categorize purebred cats) comes from lcWGS and the health markers come from Amplicon-Seq (also verified here).  So, thank you for your helpful feedback, as well as encouraging a community were such discussion can take place!

P.S. I have some on-going notes about known traits (not currently in the basepaws report), but I would be happy to hear more.  This includes things like long hair.

In general, I tend to provide things like links to informational details from the UC-Davis VGL, but I am encouraging you to us the links on the informational page to purchase those separate tests.  For example, I purchased the UC-Davis "Cat Ancestry" panel that includes some trait markers as well the Optimal Selection panel (also mentioned in the earlier blog post).

If you are a breeder, I think part of the goal of Optimal Selection is to identify mating pairs with the maximal possible diversity within the breed.

Maybe it is otherwise a little off-topic, but I mention this because I think people purchasing the Breed + Health kit might in fact be interested in understanding some of the traits that they see in thier cat.

P.P.S. If a relative finder was added and both parents were in the basepaws database, then that would in fact be relevant for confirming that a cat is purebred (as long as there were no methodological issues, or especially low read coverage).  However, that is not currently offered from basepaws, so that is why I am saying that the current breed measures cannot be used to determine your cat's breed (as also said in the disclaimer from basepaws).

Change Log:

2/27/2021 - public post

2/28/2021 - minor changes (including switching wording from "Breed Index" to "Breed Group")

3/6/2021 - add notes for some other common questions and caveats + minor change

3/15/2021 - add lcWGS fragment counts (and linked image)

3/26/2021 - add note to make clear that all 3 reports were re-processed at the same time + add link to Martin et al. 2021 lcWGS paper

3/28/2021 - add extra details / minor re-organization; correct misunderstanding of table

4/18/2021 - add sentence about basepaws sample mix-up

11/21/2021 - add note about 2 markers defining "Lynx Point".

Friday, July 3, 2020

Broad Ancestry Predictions for My Exome Sample

I had a supervisor ask me to help a co-worker with making some ancestry estimates in Exome samples.

I wanted to first test application to my own samples, which I thought could at least help with troubleshooting errors and providing another independent sample for comparison (not used for training the method being tested).

I also thought it might help to put this in a blog post, since I think this can provide a concrete example of what might help in working towards labs knowingly collaborate with each other (where large portions of work may otherwise be done in different labs without their each other's knowledge).

Strategy #1: Use On-Target High Coverage GATK Variants

Based upon some earlier analysis (related to the Mayo GeneGuide analysis, but not previously posted), there is some additional decrease in IBD/kinship from a perfect self-identification for my Genos Exome sample (even though I think it may still be acceptable for some applications, such as self-identification for QC and privacy purposes):

 Sample 1Sample 2 Kinship 
 Veritas WGS Genes for Good SNP chip0.499907 
 Veritas WGS 23andMe SNP chip0.499907 
 Veritas WGSMayo GeneGuide Exome+ 0.499907 
 Veritas WGSGenos Exome (BWA-MEM Re-Aligned, GATK Variants) 0.459216

As you can see in the above link, performance was better for the Mayo GeneGuide gVCF.  However, I was not provided FASTQ files or BAM files (even when I paid for the Exome+ data to get the gVCF), and Mayo GeneGuide has been discontinued (and I would usually have considered the Genos Exome to be preferable overall).  This may be because my Genos Exome target design was for CDS (protein-coding) regions only.

While it is impossible to conduct the 2nd strategy (using off-target reads) with the Mayo GeneGuide Exome+ data, I can test comparing the 2 Exome samples (Genos and Mayo GeneGuide) for ADMIXTURE ancestry analysis:



There is also public code written for a lab for QC Array (SNP chip) application, which is conceptually similar (although my 23andMe and AncestryDNA SNP chip IBD values were higher than my Genos Exome kinship values).  This explains the naming of the table with the 1000 Genomes super-population assignments (to create the ADMIXTURE .pop file).

There is also some code for the above ADMXITURE plots here.  Those results were roughly similar with all 5 of the samples tested (Veritas WGS, Genes for Good SNP chip, 23andMe SNP chip, Genos Exome, and Mayo GeneGuide Exome+) with 2,272 1000 Genomes reference samples and 9,423 genotypes.

For comparison, if I just compare my 23andMe SNP chip to the 1000 Genomes SNP chips (using an increased count of 92,540 probes), then this is what my ADMIXTURE ancestry looks like:




Strategy #2a: Use Off-Target Reads for STITCH lcWGS Analysis

For comparison, you can see this blog post about using lcWGS for self-identification (where ~1 million reads had similar performance to the Genos CDS Exome Kinship/IBD calculation).

This leads to some improvement in the kinship between my own samples (~0.479 compared to my 23andMe and AncestryDNA SNP chips).



This estimate is similar to the on-target results (~85% European ancestry).  However, my AMR percentage was higher (~7% versus <1%), and my EAS percentage was lower (~1% versus 7%).  So, I think this was a little more accurate, but I list some important caveats below.  However, if you are looking for 1 majority ancestry assignment (and placing less emphasis on <10-20% assignments), then they are essentially the same (and using STITCH considerably increases the run-time).

Importantly, you can also see my lcWGS ancestry analysis here (but for chromosome painting rather than ADMIXTURE, where chromosome painting requires more markers), and you can see my full SNP chip results (mostly for 23andMe) here.  So, I was a surprised that the larger number of 23andMe SNPs at the end of the 1st section didn't have an ADMIXTURE EUR percentage closer to 95% in the previous section (more similar to the chromosome painting results) and was even less accurate by that measure.  However, if you say that you can only consider individuals with 1 primary ancestry assignment with a >50% or >70% contribution (for me, EUR), then these results are all compatible.

On top of that, this is not really a fair comparison for ancestry, since the imputations were made using CEU, GBR, and ACB 1000 Genomes reference samples (the set of 286 reference samples used for IBD/kinship self-recovery).

In contrast, these are the results if I only use the GIH population samples as the reference for the imputations, and kinship similar decreases a little (~0.457) and this is what the supervised ADMIXTURE assignments look like (which introduces the expected bias based upon the reference population):



The run-time was also noticeably shorter when I used 1 population for imputation (versus 3 populations for imputation).

That said, if you had a more diverse set of samples to test, then that should also be taken into consideration.  For example, if I guessed the wrong ancestry for imputation, then that can cause some noticeable problems.

While it looks like STITCH off-target read analysis might provide improvement on first impression, the need to know the ancestry for a sample in advance confounds this particular application.  Namely, I would say this re-emphasizes that you need to be careful to read too much into the EAS versus AMR differences that I reported earlier, since I used the most relevant populations for myself and those results are therefore not completely unbiased (think of it like the opposite of the SAS/GIH imputation test).

I ended up using considerably fewer off-target genotypes than I was expecting (>50,000 bp off-target sequence), but you can see that  I did need to use more sequence (closer to the target regions) for the GLIMPSE analysis below.

Strategy #2b: Use Off-Target Reads for GLIMPSE lcWGS Analysis

In the lcWGS self-identification for my own sample, the performance of GLIMPSE was lower than STITCH, but the run-time was noticeably faster and all the 1000 Genome samples were used (so, it was not designed to use a subset of most related populations to improve performance).

Because the run-time was shorter, I tested multiple flanking distances from the coding regions.  This was important because I needed to use sequence closer to the target regions to get self-identification more similar to STITCH:

 Sample 1Sample 2 Kinship 
 GLIMPSE (Genos Off- Target)
50,000 bp flanking
 23andMe SNP chip0.362039
GLIMPSE (Genos Off- Target)
10,000 bp flanking
 23andMe SNP chip0.449494
 GLIMPSE (Genos Off- Target)
 2,000 bp flanking
 23andMe SNP chip0.475468


In all 3 cases above, the number of genotypes imputed by GLIMPSE was identical (91,296 variants).  Unlike STITCH, there was no "PASS" filter for these variants.  However, if this means that the variants were more accurate when more reads were used, then I believe the number of variants that would have had a "PASS" status should have also increased.

If you use the >10,000 bp and >2,000 bp flanking sequence, then you can see my ancestry results below (respectively):

 


I am not sure if it matters that  I had to use sequence closer to the target (coding) regions, but these look similar to the regular on-target variant ancestry results (which were not as computationally intenstive and took up less storage space).

You can also see some more details about the GLIMPSE analysis on my samples here.



Finally, this was my first attempt at estimating ancestry from my Exome samples, so I think that there is likely some room for improvement.  Nevertheless, I hope this is useful as a starting point for discussion (for what I thought was more-or-less the most straightforward application of existing methods).

Disclaimer: I think it is important to emphasize that this amount work may not be enough to say the strategy is completely free of errors, and it is also not enough to say the process should be scaled up for more samples.

However, if this post and the public GitHub code can play a similar role as a public lab notebook like the LOG.txt file for the RNA-Seq differential expression limits test, then that might help with communicating effort (and/or a partial contribution to a process, for both positive and negative results).

Change Log:
7/3/2020 - public post date
7/4/2020 - minor changes (including clarification of summary/discussion)
7/5/2020 - minor changes (including adding link to initialized GLIMPSE GitHub subfolder)
7/6/2020 - minor changes
8/1/2020 - add GLIMPSE results
 
Creative Commons License
Charles Warden's Science Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.