Friday, July 3, 2020

Broad Ancestry Predictions for My Exome Sample

I had a supervisor ask me to help a co-worker with making some ancestry estimates in Exome samples.

I wanted to first test application to my own samples, which I thought could at least help with troubleshooting errors and providing another independent sample for comparison (not used for training the method being tested).

I also thought it might help to put this in a blog post, since I think this can provide a concrete example of what might help in working towards labs knowingly collaborate with each other (where large portions of work may otherwise be done in different labs without their each other's knowledge).

Strategy #1: Use On-Target High Coverage GATK Variants

Based upon some earlier analysis (related to the Mayo GeneGuide analysis, but not previously posted), there is some additional decrease in IBD/kinship from a perfect self-identification for my Genos Exome sample (even though I think it may still be acceptable for some applications, such as self-identification for QC and privacy purposes):

 Sample 1Sample 2 Kinship 
 Veritas WGS Genes for Good SNP chip0.499907 
 Veritas WGS 23andMe SNP chip0.499907 
 Veritas WGSMayo GeneGuide Exome+ 0.499907 
 Veritas WGSGenos Exome (BWA-MEM Re-Aligned, GATK Variants) 0.459216

As you can see in the above link, performance was better for the Mayo GeneGuide gVCF.  However, I was not provided FASTQ files or BAM files (even when I paid for the Exome+ data to get the gVCF), and Mayo GeneGuide has been discontinued (and I would usually have considered the Genos Exome to be preferable overall).  This may be because my Genos Exome target design was for CDS (protein-coding) regions only.

While it is impossible to conduct the 2nd strategy (using off-target reads) with the Mayo GeneGuide Exome+ data, I can test comparing the 2 Exome samples (Genos and Mayo GeneGuide) for ADMIXTURE ancestry analysis:



There is also public code written for a lab for QC Array (SNP chip) application, which is conceptually similar (although my 23andMe and AncestryDNA SNP chip IBD values were higher than my Genos Exome kinship values).  This explains the naming of the table with the 1000 Genomes super-population assignments (to create the ADMIXTURE .pop file).

There is also some code for the above ADMXITURE plots here.  Those results were roughly similar with all 5 of the samples tested (Veritas WGS, Genes for Good SNP chip, 23andMe SNP chip, Genos Exome, and Mayo GeneGuide Exome+) with 2,272 1000 Genomes reference samples and 9,423 genotypes.

For comparison, if I just compare my 23andMe SNP chip to the 1000 Genomes SNP chips (using an increased count of 92,540 probes), then this is what my ADMIXTURE ancestry looks like:




Strategy #2a: Use Off-Target Reads for STITCH lcWGS Analysis

For comparison, you can see this blog post about using lcWGS for self-identification (where ~1 million reads had similar performance to the Genos CDS Exome Kinship/IBD calculation).

This leads to some improvement in the kinship between my own samples (~0.479 compared to my 23andMe and AncestryDNA SNP chips).



This estimate is similar to the on-target results (~85% European ancestry).  However, my AMR percentage was higher (~7% versus <1%), and my EAS percentage was lower (~1% versus 7%).  So, I think this was a little more accurate, but I list some important caveats below.  However, if you are looking for 1 majority ancestry assignment (and placing less emphasis on <10-20% assignments), then they are essentially the same (and using STITCH considerably increases the run-time).

Importantly, you can also see my lcWGS ancestry analysis here (but for chromosome painting rather than ADMIXTURE, where chromosome painting requires more markers), and you can see my full SNP chip results (mostly for 23andMe) here.  So, I was a surprised that the larger number of 23andMe SNPs at the end of the 1st section didn't have an ADMIXTURE EUR percentage closer to 95% in the previous section (more similar to the chromosome painting results) and was even less accurate by that measure.  However, if you say that you can only consider individuals with 1 primary ancestry assignment with a >50% or >70% contribution (for me, EUR), then these results are all compatible.

On top of that, this is not really a fair comparison for ancestry, since the imputations were made using CEU, GBR, and ACB 1000 Genomes reference samples (the set of 286 reference samples used for IBD/kinship self-recovery).

In contrast, these are the results if I only use the GIH population samples as the reference for the imputations, and kinship similar decreases a little (~0.457) and this is what the supervised ADMIXTURE assignments look like (which introduces the expected bias based upon the reference population):



The run-time was also noticeably shorter when I used 1 population for imputation (versus 3 populations for imputation).

That said, if you had a more diverse set of samples to test, then that should also be taken into consideration.  For example, if I guessed the wrong ancestry for imputation, then that can cause some noticeable problems.

While it looks like STITCH off-target read analysis might provide improvement on first impression, the need to know the ancestry for a sample in advance confounds this particular application.  Namely, I would say this re-emphasizes that you need to be careful to read too much into the EAS versus AMR differences that I reported earlier, since I used the most relevant populations for myself and those results are therefore not completely unbiased (think of it like the opposite of the SAS/GIH imputation test).

I ended up using considerably fewer off-target genotypes than I was expecting (>50,000 bp off-target sequence), but you can see that  I did need to use more sequence (closer to the target regions) for the GLIMPSE analysis below.

Strategy #2b: Use Off-Target Reads for GLIMPSE lcWGS Analysis

In the lcWGS self-identification for my own sample, the performance of GLIMPSE was lower than STITCH, but the run-time was noticeably faster and all the 1000 Genome samples were used (so, it was not designed to use a subset of most related populations to improve performance).

Because the run-time was shorter, I tested multiple flanking distances from the coding regions.  This was important because I needed to use sequence closer to the target regions to get self-identification more similar to STITCH:

 Sample 1Sample 2 Kinship 
 GLIMPSE (Genos Off- Target)
50,000 bp flanking
 23andMe SNP chip0.362039
GLIMPSE (Genos Off- Target)
10,000 bp flanking
 23andMe SNP chip0.449494
 GLIMPSE (Genos Off- Target)
 2,000 bp flanking
 23andMe SNP chip0.475468


In all 3 cases above, the number of genotypes imputed by GLIMPSE was identical (91,296 variants).  Unlike STITCH, there was no "PASS" filter for these variants.  However, if this means that the variants were more accurate when more reads were used, then I believe the number of variants that would have had a "PASS" status should have also increased.

If you use the >10,000 bp and >2,000 bp flanking sequence, then you can see my ancestry results below (respectively):

 


I am not sure if it matters that  I had to use sequence closer to the target (coding) regions, but these look similar to the regular on-target variant ancestry results (which were not as computationally intenstive and took up less storage space).

You can also see some more details about the GLIMPSE analysis on my samples here.



Finally, this was my first attempt at estimating ancestry from my Exome samples, so I think that there is likely some room for improvement.  Nevertheless, I hope this is useful as a starting point for discussion (for what I thought was more-or-less the most straightforward application of existing methods).

Disclaimer: I think it is important to emphasize that this amount work may not be enough to say the strategy is completely free of errors, and it is also not enough to say the process should be scaled up for more samples.

However, if this post and the public GitHub code can play a similar role as a public lab notebook like the LOG.txt file for the RNA-Seq differential expression limits test, then that might help with communicating effort (and/or a partial contribution to a process, for both positive and negative results).

Change Log:
7/3/2020 - public post date
7/4/2020 - minor changes (including clarification of summary/discussion)
7/5/2020 - minor changes (including adding link to initialized GLIMPSE GitHub subfolder)
7/6/2020 - minor changes
8/1/2020 - add GLIMPSE results

Wednesday, May 6, 2020

Opinions Related to Gencove Pre-Print Comment


Because a pre-print comment is somewhat formal, I thought that I should separate my opinions from the main feedback.

So, I decided to put those in a blog post.  You can see my pre-print review/comment here, and these are the extra comments:

General Notes / Warnings (Completely Removed from Comment):

My Nebula lcWGS results were OK for some things (like relatedness and broad ancestry), but I found the Gencove accuracy to be unacceptable for specific variants (for myself).

While Nebula has changed to only provide higher coverage sequencing, I previously submitted an FDA MedWatch report for my own data (for the lcWGS Gencove results).

To be fair, there are also general limits to the utility of most of the Polygenic Risk Scores that I was able to test with my own data (with some informal notes in this blog post).  So, while true, mentioning that I still had concerns about the percentiles that I saw from Nebula (even with the higher coverage sequencing data) may be less relevant.

Similarly, while I want to encourage other customers to report anything they find to MedWatch (and/or PatientsLikeMe, etc), I also want to acknowledge my own limitations that this general warning is more about specific issues that I found for myself.  For example, it may help to have an independent analysis with larger sample sizes to gauge my general PRS concerns and/or be more specific in terms of which specific PRS do or do not have clinical utility with sufficient predictive power for the disease association.

Specific Comment #2) I think my own result might match the imputed correlation that is described (in terms of having ~90% accuracy).  However, I would say that is unacceptable for making clinical decisions, especially since more accurate genotypes can be defined.  It is important to be transparent and not over-estimate accuracy, so I think that part is good.  I also realize that something unacceptable for individual variants can be acceptable for other applications.  However, I think something about limits should be mentioned for the general audience, even if they really apply to the same Polygenic Risk Scores in higher coverage sequencing data.

I am not sure if this matters for this particular project, but I have found that it is not unusual to learn about something that may contradict an original funding goal.  I have certainly noticed that it can take me a while to realize I need to question some original assumptions, but sharing those experiences is extremely valuable to the scientific community (if the conclusions then shift to helping others avoid similar mistakes).  I also realize prior assumptions can be hard overlook in comments/reviews as well, and there is definitely more that I can learn.  Given that you have a pre-print and there is a lot of details in supplemental information and external files, I think that is a good sign.

Specific Comment #3) In the future, I hope that this is also the sort of thing that precisionFDA, All of Us, etc. can help with.  In fact, as an individual opinion, this makes we wonder if the SBIR funding mechanism might be able to help with directly providing generics through non-profits (especially for genomics diagnostics).  However, I don’t think that means SBIR for-profit funding would have to be completely ended to preferentially fund non-profits, and I realize that probably can’t affect this particular paper.

If the Gencove code isn’t public, then I am not sure how you could show others could reproduce a freeze of the code before testing application to new samples.  Nevertheless, I applaud that you provided some code for the publication.

Specific Comment #4) There may be a way to revise the current manuscript without adding the independent (public) test data and/or the open-source alternatives.  For example, I don’t think you need additional results for your effective coverage section, but I am more interested in the concordance measures.  If the Gencove / STITCH / GLIMPSE / IMPUTE results are similar in terms of technical replicate concordance (for the same 1000 Genomes samples), then I think that you could skip what is described for specific comment 3) for this paper.

I also noticed that the competing interests statement was in the past tense for the present employees (as I understand it).

Summary: I think the utility for lcWGS to cause additional genomic data types to be considered identifiable information is important (which I have in a different blog post).


Change Log:

5/6/2020 - public post

Thursday, April 30, 2020

Personal Thoughts on Collaboration and Long-Term Project Planning: Reproducibility and Depositing Data / Code

I believe that I broadly need to improve explanations for the need / value to deposit data and have code for the associated paper (even though that takes additional time and effort).

This is already a little different than the other sections, since it more of a question than a suggestion.

Nevertheless, as an individual, this is what I either currently do or I need to learn more about:
  • I am actively trying to better understand the details for proper data deposit for patient data (even though I have previously assisted with GEO and SRA submissions).
    • For example, I am trying to understand how patient consent relates to the need to have a controlled-access submission (even if that increases the time necessary to deposit data, or that certain projects should not be funded if the associated data cannot be deposited appropriately).  So, being involved with a successful dbGaP submission would probably be good experience.
    • I thought the rules were similar for other databases (like ArrayExpress, ENA, EGA, etc.).
    • However, if you know of other ways to appropriately deposit data, then I would certainly be interested in hearing about them!
  • If possible, I always recommend depositing data (and you see several papers where we did in fact do that), but I think different expectations would need to be set for supporting code (hence I have said things like "I cannot provide user support for the templates").
    • This is not to say I don't think code sharing is important.  On the contrary, I think it is important, but you have to plan for the appropriate amount of time to carefully keep track of everything needed to reproduce a result.
    • Also, if are a lot of papers where code has not been provided in the past, then I have to work on figuring out how to explain the need to share code (and spend more time per project, thus reducing the total number of projects that each lab/individual works on).
  • I think it would also be best if I could learn more about IRB/IACUC protocols (for both human and other animal studies).

In terms of how I can think of potentially emphasize the importance of data deposit and code sharing, you can see my notes below.  However, if you have other ideas about how to effectively and politely encourage PIs to deposit data and plan for enough effort to provide reproducible code (and/or help review boards not approve experiments producing data that can't be deposited), then I would certainly appreciate hearing about other experiences!

  • Even if it is not caught during peer review, I think journal data sharing requirements can apply for post-publication review?
  • The NIH has Genomic Data Sharing (GDS) policies regarding when genomics data is expected to be deposited.
    • There is also additional information about submitting genomic data, even if the study was not directly funded by the NIH.
    • There is also information about the NIH data sharing policies here.
  • While it mostly emphasized the need to expectation for data sharing with grants that are greater than $500,000, the NIH Data Sharing Policy and Implementation also mentions the need to code-related information available for reproducibility under "Data Documentation".
  • I also have this blog post on the notes that I have collected about limits to data sharing, but that is more about limiting experiments than data deposit for an experiment that has already been conducted.
  • Eglen et al. 2017 has some guidelines regarding sharing code.
  • This book chapter also discusses data and code sharing in the context of reproducibility.

Change Log:

4/30/2020 - public post
8/5/2020 - public post

Notes on Limits for Data Sharing

This overlaps with my post showing low-coverage sequencing data was identifiable information (with my own data).  However, I though having separate post to keep track of details still had some value.
  • Institutional Certification is required for patient data.  While some data collected before January 25th, 2015 can be deposited under controlled access without "explicit consent", this is not true for more recently collected samples.
    • For this reason, I would recommend not approving genomics studies with samples collected after this point, if such consent was not obtained (either in the original protocol, or in an amended protocol).
    • This also makes it important to get amendments to your IRB protocols, when you make changes.
    • The website can change over time.  In the event that the current website does not make clear that this applies to cell lines, you can see more explicit mention of cell lines here.
      • I believe the earlier website used the same language as the subheader on this form, saying "data generated from cell lines created or clinical specimens collected".
  • This means that you should not be able to create cell lines using samples collected more recently without "explicit consent" for either public or controlled access data deposit, since it will be extremely hard to enforce the appropriate use of the data after you share the cell lines with other labs.
    • I think that is consent with what is described in this article, which says "[consent] should be requested prior to generation" for cell lines.
      • The NIH GDS Overview says "For studies using cell lines or clinical specimens created or collected after [January 25th, 2015]...Informed consent for future research use and broad data sharing should have been obtained, even if samples are de-identified".
      • The NIH GDS FAQ also says "NIH strongly encourages investigators to transition to the use of specimens that have been consented for future research uses and broad sharing."
      • Additionally, the GEO human subject guidelines say "[it] is your responsibility to ensure that the submitted information does not compromise participant privacy[,] and is in accord with the original consent[,] in addition to all applicable laws, regulations, and institutional policies" (with or without NIH finding).
      • Plus, the NIH GDS FAQ says "investigators who download unrestricted-access data from NIH-designated repositories should not attempt to identify individual human research participants from whom the data were obtained".
    • HeLa cell lines were not obtained with the appropriate consent.  I believe that is why there is a collection of HeLa dbGaP datasets, since they are supposed to be deposited through a controlled access mechanism.  This is not always mentioned on the vendor website, and this is not always immediately enforced.  However, post-publication review applies to datasets and produces (as well as papers, which can be corrected or retracted).
      • In terms of HeLa cells, the genomic data is strictly expected to be deposited as controlled access, as explained in this policy.
    • If there is a way to check consent for cell lines, then I would appreciate learning about that.
    • As far as I know, the only cell lines that are confirmed to have consent to generate genetically identifying data to release publicly are those from the Personal Genome Project participants.  However, again, I would be happy to hear from others.
    • The ATCC website says "Genetic material deposited with ATCC after 12 October 2014 falls under the Convention on Biological Diversity and its Nagoya Protocol...It is the responsibility of end users that these undertakings are complied with and we strongly recommend that customers refer to this prior to purchase."
      • My understanding is that the United States has not joined this agreement.  However, I hope that this matches the sprit of other rules or guidelines from the NIH and HHS.  If I understand everything, I also hope the US joins at a later point in time.
  • In general, I think work done with low-coverage sequencing data can show that a lot of genomic data can be identifiable (which I think matches the need for controlled access and justification for not being allowed to create a cell line without the appropriate consent).
  • There is also this Blay et al. 2019 article describing kinship calculations with RNA-Seq data, also confirming the expectation that the raw FASTQ files contain identifiable information for most common RNA-Seq libraries.
    • The NIH GDS FAQ also includes "transcriptomic" and "gene expression" data as covered under GDS policies
  • I believe the above points may relate to the 2013 Omnibus rule, connecting the GINA and HIPAA laws.  As I understand it, I think you can find an unofficial summary here.
    • I believe that also matches what is described this link from the Health and Human Services (HHS) website (if it related to a health care provider).
    • There are general HIPAA FAQ for Individuals here, including a description of the HIPAA privacy rule here that explains HIPAA is intended to "[set] boundaries on the use and release of health records".
    • The links most directly above are from Health and Human Services (HHS).  However, in the research contextthis article mentions the importance of taking genetic information into consideration with HIPAA/PHI/de-identification (which recommends controlled access if there is not appropriate consent for public deposit, since some raw genomic data may not be able to be truly de-identified).
    • At least for someone without a legal background like myself, I think "Under GINA, genetic information is deemed to be ‘health information’ that is protected by the Privacy Rule [citation removed] even if the genetic information is not clinically significant and would not be viewed as health information for other legal purposes." from Clayton et al. 2019 might be worth considering.
    • In other words, I believe that there are both NIH and HHS rules/guidelines that require or recommend care needs to be taken for patient genomic data.
  • I think some of the information from the Design and Interpretation of Clinical Trials Course course from Johns Hopkins University is useful.
    • Even in the research setting, the document from that course for the "Common Rule" includes "Identifiable private information" in the definition of "Human Subject" Research.
    • In the HIPAA privacy rule booklet for that course, it also says "For purposes of the Privacy Rule, genetic information is considered to be health information."  You can also see that posted here.

There are certainly many individuals (at work, as well as at the NIH, NCI, etc.) that have been helping me understand all of this.  So, thank you all very much!

Change Log:

4/30/2020 - public post
7/30/2020 - updates
8/5/2020 - updates
7/9/2021 - add information about RNA-Seq kinship
8/12/2021 - add information about Personal Genome Project cell lines and ATCC / Nagoya Protocol; formatting changes in main text and change log
8/17/2021 - add GDS FAQ and NIH HeLa notes
8/19/2021 - add GEO note
8/27/2021 - add HIPAA notes
11/23/2021 - add HIPAA notes
5/27/2022 - add cell line institutional certification notes
1/15/2023 - add PLOS Computational Biology reference link related to HIPAA/PHI
1/16/2023 - add Common Rule reference from JHU Coursera course + Clayton et al. 2019 reference
1/28/2023 - add note to make link from HHS page more clear + minor formatting changes

Sunday, March 8, 2020

Testing Limits of Self-Identification / Relatedness using Genomic FASTQ Files

Color asked me sign a HIPAA release in order to get access to raw genomic data, which included a FASTQ file with ~15,000 reads.  So, I thought it might be useful to get an idea about how few reads (from random low-coverage Whole Genome Sequencing) can be used to identify myself.

To be clear, I think most rules are meant to take possible future advances into consideration.  So, just because I can't identify myself, doesn't mean somebody else can't identify me with fewer reads (with current of future methods).  Nevertheless, if I can identify myself, then I think that there is a good chance others could probably identify themselves with a similar number of reads and/or variants (and possibly fewer reads/variants).

Down-Sampling 1000 Genomes Omni SNP Chip Data

I compared relatedness estimates for myself with the following genotypes:  1) Veritas WGS, 2) 23andMe SNP chip, 3) Genes for Good SNP chip, and 4) Nebula lcWGS (along with the matching positions from the 1000 Genomes Omni SNP chip).

Perhaps more importantly, I also show kinship/relationship estimates (from plink) for 1000 Genomes samples for parent-to-child relationships as well as more distant relationships:



As you can see, there is a bit more variability in the parent-to-child estimates with a few thousand variants.  The self-identification estimates (among pairs of my 4 samples) were always greater than 0.45, but there is noticeable overlap in the kinship estimates for 1000 Genomes parent-to-child and more distant relatives when you drop down to only using 19 variants.

So, making sure you didn't get false positives for close relationships may be important, particularly with smaller numbers of variants.  If you have SNP chip or regular Whole Genome Sequencing data, then identifying yourself would also be easier than having 2 low-coverage Whole Genome Sequencing datasets.

However, if I can get 1000s (or perhaps even 100s) of variant calls, I am currently most interested in how accurate those calls can be.

Gencove and STITCH Imputed Self-Identification

I have earlier posts showing that the Gencove imputed variants from Nebula were not acceptable for individual variant calls, but I think they provided reasonable broad ancestry and relatedness results.  To be fair, I don't believe Nebula is currently providing low coverage Whole Genome Sequencing results anymore, opting for much higher coverage (like regular Whole Genome Sequencing).  However, Color provided me with considerably fewer lcWGS reads than Nebula (and Color also has a pre-print about lcWGS Polygenic Risk Scores that I was concerned about).

So, I was interested in testing what imputed variants I could get if I uploaded FASTQ files for Gencove analysis myself (as well as an open-source option called STITCH).



There is also more information about running STITCH (as well as more statistics for Gencove variant concordance) within this subfolder (and this subfolder/README) on the human GitHub page.  Essentially, the performance of the human lcWGS looks good at 0.1x (if not better than the earlier Gencove genotypes that were provided to me from Nebula), but there is a drop in performance with the cat lcWGS.

I ran the STITCH analysis on a local computer, so the run-time was longer than Gencove (between 1 day and 1 week, depending upon the number of reference samples - hence, I would start with running STITCH with ~99 reference samples in the future).  However, if you were willing to pay to run analysis on the cloud (or use more local computing power), I think the run-time would be more similar if each chromosome was analyzed in parallel.  Also, STITCH is open-source, and doesn't have any limits on the minimum or maximum number of reads that can be processed.  The performance also looks similar with ~5 million 100 bp paired-end reads, so the window for more accurate results that can be returned from Gencove may be around 2 million reads.  So, I think using STITCH can have advantages in a research setting.

I welcome alternative suggestions of (open-source) methods to try, but would tentatively come up with these suggestions (for 100 bp paired-end reads, with random / even coverage across the genome):

greater than 1 million reads: good chance of self-identification

0.1 - 1 million reads: intermediate chance of self-identification (perhaps similar to patient's initials, if it narrows down a set of family members?).  Potentially "good" chance of self-identification with other methods and/or future developments.

less than 0.1 million reads: respect general privacy and allow for future improvements, but additional challenges may be countered.  There still may also be sensitive and/or informative rare variants.

I also added the results for GLIMPSE lcWGS imputations (Rubinacci et al. 2020).  These are human results, but the performance was a little lower than STITCH (more similar to the Gencove results for my cat, but lower than the Gencove results for myself).  However, it probably should be noted that I did not specify my ancestry for GLIMPSE but I did specify a limited number of populations for STITCH.  So, if you don't know the  ancestry (or the ancestry used might  confound the results), then that loss of concordance may be  OK.  Also, I think the  GLIMPSE run-time was shorter (within 1 day) and it used the full set of 1000 Genomes samples as the reference set.

Recovery/Observation of Variants in 1 Read

Even if I don't exactly know what is most likely to be able to self-identify myself, I can try to get some idea of the best-case scenario in terms of even having 1x coverage at a potentially informative variant position.

The numbers here are a little different than the 1st section: I am looking for places where my genome varies from the reference genome, and I am considering a larger number of sites.  Nevertheless, I was curious about roughly how many reads it took to recover 500 or 1000 variants from a couple variant lists:



Notice the wide range of any SNPs that can be recovered versus a set of potentially informative SNPs.  However, it looks like you may very roughly notice problems with self-identification (for even future methods) with less than ~250,000 reads (matching the STITCH results above).  This is noticeably more than the ~15,000 of reads that I had to sign a HIPAA release to get from Color, but that lower limit (for any SNPs, called "WGS SNPs" with the gray line) was ~15,000 reads (very similar to what I was provided from Color, albeit single-end instead of paired-end).

In reality, you have sequencing error and a false discovery rate to consider when calling variants from 1 read, the nucleotide distribution at each position is not random between the 4 nucleotides, linkage (non-independence) of variants, and you have 2 copies of chromosomes for each position in the genome reference.  However, if you over-simplified things and asked how many combinations of 4 nucleotides (or even 2 nucleotides) create more unique sequences than the world population, that is noticeably less than the 500 or 1000 thresholds added to the plot above.

So, if you consider rare SNPs (instead of calculating a relatedness estimate, with common SNPs), perhaps you could identify yourself with less than 50,000 reads?  Either way, if you give some rough estimate allowing for future improvements in technology, I would feel safe exercising extra caution with data that has at least 100,000 reads (collected randomly / evenly across the genome).  I also believe that erring on the side of caution for data with fewer reads is probably wise as a preventative measure, but I think the possible applications for that data is lower.

Closing Thoughts

If anybody has knowledge of any other strategies, I am interested in hearing about them.  For example, I think there may be something relevant from Gilly et al. 2019, but I don't currently have a 3rd imputation benchmark set up.  I have also tried to ask a similar question on this Biostars discussion, since it looks like Gencove is no longer freely available (even though I was able to conduct the analysis above using a free trial).  I have all of this data publicly available on my Personal Genome Project page.

I am also not saying that it is not important to consider privacy for samples with less than any of the numbers of reads that I mention above (with or without a way to self-identify myself with current methods).  For example, the FASTQ files have information about the machine, run, and barcode in them (even with only 1 read).  So, if the consumer genomics company had a map between samples and customers, then perhaps that is worth keeping in mind for privacy conversations.  Likewise, if the smaller number of variants includes disease-related variants, perhaps that is also worth considering.

I don't want to cause unnecessary alarm: as mentioned above, I have made my own data public.  However, you do have to take the type of consent into consideration when working with FASTQ files (for data deposit and data sharing).  For example, you currently need "explicit consent" for either public or controlled access of samples collected after January 25th, 2015.

Finally, I would like to thank Robert Davies for the assistance that he provided (in terms of talking about the general idea, as well as attempting to use STITCH for genotype annotations), which you can see from this GitHub discussion.  I would also like to thank several individuals for helping me learn more about the consent requirements for data deposit.

Additional References

I am interested to hear feedback from others, potentially including expansion of this list.

However, if it might help with discussion, here are some possibly useful references:

Selected Genomic Identifiability Studies (or at least relevant publications):
  • Sholl et al. 2016 - Supplemental Methods describe using 48 SNPs to "confirm patient identity and eliminate sample mix-up and cross-contamination".
  • McGuire et al. 2008 - article describing genomics and privacy with some emphasis on medical records
  • Oestreich et al. 2021 - article generally discussing genomics identifiability and privacy
  • Ziegenhain and Sandberg 2021 - in theory, considers methodology to provide anonymized processed data.  I have not tried this myself, but this could at best maximize downstream analysis.  That might be useful in that it expands processed data that could be shared with caveats.  However, some things require accurate sequencing reads as unaltered raw data.  Modified sequences should not be represented as such "raw" data.
  • Wan et al. 2022 - article generally discussing genomics identifiability and privacy
  • Russell et al. 2022 - instead of sequencing coverage (such as in this blog post), this preprint describes the impact of the amount of starting DNA material on microarray genotyping for forensics analysis.
    • Kim and Rosenberg 2022 - preprint describing characteristics affecting identifiability for STR (Short Tandem Repeat) analysis
  • Popli et al. 2022 - a preprint describing kinship estimates in low coverage sequencing data (and the amount of data for relatedness estimates is the topic for most of the content in this blog post)
While related to the more general topic, I think the goals of Lippert et al. 2017 and Venkatesaramani et al. 2021 and are somewhat different than what I was trying to compare (including image analysis).

I also have some notes in this blog post, some of which are from peer reviewed publications and some from other sources (such as NIH and HHS website).  Again, I would certainly like to learn more.

In addition to the publications for STITCH/GLIMPSE/Gencove, other imputation / low-coverage analysis studies include Martin et al. 2021Emde et al. 2021 and GLIMPSE2Hanks et al. 2022 also compares microarray genotyping and low coverage imputation to Whole Genome sequencing.  If it expected that low coverage analysis includes enough markers to be useful, then I think you are either directly or indirectly saying that level of coverage is sufficient to identify the individual (which I think is a criteria that is likely easier to meet than clinical utility).

I certainly don't want to cause any undue concern.  I think some public data is important for the scientific community, but I think it is appropriate for most individuals to agree to controlled access data sharing.  Nevertheless, I think this is an important topic, which might need additional communication in the scientific community.

Change Log:

3/8/2020 - public post
3/9/2020 - add comment about simplified unique sequence calculation
4/8/2020 - add STITCH results + minor changes
4/9/2020 - minor changes
4/16/2020 - minor change
5/1/2020 - add link for GLIMPSE (before analysis)
7/28/2020 - add GLIMPSE results
1/15/2023 - add additional references for other studies
3/16/2023 - add 48 SNP verification

Thursday, January 16, 2020

Converted Twitter Response: Fraud versus Errors


There was a Twitter discussion emphasizing how identifying problems in papers shouldn't be thought of as an attack on science (and it is important not to undermine things that have been shown more robustly).

In particular, an estimate of the science-wide fraud rate of 5-10% was suggested (and I have converted my multi-part Twitter responses into a blog post).  I think 5-10% is in roughly the ballpark of a 2% estimate mentioned in a book called “Fraud in the Lab” (which I really liked).  That is also in the similar range of a 5% loss for businesses reported by the Association of Certified Fraud Examiners (which I learned about from reading "Talking to Strangers".  However, I would like to think the 2% fraud rate is more accurate.  However, this report makes it looks like fraud in other situations often goes unpunished, and there are costs to finding and correcting fraud.

However, I think intentions are important.  For example, I think the error rate is higher than the fraud rate.  I mention a couple citations with very different estimates in this blog post (where I hope the 15-25% error rate is more accurate, compared to considerably higher estimates).

In other words, if somebody is overworked, then they can make mistakes.  However, if a way is found for them to have an appropriate workload, then they can be a productive and helpful member of society.

That is different than somebody who intentionally creates misleading information (and/or coerces a subordinate to commit crimes), without hesitation or concern about the consequences.

I think it is also important that we train ourselves not to do things that we may regret in the long-run due to fear / pressure / competition in the short-term (as early as possible in your career).  That would be true for everybody, not just scientists.

For example, if you use your personal connections to get results published more quickly (with less rigorous / critical assessment), then that is bad for science (even if it can help you get grants/funding).  So, there can be a difference between “science” and “scientists.”

I also believe admitting mistakes is important (and needs to have better consequences than denying that you have done something wrong).  For example, I think how somebody reacts to a correction / retraction is important.

So, I think most scientists are fundamentally good, but changes in policies and other structures can improve to help them reach their full potential (and conduct science as fairly and objectively as possible).



Change Log:

1/16/2020 - public post
6/20/2020 - test reverting draft that is later made up public (confirm that original publication date stays the same)
7/26/2020 - add business fraud rate statistic
 
Creative Commons License
Charles Warden's Science Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.