Sunday, March 8, 2020

Testing Limits of Self-Identification / Relatedness using Genomic FASTQ Files

Color asked me sign a HIPAA release in order to get access to raw genomic data, which included a FASTQ file with ~15,000 reads.  So, I thought it might be useful to get an idea about how few reads (from random low-coverage Whole Genome Sequencing) can be used to identify myself.

To be clear, I think most rules are meant to take possible future advances into consideration.  So, just because I can't identify myself, doesn't mean somebody else can't identify me with fewer reads (with current of future methods).  Nevertheless, if I can identify myself, then I think that there is a good chance others could probably identify themselves with a similar number of reads and/or variants (and possibly fewer reads/variants).

Down-Sampling 1000 Genomes Omni SNP Chip Data

I compared relatedness estimates for myself with the following genotypes:  1) Veritas WGS, 2) 23andMe SNP chip, 3) Genes for Good SNP chip, and 4) Nebula lcWGS (along with the matching positions from the 1000 Genomes Omni SNP chip).

Perhaps more importantly, I also show kinship/relationship estimates (from plink) for 1000 Genomes samples for parent-to-child relationships as well as more distant relationships:



As you can see, there is a bit more variability in the parent-to-child estimates with a few thousand variants.  The self-identification estimates (among pairs of my 4 samples) were always greater than 0.45, but there is noticeable overlap in the kinship estimates for 1000 Genomes parent-to-child and more distant relatives when you drop down to only using 19 variants.

So, making sure you didn't get false positives for close relationships may be important, particularly with smaller numbers of variants.  If you have SNP chip or regular Whole Genome Sequencing data, then identifying yourself would also be easier than having 2 low-coverage Whole Genome Sequencing datasets.

However, if I can get 1000s (or perhaps even 100s) of variant calls, I am currently most interested in how accurate those calls can be.

Gencove and STITCH Imputed Self-Identification

I have earlier posts showing that the Gencove imputed variants from Nebula were not acceptable for individual variant calls, but I think they provided reasonable broad ancestry and relatedness results.  To be fair, I don't believe Nebula is currently providing low coverage Whole Genome Sequencing results anymore, opting for much higher coverage (like regular Whole Genome Sequencing).  However, Color provided me with considerably fewer lcWGS reads than Nebula (and Color also has a pre-print about lcWGS Polygenic Risk Scores that I was concerned about).

So, I was interested in testing what imputed variants I could get if I uploaded FASTQ files for Gencove analysis myself (as well as an open-source option called STITCH).



There is also more information about running STITCH (as well as more statistics for Gencove variant concordance) within this subfolder (and this subfolder/README) on the human GitHub page.  Essentially, the performance of the human lcWGS looks good at 0.1x (if not better than the earlier Gencove genotypes that were provided to me from Nebula), but there is a drop in performance with the cat lcWGS.

I ran the STITCH analysis on a local computer, so the run-time was longer than Gencove (between 1 day and 1 week, depending upon the number of reference samples - hence, I would start with running STITCH with ~99 reference samples in the future).  However, if you were willing to pay to run analysis on the cloud (or use more local computing power), I think the run-time would be more similar if each chromosome was analyzed in parallel.  Also, STITCH is open-source, and doesn't have any limits on the minimum or maximum number of reads that can be processed.  The performance also looks similar with ~5 million 100 bp paired-end reads, so the window for more accurate results that can be returned from Gencove may be around 2 million reads.  So, I think using STITCH can have advantages in a research setting.

I welcome alternative suggestions of (open-source) methods to try, but would tentatively come up with these suggestions (for 100 bp paired-end reads, with random / even coverage across the genome):

greater than 1 million reads: good chance of self-identification

0.1 - 1 million reads: intermediate chance of self-identification (perhaps similar to patient's initials, if it narrows down a set of family members?).  Potentially "good" chance of self-identification with other methods and/or future developments.

less than 0.1 million reads: respect general privacy and allow for future improvements, but additional challenges may be countered.  There still may also be sensitive and/or informative rare variants.

I also added the results for GLIMPSE lcWGS imputations (Rubinacci et al. 2020).  These are human results, but the performance was a little lower than STITCH (more similar to the Gencove results for my cat, but lower than the Gencove results for myself).  However, it probably should be noted that I did not specify my ancestry for GLIMPSE but I did specify a limited number of populations for STITCH.  So, if you don't know the  ancestry (or the ancestry used might  confound the results), then that loss of concordance may be  OK.  Also, I think the  GLIMPSE run-time was shorter (within 1 day) and it used the full set of 1000 Genomes samples as the reference set.

Recovery/Observation of Variants in 1 Read

Even if I don't exactly know what is most likely to be able to self-identify myself, I can try to get some idea of the best-case scenario in terms of even having 1x coverage at a potentially informative variant position.

The numbers here are a little different than the 1st section: I am looking for places where my genome varies from the reference genome, and I am considering a larger number of sites.  Nevertheless, I was curious about roughly how many reads it took to recover 500 or 1000 variants from a couple variant lists:



Notice the wide range of any SNPs that can be recovered versus a set of potentially informative SNPs.  However, it looks like you may very roughly notice problems with self-identification (for even future methods) with less than ~250,000 reads (matching the STITCH results above).  This is noticeably more than the ~15,000 of reads that I had to sign a HIPAA release to get from Color, but that lower limit (for any SNPs, called "WGS SNPs" with the gray line) was ~15,000 reads (very similar to what I was provided from Color, albeit single-end instead of paired-end).

In reality, you have sequencing error and a false discovery rate to consider when calling variants from 1 read, the nucleotide distribution at each position is not random between the 4 nucleotides, linkage (non-independence) of variants, and you have 2 copies of chromosomes for each position in the genome reference.  However, if you over-simplified things and asked how many combinations of 4 nucleotides (or even 2 nucleotides) create more unique sequences than the world population, that is noticeably less than the 500 or 1000 thresholds added to the plot above.

So, if you consider rare SNPs (instead of calculating a relatedness estimate, with common SNPs), perhaps you could identify yourself with less than 50,000 reads?  Either way, if you give some rough estimate allowing for future improvements in technology, I would feel safe exercising extra caution with data that has at least 100,000 reads (collected randomly / evenly across the genome).  I also believe that erring on the side of caution for data with fewer reads is probably wise as a preventative measure, but I think the possible applications for that data is lower.

Closing Thoughts

If anybody has knowledge of any other strategies, I am interested in hearing about them.  For example, I think there may be something relevant from Gilly et al. 2019, but I don't currently have a 3rd imputation benchmark set up.  I have also tried to ask a similar question on this Biostars discussion, since it looks like Gencove is no longer freely available (even though I was able to conduct the analysis above using a free trial).  I have all of this data publicly available on my Personal Genome Project page.

I am also not saying that it is not important to consider privacy for samples with less than any of the numbers of reads that I mention above (with or without a way to self-identify myself with current methods).  For example, the FASTQ files have information about the machine, run, and barcode in them (even with only 1 read).  So, if the consumer genomics company had a map between samples and customers, then perhaps that is worth keeping in mind for privacy conversations.  Likewise, if the smaller number of variants includes disease-related variants, perhaps that is also worth considering.

I don't want to cause unnecessary alarm: as mentioned above, I have made my own data public.  However, you do have to take the type of consent into consideration when working with FASTQ files (for data deposit and data sharing).  For example, you currently need "explicit consent" for either public or controlled access of samples collected after January 25th, 2015.

Finally, I would like to thank Robert Davies for the assistance that he provided (in terms of talking about the general idea, as well as attempting to use STITCH for genotype annotations), which you can see from this GitHub discussion.  I would also like to thank several individuals for helping me learn more about the consent requirements for data deposit.

Additional References

I am interested to hear feedback from others, potentially including expansion of this list.

However, if it might help with discussion, here are some possibly useful references:

Selected Genomic Identifiability Studies (or at least relevant publications):
  • Sholl et al. 2016 - Supplemental Methods describe using 48 SNPs to "confirm patient identity and eliminate sample mix-up and cross-contamination".
  • McGuire et al. 2008 - article describing genomics and privacy with some emphasis on medical records
  • Oestreich et al. 2021 - article generally discussing genomics identifiability and privacy
  • Ziegenhain and Sandberg 2021 - in theory, considers methodology to provide anonymized processed data.  I have not tried this myself, but this could at best maximize downstream analysis.  That might be useful in that it expands processed data that could be shared with caveats.  However, some things require accurate sequencing reads as unaltered raw data.  Modified sequences should not be represented as such "raw" data.
  • Wan et al. 2022 - article generally discussing genomics identifiability and privacy
  • Russell et al. 2022 - instead of sequencing coverage (such as in this blog post), this preprint describes the impact of the amount of starting DNA material on microarray genotyping for forensics analysis.
    • Kim and Rosenberg 2022 - preprint describing characteristics affecting identifiability for STR (Short Tandem Repeat) analysis
  • Popli et al. 2022 - a preprint describing kinship estimates in low coverage sequencing data (and the amount of data for relatedness estimates is the topic for most of the content in this blog post)
While related to the more general topic, I think the goals of Lippert et al. 2017 and Venkatesaramani et al. 2021 and are somewhat different than what I was trying to compare (including image analysis).

I also have some notes in this blog post, some of which are from peer reviewed publications and some from other sources (such as NIH and HHS website).  Again, I would certainly like to learn more.

In addition to the publications for STITCH/GLIMPSE/Gencove, other imputation / low-coverage analysis studies include Martin et al. 2021Emde et al. 2021 and GLIMPSE2Hanks et al. 2022 also compares microarray genotyping and low coverage imputation to Whole Genome sequencing.  If it expected that low coverage analysis includes enough markers to be useful, then I think you are either directly or indirectly saying that level of coverage is sufficient to identify the individual (which I think is a criteria that is likely easier to meet than clinical utility).

I certainly don't want to cause any undue concern.  I think some public data is important for the scientific community, but I think it is appropriate for most individuals to agree to controlled access data sharing.  Nevertheless, I think this is an important topic, which might need additional communication in the scientific community.

Change Log:

3/8/2020 - public post
3/9/2020 - add comment about simplified unique sequence calculation
4/8/2020 - add STITCH results + minor changes
4/9/2020 - minor changes
4/16/2020 - minor change
5/1/2020 - add link for GLIMPSE (before analysis)
7/28/2020 - add GLIMPSE results
1/15/2023 - add additional references for other studies
3/16/2023 - add 48 SNP verification
 
Creative Commons License
My Biomedical Informatics Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.