Wednesday, May 6, 2020

Opinions Related to Gencove Pre-Print Comment


Because a pre-print comment is somewhat formal, I thought that I should separate my opinions from the main feedback.

So, I decided to put those in a blog post.  You can see my pre-print review/comment here, and these are the extra comments:

General Notes / Warnings (Completely Removed from Comment):

My Nebula lcWGS results were OK for some things (like relatedness and broad ancestry), but I found the Gencove accuracy to be unacceptable for specific variants (for myself).

While Nebula has changed to only provide higher coverage sequencing, I previously submitted an FDA MedWatch report for my own data (for the lcWGS Gencove results).

To be fair, there are also general limits to the utility of most of the Polygenic Risk Scores that I was able to test with my own data (with some informal notes in this blog post).  So, while true, mentioning that I still had concerns about the percentiles that I saw from Nebula (even with the higher coverage sequencing data) may be less relevant.

Similarly, while I want to encourage other customers to report anything they find to MedWatch (and/or PatientsLikeMe, etc), I also want to acknowledge my own limitations that this general warning is more about specific issues that I found for myself.  For example, it may help to have an independent analysis with larger sample sizes to gauge my general PRS concerns and/or be more specific in terms of which specific PRS do or do not have clinical utility with sufficient predictive power for the disease association.

Specific Comment #2) I think my own result might match the imputed correlation that is described (in terms of having ~90% accuracy).  However, I would say that is unacceptable for making clinical decisions, especially since more accurate genotypes can be defined.  It is important to be transparent and not over-estimate accuracy, so I think that part is good.  I also realize that something unacceptable for individual variants can be acceptable for other applications.  However, I think something about limits should be mentioned for the general audience, even if they really apply to the same Polygenic Risk Scores in higher coverage sequencing data.

I am not sure if this matters for this particular project, but I have found that it is not unusual to learn about something that may contradict an original funding goal.  I have certainly noticed that it can take me a while to realize I need to question some original assumptions, but sharing those experiences is extremely valuable to the scientific community (if the conclusions then shift to helping others avoid similar mistakes).  I also realize prior assumptions can be hard overlook in comments/reviews as well, and there is definitely more that I can learn.  Given that you have a pre-print and there is a lot of details in supplemental information and external files, I think that is a good sign.

Specific Comment #3) In the future, I hope that this is also the sort of thing that precisionFDA, All of Us, etc. can help with.  In fact, as an individual opinion, this makes we wonder if the SBIR funding mechanism might be able to help with directly providing generics through non-profits (especially for genomics diagnostics).  However, I don’t think that means SBIR for-profit funding would have to be completely ended to preferentially fund non-profits, and I realize that probably can’t affect this particular paper.

If the Gencove code isn’t public, then I am not sure how you could show others could reproduce a freeze of the code before testing application to new samples.  Nevertheless, I applaud that you provided some code for the publication.

Specific Comment #4) There may be a way to revise the current manuscript without adding the independent (public) test data and/or the open-source alternatives.  For example, I don’t think you need additional results for your effective coverage section, but I am more interested in the concordance measures.  If the Gencove / STITCH / GLIMPSE / IMPUTE results are similar in terms of technical replicate concordance (for the same 1000 Genomes samples), then I think that you could skip what is described for specific comment 3) for this paper.

I also noticed that the competing interests statement was in the past tense for the present employees (as I understand it).

Summary: I think the utility for lcWGS to cause additional genomic data types to be considered identifiable information is important (which I have in a different blog post).


Change Log:

5/6/2020 - public post

Thursday, April 30, 2020

Personal Thoughts on Collaboration and Long-Term Project Planning: Reproducibility and Depositing Data / Code

I believe that I broadly need to improve explanations for the need / value to deposit data and have code for the associated paper (even though that takes additional time and effort).

This is already a little different than the other sections, since it more of a question than a suggestion.

Nevertheless, as an individual, this is what I either currently do or I need to learn more about:
  • I am actively trying to better understand the details for proper data deposit for patient data (even though I have previously assisted with GEO and SRA submissions).
    • For example, I am trying to understand how patient consent relates to the need to have a controlled-access submission (even if that increases the time necessary to deposit data, or that certain projects should not be funded if the associated data cannot be deposited appropriately).  So, being involved with a successful dbGaP submission would probably be good experience.
    • I thought the rules were similar for other databases (like ArrayExpress, ENA, EGA, etc.).
    • However, if you know of other ways to appropriately deposit data, then I would certainly be interested in hearing about them!
  • If possible, I always recommend depositing data (and you see several papers where we did in fact do that), but I think different expectations would need to be set for supporting code (hence I have said things like "I cannot provide user support for the templates").
    • This is not to say I don't think code sharing is important.  On the contrary, I think it is important, but you have to plan for the appropriate amount of time to carefully keep track of everything needed to reproduce a result.
    • Also, if are a lot of papers where code has not been provided in the past, then I have to work on figuring out how to explain the need to share code (and spend more time per project, thus reducing the total number of projects that each lab/individual works on).
  • I think it would also be best if I could learn more about IRB/IACUC protocols (for both human and other animal studies).

In terms of how I can think of potentially emphasize the importance of data deposit and code sharing, you can see my notes below.  However, if you have other ideas about how to effectively and politely encourage PIs to deposit data and plan for enough effort to provide reproducible code (and/or help review boards not approve experiments producing data that can't be deposited), then I would certainly appreciate hearing about other experiences!

  • Even if it is not caught during peer review, I think journal data sharing requirements can apply for post-publication review?
  • The NIH has Genomic Data Sharing (GDS) policies regarding when genomics data is expected to be deposited.
    • There is also additional information about submitting genomic data, even if the study was not directly funded by the NIH.
    • There is also information about the NIH data sharing policies here.
  • While it mostly emphasized the need to expectation for data sharing with grants that are greater than $500,000, the NIH Data Sharing Policy and Implementation also mentions the need to code-related information available for reproducibility under "Data Documentation".
  • I also have this blog post on the notes that I have collected about limits to data sharing, but that is more about limiting experiments than data deposit for an experiment that has already been conducted.
  • Eglen et al. 2017 has some guidelines regarding sharing code.
  • This book chapter also discusses data and code sharing in the context of reproducibility.

Change Log:

4/30/2020 - public post
8/5/2020 - public post

Notes on Limits for Data Sharing

This overlaps with my post showing low-coverage sequencing data was identifiable information (with my own data).  However, I though having separate post to keep track of details still had some value.
  • Institutional Certification is required for patient data.  While some data collected before January 25th, 2015 can be deposited under controlled access without "explicit consent", this is not true for more recently collected samples.
    • For this reason, I would recommend not approving genomics studies with samples collected after this point, if such consent was not obtained (either in the original protocol, or in an amended protocol).
    • This also makes it important to get amendments to your IRB protocols, when you make changes.
    • The website can change over time.  In the event that the current website does not make clear that this applies to cell lines, you can see more explicit mention of cell lines here.
      • I believe the earlier website used the same language as the subheader on this form, saying "data generated from cell lines created or clinical specimens collected".
  • This means that you should not be able to create cell lines using samples collected more recently without "explicit consent" for either public or controlled access data deposit, since it will be extremely hard to enforce the appropriate use of the data after you share the cell lines with other labs.
    • I think that is consent with what is described in this article, which says "[consent] should be requested prior to generation" for cell lines.
      • The NIH GDS Overview says "For studies using cell lines or clinical specimens created or collected after [January 25th, 2015]...Informed consent for future research use and broad data sharing should have been obtained, even if samples are de-identified".
      • The NIH GDS FAQ also says "NIH strongly encourages investigators to transition to the use of specimens that have been consented for future research uses and broad sharing."
      • Additionally, the GEO human subject guidelines say "[it] is your responsibility to ensure that the submitted information does not compromise participant privacy[,] and is in accord with the original consent[,] in addition to all applicable laws, regulations, and institutional policies" (with or without NIH finding).
      • Plus, the NIH GDS FAQ says "investigators who download unrestricted-access data from NIH-designated repositories should not attempt to identify individual human research participants from whom the data were obtained".
    • HeLa cell lines were not obtained with the appropriate consent.  I believe that is why there is a collection of HeLa dbGaP datasets, since they are supposed to be deposited through a controlled access mechanism.  This is not always mentioned on the vendor website, and this is not always immediately enforced.  However, post-publication review applies to datasets and produces (as well as papers, which can be corrected or retracted).
      • In terms of HeLa cells, the genomic data is strictly expected to be deposited as controlled access, as explained in this policy.
    • If there is a way to check consent for cell lines, then I would appreciate learning about that.
    • As far as I know, the only cell lines that are confirmed to have consent to generate genetically identifying data to release publicly are those from the Personal Genome Project participants.  However, again, I would be happy to hear from others.
    • The ATCC website says "Genetic material deposited with ATCC after 12 October 2014 falls under the Convention on Biological Diversity and its Nagoya Protocol...It is the responsibility of end users that these undertakings are complied with and we strongly recommend that customers refer to this prior to purchase."
      • My understanding is that the United States has not joined this agreement.  However, I hope that this matches the sprit of other rules or guidelines from the NIH and HHS.  If I understand everything, I also hope the US joins at a later point in time.
  • In general, I think work done with low-coverage sequencing data can show that a lot of genomic data can be identifiable (which I think matches the need for controlled access and justification for not being allowed to create a cell line without the appropriate consent).
  • There is also this Blay et al. 2019 article describing kinship calculations with RNA-Seq data, also confirming the expectation that the raw FASTQ files contain identifiable information for most common RNA-Seq libraries.
    • The NIH GDS FAQ also includes "transcriptomic" and "gene expression" data as covered under GDS policies
  • I believe the above points may relate to the 2013 Omnibus rule, connecting the GINA and HIPAA laws.  As I understand it, I think you can find an unofficial summary here.
    • I believe that also matches what is described this link from the Health and Human Services (HHS) website (if it related to a health care provider).
    • There are general HIPAA FAQ for Individuals here, including a description of the HIPAA privacy rule here that explains HIPAA is intended to "[set] boundaries on the use and release of health records".
    • The links most directly above are from Health and Human Services (HHS).  However, in the research contextthis article mentions the importance of taking genetic information into consideration with HIPAA/PHI/de-identification (which recommends controlled access if there is not appropriate consent for public deposit, since some raw genomic data may not be able to be truly de-identified).
    • At least for someone without a legal background like myself, I think "Under GINA, genetic information is deemed to be ‘health information’ that is protected by the Privacy Rule [citation removed] even if the genetic information is not clinically significant and would not be viewed as health information for other legal purposes." from Clayton et al. 2019 might be worth considering.
    • In other words, I believe that there are both NIH and HHS rules/guidelines that require or recommend care needs to be taken for patient genomic data.
  • I think some of the information from the Design and Interpretation of Clinical Trials Course course from Johns Hopkins University is useful.
    • Even in the research setting, the document from that course for the "Common Rule" includes "Identifiable private information" in the definition of "Human Subject" Research.
    • In the HIPAA privacy rule booklet for that course, it also says "For purposes of the Privacy Rule, genetic information is considered to be health information."  You can also see that posted here.

There are certainly many individuals (at work, as well as at the NIH, NCI, etc.) that have been helping me understand all of this.  So, thank you all very much!

Change Log:

4/30/2020 - public post
7/30/2020 - updates
8/5/2020 - updates
7/9/2021 - add information about RNA-Seq kinship
8/12/2021 - add information about Personal Genome Project cell lines and ATCC / Nagoya Protocol; formatting changes in main text and change log
8/17/2021 - add GDS FAQ and NIH HeLa notes
8/19/2021 - add GEO note
8/27/2021 - add HIPAA notes
11/23/2021 - add HIPAA notes
5/27/2022 - add cell line institutional certification notes
1/15/2023 - add PLOS Computational Biology reference link related to HIPAA/PHI
1/16/2023 - add Common Rule reference from JHU Coursera course + Clayton et al. 2019 reference
1/28/2023 - add note to make link from HHS page more clear + minor formatting changes
 
Creative Commons License
Charles Warden's Science Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.