Charles Warden's Science Blog: Personal Thoughts on Collaboration and Long-Term Project Planning: Reproducibility and Depositing Data / Code

Thursday, April 30, 2020

Personal Thoughts on Collaboration and Long-Term Project Planning: Reproducibility and Depositing Data / Code

I believe that I broadly need to improve explanations for the need / value to deposit data and have code for the associated paper (even though that takes additional time and effort).

This is already a little different than the other sections, since it more of a question than a suggestion.

Nevertheless, as an individual, this is what I either currently do or I need to learn more about:

I am actively trying to better understand the details for proper data deposit for patient data (even though I have previously assisted with GEO and SRA submissions).

For example, I am trying to understand how patient consent relates to the need to have a controlled-access submission (even if that increases the time necessary to deposit data, or that certain projects should not be funded if the associated data cannot be deposited appropriately). So, being involved with a successful dbGaP submission would probably be good experience.
I thought the rules were similar for other databases (like ArrayExpress, ENA, EGA, etc.).
However, if you know of other ways to appropriately deposit data, then I would certainly be interested in hearing about them!

If possible, I always recommend depositing data (and you see several papers where we did in fact do that), but I think different expectations would need to be set for supporting code (hence I have said things like "I cannot provide user support for the templates").

This is not to say I don't think code sharing is important. On the contrary, I think it is important, but you have to plan for the appropriate amount of time to carefully keep track of everything needed to reproduce a result.
Also, if are a lot of papers where code has not been provided in the past, then I have to work on figuring out how to explain the need to share code (and spend more time per project, thus reducing the total number of projects that each lab/individual works on).

I think it would also be best if I could learn more about IRB/IACUC protocols (for both human and other animal studies).

In terms of how I can think of potentially emphasize the importance of data deposit and code sharing, you can see my notes below. However, if you have other ideas about how to effectively and politely encourage PIs to deposit data and plan for enough effort to provide reproducible code (and/or help review boards not approve experiments producing data that can't be deposited), then I would certainly appreciate hearing about other experiences!

Even if it is not caught during peer review, I think journal data sharing requirements can apply for post-publication review?

For example, the Nature reporting standards say that the editor can be contacted if data is not shared, which includes code as well as data.
PLOS has "Unacceptable Data Access Restrictions" under their data availability policy.

The NIH has Genomic Data Sharing (GDS) policies regarding when genomics data is expected to be deposited.

There is also additional information about submitting genomic data, even if the study was not directly funded by the NIH.
There is also information about the NIH data sharing policies here.

While it mostly emphasized the need to expectation for data sharing with grants that are greater than $500,000, the NIH Data Sharing Policy and Implementation also mentions the need to code-related information available for reproducibility under "Data Documentation".
I also have this blog post on the notes that I have collected about limits to data sharing, but that is more about limiting experiments than data deposit for an experiment that has already been conducted.
Eglen et al. 2017 has some guidelines regarding sharing code.
This book chapter also discusses data and code sharing in the context of reproducibility.