Charles Warden's Science Blog: Personal Thoughts on Collaboration and Long-Term Project Planning: Post-Publication Review

I have another post more broadly describing the importance of comments / corrections that I have self-imposed on my papers, as well as thoughts about the science-wide error rate.

However, those are not all from work I did in a shared resource at City of Hope. So, I thought I should summarize a subset of those points here:

COHCAP comment #1: correction of minor typos (now upgraded as a formal corrigendum)
COHCAP comment #2: my personal opinions emphasizing the following points

"City of Hope" should not have been used in the algorithm name
I've more recently gained better appreciation for the need to have testing of methods for every paper (so, I mention that readers should not consider the best COHCAP results to be completely automated). Given that COHCAP stands for "City of Hope CpG Island Analysis Pipeline" this is relevant to my discussions of "templates" versus "pipelines"

COHCAP comment #3: while the Nucleic Acids Research editors were very helpful in encouraging me to look more closely at a discrepancy in the listing of the machine for processing the 450k array, they declined to post the comment because it was ultimately determined to be an error in the GEO entry rather than the Supplemental Materials for the COHCAP paper.

I mention this in a little greater detail on Google Sites; however, I was able to confirm that the HiScanSQ (not the BeadArray) was used to process the samples because i) the BeadArray is not capable of processing a 450k array and ii) City of Hope never owned a BeadArray.

2nd Author Correction: Table #2 was wrong (duplication of table #1, although the table description was correct)
2nd Author Comment #1: Use of the phrase "silhouette plot" was not precise

While this could potentially be an example of a concern for a bioinformatician within a biology lab, I worked on this paper when I was in the COH Bioinformatics Core. So, I think the most important lesson is to develop habits where you stop whenever you encounter something you don't know, and set a pace (and total number of projects) where you expect to have to take some time to learn more about what you see in the literature (and how to ask the right / best questions to collaborators that are likely also busy working on multiple projects).

2nd Author Comment #2: Use of the phrase "silhouette plot" was also used in another paper, which was published before this 2nd author paper (even though this project was started first)

I think this is important in terms of better appreciating the interdependence of labs supported by the same staff member (although I have started try and have acknowledgements for templates, and making notes in follow-up analysis whenever code is copied between labs prior to publication).

Middle-Author Papers

It is important that I am fair to everybody (regardless of whether they are a collaborator). However, I also realize this is a sensitive issue that requires some additional internal communication.

So, I have reduced the amount of details for these examples. While I think there has been at least 1 correction that was initiated more than a year ago, I am (slowly) continuing to follow-up whenever something is or was not correct.
Sorting through the details for corrections is like managing the correct workload for new projects. If I try to figure out what exactly happened with too many papers at once, I will be more likely to make mistakes. So, at any given time, I try to focus more on ~3 issues that I know about.
In other words, I will be honest if asked about any errors (or potential errors). However, if I have the advantage of being able to have discussions with people who I know better, then I think it is probably wise to focus on that as much as possible.
I am willing to add a link to notes about middle-author papers. However, if it is possible to wait until everything on that list has been corrected, I think that may be preferable.

So far, I don't think that I caused most of the middle-author paper errors, but I made some mistakes for middle author papers. So, I provide a couple examples omitting some specific details below:

GEO Sample Label Update: Since GEO doesn't have a change log, I thought I should mention there was one prostate cancer project that I helped prepare for a GEO upload whose GEO labels were not ideal (even though the patient IDs for sample pairing for the sample were correct, the samples should have been called "sample" rather than "patient," and that has been corrected). This was not a huge problem, but most other GEO corrections are due to me not knowing about the machine (so, they were errors from somebody that I didn't catch due to a gap in my knowledge). So, to be fair, I thought I needed to mention this because I was the one who accidentally created the error (rather than passing along somebody else's error).
GEO Machine / Base Calling Methods Update: There was at least one submission where machine and methods needed to be updated, both of which involved at least some previous misunderstanding on my part.

If I were to give advice to my previous self, I would say it is important for the project lead to understand the full project (and plan to spend a substantial amount of time revising and critically assessing your results). If there is something that you don't understand, do everything you can do discuss with the other authors prior to paper submission. After all, you will likely have to give at least a partial explanation to people asking about your project, such as face-to-face discussions where co-authors may not present. It is also important to capture the the full amount of work required for a paper (including post-publication support).

You don't necessarily have to be a project lead to need to plan for an appropriate workload, although taking responsibility for a paper is much more difficult if you aren't a project lead (if you caused the mistake, then somebody else may experience more severe consequences for your mistake).

I think it is also important to emphasize personal limits (and the solution it provides). If your optimum workload is 5 projects and you work on 10 projects, then you are going to encounter difficulties. However, I think it can then help if you take your time and gain a better intuition about what you don't know (and therefore what you either need to spend more time on or possibly focus less on overall). I admittedly still have to figure out exactly what produces the best work-life balance, and I think you have to wait to notice some of the accumulation of follow-up requests and/or post-publication support / review. However, I think I have gotten a better feel for what that "optimal" day is like: I just have to figure out how to consistently have that each day (on the scale of years). In other words, if you are feeling overwhelmed, then I would recommend focusing on previous positive experiences as hope that you can improve by decreasing your responsibility / workload. I also needed to learn to recognize and manage stress better (sometimes with medication).

I think a lot of what I described above can also just be simple mistakes. For example, I can tell that I make more mistakes if I work overtime on a regular basis or if I haven't been well-rested. While I didn't exactly cause all of the errors that I described above, I think it is necessary for me to take responsibility whenever I was first author (or equivalent). If I can describe myself as precisely as possible (which I realize is still a work in progress), then I hope that can also help others as well (for every level of collaboration in putting together a paper).

P.S. There were 2 general points (previously under the "middle author" section) that I think may be better to move to another blog. I have already move that content (and I will provide links here when public). However, in the meantime, I would say those fall into the categories of i) what is the best way to correct minor errors (which you can see in this ResearchGate discussion) and ii) explain the need and estimate of time required to provide data and code needed for a result to be reproducible.

Update Log:

7/26/2019 - public post date
7/27/2019 - revise concluding paragraph
7/29/2019 - move majority of concluding paragraph back to a draft; try to be more conservative / clear with commentary
7/31/2019 - add link to COHCAP corrigendum
8/1/2019 - mention there will need to be additional corrections
8/5/2019 - minor changes (+ add back in concluding paragraph, followed by additional trimming/revision)
8/6/2019 - minor changes
8/13/2019 - mention GEO update
9/19/2019 - mention data deposit and code sharing
9/20/2019 - expand middle-author section; minor change
9/21/2019 - minor change
9/28/2019 - add experiences learned from IRB / patient consent process
9/29/2019 - fix typos; reword recent changes
10/02/2019 - add Yapeng link
10/15/2019 - add note for gene length calculation, as well as another link to blog post (with some separated content in this post)
11/1/2019 - mention ChIP-Seq issue
1/28/2020 - add intermediate set of ChIP-Seq notes
4/4/2020 - minor changes + reduce middle author content + move general points
4/24/2020 - add link to ResearchGate discussion

9/7/2020 - minor change (removing some specific information)

12/16/2020 - add another middle-author example without any details (shifting from specific to general)