Friday, July 26, 2019

Personal Thoughts on Collaboration and Long-Term Project Planning: Post-Publication Review

I have another post more broadly describing the importance of comments / corrections that I have self-imposed on my papers, as well as thoughts about the science-wide error rate.

However, those are not all from work I did in a shared resource at City of Hope.  So, I thought I should summarize a subset of those points here:


  • COHCAP comment #1: correction of minor typos (now upgraded as a formal corrigendum
  • COHCAP comment #2: my personal opinions emphasizing the following points
    • "City of Hope" should not have been used in the algorithm name
    • I've more recently gained better appreciation for the need to have testing of methods for every paper (so, I mention that readers should not consider the best COHCAP results to be completely automated).  Given that COHCAP stands for "City of Hope CpG Island Analysis Pipeline" this is relevant to my discussions of "templates" versus "pipelines"
  • COHCAP comment #3: while the Nucleic Acids Research editors were very helpful in encouraging me to look more closely at a discrepancy in the listing of the machine for processing the 450k array, they declined to post the comment because it was ultimately determined to be an error in the GEO entry rather than the Supplemental Materials for the COHCAP paper.
    • I mention this in a little greater detail on Google Sites; however, I was able to confirm that the HiScanSQ (not the BeadArray) was used to process the samples because i) the BeadArray is not capable of processing a 450k array and ii) City of Hope never owned a BeadArray.
  • 2nd Author Correction: Table #2 was wrong (duplication of table #1, although the table description was correct)
  • 2nd Author Comment #1: Use of the phrase "silhouette plot" was not precise
    • While this could potentially be an example of a concern for a bioinformatician within a biology lab, I worked on this paper when I was in the COH Bioinformatics Core.  So, I think the most important lesson is to develop habits where you stop whenever you encounter something you don't know, and set a pace (and total number of projects) where you expect to have to take some time to learn more about what you see in the literature (and how to ask the right / best questions to collaborators that are likely also busy working on multiple projects).
  • 2nd Author Comment #2: Use of the phrase "silhouette plot" was also used in another paper, which was published before this 2nd author paper (even though this project was started first)
    • I think this is important in terms of better appreciating the interdependence of labs supported by the same staff member (although I have started try and have acknowledgements for templates, and making notes in follow-up analysis whenever code is copied between labs prior to publication).
  • Middle-Author Papers
    • It is important that I am fair to everybody (regardless of whether they are a collaborator).  However, I also realize this is a sensitive issue that requires some additional internal communication.
      • So, I have reduced the amount of details for these examples.  While I think there has been at least 1 correction that was initiated more than a year ago, I am (slowly) continuing to follow-up whenever something is or was not correct.
      • Sorting through the details for corrections is like managing the correct workload for new projects.  If I try to figure out what exactly happened with too many papers at once, I will be more likely to make mistakes.  So, at any given time, I try to focus more on ~3 issues that I know about.
      • In other words, I will be honest if asked about any errors (or potential errors).  However, if I have the advantage of being able to have discussions with people who I know better, then I think it is probably wise to focus on that as much as possible.
      • I am willing to add a link to notes about middle-author papers.  However, if it is possible to wait until everything on that list has been corrected, I think that may be preferable.
    • So far, I don't think that I caused most of the middle-author paper errors, but I made some mistakes for middle author papers.  So, I provide a couple examples omitting some specific details below:
      • GEO Sample Label Update: Since GEO doesn't have a change log, I thought I should mention there was one prostate cancer project that I helped prepare for a GEO upload whose GEO labels were not ideal (even though the patient IDs for sample pairing for the sample were correct, the samples should have been called "sample" rather than "patient," and that has been corrected).  This was not a huge problem, but most other GEO corrections are due to me not knowing about the machine (so, they were errors from somebody that I didn't catch due to a gap in my knowledge).  So, to be fair, I thought I needed to mention this because I was the one who accidentally created the error (rather than passing along somebody else's error).
      • GEO Machine / Base Calling Methods Update: There was at least one submission where machine and methods needed to be updated, both of which involved at least some previous misunderstanding on my part.

If I were to give advice to my previous self, I would say it is important for the project lead to understand the full project (and plan to spend a substantial amount of time revising and critically assessing your results).  If there is something that you don't understand, do everything you can do discuss with the other authors prior to paper submission.  After all, you will likely have to give at least a partial explanation to people asking about your project, such as face-to-face discussions where co-authors may not present.  It is also important to capture the the full amount of work required for a paper (including post-publication support).

You don't necessarily have to be a project lead to need to plan for an appropriate workload, although taking responsibility for a paper is much more difficult if you aren't a project lead (if you caused the mistake, then somebody else may experience more severe consequences for your mistake).

I think it is also important to emphasize personal limits (and the solution it provides).  If your optimum workload is 5 projects and you work on 10 projects, then you are going to encounter difficulties.  However, I think it can then help if you take your time and gain a better intuition about what you don't know (and therefore what you either need to spend more time on or possibly focus less on overall).  I admittedly still have to figure out exactly what produces the best work-life balance, and I think you have to wait to notice some of the accumulation of follow-up requests and/or post-publication support / review.  However, I think I have gotten a better feel for what that "optimal" day is like: I just have to figure out how to consistently have that each day (on the scale of years).  In other words, if you are feeling overwhelmed, then I would recommend focusing on previous positive experiences as hope that you can improve by decreasing your responsibility / workload.  I also needed to learn to recognize and manage stress better (sometimes with medication).

I think a lot of what I described above can also just be simple mistakes.  For example, I can tell that I make more mistakes if I work overtime on a regular basis or if I haven't been well-rested.  While I didn't exactly cause all of the errors that I described above, I think it is necessary for me to take responsibility whenever I was first author (or equivalent).  If I can describe myself as precisely as possible (which I realize is still a work in progress), then I hope that can also help others as well (for every level of collaboration in putting together a paper).

P.S. There were 2 general points (previously under the "middle author" section) that I think may be better to move to another blog.  I have already move that content (and I will provide links here when public).  However, in the meantime, I would say those fall into the categories of i) what is the best way to correct minor errors (which you can see in this ResearchGate discussion) and ii) explain the need and estimate of time required to provide data and code needed for a result to be reproducible.

Update Log:

7/26/2019 - public post date
7/27/2019 - revise concluding paragraph
7/29/2019 - move majority of concluding paragraph back to a draft; try to be more conservative / clear with commentary
7/31/2019 - add link to COHCAP corrigendum
8/1/2019 - mention there will need to be additional corrections
8/5/2019 - minor changes (+ add back in concluding paragraph, followed by additional trimming/revision)
8/6/2019 - minor changes
8/13/2019 - mention GEO update
9/19/2019 - mention data deposit and code sharing
9/20/2019 - expand middle-author section; minor change
9/21/2019 - minor change
9/28/2019 - add experiences learned from IRB / patient consent process
9/29/2019 - fix typos; reword recent changes
10/02/2019 - add Yapeng link
10/15/2019 - add note for gene length calculation, as well as another link to blog post (with some separated content in this post)
11/1/2019 - mention ChIP-Seq issue
1/28/2020 - add intermediate set of ChIP-Seq notes
4/4/2020 - minor changes + reduce middle author content + move general points
4/24/2020 - add link to ResearchGate discussion
9/7/2020 - minor change (removing some specific information)
12/16/2020 - add another middle-author example without any details (shifting from specific to general)

Personal Thoughts on Collaboration and Long-Term Project Planning: Staff in Shared Resources

While I am still having discussions to understand and precisely describe my 10+ years of genomics experience (and the influence on current projects, particularly those started Post-2016), I have some opinions that I would like to share for broader discussions:


  1. If bioinformatics support comes from a shared resource, I think there may be benefits to limits on percent effort (as a fraction of split salary) and/or PIs supported by individual staff members.  A colleague kindly referred me to this paper that recommended a minimum limit of 5% effort.  My tentative opinion that there may be a benefit to having a maximum limit of 4-5 PIs (although the specifics probably vary, depending upon whether the analyst has a Master's Degree or a PhD, for example).
    • For me, I feel very comfortable in saying that I need to have a maximum limit of 3 average difficulty projects per day.
    • Also, if possible, I believe that gaining in-depth knowledge on a limited number of projects should help with publishing higher impact journals and/or novel methodology.
    • I think this matches the minimum level of effort for NCI awards, as well as the concept of having a conflict of commitment.  I also learned about both of those at work.
  2. I believe the PI / project limits for shared staff should be more strict when developing software that requires long-term support.  It is important to remember that the average (or minimum) amount of time per project will be inversely related to the total number of projects.  So, if you want to provide prompt feedback to users of your software (and fair support for all projects handled by an analyst), this needs to be taken into consideration when scheduling projects.
  3. If an analyst is supporting multiple labs, I believe the best-case scenario may involve splitting time between labs that knowingly collaborate with each other.  For example, if the set of projects among all labs is known, that may help scheduling submission (and expected revision) for papers that are expected to be published in the highest impact journals.  Regardless of whether this is done, the ability to support any given project is dependent upon concurrent support of other projects, and I think that is at least something that needs to be kept in mind.
  4. I think there should be transition periods when making changes in staff support.  While I'm not 100% certain about the details, I think a yearly or quarterly review may be a good idea.  I don't believe it is a good idea to make support decisions on a daily or weekly basis, and it often takes me 1-2 months to feel comfortable with a project that has been was previously worked on my somebody else.
  5. At least for me, it helps to have somewhat frequent discussions / analysis to help remember the details for a project.  For example, I would probably recommend having at least monthly discussions (and I think weekly or daily discussions / analysis is probably preferable).
  6. Likewise, I can discover and fix errors when spending a substantial amount of time for critical assessment of results.  For example, my recommendation would be to expect to have 10 "rounds" of analysis (hopefully, some of which includes creative / custom / novel analysis).
    • This isn't perfect, but I know I have found (and corrected) some errors this way.  This is kind of similar to being able to make new discoveries (or correct additional errors) when you re-read a paper.
  7. It is my opinion that specialized protocols should be an area of expertise for individual labs (which may or may not be offered externally), but this should not be a responsibility / service of the core.
That said, I am expecting the "Update Log" to reflect at least a few additional changes (as I have more discussions - primarily internal at first, but I think some broader feedback is also valuable).

While I am primarily focusing on the project management part of shared staff in the points above, the concept of an optimum workload can be important in various situations.  For example, if your optimum workload is 5 projects and you are trying to take on 10+ projects, you will either do lower quality work or not have an even distribution of time on projects.  However, that does indicate a possible solution: if somebody is having difficulties with supporting a certain number of projects / PIs, they may actually be able to be capable of producing excellent quality work if they focus on fewer projects more in-depth.

Update Log:
7/26/2019 - public post date
9/14/2019 - add note on discussion intervals
9/16/2019 - remove nearly duplicated sentence
9/23/2019 - mention being able to catch errors as I perform additional analysis for a particular project
9/28/2019 - add point / opinion #7
4/16/2020 - add link about effort limits and conflict of commitment
4/17/2020 - minor change

Personal Experiences with Comments and Corrections on Peer-Reviewed Papers

So far, I have at least 5 publications with examples of comments and/or corrections:


  • Coding fRNA Comment (Georgia Tech project, Published While at Princeton).
    • It is important to remember that anything you publish has a certain level of permanence (and you can be contacted about a publication 10+ years later).
    • On the positive site, I think it is worth mentioning that your own desire to work towards helping people and being the best possible scientist is important: for example, peer review can help if you do everything you can before submission, but I recently provided more public data and re-analysis that I think improves the overall message for this paper (from my own initiative; for example, please see this GitHub page, with PDFs for text and comment figures).
    • So, I truly believe peer review can help, but I think personal responsibility and transparency are at least as important.
  • Corrigendum for 2-Word Typo in BAC Review (UMDNJ, Post-Princeton, Pre-COHBIC).
    • More important than the correction, I think it should be emphasized that 6 months of working in a lab (especially without ever doing the experimental protocol emphasized) is not enough time to justify writing a review.
  • As 1st and corresponding author, I have issued two comments (and a corrigendum) for the COHCAP paper describing a method for analysis of DNA methylation data (City of Hope Bioinformatics Core).
    • There was also a third comment that NAR decided not to publish (regarding an error with the machine used in GEO, which is described in more detail on my Google Sites page and briefly mentioned in the related blog post).
  • As a 2nd equal contribution author, there was a correction regarding the 2nd table, as well as 2 comments related to imprecise use of the term "silhouette plot" (City of Hope Bioinformatics Core)  
  • [initiated and completed corrections to middle-author papers and deposited datasets] (City of Hope Integrative Genomics Core, Post-Michigan).
I trimmed down the details above because I think the formatting on my Google Sites page is a little better, and I listed specifics for the details of the City of Hope papers in a separate post.  So, given that only two other papers were pre-COH, I thought it may be better to shift this post more towards the higher-level discussion.

We all have other factors that will contribute to the total amount of time on a project (such as allocating some time for personal life), and I would usually expect work responsibilities to increase over time.  For example, if you are scrambling to complete your work as a graduate student, you may want to be cautious about setting goals for jobs that would have an even greater amount of responsibility.

Some people may be afraid of pointing out similar issues in previous papers.  While possibly somewhat counter-intuitive, I think this can help build trust in the associated papers / researchers: if researchers are not transparent in their actions and overall experience with a set of results, that can contribute to public distrust (and development of bad habits that can become worse over time).  Plus, if research is an on-going process of small steps towards an eventual solution, readers should expect each paper to acknowledge some limitations and/or unsolved problems (and a fair representation of results should help in identifying the most important areas for future research).

One relatively well-known example of the impossibility of being 100% accurate in all predictions is that Nobel Laureate Linus Pauling had a PNAS paper proposing a triple-helix structure for DNA.  There was even a Retraction Watch blog post that brought up the issue of whether this paper should be retracted.  I don't believe anyone currently cites that paper with the believe that DNA has a triple-helix (rather than a double-helix) structure.  However, taking time to correct and address mistakes needs to be taken into consideration for project management, and my point is that I am trying to encourage more self-regulation of corrections (since there are papers in relatively high impact journals whose main conclusion is wrong and they haven't been retracted or corrected).

While it harder to pass judgments on other people's work, I hope that I can be a good example for other people to identify issues with their own previous work.  For example, one counter-argument to the claim that most scientific arguments are wrong is the Jager and Leek 2014 paper where Figure 4 shows a science-wide FDR closer to 15%.  In one sense, this is good (15% is certainly better than 50% or 95%), but I think the correction / retraction rate is probably noticeably less than 15% (so, I think more scientists need to be correcting previous papers).  From my own record (of 1st author or equivalent papers), my correction rate is currently a little higher than that Jager and Leek estimate (3/8, or 37.5%) but my retraction rate is currently lower (0%).  I am not saying I will never have a retraction (or additional corrections).  In fact, I think there probably will be at least a couple additional corrections (among my total publication record).  However, that is my own personal estimate, and I would like to contribute to having discussions to try and reduce this correction rate for future studies.

I believe being open to discussion can cause you to temporarily lean towards agreement, as you try to see things from the other person's perspective (even if you eventually become more confident in your earlier claim).  So, even if a reviewer/editor considers a paper acceptable to publish or a grant is OK to fund (within a relatively short period of review process), the post-publication review is a very important (and I think making funded grants and/or grant submissions public and available for comment may also have value).

I also think that having less formal discussions (on Biostars, blogs, pre-prints, etc.) can help the peer-reviewed version of an article to be more accurate (if people actively comment in a location that is easy to check, reviewers take public comments into consideration, and/or journals use public comments to select reviewers that will provide the most fair assessment).  For multiple platforms, the Disqus comment system provides a centralized way to look for commentary for at least one peer-reviewed and at least one pre-print system.  While not linked directly from the paper, PubPeer also provides an independent commentary on journal articles.  I also have some examples of comments on both those mediums on this blog post.

While not the primary purpose, I think Twitter can also be useful for peer view.  For example, consider the contribution of a Twitter discussion to this Disqus comment.  Likewise, I found out about this article about the limits to peer review from Twitter.

I also have a set of blog posts summarizing experiences that describe the need for correction / qualification of results provided to the public for genomic products (although I think catching errors in papers, or better yet pre-prints, is really the preferable solution).  While maybe having something like the Disqus system for individual results (kind of like ClinVarGET-EvidenceSNPedia, etc.) may have some advantages, people can currently give feedback in mediums like PatientsLikeMe (where I have described my experiences here) and the FDA MedWatch.

Update Log:

2/2019 I would like to thank John Storey's tweet to reminding me of Jeff's SWFDR publication (in a draft, prior to the public post)
7/26/2019 - public post date
8/1/2019 - remove middle author link after realizing that there will be additional middle author corrections; add COHCAP corrigendum link; add PubPeer link based upon this tweet.
8/3/2019 - add Twitter links
8/6/2019 - switch link to updated genomics summary
1/16/2020 - add link to Disqus / PubPeer comment list
4/4/2020 - minor changes
7/11/2020 - minor changes
3/20/2022 - add Oncotarget RNA-Seq comment
 
Creative Commons License
Charles Warden's Science Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.