Sunday, June 2, 2019

What's the difference between a "Pipeline" and a "Template"?

The process of understanding each step of analysis is important for presenting the final set of results, and the process of writing the code for that analysis can help you understand the methods better (and identify questions and/or room for improvement in your current code / understanding).

I have some templates for analysis, but I call them "templates" rather than "pipelines" because the code itself usually requires some modification.  While I think it is extremely useful to have packages for specific functions, you may find that having a pre-set pipeline doesn't quite produce publication-quality figures, and having a template for your own code (that is easier for you to change than somebody else) may make it easier to implement changes that come as the result of iterations of project discussions.  I have a note to this effect for most templates (such as the acknowledgement in the README for the RNA-Seq gene expression analysis "template," as well as a the 2nd post-publication comment for COHCAP, which stands for "City of Hope CpG island Analysis Pipeline").

These modifications can be important for semi-automated analysis, and it is possible that other people may find there are some situations where it can be useful to have templates for intermediate results.  However, I also believe there are some other factors for discussion that are worth taking into consideration:

  • Be careful not to increase the turn-around time for an initial result while increasing the total amount of time to get a paper to publication (or skip steps that could decrease the accuracy of the publication).  This can be particularly tricky if it takes a couple years to appreciate all the time required for follow-up requests.
  • Be aware of how other people will view your code, and what is appropriate to include in a publication. For example, even if it becomes appropriate to have "templates" for intermediate results (which I am not saying is necessarily true), taking time to understand your results is important for responsible research practices (regardless of the formality of what you are making public).  So, testing your code on multiple datasets (either within your lab, or public data from other labs) can be important for troubleshooting.
  • Unlike code published with a particular paper, the templates (by definition) are really designed with my own use in mind, and are much more difficult to support for other people (even within the same lab).  So, while looking at portions of the code may be helpful in generating ideas, support for "templates" can't really be provided (at least not in the same way as "pipelines" or packages for a particular step of analysis)
  • While it is increasingly important to provide code to help with reproducible and understanding of analysis, that code likely needs to be different for each paper (and time and energy will be required for supporting the separate code for each paper, for each lab).  While this is not exactly a pipeline (since that code will likely represent a template to be modified for other people's experiments, not run without any changes, or novel modifications), this is also not the same as saying you have a generalized template that you think is good to use on a large number of projects.

In short, I believe it is important to expect some testing for each project, in order to be the most confident in the results that you present.  One possible alternative to a "template" may be having code for public demo datasets (for training), but that is probably also not a complete solution.

Update Log:
6/2/2019 - original public post

1 comment:

  1. This comment has been removed by a blog administrator.

    ReplyDelete

 
Creative Commons License
My Biomedical Informatics Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.