Wednesday, November 6, 2019

Requiring (At Least Some) Methods Testing for Every Project

It may currently be a little hard to find, but I wanted to point out a couple links relevant to showing the value in testing different RNA-Seq methods for every project:


  • SourceForge repository for public data analysis
    • I still have a ways to go before being able to start working on a paper, but you can see how I am progressing here
    • I think the Target_Recovery_Status.xlsx file (for checking recovery of the known genetic perturbation in an experiment) is the most relevant for showing that you could not choose 1 method out of edgeR, DESeq2, and limma-voom to maximally recover the known gene knock-down or over-expression
    • I am also experimenting with having a completely public log for notes and analysis
  • Acknowledgement for GitHub RNA-Seq gene expression template
    • Includes some papers with modified methods
  • While the newer analysis tends to have smaller samples sizes, you can see noticable differences between methods in a much larger cohort in this post


On Biostars (which you can see in a variety of responses, including but not limited to this one), I would generally give the following recommendations:


  • If possible, test calculating p-values with edgeR, DESeq2, and limma-voom
  • I would recommend having an independently calculated expression method (like FPKM, Fragment Per Kilobase per Million), in order to help assess method selection
    • For example, you might see an extremely obvious change in expression for a gene (such as the one that you altered), but it might not have a significant p-value (or have a missing p-value) for one of the methods.
    • While the optimal strategy for discovery may not necessarily be the one that most stringently recovers previous results, you may be able to tell some strategies clearly don't work well on your data.
    • I would also recommend using this gene expression measurement to create heatmaps to compare clustering of replicates
      • I would typically use this instead of exporting normalized counts from the method to calculate the p-value, but testing clustering of replicates (without defining the groups in the normalization) is another possible way to compare strategies.
      • Sometimes this can be a bit qualitative.  However, if you define your gene lists / enrichment as a "hypothesis," then I think this is made up for my having independent validation for your claim.
      • I do realize this treads the line between p-hacking and needing to test methods due to limits in precision (which I mention a little bit in this comment and this Twitter discussion).  However, as scientists, I think this is part of why it is extremely important to be transparent and admit errors as soon as we discover them (in the interests of training ourselves to be as objective as possible).
  • Robustness of identifying a result with different methods may also give you some extra confidence in the results (unless the methods are not really independent, for example)
  • If you test alternative normalization, make sure you have a visualization before and after applying that normalization (to try and assess the likelihood of over-fitting in your adjustment)
  • I also think it is important that these are open-source, freely available programs (so that you can have the ability to determine what works best for your individual project)


In general, these posts may also be relevant to the discussion of limits to precision in the genomics methods:



Again, it is going to be a while, but I do hope to eventually have a preprint to cover the above points (as well as some other observations that I have had from working on a variety of projects for RNA-Seq gene expression analysis).

Change Log:

11/6/2019 - public post
6/3/2020 - add link for earlier (larger) RNA-Seq benchmark
6/7/2022 - minor formatting change

2 comments:

  1. Nice explanation, it really helps me to get a better understanding of how to perform RNAseq analyzes.

    ReplyDelete
    Replies
    1. I think exposure to a variety of opinions is also important, so I don't want this alone to be a guide for RNA-Seq analysis.

      However, I think there are truly some important messages that I believe need to be communicated more often, and I am very happy that you think this is helpful!

      Delete

 
Creative Commons License
My Biomedical Informatics Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.