Friday, January 25, 2019

TopHat2 Really Isn't That Bad

TopHat is a highly cited RNA-Seq aligner.  Newer methods for alignment (and quantification without alignment) are now available, so there is a debate if those newer (and also highly cited methods) should be used instead of the final version of TopHat2.

One of the authors on the original TopHat paper (although admittedly not on the TopHat2 paper, or on the HISAT paper, for that matter) has also discouraged people from using TopHat1.  To be precise, that tweet is about TopHat1 (and I would in fact use TopHat2).  However, I think the interpretation is often that HISAT/HISAT2 should completely replace TopHat2 (and that is what I strongly disagree about).

As pointed out in this article, the TopHat website does recommend use of HISAT2 as it has been moved to a status of "low maintenance".  However, I usually prefer STAR and/or TopHat2 over HISAT2, so I still don't completely agree (for reasons explained in this post).

Since I kept bringing it up on Biostars, I thought it might be good to have a brief blog post with some points that I include in multiple responses (in terms of why I think TopHat2 can still be useful):

1) I've had situations where the more conservative alignment from TopHat2 was arguably useful to avoid alignments from unintended sequence and/or contamination.

2) I've had situations where unaligned reads from TopHat2 were useful for conversation back to FASTQ for use in RNA-Seq de novo assembly (such as for exongeous sequence and/or pathogen identification).
3) For gene expression, if I see something potentially strange with a TopHat2 alignment (for endogenous gene expression), I have yet to find an example where the trend was substantially different with STAR (there probably are some examples for some genes, but I would guess the gene expression quantification with htseq-count are usually similar for TopHat2 and STAR single-end alignments for most genes).

There was only small subset where I confirmed that the conclusion was not different with a STAR alignment, but you can see this acknowledgement to get some idea about the total number of samples tested (although I apologize that you have to scroll down to see names, where the name count per protocol is correlated with the total number of projects).

I also have a more recent blog post referencing some comparisons where I tested recovery of the altered gene in a knock-out or over-expression dataset, where I start with a TopHat2 alignment.

4) At least for 51 bp Single-End Reads (with 4 cores and 8 GB of RAM per sample, on a cluster), I wouldn't consider the difference in run-time for TopHat2 (versus STAR) to be prohibitive (I think it is usually within a few hours, running samples in parallel).  I believe the difference is a bigger issue with paired-end reads, but I don't encounter those as often.

That said, I am usually running TopHat2 with the --no-coverage-search parameter, which can noticeably decrease the run-time.

Here are a couple possibly relevant Biostars posts:

Also, there are some points that were brought up after the intial post:

  1. predeus had a good comment essentially mentioning that the STAR and HISAT parameters can be changed to be more conservative (for either method) and/or provide unaligned reads (for STAR).  This may be worth considering during troubleshooting.
  2. Bastien Herv√© had a good comment that the TopHat website says "Please note that TopHat has entered a low maintenance, low support stage as it is now largely superseded by HISAT2 which provides the same core functionality (i.e. spliced alignment of RNA-Seq reads), in a more accurate and much more efficient way."  I can now see possible concerns about how someone may think I was trying to imply that the developers agreed with me, when I need to be representing my opinions as independent ideas.  So, I apologize about that.
    • That said, I think it is important to emphasize that programs can have useful applications that may not have been planned by developers, and I think the above points about single-end 50 bp RNA-Seq samples are still relevant.
    • Also, I have even had a period of time when I recommended people use minfi instead of COHCAP, even though I am the COHCAP developer / maintainer.  In fact, I think that was even on the main page for the standalone version, since I am having a hard time finding a ~2015 discussion group question where that was my answer.  However, I currently believe that I may have shortchanged by own package (due to the greater popularity of minfi, and one benchmark showing similar validation for COHCAP versus minfi/bumphunter).
      • It is more of a side note, but the COHCAP paper is also relevant to the discussion of not implying agreement, since one of the comments acknowledges that I shouldn't have used "City of Hope" in the algorithm name.

I will probably continue to update this in the future, but I wanted to have some place to save my thoughts in the meantime.

Update Log:

1/25/2019 - modified formatting wording (decided to add a note about this retroactively on 1/29)
1/29/2019 - not directly an update to this blog, but there was a response to the points that I proposed here:
1/30/2019 - added point about run-time (and specifically reference extra comment about STAR/HISAT parameters)
2/26/2019 - Added link to Biostars post about lower alignment rate (for TopHat2 versus Bowtie2)
2/27/2019 - Add point about COHCAP and not implying agreement
3/2/2019 - Add link to Twitter response
6/17/2019 - Mention that I was using the --no-coverage-search parameter
7/5/2019 - qualify difference in gene expression trends and remove splicing event (since I could also identify that gene with a later version of rMATS and a STAR alignment)
3/18/2020 - add link to note on TopHat website, as well as a link about more recent benchmarks


  1. Very interesting post. Are you aware of cases where the mapping rate is substantially lower with STAR/HISAT2 than with TopHat2? I was trying to reproduce some results from an SRA data set that happens to have shorter reads (mA6-Seq, average length 49) and the mapping rate for both STAR/HISAT2 is fairly low, while in the original paper they've used TopHat2 (they didn't report the mapping stats). I was curious whether that was due to the tools or the sequencing (either specicially the m6A-Seq protocol, or the data quality.) And if it is, are those mapped reads realiable or STAR/HISAT2 were right to throw them out?

    Thanks, any thoughts would be greatly appreciated.

    1. I think it may be better to post a question like this on Biostars, to see what feedback you can get from different people.

      I can't really answer that question for certain about your data (so, I don't know if STAR is right to throw them out; for example, in some organisms, maybe the genome reference could benefit from refinement and unaligned reads may correspond to true genes), but there was a question about a lower TopHat alignment on Biostars today:

    2. Yes, that post is what lead me here :) I'll post it and see what happens.

    3. Ok - also, I don't typically test alignments for each project, but I would probably expect the TopHat2 alignment to be lower than STAR/HISAT. So, if you already have a low alignment rate, I think TopHat2 alone probably won't solve your problem.


Creative Commons License
My Biomedical Informatics Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.