Sunday, June 9, 2019

Considerations for "Somatic Mutations Widespread Across Normal Tissues"

On Friday, the GenomeWeb summary titled "Somatic Mutations Widespread Across Normal Tissues, New RNA-Seq Analysis Finds" caught my attention.

I had to read the title a second time to realize that the Yizhak et al. 2019 article also mentions "normal" samples.  The reason is that Figure 1 shows tumor samples (and I would typically expect people to be using MuTect to be making somatic variant calls in tumors).

In terms of the tumor somatic variant calling (which is what I initially thought the article was emphasizing), something did seem strange about Figure 1A because (with rare exceptions) a true RNA-Seq mutation should also be present in paired DNA-Seq.  An important caveat is that the figure legend describes these as "mutations detected before filtering," but that is almost exactly what I thought may need to be made more clear in this blog post.

In fact, I was confused because Figure 1A is not the frequency of RNA-MuTect somatic variants calls.  I would have liked to see a similar plot after filtering, but they do say in the main text "to address the excessive mutations detected in only the RNA, we developed RNA-MuTect, which is based on several key filtering steps (fig. S3)...the vast majority (93%) of RNA mutations were filtered out".  In other words, the authors agree that Figure 1A indicates the presence of artifacts that need to be removed (and that the rationale for needing to develop their filtering process).

So, I agree with the authors more than I originally expected.  For example, what initially caused me the most concern is the use of the word "widespread" in the GenomeWeb summary that mentioned "normal" tissues.  However, the authors did not use the word "widespread" in their title.

Nevertheless, if there were parts of the article that confused me, then it may be confusing to other people as well.

In terms of things that you might generally need to watch out for, if you were working with a stranded RNA-Seq library, then you won't be able to correct for a strand bias (and they show a clear non-transcribed strand bias for the GTEx data in Figure S18).  Gene expression limits the ability to detect mutations, but you might also want to intentionally look for mutations retained in highly expressed genes.  Additionally, the title of the GenomeWeb summary reminded me of an RNA-editing paper that I am surprised hasn't been retracted yet (even though it had several published comments linked in PubMed and blog posts expressing concern). So, details regarding the alignment method, etc. are also important.

That said, Figure 2C makes me think there can be situations where even greater filtering may be worth considering.  For example, if one of the points of looking at RNA-Seq data is to look for variants that remained at high allele fraction (with the assumption that the cell could transcribe alleles/isoforms at different rates to decrease the allele fraction ;of the less advantageous allele), you may want to look for causal variants at greater than 20% allele allele fraction (in a pure tumor / disease sample).  Indeed, they explain the lack of validation for many DNA-Seq mutations as reduced detection in genes with low expression levels (where they cite calculations in Figure S2); However, to be fair, there are other sections of the article where I get the impression that variant fractions greater than 5% are thought to be more robust, which I agree with.

Nevertheless, to be clear, they already do noticeable variant filtering (described in the section "The RNA-MuTect pipeline" of the Supplemental Methods), which includes filtering of RNA editing sites.

Also, there are parts of the paper that use the term "pipeline," and I touch on the likely need to consider the default results as "initial" results that might require additional refinement (as a general rule) in this other post.  However, I think that is a broader issue to be communicated.

In terms of some specific comments for this paper:

a) I would have liked to see something like Figure 1A and Figure 2C for RNA-MuTect filtered somatic tumor mutations. However, I think some of this information is shown (per sample) in the bottom (and top) part of Figure 1C and Figure 4C.

Figure S5B is also somewhat similar to Figure 2C, although I believe the supplemental figure is specifically for variants related to those given signatures.

That said, while I am glad they used a conservative strategy overall, the variation in sensitivity / precision per-patient in Figure 1B makes me think RNA somatic variant calling still may require some optimization for various projects.

b) I believe there is a typo in Figure 1D (or at least the caption).  The caption makes it sound like the labels should be "Smoking UV APOBEC COSMIC POLE MSI W2".  However, I think the signatures are first defined from de-convolution, and then annotated.  This would explain how W2 is described as the APOBEC signature in the caption for Figure S5, while W2 is described as an RNA signature in the caption for Figure 1D (where APOBEC is described as W4).  Nevertheless, if W2 is supposed to be the 2nd row, that means that W2 should be described as (ii) rather than (vii) for the Figure 1D caption.

c) There is another typo in the main text:

Current: "Looking at tissue subregions, we found that non-sun-exposed skin had more mutations than nonexposed skin"

Corrected: "Looking at tissue subregions, we found that sun-exposed skin had more mutations than nonexposed skin"

d) I was a little surprised when 67 / 87 (77%) of rows in Table S6 were for DNMT3A (mostly at chr2:25468887), but I am guessing that is due to what was defined in the earlier list of 332 variants.  In general, prior knowledge for well-characterized variants may be preferable than filtering for discovery (for some situations), but I'm not sure what else to say about this specific result.

e) Please note that the scale of the y-axis is different within Figure 3C.

f) While I thought I remembered use of two alignments per sample, I am a little confused about part of the Supplemental Methods.

1. Mutation calling pipeline -- (2) realigning identified sSNVs with NovoAlign ( and performing an additional iteration of MuTect with the newly aligned BAM files.

3. The RNA-MuTect pipeline -- (3) A realignment filter for RNA-seq data where all reads aligned that span a candidate variant position from both the tumor (case) and normal (control) samples are realigned using HISAT2"

The code on-line uses HISAT2 for re-alignment (not NovoAlign).  I think the "Mutation calling pipeline" was upstream of "The RNA-MuTect pipeline"?  What could explain the difference in methods.  However, if the unfiltered set of calls is for the "Mutation calling pipeline," then that leaves a lot of false positives (that need to be filtered with something like the RNA-MuTect scripts).

Update (7/11/2019): During a journal club discussion, Integrative Genomics Core (IGC) staff helped me realize that the "1. Mutation calling pipeline" section describes both DNA and RNA-Seq data.  While the mixed sentences make it hard to follow, a possible explanation would be that Novoalign was used with the DNA-Seq data (and HISAT2 was used for the RNA-Seq data).  However, if that is true, then there should have been 3 categories in Figure 1A (DNA without filtering, DNA with re-alignment/filtering, RNA without filtering)

g) The main text describes a “yet unreported mutational signature in the RNA domainated by C>T mutations” (the majority of which were in a single colon cancer sample).  However, this seems quite similar to the oxoG artifact (Costello et al. 2013) that they describe filtering for the “RNA Mutation Calling Pipeline” (and the Haradhvala et al. 2016 8-oxoG pattern that they cite for the GTEx strand bias in Figure S18)

h) There is also at least one sentence that needs to be re-worded in the Supplemental Methods: "In comparison, RNA-MuTect that takes power into account and therefore can potentially results with a lower overlap (due to the lower number of powered sites out of total sites), achieves a median overlap of 10 mutations and an average of 18.6 mutations." (the tense / grammar is off)

i) I believe there is a typo at the top of Figure S3?  I thought the paired DNA was the validation, and I would expect most people would want to use RNA-MuTect for paired human tumor-normal RNA-Seq samples (not tumor RNA-Seq and normal DNA-Seq)

j) I believe the tissue-specific mutation hotspots in unaffected normals (Figure 4A) for skin and esophagus was not validated in the TCGA tumor cohorts for melanoma (SKCM) and esophageal cancer (ESCA) also did not have a row with the darkest red shade (for greatest mutation frequency)?

k) I believe there are 2 typos in the HISAT2 parameters (at least when using version 2.1.0)

Provided: l 0
Alternative: -I 0 (capital i, rather than lower-case l)

Provided: x 800
Alternative: -X 800 (capitalize x)

To be clear, I am certain that some somatic variants exist in normal tissues.  For example, the increased mutation rate in normal skin in Figure 2B (and otherwise low variant counts) made me more more comfortable with the normal tissue analysis.  However, I am saying such analysis needs to be careful about false positives and possible artifacts, and additional filtering on your own samples may be necessary (which, again, I believe the RNA-MuTect authors already agree with me on this issue, and that is why they developed their filtering strategy).

Also, it is my own fault that I didn't initially read to the end of the GenomeWeb article, where it was also mentioned "[we] expect that most of these clones would not ever become cancer."  I think that is important context to avoid alarm.  While I am glad that I took time to understand the article better on the weekend, I still have some room for improvement in terms of taking the right amount of time to develop an opinion / impression of a finding.

Finally, as kindly pointed to me, Science has an eLetters system (similar to the Disqus comment system).  I have pointed out the 2 most clear errors in an eLetter, and there is also another comment from another reader.

I also summarized some more general content that I decided to split into a separate post, but I think most of this content was more appropriate as a blog post than an in-article comment.

Update Log:

6/9/2019 - original public post
6/10/2019 - minor changes
6/18/2019 - add question / note on re-alignment method
6/28/2019 - update as I prepare for IGC Bioinformatics Journal Club presentation.  Fix some formatting issues introduced after that.
7/9/2019 - fix typo: Yizak --> Yazhak (I noticed this after a separate Disqus comment)
7/11/2019 - add note about mixed paragraph
7/12/2019 - add additional points i) and j) from yesterday's discussion
8/27/2019 - note expected typo for HISAT2 parameters in Table S12
5/1/2020 - minor changes + mention that Science allows published comments approved by the editors
5/15/2020 - add link to eLetter


  1. Hello. I do not understand the meaning of leakage errors in figure S3. Could you please explain it for me?

  2. Hi - I am at home, where I don't directly have access to the article.

    However, I just discovered that there was a pre-print:

    So, Figure 1E in the pre-print has a diagram for the leakage errors.

    Essentially, is there is a position that breaks a large homopolymer into two separate homopolymers, the leakage error is a great that appears to have 1 large homopolymer instead of 2 smaller ones.

    In the provided leakage error example, there are some reads that incorrectly report AAGAA as AAAAA.

  3. Thank you very much for your instruction.

    >Essentially, is there is a position that breaks a large homopolymer into two separate homopolymers, the leakage error is a great that appears to have 1 large homopolymer instead of 2 smaller ones.

    Could you tell me more above?

    1. Hi - I would say it is better to ask the authors about this (although I think I originally contacted them about the errors and I didn't get a response).

      At least some of the data is controlled access (and they don't provide enough information to re-produce every step). So, there may exist a file that lists some specific variants caused by leakage errors, but I don't know it off the top of my head (I could just use the example illustrated in the figure, of the 5 nucleotide example).

      Thank you very much for your interest!


Creative Commons License
My Biomedical Informatics Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.