Saturday, June 22, 2019

Speculative Opinion: Humane Experimental Therapies for Animals May Help Human Health?

This week, there were some Nature summaries about the possibility of developing mouse lemurs as an "model organism".  However, that linked article covered more than the topic of the mouse lemur as a model organism: for example, more than half of the video at the end focused on work being done at a National Park in Madagascar (including some citizen science work for education).

I thought the description of a species that was in between a mouse and a monkey was interesting, but I am not convinced about the widespread use of the mouse lemur as a model organism.  I'm also not sure about the evidence for the mouse lemur being the 2nd most abundant primate (after people); for example, one of the "least-concern" mouse lemur species had a unknown population size on one website (and the Wikipedia link lists a gibbon as being 2nd, with an order of magnitude difference in population size).  However, if there are ways to learn about possible human treatments through the experiences with cost effective ways to help the mouse lemur population, then I think that would be extremely interesting.

I believe I first heard about human treatments being inspired by animal treatments when I was reading the article "Cancer Clues from Pet Dogs," in the 2018 special issue of Scientific American on "The Science of Dogs and Cats."  I don't believe I saw specific citations for the points that I found most interesting, but I'll summarize some points below (with some papers that I found by Googling):

  • My understanding is that piroxicam treating bladder cancer in dogs influenced the use of celecoxib (Celebrex) in humans.
    • Perhaps this relates to what is described in Knapp et al. 2016?
    • The VCA link above mentions the use being off-label, so I am guessing there is even more of a backstory.
  • The article also mentions Stephen Withrow playing a role in optimizing methods to avoid limb amputation of osteosarcoma (in dogs, but now used in people).
  • The article mentions intranasal delivery of IL-2 contributing to inhaled IL-2 therapy for lung metastases in people
  • David Waters (one of the authors) describes using CT scans in dogs to test whether signal is for a true tumor (or a false positive), with the expectation that should also help improve the accuracy in the methods for people
    • I think this study by Waters et al. 1998 is a relevant citation, although I am not sure if there is a paper to show translation into human methods
    • I'm also not entirely certain if we are talking about the dogs as patients (those are the examples that I am trying to list), versus the dogs as subjects (which would be more along the lines of a "model organism" or "animal model")

Again, I'm not sure what are the primarily references for these claims, where I would ideally want to see validation from independent labs.  Admittedly, that is for biological / medical concepts, and you can't really prove inspiration in the same sense (there are limits in the chronological order, when claiming "procedure X in dogs enabled the similar or same procedure X in people," although I've certainly seen examples where the official publication date of peer reviewed publications didn't actually match the order of discovery).  Nevertheless, please let me know if you can help me fill in some of the details!

More broadly, in "Reason for Hope" (in the chapter "On the Road to Damascus"), Jane Goodall describes somebody whose's daughter had a heart problem and was told "her daughter was only alive because of experimental work on dogs," and that individual approached Jane in a very negative manner.  Dr. Goodall then waited for the individual to calm down, and explained that her mother had a big valve in her heart (even thinking through the process to point out "[it] was from a commercially slaughtered hog, but the procedure had been worked out with pigs in a laboratory").  She then said "I just feel terribly grateful to the pig that saved my mother's life...So I want to do all I can to improve conditions for pigs - in the labs and on the farms.  Don't you feel grateful to the dogs who saved your daughter?I think this is a good mindset.  From my end, I also think that treating animals more like people might even help with successfully translating discoveries to humans.

I also think that there are fewer experiments on cats and dogs now (most mammalian experiments are in mice), but I don't believe those numbers have dropped to zero.  That said, if you start asking people what evidence they want to see before an experiment is performed on their own pet, then that may also shift the way some people think about animal research.

Also, the video mentions creation of knock-out mutants (for a model organism, like a mouse), and this is different than the animal therapies that I describe above.  However, I think communication between those representing animal rights and those conducting animal experiments is important.  For example, you need mutual understanding and open communications to have discussions about whether an experiment is necessary or (in the case of the genetic modification, like a gene knock-out) how to troubleshoot if that modification was successfully and retained over generations.  This might even be relevant to discussions of safety for genetic editing in people.

Finally, pets may not always have health insurance (and wildlife definitely doesn't have insurance).  For example, I'm going out on a limb on this one, but maybe this could even help with some of the details of directly offering generics through non-profits.  So, if we can find cost effective ways to make discoveries and continue to provide treatment at a reasonable cost (while respecting those who contributed to that body of knowledge), then I think that could be quite interesting and important.

Change Log:
6/22/2019 - public post date
-- I also want to acknowledge that I had a Facebook post about the mouse lemur article, and that is why I expanded my thoughts for this post.
11/3/2019 - add "speculative opinion" tag

Saturday, June 15, 2019

What About Bioinformatics Companies?

A product starting under a grant (such as in academics or a non-profit) but later being becoming a start-up (as a for-profit) is one possibility. However, my post on providing generics through non-profits would bring into question whether such a start-up could continue to be an independent non-profit (and/or a government entity/contract).

While I think I need to learn more before being able to say something strictly can't be provided from a for-profit organization, I think there are some things that may need to be taken into consideration within the current framework of options:

  • Perhaps require free command-line version of software for commercial software with a user interface (and emphasize the education component to learn more coding)?  Otherwise, it isn't really reproducible for most people, and it probably isn't appropriate to only use one program in all situations.
  • In terms of compromises to help with reproducibility, Novoalign has free version that uses fewer threads, and MATLAB allows people to run programs developed in MATLAB without a licence .
  • precisionFDA is designed/supported by a private company (DNAnexus), though what I assume is a government contract.  So, if you have genomics data, it is free to re-analyze / compare your own genomics data (since the FDA is paying for the costs).  I think this is a very good thing for citizens that can help them become more involved (and, hopefully, understand the difficulties of the regulatory process a little better).  I have some notes on on my own experiences with precisionFDA here.  So, I support this strategy (although perhaps there can be discussion about the fact that DNAnexus is currently a for-profit company).

Also, the idea that encouraging the testing multiple free RNA-Seq methods (something else that I would like to be able to show, at some point) may seem at ends with having commercial bioinformatics software.  While there is some truth to this, I would have the following response to such a critique:

1) I think the time frame matters when discussing software recommendations.  If the goal is to increase coding abilities in 5-10 years, then papers that need to be published sooner may need some alternative solution.  In other words, if somebody doesn't know how to code and there is a program with a graphical user interface that helps them do some analysis on their own, I think that can be good.  However, if they get a weird result (and/or a negative result) with that program (which could have a commercial license), I strongly suggest they test other programs before preparing for publication (and those other programs may need to be open-source command line programs)

2) Having extra options (which includes commercial software) gives labs more options of programs that they can test for their project.  So, if you get a weird result with the open-source software, having extra commercial options may help.  My only concern is that the fees may be a barrier to entry for some labs, and I don't want to encourage excessive use of free trials if licenses are not often eventually purchased.

So, I think there is still value in giving suggestions of how to make the most out of available open-source options, even you use use some commercial programs.  This is similar to what I do: the majority of the bioinformatics programs that I use are open-source, but I sometimes also use commercial programs (like IPA).  However, even with IPA, I would also recommend comparing results with free programs (like Enrichr).  Sometimes the free open-source programs end up being a better fit for the individual project than the commercial ones, but that often varies by project.

However, if you can't lock down one particular program to use in all situations (which has definitely been my experience), that is why I am somewhat concerned about the barriers to entry that could be caused by having to purchase licenses (and, if you don't already have a license, I would usually recommend trying out open-source options first).  I think this is also helps with your ability to provide support during transitions.  For example, I code in R rather than MATLAB, so I don't have to worry about keeping a MATLAB license (and a lot of genomics packages are developed in R/Bioconductor, in addition to it being free).

Change Log:
6/15/2019 - public post date

Sunday, June 9, 2019

Considerations for "Somatic Mutations Widespread Across Normal Tissues"

On Friday, the GenomeWeb summary titled "Somatic Mutations Widespread Across Normal Tissues, New RNA-Seq Analysis Finds" caught my attention.

I had to read the title a second time to realize that the Yizhak et al. 2019 article also mentions "normal" samples.  The reason is that Figure 1 shows tumor samples (and I would typically expect people to be using MuTect to be making somatic variant calls in tumors).

In terms of the tumor somatic variant calling (which is what I initially thought the article was emphasizing), something did seem strange about Figure 1A because (with rare exceptions) a true RNA-Seq mutation should also be present in paired DNA-Seq.  An important caveat is that the figure legend describes these as "mutations detected before filtering," but that is almost exactly what I thought may need to be made more clear in this blog post.

In fact, I was confused because Figure 1A is not the frequency of RNA-MuTect somatic variants calls.  I would have liked to see a similar plot after filtering, but they do say in the main text "to address the excessive mutations detected in only the RNA, we developed RNA-MuTect, which is based on several key filtering steps (fig. S3)...the vast majority (93%) of RNA mutations were filtered out".  In other words, the authors agree that Figure 1A indicates the presence of artifacts that need to be removed (and that the rationale for needing to develop their filtering process).

So, I agree with the authors more than I originally expected.  For example, what initially caused me the most concern is the use of the word "widespread" in the GenomeWeb summary that mentioned "normal" tissues.  However, the authors did not use the word "widespread" in their title.

Nevertheless, if there were parts of the article that confused me, then it may be confusing to other people as well.

In terms of things that you might generally need to watch out for, if you were working with a stranded RNA-Seq library, then you won't be able to correct for a strand bias (and they show a clear non-transcribed strand bias for the GTEx data in Figure S18).  Gene expression limits the ability to detect mutations, but you might also want to intentionally look for mutations retained in highly expressed genes.  Additionally, the title of the GenomeWeb summary reminded me of an RNA-editing paper that I am surprised hasn't been retracted yet (even though it had several published comments linked in PubMed and blog posts expressing concern. So, details regarding the alignment method, etc. are also important.

That said, Figure 2C makes me think there can be situations where even greater filtering may be worth considering.  For example, if one of the points of looking at RNA-Seq data is to look for variants that remained at high allele fraction (with the assumption that the cell could transcribe alleles/isoforms at different rates to decrease the allele fraction ;of the less advantageous allele), you may want to look for causal variants at greater than 20% allele allele fraction (in a pure tumor / disease sample).  Indeed, they explain the lack of validation for many DNA-Seq mutations as reduced detection in genes with low expression levels (where they cite calculations in Figure S2); However, to be fair, there are other sections of the article where I get the impression that variant fractions greater than 5% are thought to be more robust, which I agree with.

Nevertheless, to be clear, they already do noticeable variant filtering (described in the section "The RNA-MuTect pipeline" of the Supplemental Methods), which includes filtering of RNA editing sites.

Also, there are parts of the paper that use the term "pipeline," and I touch on the likely need to consider the default results as "initial" results that might require additional refinement (as a general rule) in this other post.  However, I think that is a broader issue to be communicated.

In terms of some specific comments for this paper:

a) I would have liked to see something like Figure 1A and Figure 2C for RNA-MuTect filtered somatic tumor mutations. However, I think some of this information is shown (per sample) in the bottom (and top) part of Figure 1C and Figure 4C.

Figure S5B is also somewhat similar to Figure 2C, although I believe the supplemental figure is specifically for variants related to those given signatures.

That said, while I am glad they used a conservative strategy overall, the variation in sensitivity / precision per-patient in Figure 1B makes me think RNA somatic variant calling still may require some optimization for various projects.

b) I believe there is a typo in Figure 1D (or at least the caption).  The caption makes it sound like the labels should be "Smoking UV APOBEC COSMIC POLE MSI W2".  However, I think the signatures are first defined from de-convolution, and then annotated.  This would explain how W2 is described as the APOBEC signature in the caption for Figure S5, while W2 is described as an RNA signature in the caption for Figure 1D (where APOBEC is described as W4).  Nevertheless, if W2 is supposed to be the 2nd row, that means that W2 should be described as (ii) rather than (vii) for the Figure 1D caption.

c) There is another typo in the main text:

Current: "Looking at tissue subregions, we found that non-sun-exposed skin had more mutations than nonexposed skin"

Corrected: "Looking at tissue subregions, we found that sun-exposed skin had more mutations than nonexposed skin"

d) I was a little surprised when 67 / 87 (77%) of rows in Table S6 were for DNMT3A (mostly at chr2:25468887), but I am guessing that is due to what was defined in the earlier list of 332 variants.  In general, prior knowledge for well-characterized variants may be preferable than filtering for discovery (for some situations), but I'm not sure what else to say about this specific result.

e) Please note that the scale of the y-axis is different within Figure 3C.

f) While I thought I remembered use of two alignments per sample, I am a little confused about part of the Supplemental Methods.

1. Mutation calling pipeline -- (2) realigning identified sSNVs with NovoAlign ( and performing an additional iteration of MuTect with the newly aligned BAM files.

3. The RNA-MuTect pipeline -- (3) A realignment filter for RNA-seq data where all reads aligned that span a candidate variant position from both the tumor (case) and normal (control) samples are realigned using HISAT2"

The code on-line uses HISAT2 for re-alignment (not NovoAlign).  I think the "Mutation calling pipeline" was upstream of "The RNA-MuTect pipeline"?  What could explain the difference in methods.  However, if the unfiltered set of calls is for the "Mutation calling pipeline," then that leaves a lot of false positives (that need to be filtered with something like the RNA-MuTect scripts).

Update (7/11): During a journal club discussion, IGC staff helped me realize that the "1. Mutation calling pipeline" section describes both DNA and RNA-Seq data.  While the mixed sentences make it hard to follow, a possible explanation would be that Novoalign was used with the DNA-Seq data (and HISAT2 was used for the RNA-Seq data).  However, if that is true, then there should have been 3 categories in Figure 1A (DNA without filtering, DNA with re-alignment/filtering, RNA without filtering)

g) The main text describes a “yet unreported mutational signature in the RNA domainated by C>T mutations” (the majority of which were in a single colon cancer sample).  However, this seems quite similar to the oxoG artifact (Costello et al. 2013) that they describe filtering for the “RNA Mutation Calling Pipeline” (and the Haradhvala et al. 2016 8-oxoG pattern that they cite for the GTEx strand bias in Figure S18)

h) There is also at least one sentence that needs to be re-worded in the Supplemental Methods: "In comparison, RNA-MuTect that takes power into account and therefore can potentially results with a lower overlap (due to the lower number of powered sites out of total sites), achieves a median overlap of 10 mutations and an average of 18.6 mutations." (the tense / grammar is off)

i) I believe there is a typo at the top of Figure S3?  I thought the paired DNA was the validation, and I would expect most people would want to use RNA-MuTect for paired human tumor-normal RNA-Seq samples (not tumor RNA-Seq and normal DNA-Seq)

j) I believe the tissue-specific mutation hotspots in unaffected normals (Figure 4A) for skin and esophagus was not validated in the TCGA tumor cohorts for melanoma (SKCM) and esophageal cancer (ESCA) also did not have a row with the darkest red shade (for greatest mutation frequency)?

k) I believe there are 2 typos in the HISAT2 parameters (at least when using version 2.1.0)

Provided: l 0
Alternative: -I 0 (capital i, rather than lower-case l)

Provided: x 800
Alternative: -X 800 (capitalize x)

To be clear, I am certain that some somatic variants exist in normal tissues.  For example, the increased mutation rate in normal skin in Figure 2B (and otherwise low variant counts) made me more more comfortable with the normal tissue analysis.  However, I am saying such analysis needs to be careful about false positives and possible artifacts, and additional filtering on your own samples may be necessary (which, again, I believe the RNA-MuTect authors already agree with me on this issue, and that is why they developed their filtering strategy).

Also, it is my own fault that I didn't initially read to the end of the GenomeWeb article, where it was also mentioned "[we] expect that most of these clones would not ever become cancer."  I think that is important context to avoid alarm.  While I am glad that I took time to understand the article better on the weekend, I still have some room for improvement in terms of taking the right amount of time to develop an opinion / impression of a finding.

Finally, it might also be nice if Science can add a Disqus comment system, similar to other journals.  I think the paper probably requires a formal correction such as the 3 typos, for specific points b), c), and h)).  However, to be fair, I also understand that the correction process can take some time.  Either way, I think having a shorter amount of commentary that is more readily available to readers almost immediately is important.

I also summarized some more general content that I decided to split into a separate post, but I think most of this content was more appropriate as a blog post than an in-article comment.

Update Log:

6/9/2019 - original public post
6/10/2019 - minor changes
6/18/2019 - add question / note on re-alignment method
6/28/2019 - update as I prepare for IGC Bioinformatics Journal Club presentation.  Fix some formatting issues introduced after that.
7/9/2019 - fix typo: Yizak --> Yazhak (I noticed this after a separate Disqus comment)
7/11/2019 - add note about mixed paragraph
7/12/2019 - add additional points i) and j) from yesterday's discussion
8/27/2019 - note expected typo for HISAT2 parameters in Table S12

General Comments on Low-Frequency Variant / Sequence Filtering

In terms of general troubleshooting artifacts for low-frequency variants (for any data type), these are some things that I think may be worth taking into consideration:

1) Increased false positives for DNA-Seq variants at lower frequencies in normal controls

From my own publication record, I discuss this in Warden et al. 2014:

Namely, Figure 7 shows an implausibly high proportion of damaging variants with lower-frequency germline variants using VarScan with default settings (in healthy 1000 Genomes controls):

The variant frequencies (per sample, at a given variant position) are not immediately obvious from the above plot.  However, you can see those "novel" variants are more likely to be detected with the default settings in VarScan (Figure 4 of the Warden et al. paper):

Although, importantly for this discussion, those variant calls can be improved with filtering.  Notice the "novel" and "known" variant frequencies are more similar, which is one of the indications in that paper of a lower false positive rate (Figure 5 of the Warden et al. paper):

In terms of other publications, you can also see increased discrepancies for low-frequency somatic variants (less than 20% variant fraction) in Arora et al. 2019 pre-print (in Figures 5B and 5D), as well as Figure S1 of Yizak et al. 2019 (although the emphasis on RNA-Seq versus DNA-Seq data is a little different).

2) Possible barcoding issues, such as with PhiX sequence (which doesn't have a barcode and theoretically shouldn't be in any de-multiplexed samples, even though you usually see at least some PhiX reads).

For 2), I would be happy if this post got enough attention for people to be aware there is a non-trivial chance their top BLAST hit for a PhiX sequence could be incorrectly labeled as a 16S sequence.

While a little less clear, I also think there should be more post-publication review for this eDNA article, where I think the title is the opposite of what seems like the most parsimonious explanation.

I added this quite a bit after the original post.  However, to be fair, you may sometimes be able to get some idea about barcode hopping if you have a small fragment (where you actually sequence the barcodes past your genomic sequence).  This is what I did for my cat's basepaws sequence (~15X WGS): in that case, you could see some other valid Illumina barcode combinations among my reads, but it wasn't too bad.  I'm not sure if this could be more of an issue with the low-coverage sequence (less than 1x), but I have general concerns about that (for most applications, except broad ancestry or IBD calculations).

Update Log:
6/9/2019 - original public post
8/2/2019 - add link to basepaws script
8/9/2019 - replace GitHub link with blog post link

Sunday, June 2, 2019

What's the difference between a "Pipeline" and a "Template"?

The process of understanding each step of analysis is important for presenting the final set of results, and the process of writing the code for that analysis can help you understand the methods better (and identify questions and/or room for improvement in your current code / understanding).

I have some templates for analysis, but I call them "templates" rather than "pipelines" because the code itself usually requires some modification.  While I think it is extremely useful to have packages for specific functions, you may find that having a pre-set pipeline doesn't quite produce publication-quality figures, and having a template for your own code (that is easier for you to change than somebody else) may make it easier to implement changes that come as the result of iterations of project discussions.  I have a note to this effect for most templates (such as the acknowledgement in the README for the RNA-Seq gene expression analysis "template," as well as a the 2nd post-publication comment for COHCAP, which stands for "City of Hope CpG island Analysis Pipeline").

These modifications can be important for semi-automated analysis, and it is possible that other people may find there are some situations where it can be useful to have templates for intermediate results.  However, I also believe there are some other factors for discussion that are worth taking into consideration:

  • Be careful not to increase the turn-around time for an initial result while increasing the total amount of time to get a paper to publication (or skip steps that could decrease the accuracy of the publication).  This can be particularly tricky if it takes a couple years to appreciate all the time required for follow-up requests.
  • Be aware of how other people will view your code, and what is appropriate to include in a publication. For example, even if it becomes appropriate to have "templates" for intermediate results (which I am not saying is necessarily true), taking time to understand your results is important for responsible research practices (regardless of the formality of what you are making public).  So, testing your code on multiple datasets (either within your lab, or public data from other labs) can be important for troubleshooting.
  • Unlike code published with a particular paper, the templates (by definition) are really designed with my own use in mind, and are much more difficult to support for other people (even within the same lab).  So, while looking at portions of the code may be helpful in generating ideas, support for "templates" can't really be provided (at least not in the same way as "pipelines" or packages for a particular step of analysis)
  • While it is increasingly important to provide code to help with reproducible and understanding of analysis, that code likely needs to be different for each paper (and time and energy will be required for supporting the separate code for each paper, for each lab).  While this is not exactly a pipeline (since that code will likely represent a template to be modified for other people's experiments, not run without any changes, or novel modifications), this is also not the same as saying you have a generalized template that you think is good to use on a large number of projects.

In short, I believe it is important to expect some testing for each project, in order to be the most confident in the results that you present.  One possible alternative to a "template" may be having code for public demo datasets (for training), but that is probably also not a complete solution.

Update Log:
6/2/2019 - original public post
Creative Commons License
My Biomedical Informatics Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.