I've recently added a section to my personal website that contains some scripts that I have used for bioinformatics analysis:
https://sites.google.com/site/cwarden45/scripts
As of right now, the page contains a handful of scripts for microarray, next-generation sequencing, and qPCR analysis. I plan on updating this page periodically.
Generally speaking, the scripts aren't organized in a carefully documented package (like what you can find from Bioconductor, etc.). However, I have found them to be very useful templates for routine analysis, so I thought it might be useful to share them with others.
Friday, May 25, 2012
Shared Scripts for Genomic Analysis
Labels:
microarray,
next-generation sequencing,
Perl,
qPCR,
R
Monday, January 30, 2012
Article Review: Accurate identification of A-to-I RNA editing in human by transcriptome sequencing
In this article, Bahn et al. develop a novel method to identify A-to-I RNA editing sites in next-generation sequencing data.
My favorite aspect of this paper was how the authors empirically estimated the false discovery rate of their algorithm using an ADAR siRNA knock-down in a cancer cell line that only showed normal expression levels for one member of the ADAR family (shown in Figure 2 of the paper). Experimental validation with Sanger sequencing also shows a low false positive rate for the A-to-G events (although not necessarily for non-A-to-G events).
Supplemental Table 3 is also worth checking out: it provides a good review of genome-wide RNA editing studies, including the contentious study in Science by Li et al. For example, only 34% of the RNA editing sites shared by Li et al. and this paper were A-to-G events, whereas 86-100% of the overlapping sites for all of the other studies were A-to-G events. Likewise, the differences in the histograms for RNA editing sites (Figure 2A in this paper, and Figure 1A in Li et al.) emphasize how different the analysis in Li et al. is from other similar studies in the literature.
The supplemental table also shows how few RNA editing sites overlap between studies. For example, the authors emphasize how their study recovers 854 A-to-G differences in the DARNED database, but I think it is worth keeping in mind that there were 42,045 sites in the DARNED database and 9636 predicted RNA editing sites (using the threshold for comparison with other studies). This seems to be a common problem that isn't unique to this study (and the authors emphasize that the overlap between genes with RNA editing sites is greater than the overlap of individual RNA editing sties), but I think it is still an interesting observation that is worth keeping in mind for future analysis (which will hopefully have larger samples of paired DNA-Seq and RNA-Seq samples).
In general, I think this method does a good job of identifying and filtering likely causes of spurious RNA editing events (like those mentioned in Schrider et al. 2011). For example, the authors use a "double-filtering" strategy to focus on reads with unique alignments (where a conservative threshold is used to define alignments to potential RNA editing sites but a more liberal criteria is used to search for homologous regions that could be causing inaccurate alignments). I also liked that most of the in-depth analysis focused on sites with an editing ratio greater than 0.2.
This study focused on analysis of the grade IV glioma cell line U87MG (RNA-Seq: GSE28040, DNA-Seq: GSE19986) and a primary breast cancer sample (EGAS00000000054). Although it probably allowed for more cost-effective analysis, I wonder if the results would have been even cleaner if the RNA-Seq and DNA-Seq data were both newly created for this study using similar technologies (for example, the RNA-Seq data is paired-end Illumina reads whereas the DNA-Seq data was from another study using SOLiD reads). However, I think the results were clean enough that this probably didn't matter too much (based upon the ADAR knock-down data).
The novel motif discovery (Figure 5) was interesting, but I had a hard time imagining the relevance of this motif that isn't found at a consistent distance from the A-to-I site (like those shown in Figure 4). That said, I would be interested in see any follow-up analysis that characterizes the mechanism by which this motif is involved with A-to-I editing.
I think this study only provides very limited analysis on A-to-I editing in cancer. To be fair, the sample size (one sample at a time) is probably not sufficient to make many general claims about A-to-I editing in cancer. However, I still think this aspect of the study was over-emphasized. For example, Supplemental Table 13 shows how sensitive the hypergeometric test (comparing RNA editing sites in the two samples) will be when dealing with such a large background set; all of the RNA editing events except G-to-C were statistically significant with a p-value < 0.05, even though the A-to-G overlap was the only category with more than 5 overlapping sites. In other words, I don't think statistical significance was a strong indicator of biological importance for this analysis. Likewise, it was nice that the enrichment analysis of the NCI Cancer Gene Index genes provided some candidate genes, but I don't think this study is useful in identifying a gene where A-to-I editing is highly likely to play an important role in oncogenesis.
Overall, I would recommend this article to anyone interested in RNA editing and next-generation sequencing analysis.
My favorite aspect of this paper was how the authors empirically estimated the false discovery rate of their algorithm using an ADAR siRNA knock-down in a cancer cell line that only showed normal expression levels for one member of the ADAR family (shown in Figure 2 of the paper). Experimental validation with Sanger sequencing also shows a low false positive rate for the A-to-G events (although not necessarily for non-A-to-G events).
Supplemental Table 3 is also worth checking out: it provides a good review of genome-wide RNA editing studies, including the contentious study in Science by Li et al. For example, only 34% of the RNA editing sites shared by Li et al. and this paper were A-to-G events, whereas 86-100% of the overlapping sites for all of the other studies were A-to-G events. Likewise, the differences in the histograms for RNA editing sites (Figure 2A in this paper, and Figure 1A in Li et al.) emphasize how different the analysis in Li et al. is from other similar studies in the literature.
The supplemental table also shows how few RNA editing sites overlap between studies. For example, the authors emphasize how their study recovers 854 A-to-G differences in the DARNED database, but I think it is worth keeping in mind that there were 42,045 sites in the DARNED database and 9636 predicted RNA editing sites (using the threshold for comparison with other studies). This seems to be a common problem that isn't unique to this study (and the authors emphasize that the overlap between genes with RNA editing sites is greater than the overlap of individual RNA editing sties), but I think it is still an interesting observation that is worth keeping in mind for future analysis (which will hopefully have larger samples of paired DNA-Seq and RNA-Seq samples).
In general, I think this method does a good job of identifying and filtering likely causes of spurious RNA editing events (like those mentioned in Schrider et al. 2011). For example, the authors use a "double-filtering" strategy to focus on reads with unique alignments (where a conservative threshold is used to define alignments to potential RNA editing sites but a more liberal criteria is used to search for homologous regions that could be causing inaccurate alignments). I also liked that most of the in-depth analysis focused on sites with an editing ratio greater than 0.2.
This study focused on analysis of the grade IV glioma cell line U87MG (RNA-Seq: GSE28040, DNA-Seq: GSE19986) and a primary breast cancer sample (EGAS00000000054). Although it probably allowed for more cost-effective analysis, I wonder if the results would have been even cleaner if the RNA-Seq and DNA-Seq data were both newly created for this study using similar technologies (for example, the RNA-Seq data is paired-end Illumina reads whereas the DNA-Seq data was from another study using SOLiD reads). However, I think the results were clean enough that this probably didn't matter too much (based upon the ADAR knock-down data).
The novel motif discovery (Figure 5) was interesting, but I had a hard time imagining the relevance of this motif that isn't found at a consistent distance from the A-to-I site (like those shown in Figure 4). That said, I would be interested in see any follow-up analysis that characterizes the mechanism by which this motif is involved with A-to-I editing.
I think this study only provides very limited analysis on A-to-I editing in cancer. To be fair, the sample size (one sample at a time) is probably not sufficient to make many general claims about A-to-I editing in cancer. However, I still think this aspect of the study was over-emphasized. For example, Supplemental Table 13 shows how sensitive the hypergeometric test (comparing RNA editing sites in the two samples) will be when dealing with such a large background set; all of the RNA editing events except G-to-C were statistically significant with a p-value < 0.05, even though the A-to-G overlap was the only category with more than 5 overlapping sites. In other words, I don't think statistical significance was a strong indicator of biological importance for this analysis. Likewise, it was nice that the enrichment analysis of the NCI Cancer Gene Index genes provided some candidate genes, but I don't think this study is useful in identifying a gene where A-to-I editing is highly likely to play an important role in oncogenesis.
Overall, I would recommend this article to anyone interested in RNA editing and next-generation sequencing analysis.
Labels:
cancer,
DNA-Seq,
genomics,
RNA Editing,
RNA-Seq
Wednesday, November 16, 2011
Notes from ICHG / ASHG 2011
Although it may be old news for anyone following the #ICHG2011 twitter feed, I figure there are still some people out there that might be interested in seeing my summary slides that I'll be presenting at a Bioinformatics Core group meeting to discuss what I learned at the conference (those slides are available here).
Generally speaking, I was very pleased with the number and variety of great speakers. Plus, there were fun activities like a circus performance to open the conference and complementary poutine for lunch a couple days. I kind of wish the conference was a day or two shorter and there were more activities / discussions to encourage networking among smaller groups of the attendees, but I think these are only minor concerns.
Overall, I would consider the conference to be a great success, and I am seriously considering attending next year in San Francisco!
Generally speaking, I was very pleased with the number and variety of great speakers. Plus, there were fun activities like a circus performance to open the conference and complementary poutine for lunch a couple days. I kind of wish the conference was a day or two shorter and there were more activities / discussions to encourage networking among smaller groups of the attendees, but I think these are only minor concerns.
Overall, I would consider the conference to be a great success, and I am seriously considering attending next year in San Francisco!
Subscribe to:
Posts (Atom)