Initial analysis of the TCGA data for ovarian cancer was recently published in Nature this week. The Cancer Genome Atlas (TCGA) is a joint project by the NCI and NHGRI to study genomic changes that are associated with many different types of cancer by collecting a large number of patient samples for analysis using mRNA gene expression microarrays, copy number arrays, methylation arrays, miRNA microarrays, and exome sequencing. This data can be freely downloaded using the TCGA Data Portal.
There is a huge amount of information presented in this paper. For example, the first TCGA paper provided an overview of the glioblasoma data, mostly focusing on somatic mutations and copy number alternations. There have been a number number of subsequent papers studying the glioblastoma data, and the subsequent TCGA papers that I am most familar with focused on subtypes defined by gene expression patterns (Verhaak et al. 2010) and methylation patterns (Noushmehr et al. 2010). The new ovarian cancer TCGA paper provides all of the information provided in the glioblastoma nature paper in addition to the subtyping analysis that was covered in mulitple high-impact, highly cited papers.
I think one of the most important take-home messages was the extremely important role of p53 in ovarian cancer. For example, 96% of high-grade tumors showed p53 mutations, which has also been shown previously in publications such as Ahmed et al. 2010. In contrast, the TCGA glioblastoma paper showed a p53 mutation rate of 38% to 58% for untreated and treated tumors, respectively. Interestingly, the ovarian cancer TCGA paper also revealed a high rate of p53 mutation in ovarian cancers contributes to FOXM1 overexpression by using PARADIGM to identify pathway alterations in the new TCGA data (where pathways were defined using the NCI Pathway Interaction Database).
Another striking result was how consistent the copy-number alterations were within either ovarian tumors or glioblastomas but how different the copy-number alternations were between the two cancer types (as shown in Figure 1a).
Although I was impressed that the study defined separate subtypes for mRNA gene expression, miRNA gene expression, and CpG methylation status, I had mixed feelings about the results. For example, the subtypes defined by methylaton only had "modest stability" (so, they have limited predictive power), and I thought the overlap between the mesenchymal mRNA subtype and tbe C2 miRNA subtye (and the proliferative mRNA subtype with the C1 miRNA subtype) was overemphasized. I was also a little disappointed that the integrative analysis didn't substantially enhance the subtype definitions (for example, I think Figure S6.4 in the ovarian cancer paper looks less impressive than Figure 3 in Verhaak et al.). However, I did find it interesting that both the glioblastoma and ovarian cancers had a "mesenchymal" subtype (although I don't think these subtypes necessarily have the same biological meaning), and I think it will definitely be interesting to further characterize the subtypes defined based upon mRNA gene expression.
I was somewhat surprised at how much the survival curves varied for the 4 data sets shown in Figure 2c. For example, the TCGA test set (N = 255) and the data from Tothill et al. 2008 (N = 237) had very different Cox p-values (0.02 and 0.00008, respectively). Nevertheless, it is not trivial to get a statistically significant result in 4 independent data sets, and I think the survival results are certianly strong enough to warrant further investigation in order to understand the cause of this variation.
Overall, I would consider this a must-read for any bioinformatician interested in cancer research.
Friday, July 1, 2011
Review of the TCGA Ovarian Cancer Paper
Labels:
integrative genomics,
microarray,
ovarian cancer,
TCGA
Sunday, June 26, 2011
Review of Biopunk
Biopunk is a book discussing biological research that isn't conducted in traditional research setting (like an academic lab or a pharmaceutical company). The book covers a wide variety of topics such as a philosophical discussion about what motivates good scientists, how legal and political decisions affect scientific progress, and recent developments in the field of "DIY bio" (where the book mostly focuses on personalized medicine and synthetic biology). Throughout the book, Wohlsen also provides several cool factoids, like the Bridges of Cherrapunji that are engineered from living tree roots.
One chapter focuses on DTC genetic testing, where Wohlesen provides both an overview of this industry as well as accounts of individuals who have utilized DTC testing. For example, Raymond McCauley conducted his own DIY bio research on metabolites in his own blood in order to try and better understand his 23andMe result indicating an increased risk for macular degeneration. Although Wohlesen acknowledges "McCauley did not hesitate to concede that the results do not show anything conclusive," I think this is a very cool example of how DIY Bio can help inquisitive scientists try to learn more about themselves outside a formal research setting.
My subsequent research on Raymond McCauley also led me to learn more about DIYgenomics.org, which provides tools to help users further analyze their 23andMe data for health risk, drug response, and athletic performance for individual SNPs. In some ways, this reminded me of the new, free Interpretome tool, but Interpretome can load my 23andMe data more quickly and with a more streamlined interface. Nevertheless, I think it's good to know that this option is out there.
There were also a few aspects of the book that disappointed me. For example, many accounts of biopunk research seem to focus more on buying used lab equipment off craigslist or eBay than new technological developments that can help democratize research. It also seemed like a lot of the "biopunks" were pretty well-educated and not necessarily good examples of what I would consider amateur scientific research. Also, I was somewhat disappointed at how difficult it was to additional information on some of the start-ups / organizations that were mentioned in the book (which has only been out for a few months).
For example, the chapter "Cancer Kitchen" discusses how John Schloendorn and Eri Gentry studied the role that the immune system played in cancer using Schloendorn's own cancer cells, which led the creation of DIY nonprofit called Livly to develop cancer immunotherapies (and Gentry later co-founded BioCurious, another DIY nonprofit). However, the Livly website described in the book is no longer hosted on the internet (the old url, provided on the Livly facebook page, now links to an unrelated website). Likewise, BioCurious only seems to have a facebook page with limited information. Even with limited funding, the company can at least create a free Google Sites website (like my personal website) in order to more effectively convey information about the company.
I was also very interested in learning more about the Pink Army Cooperative (a DIY drug company attempting to deliver personalized treatments for breast cancer). This time, I was able to find a generally well-designed and informative website, but I couldn't find much information about concrete research accomplishments (to be fair though, Wohlsen does warn readers that "so far, Pink Army is more a concept than an actual co-op").
Although it was frustrating that I couldn't learn much more about these specific non-profits, Biopunk has successfully encouraged me to learn more about the DIY bio movement. Who knows, maybe I'll even stop by a meeting for my local DIYbio chapter!
One chapter focuses on DTC genetic testing, where Wohlesen provides both an overview of this industry as well as accounts of individuals who have utilized DTC testing. For example, Raymond McCauley conducted his own DIY bio research on metabolites in his own blood in order to try and better understand his 23andMe result indicating an increased risk for macular degeneration. Although Wohlesen acknowledges "McCauley did not hesitate to concede that the results do not show anything conclusive," I think this is a very cool example of how DIY Bio can help inquisitive scientists try to learn more about themselves outside a formal research setting.
My subsequent research on Raymond McCauley also led me to learn more about DIYgenomics.org, which provides tools to help users further analyze their 23andMe data for health risk, drug response, and athletic performance for individual SNPs. In some ways, this reminded me of the new, free Interpretome tool, but Interpretome can load my 23andMe data more quickly and with a more streamlined interface. Nevertheless, I think it's good to know that this option is out there.
There were also a few aspects of the book that disappointed me. For example, many accounts of biopunk research seem to focus more on buying used lab equipment off craigslist or eBay than new technological developments that can help democratize research. It also seemed like a lot of the "biopunks" were pretty well-educated and not necessarily good examples of what I would consider amateur scientific research. Also, I was somewhat disappointed at how difficult it was to additional information on some of the start-ups / organizations that were mentioned in the book (which has only been out for a few months).
For example, the chapter "Cancer Kitchen" discusses how John Schloendorn and Eri Gentry studied the role that the immune system played in cancer using Schloendorn's own cancer cells, which led the creation of DIY nonprofit called Livly to develop cancer immunotherapies (and Gentry later co-founded BioCurious, another DIY nonprofit). However, the Livly website described in the book is no longer hosted on the internet (the old url, provided on the Livly facebook page, now links to an unrelated website). Likewise, BioCurious only seems to have a facebook page with limited information. Even with limited funding, the company can at least create a free Google Sites website (like my personal website) in order to more effectively convey information about the company.
I was also very interested in learning more about the Pink Army Cooperative (a DIY drug company attempting to deliver personalized treatments for breast cancer). This time, I was able to find a generally well-designed and informative website, but I couldn't find much information about concrete research accomplishments (to be fair though, Wohlsen does warn readers that "so far, Pink Army is more a concept than an actual co-op").
Although it was frustrating that I couldn't learn much more about these specific non-profits, Biopunk has successfully encouraged me to learn more about the DIY bio movement. Who knows, maybe I'll even stop by a meeting for my local DIYbio chapter!
Labels:
cancer,
DIYbio,
immunotherapy,
personalized medicine,
synthetic biology
Monday, May 16, 2011
Modeling Bimodal Gene Expression
Since it is often challenging to estimate parameters for mixture models (such as those used to model bimodal gene expression), I thought it might be useful to discuss some of my successes using non-linear least sequares (NLS) regression to model bimodal gene expression.
Many scientists use maximum likelihood estimation (MLE) to model bimodal gene expression (such as Lim et al. 2002, Fan et al. 2005, Mason et al. 2011, etc.). My MLE model is based the code provided in this discussion thread, so I used the mle function from the stats4 package (which is a wrapper for the standard optim function).
On simulated data, both the MLE and NLS models estimate 37% of samples show over-expression (i.e. come from the distribution with the higher mean), which is very close to the true value of 36%:
However, I have found that the mle function often returns error messages when working with real data and models built using the nls function tend to fit the data better than the MLE estimates. Even on the simulated data above, the NLS model appears to prove an ever so slightly better fit for the data (as might be expected because it directly models the density function).
In order to illustrate my point, I have also analyzed some genes exhibiting bimodal gene expression (as identified by Mason et al. 2011) in the GEO dataseries GSE13070. For example, here is a gene that I could model relatively well with NLS regression whereas I simply couldn't produce an MLE model:
Of course, both of these tools have their limitations. For example, the data has to have a pretty clean bimodal distribution (here are 3 examples of distributions that couldn't be modeling using either method: ACTIN3, ERAP2, and MAOA (different probe)). For the NLS model, I also had to set the variance to be equal for the two samples in order to produce a reasonable estimate of over-expression, but I believe this is usually a safe assumption.
Although I do not present the data in this blog post (because it contains unpublished results), I have also found NLS regression to be useful on other genes in several other datasets, So, I know NLS regression works well with more than just the one gene that I show above.
Also, for those that are interested, here is the source code that I used to produce all of the above figures.
Many scientists use maximum likelihood estimation (MLE) to model bimodal gene expression (such as Lim et al. 2002, Fan et al. 2005, Mason et al. 2011, etc.). My MLE model is based the code provided in this discussion thread, so I used the mle function from the stats4 package (which is a wrapper for the standard optim function).
On simulated data, both the MLE and NLS models estimate 37% of samples show over-expression (i.e. come from the distribution with the higher mean), which is very close to the true value of 36%:
Simulated Data
In order to illustrate my point, I have also analyzed some genes exhibiting bimodal gene expression (as identified by Mason et al. 2011) in the GEO dataseries GSE13070. For example, here is a gene that I could model relatively well with NLS regression whereas I simply couldn't produce an MLE model:
Of course, both of these tools have their limitations. For example, the data has to have a pretty clean bimodal distribution (here are 3 examples of distributions that couldn't be modeling using either method: ACTIN3, ERAP2, and MAOA (different probe)). For the NLS model, I also had to set the variance to be equal for the two samples in order to produce a reasonable estimate of over-expression, but I believe this is usually a safe assumption.
Although I do not present the data in this blog post (because it contains unpublished results), I have also found NLS regression to be useful on other genes in several other datasets, So, I know NLS regression works well with more than just the one gene that I show above.
Also, for those that are interested, here is the source code that I used to produce all of the above figures.
Labels:
bimodal gene expression,
GEO,
microarray,
modeling,
over-expression,
R
Subscribe to:
Posts (Atom)