Although it may be old news for anyone following the #ICHG2011 twitter feed, I figure there are still some people out there that might be interested in seeing my summary slides that I'll be presenting at a Bioinformatics Core group meeting to discuss what I learned at the conference (those slides are available here).
Generally speaking, I was very pleased with the number and variety of great speakers. Plus, there were fun activities like a circus performance to open the conference and complementary poutine for lunch a couple days. I kind of wish the conference was a day or two shorter and there were more activities / discussions to encourage networking among smaller groups of the attendees, but I think these are only minor concerns.
Overall, I would consider the conference to be a great success, and I am seriously considering attending next year in San Francisco!
Wednesday, November 16, 2011
Saturday, August 6, 2011
What is it like to be a "Bioinformatics Specialist"?
I recently received a request from a complete stranger to learn more about the field of bioinformatics. Since I think others may also benefit from my answers, I've converted this e-mail conversation into a blog post. I've made some modifications to the questions and my responses, but all the main ideas are the same.
FYI, I've provided a link to my CV, so you can get a better idea about my background.
FYI, I've provided a link to my CV, so you can get a better idea about my background.
Q) Can you tell me a little bit about your work as a bioinformatics specialist and what a typical day looks like?
A) I think it is safe to assume that someone with this job description will work for at least one lab, and your goal will usually be to help biologists without a strong computational background analyze their data. In particular, I assist with microarray and next-generation sequencing data analysis. Sometimes you may work in the lab of an individual scientist, but I work in a shared resource facility. So, I work for several scientists on campus. The Bioinformatics Core is also in charge of software support, so I also assist in installing and maintaining software and hardware. Additionally, I assist in writing papers and grants, but I don’t know if is safe to assume all Bioinformatics Specialists will be authors in papers.
Q) When you completed your MS degree, did you find the job market to be favorable?
A) I actually have an MA degree in Molecular Biology (I was in a PhD program and left the program with just a Master's degree), so this may be a little different than someone going for an independent MS degree in Bioinformatics. When I had to look for jobs, it did take a lot of effort, and I basically accepted the first offer I could get after a few months of hunting. However, there are lots of people who are unemployed or go a year or more without a job, so it could be a lot worse.
Q) How deeply would you suggest that a person searching for a job similar to yours get into programming? What programming languages are most useful for your job?
A) I would say that you pretty much can’t get a job with “Bioinformatics” in the title without significant programming experience and a firm grasp of statistics.
I am proficient in R and Perl. SQL is also very important. Python is especially useful for next-generation sequencing analysis. It is also valuable to learn Java, Apache, and PHP.
Q) Do you know of any useful resources for job seekers? For example, what do you know about bioinformatics internships?
A) I think that the importance of internships varies with your career goal. I think they are a little less important if you plan to eventually get a PhD, but I think they can be very important for individuals with a terminal BS or MS degree.
I would suggest e-mailing PIs / Scientists whose work you find interesting to see if there are any jobs – this is how I have gotten all of my jobs.
I would suggest e-mailing PIs / Scientists whose work you find interesting to see if there are any jobs – this is how I have gotten all of my jobs.
Before I got paid to do research, I had to do research for academic credit for 1-2 years. If you don’t have considerable research experience, I would consider offering to do volunteer work (or an unpaid internship).
You should also apply for jobs that you see posted for companies. However, I haven’t actually had much success with these. I think companies are legally obligated to post jobs, even if they have already found an internal person for the job. A lot of companies like to promote from within, so this may be worth something to consider. Nevertheless, I think anyone who is willing to pay money to post a job on an external website is probably serious about at least considering candidates from outside of the company. For example, here are some useful resources when looking for bioinformatics jobs (in addition to places like Monster, etc.):
When you are in school, I believe there should be some sort of career services department that might be able to help you. For example, I have helped send out job postings for the Bioinformatics Core to prestigious universities looking for recent graduates.
Also, be certain to take advantage of research experience (even it is not required) for networking purposes. Plus, research experience also has other direct benefits, such as getting practical experience that will almost certainly be useful for jobs later down the road.
Please feel free to continue the discussion with questions and comments below!
Labels:
bioinformatics,
jobs,
research
Friday, July 1, 2011
Review of the TCGA Ovarian Cancer Paper
Initial analysis of the TCGA data for ovarian cancer was recently published in Nature this week. The Cancer Genome Atlas (TCGA) is a joint project by the NCI and NHGRI to study genomic changes that are associated with many different types of cancer by collecting a large number of patient samples for analysis using mRNA gene expression microarrays, copy number arrays, methylation arrays, miRNA microarrays, and exome sequencing. This data can be freely downloaded using the TCGA Data Portal.
There is a huge amount of information presented in this paper. For example, the first TCGA paper provided an overview of the glioblasoma data, mostly focusing on somatic mutations and copy number alternations. There have been a number number of subsequent papers studying the glioblastoma data, and the subsequent TCGA papers that I am most familar with focused on subtypes defined by gene expression patterns (Verhaak et al. 2010) and methylation patterns (Noushmehr et al. 2010). The new ovarian cancer TCGA paper provides all of the information provided in the glioblastoma nature paper in addition to the subtyping analysis that was covered in mulitple high-impact, highly cited papers.
I think one of the most important take-home messages was the extremely important role of p53 in ovarian cancer. For example, 96% of high-grade tumors showed p53 mutations, which has also been shown previously in publications such as Ahmed et al. 2010. In contrast, the TCGA glioblastoma paper showed a p53 mutation rate of 38% to 58% for untreated and treated tumors, respectively. Interestingly, the ovarian cancer TCGA paper also revealed a high rate of p53 mutation in ovarian cancers contributes to FOXM1 overexpression by using PARADIGM to identify pathway alterations in the new TCGA data (where pathways were defined using the NCI Pathway Interaction Database).
Another striking result was how consistent the copy-number alterations were within either ovarian tumors or glioblastomas but how different the copy-number alternations were between the two cancer types (as shown in Figure 1a).
Although I was impressed that the study defined separate subtypes for mRNA gene expression, miRNA gene expression, and CpG methylation status, I had mixed feelings about the results. For example, the subtypes defined by methylaton only had "modest stability" (so, they have limited predictive power), and I thought the overlap between the mesenchymal mRNA subtype and tbe C2 miRNA subtye (and the proliferative mRNA subtype with the C1 miRNA subtype) was overemphasized. I was also a little disappointed that the integrative analysis didn't substantially enhance the subtype definitions (for example, I think Figure S6.4 in the ovarian cancer paper looks less impressive than Figure 3 in Verhaak et al.). However, I did find it interesting that both the glioblastoma and ovarian cancers had a "mesenchymal" subtype (although I don't think these subtypes necessarily have the same biological meaning), and I think it will definitely be interesting to further characterize the subtypes defined based upon mRNA gene expression.
I was somewhat surprised at how much the survival curves varied for the 4 data sets shown in Figure 2c. For example, the TCGA test set (N = 255) and the data from Tothill et al. 2008 (N = 237) had very different Cox p-values (0.02 and 0.00008, respectively). Nevertheless, it is not trivial to get a statistically significant result in 4 independent data sets, and I think the survival results are certianly strong enough to warrant further investigation in order to understand the cause of this variation.
Overall, I would consider this a must-read for any bioinformatician interested in cancer research.
There is a huge amount of information presented in this paper. For example, the first TCGA paper provided an overview of the glioblasoma data, mostly focusing on somatic mutations and copy number alternations. There have been a number number of subsequent papers studying the glioblastoma data, and the subsequent TCGA papers that I am most familar with focused on subtypes defined by gene expression patterns (Verhaak et al. 2010) and methylation patterns (Noushmehr et al. 2010). The new ovarian cancer TCGA paper provides all of the information provided in the glioblastoma nature paper in addition to the subtyping analysis that was covered in mulitple high-impact, highly cited papers.
I think one of the most important take-home messages was the extremely important role of p53 in ovarian cancer. For example, 96% of high-grade tumors showed p53 mutations, which has also been shown previously in publications such as Ahmed et al. 2010. In contrast, the TCGA glioblastoma paper showed a p53 mutation rate of 38% to 58% for untreated and treated tumors, respectively. Interestingly, the ovarian cancer TCGA paper also revealed a high rate of p53 mutation in ovarian cancers contributes to FOXM1 overexpression by using PARADIGM to identify pathway alterations in the new TCGA data (where pathways were defined using the NCI Pathway Interaction Database).
Another striking result was how consistent the copy-number alterations were within either ovarian tumors or glioblastomas but how different the copy-number alternations were between the two cancer types (as shown in Figure 1a).
Although I was impressed that the study defined separate subtypes for mRNA gene expression, miRNA gene expression, and CpG methylation status, I had mixed feelings about the results. For example, the subtypes defined by methylaton only had "modest stability" (so, they have limited predictive power), and I thought the overlap between the mesenchymal mRNA subtype and tbe C2 miRNA subtye (and the proliferative mRNA subtype with the C1 miRNA subtype) was overemphasized. I was also a little disappointed that the integrative analysis didn't substantially enhance the subtype definitions (for example, I think Figure S6.4 in the ovarian cancer paper looks less impressive than Figure 3 in Verhaak et al.). However, I did find it interesting that both the glioblastoma and ovarian cancers had a "mesenchymal" subtype (although I don't think these subtypes necessarily have the same biological meaning), and I think it will definitely be interesting to further characterize the subtypes defined based upon mRNA gene expression.
I was somewhat surprised at how much the survival curves varied for the 4 data sets shown in Figure 2c. For example, the TCGA test set (N = 255) and the data from Tothill et al. 2008 (N = 237) had very different Cox p-values (0.02 and 0.00008, respectively). Nevertheless, it is not trivial to get a statistically significant result in 4 independent data sets, and I think the survival results are certianly strong enough to warrant further investigation in order to understand the cause of this variation.
Overall, I would consider this a must-read for any bioinformatician interested in cancer research.
Labels:
integrative genomics,
microarray,
ovarian cancer,
TCGA
Sunday, June 26, 2011
Review of Biopunk
Biopunk is a book discussing biological research that isn't conducted in traditional research setting (like an academic lab or a pharmaceutical company). The book covers a wide variety of topics such as a philosophical discussion about what motivates good scientists, how legal and political decisions affect scientific progress, and recent developments in the field of "DIY bio" (where the book mostly focuses on personalized medicine and synthetic biology). Throughout the book, Wohlsen also provides several cool factoids, like the Bridges of Cherrapunji that are engineered from living tree roots.
One chapter focuses on DTC genetic testing, where Wohlesen provides both an overview of this industry as well as accounts of individuals who have utilized DTC testing. For example, Raymond McCauley conducted his own DIY bio research on metabolites in his own blood in order to try and better understand his 23andMe result indicating an increased risk for macular degeneration. Although Wohlesen acknowledges "McCauley did not hesitate to concede that the results do not show anything conclusive," I think this is a very cool example of how DIY Bio can help inquisitive scientists try to learn more about themselves outside a formal research setting.
My subsequent research on Raymond McCauley also led me to learn more about DIYgenomics.org, which provides tools to help users further analyze their 23andMe data for health risk, drug response, and athletic performance for individual SNPs. In some ways, this reminded me of the new, free Interpretome tool, but Interpretome can load my 23andMe data more quickly and with a more streamlined interface. Nevertheless, I think it's good to know that this option is out there.
There were also a few aspects of the book that disappointed me. For example, many accounts of biopunk research seem to focus more on buying used lab equipment off craigslist or eBay than new technological developments that can help democratize research. It also seemed like a lot of the "biopunks" were pretty well-educated and not necessarily good examples of what I would consider amateur scientific research. Also, I was somewhat disappointed at how difficult it was to additional information on some of the start-ups / organizations that were mentioned in the book (which has only been out for a few months).
For example, the chapter "Cancer Kitchen" discusses how John Schloendorn and Eri Gentry studied the role that the immune system played in cancer using Schloendorn's own cancer cells, which led the creation of DIY nonprofit called Livly to develop cancer immunotherapies (and Gentry later co-founded BioCurious, another DIY nonprofit). However, the Livly website described in the book is no longer hosted on the internet (the old url, provided on the Livly facebook page, now links to an unrelated website). Likewise, BioCurious only seems to have a facebook page with limited information. Even with limited funding, the company can at least create a free Google Sites website (like my personal website) in order to more effectively convey information about the company.
I was also very interested in learning more about the Pink Army Cooperative (a DIY drug company attempting to deliver personalized treatments for breast cancer). This time, I was able to find a generally well-designed and informative website, but I couldn't find much information about concrete research accomplishments (to be fair though, Wohlsen does warn readers that "so far, Pink Army is more a concept than an actual co-op").
Although it was frustrating that I couldn't learn much more about these specific non-profits, Biopunk has successfully encouraged me to learn more about the DIY bio movement. Who knows, maybe I'll even stop by a meeting for my local DIYbio chapter!
One chapter focuses on DTC genetic testing, where Wohlesen provides both an overview of this industry as well as accounts of individuals who have utilized DTC testing. For example, Raymond McCauley conducted his own DIY bio research on metabolites in his own blood in order to try and better understand his 23andMe result indicating an increased risk for macular degeneration. Although Wohlesen acknowledges "McCauley did not hesitate to concede that the results do not show anything conclusive," I think this is a very cool example of how DIY Bio can help inquisitive scientists try to learn more about themselves outside a formal research setting.
My subsequent research on Raymond McCauley also led me to learn more about DIYgenomics.org, which provides tools to help users further analyze their 23andMe data for health risk, drug response, and athletic performance for individual SNPs. In some ways, this reminded me of the new, free Interpretome tool, but Interpretome can load my 23andMe data more quickly and with a more streamlined interface. Nevertheless, I think it's good to know that this option is out there.
There were also a few aspects of the book that disappointed me. For example, many accounts of biopunk research seem to focus more on buying used lab equipment off craigslist or eBay than new technological developments that can help democratize research. It also seemed like a lot of the "biopunks" were pretty well-educated and not necessarily good examples of what I would consider amateur scientific research. Also, I was somewhat disappointed at how difficult it was to additional information on some of the start-ups / organizations that were mentioned in the book (which has only been out for a few months).
For example, the chapter "Cancer Kitchen" discusses how John Schloendorn and Eri Gentry studied the role that the immune system played in cancer using Schloendorn's own cancer cells, which led the creation of DIY nonprofit called Livly to develop cancer immunotherapies (and Gentry later co-founded BioCurious, another DIY nonprofit). However, the Livly website described in the book is no longer hosted on the internet (the old url, provided on the Livly facebook page, now links to an unrelated website). Likewise, BioCurious only seems to have a facebook page with limited information. Even with limited funding, the company can at least create a free Google Sites website (like my personal website) in order to more effectively convey information about the company.
I was also very interested in learning more about the Pink Army Cooperative (a DIY drug company attempting to deliver personalized treatments for breast cancer). This time, I was able to find a generally well-designed and informative website, but I couldn't find much information about concrete research accomplishments (to be fair though, Wohlsen does warn readers that "so far, Pink Army is more a concept than an actual co-op").
Although it was frustrating that I couldn't learn much more about these specific non-profits, Biopunk has successfully encouraged me to learn more about the DIY bio movement. Who knows, maybe I'll even stop by a meeting for my local DIYbio chapter!
Labels:
cancer,
DIYbio,
immunotherapy,
personalized medicine,
synthetic biology
Monday, May 16, 2011
Modeling Bimodal Gene Expression
Since it is often challenging to estimate parameters for mixture models (such as those used to model bimodal gene expression), I thought it might be useful to discuss some of my successes using non-linear least sequares (NLS) regression to model bimodal gene expression.
Many scientists use maximum likelihood estimation (MLE) to model bimodal gene expression (such as Lim et al. 2002, Fan et al. 2005, Mason et al. 2011, etc.). My MLE model is based the code provided in this discussion thread, so I used the mle function from the stats4 package (which is a wrapper for the standard optim function).
On simulated data, both the MLE and NLS models estimate 37% of samples show over-expression (i.e. come from the distribution with the higher mean), which is very close to the true value of 36%:
However, I have found that the mle function often returns error messages when working with real data and models built using the nls function tend to fit the data better than the MLE estimates. Even on the simulated data above, the NLS model appears to prove an ever so slightly better fit for the data (as might be expected because it directly models the density function).
In order to illustrate my point, I have also analyzed some genes exhibiting bimodal gene expression (as identified by Mason et al. 2011) in the GEO dataseries GSE13070. For example, here is a gene that I could model relatively well with NLS regression whereas I simply couldn't produce an MLE model:
Of course, both of these tools have their limitations. For example, the data has to have a pretty clean bimodal distribution (here are 3 examples of distributions that couldn't be modeling using either method: ACTIN3, ERAP2, and MAOA (different probe)). For the NLS model, I also had to set the variance to be equal for the two samples in order to produce a reasonable estimate of over-expression, but I believe this is usually a safe assumption.
Although I do not present the data in this blog post (because it contains unpublished results), I have also found NLS regression to be useful on other genes in several other datasets, So, I know NLS regression works well with more than just the one gene that I show above.
Also, for those that are interested, here is the source code that I used to produce all of the above figures.
Many scientists use maximum likelihood estimation (MLE) to model bimodal gene expression (such as Lim et al. 2002, Fan et al. 2005, Mason et al. 2011, etc.). My MLE model is based the code provided in this discussion thread, so I used the mle function from the stats4 package (which is a wrapper for the standard optim function).
On simulated data, both the MLE and NLS models estimate 37% of samples show over-expression (i.e. come from the distribution with the higher mean), which is very close to the true value of 36%:
Simulated Data
In order to illustrate my point, I have also analyzed some genes exhibiting bimodal gene expression (as identified by Mason et al. 2011) in the GEO dataseries GSE13070. For example, here is a gene that I could model relatively well with NLS regression whereas I simply couldn't produce an MLE model:
Of course, both of these tools have their limitations. For example, the data has to have a pretty clean bimodal distribution (here are 3 examples of distributions that couldn't be modeling using either method: ACTIN3, ERAP2, and MAOA (different probe)). For the NLS model, I also had to set the variance to be equal for the two samples in order to produce a reasonable estimate of over-expression, but I believe this is usually a safe assumption.
Although I do not present the data in this blog post (because it contains unpublished results), I have also found NLS regression to be useful on other genes in several other datasets, So, I know NLS regression works well with more than just the one gene that I show above.
Also, for those that are interested, here is the source code that I used to produce all of the above figures.
Labels:
bimodal gene expression,
GEO,
microarray,
modeling,
over-expression,
R
Sunday, April 17, 2011
How and Why the FDA Should Allow DTC Genetic Testing
At the beginning of this month, the FDA extended the period to submit public comments about Direct-To-Consumer (DTC) genetic testing to the Molecular and Clinical Genetics Panel of the Medical Devices Advisory Committee (referencing docket ID FDA-2011-N-0066 at http://www.regulations.gov). For more information on this topic, please check out this post from The Spittoon (the official blog for 23andMe).
I just submitted a comment to the FDA (which is essentially a shortened version of this blog post). You can currently view my comment here using Google Docs, but I do not currently see the posting on http://www.regulations.gov (I will work on verifying that the comment was successfully uploaded). In fact, there were only a few comments posted after the original 3/1/2011 deadline, and I do not see any new comments posted after the extension of the comment period that occurred on 4/1/2011. If you have not already done so, please submit a comment before the new deadline on May 1st!
In many ways, I think DTC genetic testing companies are similar to medical websites like WebMD (which is an idea I first remember seeing in this blog post comment). I personally think it would be a great disservice to society if websites like WebMD, Mayo Clinic, and MedlinePlus were banned because they provide medical advice to the public without consultation with a physician. Likewise, I also think it is very important that people be able to learn about their own genetic information without having to consult a physician (although I would certainly encourage people to seek advice from medical experts if they feel the need to do so). Although all doctors do not agree that patients should have access to DTC genetic tests, there are also some doctors that dislike WebMD. I do not believe this is a valid reason to ban either type of medical information.
I want to emphasize that I do not oppose any sort of FDA regulation. For example, companies that intentionally mislead people should be penalized (as one example, check out this post on My Gene Profile by Daniel MacArthur). However, I do not think the FDA should ban companies who are transparent in their actions and are basing their analysis of published, peer-reviewed scientific research.
I think there is sufficient evidence to show that most people will have reasonable reactions to their results (for example, check out this research article in New England Journal of Medicine). However, I think it might be helpful for the FDA to help classify which test results clearly require medical action and which ones are "research" grade tools that connect people with findings in the medical literature. In an earlier post, I discussed how a "3-tier system" might be able to help accomplish this. Essentially, we currently do have "clinical" tests and "research" tests, but I think formalizing some sort of system to distinguish between such tests could be useful (especially if it helps provides a way to maintain DTC genetic testing without the need to require physicians act as a gatekeepers for this information, or if it prevents these tests from being outright banned).
In general, I think it is important for individuals to have access to a variety of opinions in order to think critically when making medical decisions. It is not good to blindly trust any source of information - whether that information comes from a doctor, a DTC genetic testing agncy, a government regulatory agency, or a scientist (like myself). I strongly believe that people should have access to second opinions about their genetic tests (through tools like Promethease).
I think the FDA could also potentially help improve genetic testing (for both DTC and non-DTC tests) by helping provide people access to secondary sources of information. For example, I think it would be fine for the FDA to force companies to allow users to export their data in a standard format in order to allow people to easily get second opinions about their genetic testing results. Strictly speaking, I don't think this is necessary - for example, there are 3rd party web apps that help users learn more about their 23andMe results (such as this Firefox app), and Promethease already helps users search for annotations from SNPedia (although there is a $2 fee if you want your results quickly). However, I don't think it would hurt to have a standard format that applies to all genetic testing companies.
In fact, a standard format for sequence data from genetic tests could provide a useful framework for a collaboration between the FDA and NIH to fund development of of a free tool for people to analyze their genetic information. For example, MedlinePlus is an excellent resource provided by the NIH, and I think it could be really cool of the FDA would work with the NIH to help people analyze their genetic information similar to the way MedlinePlus provides traditional medical advice If such a collaboration were to take place, then I think it would also be fair to require genetic testing companies to provide links to this 3rd party tool (as well as other tools, if they choose to do so). This could be helpful both in terms of helping people think more critically about their results and I think it could be a good way to fund research on how to best convey genomics research to the general public and incorporate publicly available data into a single risk assessment provided by this free, 3rd party tool.
I just submitted a comment to the FDA (which is essentially a shortened version of this blog post). You can currently view my comment here using Google Docs, but I do not currently see the posting on http://www.regulations.gov (I will work on verifying that the comment was successfully uploaded). In fact, there were only a few comments posted after the original 3/1/2011 deadline, and I do not see any new comments posted after the extension of the comment period that occurred on 4/1/2011. If you have not already done so, please submit a comment before the new deadline on May 1st!
In many ways, I think DTC genetic testing companies are similar to medical websites like WebMD (which is an idea I first remember seeing in this blog post comment). I personally think it would be a great disservice to society if websites like WebMD, Mayo Clinic, and MedlinePlus were banned because they provide medical advice to the public without consultation with a physician. Likewise, I also think it is very important that people be able to learn about their own genetic information without having to consult a physician (although I would certainly encourage people to seek advice from medical experts if they feel the need to do so). Although all doctors do not agree that patients should have access to DTC genetic tests, there are also some doctors that dislike WebMD. I do not believe this is a valid reason to ban either type of medical information.
I want to emphasize that I do not oppose any sort of FDA regulation. For example, companies that intentionally mislead people should be penalized (as one example, check out this post on My Gene Profile by Daniel MacArthur). However, I do not think the FDA should ban companies who are transparent in their actions and are basing their analysis of published, peer-reviewed scientific research.
I think there is sufficient evidence to show that most people will have reasonable reactions to their results (for example, check out this research article in New England Journal of Medicine). However, I think it might be helpful for the FDA to help classify which test results clearly require medical action and which ones are "research" grade tools that connect people with findings in the medical literature. In an earlier post, I discussed how a "3-tier system" might be able to help accomplish this. Essentially, we currently do have "clinical" tests and "research" tests, but I think formalizing some sort of system to distinguish between such tests could be useful (especially if it helps provides a way to maintain DTC genetic testing without the need to require physicians act as a gatekeepers for this information, or if it prevents these tests from being outright banned).
In general, I think it is important for individuals to have access to a variety of opinions in order to think critically when making medical decisions. It is not good to blindly trust any source of information - whether that information comes from a doctor, a DTC genetic testing agncy, a government regulatory agency, or a scientist (like myself). I strongly believe that people should have access to second opinions about their genetic tests (through tools like Promethease).
I think the FDA could also potentially help improve genetic testing (for both DTC and non-DTC tests) by helping provide people access to secondary sources of information. For example, I think it would be fine for the FDA to force companies to allow users to export their data in a standard format in order to allow people to easily get second opinions about their genetic testing results. Strictly speaking, I don't think this is necessary - for example, there are 3rd party web apps that help users learn more about their 23andMe results (such as this Firefox app), and Promethease already helps users search for annotations from SNPedia (although there is a $2 fee if you want your results quickly). However, I don't think it would hurt to have a standard format that applies to all genetic testing companies.
In fact, a standard format for sequence data from genetic tests could provide a useful framework for a collaboration between the FDA and NIH to fund development of of a free tool for people to analyze their genetic information. For example, MedlinePlus is an excellent resource provided by the NIH, and I think it could be really cool of the FDA would work with the NIH to help people analyze their genetic information similar to the way MedlinePlus provides traditional medical advice If such a collaboration were to take place, then I think it would also be fair to require genetic testing companies to provide links to this 3rd party tool (as well as other tools, if they choose to do so). This could be helpful both in terms of helping people think more critically about their results and I think it could be a good way to fund research on how to best convey genomics research to the general public and incorporate publicly available data into a single risk assessment provided by this free, 3rd party tool.
Update (6/20/2020): I wrote this post before I
started adding change log entries.
However, I added a note because my opinions have shifted somewhat since
I originally wrote this blog post. For
example, you can see several FDA MedWatch reports that I have submitted within
the collection of posts linked here.
Essentially, I think I have better appreciation for the harm
that can be caused if a result is rushed to the public, especially if
information is distributed to a large number of people (such as 10,000s or
100,000s of customers). I still
believe that situations where something partially effective that still works
better than a placebo (or has non-trivial predictive power) should be thought
of differently than highly effective solutions or completely ineffective
solutions. Indeed, you can see some
non-genomic reports in my PatientsLikeMe post, at least one of which I also
submitted as an FDA MedWatch report for side effects.
I think setting the right expectations can help, but I
thought the problems that I didn’t notice before were sufficiently important
that I needed to add something to this post.
I also think it is important that genomic risk calculations can
be validated in independent cohorts, which makes transparency and
on-going quality assessment important.
For example, I still believe that publicly available information
is important, which means that you can have access to it with or without a physician. Even if there are consent limitations that
require controlled access (or prohibit carrying out the experiment in the first
place), I think maximizing specialist access to raw data (with accurate
documentation of data sharing) is still important. If you look at the cystic fibrosis post, you
can see that free and open feedback from a Biostars discussion helped with
re-analysis of my raw data.
Labels:
DTC testing,
FDA,
personalized medicine,
regulation
Monday, March 14, 2011
Article Review: Epigenetic suppression of the TGF-beta pathway revealed by transcriptome profiling in ovarian cancer
In this paper, Matsumura et al. develop a method to identify methylated genes in ovarian cancer patients using gene expression data from roughly 40 ovarian cancer cell lines and 20 cultured primary tumor samples. The authors posit that this method provides a unique opportunity to study pathways affected by methylation because it directly examines gene expression.
My overall thoughts on this paper:
Pros:
Cons:
I think one of the most useful tools discussed in this paper is GATHER, which is very fast and has a simple user interface. GATHER provides enrichment analysis for information from various databases, such as Gene Ontology, KEGG Pathways, TRANSFAC, and MEDLINE. More detailed information about the data mined in GATHER can be found in the associated paper by Chang and Nevins.
In fact, GATHER was immediately useful in helping interpret the results of this study. For example, I used GATHER to check the enrichment for the list of 378 methylated genes described in this paper. This revealed that the TGF-beta signaling pathway was not the most significantly enriched pathway in the gene list, and the TGF-beta signaling pathway actually had the smallest number of representative genes in the methylated gene list (out of the significantly enriched pathways). GATHER was also useful for studying the enrichment of pathways in the more conservative "methyl cluster" gene list (which showed a weaker association with the TGF-beta pathway and a stronger association with other pathways, such as the focal adhesion genes). These are some of the reasons that I believe the methylation directly suppresses EMT-related genes in these ovarian cancer patients (rather than acting through the TGF-beta pathway).
Another useful open-source tool described in the paper is the binary regression method used to define the TGF-beta gene signature. The binary regression method is especially useful for biologists without a lot of coding experience because it has MATLAB GUI with a simple, user-friendly interface (and version 2.0 is even better than the original code). In addition to defining gene and pathway signatures, the Bild lab is also currently using this binary regression algorithm to predict drug sensitivity from patient samples.
My overall thoughts on this paper:
Pros:
- The study produced a relatively large amount of data, which is now available in GEO
- The study utilized a large amount of publicly available data, providing a very useful list of citations for anyone interested in doing bioinformatics analysis on ovarian cancer (especially those interested in methylation).
- The authors utilize useful open-source tools for pathway analysis (namely GATHER and the specialized binary regression method)
Cons:
- I think it is more likely that methylation directly suppresses EMT-related genes (such as those involved with cell adhesion) rather than repressing the TGF-beta pathway (which then regulates EMT genes).
- Unlike in other cancers, patients with methylated genes do not show a worse prognosis. In fact, I wouldn't be surprised if patents with methylated genes had a slightly better prognosis because methylation suppresses genes associated with the epithelial-mesenchymal transition (which is associated with a progression to a more aggressive cancer). This hypothesis is also supported by the stromal response data shown in Figure S9.
I think one of the most useful tools discussed in this paper is GATHER, which is very fast and has a simple user interface. GATHER provides enrichment analysis for information from various databases, such as Gene Ontology, KEGG Pathways, TRANSFAC, and MEDLINE. More detailed information about the data mined in GATHER can be found in the associated paper by Chang and Nevins.
In fact, GATHER was immediately useful in helping interpret the results of this study. For example, I used GATHER to check the enrichment for the list of 378 methylated genes described in this paper. This revealed that the TGF-beta signaling pathway was not the most significantly enriched pathway in the gene list, and the TGF-beta signaling pathway actually had the smallest number of representative genes in the methylated gene list (out of the significantly enriched pathways). GATHER was also useful for studying the enrichment of pathways in the more conservative "methyl cluster" gene list (which showed a weaker association with the TGF-beta pathway and a stronger association with other pathways, such as the focal adhesion genes). These are some of the reasons that I believe the methylation directly suppresses EMT-related genes in these ovarian cancer patients (rather than acting through the TGF-beta pathway).
Another useful open-source tool described in the paper is the binary regression method used to define the TGF-beta gene signature. The binary regression method is especially useful for biologists without a lot of coding experience because it has MATLAB GUI with a simple, user-friendly interface (and version 2.0 is even better than the original code). In addition to defining gene and pathway signatures, the Bild lab is also currently using this binary regression algorithm to predict drug sensitivity from patient samples.
That said, there are probably a few things I should warn potential users about before giving this product my complete stamp of approval. Although I have played around this tool a little bit (with encouraging results), I haven't had a chance to use it as much as the relatively common R packages for SVMs (in the e1071 package) and classification trees (in the tree package). Therefore, I can't really comment about the practical limitations of this algorithm.
I was also a little bit nervous when I saw that Anil Potti (who I mentioned in my previous blog post) was one of the authors on the original Nature paper by Bild et al. for the binary regression method. However, Potti wasn't involved with the early framework for this method (described by West et al.), and a retraction request for one of the retracted Potti papers states "although we believe that the underlying approach to developing predictive signatures is valid, a corruption of several validation data sets precludes conclusions regarding these signatures." Therefore, I don't think Anil Potti had any negative influence on the binary regression method.
Overall, I found this paper to be useful and informative, and I would recommend it for anyone interested in microarray analysis.
I was also a little bit nervous when I saw that Anil Potti (who I mentioned in my previous blog post) was one of the authors on the original Nature paper by Bild et al. for the binary regression method. However, Potti wasn't involved with the early framework for this method (described by West et al.), and a retraction request for one of the retracted Potti papers states "although we believe that the underlying approach to developing predictive signatures is valid, a corruption of several validation data sets precludes conclusions regarding these signatures." Therefore, I don't think Anil Potti had any negative influence on the binary regression method.
Overall, I found this paper to be useful and informative, and I would recommend it for anyone interested in microarray analysis.
Labels:
Anil Potti,
binary regression,
GATHER,
genomics,
methylation,
microarray,
ovarian cancer
Thursday, March 10, 2011
Retractions in PubMed
For those who don't know, PubMed lists retractions (in addition to the standard stuff like articles, editorials, etc.). The details regarding how PubMed decides when to flag retractions are provided here.
With retractions on the rise, I think PubMed retraction listings can play a useful role in helping hold authors accountable for publishing misleading formation. For example, I hope NIH reviewers check PubMed to watch out for PIs with tainted records.
However, I noticed an inconsistency today that made me curious about how these retractions are annotated.
For example, take a look of these two Anil Potti papers that have been retracted:
This is how retractions should look:
However, this other paper doesn't have the retraction designation, even though there is already a separate PubMed entry for this paper's retraction::
There is a 2007 erratum mentioned for the New England Journal of Medicine paper but no retraction flag (as could be seen clearly for the Nature Medicine paper).
If anyone can provide additional information on this topic, then I would certainly appreciate it.
With retractions on the rise, I think PubMed retraction listings can play a useful role in helping hold authors accountable for publishing misleading formation. For example, I hope NIH reviewers check PubMed to watch out for PIs with tainted records.
However, I noticed an inconsistency today that made me curious about how these retractions are annotated.
For example, take a look of these two Anil Potti papers that have been retracted:
This is how retractions should look:
However, this other paper doesn't have the retraction designation, even though there is already a separate PubMed entry for this paper's retraction::
There is a 2007 erratum mentioned for the New England Journal of Medicine paper but no retraction flag (as could be seen clearly for the Nature Medicine paper).
If anyone can provide additional information on this topic, then I would certainly appreciate it.
Labels:
PubMed,
retractions,
scientific misconduct
Sunday, February 27, 2011
Thoughts on my 23andMe Results
NOTE: Getting Advice About Genetic Testing
I have been spending the past several weeks reviewing my 23andMe genetic testing results that I received at the beginning of this year, and I am now ready to write up my opinions in a blog post. In fact, I have so much to say that I have split the topic into 3 separate posts:
Part I: Getting a (Free) Second Opinion
Part II: The Importance of Non-Genetic Risk Factors
Part III: Using Predictive Models for Genetic Testing
I have been spending the past several weeks reviewing my 23andMe genetic testing results that I received at the beginning of this year, and I am now ready to write up my opinions in a blog post. In fact, I have so much to say that I have split the topic into 3 separate posts:
Part I: Getting a (Free) Second Opinion
Part II: The Importance of Non-Genetic Risk Factors
Part III: Using Predictive Models for Genetic Testing
Labels:
23andMe,
personalized medicine
My 23andMe Results: Using Predictive Models for Genetic Testing
NOTE: Getting Advice About Genetic Testing
There is a big difference between saying that a mutation has a statistically significant association with a trait and saying that a mutation has strong predictive power for a trait. To illustrate this point, I’ll focus on the 23andMe prediction for eye color.
There is a big difference between saying that a mutation has a statistically significant association with a trait and saying that a mutation has strong predictive power for a trait. To illustrate this point, I’ll focus on the 23andMe prediction for eye color.
One of the top 5 traits shown in my trait overview is “eye color.” My eye color was predicted to be “likely brown” (which was an accurate prediction). However, I wanted to look at this report more carefully because I remembered Francis Collins mentioning that 23andMe incorrectly predicted his eye color in “The Language of Life".
I don’t know Francis Collins’ genotype, but I did notice a potential problem when I looked at the section for “Your Genetic Data”
Genotypes for rs12913832 (from 23andMe)
Percent Brown
|
Percent Green
|
Percent Blue
| |
AA
|
85%
|
14%
|
1%
|
AG
|
56%
|
37%
|
7%
|
GG
|
1%
|
27%
|
72%
|
My genotype is AG (shown in red above). I was correctly predicted to have brown eyes, but I actually only have a 56% chance of having brown eyes. This means that the AG genotype is roughly as accurate as a flip of a coin at predicting individuals with brown eye color. In my opinion, I don’t think this should have come up as a top prediction, and I think it probably would have been better to flag this SNP as “not predictive” for this particular genotype.
Now, don’t get me wrong – I don’t think people should only have access to information about their highly predictive SNPs. In fact, I would want to be able to know the frequency of my genotype if it significantly varies for different traits. However, I don’t think information like my predicted eye color should have been something that caught my eye within 5 minutes of viewing my results.
In general, I think it would be useful to give scores to predictions in the same way that stars are given for disease risk. Ideally, it would be nice to provide all the relevant information (such as overall accuracy, sensitivity, specificity. positive predictive value, negative predictive value, etc.) about these predictive models in a table, but in practice it might be better to focus on one or two features for simplicity (especially when dealing with traits that are not binary). In order to distinguish between predictive scores and reproducibility/statistical scores (which I what I would call the 4 star system), perhaps SNPs with a positive predictive value (PPV) greater than 75% get a bronze circle, SNPs with a PPV greater than 85% get a silver circle, SNPs with a PPV greater than 95% get gold circle, and all other SNPS get labeled as "not predictive."
These calculations become trickier when considering diseases that are significantly influenced by multiple SNPs, and this is where building a predictive model (using SVM, CART, etc.) could really be helpful in providing individuals with the most accurate predictions. In fact, these predictive models need not only consider genetic information and can also be used for non-genetic / environmental risk factors like weight, family history, blood sugar, blood pressure, etc. (which I mentioned in my second post).
Unfortunately, there is no absolute best way to build a predictive model using different SNPs (and/or non-genetic information). Based upon my experience, I think regression-based models do a good job of providing probabilities that individuals have a particular trait, and very strong associations should have similar results regardless of which machine learning technique is used. However, I don’t know if a single tool will be appropriate for creating all predictive models, and I’m sure there are some traits for which no good predictive model can be created.
Since there is not a lot of precedence for clinically-relevant predictive models incorporating genetic and non-genetic information, I think 23andMe has a great opportunity to experiment with this for their trait predictions. Since 23andMe is the largest DTC genetic testing company, they will have the most incoming data that they can use as validation sets. If they are worried about how easily individuals will interpret these results, perhaps a separate “experimental” section can be providing this type of result (just like I think it might be best to test these models for the "trait" section before incorporating these results into disease risk and drug response results, which people are more likely to use when making medical decisions). Also, I should acknowledge that I don't know what's going on behind the scenes at 23andMe (for example, I don’t know how 23andMe is currently combining SNPs for disease risk etc.), so this may have be something they have already started to investigate.
Finally, I would like to close this 3-part post by emphasizing that I was generally pleased with my 23andMe results, and I have only provided constructive criticism because I want these results to be as clear and accurate as possible because I think 23andMe has the potential to be an invaluable resource to empower patients to utilize their genetic information to the fullest extent.
My 23andMe Results: The Importance of Non-Genetic Risk Factors
NOTE: Getting Advice About Genetic Testing
When I first saw my 23andMe results, I was very glad to see that each genetic association also had a heritability value. For example, the disease description for Type 1 Diabetes indicates that 72-88% of the disease is determined by genetics whereas the sample description for Type 2 Diabetes indicates that 26% of the disease is determined by genetics (meaning that 74% is determined by environmental factors). In other words, there is a lot more that can be done to prevent the onset of Type 2 Diabetes than Type 1 Diabetes.
When I first saw my 23andMe results, I was very glad to see that each genetic association also had a heritability value. For example, the disease description for Type 1 Diabetes indicates that 72-88% of the disease is determined by genetics whereas the sample description for Type 2 Diabetes indicates that 26% of the disease is determined by genetics (meaning that 74% is determined by environmental factors). In other words, there is a lot more that can be done to prevent the onset of Type 2 Diabetes than Type 1 Diabetes.
I was also glad to see a “What You Can Do” section describing what actions high risk individuals could potentially take to prevent or manage their disease.
Although these two steps may represent the best way to currently convey this information for most associations, I think it would really help if non-genetic factors could directly be incorporated into risk calculations.
For example, I noticed that one of my Promethease results indicated that a long history of high blood sugar would increase my risk for CAD from 1.7x to 7x (for SNP rs1333049). If 23andMe could incorporate information about my medical history directly into my risk calculations, then I think that could make the predictions much more powerful.
In addition to providing more precise risk assessments, I think incorporating non-genetic information could also actively help individuals manage their health. For example, if I knew that losing 30 pounds would cut my risk of developing a particular type of cardiovascular disease by 50% based upon a personalized quantitative model, then I would probably be more inclined to lose that weight than if I simply knew that eating right and exercising was generally a good idea.
That said, I think there are some fundamental changes that may need to take place before such an idea could be implemented (assuming science has progressed to the point where we could provide such models for most diseases). First, I think users need a more dynamic way to record their medical information in 23andMe. By this I mean that users need to be able to update their medical information rather than fill out surveys at one point after they create their account. For example, I responded that I didn’t have any serious illnesses (like cancer, cardiovascular disease, etc.), but I’m sure my answers to those questions will change over the course of my life.
In order to integrate both genetic and environmental risk factors into risk calculations, it may also be helpful to think about other ways to present genetic testing results (which I discuss in greater detail in my third post on predictive models).
My 23andMe Results: Getting a (Free) Second Opinion
NOTE: Getting Advice About Genetic Testing
In order to get an idea about how well the 23andMe risk calculator agrees with other algorithms (when using the same exact same SNP data), I searched for other tools that I could use to analyze my genetic data.
In order to get an idea about how well the 23andMe risk calculator agrees with other algorithms (when using the same exact same SNP data), I searched for other tools that I could use to analyze my genetic data.
For this post, I have compared my risk assessments from 23andMe to those provided by Promethease (which uses the information available in SNPedia). I also played around with the free version of Enlis Genome (Personal Edition), but I found the GUI to be a little buggy and they didn’t automatically prioritize risk assessments (unlike 23andMe and Promethease). So, this post will focus only on comparing my 23andMe assessment with my Promethease assessment.
To be fair, I should point out that I would not necessarily expect 100% concordance between my 23andMe and Promethease results for various reasons. For example, the “magnitude” score from promethease is a subjective measure, and the curation methods are different for these two tools. However, I think such a comparison will still be useful because it will still be encouraging to see any predictions that are shared by both tools, and I think both of these tools provide useful information since there is no “standard” way to combine all possible associated SNPs associated with a particular disease.
I will focus on my increased disease risks, but the same principles could be applied to decreased disease risk, drug response, or any other trait.
All Diseases with Increased Risk
Overall, I thought that there was pretty good agreement between the two methods. This may not be apparent from the table above, but that is because my list of “higher importance” SNPs is considerably smaller but with greater overlap. For example, I would have ideally preferred to look at SNPs with a 1.5x increase in risk and an absolute risk greater than 50%. The absolute cut-off of 50% is because I would prefer to look at SNPs where I am more likely to get the disease than not get the disease. The 1.5x (or 50% increase in risk) is a somewhat arbitrary cutoff that is loosely based upon my microarray data analysis experience. Since no SNPs meet both of these criteria, I chose to look at those with a greater than 1.5x relative risk and greater than 5% absolute risk (which, in my opinion, is still quite low). Now, take a look at my more subjective SNP list.
“Higher Priority” SNPs
Now, 2 out of the 3 SNPs have similar predictions. Although there wasn’t a high magnitude SNP in promethease for venous thromboembolism, this could be because I subjectively considered this disease to be less well known than arthritis or diabetes, so I figured less popular diseases may have lower magnitude scores. For this reason, I decided to look into what SNPs are used by 23andMe and promethease to determine venous thromboembolism risk. I also checked the Genome-Wide Association (GWAS) Catalog to try and get a idea which SNPs are the best established (according to the US National Human Genome Institute).
SNPs Associated with Venous Thromboembolism Risk
23andMe
|
Promethease
|
GWAS Catalog
| |
rs6025
|
Yes
|
Yes
|
No
|
i3002432
|
Yes
|
No
|
No
|
rs505922
|
No
|
Yes
|
Yes
|
NOTE: Promethease lists 19 SNPs associated venous thrombembolism. In order to simplify the table (and avoid listing some potentially inaccurate and/or low-confidence associations), I have only listed SNPs listed by 23andMe or the GWAS Catalog.
Now there is agreement between the 23andMe and Promethease results because both tools indicate that I have a mutation in rs6025, which results in an increased risk of developing venous thromboembolism. However, I think it is worth pointing out that the results are not quite as clean as they could be. For example, this SNP was not listed in the GWAS Catalog, and I couldn’t determine the dbSNP annotation for i3002432 (so it was relatively hard for me to cross-reference this result with other databases).
Another topic that is worth considering is family history. Before I saw my results, there were 3 diseases that I wanted to check due to family history: type I diabetes, type II diabetes, and macular degeneration. Thus, it was interesting to see type I diabetes come up in both reports. Although I doubt that I will get type I diabetes (since the absolute risk is low and this disease usually appears during childhood), this information may still be useful if these mutations have other affects and/or increase the likelihood of children inheriting type I diabetes.
On the other hand, I didn’t see results indicating a increase in risk for type II diabetes and there were some conflicting results about macular degeneration. Of course, family history is not a gold standard, and I may very well never develop type II diabetes or macular degeneration. However, I think it is important to think carefully about ambiguous or uncertain results. For example, this could be done by comparing SNP association to family history as well as considering both genetic and non-genetic risk factors for disease (the later is the topic of my second post).
Subscribe to:
Posts (Atom)