Wednesday, December 4, 2013

Who was the Primary Audience for the FDA Warning to 23andMe?

NOTEGetting Advice About Genetic Testing

The other day, I put together an informal survey about public opinion on the FDA warning to 23andMe and the subsequent class action lawsuit against 23andMe.  I posted links to this survey on Twitter, Facebook, the 23andMe Community discussion group (sorry, link only visible for 23andMe users), and in the comments for two articles describing the class action lawsuit (see specific links here and here).

Since it looks like the response rate has plateaued, I am presenting the results for everybody to see in this blog post.  By the way, I don't work for 23andMe, so I can't provide authoritative "right" answers to some of the questions.  However, I will always provide my own answer.

Question #1

Pretty straightforward.  I wanted to see how responses compared for 23andMe customers versus everybody else.  Almost all responses came from 23andMe customers.

I responded "Yes"

Question #2

Most people think 23andMe is at least as good as other companies.  Although the response rate is limited, I think the 2 individuals that thought 23andMe was "less accurate" provide some potentially interesting information that I will describe in more detail throughout the post.

I said "Comparable".  For SNP calling, I am assuming that most companies are using customized Illumina SNP arrays.  If this is the case, the sequencing quality should be practically identical, regardless of where you go.

Question #3

The vast majority of those surveyed don't think they deserve a refund (which is the basis for the class action suit).  If you're counting the answers for each question, you may notice that 2 people that didn't buy a 23andMe kit accidentally answered this question (both answering "No").  So, really only 45 23andMe customers answered "No," but I think the most important point is that only 2/47 customers expressed concern about their results.

The two individuals that said "Yes" are the same two individuals who thought 23andMe was less accurate than other direct-to-consumer genetic testing companies.  Again, I only have a small sample size, but this may indicate that FDA warning has caused a lot of concern about the quality of 23andMe results that didn't previously exist.

If this is true, this is important to keep in mind because I think it represents a misinterpretation of the FDA warning.  I think it is safe to say that the main message was something along the lines of "Hey 23and and Me - I want to learn more information about your tests, please reply to us ASAP".  This is not the same as "Hey current and existing customers - we are aware of serious problems with 23andMe, so you should not buy this bogus test".  My current theory is that confused customers are taking the later interpretation.

Moreover, this appears the be the same misconception implied by a blog post trying to recruit people for the class action lawsuit:

More specifically, the blog post says "23andMe, it turns out, has been under investigation since 2008 by the FDA".  While technically this is probably true, this seems to be imply that 23andMe has been singled out for suspected wrongdoing for 5 years.  However, this is not the case.  The first news from the FDA I recall hearing was in 2010, where a warning was issued to all the major direct-to-consumer (DTC) genetic testing companies (in fact, I believe Pathway Genomics was issued that warning before 23andMe).  The issue is that these companies are providing an unprecedented service that requires thinking about regulation in new ways.  Getting FDA approval is time-consuming in normal circumstances, and the process for 23andMe will take even more effort.  This is the nature of the long-term "investigation".  The FDA warning last month relates to "510(k)s for PGS on July 2, 2012 and September 4, 2012".  We know that 23andMe was not prompt in its responses since that time.  I would consider this a foolish mistake, but 23andMe can still eventually provide a product that is officially deemed acceptable by the FDA.  In fact, I would bet this will eventually be the case.

If you're still wondering, I said "No".

Question #4

My own opinions have changed around a bit on this one.  As you might guess from the question design, I've assume the true cost to provide the test was greater than the amount charged to the customer (and I choose $100-$500).  Namely, I know 23andMe has a lot of strong financial backing, including NIH funding for their research (which I hope can also help allay concerns caused by the FDA warning: clearly, the folks in this other branch of the federal government are cool with 23andMe).

I have become less certain about the costs recently when I found out that AncestryDNA also provides a genetic test for $99 (click here to see the details from my blog post on that).  My assumption was this company has less extra funding than 23andMe.  The 23andMe array provides ~50% more SNPs than the AncestryDNA sample that I looked at, but I'm not sure how much this really makes a difference in cost for the array.  That said, 23andMe probably has a lot more work to do on the interpretation side (AncestryDNA only provides ancestry results) as well as work with surveys to conduct genomics research.  So, I would probably still give the same answer in the end, but I am less confident in my choice.

To a certain extent, these details don't really matter too much: I would consider $99 dollars to be a great deal for the genetic information alone (even without interpretation from the vendor).  It is safe to say 23andMe is not making huge profits from the collection and processing of DNA samples.

Question #6 (Yes, #6 - I'll explain)

One of my concerns is that I've seen a lot of information on the Internet that I found to be inaccurate or misleading (even from generally respected sources of information), and I was concerned this was causing an exaggeration of concern in potential or existing 23andMe customers.

This question is a follow up for a question where I asked what articles / blogs people read reading the FDA warning and/or the lawsuit.  I'm not showing those results here because they are hard to read.  There was basically just a bunch of check boxes next to hyperlinks (for sites containing both positive and negative opinions about 23andMe).  20/51 individuals didn't read any of the same links that I did, and the follow up question showed that most people didn't really care what they read on-line from these types of sources.  Neither of the individuals desiring a refund read any of those specific links.  One said "No" to this question, the other said "Maybe".  Obviously, they learned the news from somewhere (and again, we're only talking about 2 people), but I did feel better that some of the specific sources that concerned me were probably not a huge deal.

I answered "No" for this one.

Additional Results

I originally used all 10 allowed questions for the free survey, but I ended up deleting 4 of those questions.  Basically, I was also interested in learning other stuff:  What to people think the terms "true positive" or "false positive" mean (with respect to 23andMe results)?  How confident are customers in their quality of results (raw sequencing data, interpretation of results, etc.)?  What to people think the error rate is for the sequence data?  (BTW, I would use this comment / post to argue the error rate less than 0.01%)  What is the perceived difference in quality for the health versus ancestry results?  How familiar are 23andMe customers with the scientific concepts being used and the limitations to the results?  However, I ended up creating confusing questions that jumbled these issues together.

I bring this up because I was impressed by how quickly I got a response from the community of 23andMe customers.  Within hours, I had users point out that the questions couldn't be properly answered without more specific information.  In fact, I received strong worded complaints due to this problem in designing the survey.

In other words, the process of running the survey made me feel more confident in the ability of 23andMe customers to be able to handle and critique their results.  Of course, I have read studies showing this to be the case (at least most of the time), but the act of collecting the data itself provides an experience that went beyond reading statistics in a scientific publication.

Concluding Remarks

The bottom line is that I feel insulted a class action lawsuit is being sought with the assumption that all 23andMe customers deserve a refund (I assume that assumption is necessary, based upon the requested $5 million dollars).  I would never ask for this, and I strongly oppose the lawsuit.

Of course, it would be better if my survey included a larger number of customers.  Hopefully, there can be a more authoritative inquiry into this matter.  My gut says that most customers are satisfied with their 23andMe results and those that are concerned about their results have over-estimated the severity of the potential problems, and I think this limited survey supports that theory.  In the very least, I hope it encourages discussion about this topic and eases some customer concerns.

If I get significantly more responses, I will update this post.  However, I'm not expecting the final count to be much higher (and I can't actually collect more than 100 total responses).

Additional Analysis of AncestryDNA Data

NOTEGetting Advice About Genetic Testing

I have a friend that asked me what extra information she could get from her raw AncestryDNA data.  I've studied by own 23andMe data extensively, so I thought it would be useful to show what tools can also be applied to raw data from other genetic testing services.

I have not purchased an AncestryDNA kit for myself, so all the analysis that I performed was for somebody else.  Therefore, I will not mention any specific results from this analysis.

First, you may find it useful to convert your raw data to 23andMe format.  You can use this Perl script ( to convert your file.  The Perl script can be run using instructions similar to those provided here.

Here are some potentially useful tools to learn more from your AncestryDNA data:

1) Interpretome - free, but requires 23andMe format file
2) Custom scripts - free, but requires 23andMe format file (and probably some comfort with programming)
3) Promethease - can directly use raw AncestryDNA file, but it costs $5.  I would probably recommend trying to use the free options first.

At first, I wasn't certain how well these strategies would work.  For example, I thought AncestryDNA might focus on non-functional regions that wouldn't affect disease risk.  However, I think there is a decent amount of additional information that can be gained from the raw data.

For example, all the functions that I tested in Interpretome worked properly.  Additionally, I found 2,581 AncestryDNA SNPs were listed in the GWAS Catalog.  Of those 2,581 SNPs, 1,337 were risk allele the individual that I tested.  This is less than I had for my 23andMe data (3,050 with GWAS Catalog annoations, 1,626 risk alleles within GWAS Catalog variants), but I think it is decent for an assay that don't provide any health reports to the user.

For those that are interested, here is a venn diagram of the overlapping 23andMe variants (V3 23andMe chip, AncestryDNA result from a few months ago):

You can see that the 23andMe chip covers ~50% more of the genome, but the AncestryDNA chip still covers a fair number of nucleotides.  My only real problem is the lack of labeling data from X, Y, and MT chromosomes as such in the raw data.  There are chromosomes listed as "23" and "25," so I initially guessed that "24" is the Y-chromosome.  However, I noticed that randomly selected rsIDs from chromosomes "23" and "25" both came from the X-chromosome.  So, I don't see a simple way to change those into a standard format.

Nevertheless, I think the information from chromosomes 1-22 provide a lot of information to review.  I hope this post helps AncestryDNA customers find something interesting!

Tuesday, December 3, 2013

Survey of Opinions on 23andMe FDA Warning and Lawsuit

I was surprised to see so many negative (and/or confused) responses to what has been going on with 23andMe the past week or so.  So, I would like to try and better quantify what is the true public opinion on the matter.

Click here to take survey

I've never actually done this before...I'll eventually write a post describing the results, but I don't know how long it should take to get a reasonable number of results.  I'm hoping a week, but this will definitely depend on how much people pass along the survey.

UPDATE: You can see the results of this study here:

Tuesday, November 26, 2013

My Take on the FDA Warning to 23andMe

NOTEGetting Advice About Genetic Testing

I have recently participated in a forum discussion on Biostar about the newest warning to 23andMe from the FDA.  I think my responses will be of interest to a broader audience, so I have copied them here.  If I end up contributing more to the Biostar discussion, I will update my blog post as well.

Comment: The majority of published reactions seem to miss what the FDA's major issue is: about marketing and describing the service.

Response: Yeah, I agree that the wording in this warning focuses mostly on marketing.

However, the FDA has previously tried to shut down DTC genomics companies (including 23andMe) and the 3rd paragraph seems to mostly focus on the accuracy of the test. The carrier status report should really be OK for diagnosis (and I think many of the specific associations mentioned in that warning are also pretty well established). Now, there are some caveats to some results like deciding how to combine independent SNP risks and explaining the difference between a mutations that guarantee onset of a disease versus modulate risk (which may only have a modest impact on risk in many cases). I personally think 23andMe does a decent job of this already, but I'm sure there can always be room for improvement.

In other words, my understanding is that the problem was primarily with direct communication with the FDA regarding technical benchmarks that would justify marketing claims (and I think the FDA is supposed to provide permission based upon this data prior to advertising). This certainly relates to communication with customers, but I think delays in formal responses to the FDA from 23andMe were the primary problem.

Concern: There is at least one report of an individual with an inaccurate 23andMe report (click here to view).

Response: Make sure to read the entire article.  Once the bug was reported, it was fixed and the report was updated.

Concern: The issue isn't the bug. The issue is that 23andme is offering a product while making claims about how customers can use results for improved medical care. Medical professionals and should be in charge of offering medical services.

Response: I agree that communication with customers is important and some customers may not have a good sense of what it is like to be part of a research project. It is possible that this is something 23andMe needs to work on.

However, I think the best solution is not to get rid of 23andMe, but rather help improve communication regarding the confidence of results. For example, I wrote a blog post about one possible solution after a previous FDA warning was issued:

Also, I don't think this is a typical result. For example, here is a link to my results as well an article from Lifehacker (from someone with much less experience with genomics research). I'm sure I've seen more, but these are what I could think of off the top of my head.

Plus, I think an unfortunate reality is that this sort of thing will happen from time to time. I think pretty much all diagnostics will suffer from some degree of false positives, false negatives, and/or human error. I know I constantly have to update the bioinformatic programs that I design (for what I would call "research grade" analysis) - especially when hunting down bugs that are only apparent when analyzing a small number of data sets.

Concern: The fact that it turned out to just be a bug is one thing, it's pretty bad but they fixed it, but even if it had turned out to be true, it's still horrific. Imagine being sent an email saying, "Hey, you're going to get progressively disabled then die young," and no further information about what the condition is and how it's going to affect you, no kind sympathetic face offering you tissues and advice and options. This guy did his research and turned out to be fine, and that's great, but how many people fall into deep depression on getting this news? How many people kill themselves? You can't give out this kind of potentially devastating life-altering news in an email.

Response: For some people, I agree this may not be the best way to communicate results. This is probably why 23andMe adds an additional step for viewing results like this that are not present for non-medical and non-predictive results. If you aren't prepared to view results on-line, then I would probably recommend either not getting a 23andMe profile, not viewing that portion of the results, and/or contacting a genetic counselor to review the results with you. For example, my 23andMe report indicates that I am a carrier for cystic fibrosis and they provide a link on how to talk to a genetic counselor on that page:

That said, I think the concern overall is an over-reaction for the following reasons:

1) There have been several publications showing that most people have no problem responding to DTC genetic testing. I can't list all the publications off the top of my head, but here is a summary of one such article:

2) In general, the accuracy for the DNA sequencing portion of the tests (currently via an Illumina SNP array) is pretty good. For example, the FDA has recently approved Illumina sequencing for clinical application. I'm also pretty sure 23andMe has checked the accuracy of the array by comparing normal 23andMe clients to the results from people who participated in the exome sequencing pilot. That said, there is a difference between the interpretation for the carrier status results (which is relatively straightforward) and all of the other results, and my understanding is that failure to communicate these results to the FDA is one of the legitimate complaints from the FDA letter.

3) I think people need to be careful and critical in all cases. For example, let's say I had a wife who was also a cystic fibrosis carrier (identified via 23andMe) and we were thinking about kids. The first thing I would do is verify the result. For example, I could order a Counsyl test from a doctor (which might be a good alternative for some people instead of 23andMe, although I think you might still be viewing your results on-line) to verify that we were in fact both carriers. I wouldn't immediately run to an in vitro fertilization clinic. Plus, medical professionals can make wrong calls too. Also, to be clear: I have family members who are confirmed cystic fibrosis carriers, as determined by standard testing.  So, I think the probability of this being a false positive is very low, but I would always want to tread carefully.  This has certainly happened to me, which at least one time delayed hospitalization for a very serious infection. This is not an attack on the medical establishment: there is a reason I went to see the doctors in the first place. However, I think the actions made by this individual were spot on, regardless of whether something is FDA-approved and regardless of whether a result comes from a person or a computer: if something doesn't sound right, you should look into a second opinion, independent research, etc.

CommentThere is now a class action lawsuit against 23andMe:

ResponseThat is unfortunate.

It does seem to reflect a relatively common sentiment among customers who didn't really follow all the scientific details, which is something I was surprised to learn. However, I think it was also influenced by a lot of misleading or inaccurate information out there. For example, this is the one that angered me the most (because it came from someone who really should have known better):
I'm hoping this can be settled out of court (if that is still possible at this point). At least this provides a list of specific claims that I hope 23andMe will directly address to customers in an official statement - at least they can reference a plethora of 3rd party experts who can generally back them up. I certainly think they made badchoices with the timing of advertising and providing terse official responses, but I don't think that should be a $5 million mistake (especially for a service that I assume is being provided below cost)
Update: I also have put together a survey on this topic. If you can fill out and/or distribute the survey, I would appreciate it!

Tuesday, November 19, 2013

RNA-Seq Differential Expression Benchmarks

I recently published a paper whose primary purpose was to serve as a reference for the protocol that I use for RNA-Seq analysis (see main paper and supplemental figures).

The aspect of the paper that I think is most interesting to the genomics community is a comparison of statistical tools for defining differentially expressed genes, which had the greatest influence on the resulting gene lists (at least among the comparisons that I make in the paper).  So, I will review those relevant figures in this blog post.

The plots below show the robustness of the gene lists produced by a given algorithm.  In other words, the higher the "common" line on the graph, the more robust the gene lists (i.e. the higher the proportion of genes commonly called by multiple algorithms).  Most readers will probably not be as interested in the x-axis (rounding factor for RPKM values), and it only changes the gene lists for Partek and sRAP.
Analysis of Patient Cohort (Tumor versus Normal).  1-factor is just tumor versus normal, while 2-factor also includes patient ID (pairing tumor and normal samples).  cuffdiff results not shown because no genes were defined with FDR < 0.05.  sRAP not shown because gene list was very small (see Figure S3 from the paper)
Analysis of Cell Line Comparison (Mutant versus WT)
To be fair, I will certainly admit robustness is not the same as accuracy.   Uniquely identified genes may be true positives that represent a lower false negative rate.  However, this did correspond to some circumstantial evidence I've seen with other datasets where cuffdiff and edgeR have given some weird results.  The results from this paper don't actually contain the clearest examples of this, but you can take a look at the GAGE4 stats to see an example where I would at least argue that edgeR provides inflated statistical significance.

Overall, I think Partek works the best (which is what I use for COH customers), but I was also pleased with DESeq (and sRAP, but I am obviously biased).  In fact, these comparisons support earlier observations that DESeq is conservative in defining lists of differential expressed genes (Robles et al. 2012).

However, my main goal is not to simply tell you what is the single best solution.  In fact, the cell line comparison above also had paired microarray data, and I would say the concordance between the two technologies was roughly similar for most algorithms:
RNA-Seq versus Microarray Gene lists.  "Microarray DEG" = proportion of differentially expressed genes in microarray data also present in RNA-Seq gene list.  "RNA-Seq DEG" = proportion of differentially expressed genes in RNA-Seq data also present in microarray gene list.

The similarity in microarray concordance kind of reminds me of Figure 2a from Rapport et al. 2013, which compares RNA-Seq gene lists to ~1000 qPCR validated genes.  However, I think properly determining accuracy can be difficult.  For example, look at the differences between the qPCR results in Figure 2a and the ERCC spike-ins in Figure S5 for that same paper.

Instead, these are the main take-home points I would like to emphasize:

1) Simple methods comparing RPKM values (in this case, rounded and log2 transformed) for defining differentially expressed genes can work at least as well as more complicated methods that are unique for RNA-Seq analysis (at least for gene-level comparisons).  For example, one claim against count-based methods in general (including edgeR, DESeq, etc.) is that there can be confounding factors, such as changes in splicing patterns.  Although I agree this is a theoretical problem that probably does occur to some extent, it doesn't seem to be a major factor influencing concordance with microarray data, qPCR validation, etc.

2) There is probably not a solution that works best in all situations. In this paper, you can see the results look very different with the patient versus cell line datasets.  For practical reasons, a lot of benchmarks will probably use cell line datasets.  However, it is not safe to assume performance for large patient cohorts will be comparable to cell line data (or patient data with little or no biological replicates).

Monday, November 18, 2013

My American Gut Individual Report

I recently received my American Gut Individual Report for my fecal sample in the mail.  Click here to see this report as a PDF.

The first thing that I noticed was that the phylum distributions were very different than what I calculated from my raw data (click here to see those results).  Namely, ~75% of my reads aligned to Proteobacteria 16S rRNAs, but my individual report said that ~10% of my sample was from Proteobacteria.  So, I contacted the American Gut team to ask why the results were so different.  They said that it appears the shipping process allows differential growth of certain bacteria (especially Gammaprotoebacteria) that would not normally appear in fresh samples.  So, they filter out likely contaminants for the report.

This is something that I would like to learn more about, and the American Gut team also said that they are actively investigating this.  The filtering code is available on the American Gut GitHub website, but I am most curious in seeing population-level metrics about this differential growth.  Namely, I would like to see how the reduction in false positives is affecting the true positive rate.  At least in my case, ~70% of my reads are being filtered out as potential contaminants, which seems like a loss of a lot of information. Additionally, my filtered sample seems to cluster with samples that have relatively low Firmicutes counts (much closer to the original value of ~15%, see PCA plot in lower-right hand corner), when the report says the percentage should be ~65% Firmicutes (bar plot).  In other words, my guess is that the "best" interpretation of my results may lie somewhere between these two reports (where the true Firmuicutes abundance may be lower and the true Proteobacteria abundance may be higher).

That said, I know the American Gut analysis is ongoing, and there will likely be additional future reports.  For example, I know for certain that there will eventually be a report for my oral sample.  It will be interesting to see if the results of future reports change as new scientific findings are discovered.

Tuesday, November 5, 2013

Metagenomic Profiles for American Gut Subjects with Migraines

As someone who experiences migraines, I know that diet can affect the onset of a migraine.  Therefore, I was interested to see if there were any metagenomic differences among subjects with migraines from the American Gut project.

After filtering for FASTQ files greater than 1 MB, I found there were 77 American Gut subjects that suffered from migraines (at least a few times a year).  I calculated abudance levels for phylum, class, and species using MG-RAST (with alignments to RDP).  You can see my previous post for more details.

First, I wanted to see what other variables were correlated with migraine status (in order to make sure I was characterizing metagenomic changes that are specifically associated with migraines and not secondary factors).  I found two factors significantly associated with migraine status: BMI (and other weight-related factors: migraine subjects had a higher BMI) and sex (migraine subjects were more likely to be women).

METADATA t-test p-value
AGE 0.79
TOT_MASS 0.039
BMI 0.017
SEX 0.0039
FAT_PER 0.93

So, I then collected normal controls matched for BMI and sex.  There were a few migraine subjects missing gender and/or BMI information, so I only included 70 matched controls.  To be fair, this might have not mattered too much: for example, the preliminary American Gut report noted that profiles didn't seem very different between genders.  However, I thought it was best to err on the side of safety.

Unsurprisingly, there were not any substantial clustering between migraine and non-migraine subjects (otherwise, I would have expected this to be included in the American Gut preliminary report):

Unfortunately, I also didn't see any strong clustering based upon migraine frequency either:

FYI, I am showing PCA plots based upon species-level abundances for species with an average abundance of at least 100 counts per sample, but you can also view the case-vs-control phylum and class distributions as well as the migraine frequency phylum and class distributions.

Although I didn't see any major differences in the PCA plots, I went ahead and looked for any differences that occurred with a false-discovery rate (FDR) less than 0.05.  I used the sRAP package for analysis, treating the 16S rRNA abundances like gene expression values.  Although most of these results looked like artifacts from only having one subject that had migraines on a daily basis, I did consider Lactococcus lactis (p=0.00088, FDR = 0.042) to be an interesting candidate:

y-axis is abundance (count-per-thousand) on a log2 scale

I double-checked the frequency of migraine subjects with lactose intolerance: the frequency was qualitatively higher compared to subjects with less frequent migraines (50% versus 28.6%), but this difference was not statistically significant (Fisher's exact p-value = 0.28).  Additionally, I'm not sure about the interpreation if this Lactococcus lactis trend is found to be reproducible in other independent cohorts.  For example, it seems like Lactococcus lactis can regulate riboflavin production (Burgess et al. 2004), which has supposedly been used in the treatment of migraines.  In other words, causality may be hard to distinguish: perhaps subjects with severe migraine problems are taking probiotic supplements.  For example, the majority of American Gut participants reported taking some sort of supplement (I couldn't find a metadata variable specific for probiotics).  In fact, it has been reported that Lactococcus lactis might alleviate symptoms from lactose intolerance (Li et al. 2012), and the individual with daily migraine occurrences was lactose intolerant.

I certainly applaud the work done by the American Gut project: from what I could tell, they were the only major metagenomic consortium that collected migraine metadata.  However, I would feel more confident about these results if there were more subjects who commonly experienced migraines and/or if there was longitudinal data (to track metagenomic profiles during intervals when migraine subjects did or did not experience a migraine).

Of course, I should also point out that I wouldn't consider the analysis presented in this post to be comprehensive.  I would certainly be interested in seeing what other conclusions could be drawn from this data.  In fact, the processed data is publicly available in MG-RAST:

Migraine Subjects

Matched Control Subjects

If desired, you can also download by tables (normalized to counts per thousand) for phylum, class, and species distributions.

Monday, October 21, 2013

Analyze Your 16S rRNA Data Using MG-RAST

Step #1: Register for an Account

You can use this link to register:

Registration is not automated, so registration is not immediate.  It took about a day to create an account for me.

Once you receive an e-mail saying "MG-RAST - account request approved", you can sign-in for the next step.

Step #2: Upload Your Data

Go to the MG-RAST website:

I would recommend using Firefox - you will see a pop-up if you do not.

Sign into MG-RAST (username and password are entered in the upper-right hand corner of the screen).

Choose "Upload".  This is represented by a green arrow pointing upwards.  There should be one in the middle of your screen (which says "Upload") as well as in the upper-right hand corner of the screen (although this one is not specifically labeled).

The metadata step is not required.  I skipped this because I figure the American Gut data should eventually be entered into this database, and I didn't want to produce a duplicate dataset (and I probably didn't know all of the details regarding funding, sample processing, etc.).  However, I contacted the MG-RAST developers, and they actually encouraged me to make the sample public.  If you take the time to fill out the metadata for your sample, it will be processed more quickly.

Under "PREPARE DATA", Click "2. upload files" and browse for your FASTQ that you downloaded from ENA.  You will see a pop-up, but just click "close".  It is not necessary to complete the check.  You can click "3. Manage Inbox" to see when the upload is complete (if you want to wait a few minutes, you can keep clicking "update inbox" until the files are ready).  Otherwise, you can just do something else and come back later.

Step #3: Run the MG-RAST Pipeline

After the data has been uploaded, click "1. select metadata file" under "DATA SUBMISSION".  If you didn't create a metadata file, just click the box saying "I do not want to supply metadata" and click "select".

Click "2. select project".  You probably don't have an existing project, so just type in something like "American Gut" and click "select".

Under "3. select sequence file(s)", click the check marks next to the files that you want to analyze and click "select".

Unless you have some experience with metagenomic analysis, just select "4. choose pipeline options" and click "select";

Finally, choose a data submission option (if you don't provide metadata, you have to keep your data private) and click "submit job".

Step #4: Analyze Your Processed Data

The pipeline may take a while (at least a few hours and possibly as long as a week), especially if you are keeping the data private.  So, I would recommend doing something else and then signing back into MG-RAST.  You can check the status of your samples at any time by clicking the earth icon in the upper-right hand corner of the screen (or "Browse Metagenomes" in the middle of the screen).  There will be numbers next to different stages in the upper-left hand corner.  If you click the number next to "In Progress" and you see your samples, then they are not ready (but you can at least you can see where your samples are in the pipeline).  You need to be able to click the number next to "Available for Analysis" and then be able to see your samples in the next menu that is loaded.

Once your samples are available for analysis, click on the bar-plot icon in the upper-right hand corner of the screen.  There are a lot of options available for metagenomic analysis, but I will walk through what I think is the most useful analysis.

Under "Organism Abundance" on the left-hand side, click "Best Hit Classification".  Under "Data Selection", select your samples by clicking the "+" icon next to "Metagenomes".  If you left your samples as private, then they should be relatively easy to select.

Under "Annotation Sources", the default may be "M5NR".  I would strongly recommend you use a RNA database, such ad RDP, Greengenes, or M5RNA.  M5RNA is a little more interesting because it also contains Eukaroytic sequences, but I will mostly focus on RDP (so that I can compare the results to the RDP-Classifier).  Highlight the desired database click "OK".

At this point, you should have all the necessary configurations set up, and your screen should look something like this:

To analyze your selected data, click a radio button under "Data Visualization" and then click "generate".  I think the tree and table tools are the most useful.

Analyze Your 16S rRNA Data Using RDP-Classifier

Step #1: Convert Files from FASTQ to FASTA

There are lots of ways to do this, but I would recommend using Galaxy if you don't have any programming experience:

Go to the Galaxy website:

If you are an academic researcher, your institution might have a local mirror (which should be faster).  However, the link above will work for everybody.

Upload your data using "Get Data" --> "Upload File" (the functions are available on the left-hand side of the screen).  You can set the file type to "fastq", but you probably don't need to.  Updates will appear on the right-hand side of the screen, so you know when each step is complete (the box for the corresponding step will turn green).

Go to "NGS: QC and manipulation" --> "FASTQ Groomer" (should be under "ILLUMINA DATA" in grey font).  Leave all the default settings and click "Execute".  This is technically necessary because of a formatting issue.

Go to "Convert Formats" --> "FASTQ to FASTA".  Once this step is complete, click the appropriate green box on the right-hand side.  Once the box becomes larger (allowing you to see the first few lines of the file), click the purple floppy disk icon to download the FASTA file.  I would recommend renaming the FASTA file after it is downloaded, so it is easier to keep track of.

Step #2: Create an RDP Account

You can sign up using this link:

An account will be created automatically.  You will receive an e-mail with a username and password (you will be asked to change your password the first time you sign in).  Technically, you don't need an account to run the classifier.  However, I think it may be helpful if you want to play around with some other tools.

Step #3: Sign-In and Run RDP-Classifier

Using the link provided by the registration e-mail, sign into myRDP.

Now, go to this link:

There will be an option to "Choose a file (unaligned format) to upload:".  Use the browser to select the FASTA (not FASTQ) file that you downloaded from Galaxy.  Next, click "Submit".

The classifier is very fast (you should get your results in a few minutes).  The result page is somewhat hard to parse, but everything is clickable to learn more.  The number of reads is shown in parentheses.

It really helps to remember biological classifications when interpreting these results.  Here is a quick cheat sheet:

phylum > class > order (> suborder) > family > genus

Unfortunately, the classifier won't provide species-specific information.

You can also download the results in a text file.  If you do this, you can use a tool like Notepad++ to search for keywords (like phylum, genus, etc.), but I think the results are a little easier to view on the webpage.

How to Download Your American Gut Data

Step #1: Find Your Barcode(s)

Each sample has a nine digit barcode.  If you see a smaller number, add leading zeros.

For example, my barcodes were 2683 (fecal) and 2684 (oral), so I need to use 000002683 and 000002684 as my sample IDs.

Step #2: Search For Your Sample

Go to the European Nucleotide Archive (ENA) website:

Copy and paste your 9-digit barcode into the text search

I get a single result for each of my samples when I do this.  If you get multiple results, choose the metagenome sample (see image below).

If you want to double-check you have the right sample, click on the "Sample accession" link (which will start with "ERS").  If you then click the "Attributes" tab, you should be able to see your metadata.  For example, I know I live in Los Angeles, so my state better not be GA.

Step #3: Download Your FASTQ Files

If you are certain you have the right sample, click the the link for "Fastq files (ftp)" to start the download.  Note that the sample will be labeled based upon the "Run accesssion" (starting with "ERR").  For example, here are the different IDs for my samples:

Fecal: 000002683 --> ERS345317 --> ERR336561
Oral: 000002684 --> ERS344890 --> ERR336138

The .fastq files will be compressed, so you should unzip them.  I would recommend using 7zip for this.  I would also recommend renaming your files (like fecal.fastq and oral.fastq) to make it easier for you to keep track of them.

Open-Source Analysis of My Raw American Gut Data

Just like I used open-source tools to re-analyze my 23andMe data, I wanted to see what I could learn from analyzing the raw 16S rRNA reads from the American Gut Project.  The American Gut team is very well organized, and you can currently access your raw data in public databases.

I've provided a tutorial based entirely on web-based analysis, so you don't need to know any programming to follow these steps in your own data.  You can also skip the tutorial links to just see my own results.

Step #1: Get Your Data
Step #2a: Analyze Your Data in MG-RAST (preferred, but time-consuming)
Step #2b: Analyze Your Data using RDP-Classifier (quick, but less functionality)

Here are some charts that I could quickly create in Excel using data from RDP-Classifier:

If you compare these distributions to the average gut distribution from the American Gut preliminary report, then you can tell it is very different than the average participant.  Clearly, I have more Proteobacteria than the average participant (and less Firmicutes and Bacteroidetes).  However, it is also important to note that there was also a large amount of variation between participants.

I don't know how my class distributions compare to other samples, but it seems like I can at least infer that there is more variety in my oral sample than my fecal sample:

I also found it useful to see what specific genera were most highly abundant.  According to RDP-Classifer, these are the genera with more than 1000 reads:

Fecal Oral
Streptococcaceae Streptococcus 0 10614
Neisseriaceae Neisseria 0 8824
Actinomycetaceae Actinomyces 0 4129
Lachnospiraceae Oribacterium 0 2319
Veillonellaceae Veillonella  0 2124
Bacteroidaceae Bacteroides 1800 5
Enterobacteriaceae Escherichia/Shigella 8039 5

I was able to find reports that Actinomyces was associated with transformation of lymphocytes in patients with periodontal disease (Baker et al. 1976) and Streptococcus mutans plays a role in human dental decay (Loesche 1986), although I don't know if I had this specific strain (based upon the MG-RAST report, it looks like I don't).  Likewise, I showed this list to my dentist, and she recognized these two genera as being associated with dental problems.  Although I don't know how common these are overall, it seems to make sense that they could be found in an oral sample.

Obviously, I recognized Escherichia/Shigella.  The American Gut report points out that the phylum containing this genus is not highly abundant in an average participant.  At one point, I had to be hospitalized with ulcerative colitis (with a strain of E. coli producing shiga toxin), so perhaps this is related to the high abundance of this genus (although that was several years ago).

If you are patient enough to wait for your MG-RAST results, then you can make similar (but slightly cooler looking) pie charts and tables automatically.  For example, you can take a look at the corresponding plots from MG-RAST for my fecal and oral samples.  You can also create plots to compare species in multiple samples (red bars are for my oral samples, green bars are for my fecal sample):

Perhaps most importantly, MG-RAST will provide annotations down to the species level (and strain level, when possible).  The species counts aren't perfectly correlated with the genera counts (predicted from the classifier), but the the most interesting genera appeared in both lists.

uncultured bacterium
Escherichia coli ED1a
Abiotrophia para-adiacens
Actinomyces odontolyticus
Veillonella dispar
Butyrivibrio fibrisolvens
Blautia sp. Ser8
Bacteroides stercoris
Syntrophococcus sucromutans
Pseudomonas fluorescens
Bacteroides caccae
Bacteroides stercoris ATCC 43183
Bacteroides vulgatus
Rothia mucilaginosa
uncultured bacterium
Haemophilus haemolyticus
Prevotella buccalis
Ruminococcus gauvreauii
Streptococcus sanguinis
Ruminococcus torques L2-14
Escherichia coli
Abiotrophia defectiva
Gemella morbillorum
Dialister propionicifaciens
Veillonella parvula
Butyrivibrio hungatei
Atopobium minutum
Parvimonas micra
Leptotrichia shahii

This information can allowed to conduct more effective literature searches.  For example, my understanding is that the ED1a strain has not been shown to be associated with ulcerative colitis.  On the other hand, the species information allowed me to find a paper for the discovery of my specific strain of Actinomyces, which was harvested from 450 tooth cavities (Batty 2005).  Likewise, I could confirm that Streptococcus sanguinis was also pathogenic (Xu et al. 2007).

FYI, Galaxy also has some metagenomic tools.  However, running BLAST on Galaxy will take a long time.  If you are comfortable with running BLAST locally, it should be easier (but this requires some comfort using the computer).  You can also analyze your data locally using QIIME upload the results to PICRUSt for functional enrichment, if you don't need the convenience of the web-based tools listed above (MG-RAST can produce QIIME reports, but I think it is better to use a tab-delimited text file to avoid formatting problems).

I am still interested in seeing what my official individual report will look like: although I have general experience with bioinformatics analysis, the folks at American Gut have looked at a lot more metagenomic data than I have.  Likewise, I am interesting in seeing how my profiles change at different time points: once I eventually get my uBiome results, I will put together another post to compare the results.

Thursday, May 16, 2013

Gene Patents and Genetic Testing

NOTEGetting Advice About Genetic Testing

This week, Angelina Jolie wrote an editorial for the New York Times describing how she underwent a double mastectomy after testing positive for a deleterious BRCA1 mutation.

This article stirred up a lot of discussion because she mentions how "[the] cost of testing for BRCA1 and BRCA2, at more than $3,000 in the United States, remains an obstacle for many women".

Although there have been responses pointing out that insurance can significantly decrease the out-of-pocket costs to the patient (under the right circumstances), the question remains "Why is the test so expensive?".  After all, the true cost to perform the test is much less than $3,000.

The genetic test in question is Myriad's BRACAnalysis test.  This company should sound familar because the Supreme Court recently heard a case from Myriad to decide if human genes can be patented (the BRCA1/2 gene patents are the reason why the Myriad test costs so much).

There are lots of resources to learn more about the court case and BRCA testing.  For example, Chris Mason recently wrote good summary on the topic.  However, I want to re-emphasize a couple important points here:

1) It is possible to provide a genetic test covering a panel of diseases at ~1/10th of the cost.  For example, 23andMe provides carrier status for a variety of diseases (including some BRCA1/2 mutations) for $99.  Likewise, Counsyl offers genetic testing that is less than $100 with insurance or less than $1000 without insurance.  There are also various options for discovering mutations in raw sequencing data, whose cost is continually decreasing.

2) The basis of Myriad's claims to retain its patents is the high development costs to develop the test.  Regardless of the ethical arguements against gene patients (and the high likelihood that genetic tests will be based upon findings from government-funded research), this is no longer a technical concern now that the human genome has been sequenced.  There a multitude of freely available tools to help clone any region of the human genome.  There are certainly still novel discoveries out there (like novel non-coding genes, fusion genes, etc.), but I would argue the tricky part is asking the right question - the production of the cDNA is then trivial.  I see no reason why individuals should have to pay for the costs of a patent for the interpretion of a medical discovery reported in the scientific literature.
Creative Commons License
My Biomedical Informatics Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.