Monday, October 21, 2013

Analyze Your 16S rRNA Data Using MG-RAST

Step #1: Register for an Account

You can use this link to register:

Registration is not automated, so registration is not immediate.  It took about a day to create an account for me.

Once you receive an e-mail saying "MG-RAST - account request approved", you can sign-in for the next step.

Step #2: Upload Your Data

Go to the MG-RAST website:

I would recommend using Firefox - you will see a pop-up if you do not.

Sign into MG-RAST (username and password are entered in the upper-right hand corner of the screen).

Choose "Upload".  This is represented by a green arrow pointing upwards.  There should be one in the middle of your screen (which says "Upload") as well as in the upper-right hand corner of the screen (although this one is not specifically labeled).

The metadata step is not required.  I skipped this because I figure the American Gut data should eventually be entered into this database, and I didn't want to produce a duplicate dataset (and I probably didn't know all of the details regarding funding, sample processing, etc.).  However, I contacted the MG-RAST developers, and they actually encouraged me to make the sample public.  If you take the time to fill out the metadata for your sample, it will be processed more quickly.

Under "PREPARE DATA", Click "2. upload files" and browse for your FASTQ that you downloaded from ENA.  You will see a pop-up, but just click "close".  It is not necessary to complete the check.  You can click "3. Manage Inbox" to see when the upload is complete (if you want to wait a few minutes, you can keep clicking "update inbox" until the files are ready).  Otherwise, you can just do something else and come back later.

Step #3: Run the MG-RAST Pipeline

After the data has been uploaded, click "1. select metadata file" under "DATA SUBMISSION".  If you didn't create a metadata file, just click the box saying "I do not want to supply metadata" and click "select".

Click "2. select project".  You probably don't have an existing project, so just type in something like "American Gut" and click "select".

Under "3. select sequence file(s)", click the check marks next to the files that you want to analyze and click "select".

Unless you have some experience with metagenomic analysis, just select "4. choose pipeline options" and click "select";

Finally, choose a data submission option (if you don't provide metadata, you have to keep your data private) and click "submit job".

Step #4: Analyze Your Processed Data

The pipeline may take a while (at least a few hours and possibly as long as a week), especially if you are keeping the data private.  So, I would recommend doing something else and then signing back into MG-RAST.  You can check the status of your samples at any time by clicking the earth icon in the upper-right hand corner of the screen (or "Browse Metagenomes" in the middle of the screen).  There will be numbers next to different stages in the upper-left hand corner.  If you click the number next to "In Progress" and you see your samples, then they are not ready (but you can at least you can see where your samples are in the pipeline).  You need to be able to click the number next to "Available for Analysis" and then be able to see your samples in the next menu that is loaded.

Once your samples are available for analysis, click on the bar-plot icon in the upper-right hand corner of the screen.  There are a lot of options available for metagenomic analysis, but I will walk through what I think is the most useful analysis.

Under "Organism Abundance" on the left-hand side, click "Best Hit Classification".  Under "Data Selection", select your samples by clicking the "+" icon next to "Metagenomes".  If you left your samples as private, then they should be relatively easy to select.

Under "Annotation Sources", the default may be "M5NR".  I would strongly recommend you use a RNA database, such ad RDP, Greengenes, or M5RNA.  M5RNA is a little more interesting because it also contains Eukaroytic sequences, but I will mostly focus on RDP (so that I can compare the results to the RDP-Classifier).  Highlight the desired database click "OK".

At this point, you should have all the necessary configurations set up, and your screen should look something like this:

To analyze your selected data, click a radio button under "Data Visualization" and then click "generate".  I think the tree and table tools are the most useful.

Analyze Your 16S rRNA Data Using RDP-Classifier

Step #1: Convert Files from FASTQ to FASTA

There are lots of ways to do this, but I would recommend using Galaxy if you don't have any programming experience:

Go to the Galaxy website:

If you are an academic researcher, your institution might have a local mirror (which should be faster).  However, the link above will work for everybody.

Upload your data using "Get Data" --> "Upload File" (the functions are available on the left-hand side of the screen).  You can set the file type to "fastq", but you probably don't need to.  Updates will appear on the right-hand side of the screen, so you know when each step is complete (the box for the corresponding step will turn green).

Go to "NGS: QC and manipulation" --> "FASTQ Groomer" (should be under "ILLUMINA DATA" in grey font).  Leave all the default settings and click "Execute".  This is technically necessary because of a formatting issue.

Go to "Convert Formats" --> "FASTQ to FASTA".  Once this step is complete, click the appropriate green box on the right-hand side.  Once the box becomes larger (allowing you to see the first few lines of the file), click the purple floppy disk icon to download the FASTA file.  I would recommend renaming the FASTA file after it is downloaded, so it is easier to keep track of.

Step #2: Create an RDP Account

You can sign up using this link:

An account will be created automatically.  You will receive an e-mail with a username and password (you will be asked to change your password the first time you sign in).  Technically, you don't need an account to run the classifier.  However, I think it may be helpful if you want to play around with some other tools.

Step #3: Sign-In and Run RDP-Classifier

Using the link provided by the registration e-mail, sign into myRDP.

Now, go to this link:

There will be an option to "Choose a file (unaligned format) to upload:".  Use the browser to select the FASTA (not FASTQ) file that you downloaded from Galaxy.  Next, click "Submit".

The classifier is very fast (you should get your results in a few minutes).  The result page is somewhat hard to parse, but everything is clickable to learn more.  The number of reads is shown in parentheses.

It really helps to remember biological classifications when interpreting these results.  Here is a quick cheat sheet:

phylum > class > order (> suborder) > family > genus

Unfortunately, the classifier won't provide species-specific information.

You can also download the results in a text file.  If you do this, you can use a tool like Notepad++ to search for keywords (like phylum, genus, etc.), but I think the results are a little easier to view on the webpage.

How to Download Your American Gut Data

Step #1: Find Your Barcode(s)

Each sample has a nine digit barcode.  If you see a smaller number, add leading zeros.

For example, my barcodes were 2683 (fecal) and 2684 (oral), so I need to use 000002683 and 000002684 as my sample IDs.

Step #2: Search For Your Sample

Go to the European Nucleotide Archive (ENA) website:

Copy and paste your 9-digit barcode into the text search

I get a single result for each of my samples when I do this.  If you get multiple results, choose the metagenome sample (see image below).

If you want to double-check you have the right sample, click on the "Sample accession" link (which will start with "ERS").  If you then click the "Attributes" tab, you should be able to see your metadata.  For example, I know I live in Los Angeles, so my state better not be GA.

Step #3: Download Your FASTQ Files

If you are certain you have the right sample, click the the link for "Fastq files (ftp)" to start the download.  Note that the sample will be labeled based upon the "Run accesssion" (starting with "ERR").  For example, here are the different IDs for my samples:

Fecal: 000002683 --> ERS345317 --> ERR336561
Oral: 000002684 --> ERS344890 --> ERR336138

The .fastq files will be compressed, so you should unzip them.  I would recommend using 7zip for this.  I would also recommend renaming your files (like fecal.fastq and oral.fastq) to make it easier for you to keep track of them.

Open-Source Analysis of My Raw American Gut Data

Just like I used open-source tools to re-analyze my 23andMe data, I wanted to see what I could learn from analyzing the raw 16S rRNA reads from the American Gut Project.  The American Gut team is very well organized, and you can currently access your raw data in public databases.

I've provided a tutorial based entirely on web-based analysis, so you don't need to know any programming to follow these steps in your own data.  You can also skip the tutorial links to just see my own results.

Step #1: Get Your Data
Step #2a: Analyze Your Data in MG-RAST (preferred, but time-consuming)
Step #2b: Analyze Your Data using RDP-Classifier (quick, but less functionality)

Here are some charts that I could quickly create in Excel using data from RDP-Classifier:

If you compare these distributions to the average gut distribution from the American Gut preliminary report, then you can tell it is very different than the average participant.  Clearly, I have more Proteobacteria than the average participant (and less Firmicutes and Bacteroidetes).  However, it is also important to note that there was also a large amount of variation between participants.

I don't know how my class distributions compare to other samples, but it seems like I can at least infer that there is more variety in my oral sample than my fecal sample:

I also found it useful to see what specific genera were most highly abundant.  According to RDP-Classifer, these are the genera with more than 1000 reads:

Fecal Oral
Streptococcaceae Streptococcus 0 10614
Neisseriaceae Neisseria 0 8824
Actinomycetaceae Actinomyces 0 4129
Lachnospiraceae Oribacterium 0 2319
Veillonellaceae Veillonella  0 2124
Bacteroidaceae Bacteroides 1800 5
Enterobacteriaceae Escherichia/Shigella 8039 5

I was able to find reports that Actinomyces was associated with transformation of lymphocytes in patients with periodontal disease (Baker et al. 1976) and Streptococcus mutans plays a role in human dental decay (Loesche 1986), although I don't know if I had this specific strain (based upon the MG-RAST report, it looks like I don't).  Likewise, I showed this list to my dentist, and she recognized these two genera as being associated with dental problems.  Although I don't know how common these are overall, it seems to make sense that they could be found in an oral sample.

Obviously, I recognized Escherichia/Shigella.  The American Gut report points out that the phylum containing this genus is not highly abundant in an average participant.  At one point, I had to be hospitalized with ulcerative colitis (with a strain of E. coli producing shiga toxin), so perhaps this is related to the high abundance of this genus (although that was several years ago).

If you are patient enough to wait for your MG-RAST results, then you can make similar (but slightly cooler looking) pie charts and tables automatically.  For example, you can take a look at the corresponding plots from MG-RAST for my fecal and oral samples.  You can also create plots to compare species in multiple samples (red bars are for my oral samples, green bars are for my fecal sample):

Perhaps most importantly, MG-RAST will provide annotations down to the species level (and strain level, when possible).  The species counts aren't perfectly correlated with the genera counts (predicted from the classifier), but the the most interesting genera appeared in both lists.

uncultured bacterium
Escherichia coli ED1a
Abiotrophia para-adiacens
Actinomyces odontolyticus
Veillonella dispar
Butyrivibrio fibrisolvens
Blautia sp. Ser8
Bacteroides stercoris
Syntrophococcus sucromutans
Pseudomonas fluorescens
Bacteroides caccae
Bacteroides stercoris ATCC 43183
Bacteroides vulgatus
Rothia mucilaginosa
uncultured bacterium
Haemophilus haemolyticus
Prevotella buccalis
Ruminococcus gauvreauii
Streptococcus sanguinis
Ruminococcus torques L2-14
Escherichia coli
Abiotrophia defectiva
Gemella morbillorum
Dialister propionicifaciens
Veillonella parvula
Butyrivibrio hungatei
Atopobium minutum
Parvimonas micra
Leptotrichia shahii

This information can allowed to conduct more effective literature searches.  For example, my understanding is that the ED1a strain has not been shown to be associated with ulcerative colitis.  On the other hand, the species information allowed me to find a paper for the discovery of my specific strain of Actinomyces, which was harvested from 450 tooth cavities (Batty 2005).  Likewise, I could confirm that Streptococcus sanguinis was also pathogenic (Xu et al. 2007).

FYI, Galaxy also has some metagenomic tools.  However, running BLAST on Galaxy will take a long time.  If you are comfortable with running BLAST locally, it should be easier (but this requires some comfort using the computer).  You can also analyze your data locally using QIIME upload the results to PICRUSt for functional enrichment, if you don't need the convenience of the web-based tools listed above (MG-RAST can produce QIIME reports, but I think it is better to use a tab-delimited text file to avoid formatting problems).

I am still interested in seeing what my official individual report will look like: although I have general experience with bioinformatics analysis, the folks at American Gut have looked at a lot more metagenomic data than I have.  Likewise, I am interesting in seeing how my profiles change at different time points: once I eventually get my uBiome results, I will put together another post to compare the results.
Creative Commons License
My Biomedical Informatics Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.