Charles Warden's Science Blog

Analyze Your 16S rRNA Data Using MG-RAST

Step #1: Register for an Account

You can use this link to register:http://metagenomics.anl.gov/?page=Register

Registration is not automated, so registration is not immediate. It took about a day to create an account for me.

Once you receive an e-mail saying "MG-RAST - account request approved", you can sign-in for the next step.

Step #2: Upload Your Data

Go to the MG-RAST website: http://metagenomics.anl.gov/

I would recommend using Firefox - you will see a pop-up if you do not.

Sign into MG-RAST (username and password are entered in the upper-right hand corner of the screen).

Choose "Upload". This is represented by a green arrow pointing upwards. There should be one in the middle of your screen (which says "Upload") as well as in the upper-right hand corner of the screen (although this one is not specifically labeled).

The metadata step is not required. I skipped this because I figure the American Gut data should eventually be entered into this database, and I didn't want to produce a duplicate dataset (and I probably didn't know all of the details regarding funding, sample processing, etc.). However, I contacted the MG-RAST developers, and they actually encouraged me to make the sample public. If you take the time to fill out the metadata for your sample, it will be processed more quickly.

Under "PREPARE DATA", Click "2. upload files" and browse for your FASTQ that you downloaded from ENA. You will see a pop-up, but just click "close". It is not necessary to complete the check. You can click "3. Manage Inbox" to see when the upload is complete (if you want to wait a few minutes, you can keep clicking "update inbox" until the files are ready). Otherwise, you can just do something else and come back later.

Step #3: Run the MG-RAST Pipeline

After the data has been uploaded, click "1. select metadata file" under "DATA SUBMISSION". If you didn't create a metadata file, just click the box saying "I do not want to supply metadata" and click "select".

Click "2. select project". You probably don't have an existing project, so just type in something like "American Gut" and click "select".

Under "3. select sequence file(s)", click the check marks next to the files that you want to analyze and click "select".

Unless you have some experience with metagenomic analysis, just select "4. choose pipeline options" and click "select";

Finally, choose a data submission option (if you don't provide metadata, you have to keep your data private) and click "submit job".

Step #4: Analyze Your Processed Data

The pipeline may take a while (at least a few hours and possibly as long as a week), especially if you are keeping the data private. So, I would recommend doing something else and then signing back into MG-RAST. You can check the status of your samples at any time by clicking the earth icon in the upper-right hand corner of the screen (or "Browse Metagenomes" in the middle of the screen). There will be numbers next to different stages in the upper-left hand corner. If you click the number next to "In Progress" and you see your samples, then they are not ready (but you can at least you can see where your samples are in the pipeline). You need to be able to click the number next to "Available for Analysis" and then be able to see your samples in the next menu that is loaded.

Once your samples are available for analysis, click on the bar-plot icon in the upper-right hand corner of the screen. There are a lot of options available for metagenomic analysis, but I will walk through what I think is the most useful analysis.

Under "Organism Abundance" on the left-hand side, click "Best Hit Classification". Under "Data Selection", select your samples by clicking the "+" icon next to "Metagenomes". If you left your samples as private, then they should be relatively easy to select.

Under "Annotation Sources", the default may be "M5NR". I would strongly recommend you use a RNA database, such ad RDP, Greengenes, or M5RNA. M5RNA is a little more interesting because it also contains Eukaroytic sequences, but I will mostly focus on RDP (so that I can compare the results to the RDP-Classifier). Highlight the desired database click "OK".

At this point, you should have all the necessary configurations set up, and your screen should look something like this:

To analyze your selected data, click a radio button under "Data Visualization" and then click "generate". I think the tree and table tools are the most useful.

Analyze Your 16S rRNA Data Using RDP-Classifier

Step #1: Convert Files from FASTQ to FASTA

There are lots of ways to do this, but I would recommend using Galaxy if you don't have any programming experience:

Go to the Galaxy website: https://usegalaxy.org/

If you are an academic researcher, your institution might have a local mirror (which should be faster). However, the link above will work for everybody.

Upload your data using "Get Data" --> "Upload File" (the functions are available on the left-hand side of the screen). You can set the file type to "fastq", but you probably don't need to. Updates will appear on the right-hand side of the screen, so you know when each step is complete (the box for the corresponding step will turn green).

Go to "NGS: QC and manipulation" --> "FASTQ Groomer" (should be under "ILLUMINA DATA" in grey font). Leave all the default settings and click "Execute". This is technically necessary because of a formatting issue.

Go to "Convert Formats" --> "FASTQ to FASTA". Once this step is complete, click the appropriate green box on the right-hand side. Once the box becomes larger (allowing you to see the first few lines of the file), click the purple floppy disk icon to download the FASTA file. I would recommend renaming the FASTA file after it is downloaded, so it is easier to keep track of.

Step #2: Create an RDP Account

You can sign up using this link: https://rdp.cme.msu.edu/user/createAcct.spr

An account will be created automatically. You will receive an e-mail with a username and password (you will be asked to change your password the first time you sign in). Technically, you don't need an account to run the classifier. However, I think it may be helpful if you want to play around with some other tools.

Step #3: Sign-In and Run RDP-Classifier

Using the link provided by the registration e-mail, sign into myRDP.

Now, go to this link: https://rdp.cme.msu.edu/classifier/classifier.jsp

There will be an option to "Choose a file (unaligned format) to upload:". Use the browser to select the FASTA (not FASTQ) file that you downloaded from Galaxy. Next, click "Submit".

The classifier is very fast (you should get your results in a few minutes). The result page is somewhat hard to parse, but everything is clickable to learn more. The number of reads is shown in parentheses.

It really helps to remember biological classifications when interpreting these results. Here is a quick cheat sheet:

phylum > class > order (> suborder) > family > genus

Unfortunately, the classifier won't provide species-specific information.

You can also download the results in a text file. If you do this, you can use a tool like Notepad++ to search for keywords (like phylum, genus, etc.), but I think the results are a little easier to view on the webpage.

How to Download Your American Gut Data

Step #1: Find Your Barcode(s)

Each sample has a nine digit barcode. If you see a smaller number, add leading zeros.

For example, my barcodes were 2683 (fecal) and 2684 (oral), so I need to use 000002683 and 000002684 as my sample IDs.

Step #2: Search For Your Sample

Go to the European Nucleotide Archive (ENA) website: http://www.ebi.ac.uk/ena/

Copy and paste your 9-digit barcode into the text search

I get a single result for each of my samples when I do this. If you get multiple results, choose the metagenome sample (see image below).

If you want to double-check you have the right sample, click on the "Sample accession" link (which will start with "ERS"). If you then click the "Attributes" tab, you should be able to see your metadata. For example, I know I live in Los Angeles, so my state better not be GA.

Step #3: Download Your FASTQ Files

If you are certain you have the right sample, click the the link for "Fastq files (ftp)" to start the download. Note that the sample will be labeled based upon the "Run accesssion" (starting with "ERR"). For example, here are the different IDs for my samples:

Fecal: 000002683 --> ERS345317 --> ERR336561
Oral: 000002684 --> ERS344890 --> ERR336138

The .fastq files will be compressed, so you should unzip them. I would recommend using 7zip for this. I would also recommend renaming your files (like fecal.fastq and oral.fastq) to make it easier for you to keep track of them.

Charles Warden's Science Blog

Monday, October 21, 2013

Analyze Your 16S rRNA Data Using MG-RAST

Analyze Your 16S rRNA Data Using RDP-Classifier

How to Download Your American Gut Data

About Me

My Websites

Blog Archive

Labels

Charles Warden's Science Blog

Monday, October 21, 2013

Analyze Your 16S rRNA Data Using MG-RAST

Analyze Your 16S rRNA Data Using RDP-Classifier

How to Download Your American Gut Data

About Me

My Websites

Blog Archive

Labels

Follow Me!