Wednesday, March 13, 2013

Bioinformatics 101: Bioinformatics Journals

Here are some of the journals that I check on a weekly basis.  I would strongly recommend subscribing to the relevant RSS feeds (using something like Google Reader).

Bioinformatics / Computational Biology:


Other Journals:
  • Nature Methods and Nature Biotechnology - not specific for bioinformatics articles, but many important programs / protocols are published here
  • PLOS ONE - general subject journal, but it has some good bioinformatics articles
  • peerJ - similar to PLOS ONE, but utilizes a membership system (so, you pay by author instead of by article)
  • Nature, Science, PNAS, etc.
Tutorials / Blogs:

  • OpenHelix - tutorials for popular programs; some free, some require subscription
    • Open Helix Blog - this covers tutorials and FAQs for common bioinformatics tools. I mostly read it for the Friday SNPpets (collection of popular weekly twitter feeds)
  • Omixon Blog - Bioinformatics company that provides free tutorials for common tools
  • Core Genomics - "personal blog written by James Hadfield who runs a Genomics core facility Cambridge" - lots of interesting technical details about next-generation sequencing
  • MassGenomics - medical genomics blog by Dan Koboldt, a staff scientist at the Genome Institute at Washington University. Consistently great article reviewers.
  • Genomes Unzipped - popular blog run by several genomics researchers. I would argue that it was made popular by Daniel McArthur (who doesn't post there as often now), but there are still other contributors that keep the blog up to date.
  • Getting Genetics Done - a well-maintained blog written mostly by Stephen Turner (Bioinformatics Core director at University of Virginia). Focuses mostly on providing technical suggestions.
  • NIH Bioinformatics Support System - probably doesn't have a feed, but contains useful tutorials

Bioinformatics 101: RNA Sequence Analysis

miRNA Resources:

  • MirBase
    • free database of miRNA sequences
  • TarBase
    • free database of experimentally validated miRNA targets
  • miRecords
    • database of miRNA-target interactions
  • IPA miRNA-target analysis
    • commercial database that includes free databases as well as a proprietary list of miRNA-target interactions found using text-mining of the literature
  • TargetScan
    • free tool to predict miRNA targets
  • sylArray
    • tool to predict miRNA targets from gene expression data.  Uses gene ranking, so it doesn't require mRNA differential expression (although you will need to check that the miRNA regulator is differentially expressed)
In general, I think you really need both miRNA expression and mRNA expression data to get reliable results when trying to identify miRNA-target interactions

RNA-Seq Splicing Events:

  • JunctionSeq - extends DEXSeq to include junction coverage (including junctions not defined among isoforms in reference database).  Strictly speaking, it only calls differential exon and junction coverage (and provides a statistic at the gene-level), but the junction coverage can be helpful in identifying some other types of splicing events.
  • MATS - Provides differential splicing for skipped exon (SE), alternative 5' splice site (A5SS), alternative 3' splice site (A3SS), mutually exclusive exons (MXE), and retained intron (IR)
  • MISO - Provides single-sample and differential splicing for skipped exon (SE), alternative 5' splice site (A5SS), alternative 3' splice site (A3SS), mutually exclusive exons (MXE), retained intron (IR), tandem 3' UTRs (TandemUTR), alternative first exon (AFE), and alternative last exon (ALE)

RNA Secondary Structure:

RNA Domain Homology:

  • Rfam
    • may be helpful in predicting function of a non-coding RNA of unknown function

de novo Assembly Algorithms (RNA-Seq):

  • Oases
  • Trans-ABySS
  • Trinity
  • eXpress - mRNA quantification tool that works with both de novo assembly transcripts (as well as transcripts from direct genome alignment)

  • FASTX-Toolkit - popular suite of tools to quantify and manipulate sequences .fastq and .fasta files
  • samtools - popular suite of tools to quantify and reformat .sam/.bam files
  • Picard - Java-based implementation of samtools; CollectRNASeqMetrics can produce a coverage plot (normalized per start to end of transcript)
  • RSeQC - package to produce a variety of RNA-Seq QC figures

General RNA-Seq Analysis

Bioinformatics 101: Literature / Text Mining

Search Engines:
  • PubMed
    • popular, free tool provided by NCBI to search biomedical journal articles
    • includes links to connected NCBI resources (GEO, RefSeq, etc.)
  • Google Scholar
    • popular, free tool to search the scientific literature
    • provides citation information
    • allows authors to create their own bibliographies (which provide author-level citation metrics) 
Gene-Centric Information:
  • NCBI Gene
    • free tool curated by the NCBI
    • includes literature citations, Gene Ontology categories, alternative and official gene symbols, etc.
  • iHOP (Information Hyperlinked Over Proteins)
    • free text-mining program that predicts interactions between genes
  • PolySearch
    • free text-mining program that predicts interactions between genes, diseases, drugs, metabolites, SNPs, pathways, and/or tissues
  • IPA (Ingenuity Pathway Analysis)
    • commercial program curating information about genes, metabolites, etc.
    • most popular use is for functional enrichment analysis, but it can also be used as a general tool for searching the literature

Bioinformatics 101: Pathway Analysis

Gene List Enrichment Tools (Requires Differenital Expression Analysis):

Other Systems-Level Analysis Tools (No Upstream Filtering Necessary):

Bioinformatics 101: Gene Expression Analysis

Differential Expression Tools:

  • R - statistical programming language
    • most common statistical functions (t-test, ANNOVA, etc.) are built in
    • Bioconductor - suite of R packages used for bioinformatic analysis
      • limma - most commonly used differential expression tool for microarray analysis
      • edgeR - R package for RNA-Seq differential expression analysis
      • DEseq - R package for RNA-Seq differential expression analysis
  • cuffdiff
    • differential expression package within cufflinks
    • cufflinks provides transcript abundance calculations
    • strictly speaking, the developers recommend using cuffdiff for differential expression, although it is relatively common to use edgeR, DEseq, etc. for differential expression following mRNA quantification via cufflinks
  • Java TreeView
    • free tool for clustering microarray data
  • OCplus - R package for statistical power calculations (and differential expression) for microarray studies
  • Scotty - web-based tool for statistical power calculations for RNA-Seq data
  • Partek Genomics Suite
    • Commercial program that includes a number of workflows, such as microarray gene expression and RNA-Seq analysis
    • Includes statistics for differential expression analysis as well as tools for downstream functional analysis and upstream quality control assessment
lncRNA Resources:

  • MiTranscriptome - known and novel lncRNAs with cancer-associated profiles
  • TANRIC - TCGA and CCLE expression analysis for lncRNAs (including correlations with protein-coding genes and miRNAs)
  • Expression Atlas - gene expression profiles for known genes across various datasets
  • lncrnadb - includes additional annotations for known lncRNAs (such

Transcription Factor Motif Analysis:

  • IPA Upstream Regulator Analysis
    • Commercial tool that searches for enrichment of known targets for regulatory genes and molecules (such as transcription factors)
    • Can also detect if targets are consistent with activation or inhibition of the regulator
    • free tool that identifies upstream motifs enriched for gene lists
    • works on a wide variety of species, so it is useful for motif finding in less commonly studies organisms
  • Whole Genome rVISTA - calculate enrichment of transcription factor motifs predicted based upon evolutionary conservation
  • TRED (Transcriptional Regulatory Element Database) - database from CSHL for transcription factors.  Includes target gene lists for transcription factors in human, mouse, and rat
  • TRANSFAC - database of transcription factor motif sequences.  There are commercial and open-source versions of the database
  • JASPAR - open-source database of transcription factor motif sequences
General RNA-Seq Information:

Microarray Annotation Resources:
  • NetAffx
    • Affymetrix resource for probe design information
    • registration is free but required
  • GeneAnnot
    • an alternative resource for Affymetrix probe annotations

Bioinformatics 101: Protein Analysis

Protein Domain / Structure / Homology Tools:

3D-Structure Viewers:

Mass Spectrometry:
  • PRIDE - mass-spectrometry sample database managed by EMBL-EBI
  • PeptideAtlas - database for mass spectrometry data - includes links to relevant publications
  • MaxQuant - popular tool for mapping proteomics spectra from mass spectrometry data
  • ProteinProphet - another popular tool for mapping proteomics spectra to proteins
  • DanteR - R implementation of the popular DAnTE algorithm for differential expression of mass spectrometry proteomics data
  • LabKey / CPAS - open-source LIMS + basic analysis pipeline
  • PIR - UniProt Protein Information Resource: includes links to databases and peptide mapping tools
Protein-Protein Interaction Databases:
  • IntAct - database for protein-protein interactions
  • BioGRID - database of genetic and protein interactions
  • MINT - protein-protein interaction database
  • STRING - database for known and predicted protein-protein interactions
  • HIPPIE: database of human protein-protein interactions, integrating data from several other databases

  • STITCH: database of drug-protein interactions
  • PaxDb - database of protein expression across different tissues and organisms
  • MOPED - database of protein expression across different tissues and model organisms

Bioinformatics 101: Genomic Databases

Genomic Annotations:

Systems Biology Databases:
  • Gene Ontology (GO)
    • Database of functional annotations for protein-coding genes
  • KEGG - Kyoto Encyclopedia of Genes and Genomes
    • primarily used as a pathway database
  • IntAct - database for protein-protein interactions
  • Reactome
  • Regulome Explorer - software to visualize integrative genomic data from the TCGA project
  • BioGRID - database of genetic and protein interactions
  • MINT - protein-protein interaction database
  • STRING - database for known and predicted protein-protein interactions
  • STITCH: database of drug-protein interactions

Microarray / Sequencing Databases:

  • GEO - microarray database
  • ArrayExpress - microarray database
  • SRA - sequencing archive; entries are often also indexed in GEO
  • BioGPS - similar to NCBI Gene, but also includes normal tissue expression levels (from microarray data)
  • TiGER - tissue-specific gene expression database
  • CellMiner - query NCI-60 cell line data
  • TCGA Data Portal - integrative genomic data for large cancer datasets

Genomic Variation Databases:

Disease-Centric Databases:

  • General
    • OMIM - Online Mendelian Inheritance in Man
      • database of human diseases
    • SIDER - EMBL side effect database
  • Cancer
    • cBioPortal
      • User-friendly interface for querying cancer datasets (including TCGA data)
    • TCGA - The Cancer Genome Atlas
      • includes microarray and sequencing data
    • Oncomine
      • database of gene expression and copy number data from patients
      • basic access is free, but license is required for premium access
    • caArray - NCI Cancer Database

Protein Databases:

Bioinformatics 101: Image Analysis

Microscopy Image Analysis / Visualization:

  • ImageJ - NIH image viewer and analysis tool
    • Fiji - Fiji Is Just ImageJ
      • ImageJ wrapper containing a number of plug-ins for advanced analysis
  • Cell Profiler
  • Cell Profiler Analyst
    • tool for high-throughput image analysis
  • LSM Image Viewer
    • free software to view .lsm images
    • more advanced software is commercially available

General Tools
  • Inkscape
    • open-source version of Adobe Illustrator
    • Useful for creating figures for papers

Bioinformatics 101: DNA Sequence Analysis

Genome Visualization Tools:

  • UCSC Genome Browser
    • popular, free genomic visualization tool for a wide variety of organisms
    • also serves as a database for genomic sequences and features
  • Integrative Genomics Viewer (IGV)
    • very efficient tool for visualizing almost any type of genomic data
    • open-source
  • Gbrowse - open-source genome browser
  • Circos
    • circular genome plot
      • Especially useful for plotting genomic interaction results
    • official code has a step learning curve, but you have a lot of options for precise formatting
    • also implemented in Rcircos
  • POMO
    • creates image similar to circos plot
    • I consider the input file much more intuitive than circos configuration files, and plots are created via web interface (instead of local installation)
    • can be used to plot data from multiple species
    • I would recommend using Firefox; I've had some problems with Chrome and IE

Sequence Alignment:

  • BLAST - search for similar DNA sequences in GenBank
  • ClustalW - multi-species genome alignment
  • TCoffee - multi-species genome alignment
  • Mauve - multi-species alignment and visualization tool to detect segments of conserved sequence

General DNA-Seq Tools:

  • samtools
    • popular, free tool to extract data from .SAM alignment files
    • Picard - java-based version of samtools
    • see short read aligners necessary for upstream analysis
  • Galaxy
    • open-source, cloud-based suite of popular sequence analysis tools (including deep sequencing analysis 
  • GATK
    • toolkit for analysis of next-generation sequencing data
    • previously open-source, but now requires a commercial license
  • CLC Bio Genomics Workbench
    • commercial software covering a wide variety of applications such as sequence alignment, SNP/DIP detection, de novo assembly, etc.
    • CLC Bio Genomics Workbench also has the functionality of CLC Bio Main Workbench for standard sequencing analysis (cloning, primer design, etc.)
      • both are commercial programs that require a purchased license
  • SeqAnswers Software List
Copy Number / Indel Tools:

    • My favorite tool for making copy number calls in exon capture data
    • However, you will want to analyze a pool of samples (say >10) because it is not ideal for analysis of one or two samples. Can also create .bed files to import into DNAcopy.
  • PennCNV
    • Suite of tools for calling copy number alterations from microarray data
    • Includes segmentation algorithm that considers LRR and BAF values
    • PennCNV-Affy is particularly useful for processing Affy SNP chip data
    • PennCNV2 is designed to handle tumor-normal paired data, but I currently prefer the single-sample analysis from the original PennCNV package
    • Tool for calling somatic copy number alterations from SNP chip data
    • estimates tumor purity
  • DNAcopy
    • Bioconductor package that makes copy number calls (either for single sample or log2ratio for paired samples).
    • Works for either microarray or NGS data
  • ExomeCopy
    • Bioconductor that can make copy number calls directly from .bam files.
    • I have found it most useful to produce copy number counts that I can then use for analysis in DNAcopy
  • Nexus Copy Number
    • commercial software for analysis of copy number alterations
    • works for a variety of microarray platforms as well as for deep sequencing analysis
  • VarScan
    • Can make copy number calls for individual or paired samples (as well as SNP/small indel calls). 
    • Individual copy number calls is basically the same as a .pileup file, but somatic calls are relatively useful

Transcription Factor Motif Analysis:

    • database of transcription factor motifs
    • a subscription is required to access the most recent annotations, but older versions are freely available
    • A plug-in is available within CLC Bio (a commercial program for genomics analysis)
    • free database of transcription factor motif sequences
  • TFsitescan
    • free tool to search for transcription factor motifs
  • MEME Suite
    • tools for ab initio motif finding
  • rVista / VISTA Suite
    • tool for searching motifs conserved across closely related organisms
  • TESS
    • transcription factor search system
    • unfortunately, this tool now has to be run locally

Mutation Analysis:
  • VarScan
    • open-source variant calling tool
    • see short read aligners necessary for upstream analysis
    • usually also requires something like samtools to create input file (.pileup file)
  • SeattleSNPs Genome Variation Server
    • tool to filter candidate variants (based upon frequency, predicted function, etc.)
  • ANNOVAR (pronounced Anno-Var)
    • tool to filter candidate variants (based upon frequency, predicted function, etc.)
    •  wANNOVAR is the web-based interface
  • GWAS Catalog
    • NHGRI database of SNP-based phenotypic / disease associations
  • Promethease
    • open-source tool for personalized genomic analysis
    • it is technically free to use, but you can pay $5 to get your report more quickly
    • uses annotations from SNPedia
  • Interpretome
    • Genome interpretation tool similar to Promethease
    • In my opinion, nicer interface.  However, it currently only works with raw data from 23andMe and  Lumigenix.
  • SNPedia
    • crowd sourced annotation of SNP associations
    • includes some publicly available genomes
ChIP-Seq Tools:

de novo Assembly Algorithms:

Other Tools:
  • Primer3 - PCR primer design
  • Repeatmasker - identifies repetitive elements within a DNA sequence
  • Webcutter - detects restriction enzyme sites in a DNA sequence
  • Translate - a tool that allows translation of nucleotide (DNA / RNA) sequence into a protein sequence

Bioinformatics 101: Short Read Aligners

General Purpose Aligners:

  • BWA
  • Bowtie
  • Novoalign
    • commercial software covering a variety of alignment needs (RNA-Seq, miRNA-Seq, DNA-Seq, BS-Seq, etc.)
    • some functionality is also available in the free version

RNA-Seq Aligners:

BS-Seq Aligners:

Sequencing Technology Tutorials:

Bioinformatics 101: General Coding Information






  • C and R:
  • My Notes
    • For g++ compiler, binary output is created with "-o"
      • You can use "-g" option for debugging and "-Wall" for warning messages, but you'll still get error messages either way
    • If mixing your code with open-source code, take the compiler into consideration.  For example, some string functions that work when compiling in gcc but not g++.


Bioinformatics 101: DNA Methylation Analysis

Enrichment-Based Analysis Tools:

Bisulfite-Conversion Based Analysis Tools:

Bioinformatics 101

I thought it would be nice to provide a set of links for bioinformatics resources that I find to be useful.  A lot of this information comes from the experience that I gained working in the Bioinformatics Core at City of Hope.  Unlike my other blog posts, I will come back and modify these lists over time (since new bioinformatics resources are constantly being developed.

In order to help organize this huge amount of information, I have divided my annotations into the following sub-posts (all with the "Bioinformatics 101" label):

Also, feel free to leave your own suggestions as comments on the relevant pages!

Sunday, March 10, 2013

Tracking My Thoughts Using the Affectiv Suite

I was most pleased with the Affectiv Suite in the EPOC control panel (click here for more technical details from Emotiv tech support), so I will briefly describe my experience in this post.

The excitement / calm signal commonly varied, whereas the engagement/disinterest and meditation signals were harder to change.

Preparing images for this blog post (typical for random task)

The engagement / disinterest signal seemed to require a sustained change in attention for an extended period of time, which was sometimes accompanied by a clear change in the excitement / calm signal (for example, see the slight long-term increase in the image below).

Watching Futurama:

I wasn't really able to see any clear peaks in the meditation signal but I could see a decrease in the engagement signal when trying to relax (for example, see the long-term change in the image below).

Listening to music with eyes closed

In short, I think the engagement / disinterest signature was the most accurate, the excitement / calm signature may be too sensitive, and the meditation signature may not be sensitive enough.

Reading My Mind Using The Emotiv EPOC

Once I saw the TED talk on the Emotiv EPOC / EEG, I knew that I had to get my hands on that mind-reading gadget.

Emotiv offers two versions of the headset shown in the TED talk: the EEG (which allows users to access their raw EEG data) and the EPOC (it costs much less because which only allows you to run applications and doesn't give you access to the raw data).  Because I only had a causal interest in the topic, I bought the EPOC headset.

I tested the EPOC headset on myself as well as a few friends.  When my friends and I first saw the Cognitiv Suite, the response was similar to the crowd of the TED talk: people were impressed that the virtual box would move after training the first action. However, the excitement faded after using the Cognitiv Suite for a few minutes and trying to control multiple actions because it isn't highly accurate at detecting a specific thought. For example, if you train "push" and "left", you will probably see the box move towards you more than it moves left (or vice versa), and the action probably won't really be in sync with your thoughts.

In short, I didn't find the EPOC headset to be as cool as I was hoping it would be because it wasn't a very effective tool for providing mind control.  Nevertheless, I do think it is interesting to be able to visualize my brain activity (see the links to additional posts below).

Part #1: Review of EPOC Apps

Part #2: Tracking My Thoughts Using the Affectiv Suite

To be fair, my friends and I are only small pool of test subjects.  For example, the Emotiv website lists papers published using data from the EEG and EPOC headsets, and you can find various videos on YouTube for interesting applications (for example, click here or here).  Likewise, I found this presentation that showcases some examples of EPOC headset data with a (simple) technical explanation about what is going on.  However, my own personal experience was more similar to this EPOC review or this article which points out that researchers could use the device to make non-random predictions about user's PIN numbers but the predictions were not super accurate.

Finally, it may be worth noting that there are other devices with similar functionality (for example, see this list from Wikipedia), which is something that I didn't initially realize.  The Emotiv headset may really be the best option, but I would at least recommend researching some other options if you were interested in trying out a "mind reading" device.

Review of EPOC Apps

Useful / Interesting Apps:

1) EPOC Control Panel (Free) - the basic suite of tools: allows you to check the quality of the signal from the sensors on the EPOC headset, detect facial expressions (Expressiv Suite), detect mental states (Affectiv Suite), perform actions (Cognitiv Suite, see my first post), and detect head motions (mouse emulator).  You can see this letter from Emotiv tech support for a more detailed, technical explanation of these features.

The mouse emulator worked perfectly, but it uses a gyroscope in the headpiece (so, it doesn't actually depend on signals from your brain).  I could detect some interesting changes in the Affectiv Suite (see this related post), but I couldn't really get the Expressiv Suite or Cognitiv Suite to work well for me.  I could always get the headset set up correctly, but less than 1/2 of my friends could achieve this step (i.e. the computer couldn't read any signal from the headset).

I did have one friend where the Expressiv Suite seemed to work well, but it was never very accurate for me: it thought that I was always trying to smile, and the overall patterns just got worse when I tried to retrain the algorithm.

2) Emotiv EPOC Brain Activity Map ($9.95) - allows you to visualize the signal for alpha, beta, delta, and theta waves for all of the sensors in a 2D map.  Provides 3 different visualization types.  It would be nice if it could record averaged activity over time, but I think it is still much better than the more expensive 3D version.

3) subConch (all users - Free) - creates sounds that are supposed to match your mental state (e.g. low pitched sound for a calm mind, high pitched sound for a calm mind).  I also found that the application has its own website  which shows how the software is utilized in an art exhibition   FYI, I almost listed it as a "Disappointing App" because it didn't initially install correctly - be sure to extract a compressed file after the installation closes (this is what I failed to do the first time).

Disappointing Apps:

1) Spirit Mountain Demo Game (Free) - you move around an avitar in a 3D world from a 1st person perspective, and you need to use the EPOC headset to accomplish various mind-controlled tasks.  To be clear, you move around the world with a mouse and keyboard: I initially thought you could control movement with your mind (and you probably can try to do this using some advanced option).  To be fair, I think it would be nearly impossible (at least for me) to achieve this level of control with the EPOC headset alone, and I did find a demo video where the narrator does explicitly saw that he is using the mouse and keyboard.  It was also extremely slow on my computer (running Windows 7 with 1 GB of RAM).  I'm sure that I could modify the inital rendering options to improve this, but I honestly didn't think the game was fun enough to warrant the extra effort (probably because I already had so many difficulties with the Cognitiv Suite in the EPOC Control panel).

2) Emotiv EPOC 3D Brain Activity Map (Standard Edition - $49.95) - Wraps 2D signal for alpha, beta, delta, and theta waves (as well as a customized wavelength) around a 3D head.  This is essentially the same as a 2D image because there is no calculated depth within the brain.  I found the interface a little more buggy than the 2D version.  For example, using the scroll button to zoom in and out didn't really work: once you started to zoom in, you had to be either fully zoomed in or fully zoomed out (it didn't seem to measure gradations of zoom).

Also, part of the reason I wanted to check it out was because it came with the ability to record measurements over time.  This is true, but I think it is really just intended for you to be able to see different sides of the head at the same measurement point: it really isn't practical for being able to see how your activity changes over long periods of time (for example, a 30 minute recording would have to be viewed in real time).

Finally, I was absolutely furious when the program wouldn't initially install correctly.   I did eventually figure out how to solve the problem (the .exe extension was simply missing from the executable file) and tech support was prompt to offer a solution.  Nevertheless, I think this sort of thing might be reasonable to expect for a free app, but I would hope these sort of problems would have been figured out for a program that costs $50!

3) MindKeyboard (Free) - the idea is that you can type using your mind by moving a cursor left and right along a string of letters (and using another feature, like the "push" action, to select a letter).  I liked the simple idea, but I could never get it to work (again, probably because I already had so many difficulties with the Cognitiv Suite in the EPOC Control panel).

Creative Commons License
My Biomedical Informatics Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.