Charles Warden's Science Blog: May 2011

Monday, May 16, 2011

Modeling Bimodal Gene Expression

Since it is often challenging to estimate parameters for mixture models (such as those used to model bimodal gene expression), I thought it might be useful to discuss some of my successes using non-linear least sequares (NLS) regression to model bimodal gene expression.

Many scientists use maximum likelihood estimation (MLE) to model bimodal gene expression (such as Lim et al. 2002, Fan et al. 2005, Mason et al. 2011, etc.). My MLE model is based the code provided in this discussion thread, so I used the mle function from the stats4 package (which is a wrapper for the standard optim function).

On simulated data, both the MLE and NLS models estimate 37% of samples show over-expression (i.e. come from the distribution with the higher mean), which is very close to the true value of 36%:

Simulated Data

However, I have found that the mle function often returns error messages when working with real data and models built using the nls function tend to fit the data better than the MLE estimates. Even on the simulated data above, the NLS model appears to prove an ever so slightly better fit for the data (as might be expected because it directly models the density function).

In order to illustrate my point, I have also analyzed some genes exhibiting bimodal gene expression (as identified by Mason et al. 2011) in the GEO dataseries GSE13070. For example, here is a gene that I could model relatively well with NLS regression whereas I simply couldn't produce an MLE model:

Of course, both of these tools have their limitations. For example, the data has to have a pretty clean bimodal distribution (here are 3 examples of distributions that couldn't be modeling using either method: ACTIN3, ERAP2, and MAOA (different probe)). For the NLS model, I also had to set the variance to be equal for the two samples in order to produce a reasonable estimate of over-expression, but I believe this is usually a safe assumption.

Although I do not present the data in this blog post (because it contains unpublished results), I have also found NLS regression to be useful on other genes in several other datasets, So, I know NLS regression works well with more than just the one gene that I show above.

Also, for those that are interested, here is the source code that I used to produce all of the above figures.