Many scientists use maximum likelihood estimation (MLE) to model bimodal gene expression (such as Lim et al. 2002, Fan et al. 2005, Mason et al. 2011, etc.). My MLE model is based the code provided in this discussion thread, so I used the mle function from the stats4 package (which is a wrapper for the standard optim function).

On simulated data, both the MLE and NLS models estimate 37% of samples show over-expression (i.e. come from the distribution with the higher mean), which is very close to the true value of 36%:

__Simulated Data__

In order to illustrate my point, I have also analyzed some genes exhibiting bimodal gene expression (as identified by Mason et al. 2011) in the GEO dataseries GSE13070. For example, here is a gene that I could model relatively well with NLS regression whereas I simply couldn't produce an MLE model:

Of course, both of these tools have their limitations. For example, the data has to have a pretty clean bimodal distribution (here are 3 examples of distributions that couldn't be modeling using either method: ACTIN3, ERAP2, and MAOA (different probe)). For the NLS model, I also had to set the variance to be equal for the two samples in order to produce a reasonable estimate of over-expression, but I believe this is usually a safe assumption.

Although I do not present the data in this blog post (because it contains unpublished results), I have also found NLS regression to be useful on other genes in several other datasets, So, I know NLS regression works well with more than just the one gene that I show above.

Also, for those that are interested, here is the source code that I used to produce all of the above figures.