Monday, January 14, 2013

Testing Some Genomic Text-Mining Tools

I've wanted to learn more about the field of text-mining for some time, and I've recently had the goal of seeing if I can build a directed regulatory network using text-mining of papers in the scientific  literature.  Text-mining is certainly not a new area of genomics research: for example, Jenssen et al. 2001 built a co-citation network using text-mining.  However, this sort of network can *not* be used to identify genes that have been shown to activate or inhibit one another, and I was curious to see how easy it would be to currently build such a network.

I tested two text-mining tools for a couple example queries: iHOP and PolySearch.

This certainly not an exhaustive test of all the text-mining tools currently available, but these two tools had a nice user interfaces and a relatively high number of citations.  For those that are interested, here are some reviews that I came across in my literature search: Altman et al. 2008, Ananiadou et al. 2010, Hoffmann et al. 2005, Jensen et al. 2006, Krallinger and Valencia 2005, and Skusa et al. 2005.

I found the test query for progesterone receptor (PGR) was a good example of the strengths / weaknesses of these two tools.

In both cases, estrogen receptor was found to be co-cited most frequency with PGR.  I was pleased with this result because the relationship between these genes is well characterized, and they are two markers commonly studed in breast cancer patients.

On the other hand, these tools weren't very helpful for identifying several genes that are activated or inhibited by progesterone receptor.  I liked the fact that PolySearch allowed me to specify verbs to define the method of association, but I found that the search results for "activate; activated; activates" and "inhibit; inhibited; inhibits" produced practically identical gene lists.  As far as I could tell, iHOP didn't provide this feature for users, but I did write a short Perl script to parse sentences containing "activate" or "inhibit" (in the iHOP results) and I manually searched those sentences to try and find PGR targets.  This provided a small number of results which were not particularly useful: for example, iHOP provided citations that IGFBP-1 was both activated and inhibited by PGR.

In short, I think these tools do a good job of determining generic associations between genes by determining which genes are commonly co-cited.  However, it didn't appear that either of these tools would be a good solution for building a directed regulatory network.  I also found that each of these tools had distinct advantages / disadvantages: iHOP provided results much more quickly than PolySearch, but PolySearch can define a broader range of associations (such as gene-disease) and I liked the PolySearch query and results interface better than the iHOP interface.

To be clear, I am certainly not saying that it is impossible to accomplish my goal of defining a directed regulatory network via text-mining.  I just did not find it to be practical for a text-mining neophyte like myself.  I should also specify that I was specifically interested in open-source tools -  I know of some commerically available tools that provide this information (usually with the assistance of PhD-level curators), but I am trying to see if this can be done entirely in silico by scratch. On the contrary, I am sure this is a feasible goal, and I would certainly appreciate any comments for suggestions of atternative ways to achieve this goal.

Wednesday, October 10, 2012

Summary of "Updated Phylogenetic Analysis of Polyomavirus-Host Co-Evolution"

I recently published a short article in the Journal of Bioinformatics and Research that investigated host switching in polyomaviruses (which you can also download here).

The analysis was pretty straightforward, although I thought it provided a nice example of how Mauve can help supplement traditional phylogenetic analysis.  Also, it looks like most polyomavirus phylogenies in virology journals compare the divergence of individual protein sequences, but I found that analysis of the genomic sequence seems to provide more useful, consistent results.

In the interests of full disclosure, part of the reason I want to plug this paper is that I am on the editorial board for this journal.  That said, I do honestly think it is a journal worth considering if you have a paper that isn't a good fit for a more established journal: it is open-access, turn-around time is quick, and it only costs $100 in processing charges for an accepted manuscript.

Monday, October 1, 2012

My DREAM Model for Predicting Breast Cancer Survival

This summer, I have worked on submitting a few models to the DREAM competition for predicting breast cancer survival.

Although I was originally planning on posting about my model after the competition was completely finished, I decided to go ahead and describe my experience because 1) my model honestly didn't radically differ from the example model and 2) I don't think I have enough time to redo the whole model building process on the new data before the 10/15 deadline.

To be clear, the performance isn't all that different for the old and new data, but there are technical details that would have to be worked out to submit the models (and I would want to take time to re-examine the best clinical variables to include in the model).  For example, here are the concordance index values for my three models on the training dataset:

 
 
  New Data
CWexprOnly
 0.64
0.60
CWfullModel
 0.72
 NA
CWreducedModel
 0.71
 0.68

The old models are supposed to be converted to work on the new data.  If this does happen, then I'll be able to see the performance of these models on the future datasets (additional METABRIC test dataset + new, previously unpublished dataset).  That would certainly be cool, but this conversion has not yet happened.

In general, my strategy was to pick the gene expression values that correlated most strongly with survival, and I then averaged the expression of probes either positively or negatively correlated with patient survival.  On top of this, I further filtered the probes to only include those that vary between high and low grade patients.  My qualitative observation with working with breast cancer data has been that genes that vary with multiple clinically relevant variables seem to be more reproducible in independent cohorts.  So, I thought that this might help when examining the true, new validation set.  However, I gave this much smaller weight than the survival correlation (I required the probes to have a survival correlation FDR < 1e-8 and a |correlation coefficient| > 0.25, but I only required the probes to also have a differential grade FDR < 0.01).

So, these three models can be described as:

CWexprOnly: cox regression; positive and negative metagenes only

CWfullModel: cox regression; tumor size + treatment * lymph node positive +  grade + Pam50Subtype + positive metagene + negative metagene

CWreducedModel: cox regression; tumor size + treatment * lymph node positive + positive metagene

The CWreducedModel was used to see how much a difference it made to only include the strongest variables (and to what extent the full model may be subject to over-fitting).  The CWexprOnly model was used to see how well the gene expression could predict survival, even without the assistance of any clinical variables.

I included the treatment * lymph node positive variable because it defined a variable similar to the strongly correlated "group" variable, without making assumptions about which were the most important variables (and, as I would later learn, the "group" variable won't be provided for the new dataset).

Additionally, one observation I made prior to the model building process was how strongly the collection site correlated with survival (see below).  This variable wasn't defined by the individual patient, and  I assumed this should be a technical variation (or at least something that won't be useful in a truly independent validation dataset).  The new data dimenishes the imact of this confounding variable, but the correlation is still there.


 
Old Data
New Data
Collection Site
0.42
 0.23
Group
-0.51
 -0.45
Treatment
0.29
 0.28
Tumor Size
-0.18
 NA
Lymph Node Status
-0.24
 NA


ER, PR, and HER2 status are also important variables.  However, PR and HER2 status was missing in the old data, and I didn't record the original ER correlation.  Therefore, they are among the variables that I don't report in the above table.  Likewise, the representation of the tumor size and lymph node status variables changed between the two datasets.

This was a valuable experience to me, and I'm sure the DREAM papers that come out next year will be worth checking out.  There were some details about the organization that I think can be improved (avoid changing the data throughout the competition, find a way to limit the model of models to avoid cherry picking of over-fitted, non-robust models, and providing rewards for intermediate predictions of data where the users could cheat use the publicly available test dataset).  Nevertheless, I'm sure the process will be streamlined if SAGE assists with the DREAM competition next year, and I think there will be some useful observations about optimal model building from the current competition.
 
Creative Commons License
Charles Warden's Science Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.