Charles Warden's Science Blog: Testing Some Genomic Text-Mining Tools

Monday, January 14, 2013

Testing Some Genomic Text-Mining Tools

I've wanted to learn more about the field of text-mining for some time, and I've recently had the goal of seeing if I can build a directed regulatory network using text-mining of papers in the scientific literature. Text-mining is certainly not a new area of genomics research: for example, Jenssen et al. 2001 built a co-citation network using text-mining. However, this sort of network can *not* be used to identify genes that have been shown to activate or inhibit one another, and I was curious to see how easy it would be to currently build such a network.

I tested two text-mining tools for a couple example queries: iHOP and PolySearch.

This certainly not an exhaustive test of all the text-mining tools currently available, but these two tools had a nice user interfaces and a relatively high number of citations. For those that are interested, here are some reviews that I came across in my literature search: Altman et al. 2008, Ananiadou et al. 2010, Hoffmann et al. 2005, Jensen et al. 2006, Krallinger and Valencia 2005, and Skusa et al. 2005.

I found the test query for progesterone receptor (PGR) was a good example of the strengths / weaknesses of these two tools.

In both cases, estrogen receptor was found to be co-cited most frequency with PGR. I was pleased with this result because the relationship between these genes is well characterized, and they are two markers commonly studed in breast cancer patients.

On the other hand, these tools weren't very helpful for identifying several genes that are activated or inhibited by progesterone receptor. I liked the fact that PolySearch allowed me to specify verbs to define the method of association, but I found that the search results for "activate; activated; activates" and "inhibit; inhibited; inhibits" produced practically identical gene lists. As far as I could tell, iHOP didn't provide this feature for users, but I did write a short Perl script to parse sentences containing "activate" or "inhibit" (in the iHOP results) and I manually searched those sentences to try and find PGR targets. This provided a small number of results which were not particularly useful: for example, iHOP provided citations that IGFBP-1 was both activated and inhibited by PGR.

In short, I think these tools do a good job of determining generic associations between genes by determining which genes are commonly co-cited. However, it didn't appear that either of these tools would be a good solution for building a directed regulatory network. I also found that each of these tools had distinct advantages / disadvantages: iHOP provided results much more quickly than PolySearch, but PolySearch can define a broader range of associations (such as gene-disease) and I liked the PolySearch query and results interface better than the iHOP interface.

To be clear, I am certainly not saying that it is impossible to accomplish my goal of defining a directed regulatory network via text-mining. I just did not find it to be practical for a text-mining neophyte like myself. I should also specify that I was specifically interested in open-source tools - I know of some commerically available tools that provide this information (usually with the assistance of PhD-level curators), but I am trying to see if this can be done entirely in silico by scratch. On the contrary, I am sure this is a feasible goal, and I would certainly appreciate any comments for suggestions of atternative ways to achieve this goal.