Login
Talks & Seminars
Title: Scaling up development and deployment of rule-based Information Extraction (IE) Systems
Dr. Ganesh Ramakrishnan, IIT Bombay and IBM India Research Lab
Date & Time: January 5, 2009 17:00
Venue: Conference Room, 'C' Block, 01st floor, Kanwal Rekhi building
Abstract:

Information Extraction (as also disambiguation) from terabytes of text is becoming an increasingly important requirement. In this talk I will focus on rule-based systems for this task and present two of my contributions in this area:

In the first part of the talk, I will discuss my contribution to scaling up the rule engine for the IE task. Current rule-based and machine learning based approaches for Information Extraction (IE) operate at the document level. I will present a radically different approach to information extraction which uses the inverted index typically created for rapid key-word based searching of a document collection. We define a set of operations on the inverted index that allows us to create annotations defined by cascading regular expressions. The entity annotations for an entire document corpus can be created purely of the index with no need to access the original documents. Experiments on two publicly available data sets show at least an order of magnitude speed-up over the document-at-a-time annotator in GATE (http://gate.ac.uk/). In addition, the index based approach permits optimization of the order of index operations required for the annotation process. We present techniques, which, based on estimated costs, can optimally select between different logically equivalent operator evaluation plans. I will also discuss issues in collectively optimizing the plans of several rules and present some solutions. All this discussion sets the stage for introducing RAD: a tool for Rapid Annotator Development on a document collection. I will conclude this section with an overview of the SystemText (http://www.alphaworks.ibm.com/tech/systemt) system that we have built at IBM Research, based on some lessons learnt from our IE scaling experiences.

In the second part of the talk, I will briefly discuss a relational learning approach to rule construction and consolidation for extraction/disambiguation. The rules are mined using a relational learning technique called Inductive Logic Programming and the consolidation is performed using a statistical classifier (Support Vector Machine) by propositionalizing the rules into `features'. The process of propositionalization, has been largely done either as a pre-processing step (in which a large set of possibly useful features are constructed first, and then a predictive model is constructed) or by tightly coupling feature construction and model construction (in which a predictive model is constructed with each new feature, and only those that result in a significant improvement in performance are retained). These represent two extremes. An interesting, third perspective on the problem arises by taking search-based view of rule construction. In this, we conceptually view the task as searching through subsets of all possible rules that can be constructed by the ILP system. Clearly an exhaustive search of such a space will usually be intractable. We resort instead to a randomised local search which repeatedly constructs randomly (but non-uniformly) a subset of rules and then performs a greedy local search starting from this subset. The number of possible features usually prohibits an enumeration of all local moves. Consequently, the next move in the search-space is guided by the errors made by the model constructed using the current set of fules. This can be seen as sampling non-uniformly from the set of all possible local moves, with a view of selecting only those capable of improving performance. The result is a procedure in which a rule subset is initially generated in the pre-processing style, but further alterations are guided actively by actual model predictions: `Rule Construction Using Theory-Guided Sampling and Randomized Search'. We test this procedure on the language processing task of word-sense disambiguation. Good models have previously been obtained for this task using an SVM in conjunction with ILP rules constructed in the pre-processing style. Our results show an improvement on these previous results: predictive accuracies are usually higher, and substantially fewer rules are needed.

Speaker Profile:
Details about speaker is available at http://www.cse.iitb.ac.in/~ganesh/
List of Talks

Webmail

Username:
Password:
Faculty CSE IT
Forgot Password
    [+] Sitemap     Feedback