Home

WWT: Table queries on the World Wide Web.

The Web today comprises of billions of semi-structured objects such as tables and lists that have been universally accepted as an idiom for expressing relational data even for human consump- tion. Usually, these are considerably higher quality than completely unstructured free-format text. In the WWT project we exploit such tables and lists for various query-driven structure extraction tasks. WWT assembles relational tables on-the-fly from a few seed rows by align- ing, segmenting, and consolidating information from raw tables and lists on the Web. When manually created ontologies are available, we enrich raw tables with links to the ontology. We developed joint graphical models to annotate table cells with entities that they likely mention, table columns with types from which entities are drawn for cells in the column, and relations that pairs of table columns are seeking to express. WWT is unique in the way it taps tables to answer quantity queries whose target is a quantity with natural variation, such as battery life of ipad, and half life of plutonium. WWT responds to such queries with a ranked list of quantity distributions, suitably represented. We use a probabilistic context free grammar (PCFG) based unit extractor on the tables, and retain several top-scoring extractions of quan- tity and numerals. Many of the key subtasks, such as extraction, consolidation, and annotation, rely heavily on statistical machine learning. A highlight of WWT is that it can dynamically train statistical models given very limited, and only indirect supervision.

Publications

	Sunita Sarawagi and Soumen Chakrabarti. Open-domain quantity queries on web tables: Annotation, response, and consensus models. In ACM SIGKDD, 2014. [ bib \| .pdf ] Over 40% of columns in hundreds of millions of Web tables contain numeric quantities. Tables are a richer source of structured knowledge than free text. We harness Web tables to answer queries whose target is a quantity with natural variation, such as `net worth of zuckerburg`, `battery life of ipad`, `half life of plutonium`, and `calories in pizza`. Our goal is to respond to such queries with a ranked list of quantity distributions, suitably represented. Apart from the challenges of informal schema and noisy extractions, which have been known since tables were used for non-quantity information extraction, we face additional problems of noisy number formats, as well as unit specifications that are often contextual and ambiguous. Early ``hardening'' of extraction decisions at a table level leads to poor accuracy. Instead, we use a probabilistic context free grammar (PCFG) based unit extractor on the tables, and retain several top-scoring extractions of quantity and numerals. Then we inject these into a new collective inference framework that makes global decisions about the relevance of candidate table snippets, the interpretation of the query's target quantity type, the value distributions to be ranked and presented, and the degree of consensus that can be built to support the proposed quantity distributions. Experiments with over 25 million Web tables and 350 diverse queries show robust, large benefits from our quantity catalog, unit extractor, and collective inference.
	Rakesh Pimplikar and Sunita Sarawagi. Answering table queries on the web using column keywords. In In Proc. of the 38th Int'l Conference on Very Large Databases (VLDB), 2012. [ bib \| .pdf ] We present the design of a structured search engine which returns a multi-column table in response to a query consisting of keywords describing each of its columns. We answer such queries by exploiting the millions of tables on the Web because these are much richer sources of structured knowledge than free-format text. However, a corpus of tables harvested from arbitrary HTML webpages presents huge challenges of diversity and redundancy not seen in centrally edited knowledge bases. We concentrate on one concrete task in this paper. Given a set of Web tables T₁,...,T_n, and a query Q with q sets of keywords Q₁,...,Q_q, decide for each T_i if it is relevant to Q and if so, identify the mapping between the columns of T_i and query columns. We represent this task as a graphical model that jointly maps all tables by incorporating diverse sources of clues spanning matches in different parts of the table, corpus-wide co-occurrence statistics, and content overlap across table columns. We define a novel query segmentation model for matching keywords to table columns, and a robust mechanism of exploiting content overlap across table columns. We design efficient inference algorithms based on bipartite matching and constrained graph cuts to solve the joint labeling task. Experiments on a workload of 59 queries over a 25 million web table corpus shows significant boost in accuracy over baseline IR methods.
	Rahul Gupta and Sunita Sarawagi. Joint training for open-domain extraction on the web: Exploiting overlap when supervision is limited. In WSDM, 2011. [ bib \| .pdf ] We consider the problem of jointly training structured mod- els for extraction from multiple web sources whose records enjoy partial content overlap. This has important applica- tions in open-domain extraction, e.g. a user materializing a table of interest from multiple relevant unstructured sources; or a site like Freebase augmenting an incomplete relation by extracting more rows from web sources. Such applications require extraction over arbitrary domains, so one cannot use a pre-trained extractor or demand a huge labeled dataset. We propose to overcome this lack of supervision by using content overlap across the related web sources. Existing methods of exploiting overlap have been developed under settings that do not generalize easily to the scale and diver- sity of overlap seen on Web sources. We present an agreement-based learning framework that jointly trains the models by biasing them to agree on the agreement regions, i.e. shared text segments. We present alternatives within our framework to trade-off tractability, robustness to noise, and extent of agreement enforced; and propose a scheme of partitioning agreement regions that leads to efficient training while maximizing overall accuracy. Further, we present a principled scheme to discover low-noise agreement regions in unlabeled data across multiple sources. Through extensive experiments over 58 different extrac- tion domains, we establish that our framework provides sig- nificant boosts over uncoupled training, and scores over al- ternatives such as collective inference, staged training, and multi-view learning.
	Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. Annotating and searching web tables using entities, types and relationships. In Proc. of the 36th Int'l Conference on Very Large Databases (VLDB), 2010. [ bib \| .pdf ] Tables are a universal idiom to express relational data, even for human consumption. Billions of tables on Web pages express entity references, attributes and relationships. This representation of relational world knowledge is usually considerably better than completely unstructured free-format text. At the same time, unlike manually-created knowledge bases, relational information mined from ``organic'' Web tables need not be constrained by availability of precious editorial time. Unfortunately, in the absence of any formal, uniform schema imposed on Web tables, Web search cannot take advantage of these high-quality sources of relational information. In this paper we propose new machine learning techniques to annotate table cells with entities that they likely mention, table columns with types from which entities are drawn for cells in the column, and relations that pairs of table columns are seeking to express. We propose a new graphical model for making all these labeling decisions for each table simultaneously, rather than make separate local decisions for entities, types and relations. Experiments using the YAGO catalog, DBPedia, tables from Wikipedia, and over 25 million HTML tables from a 500 million page Web crawl uniformly show the superiority of our approach. We also evaluate the impact of better annotations on a prototype relational Web search tool. We demonstrate clear benefits of our annotations beyond indexing tables in a purely textual manner.
	Rahul Gupta and Sunita Sarawagi. Answering table augmentation queries from unstructured lists on the web. In Proc. of the 35th Int'l Conference on Very Large Databases (VLDB), 2009. [ bib \| .pdf ] We present the design of a system for assembling a table from a few example rows by harnessing the huge corpus of information-rich but unstructured lists on the web. We developed a totally unsupervised end to end approach which given the sample query rows - (a) retrieves HTML lists relevant to the query from a pre-indexed crawl of web lists, (b) segments the list records and maps the segments to the query schema using a statistical model, (c) consolidates the results from multiple lists into a unified merged table, (d) and presents to the user the consolidated records ranked by their estimated membership in the target relation. The key challenges in this task include construction of new rows from very few examples, and an abundance of noisy and irrelevant lists that swamp the consolidation and ranking of rows. We propose modifications to statistical record segmentation models, and present novel consolidation and ranking techniques that can process input tables of arbitrary schema without requiring any human supervision. Experiments with Wikipedia target tables and 16 million unstructured lists show that even with just three sample rows, our system is very effective at recreating Wikipedia tables, with a mean runtime of around 20s.

Datasets

Annotation data: Annotated tables with entity, type, and relationship tags used in the above VLDB 2010 paper.

List segmentation dataset used in the WSDM 2011 paper above.

Query workloads used in the KDD 2014 paper

QuTree quantity catalog used in the KDD 2014 paper

Code

Unit parser code

Talk Slides

	Overview: Query-driven relation extraction from the semi-structured Web
	Annotating Web tables with entity, type, relationship links
	Exploiting overlap for extraction from multiple sources
	Extractions from lists

Group members

	Sarawagi, Sunita
	Rahul Gupta
	Girija Limaye
	Rakesh Pimplikar
	Prashant Barole

	Sunita Sarawagi and Soumen Chakrabarti. Open-domain quantity queries on web tables: Annotation, response, and consensus models. In ACM SIGKDD, 2014. [ bib \| .pdf ] Over 40% of columns in hundreds of millions of Web tables contain numeric quantities. Tables are a richer source of structured knowledge than free text. We harness Web tables to answer queries whose target is a quantity with natural variation, such as `net worth of zuckerburg`, `battery life of ipad`, `half life of plutonium`, and `calories in pizza`. Our goal is to respond to such queries with a ranked list of quantity distributions, suitably represented. Apart from the challenges of informal schema and noisy extractions, which have been known since tables were used for non-quantity information extraction, we face additional problems of noisy number formats, as well as unit specifications that are often contextual and ambiguous. Early ``hardening'' of extraction decisions at a table level leads to poor accuracy. Instead, we use a probabilistic context free grammar (PCFG) based unit extractor on the tables, and retain several top-scoring extractions of quantity and numerals. Then we inject these into a new collective inference framework that makes global decisions about the relevance of candidate table snippets, the interpretation of the query's target quantity type, the value distributions to be ranked and presented, and the degree of consensus that can be built to support the proposed quantity distributions. Experiments with over 25 million Web tables and 350 diverse queries show robust, large benefits from our quantity catalog, unit extractor, and collective inference.
	Rakesh Pimplikar and Sunita Sarawagi. Answering table queries on the web using column keywords. In In Proc. of the 38th Int'l Conference on Very Large Databases (VLDB), 2012. [ bib \| .pdf ] We present the design of a structured search engine which returns a multi-column table in response to a query consisting of keywords describing each of its columns. We answer such queries by exploiting the millions of tables on the Web because these are much richer sources of structured knowledge than free-format text. However, a corpus of tables harvested from arbitrary HTML webpages presents huge challenges of diversity and redundancy not seen in centrally edited knowledge bases. We concentrate on one concrete task in this paper. Given a set of Web tables T₁,...,T_n, and a query Q with q sets of keywords Q₁,...,Q_q, decide for each T_i if it is relevant to Q and if so, identify the mapping between the columns of T_i and query columns. We represent this task as a graphical model that jointly maps all tables by incorporating diverse sources of clues spanning matches in different parts of the table, corpus-wide co-occurrence statistics, and content overlap across table columns. We define a novel query segmentation model for matching keywords to table columns, and a robust mechanism of exploiting content overlap across table columns. We design efficient inference algorithms based on bipartite matching and constrained graph cuts to solve the joint labeling task. Experiments on a workload of 59 queries over a 25 million web table corpus shows significant boost in accuracy over baseline IR methods.
	Rahul Gupta and Sunita Sarawagi. Joint training for open-domain extraction on the web: Exploiting overlap when supervision is limited. In WSDM, 2011. [ bib \| .pdf ] We consider the problem of jointly training structured mod- els for extraction from multiple web sources whose records enjoy partial content overlap. This has important applica- tions in open-domain extraction, e.g. a user materializing a table of interest from multiple relevant unstructured sources; or a site like Freebase augmenting an incomplete relation by extracting more rows from web sources. Such applications require extraction over arbitrary domains, so one cannot use a pre-trained extractor or demand a huge labeled dataset. We propose to overcome this lack of supervision by using content overlap across the related web sources. Existing methods of exploiting overlap have been developed under settings that do not generalize easily to the scale and diver- sity of overlap seen on Web sources. We present an agreement-based learning framework that jointly trains the models by biasing them to agree on the agreement regions, i.e. shared text segments. We present alternatives within our framework to trade-off tractability, robustness to noise, and extent of agreement enforced; and propose a scheme of partitioning agreement regions that leads to efficient training while maximizing overall accuracy. Further, we present a principled scheme to discover low-noise agreement regions in unlabeled data across multiple sources. Through extensive experiments over 58 different extrac- tion domains, we establish that our framework provides sig- nificant boosts over uncoupled training, and scores over al- ternatives such as collective inference, staged training, and multi-view learning.
	Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. Annotating and searching web tables using entities, types and relationships. In Proc. of the 36th Int'l Conference on Very Large Databases (VLDB), 2010. [ bib \| .pdf ] Tables are a universal idiom to express relational data, even for human consumption. Billions of tables on Web pages express entity references, attributes and relationships. This representation of relational world knowledge is usually considerably better than completely unstructured free-format text. At the same time, unlike manually-created knowledge bases, relational information mined from ``organic'' Web tables need not be constrained by availability of precious editorial time. Unfortunately, in the absence of any formal, uniform schema imposed on Web tables, Web search cannot take advantage of these high-quality sources of relational information. In this paper we propose new machine learning techniques to annotate table cells with entities that they likely mention, table columns with types from which entities are drawn for cells in the column, and relations that pairs of table columns are seeking to express. We propose a new graphical model for making all these labeling decisions for each table simultaneously, rather than make separate local decisions for entities, types and relations. Experiments using the YAGO catalog, DBPedia, tables from Wikipedia, and over 25 million HTML tables from a 500 million page Web crawl uniformly show the superiority of our approach. We also evaluate the impact of better annotations on a prototype relational Web search tool. We demonstrate clear benefits of our annotations beyond indexing tables in a purely textual manner.
	Rahul Gupta and Sunita Sarawagi. Answering table augmentation queries from unstructured lists on the web. In Proc. of the 35th Int'l Conference on Very Large Databases (VLDB), 2009. [ bib \| .pdf ] We present the design of a system for assembling a table from a few example rows by harnessing the huge corpus of information-rich but unstructured lists on the web. We developed a totally unsupervised end to end approach which given the sample query rows - (a) retrieves HTML lists relevant to the query from a pre-indexed crawl of web lists, (b) segments the list records and maps the segments to the query schema using a statistical model, (c) consolidates the results from multiple lists into a unified merged table, (d) and presents to the user the consolidated records ranked by their estimated membership in the target relation. The key challenges in this task include construction of new rows from very few examples, and an abundance of noisy and irrelevant lists that swamp the consolidation and ranking of rows. We propose modifications to statistical record segmentation models, and present novel consolidation and ranking techniques that can process input tables of arbitrary schema without requiring any human supervision. Experiments with Wikipedia target tables and 16 million unstructured lists show that even with just three sample rows, our system is very effective at recreating Wikipedia tables, with a mean runtime of around 20s.