CSAW: Curating and Searching the Annotated Web

Our ambition is to annotate mentions of named entities on billions of Web pages with IDs, thus linking them to entity nodes in Wikipedia. This will enable searching with entities and relationships at an unprecedented scale. The project has two parts: annotating token segments on Web pages with Wikipedia entity IDs, and a new aggregated search mechanism for quantities.

Papers and Talk Slides

Demo etc.

Related projects, services, products, links

Data

Quantity consensus queries (QCQs)

Download the set of queries used in the experiments. The format of the xml is as specified below.

<iitb.QuantitySearch.Query>
  <queryID>A unique string id of the query</queryID>
  <numericID>A unique numeric id of the query</numericID>
  <queryString>Query string that was used to search the index</queryString>
  <descriptionString>Detailed description of the query string. 
     This was not used in the processing</descriptionString>
  <standardUnitName>The type of the answer quantity</standardUnitName>
  <answerSet>
    <!-- List of ground truth quantities -->
    <string>Ground truth quantity</string>
  </answerSet>
</iitb.QuantitySearch.Query>

QCQ snippets

We build a corpus of around 70,000 web pages by crawling the search result URLs returned by Google search engine for our queries. The corpus was indexed using Lucene. For each query maximum 100 snippets were obtained from the Lucene index. The list of snippets obtained for each query is available. The format is as follows.

<SnippetSet qid="The numeric Id of the query">
  <snippet>
    <id>A numeric id of the snippet</id>
    <text>Snippet text</text>
    <ans>Candidate answer of the snippet</ans>
      <!-- Parsed quantity in double format
	Two quantities are to capture the range answers, e.g. 15-17 --> 
      <quantity1>...</quantity1>
      <quantity2>...</quantity2>
  </snippet>
  <snippet> ... </snippet>
</SnippetSet>

LETOR-format QCQ data

LETOR format QCQ data is available. Details of the LETOR format can be found at the LETOR Web site. Snippets were annotated with binary relevance judgments by volunteers.

The data contains feature vectors and the relevance labels of snippets in LETOR format. It also contains feature vectors and (discretized) relevance labels of intervals. This feature vector representation of intervals was used in our Interval Ranking algorithm (Section 6 in the SIGIR paper). Both the snippet feature vectors and the interval feature vectors are divided into five folds and available under the directories Fold1 to Fold5. Each fold here contains a training and testing set of snippets (or intervals).

The top level directory "all" contains "Fold1" to "Fold5" of snippet data. Each line in the training (or testing) file under these Fold directories is a snippet feature vector with relevance label and a docid. The docid here is unique under a given query. The format of the training and testing file of snippet data is as follows (details specification of the format can be found in LETOR website):

relevanceLabel integerQueryID (featureID:featureValue)+ #docid

The directory "all/Interval/R8" contains the five folds (Fold1 to Fold5) of the interval data. Here R8 denotes interval width tolerance parameter r=8%. Our interval rank algorithm gives best result for this value. We have not included the interval feature vectors for other values of r to keep the download size manageable. Each line in the training and testing file (under fold directories) of the interval data is a feature vector representation of the interval in the following format.

relevanceLabel integerQueryID (featureID:featureValue)+ #docid=listOfDocIDsSeparatedByDash
The list of docids here denotes the docids of the snippets that belong to this interval. Therefore given the qid and docid list of an interval one can find out the list of snippets belonging to that interval.

QCQ sample queries and snippets

Visit http://sites.google.com/site/quantityconsensus/ for samples provided with the paper.

Wikipedia annotations

This is from the annotator part of the project.

Code access

Project members

(In approximate order of recency) Soumen Chakrabarti, Uma Sawant, Shashank Gupta, Siddhanth Jain, Hrushikesh Mohapatra, Sasidhar Kasturi, Devshree Sane, Ganesh Ramakrishnan, Apoorv Sharma, , Amit Singh, Sayali Kulkarni, Somnath Banerjee.

Support

Partly supported by grants from Google, HP Labs, Yahoo, Microsoft Research, NetApp and SAP.