Our ambition is to annotate mentions of named entities on billions of Web pages with IDs, thus linking them to entity nodes in Wikipedia. This will enable searching with entities and relationships at an unprecedented scale. The project has two parts: annotating token segments on Web pages with Wikipedia entity IDs, and a new aggregated search mechanism for quantities.
Download the set of queries used in the experiments. The format of the xml is as specified below.
<iitb.QuantitySearch.Query> <queryID>A unique string id of the query</queryID> <numericID>A unique numeric id of the query</numericID> <queryString>Query string that was used to search the index</queryString> <descriptionString>Detailed description of the query string. This was not used in the processing</descriptionString> <standardUnitName>The type of the answer quantity</standardUnitName> <answerSet> <!-- List of ground truth quantities --> <string>Ground truth quantity</string> </answerSet> </iitb.QuantitySearch.Query>
We build a corpus of around 70,000 web pages by crawling the search result URLs returned by Google search engine for our queries. The corpus was indexed using Lucene. For each query maximum 100 snippets were obtained from the Lucene index. The list of snippets obtained for each query is available. The format is as follows.
<SnippetSet qid="The numeric Id of the query"> <snippet> <id>A numeric id of the snippet</id> <text>Snippet text</text> <ans>Candidate answer of the snippet</ans> <!-- Parsed quantity in double format Two quantities are to capture the range answers, e.g. 15-17 --> <quantity1>...</quantity1> <quantity2>...</quantity2> </snippet> <snippet> ... </snippet> </SnippetSet>
The data contains feature vectors and the relevance labels of snippets in LETOR format. It also contains feature vectors and (discretized) relevance labels of intervals. This feature vector representation of intervals was used in our Interval Ranking algorithm (Section 6 in the SIGIR paper). Both the snippet feature vectors and the interval feature vectors are divided into five folds and available under the directories Fold1 to Fold5. Each fold here contains a training and testing set of snippets (or intervals).
The top level directory "all" contains "Fold1" to "Fold5" of snippet data. Each line in the training (or testing) file under these Fold directories is a snippet feature vector with relevance label and a docid. The docid here is unique under a given query. The format of the training and testing file of snippet data is as follows (details specification of the format can be found in LETOR website):
relevanceLabel integerQueryID (featureID:featureValue)+ #docid
The directory "all/Interval/R8" contains the five folds (Fold1 to Fold5) of the interval data. Here R8 denotes interval width tolerance parameter r=8%. Our interval rank algorithm gives best result for this value. We have not included the interval feature vectors for other values of r to keep the download size manageable. Each line in the training and testing file (under fold directories) of the interval data is a feature vector representation of the interval in the following format.
relevanceLabel integerQueryID (featureID:featureValue)+ #docid=listOfDocIDsSeparatedByDashThe list of docids here denotes the docids of the snippets that belong to this interval. Therefore given the qid and docid list of an interval one can find out the list of snippets belonging to that interval.
Visit http://sites.google.com/site/quantityconsensus/ for samples provided with the paper.
This is from the annotator part of the project.
(In approximate order of recency) Soumen Chakrabarti, Uma Sawant, Shashank Gupta, Siddhanth Jain, Hrushikesh Mohapatra, Sasidhar Kasturi, Devshree Sane, Ganesh Ramakrishnan, Apoorv Sharma, , Amit Singh, Sayali Kulkarni, Somnath Banerjee.
Partly supported by grants from Google, HP Labs, Yahoo, Microsoft Research, NetApp and SAP.