Learning to Rank for Quantity Consensus Queries

Download the set of queries used in the experiments. The format of the xml is as specified below.

<iitb.QuantitySearch.Query>
  <queryID>A unique string id of the query</queryID>
  <numericID>A unique numeric id of the query</numericID>
  <queryString>Query string that was used to search the index</queryString>
  <descriptionString>Detailed description of the query string. 
     This was not used in the processing</descriptionString>
  <standardUnitName>The type of the answer quantity</standardUnitName>
  <answerSet>
    <!-- List of ground truth quantities -->
    <string>Ground truth quantity</string>
  </answerSet>
</iitb.QuantitySearch.Query>

We build a corpus of around 70,000 web pages by crawling the search result URLs returned by Google search engine for our queries. The corpus was indexed using Lucene. For each query maximum 100 snippets were obtained from the Lucene index. The list of snippets obtained for each query is available. The format is as follows.

<SnippetSet qid="The numeric Id of the query">
  <snippet>
    <id>A numeric id of the snippet</id>
    <text>Snippet text</text>
    <ans>Candidate answer of the snippet</ans>
      <!-- Parsed quantity in double format
	Two quantities are to capture the range answers, e.g. 15-17 --> 
      <quantity1>...</quantity1>
      <quantity2>...</quantity2>
  </snippet>
  <snippet> ... </snippet>
</SnippetSet>

LETOR format QCQ data is available. Details of the LETOR format can be found at the LETOR Web site. Snippets were annotated with binary relevance judgments by volunteers.

The data contains feature vectors and the relevance labels of snippets in LETOR format. It also contains feature vectors and (discretized) relevance labels of intervals. This feature vector representation of intervals was used in our Interval Ranking algorithm (Section 6 in the SIGIR paper). Both the snippet feature vectors and the interval feature vectors are divided into five folds and available under the directories Fold1 to Fold5. Each fold here contains a training and testing set of snippets (or intervals).

The top level directory "all" contains "Fold1" to "Fold5" of snippet data. Each line in the training (or testing) file under these Fold directories is a snippet feature vector with relevance label and a docid. The docid here is unique under a given query. The format of the training and testing file of snippet data is as follows (details specification of the format can be found in LETOR website):

relevanceLabel integerQueryID (featureID:featureValue)+ #docid

The directory "all/Interval/R8" contains the five folds (Fold1 to Fold5) of the interval data. Here R8 denotes interval width tolerance parameter r=8%. Our interval rank algorithm gives best result for this value. We have not included the interval feature vectors for other values of r to keep the download size manageable. Each line in the training and testing file (under fold directories) of the interval data is a feature vector representation of the interval in the following format.

relevanceLabel integerQueryID (featureID:featureValue)+ #docid=listOfDocIDsSeparatedByDash
The list of docids here denotes the docids of the snippets that belong to this interval. Therefore given the qid and docid list of an interval one can find out the list of snippets belonging to that interval.

QCQ sample queries and snippets have also been published.