n250 references from Citeseer è 32000 pairs of which
only 150 duplicates
nCiteseer’s script used to segment into author, title,
year, page and rest.
n20
text and integer similarity functions
nAverage
of 20 runs
nDefault
classifier: decision tree
nInitial
labeled set: just two pairs