Experimental analysis
n250 references from Citeseer è 32000 pairs of which only 150 duplicates nCiteseer’s script used to segment into author, title, year, page and rest.
n20 text and integer similarity functions
nAverage of 20 runs
nDefault classifier: decision tree
nInitial labeled set: just two pairs