|
|
|
|
|
|
|
|
Given a list of semi-structured records, |
|
find
all records that refer to a same entity |
|
Example applications: |
|
Data warehousing: merging name/address lists |
|
Entity: |
|
Person |
|
Household |
|
Automatic citation databases (Citeseer):
references |
|
Entity: paper |
|
|
|
|
|
|
|
Errors and inconsistencies in data |
|
Spotting duplicates might be hard as they may be
spread far apart: |
|
may not be group-able using obvious keys |
|
Domain-specific |
|
Existing manual approaches require retuning with
every new domain |
|
|
|
|
|
|
|
|
|
Too much manual search in preparing training
data |
|
Hard to spot challenging and covering sets of
duplicates in large lists |
|
Even harder to find close non-duplicates that will capture the nuances |
|
|
|
|
|
|
|
Interactive discovery of deduplication function
using active learning |
|
Efficient active learning on large lists using
novel indexing mechanisms |
|
Efficient application of learnt function on
large lists using |
|
Novel cluster-based evaluation engine |
|
Cost-based optimizer |
|
|
|
|
|
Apply similarity functions on record pairs. |
|
Loop
until user satisfaction |
|
Train classifier. |
|
Use active learning to select n instances |
|
Collect user feedback. |
|
Augment with pairs inferred using transitivity |
|
Add to training set |
|
Output
classifier |
|
|
|
|
|
|
|
|
|
Train k classifiers C1, C2,.. Ck on training
data |
|
For each unlabeled instance x |
|
Find prediction y1,.., yk from the k classifiers |
|
Compute uncertainty U(x) as entropy of above y-s |
|
Pick instance with highest uncertainty |
|
|
|
|
|
|
Data partitioning |
|
Resampling training data |
|
Attribute Partitioning |
|
Random parameter perturbation |
|
Probabilistic classifiers:. |
|
Sample from posterior distribution on parameters given training data. |
|
Example: binomial parameter p has a beta
distribution with mean p |
|
|
|
|
|
Selecting split attribute |
|
Normally:
attribute with lowest entropy |
|
Perturbed:
random attribute within close range of lowest |
|
Selecting a split point |
|
Normally:
midpoint of range with lowest entropy |
|
Perturbed:
a random point anywhere in the range with lowest entropy |
|
|
|
|
Data partition bad when limited data |
|
Attribute partition bad when sufficient data |
|
Parameter perturbation: best overall |
|
|
|
|
Important to randomize selection for generative
classifiers like naïve Bayes |
|
|
|
|
SVMs good initially but not effective in
choosing instances |
|
Decision trees: best overall |
|
|
|
|
|
|
Active learning much better than random |
|
With only 100 active instances |
|
97% accuracy,
Random only 30% |
|
Committee-based selection close to optimal |
|
|
|
|
|
|
|
Fraction of duplicates in selected instances:
44% starting with only 0.5% |
|
Is the gain due to increased fraction of
duplicates? |
|
Replaced non-duplicates in selected set with
random non-dups |
|
Result à only 40% accuracy!!! |
|
|
|
|
|
|
|
Naïve application of a function would require
quadratic time |
|
1000 records would compare 10^6 pairs! |
|
Our optimizations to avoid materializing all
pairs |
|
Grouped evaluation model |
|
Reordering similarity functions |
|
Precede hard functions with simpler canopies |
|
|
|
|
|
Define operators on groups of records. |
|
Select operator |
|
Equal operator |
|
Merge operator |
|
Aggregation operator |
|
Compaction operator |
|
Define each similarity functions in terms of
these operators. |
|
i.e. express it as a plan in terms of these
operators. |
|
|
|
|
|
|
|
|
|
Interactive discovery of deduplication function
using active learning |
|
Manual effort reduced to |
|
Providing simple similarity functions |
|
Labeling selected pairs |
|
Efficient indexing mechanism |
|
Novel cluster-based evaluation engine |
|
Cost-based optimizer |
|