Notes
Outline
Slide 1
The de-duplication problem
Given a list of semi-structured records,
   find all records that refer to a same entity
Example applications:
Data warehousing:  merging name/address lists
Entity:
Person
Household
Automatic citation databases (Citeseer): references
Entity: paper
Challenges
Errors and inconsistencies in data
Spotting duplicates might be hard as they may be spread far apart:
may not be group-able using obvious keys
Domain-specific
Existing manual approaches require retuning with every new domain
The learning approach
Experiences with the learning approach
Too much manual search in preparing training data
Hard to spot challenging and covering sets of duplicates in large lists
Even harder to find close non-duplicates  that will capture the nuances
The active learning approach
The ALIAS deduplication system
Interactive discovery of deduplication function using active learning
Efficient active learning on large lists using novel indexing mechanisms
Efficient application of learnt function on large lists using
Novel cluster-based evaluation engine
Cost-based optimizer
Working of ALIAS
Apply similarity functions  on record pairs.
 Loop until user satisfaction
Train classifier.
Use active learning to select n instances
Collect user feedback.
Augment with pairs  inferred using transitivity
Add to training set
 Output classifier
Slide 9
Committee-based algorithm
Train k classifiers C1, C2,.. Ck on training data
For each unlabeled instance x
Find prediction y1,..,  yk from the k classifiers
Compute uncertainty U(x) as entropy of above y-s
Pick instance with highest uncertainty
Forming a classifier committee
Data partitioning
Resampling training data
Attribute Partitioning
Random parameter perturbation
Probabilistic classifiers:.
Sample from posterior  distribution on parameters given training data.
Example: binomial parameter p has a beta distribution with mean p
Randomly perturbing trees
Selecting split attribute
Normally:  attribute with lowest entropy
Perturbed:  random attribute within close range of lowest
Selecting a split point
Normally:  midpoint of range with lowest entropy
Perturbed:  a random point anywhere in the range with lowest entropy
Methods of creating committee
Data partition bad when limited data
Attribute partition bad when sufficient data
Parameter perturbation: best overall
Importance of randomization
Important to randomize selection for generative classifiers like naïve Bayes
Choosing the right classifier
SVMs good initially but not effective in choosing instances
Decision trees: best overall
Benefits of active learning
Active learning much better than random
With only 100 active instances
97% accuracy,  Random only 30%
Committee-based selection close to optimal
Analyzing selected instances
Fraction of duplicates in selected instances: 44% starting with only 0.5%
Is the gain due to increased fraction of duplicates?
Replaced non-duplicates in selected set with random non-dups
Result à only 40% accuracy!!!
Optimizing the evaluation of a de-duplication function
Naïve application of a function would require quadratic time
1000 records would compare 10^6 pairs!
Our optimizations to avoid materializing all pairs
Grouped evaluation model
Reordering similarity functions
Precede hard functions with simpler canopies
Evaluation Engine
Define operators on groups of records.
Select operator
Equal operator
Merge operator
Aggregation operator
Compaction operator
Define each similarity functions in terms of these operators.
i.e. express it as a plan in terms of these operators.
Slide 20
Slide 21
Features of ALIAS
Interactive discovery of deduplication function using active learning
Manual effort reduced to
Providing simple similarity functions
Labeling selected pairs
Efficient indexing mechanism
Novel cluster-based evaluation engine
Cost-based optimizer