ALIAS: An Active Learning Led Interactive Deduplication System

Publications

Members

Slide show

Source code

The goal of the ALIAS deduplication system is to automate the manual, time-consuming process of removing duplicates in large semi-structured lists. There are two main challenges of the duplicate elimination task that ALIAS addresses:

The first challenge is to define a robust deduplication function that can capture when two records refer to the same entity in spite of the various inconsistencies and errors in data. ALIAS automates this task by learning the function from examples of duplicates and non-duplicates. The success of the learning approach critically hinges on being able to provide a large covering and challenging set of examples that bring out the subtlety of the deduplication function. ALIAS interactively discovers such challenging training pairs through the use of active learning .

The second challenge is to efficiently evaluate the learnt deduplication function on large lists of records. If the function is treated as a black box, then the only method of evaluating it is to take a cartesian product of the entries. Then for each pair of entries, invoke the function to determine if the pair is a duplicate or not. This method could be intolerably expensive when the number of records is large. ALIAS views the function as a general AND/OR predicate on simpler similarity functions and applies a number of novel optimization techniques to defer materialization of pairs.

To know more details about the project visit the slide show or read the publications.

Publications

Sunita Sarawagi and Anuradha Bhamidipaty. Interactive deduplication using active learning. In Proc. of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD-2002), Edmonton, Canada, July 2002.
[ bib | .pdf ]

Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to overcome the tedium of hand-coding is to train a classifier to distinguish between duplicates and non-duplicates. The success of this method critically hinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of the deduplication function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in large lists. We present our design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning. Our experiments on real-life datasets show that active learning significantly reduces the number of instances needed to achieve high accuracy. We investigate various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output.

Sunita Sarawagi and Alok Kirpal. Scaling up the alias duplicate elimination system: A demostration. In Proc. of the 19th IEEE Int'l Conference on Data Engineering (ICDE), Bangalore, March 2003.
[ bib | .pdf ]

Duplicate elimination is an important step in integrating data from multiple sources. The challenges involved are finding a robust deduplication function that can identify when two records are duplicates and efficiently applying the function on very large lists of records. In ALIAS the task of designing a deduplication function is eased by learning the function from examples of duplicates and non-duplicates and by using active learning to spot such examples effectively [3]. Here we investigate the issues involved in efficiently applying the learnt deduplication system on large lists of records. We demonstrate the working of the ALIAS evaluation engine and highlight the optimizations it uses to significantly cut down the number of record pairs that need to be explicitly materialized.

Sunita Sarawagi, Anuradha Bhamidipaty, Alok Kirpal, and Chandra Mouli. Alias: An active learning led interactive deduplication system. In Proc. of the 28th Int'l Conference on Very Large Databases (VLDB) (Demonstration session), Hongkong, August 2002.
[ bib | .pdf ]

Deduplication, a key operation in integrating data from multiple sources, is a time-consuming, labor-intensive and domain-specific operation. We present our design of ALIAS that uses a novel approach to ease this task by limiting the manual effort to inputing simple, domain-specific attribute similarity functions and interactively labeling a small number of record pairs. We describe how active learning is useful in selecting informative examples of duplicates and non-duplicates that can be used to train a deduplication function. ALIAS provides mechanism for efficiently applying the function on large lists of records using a novel cluster-based execution model.

Group members

Sarawagi, Sunita
Ms. Anuradha Bhamidipaty
Mr. Alok Kirpal
Mr. Chandra Mouli

Code

A C++ version of the source code is packaged in this distribution.
Back to Top

Copyright IIT Bombay.
For questions regarding this project contact [sunita[At]it.iitb.ac.in].