‹header›
‹date/time›
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
‹footer›
‹#›
A describe a system that we developed for duplicate elimination
Focus on  interactively learning to define when two items are duplicates using interesing learning techniques
This crucial in all data intwgratio activities
Of interest to practinioers and reseachers
Two specific apps
Automatic citation dbs like citeseer that we all like and depend on so much
When citations refer to the same paper --- some ruffled egos when names/papers
have distributed counts over duplicates.
Given that field matching is hard and citations most often do not come segmented
There are two classes of methods in this categories:
    somehow get each classifier to output a score
    for eg: svm derive it from distance from sep, nb fn of posterior probability of winning class, purity of leave for decision trees
   and there is ongoing work (including a poster in this conference) that attempts to do this but the success of these methods is limited.
Create a committee of classifiers ideally by sampling from the version space defined by the limited training data.
Again in the ideal setting when all of them are consistent classifiers they will agree on predictions outside the version space.
They will disagree on instances that fall within the confusion region.  By getting prediction on those you get to narrow the version space.
Now we will see how to create such a committee for different classifiers
All models possible for given training data but some models have higher probability than others
There are various issues in doing active learning.
Expts to give some insight on pros and cons of different methods.
SVMs are reputed to be high accuracy classifiers but here in addition to discriminating power, also important how effective
in choosing instances
For svms --- uncertainty is inverse distance from separator
With only 100 out of the 30,000 instances we could achieve the peak accuracy
Randomly selected 100 instances get only 30% accuracy
Random needs 8000 instances to get to 90% accuracy
Another interesting comparison with an optimal approach that makes the unrealistic assumption that labels of all unlabeled instances are known and picks that instance that leads to largest increase in accuracy at each step.  This is in best you could do.
Active learning is so close to the best!
Wanted to verify if the main benefit of active learning was due to the increased fraction of duplicates.
Threw away selected non-dups and selected random set of non-dups.  Accuracy only 40%