Title: A Holistic Approach to Web Scale Information Extraction
Dr. Ashwin Machanavajjhala, Yahoo! Research
Date & Time: April 5, 2011 15:30
Venue: Lecture Hall, B Block, Third Floor, Kanwal Rekhi Building
The mushrooming of structured web sites brings up the possibility of automatically creating, enriching and maintaining large databases of structured information. However, traditional information extraction and integration techniques do not scale for this purpose. In this talk, we present a fresh perspective on the web-scale information extraction problem. We first motivate the need to extract and integrate structured information from a very large number (hundreds of thousands) of websites, and especially from inconsistently formatted tail websites, in order to build a comprehensive database of objects from specific domains of interest. We then present our (generative) model for structured data generation on the Web and formalize the end-to-end web extraction problem. Finally, we consider two important problems that arise when building an end-to-end extraction system clustering similarly formatted pages in websites and collective extraction from heterogeneous web lists. In both cases, we present novel solutions that are more efficient and accurate than prior work due to our holistic view of structured data on the Web.
Ashwin Machanavajjhala is a Research Scientist in the Community Systems group at Yahoo! Research. His primary research interests lie in the area of data management, with specific focus on mining, uncertainty management and privacy on the web. Ashwin graduated with a Ph.D. from the Department of Computer Science, Cornell University. His thesis work on defining and enforcing privacy was awarded the 2009 ACM SIGMOD Jim Gray Dissertation Award Honorable Mention. He has also received an M.S. from Cornell University and a B.Tech. in Computer Science and Engineering from the Indian Institute of Technology, Madras.
