Recrawl Scheduling based on Information Longevity
Dr. Christopher Olston, Yahoo! Research
Date & Time: September 11, 2007 14:30
Venue: Conference Room, C-Block, 1st floor, Kanwal Rekhi Building
We study the problem of deciding when to refresh web pages in an incremental crawler. A key issue is to distinguish between ephemeral information (e.g., quote of the day), which is of little benefit to refresh, versus persistent information (e.g., blog entries), which may be worthwhile to refresh. We propose a new theoretical model of optimal refreshing that takes into account longevity of information, and generalizes previous models. Based on our theory we develop a practical refresh scheduling policy that is adaptive (i.e., adjusts to changing page behavior) and local (i.e., does not require global optimization). These properties make our policy suitable for use in a real, parallel web crawler.
Speaker Profile:
Christopher Olston is a senior research scientist at Yahoo! Research, after a stint as assistant professor at Carnegie Mellon University from 2003 to 2005. His research interests include data management and web search. Olston received his Ph.D. in 2003 from Stanford University, which is a really nice place to rollerblade.
