Learning to extract information from large websites using sequential models

V.G.Vinod Vydiswaran, Sunita Sarawagi

Presented at 11th International Conference on Management of Data (COMAD 2005) (COMAD 2005), Goa, India, January 6-8, 2005


We propose a new method of information extraction from large websites by learning the sequence of links that lead to a specific goal page on the website. Sample applications include finding computer science publications starting from university root pages and fetching addresses of companies on a web database.

We model the website as a graph on a set of important states chosen using the application domain knowledge and train a conditional random field (CRF) over it. The CRF features are learnt on the keywords extracted from and around the hyperlinks traversed and the pages fetched. Our technique provides two times better harvest rates than techniques used in generic focused crawlers.

