We propose a new method of information extraction from large websites by learning the sequence of links that lead to a specific goal page on the website. Sample applications include finding computer science publications starting from university root pages and fetching addresses of companies on a web database.
We model the website as a graph on a set of important states chosen using the application domain knowledge and train a conditional random field (CRF) over it. The CRF features are learnt on the keywords extracted from and around the hyperlinks traversed and the pages fetched. Our technique provides two times better harvest rates than techniques used in generic focused crawlers.