Talks & Seminars
Title: Segmenting Web-Domains and Hashtags using Length Specific Models
Dr. Sourangshu Bhattacharya, Yahoo! Labs
Date & Time: December 18, 2012 11:30
Venue: Conference Room, 01st Floor, C Block, Department of Computer Science & Engineering, Kanwal Rekhi Building
Segmentation of a string of English language characters into a sequence of words has many applications. Here, we study two applications in the internet domain. First application is the web domain segmentation which is crucial for monetization of broken URLs. Secondly, we propose and study a novel application of twitter hashtag segmentation for increasing recall on twitter searches. Existing methods for word segmentation use unsupervised language models. We find that when using multiple corpora, the joint probability model from multiple corpora performs significantly better than the individual corpora. Motivated by this, we pro- pose weighted joint probability model, with weights specific to each corpus. We propose to train the weights in a supervised manner using max-margin methods. The supervised probability models improve segmentation accuracy over joint probability models. Finally, we observe that length of segments is an important parameter for word segmentation, and incorporate length-specific weights into our model. The length specific models further improve segmentation ac- curacy over supervised probability models. For all models proposed here, inference problem can be solved using the dynamic programming algorithm. We test our methods on five different datasets, two from web domains data, and three from news headlines data from an LDC dataset. The supervised length specific models show significant improvements over unsupervised single corpus and joint probability models. Cross-testing between the datasets confirm that supervised probability models trained on all datasets, and length specific models trained on news headlines data, generalize well. Segmentation of hashtags result in significant improvement in recall on searches for twitter trends.
Speaker Profile:
Sourangshu Bhattacharya is a Scientist at Yahoo! Labs, Bangalore. His areas of interest include Machine Leaning, Computational Advertising, and Data Mining. He has published research papers in top conferences like ICML, NIPS, UAI, etc. He has been a reviewer for many conferences like NIPS, ECML, ADCOM, etc. Sourangshu also has many US patents to his credit.
