Semi-Supervised Priors for Microblog Language Identification

Authors	S. Carter E. Tsagkias W. Weerkamp
Publication date	2011
Host editors	C. Boscarino K. Hofmann V. Jijkoun E. Meij M. de Rijke W. Weerkamp
Book title	DIR 2011: Dutch_Belgian Information Retrieval Workshop Amsterdam
Event	Dutch-Belgian Information Retrieval workshop (DIR 2011)
Pages (from-to)	12-15
Publisher	Amsterdam: University of Amsterdam, Information and Language Processing group
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Offering access to information in microblog posts requires successful language identification. Language identification on sparse and noisy data can be challenging. In this paper we explore the performance of a state-of-the-art n-gram-based language identifier, and we introduce two semi-supervised priors to enhance performance at microblog post level: (i) blogger-based prior, using previous posts by the same blogger, and (ii) link-based prior, using the pages linked to from the post. We test our models on five languages (Dutch, English, French, German, and Spanish), and a set of 1,000 tweets per language. Results show that our priors improve accuracy, but that there is still room for improvement.
Document type	Conference contribution
Language	English
Published at	http://edgar.meij.pro/wp-content/papercite-data/pdf/dir-2011.pdf
Downloads	Semi-Supervised Priors for Microblog Language Identification (Final published version)
Permalink to this page

Back

UvA-DARE