Semi-Supervised Priors for Microblog Language Identification

Open Access
Authors
Publication date 2011
Host editors
  • C. Boscarino
  • K. Hofmann
  • V. Jijkoun
  • E. Meij
  • M. de Rijke
  • W. Weerkamp
Book title DIR 2011: Dutch_Belgian Information Retrieval Workshop Amsterdam
Event Dutch-Belgian Information Retrieval workshop (DIR 2011)
Pages (from-to) 12-15
Publisher Amsterdam: University of Amsterdam, Information and Language Processing group
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
Offering access to information in microblog posts requires successful language identification. Language identification on sparse and noisy data can be challenging. In this paper we explore the performance of a state-of-the-art n-gram-based language identifier, and we introduce two semi-supervised priors to enhance performance at microblog post level: (i) blogger-based prior, using previous posts by the same blogger, and (ii) link-based prior, using the pages linked to from the post. We test our models on five languages (Dutch, English, French, German, and Spanish), and a set of 1,000 tweets per language. Results show that our priors improve accuracy, but that there is still room for improvement.
Document type Conference contribution
Language English
Published at http://edgar.meij.pro/wp-content/papercite-data/pdf/dir-2011.pdf
Downloads
Permalink to this page
Back