- Semi-Supervised Priors for Microblog Language Identification
- Dutch-Belgian Information Retrieval workshop (DIR 2011)
- Book/source title
- DIR 2011: Dutch_Belgian Information Retrieval Workshop Amsterdam
- Pages (from-to)
- University of Amsterdam, Information and Language Processing group
- Document type
- Conference contribution
- Faculty of Science (FNWI)
- Informatics Institute (IVI)
Offering access to information in microblog posts requires successful language identification. Language identification on sparse and noisy data can be challenging. In this paper we explore the performance of a state-of-the-art n-gram-based language identifier, and we introduce two semi-supervised priors to enhance performance at microblog post level: (i) blogger-based prior, using previous posts by the same blogger, and (ii) link-based prior, using the pages linked to from the post. We test our models on five languages (Dutch, English, French, German, and Spanish), and a set of 1,000 tweets per language. Results show that our priors improve accuracy, but that there is still room for improvement.
If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library, or send a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.