Semi-Supervised Priors for Microblog Language Identification
| Authors |
|
|---|---|
| Publication date | 2011 |
| Host editors |
|
| Book title | DIR 2011: Dutch_Belgian Information Retrieval Workshop Amsterdam |
| Event | Dutch-Belgian Information Retrieval workshop (DIR 2011) |
| Pages (from-to) | 12-15 |
| Publisher | Amsterdam: University of Amsterdam, Information and Language Processing group |
| Organisations |
|
| Abstract |
Offering access to information in microblog posts requires successful language identification. Language identification on sparse and noisy data can be challenging. In this paper we explore the performance of a state-of-the-art n-gram-based language identifier, and we introduce two semi-supervised priors to enhance performance at microblog post level: (i) blogger-based prior, using previous posts by the same blogger, and (ii) link-based prior, using the pages linked to from the post. We test our models on five languages (Dutch, English, French, German, and Spanish), and a set of 1,000 tweets per language. Results show that our priors improve accuracy, but that there is still room for improvement.
|
| Document type | Conference contribution |
| Language | English |
| Published at | http://edgar.meij.pro/wp-content/papercite-data/pdf/dir-2011.pdf |
| Downloads |
Semi-Supervised Priors for Microblog Language Identification
(Final published version)
|
| Permalink to this page | |