- User edits classification using document revision histories
- EACL 2012: 13th Conference of the European Chapter of the Association for Computational Linguistics
- Book/source title
- EACL 2012: 13th Conference of the European Chapter of the Association for Computational Linguistics: Proceedings of the Conference
- Pages (from-to)
- Stroudsburg: Association for Computational Linguistics
- Document type
- Conference contribution
- Faculty of Science (FNWI)
- Informatics Institute (IVI)
Document revision histories are a useful and abundant source of data for natural language processing, but selecting relevant data for the task at hand is not trivial. In this paper we introduce a scalable approach for automatically distinguishing between factual and fluency edits in document revision histories. The approach is based on supervised machine learning using language model probabilities, string similarity measured over different representations of user edits, comparison of part-of-speech tags and named entities, and a set of adaptive features extracted from large amounts of unlabeled user edits. Applied to contiguous edit segments, our method achieves statistically significant improvements over a simple yet effective edit-distance baseline. It reaches high classification accuracy (88%) and is shown to generalize to additional sets of unseen data.
If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library, or send a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.