User edits classification using document revision histories

Open Access
Authors
Publication date 2012
Host editors
  • W. Daelemans
Book title EACL 2012: 13th Conference of the European Chapter of the Association for Computational Linguistics
Book subtitle proceedings of the conference : April 23-27 2012, Avignon France
ISBN
  • 9781937284190
Event EACL 2012: 13th Conference of the European Chapter of the Association for Computational Linguistics
Pages (from-to) 356-366
Publisher Stroudsburg, PA: Association for Computational Linguistics
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
Document revision histories are a useful and abundant source of data for natural language processing, but selecting relevant data for the task at hand is not trivial. In this paper we introduce a scalable approach for automatically distinguishing between factual and fluency edits in document revision histories. The approach is based on supervised machine learning using language model probabilities, string similarity measured over different representations of user edits, comparison of part-of-speech tags and named entities, and a set of adaptive features extracted from large amounts of unlabeled user edits. Applied to contiguous edit segments, our method achieves statistically significant improvements over a simple yet effective edit-distance baseline. It reaches high classification accuracy (88%) and is shown to generalize to additional sets of unseen data.
Document type Conference contribution
Language English
Published at http://www.aclweb.org/anthology/E/E12/E12-1036.pdf http://dl.acm.org/citation.cfm?id=2380860
Downloads
381119 (Final published version)
Permalink to this page
Back