User edits classification using document revision histories
| Authors |
|
|---|---|
| Publication date | 2012 |
| Host editors |
|
| Book title | EACL 2012: 13th Conference of the European Chapter of the Association for Computational Linguistics |
| Book subtitle | proceedings of the conference : April 23-27 2012, Avignon France |
| ISBN |
|
| Event | EACL 2012: 13th Conference of the European Chapter of the Association for Computational Linguistics |
| Pages (from-to) | 356-366 |
| Publisher | Stroudsburg, PA: Association for Computational Linguistics |
| Organisations |
|
| Abstract |
Document revision histories are a useful and abundant source of data for natural language processing, but selecting relevant data for the task at hand is not trivial. In this paper we introduce a scalable approach for automatically distinguishing between factual and fluency edits in document revision histories. The approach is based on supervised machine learning using language model probabilities, string similarity measured over different representations of user edits, comparison of part-of-speech tags and named entities, and a set of adaptive features extracted from large amounts of unlabeled user edits. Applied to contiguous edit segments, our method achieves statistically significant improvements over a simple yet effective edit-distance baseline. It reaches high classification accuracy (88%) and is shown to generalize to additional sets of unseen data.
|
| Document type | Conference contribution |
| Language | English |
| Published at | http://www.aclweb.org/anthology/E/E12/E12-1036.pdf http://dl.acm.org/citation.cfm?id=2380860 |
| Downloads |
381119
(Final published version)
|
| Permalink to this page | |