Examining the Tip of the Iceberg: A Data Set for Idiom Translation
| Authors | |
|---|---|
| Publication date | 2018 |
| Host editors |
|
| Book title | LREC 2018 : Eleventh International Conference on Language Resources and Evaluation |
| Book subtitle | May 7-12, 2018, Miyazaki, Japan |
| ISBN (electronic) |
|
| Event | 11th Language Resources and Evaluation Conference |
| Pages (from-to) | 925-929 |
| Publisher | Paris: European Language Resources Association (ELRA) |
| Organisations |
|
| Abstract |
Neural Machine Translation (NMT) has been widely used in recent years with significant improvements for many language pairs. Although state-of-the-art NMT systems are generating progressively better translations, idiom translation remains one of the open challenges in this field. Idioms, a category of multiword expressions, are an interesting language phenomenon where the overall meaning of the expression cannot be composed from the meanings of its parts. A first important challenge is the lack of dedicated data sets for learning and evaluating idiom translation. In this paper we address this problem by creating the first large-scale data set for idiom translation. Our data set is automatically extracted from a widely used German$English translation corpus and includes, for each language direction, a targeted evaluation set where all sentences contain idioms and a regular training corpus where sentences including idioms are marked. We release this data set and use it to perform preliminary NMT experiments as the first step towards better idiom translation.
|
| Document type | Conference contribution |
| Language | English |
| Published at | http://www.lrec-conf.org/proceedings/lrec2018/summaries/432.html |
| Other links | http://www.lrec-conf.org/proceedings/lrec2018/index.html |
| Downloads |
432
(Final published version)
|
| Permalink to this page | |