Induction of latent domains in heterogeneous corpora

H. Cuong; K. Sima'an

doi:https://doi.org/10.1007/s10590-018-9215-9

Induction of latent domains in heterogeneous corpora a case study of word alignment

Authors	H. Cuong K. Sima'an
Publication date	12-2017
Journal	Machine Translation
Volume \| Issue number	31 \| 4
Pages (from-to)	225-249
Number of pages	25
Organisations	Interfacultary Research - Institute for Logic, Language and Computation (ILLC) Faculty of Science (FNWI)
Abstract	This paper focuses on the insensitivity of existing word alignment models to domain differences, which often yields suboptimal results on large heterogeneous data. A novel latent domain word alignment model is proposed, which induces domain-focused lexical and alignment statistics. We propose to train the model on a heterogeneous corpus under partial supervision, using a small number of seed samples from different domains. The seed samples allow estimating sharper, domain-focused word alignment statistics for sentence pairs. Our experiments show that the derived domain-focused statistics, once combined together, produce significant improvements both in word alignment accuracy and in translation accuracy of their resulting SMT systems. Going beyond the findings, we surmise that virtually any large corpus (e.g., Europarl, Hansards, Common Crawl) harbors an arbitrary diversity of hidden domains, unknown in advance. We address the novel challenge of unsupervised induction of hidden domains in parallel corpora, applied within a domain-focused word-alignment modeling framework. On the technical side, we contrast flat estimation for the unsupervised induction of domains to a simple form of hierarchical estimation, consisting of two steps aiming at avoiding bad local maxima. Extensive experiments, conducted over seven different language pairs with fully unsupervised induction of domains for word alignment, demonstrate significant improvements in alignment accuracy.
Document type	Article
Language	English
Published at	https://doi.org/10.1007/s10590-018-9215-9
Other links	https://www.scopus.com/pages/publications/85044087365
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Induction of latent domains in heterogeneous corpora a case study of word alignment