Strategies for effective utilization of training data for machine translation

P. Dakwale

Strategies for effective utilization of training data for machine translation

Authors	P. Dakwale
Supervisors	C. Monz
Cosupervisors	M. de Rijke
Award date	27-10-2020
ISBN	9789464210484
Number of pages	135
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Machine Translation (MT) systems are developed with the aim to automatically translate text from one language (the source language) to another language (the target language). The vast majority of present-day MT systems are data-driven, that is, they learn how to translate between the two languages of interest by modelling the probabilities of translation patterns from large amounts of parallel corpora. For many language pairs, the training data is compiled from multiple sources that often come from different domains and can have varying degrees of translation equivalence. One limitation of standard training mechanisms is that they assume training data to be homogeneous with respect to domain and quality. As a result, these training mechanisms may learn noisy or undesirable translation patterns. This thesis explores training strategies that can refine the quality of the translation model and minimize the effect of training data variations. The majority of the strategies proposed in this thesis are based on the idea of iterative knowledge transfer, in which the quality of a model is refined through predictions of other models that are trained in earlier steps. The main idea underlying the solutions proposed in this thesis is that knowledge gathered by the models during initial training steps can be refined and used for solving varied problems such as domain adaptation and better utilization of noisy training data. Based on this idea, this thesis proposes solutions to four different problems. First, this thesis demonstrates that the quality of models used in phrase-based machine translation can be improved by re-estimating them from their own predictions on the training data. Second, this thesis investigates the problem of low performance of neural MT models when used to translate sentences from domains that are underrepresented in the training data. This thesis proposes a knowledge transfer strategy for training neural MT models with stable performance across multiple domains. Third, we address the problem of the negative effect of low-quality training data on the performance of neural MT systems. Our research demonstrates that the performance degradation due to noise in the training data can be reduced using knowledge transfer from a strong model trained on small high-quality data. Fourth, this thesis discusses the limitations of single architectures for neural MT. We propose a combination of two different neural network architectures to effectively model important features of the source sentences.
Document type	PhD thesis
Language	English
Downloads	Thesis
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Strategies for effective utilization of training data for machine translation