Exploration and exploitation of multilingual data for statistical machine translation

S.C. Carter

Exploration and exploitation of multilingual data for statistical machine translation

Authors	S.C. Carter
Supervisors	M. de Rijke
Cosupervisors	C. Monz
Award date	05-12-2012
ISBN	9789461821973
Number of pages	179
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Shortly after the birth of computer science, researchers realised the importance of machine translation as a task worth of concentrated effort, but it is only recently that algorithms are able to provide automatic translations usable by the masses. Modern translation systems are dependent on bilingual corpora, a modern Rosetta Stone, from which the learn cross-lingual relationships that can be used to translate sentences which are not in the training corpus. This data is crucial. If it is insufficient, or out-of-domain, then translation quality degrades. To improve quality, we need to both perfect methods that extract usable translation from additional multilingual resources, and improve the constituent models of a translation system to better exploit existing multilingual data sets. In this thesis, we focus on these dual problems. Our approach is two-fold, and the thesis is structures accordingly. In part I we study the problem of extracting translations from the web, with a focus on exploiting the growing predominance of microblog platforms. We present novel methods for the language identification of microblog posts, and conduct a thorough analysis of existing methods that explore these microblog posts for new translations. In part II we study the orthogonal problem of improving language models for the tasks of reranking and source side morphological analysis. We begin by analysing a plethora of syntactic features for reranking n-best lists output from an automatic translation system. We then present a novel algorithm that allows for exact inference from high-order hidden Markov models, which we use to segment source text input. In this way, the thesis gives insight into the retrieval of relevant training data, and introduces novel methods that better utilise existing multilingual corpora.
Document type	PhD thesis
Note	SIKS dissertation series no. 2012-46 Research conducted at: Universiteit van Amsterdam
Language	English
Downloads	Thesis Cover Title pages Contents 1: Introduction 2: Background 3: Experimental methodology PART I: Exploration: introduction 4: Language identification 5: Exploring Twitter for microblog post translation PART II: Exploitation: introduction 6: Discriminative syntactic reranking 7: Source-side morphological analysis 8: Conclusions A: Example tweets returned by models in chapter 5 Bibliography Samenvatting SIKS dissertation series
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Exploration and exploitation of multilingual data for statistical machine translation