Exploration and exploitation of multilingual data for statistical machine translation

Open Access
Authors
  • S.C. Carter
Supervisors
Cosupervisors
Award date 05-12-2012
ISBN
  • 9789461821973
Number of pages 179
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
Shortly after the birth of computer science, researchers realised the importance of machine translation as a task worth of concentrated effort, but it is only recently that algorithms are able to provide automatic translations usable by the masses. Modern translation systems are dependent on bilingual corpora, a modern Rosetta Stone, from which the learn cross-lingual relationships that can be used to translate sentences which are not in the training corpus. This data is crucial. If it is insufficient, or out-of-domain, then translation quality degrades. To improve quality, we need to both perfect methods that extract usable translation from additional multilingual resources, and improve the constituent models of a translation system to better exploit existing multilingual data sets.
In this thesis, we focus on these dual problems. Our approach is two-fold, and the thesis is structures accordingly. In part I we study the problem of extracting translations from the web, with a focus on exploiting the growing predominance of microblog platforms. We present novel methods for the language identification of microblog posts, and conduct a thorough analysis of existing methods that explore these microblog posts for new translations. In part II we study the orthogonal problem of improving language models for the tasks of reranking and source side morphological analysis. We begin by analysing a plethora of syntactic features for reranking n-best lists output from an automatic translation system. We then present a novel algorithm that allows for exact inference from high-order hidden Markov models, which we use to segment source text input. In this way, the thesis gives insight into the retrieval of relevant training data, and introduces novel methods that better utilise existing multilingual corpora.
Document type PhD thesis
Note SIKS dissertation series no. 2012-46 Research conducted at: Universiteit van Amsterdam
Language English
Downloads
Permalink to this page
cover
Back