- Exploration and exploitation of multilingual data for statistical machine translation
- Award date
- 5 December 2012
- Number of pages
- Document type
- PhD thesis
- Faculty of Science (FNWI)
- Informatics Institute (IVI)
Shortly after the birth of computer science, researchers realised the importance of machine translation as a task worth of concentrated effort, but it is only recently that algorithms are able to provide automatic translations usable by the masses. Modern translation systems are dependent on bilingual corpora, a modern Rosetta Stone, from which the learn cross-lingual relationships that can be used to translate sentences which are not in the training corpus. This data is crucial. If it is insufficient, or out-of-domain, then translation quality degrades. To improve quality, we need to both perfect methods that extract usable translation from additional multilingual resources, and improve the constituent models of a translation system to better exploit existing multilingual data sets.
In this thesis, we focus on these dual problems. Our approach is two-fold, and the thesis is structures accordingly. In part I we study the problem of extracting translations from the web, with a focus on exploiting the growing predominance of microblog platforms. We present novel methods for the language identification of microblog posts, and conduct a thorough analysis of existing methods that explore these microblog posts for new translations. In part II we study the orthogonal problem of improving language models for the tasks of reranking and source side morphological analysis. We begin by analysing a plethora of syntactic features for reranking n-best lists output from an automatic translation system. We then present a novel algorithm that allows for exact inference from high-order hidden Markov models, which we use to segment source text input. In this way, the thesis gives insight into the retrieval of relevant training data, and introduces novel methods that better utilise existing multilingual corpora.
- SIKS dissertation series no. 2012-46
Research conducted at: Universiteit van Amsterdam
If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library, or send a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.