Combinatorial and compositional aspects of bilingual aligned corpora

Combinatorial and compositional aspects of bilingual aligned corpora

Authors	S. Martzoukos
Supervisors	M. de Rijke
Cosupervisors	C. Monz
Award date	21-10-2016
Number of pages	182
Organisations	Faculty of Science (FNWI)
Abstract	The subject of investigation of this thesis is the building blocks of translation in Statistical Machine Translation (SMT). We find that these building blocks, namely phrase-level dictionary entries, which are extracted from bilingual aligned corpora (training data), admit richer structure than previously known. A rigorous explanation of the extraction mechanism shows that the resulting set of building blocks is amenable to mathematical investigation with the potential of developing tools and new frameworks for translation. To this end we bridge previously unseen gaps between graph theory and probability theory within SMT in order to derive probability mass functions for phrase-level sentence segmentations and rules of translation. For the latter, experimental results support the claim of a statistical (principle of) compositionality of translation rules which fosters future work on data generation. In addition, since the constituents of composition are the original building blocks of translation, as extracted from the training process, we investigate whether they generalize monolingual building blocks (phrases), and if so, of what type. This leads to identifying the role of pointwise mutual information as the distance metric on segmentation refinements. Experiments show that such a partially ordered framework is more appropriate than a standard language model approach for finding the 'natural' building blocks of monolingual corpora.
Document type	PhD thesis
Note	Research conducted at: Universiteit van Amsterdam Series: SIKS dissertation series 2016-42
Language	English
Downloads	Thesis
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Combinatorial and compositional aspects of bilingual aligned corpora