Combinatorial and compositional aspects of bilingual aligned corpora
| Authors | |
|---|---|
| Supervisors | |
| Cosupervisors | |
| Award date | 21-10-2016 |
| Number of pages | 182 |
| Organisations |
|
| Abstract |
The subject of investigation of this thesis is the building blocks of translation in Statistical Machine Translation (SMT). We find that these building blocks, namely phrase-level dictionary entries, which are extracted from bilingual aligned corpora (training data), admit richer structure than previously known. A rigorous explanation of the extraction mechanism shows that the resulting set of building blocks is amenable to mathematical investigation with the potential of developing tools and new frameworks for translation. To this end we bridge previously unseen gaps between graph theory and probability theory within SMT in order to derive probability mass functions for phrase-level sentence segmentations and rules of translation. For the latter, experimental results support the claim of a statistical (principle of) compositionality of translation rules which fosters future work on data generation. In addition, since the constituents of composition are the original building blocks of translation, as extracted from the training process, we investigate whether they generalize monolingual building blocks (phrases), and if so, of what type. This leads to identifying the role of pointwise mutual information as the distance metric on segmentation refinements. Experiments show that such a partially ordered framework is more appropriate than a standard language model approach for finding the 'natural' building blocks of monolingual corpora.
|
| Document type | PhD thesis |
| Note | Research conducted at: Universiteit van Amsterdam Series: SIKS dissertation series 2016-42 |
| Language | English |
| Downloads | |
| Permalink to this page | |
