- Syntactically lexicalized phrase-based SMT
- IEEE Transactions on Audio, Speech and Language Processing
- Volume | Issue number
- 16 | 7
- Pages (from-to)
- Document type
- Interfacultary Research Institutes
- Institute for Logic, Language and Computation (ILLC)
Until quite recently, extending phrase-based statistical machine translation (PBSMT) with syntactic knowledge caused system performance to deteriorate. The most recent successful enrichments of PBSMT with hierarchical structure either employ nonlinguistically motivated syntax for capturing hierarchical reordering phenomena, or extend the phrase translation table with redundantly ambiguous syntactic structures over phrase pairs. In this paper, we present an extended, harmonized account of our previous work which showed that incorporating linguistically motivated lexical syntactic descriptions, called supertags, can yield significantly better PBSMT systems at insignificant extra computational cost. We describe a novel PBSMT model that integrates supertags into the target language model and the target side of the translation model. Two kinds of supertags are employed: those from lexicalized tree-adjoining grammar and combinatory categorial grammar. Despite the differences between the two sets of supertags, they give similar improvements. In addition to integrating the Markov supertagging approach in PBSMT, we explore the utility of a new surface grammaticality measure based on combinatory operators. We perform various experiments on the Arabic-to-English NIST 2005 test set addressing the issues of sparseness, scalability, and the utility of system subcomponents. We show that even when the parallel training data grows very large, the supertagged system retains a relatively stable absolute performance advantage over the unadorned PBSMT system. Arguably, this hints at a performance gap that cannot be bridged by acquiring more phrase pairs. Our best result shows a relative improvement of 6.1% over a state-of-the-art PBSMT model, which compares favorably with the leading systems on the NIST 2005 task. We also demonstrate that the advantages of a supertag-based system carry over to German-English, where improvements of up to 8.9% relative to the baseline system are observed.
- go to publisher's site
If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library, or send a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.