- Subdomain sensitive statistical parsing using raw corpora
- Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco
- Book/source title
- LREC 2008: Sixth International Conference on Language Resources and Evaluation: Proceedings
- Pages (from-to)
- European Language Resources Association (ELRA)
- Document type
- Conference contribution
- Interfacultary Research Institutes
- Institute for Logic, Language and Computation (ILLC)
Modern statistical parsers are trained on large annotated corpora (treebanks). These treebanks usually consist of sentences addressing different subdomains (e.g. sports, politics, music), which implies that the statistics gathered by current statistical parsers are mixtures of subdomains of language use. In this paper we present a method that exploits raw subdomain corpora gathered from the web to introduce subdomain sensitivity into a given parser. We employ statistical techniques for creating an ensemble of domain sensitive parsers, and explore methods for amalgamating their predictions. Our experiments show that introducing domain sensitivity by exploiting raw corpora can improve over a tough, state-of-the-art baseline.
If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library, or send a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.