Efficiently extract recurring tree fragments from large treebanks

Open Access
Authors
Publication date 2010
Host editors
  • N. Calzolari
  • K. Choukri
  • B. Maegaard
  • J. Mariani
  • J. Odijk
  • S. Piperidis
  • M. Rosner
  • D. Tapias
Book title Proceedings of the 7th international conference on Language Resources and Evaluation (LREC'10)
ISBN
  • 2951740867
  • 9782951740860
Event 7th international conference on Language Resources and Evaluation (LREC'10), Valletta, Malta
Pages (from-to) 219-226
Publisher European Language Resources Association (ELRA)
Organisations
  • Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract
In this paper we describe FragmentSeeker, a tool which is capable to identify all those tree constructions which are recurring multiple times in a large Phrase Structure treebank. The tool is based on an efficient kernel-based dynamic algorithm, which compares every pair of trees of a given treebank and computes the list of fragments which they both share. We describe two different notions of fragments we will use, i.e. standard and partial fragments, and provide the implementation details on how to extract them from a syntactically annotated corpus. We have tested our system on the Penn Wall Street Journal treebank for which we present quantitative and qualitative analysis on the obtained recurring structures, as well as provide empirical time performance. Finally we propose possible ways our tool could contribute to different research fields related to corpus analysis and processing, such as parsing, corpus statistics, annotation guidance, and automatic detection of argument structure.
Document type Conference contribution
Language English
Published at http://www.lrec-conf.org/proceedings/lrec2010/summaries/613.html
Downloads
330898.pdf (Final published version)
Permalink to this page
Back