The impact of document structure on keyphrase extraction

K. Hofmann; M. Tsagkias; E. Meij; M. de Rijke

doi:https://doi.org/10.1145/1645953.1646215

The impact of document structure on keyphrase extraction

Authors	K. Hofmann M. Tsagkias E. Meij M. de Rijke
Publication date	2009
Host editors	D. Cheung I.-Y. Song W. Chu X. Hu J. Lin J. Li Z. Peng
Book title	Proceedings of the 18th ACM Conference on Information and Knowledge Management, Hong Kong
ISBN	9781605585123
Event	18th ACM Conference on Information and Knowledge Management (CIKM 2009), Hong Kong, China
Pages (from-to)	1725-1728
Publisher	ACM
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Keyphrases are short phrases that reflect the main topic of a document. Because manually annotating documents with keyphrases is a time-consuming process, several automatic approaches have been developed. Typically, candidate phrases are extracted using features such as position or frequency in the document text. Document structure may contain useful information about which parts or phrases of a document are important, but has rarely been considered as a source of information for keyphrase extraction. We address this issue in the context of keyphrase extraction from scientific literature. We introduce a new, large corpus that consists of full-text journal articles, where the rich collection and document structure available at the publishing stage is explicitly annotated. We explore features based on the XML tags contained in the documents, and based on generic section types derived using position and cue words in section titles. For XML tags we find sections, abstract, and title to perform best, but many smaller elements may be beneficial in combination with other features. Of the generic section types, the discussion section is found to be most useful for keyphrase extraction.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.1145/1645953.1646215 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

The impact of document structure on keyphrase extraction