An Adaptable Indexing Pipeline for Enriching Meta Information of Datasets from Heterogeneous Repositories

Open Access
Authors
Publication date 2022
Host editors
  • J. Gama
  • T. Li
  • Y. Yu
  • E. Chen
  • Y. Zheng
  • F. Teng
Book title Advances in Knowledge Discovery and Data Mining
Book subtitle 26th Pacific-Asia Conference, PAKDD 2022, Chengdu, China, May 16–19, 2022 : proceedings
ISBN
  • 9783031059353
  • 9783031059377
ISBN (electronic)
  • 9783031059360
Series Lecture Notes in Computer Science
Event 26th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2022
Volume | Issue number II
Pages (from-to) 472-484
Number of pages 13
Publisher Cham: Springer
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract

Dataset repositories publish a significant number of datasets continuously within the context of a variety of domains, such as biodiversity and oceanography. To conduct multidisciplinary research, scientists and practitioners must discover datasets from various disciplines unfamiliar with them. Well-known search engines, such as Google dataset and Mendeley data, try to support researchers with cross-domain dataset discovery based on their contents. However, as datasets typically contain scientific observations or collected data from service providers, their contextual information is limited. Accordingly, effective dataset indexing can be impossible to increase the Findability, Accessibility, Interoperability, and Reusability (FAIRness) based on their contextual information. This paper presents an indexing pipeline to extend contextual information of datasets based on their scientific domains by using topic modeling and a set of suggested rules and domain keywords (such as essential variables in environment science) based on domain experts’ suggestions. The pipeline relies on an open ecosystem, where dataset providers publish semantically enhanced metadata on their data repositories. We aggregate, normalize, and reconcile such metadata, providing a dataset search engine that enables research communities to find, access, integrate, and reuse datasets. We evaluated our approach on a manually created gold standard and a user study.

Document type Conference contribution
Language English
Related dataset An adaptable indexing pipeline for enriching meta information of datasets from heterogeneous repositories
Published at https://doi.org/10.1007/978-3-031-05936-0_37
Published at https://zenodo.org/record/6555644
Other links https://www.scopus.com/pages/publications/85130230485
Downloads
2022.conference.akdd.caera (Accepted author manuscript)
Permalink to this page
Back