An Adaptable Indexing Pipeline for Enriching Meta Information of Datasets from Heterogeneous Repositories

S. Farshidi; Z. Zhao

doi:https://doi.org/10.1007/978-3-031-05936-0_37

An Adaptable Indexing Pipeline for Enriching Meta Information of Datasets from Heterogeneous Repositories

Authors	S. Farshidi Z. Zhao
Publication date	2022
Host editors	J. Gama T. Li Y. Yu E. Chen Y. Zheng F. Teng
Book title	Advances in Knowledge Discovery and Data Mining
Book subtitle	26th Pacific-Asia Conference, PAKDD 2022, Chengdu, China, May 16–19, 2022 : proceedings
ISBN	9783031059353 9783031059377
ISBN (electronic)	9783031059360
Series	Lecture Notes in Computer Science
Event	26th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2022
Volume \| Issue number	II
Pages (from-to)	472-484
Number of pages	13
Publisher	Cham: Springer
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Dataset repositories publish a significant number of datasets continuously within the context of a variety of domains, such as biodiversity and oceanography. To conduct multidisciplinary research, scientists and practitioners must discover datasets from various disciplines unfamiliar with them. Well-known search engines, such as Google dataset and Mendeley data, try to support researchers with cross-domain dataset discovery based on their contents. However, as datasets typically contain scientific observations or collected data from service providers, their contextual information is limited. Accordingly, effective dataset indexing can be impossible to increase the Findability, Accessibility, Interoperability, and Reusability (FAIRness) based on their contextual information. This paper presents an indexing pipeline to extend contextual information of datasets based on their scientific domains by using topic modeling and a set of suggested rules and domain keywords (such as essential variables in environment science) based on domain experts’ suggestions. The pipeline relies on an open ecosystem, where dataset providers publish semantically enhanced metadata on their data repositories. We aggregate, normalize, and reconcile such metadata, providing a dataset search engine that enables research communities to find, access, integrate, and reuse datasets. We evaluated our approach on a manually created gold standard and a user study.
Document type	Conference contribution
Language	English
Related dataset	An adaptable indexing pipeline for enriching meta information of datasets from heterogeneous repositories
Published at	https://doi.org/10.1007/978-3-031-05936-0_37 (Final published version)
Published at	https://zenodo.org/record/6555644 (Accepted author manuscript)
Other links	https://www.scopus.com/pages/publications/85130230485
Downloads	2022.conference.akdd.caera (Accepted author manuscript)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

An Adaptable Indexing Pipeline for Enriching Meta Information of Datasets from Heterogeneous Repositories