An Adaptable Indexing Pipeline for Enriching Meta Information of Datasets from Heterogeneous Repositories
| Authors | |
|---|---|
| Publication date | 2022 |
| Host editors |
|
| Book title | Advances in Knowledge Discovery and Data Mining |
| Book subtitle | 26th Pacific-Asia Conference, PAKDD 2022, Chengdu, China, May 16–19, 2022 : proceedings |
| ISBN |
|
| ISBN (electronic) |
|
| Series | Lecture Notes in Computer Science |
| Event | 26th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2022 |
| Volume | Issue number | II |
| Pages (from-to) | 472-484 |
| Number of pages | 13 |
| Publisher | Cham: Springer |
| Organisations |
|
| Abstract |
Dataset repositories publish a significant number of datasets continuously within the context of a variety of domains, such as biodiversity and oceanography. To conduct multidisciplinary research, scientists and practitioners must discover datasets from various disciplines unfamiliar with them. Well-known search engines, such as Google dataset and Mendeley data, try to support researchers with cross-domain dataset discovery based on their contents. However, as datasets typically contain scientific observations or collected data from service providers, their contextual information is limited. Accordingly, effective dataset indexing can be impossible to increase the Findability, Accessibility, Interoperability, and Reusability (FAIRness) based on their contextual information. This paper presents an indexing pipeline to extend contextual information of datasets based on their scientific domains by using topic modeling and a set of suggested rules and domain keywords (such as essential variables in environment science) based on domain experts’ suggestions. The pipeline relies on an open ecosystem, where dataset providers publish semantically enhanced metadata on their data repositories. We aggregate, normalize, and reconcile such metadata, providing a dataset search engine that enables research communities to find, access, integrate, and reuse datasets. We evaluated our approach on a manually created gold standard and a user study. |
| Document type | Conference contribution |
| Language | English |
| Related dataset | An adaptable indexing pipeline for enriching meta information of datasets from heterogeneous repositories |
| Published at | https://doi.org/10.1007/978-3-031-05936-0_37 |
| Published at | https://zenodo.org/record/6555644 |
| Other links | https://www.scopus.com/pages/publications/85130230485 |
| Downloads |
2022.conference.akdd.caera
(Accepted author manuscript)
|
| Permalink to this page | |
