Data quality control and research asset discovery for open science
| Authors | |
|---|---|
| Supervisors | |
| Award date | 27-05-2025 |
| ISBN |
|
| Number of pages | 141 |
| Organisations |
|
| Abstract |
Open science represents a transformative movement advocating for more open and collaborative research practices, where publications, data, software, and other academic outputs are shared at the earliest stages and made available for reuse.
Along this trend, Data Quality Control (DQC) plays a crucial role in ensuring the quality of data and, thus, the correctness and reliability of scientific findings. We investigate Active Learning (AL) to interactively query human annotators to label the most informative data points, thereby reducing the labeling burden on experts. Besides data, there has been a growing emphasis on sharing other types of research assets, such as codes, computational notebooks, and software tools to improve the reproducibility of research and facilitate collaboration across disciplines. However, the proliferation of research assets introduced by the open science movement can lead to information overload. We propose DeCNR, which models computational notebooks as bi-modal data (including text and code) and utilizes a fused sparse-dense model for computational notebooks retrieval. Extending from this research, we propose MRAS, a search system capable of indexing various types of research assets from heterogeneous data sources, enabling users to discover a wide range of research resources through a single search interface. In summary, this thesis addresses two crucial aspects of open science: Data Quality Control (DQC) and Research Asset Discovery (RAD). By focusing on DQC, we aim to ensure that data is reliable and trustworthy, while RAD seeks to facilitate the efficient retrieval of high-quality research assets. |
| Document type | PhD thesis |
| Language | English |
| Downloads | |
| Permalink to this page | |
