Ontology- and LLM-based data harmonization for federated learning in healthcare

Natallia Kokash; Lei Wang; Thomas H. Gillespie; Adam S.Z. Belloum; Paola Grosso; Sara Quinney; Lang Li; Bernard de Bono

doi:https://doi.org/10.3389/fdgth.2026.1756555

Ontology- and LLM-based data harmonization for federated learning in healthcare

Authors	Natallia Kokash Lei Wang Thomas H. Gillespie Adam S.Z. Belloum Paola Grosso Sara Quinney Lang Li Bernard de Bono
Publication date	18-03-2026
Journal	Frontiers in Digital Health
Article number	1756555
Volume \| Issue number	8
Number of pages	12
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Introduction: Semantic heterogeneity across electronic health records (EHRs) limits scalable and privacy-preserving analytics in healthcare. While federated learning (FL) enables collaborative modeling without sharing raw data, it requires consistent, ontology-aligned representations. We present an ontology- and large language model (LLM)-based data harmonization approach to support secure, interoperable FL workflows. Methods: We propose a general two-step pipeline for converting or annotating clinical text into a predefined target ontology format. First, candidate concepts are retrieved from the target vocabulary using embedding-based similarity search or ontology cross-references. Second, an LLM acts as a semantic validator, accepting or rejecting candidates based on explicit equivalence or subsumption criteria. The approach is ontology-agnostic and configurable; mapping to MONDO and HPO is demonstrated as a real-world use case. Final accepted mappings were evaluated against independent human expert assessment. Results: Across two clinical datasets, expert-LLM agreement reached up to 92%, with overall performance ranging from 78% to 91% depending on candidate-generation strategy. Retrieval alone was insufficient for reliable mapping, whereas LLM-based validation substantially improved precision while complementary retrieval strategies improved recall. Discussion: The proposed pipeline transforms ontology-based harmonization from a manual expert task into a reusable and configurable workflow suitable for federated healthcare research. By combining high-recall retrieval with LLM-based semantic adjudication, the approach enables scalable, privacy-preserving conversion of heterogeneous clinical text into standardized representations across domains.
Document type	Article
Note	With supplementary material.
Language	English
Related dataset	Ontology- and LLM-based Data Alignment Evaluation: Mapping Patient Outcomes and ICD-10 codes to MONDO and HPO ontologies
Published at	https://doi.org/10.3389/fdgth.2026.1756555 (Final published version)
Other links	https://zenodo.org/records/15411810 https://www.scopus.com/pages/publications/105038112885
Downloads	fdgth-8-1756555 (Final published version)
Supplementary materials	icd10-mondo-hpo-mapping
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Ontology- and LLM-based data harmonization for federated learning in healthcare