Ontology- and LLM-based data harmonization for federated learning in healthcare

Open Access
Authors
  • Paola Grosso ORCID logo
  • Sara Quinney
  • Lang Li
  • Bernard de Bono
Publication date 18-03-2026
Journal Frontiers in Digital Health
Article number 1756555
Volume | Issue number 8
Number of pages 12
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract

Introduction: Semantic heterogeneity across electronic health records (EHRs) limits scalable and privacy-preserving analytics in healthcare. While federated learning (FL) enables collaborative modeling without sharing raw data, it requires consistent, ontology-aligned representations. We present an ontology- and large language model (LLM)-based data harmonization approach to support secure, interoperable FL workflows. 

Methods: We propose a general two-step pipeline for converting or annotating clinical text into a predefined target ontology format. First, candidate concepts are retrieved from the target vocabulary using embedding-based similarity search or ontology cross-references. Second, an LLM acts as a semantic validator, accepting or rejecting candidates based on explicit equivalence or subsumption criteria. The approach is ontology-agnostic and configurable; mapping to MONDO and HPO is demonstrated as a real-world use case. Final accepted mappings were evaluated against independent human expert assessment. 

Results: Across two clinical datasets, expert-LLM agreement reached up to 92%, with overall performance ranging from 78% to 91% depending on candidate-generation strategy. Retrieval alone was insufficient for reliable mapping, whereas LLM-based validation substantially improved precision while complementary retrieval strategies improved recall. 

Discussion: The proposed pipeline transforms ontology-based harmonization from a manual expert task into a reusable and configurable workflow suitable for federated healthcare research. By combining high-recall retrieval with LLM-based semantic adjudication, the approach enables scalable, privacy-preserving conversion of heterogeneous clinical text into standardized representations across domains.

Document type Article
Note With supplementary material.
Language English
Published at https://doi.org/10.3389/fdgth.2026.1756555
Other links https://zenodo.org/records/15411810 https://www.scopus.com/pages/publications/105038112885
Downloads
fdgth-8-1756555 (Final published version)
Supplementary materials
Permalink to this page
Back