The journal of prevention of Alzheimer's disease

A benchmark of text embedding models for semantic harmonization of Alzheimer's disease cohorts.

BACKGROUND: Harmonizing diverse healthcare datasets is a challenging task due to inconsistent naming conventions. Manual harmonization is time- and resource-intensive, limiting scalability for multi-cohort Alzheimer's Disease research. Large Language Models, or specifically text-embedding models, offer a promising solution, but their rapid development necessitates continuous, domain-specific benchmarking, especially since general established benchmarks lack clinical data harmonization use cases. OBJECTIVES: To evaluate how different text-embedding models perform for the harmonization of clinical variables. DESIGN AND SETTING: We created a novel benchmark to assess how well different Language Model embeddings can be used to harmonize cohort study metadata with an in-house Common Data Model that includes cohort-to-cohort mappings for a wide range of Alzheimer's Disease cohorts. We evaluated five different state-of-the-art text embedding models for seven different data sets in the context of Alzheimer's disease. PARTICIPANTS: No patient data were utilized for any of the analyses, as the evaluation was based on semantic harmonization of cohort metadata only. MEASUREMENTS: Text descriptions of variables from different modalities were included for the analyses, namely clinical, lifestyle, demographics, and imaging. RESULTS: Our benchmark results favored different models compared to general-purpose benchmarks. This suggests that models fine-tuned for generic tasks may not translate well to real-world data harmonization, particularly in Alzheimer's disease. We propose guidelines to format metadata to facilitate manual or model-assisted data harmonization. We introduce an open-source library (https://github.com/SCAI-BIO/ADHTEB) and an interactive leaderboard (https://adhteb.scai.fraunhofer.de) to aid future model benchmarking. CONCLUSIONS: Our findings highlight the importance of domain-specific benchmarks for clinical data harmonization in the field of Alzheimer's disease and motivate standards for naming conventions that may support semi-automated mapping applications in the future.

Original-Artikel öffnen →