CC BY-NC-ND 4.0 · Yearb Med Inform 2022; 31(01): 262-272
DOI: 10.1055/s-0042-1742522
Section 11: Public Health and Epidemiology Informatics

Towards an Interoperable Ecosystem of Research Cohort and Real-world Data Catalogues Enabling Multi-center Studies

Morris Swertz*
1   Department of Genetics, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
Esther van Enckevort
1   Department of Genetics, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
José Luis Oliveira
2   DETI/IEETA, University of Aveiro, Portugal
Isabel Fortier
3   Research Institute of the McGill University Health Center, Montreal, Canada
Julie Bergeron
3   Research Institute of the McGill University Health Center, Montreal, Canada
Nicolas H. Thurin
4   Univ. Bordeaux, INSERM CIC-P 1401, Bordeaux PharmacoEpi, Bordeaux, France
Eleanor Hyde
1   Department of Genetics, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
Alexander Kellmann
1   Department of Genetics, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
Romin Pahoueshnja
5   University of Utrecht, The Netherlands
Miriam Sturkenboom
6   Department of Datascience & Biostatistics, Julius Center, University Medical Center Utrecht, Utrecht, The Netherlands
Marianne Cunnington
7   GlaxoSmithkline, Stevenage, Herts, SG1 2NY, UK
Anne-Marie Nybo Andersen
8   University of Copenhagen, Copenhagen, Denmark
Yannick Marcon
9   Epigeny, France
Gonçalo Gonçalves
10   Human-Centered Computing and Information Science, INESC TEC, Portugal
Rosa Gini*
11   ARS Toscana, Florence, Italy
› Author Affiliations


Objectives: Existing individual-level human data cover large populations on many dimensions such as lifestyle, demography, laboratory measures, clinical parameters, etc. Recent years have seen large investments in data catalogues to FAIRify data descriptions to capitalise on this great promise, i.e. make catalogue contents more Findable, Accessible, Interoperable and Reusable. However, their valuable diversity also created heterogeneity, which poses challenges to optimally exploit their richness.

Methods: In this opinion review, we analyse catalogues for human subject research ranging from cohort studies to surveillance, administrative and healthcare records.

Results: We observe that while these catalogues are heterogeneous, have various scopes, and use different terminologies, still the underlying concepts seem potentially harmonizable. We propose a unified framework to enable catalogue data sharing, with catalogues of multi-center cohorts nested as a special case in catalogues of real-world data sources. Moreover, we list recommendations to create an integrated community of metadata catalogues and an open catalogue ecosystem to sustain these efforts and maximise impact.

Conclusions: We propose to embrace the autonomy of motivated catalogue teams and invest in their collaboration via minimal standardisation efforts such as clear data licensing, persistent identifiers for linking same records between catalogues, minimal metadata ‘common data elements’ using shared ontologies, symmetric architectures for data sharing (push/pull) with clear provenance tracks to process updates and acknowledge original contributors. And most importantly, we encourage the creation of environments for collaboration and resource sharing between catalogue developers, building on international networks such as OpenAIRE and research data alliance, as well as domain specific ESFRIs such as BBMRI and ELIXIR.

* Corresponding authors

Supplementary Material

Publication History

Article published online:
04 December 2022

© 2022. IMIA and Thieme. This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

  • References

  • 1 Daniel C, Kalra D; Section Editors for the IMIA Yearbook Section on Clinical Research Informatics. Clinical Research Informatics. Yearb Med Inform 2020 Aug;29(1):203-7.
  • 2 Safran C. Update on Data Reuse in Health Care. Yearb Med Inform 2017 Aug;26(1):24-7.
  • 3 Schlegel DR, Ficheur G. Secondary Use of Patient Data: Review of the Literature Published in 2016. Yearb Med Inform 2017 Aug;26(1):68-71.
  • 4 Meystre SM, Lovis C, Bürkle T, Tognola G, Budrionis A, Lehmann CU. Clinical Data Reuse or Secondary Use: Current Status and Potential Future Progress. Yearb Med Inform 2017 Aug;26(1):38-52.
  • 5 Schriml LM, Chuvochina M, Davies N, Eloe-Fadrosh EA, Finn RD, Hugenholtz P, et al. COVID-19 pandemic reveals the peril of ignoring metadata standards. Sci Data 2020 Jun 19;7(1):188.
  • 6 Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 2016 Mar 15;3:160018.
  • 7 Fortier I, Raina P, Van den Heuvel ER, Griffith LE, Craig C, Saliba M, et al. Maelstrom Research guidelines for rigorous retrospective data harmonization. Int J Epidemiol 2017 Feb 1;46(1):103-5.
  • 8 Hutchinson DM, Silins E, Mattick RP, Patton GC, Fergusson DM, Hayatbakhsh R, et al; Cannabis Cohorts Research Consortium. How can data harmonisation benefit mental health research? An example of The Cannabis Cohorts Research Consortium. Aust N Z J Psychiatry 2015 Apr;49(4):317-23.
  • 9 Holub P, Swertz M, Reihs R, van Enckevort D, Müller H, Litton JE. BBMRI-ERIC Directory: 515 Biobanks with Over 60 Million Biological Samples. Biopreserv Biobank 2016 Dec;14(6):559-62.
  • 10 Merino-Martinez R, Norlin L, van Enckevort D, Anton G, Schuffenhauer S, Silander K, et al. Toward Global Biobank Integration by Implementation of the Minimum Information About BIobank Data Sharing (MIABIS 2.0 Core). Biopreserv Biobank 2016 Aug;14(4):298-306.
  • 11 Eklund N, Andrianarisoa NH, van Enckevort E, Anton G, Debucquoy A, Müller H, Zaharenko L, et al. Extending the Minimum Information About BIobank Data Sharing Terminology to Describe Samples, Sample Donors, and Events. Biopreserv Biobank 2020 Jun;18(3):155-64.
  • 12 Bhattacharya S, Dunn P, Thomas CG, Smith B, Schaefer H, Chen J, et al. ImmPort, toward repurposing of open access immunological assay data for translational and clinical research. Sci Data 2018 Feb 27;5:180015.
  • 13 Bergeron J, Doiron D, Marcon Y, Ferretti V, Fortier I. Fostering population-based cohort data discovery: The Maelstrom Research cataloguing toolkit. PLoS One 2018 Jul 24;13(7):e0200926.
  • 14 Doiron D, Marcon Y, Fortier I, Burton P, Ferretti V. Software Application Profile: Opal and Mica: open-source software solutions for epidemiological data management, harmonization and dissemination. Int J Epidemiol 2017 Oct 1;46(5):1372-8.
  • 15 van Vliet-Ostaptchouk JV, Nuotio ML, Slagter SN, Doiron D, Fischer K, Foco L, et al. The prevalence of metabolic syndrome and metabolically healthy obesity in Europe: a collaborative analysis of ten large cohort studies. BMC Endocr Disord 2014 Feb 1;14:9.
  • 16 Doiron D, Burton P, Marcon Y, Gaye A, Wolffenbuttel BHR, Perola M, et al. Data harmonization and federated analysis of population-based studies: the BioSHaRE project. Emerg Themes Epidemiol 2013 Nov 21;10(1):12.
  • 17 Pinot de Moira A, Haakma S, Strandberg-Larsen K, van Enckevort E, Kooijman M, Cadman T, et al; LifeCycle Project Group. The EU Child Cohort Network's core data: establishing a set of findable, accessible, interoperable and re-usable (FAIR) variables. Eur J Epidemiol 2021 May;36(5):565-80.
  • 18 Ronkainen J, Nedelec R, Atehortua A, Balkhiyarova Z, Cascarano A, Ngoc Dang V, et al. LongITools: Dynamic longitudinal exposome trajectories in cardiovascular and metabolic noncommunicable diseases. Environ Epidemiol 2021 Dec 28;6(1):e184.
  • 19 Vrijheid M, Basagaña X, Gonzalez JR, Jaddoe VWV, Jensen G, Keun HC, et al. Advancing tools for human early lifecourse exposome research and translation (ATHLETE): Project overview. Environ Epidemiol 2021 Oct 1;5(5):e166.
  • 20 van der Velde KJ, Imhann F, Charbon B, Pang C, van Enckevort D, Slofstra M, et al. MOLGENIS research: advanced bioinformatics data software for non-bioinformaticians. Bioinformatics 2019 Mar 15;35(6):1076-8.
  • 21 Bamber D, Collins HE, Powell C, Gonçalves GC, Johnson S, Manktelow B, et al. Development of a data classification system for preterm birth cohort studies: the RECAP Preterm project. BMC Med Res Methodol 2022 Jan 7;22(1):8.
  • 22 Final set of metadata and definitions, process, and catalogue tool. [cited 2022 February 4]
  • 23 [cited 2022 February 4]
  • 24 IMI EHDEN. D4.7 Yearly Progress Report on Technical Framework. 16 December 2020. Available from:
  • 25 Oliveira JL, Trifan A, Bastião Silva LA. EMIF Catalogue: A collaborative platform for sharing and reusing biomedical data. Int J Med Inform 2019 Jun;126:35-45.
  • 26 Final Good Practice Guide for Metadata Collection for Real-World Data Sources.
  • 27 Thurin NH, Pajouheshnia R, Roberto G, Dodd C, Hyeraci G, Bartolini C, et al. From Inception to ConcePTION: Genesis of a Network to Support Better Monitoring and Communication of Medication Safety During Pregnancy and Breastfeeding. Clin Pharmacol Ther 2022 Jan;111(1):321-31.
  • 28 Swertz M, Hyde E, Cunnington M, Gini R. 2021. Test report for FAIR data catalogue (1st) (D7.9). Zenodo.
  • 29 Schneeweiss S, Patorno E. Conducting Real-world Evidence Studies on the Clinical Outcomes of Diabetes Treatments. Endocr Rev 2021 Sep 28;42(5):658-90. Erratum in: Endocr Rev. 2021 Nov 16;42(6):873.
  • 30 Conway M, Berg RL, Carrell D, Denny JC, Kho AN, Kullo IJ, et al. Analyzing the heterogeneity and complexity of Electronic Health Record oriented phenotyping algorithms. AMIA Annu Symp Proc 2011;2011:274-83.
  • 31 Roberto G, Leal I, Sattar N, Loomis AK, Avillach P, Egger P, et al. Identifying Cases of Type 2 Diabetes in Heterogeneous Data Sources: Strategy from the EMIF Project. PLoS One 2016 Aug 31;11(8):e0160648.
  • 32 OHDSI Community. The OMOP CDM. In: OHDSI Book. [cited 2022 February 4].
  • 33 Gini R, Schuemie M, Brown J, Ryan P, Vacchi E, Coppola M, et al. Data Extraction and Management in Networks of Observational Health Care Databases for Scientific Research: A Comparison of EU-ADR, OMOP, Mini-Sentinel and MATRICE Strategies. EGEMS (Wash DC) 2016 Feb 8;4(1):1189.
  • 34 Swerdel JN, Hripcsak G, Ryan PB. PheValuator: Development and evaluation of a phenotype algorithm evaluator. J Biomed Inform 2019 Sep 1;97:103258.
  • 35 Gini R, Sturkenboom MCJ, Sultana J, Cave A, Landi A, Pacurariu A, et al; Working Group 3 of ENCePP (Inventory of EU data sources and methodological approaches for multisource studies). Different Strategies to Execute Multi-Database Studies for Medicines Surveillance in Real-World Setting: A Reflection on the European Model. Clin Pharmacol Ther 2020 Aug;108(2):228-35.
  • 36 David R, Mabile L, Specht A, Stryeck A, Thomsen M, Yahia M, et al. The Research Data Alliance – SHAring Reward and Credit (SHARC) Interest Group. FAIRness Literacy: The Achilles' Heel of Applying FAIR Principles. CODATA Data Science Journal 2020;19(32):1-11.
  • 37 Sansone SA, McQuilton P, Rocca-Serra P, Gonzalez-Beltran A, Izzo M, Lister AL, et al; FAIRsharing Community. FAIRsharing as a community approach to standards, repositories and policies. Nat Biotechnol 2019 Apr;37(4):358-67.
  • 38 Jonquet C, Coulet A, Dutta B, Emonet V. Harnessing the Power of Unified Metadata in an Ontology Repository: The Case of AgroPortal. J Data Semant 2018;(7):191–221.
  • 39 Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 2007 Nov;25(11):1251-5.
  • 40 Gray AJ, Goble C, Jiménez RC. The Bioschemas Community Bioschemas: from potato salad to protein annotation. International Semantic Web Conference; Berlin. 2017. Available from: [cited 2022 Jan 23]