FAIVOR – a push-button system for AI validation within the hospital

mibe000286 10.3205/mibe000286 urn:nbn:de:0183-mibe0002862 Research Article FAIVOR – a push-button system for AI validation within the hospital FAIVOR – ein Push-Button-System zur KI-Validierung in einem Krankenhaus Slob Slob Daniël D

Department of Radiation Oncology (Maastro), GROW Research Institute for Oncology and Reproduction, Maastricht University Medical Centre+, Maastricht, The NetherlandsDepartment of Radiation Oncology (Maastro), GROW Research Institute for Oncology and Reproduction, Maastricht University Medical Centre+, Maastricht, The Netherlands

daniel.slob@maastrichtuniversity.nl author Akhmad Akhmad Ekaterina E

Netherlands eScience Center, Amsterdam, The Netherlands

author Garcia González Garcia González Jesus J

Netherlands eScience Center, Amsterdam, The Netherlands

author Amiri Amiri Saba S

Netherlands eScience Center, Amsterdam, The Netherlands

author Kasalica Kasalica Vedran V

Netherlands eScience Center, Amsterdam, The Netherlands

author Georgievska Georgievska Sonja S

Netherlands eScience Center, Amsterdam, The Netherlands

author Choudhury Choudhury Ananya A

Department of Radiation Oncology (Maastro), GROW Research Institute for Oncology and Reproduction, Maastricht University Medical Centre+, Maastricht, The Netherlands

author Lobo Gomes Lobo Gomes Aiara A

Department of Radiation Oncology (Maastro), GROW Research Institute for Oncology and Reproduction, Maastricht University Medical Centre+, Maastricht, The Netherlands

author Dekker Dekker Andre A

Department of Radiation Oncology (Maastro), GROW Research Institute for Oncology and Reproduction, Maastricht University Medical Centre+, Maastricht, The Netherlands

author van Soest van Soest Johan J

Department of Radiation Oncology (Maastro), GROW Research Institute for Oncology and Reproduction, Maastricht University Medical Centre+, Maastricht, The Netherlands Brightlands Institute for Smart Society (BISS), Faculty of Science and Engineering, Maastricht University, Heerlen, The Netherlands

author German Medical Science GMS Publishing House

Düsseldorf

610 artificial intelligence medicine evaluation adaptation deployment künstliche Intelligenz Medizin Evaluation Anpassung Anwendung EFMI STC 2025 20251017 engl This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). 1860-9171 21 GMS Medizinische Informatik, Biometrie und Epidemiologie GMS Med Inform Biom Epidemiol 14 Eine besondere Herausforderung für vortrainierte medizinische KI-Mod</PlainText></TextGroup>elle ist ihre Validierung anhand neuer, bisher unbekannter Daten. Entwickler trainieren ein Modell anhand einer begrenzten Patientengruppe, und Anwender müssen sicherstellen, dass sie es auf eine klinische Population ohne statistisch signifikante Unterschiede anwenden. Das FAIVOR-Tool zielt darauf ab, diese Probleme zu lösen, indem es ein robustes System vorschlägt, das die Bewertung und Anpassung vortrai<TextGroup><PlainText>n</PlainText></TextGroup>ierter medizinischer KI-Systeme an neue Datensätze ermöglicht. Das Tool umfasst den Ansatz zur Containerisierung von KI-Modellen, ein KI-Modell-Repository und ein Tool zum Abrufen und Bewerten von KI-Mo<TextGroup><PlainText>d</PlainText></TextGroup>ellen auf den lokalen Datensätzen. Diese Ziele gewährleisten eine verantwortungsvolle, transparente und anpassungsfähige Verwendung vortrainierter Modelle in verschiedenen klinischen Umgebungen.</Pgraph></Abstract> <Abstract language="en" linked="yes"><Pgraph>One notable challenge for pretrained medical AI models is their validation on new, unseen data. Developers train a model on a limited patient population and users must ensure that they deploy it on a clinical population without any statistical differences. The FAIVOR-tool aims to address these matters by proposing a robust system that enables the evaluation and adaptation of pretrained medical AI systems to new datasets. The tool includes the approach for containerization of AI models, an AI model repository and a tool to fetch and evaluate AI models on the local datasets. These objectives ensure responsible, transparent, and adaptable use of pretrained models in diverse clinical settings.</Pgraph></Abstract> <TextBlock name="1 Introduction" linked="yes"> <MainHeadline>1 Introduction</MainHeadline><Pgraph>As artificial intelligence (AI) systems continue to gain importance in healthcare, ensuring responsible deployment into clinical practice is crucial. One important aspect of deployment is the assessment of the system’s capability to achieve set objectives on new, unseen data (hereafter referred to as “runtime data”). This process of local validation, however, is a privacy-sensitive matter that requires measurements to ensure data privacy is preserved on heterogeneous model architectures. Our FAIVOR tool is designed to tackle these challenges by facilitating local validation with the following considerations in mind: (1<TextGroup><PlainText>) s</PlainText></TextGroup>trict data privacy compliance, (2) model-agnostic adaptability to various platforms and (3) robust evaluation beyond standard performance metrics. </Pgraph><Pgraph>To meet these challenges, the FAIVOR tool is developed with the following objectives: (1) containerization of AI models which can be downloaded inside the healthcare organisation, (2) a repository to register, search and find AI models, with references to the containerized AI models, and (3) a tool which fetches the AI models from the registry and evaluates the model on a local dataset. These objectives ensure responsible, transparent, and adaptable use of pretrained models in diverse clinical settings.</Pgraph></TextBlock> <TextBlock name="2 Methods" linked="yes"> <MainHeadline>2 Methods</MainHeadline><SubHeadline>2.1 Requirements for local validation of AI models</SubHeadline><Pgraph>Descriptive statistics of all input features and a patient cohort should be reported to assess cohort similarity. Statistical performance of the AI model should be evaluated based on the needs of the respective clinical context <TextLink reference="1"></TextLink>, <TextLink reference="2"></TextLink>, <TextLink reference="3"></TextLink> and translated to essential metrics <TextLink reference="4"></TextLink>, <TextLink reference="5"></TextLink>, specified in model metadata <TextLink reference="4"></TextLink>, <TextLink reference="6"></TextLink>. The documentation should report all requirements for input data, applicability criteria, essential metrics, and results of previous evaluations that could be considered to assess the local validation.</Pgraph><Pgraph>For reproducibility of testing, the local validation should be complemented with adequate data registration. Therefore, the complete description of AI models and validation results in different clinical settings should be stored in a general repository of AI models while providing the model in executable format.</Pgraph><SubHeadline>2.2 Preparation of AI models</SubHeadline><Pgraph>A commonly used tool for model containerization is Docker. A Docker image can be built including specified dependencies, versions and functions to calculate the actual prediction (e.g. weight coefficients in case of linear regression models). To interact with the model, a standardised REST API will be embedded in the Docker container. </Pgraph><Pgraph>The model validation requires access to the model description. Halilaj et al. presented an open-source repository for clinical prediction models <TextLink reference="7"></TextLink>. However, model reports are limited to only storing model coefficients and no validation possibilities were provided. The repository for our project should provide a comprehensive description using FAIR principles, applied to AI models, to enable interoperability of the AI model.</Pgraph><SubHeadline>2.3 Tool to fetch the AI models from the registry</SubHeadline><Pgraph>A GUI-based application will retrieve models from the FAIR registry, run local validations, generate statistics, and publish results. It simplifies model evaluation and helps identify performance trends across hospitals and over time.</Pgraph></TextBlock> <TextBlock name="3 Results" linked="yes"> <MainHeadline>3 Results</MainHeadline><Pgraph>The FAIVOR tool is an on-premises system for evaluating and adapting pretrained medical machine learning models to new datasets. The tool encompasses the following architecture (Figure 1 <ImgLink imgNo="1" imgType="figure" />) (<Hyperlink href="https://github.com/MaastrichtU-BISS/FAIRmodels-validator">https://github.com/MaastrichtU-BISS/FAIRmodels-validator</Hyperlink>). </Pgraph><Pgraph>The AI model described in Stiphout et al. <TextLink reference="8"></TextLink> serves as a use case to illustrate the FAIVOR tool architecture. Before starting their validation job, the developers of <TextLink reference="8"></TextLink> had uploaded their AI model – with corresponding metadata – to the model repository (<Hyperlink href="https://v2.fairmodels.org/instance/3f400afb-df5e-4798-ad50-0687dd439d9b">https://v2.fairmodels.org/instance/3f400afb-df5e-4798-ad50-0687dd439d9b</Hyperlink>). The repository guided the developers to generate FAIR metadata for their model. The model was then packaged in Docker images, following instructions on <Hyperlink href="https://github.com/MaastrichtU-BISS/FAIRmodels-model-package/">https://github.com/MaastrichtU-BISS/FAIRmodels-model-package/</Hyperlink>. After requirements were met, the URI (URL) of the model metadata <TextLink reference="8"></TextLink> was ready to be fetched from the repository. The newly created Docker image containing the AI system was then pulled from the specified location on the model metadata and used for validation. </Pgraph><Pgraph>An intuitive GUI guided the developers in setting the preliminary grounds for their validation job. The GUI included features that allowed the developers to keep track of the validation job(s) status(es), document the validation job using a standardized format, select from a range of predetermined metrics based on <TextLink reference="4"></TextLink>, <TextLink reference="5"></TextLink> (e.g. PPV, NPV), and calculate (I) summary statistics for continuous and categorical features and (II) performance evaluation metrics. </Pgraph></TextBlock> <TextBlock name="4 Discussion and conclusions" linked="yes"> <MainHeadline>4 Discussion and conclusions</MainHeadline><Pgraph>A robust, privacy-preserving tool is developed to evaluate and adapt pretrained medical machine learning models to new datasets. The FAIVOR tool shows similarities to existing tools such as MONAI <TextLink reference="9"></TextLink>, MLflow and EvalAI <TextLink reference="10"></TextLink>. It differentiates by supporting FAIR metadata generation, focusing on the clinical context specifically and facilitating local validation for models that use tabular data as input. Limitations, however, should be acknowledged as the tool is still under development. At this stage, the tool only supports AI systems trained on tabular data; medical images are not available to be used. Future work includes developing the ability to upload evaluation results into the registry and options to compare validation jobs to facilitate on-premises continuous monitoring.</Pgraph></TextBlock> <TextBlock name="Notes" linked="yes"> <MainHeadline>Notes</MainHeadline><SubHeadline>Author’s ORCID</SubHeadline><Pgraph>Daniël Slob: <Hyperlink href="https://orcid.org/0009-0002-2084-5660">0009-0002-2084-5660</Hyperlink></Pgraph><SubHeadline>Competing interests</SubHeadline><Pgraph>The authors declare that they have no competing interests.</Pgraph></TextBlock> <References linked="yes"> <Reference refNo="1"> <RefAuthor>Bibb A</RefAuthor> <RefAuthor>Dreyer K</RefAuthor> <RefAuthor>Stibolt R</RefAuthor> <RefAuthor>Agarwal S</RefAuthor> <RefAuthor>Coombs L</RefAuthor> <RefAuthor>Treml L</RefAuthor> <RefAuthor>Elkholy M</RefAuthor> <RefAuthor>Brink L</RefAuthor> <RefAuthor>Wald C</RefAuthor> <RefTitle>Evaluation and Real-World Performance Monitoring of Artificial Intelligence Models in Clinical Practice: Try It, Buy It, Check It</RefTitle> <RefYear>2022</RefYear> <RefJournal>JACR</RefJournal> <RefPage>1489-96</RefPage> <RefTotal>Bibb A, Dreyer K, Stibolt R, Agarwal S, Coombs L, Treml L, Elkholy M, Brink L, Wald C. Evaluation and Real-World Performance Monitoring of Artificial Intelligence Models in Clinical Practice: Try It, Buy It, Check It. JACR. 2022 Nov;18(11):1489-96. DOI: 10.1016/j.jacr.2021.08.022</RefTotal> <RefLink>https://doi.org/10.1016/j.jacr.2021.08.022</RefLink> </Reference> <Reference refNo="4"> <RefAuthor>Binuya MAE</RefAuthor> <RefAuthor>Engelhardt EG</RefAuthor> <RefAuthor>Schats W</RefAuthor> <RefAuthor>Schmidt MK</RefAuthor> <RefAuthor>Steyerberg EW</RefAuthor> <RefTitle>Methodological guidance for the evaluation and updating of clinical prediction models: a systematic review</RefTitle> <RefYear>2022</RefYear> <RefJournal>BMC Med Res Methodol</RefJournal> <RefPage>316</RefPage> <RefTotal>Binuya MAE, Engelhardt EG, Schats W, Schmidt MK, Steyerberg EW. Methodological guidance for the evaluation and updating of clinical prediction models: a systematic review. BMC Med Res Methodol. 2022 Dec 12;22(1):316. DOI: 10.1186/s12874-022-01801-8</RefTotal> <RefLink>https://doi.org/10.1186/s12874-022-01801-8</RefLink> </Reference> <Reference refNo="6"> <RefAuthor>Tanguay W</RefAuthor> <RefAuthor>Acar P</RefAuthor> <RefAuthor>Fine B</RefAuthor> <RefAuthor>Abdolell M</RefAuthor> <RefAuthor>Gong B</RefAuthor> <RefAuthor>Cadrin-Chênevert A</RefAuthor> <RefAuthor>Chartrand-Lefebvre C</RefAuthor> <RefAuthor>Chalaoui J</RefAuthor> <RefAuthor>Gorgos A</RefAuthor> <RefAuthor>Chin AS</RefAuthor> <RefAuthor>Prénovault J</RefAuthor> <RefAuthor>Guilbert F</RefAuthor> <RefAuthor>Létourneau-Guillon L</RefAuthor> <RefAuthor>Chong J</RefAuthor> <RefAuthor>Tang A</RefAuthor> <RefTitle>Assessment of Radiology Artificial Intelligence Software: A Validation and Evaluation Framework</RefTitle> <RefYear>2023</RefYear> <RefJournal>Can Assoc Radiol J</RefJournal> <RefPage>326-33</RefPage> <RefTotal>Tanguay W, Acar P, Fine B, Abdolell M, Gong B, Cadrin-Chênevert A, Chartrand-Lefebvre C, Chalaoui J, Gorgos A, Chin AS, Prénovault J, Guilbert F, Létourneau-Guillon L, Chong J, Tang A. Assessment of Radiology Artificial Intelligence Software: A Validation and Evaluation Framework. Can Assoc Radiol J. 2023 May;74(2):326-33. DOI: 10.1177/08465371221135760</RefTotal> <RefLink>https://doi.org/10.1177/08465371221135760</RefLink> </Reference> <Reference refNo="7"> <RefAuthor>Halilaj I</RefAuthor> <RefAuthor>Oberije C</RefAuthor> <RefAuthor>Chatterjee A</RefAuthor> <RefAuthor>van Wijk Y</RefAuthor> <RefAuthor>Rad NM</RefAuthor> <RefAuthor>Galganebanduge P</RefAuthor> <RefAuthor>Lavrova E</RefAuthor> <RefAuthor>Primakov S</RefAuthor> <RefAuthor>Widaatalla Y</RefAuthor> <RefAuthor>Wind A</RefAuthor> <RefAuthor>Lambin P</RefAuthor> <RefTitle>Open Source Repository and Online Calculator of Prediction Models for Diagnosis and Prognosis in Oncology</RefTitle> <RefYear>2022</RefYear> <RefJournal>Biomedicines</RefJournal> <RefPage>2679</RefPage> <RefTotal>Halilaj I, Oberije C, Chatterjee A, van Wijk Y, Rad NM, Galganebanduge P, Lavrova E, Primakov S, Widaatalla Y, Wind A, Lambin P. Open Source Repository and Online Calculator of Prediction Models for Diagnosis and Prognosis in Oncology. Biomedicines. 2022 Oct 23;10(11):2679. DOI: 10.3390/biomedicines10112679</RefTotal> <RefLink>https://doi.org/10.3390/biomedicines10112679</RefLink> </Reference> <Reference refNo="9"> <RefAuthor>Cardoso MJ</RefAuthor> <RefAuthor>Li W</RefAuthor> <RefAuthor>Brown R</RefAuthor> <RefAuthor>Ma N</RefAuthor> <RefAuthor>Kerfoot E</RefAuthor> <RefAuthor>Wang Y</RefAuthor> <RefAuthor>Murrey B</RefAuthor> <RefAuthor></RefAuthor> <RefTitle>MONAI: An open-source framework for deep learning in healthcare [Preprint]</RefTitle> <RefYear>2022</RefYear> <RefJournal>arXiv</RefJournal> <RefPage></RefPage> <RefTotal>Cardoso MJ, Li W, Brown R, Ma N, Kerfoot E, Wang Y, Murrey B, et al. MONAI: An open-source framework for deep learning in healthcare [Preprint]. arXiv. 2022 Nov 4. DOI: 10.48550/arXiv.2211.02701</RefTotal> <RefLink>https://doi.org/10.48550/arXiv.2211.02701</RefLink> </Reference> <Reference refNo="10"> <RefAuthor>Yadav D</RefAuthor> <RefAuthor>Jain R</RefAuthor> <RefAuthor>Agrawal H</RefAuthor> <RefAuthor>Chattopadhyay P</RefAuthor> <RefAuthor>Singh T</RefAuthor> <RefAuthor>Jain A</RefAuthor> <RefAuthor>Singh SB</RefAuthor> <RefAuthor>Lee S</RefAuthor> <RefAuthor>Batra D</RefAuthor> <RefTitle>EvalAI: Towards Better Evaluation Systems for AI Agents [Preprint]</RefTitle> <RefYear>2019</RefYear> <RefJournal>arXiv</RefJournal> <RefPage></RefPage> <RefTotal>Yadav D, Jain R, Agrawal H, Chattopadhyay P, Singh T, Jain A, Singh SB, Lee S, Batra D. EvalAI: Towards Better Evaluation Systems for AI Agents [Preprint]. arXiv. 2019 Feb 10. DOI: 10.48550/arXiv.1902.03570</RefTotal> <RefLink>https://doi.org/10.48550/arXiv.1902.03570</RefLink> </Reference> <Reference refNo="8"> <RefAuthor>van Stiphout RG</RefAuthor> <RefAuthor>Lammering G</RefAuthor> <RefAuthor>Buijsen J</RefAuthor> <RefAuthor>Janssen MH</RefAuthor> <RefAuthor>Gambacorta MA</RefAuthor> <RefAuthor>Slagmolen P</RefAuthor> <RefAuthor>Lambrecht M</RefAuthor> <RefAuthor>Rubello D</RefAuthor> <RefAuthor>Gava M</RefAuthor> <RefAuthor>Giordano A</RefAuthor> <RefAuthor>Postma EO</RefAuthor> <RefAuthor>Haustermans K</RefAuthor> <RefAuthor>Capirci C</RefAuthor> <RefAuthor>Valentini V</RefAuthor> <RefAuthor>Lambin P</RefAuthor> <RefTitle>Development and external validation of a predictive model for pathological complete response of rectal cancer patients including sequential PET-CT imaging</RefTitle> <RefYear>2011</RefYear> <RefJournal>Radiother Oncol</RefJournal> <RefPage>126-33</RefPage> <RefTotal>van Stiphout RG, Lammering G, Buijsen J, Janssen MH, Gambacorta MA, Slagmolen P, Lambrecht M, Rubello D, Gava M, Giordano A, Postma EO, Haustermans K, Capirci C, Valentini V, Lambin P. Development and external validation of a predictive model for pathological complete response of rectal cancer patients including sequential PET-CT imaging. Radiother Oncol. 2011 Jan;98(1):126-33. DOI: 10.1016/j.radonc.2010.12.002</RefTotal> <RefLink>https://doi.org/10.1016/j.radonc.2010.12.002</RefLink> </Reference> <Reference refNo="2"> <RefAuthor>Altman DG</RefAuthor> <RefAuthor>Vergouwe Y</RefAuthor> <RefAuthor>Royston P</RefAuthor> <RefAuthor>Moons KG</RefAuthor> <RefTitle>Prognosis and prognostic research: validating a prognostic model</RefTitle> <RefYear>2009</RefYear> <RefJournal>BMJ</RefJournal> <RefPage>b605</RefPage> <RefTotal>Altman DG, Vergouwe Y, Royston P, Moons KG. Prognosis and prognostic research: validating a prognostic model. BMJ. 2009 May 28;338:b605. DOI: 10.1136/bmj.b605</RefTotal> <RefLink>https://doi.org/10.1136/bmj.b605</RefLink> </Reference> <Reference refNo="3"> <RefAuthor>Scott I</RefAuthor> <RefAuthor>Carter S</RefAuthor> <RefAuthor>Coiera E</RefAuthor> <RefTitle>Clinician checklist for assessing suitability of machine learning applications in healthcare</RefTitle> <RefYear>2021</RefYear> <RefJournal>BMJ Health Care Inform</RefJournal> <RefPage>e100251</RefPage> <RefTotal>Scott I, Carter S, Coiera E. Clinician checklist for assessing suitability of machine learning applications in healthcare. BMJ Health Care Inform. 2021 Feb;28(1):e100251. DOI: 10.1136/bmjhci-2020-100251</RefTotal> <RefLink>https://doi.org/10.1136/bmjhci-2020-100251</RefLink> </Reference> <Reference refNo="5"> <RefAuthor>Cabitza F</RefAuthor> <RefAuthor>Campagner A</RefAuthor> <RefTitle>The need to separate the wheat from the chaff in medical informatics: Introducing a comprehensive checklist for the (self)-assessment of medical AI studies</RefTitle> <RefYear>2021</RefYear> <RefJournal>Int J Med Inform</RefJournal> <RefPage>104510</RefPage> <RefTotal>Cabitza F, Campagner A. The need to separate the wheat from the chaff in medical informatics: Introducing a comprehensive checklist for the (self)-assessment of medical AI studies. Int J Med Inform. 2021 Sep;153:104510. DOI: 10.1016/j.ijmedinf.2021.104510</RefTotal> <RefLink>https://doi.org/10.1016/j.ijmedinf.2021.104510</RefLink> </Reference> </References> <Media> <Tables> <NoOfTables>0</NoOfTables> </Tables> <Figures> <Figure width="1526" height="712" format="png"> <MediaNo>1</MediaNo> <MediaID>1</MediaID> <Caption><Pgraph><Mark1>Figure 1: Architecture of the FAIVOR tool</Mark1></Pgraph></Caption> </Figure> <NoOfPictures>1</NoOfPictures> </Figures> <InlineFigures> <NoOfPictures>0</NoOfPictures> </InlineFigures> <Attachments> <NoOfAttachments>0</NoOfAttachments> </Attachments> </Media> </OrigData> </GmsArticle>