Making AI Safety Routine: How Sites And Sponsors Can Continuously Monitor Clinical AI
By Akshaya Bhagavathula, associate professor of epidemiology, Department of Public Health, North Dakota State University, and Michelle A. Williams, professor of epidemiology and population health, Department of Epidemiology and Population Health, Stanford University School of Medicine

AI is moving rapidly from pilot projects into routine clinical operations across U.S. health systems. Health systems are deploying AI to support diagnosis, risk prediction, documentation, and population health management, often at scale, yet many lack a clear operational approach for monitoring these tools once they are live. Unlike drugs and medical devices, clinical AI systems are rarely tracked for performance drift, bias, or unintended consequences after deployment. As a result, models that perform well at launch can quietly degrade over time, exposing patients and health systems to avoidable clinical, operational, and reputational risk. Making AI safety routine requires treating algorithms not as static products but as clinical interventions that demand continuous oversight.
While previous calls for “algorithmovigilance” draw important parallels to drug safety monitoring, these proposals have largely remained conceptual. Algorithmovigilance, defined as the systematic monitoring of algorithms for expected and unexpected health effects, provides a useful conceptual foundation, but its impact depends on translating high-level principles into operational reality. This commentary provides a concrete operational blueprint for hospital-level implementation, specifying institutional mechanisms, reporting standards, and accountability structures largely absent from prior guidance. We argue that health systems must evolve toward a unified AI health system, treating every deployed model as a continuously evaluated intervention contributing to safety intelligence. Without adequate evaluation, transparency, and ethical safeguards, AI risks deepening inequities rather than resolving them. For health system leaders, the challenge is no longer whether AI will be adopted, but whether it will be governed with the same rigor applied to other clinical interventions. Although operationalized at the hospital level, this framework is explicitly intended to support clinical researchers, implementation scientists, and real-world evidence teams engaged in life cycle evaluation of AI interventions, from pre-deployment validation through post-deployment surveillance.
Real-World Applications Show Shortcomings
Recent experiences illustrate this urgency. Diagnostic imaging, among the earliest and most commercialized clinical AI applications, demonstrates marked variability between development and real-world settings. Independent evaluations reveal degradation in both sensitivity and calibration over time and across patient subgroups, a pattern that reflects reliance on narrowly curated data sets and insufficient external validation before deployment. These findings expose a critical operational gap: Algorithms that achieve high accuracy in controlled development settings may fail when applied across diverse patient populations.
For organizations deploying AI in clinical research, including trial sponsors and research sites, this disconnect highlights a shared risk. For sponsors using AI in drug development for tasks such as patient selection, endpoint adjudication, safety monitoring, and adoptive monitoring, performance drift can systematically bias analyses, distort treatment effects, and compromised the interpretability of trial outcomes. At the site level, where AI-enabled tools increasingly support data capture, eligibility screening, imaging interpretation, and outcome assessment, unrecognized drift can introduce measurement errors, inconsistencies across sites, and protocol deviations, ultimately threatening data integrity and the validity of in-human research.
Similarly, population health management algorithms have demonstrated how bias can replicate structural inequalities. A widely deployed risk-prediction tool using healthcare cost as a proxy for future health risks and care needs underestimated risk among Black patients due to systemic barriers that suppressed expenditures despite higher disease burden. These barriers included reduced access to specialty care, underinsurance, differences in referral patterns, and longstanding structural inequalities that limited healthcare utilization even when clinical need was high. Although the developers did not include race explicitly, bias embedded itself through structural inequities in the training data. Subsequent analyses spurred redesign of commercial algorithms and adoption of fairness frameworks that explicitly evaluate whether models distribute benefits and errors equitably across patient group, prioritizing outcome equity, meaning comparable health outcomes across populations, rather than data neutrality, which assumes that models trained on existing data are unbiased simply because protected characteristics are excluded. This episode underscores a lesson for executive leadership: Technical performance metrics alone cannot substitute for understanding how data reflect historical patterns of access and care.
Clinical decision support tools offer similar cautionary lessons. A proprietary sepsis prediction algorithm implemented in hundreds of hospitals showed poor real-world performance when independently evaluated, demonstrating low sensitivity, frequent false alarms, and limited clinical benefit. Because its internal design and validation data were proprietary, health systems had limited ability to assess performance before deployment. In clinical research, the same opacity can affect drug developers and research sites that rely on proprietary AI tools for eligible screening, safety signal detection, endpoint assessment, or data quality checks, leaving them unable to independently verify performance bias or failure modes during a trial. Without transparency and post-deployment evaluation requirements, sponsors and sites may assume scientific and regulatory risk they cannot directly measure. To address this, research organizations should require pre-specified validation documentation, ongoing performance monitoring during trials, and contractual rights to audit or externally evaluate AI tools used in in-human research.
The introduction of generative AI into clinical documentation presents qualitatively different risks. Large language models are increasingly explored for drafting discharge summaries, progress notes, and patient communications. These systems generate probabilistic text and can produce clinically plausible hallucinations that are difficult for clinicians to detect and may propagate through billing, quality reporting, and medico-legal processes. Even low error rates can translate into meaningful system-wide risk when tools are deployed at scale.
A Troubling Pattern Is Revealed
Taken together, these examples reveal a common pattern: rapid adoption of AI without sufficient accountability once systems enter routine use. They point to the need for a unified safety infrastructure that integrates regulatory expectations, institutional governance, and real-time performance monitoring into a continuous learning ecosystem. Similar algorithmovigilance concepts have been proposed, drawing on drug safety methods to monitor AI harms. What has been missing for many organizations is a practical operational model for putting these ideas into daily practice.
In response, major institutions are advancing expectations for responsible AI governance. The FDA has outlined a regulatory action plan for adaptive AI-based medical devices, emphasizing transparency, real-world performance tracking, and post-market surveillance. The Joint Commission, together with the Coalition for Health AI, has issued guidance outlining expectations for algorithm transparency, bias assessment, and local validation prior to clinical use. Collectively, these signals make clear that passive or one-time evaluation of clinical AI will no longer be sufficient for health system leadership.
For clinical research and quality improvement teams, these operational elements define a reproducible infrastructure for embedded evaluation and real-world evidence generation once AI systems enter routine care. A practical framework for a learning health system for AI includes four operational elements.
Four Elements Comprise A Practical AI Framework
First, adopt pre-deployment validation using local data and subgroup performance metrics. Multidisciplinary oversight committees should review AI tools before deployment, assess performance across demographic and clinical subgroups, and evaluate potential unintended effects. For leaders, this step establishes clear accountability before clinical risk is assumed. Routine bias audits and explainability assessments should be integrated into standard quality improvement cycles, ensuring that equity concerns receive the same systematic attention as other safety measures.
Second, implement continuous post-deployment monitoring for bias, drift, and clinical outcomes. Health systems must embed a culture of ongoing evaluation, treating algorithmic performance as a standing operational responsibility rather than a one-time checkpoint. Without this capability, organizations may be unaware of performance degradation until patient harm or regulatory scrutiny occurs.
Third, use hospital-level AI registries linking model versions, clinical indications, and real-world outcomes. AI systems should be treated like other clinical interventions: documented, monitored for adverse outcomes, and updated as evidence evolves. These registries provide leadership with visibility into where AI is used, how it performs, and when intervention is required. Transparency about model architecture, data provenance, and intended use remains essential to informed adoption.
Fourth, support transparent reporting to national repositories to enable cross-institutional learning. Policymakers and regulators should support shared reporting systems that allow organizations to learn from one another’s experiences. For health systems, participation in such networks transforms isolated errors into opportunities for system-wide improvement rather than reputational risk.
Effective use of these systems requires workforce capacity at multiple levels. Clinicians need training to interpret algorithmic outputs, recognize limitations, and retain decision-making authority. Data scientists require grounding in ethics and population health principles to design systems aligned with patient welfare and equity. Without this workforce capacity, organizations risk deploying AI systems they are not equipped to supervise responsibly.
AI Must Be Safe And Accountable
Looking ahead, the next frontier in AI governance will involve real-time surveillance of algorithmic performance across institutions. Just as pharmacovigilance tracks drug safety after approval, algorithmic safety surveillance can identify performance drift, bias, and system failures early. For leaders, investing in this infrastructure is not an academic exercise but a core component of clinical risk management.
Public trust in clinical AI will depend not only on technical performance but on visible accountability. When patients and clinicians see that errors are identified, addressed, and used to improve systems, confidence can be sustained. Making AI safety routine is ultimately a leadership decision — one that determines whether innovation strengthens care or quietly introduces new forms of risk.
References:
- Davis SE, Dorn C, Park DJ, Matheny ME. Emerging algorithmic bias: fairness drift as the next dimension of model maintenance and sustainability. J Am Med Inform Assoc. 2025 May 1;32(5):845-854. doi: 10.1093/jamia/ocaf039.
- Chin MH, Afsar-Manesh N, Bierman AS, Chang C, Colón-Rodríguez CJ, Dullabh P, Duran DG, Fair M, Hernandez-Boussard T, Hightower M, Jain A, Jordan WB, Konya S, Moore RH, Moore TT, Rodriguez R, Shaheen G, Snyder LP, Srinivasan M, Umscheid CA, Ohno-Machado L. Guiding Principles to Address the Impact of Algorithm Bias on Racial and Ethnic Disparities in Health and Health Care. JAMA Netw Open. 2023 Dec 1;6(12):e2345050. doi: 10.1001/jamanetworkopen.2023.45050.
- Wong A, Otles E, Donnelly JP, Krumm A, McCullough J, DeTroyer-Cooley O, Pestrue J, Phillips M, Konye J, Penoza C, Ghous M, Singh K. External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients. JAMA Intern Med. 2021 Aug 1;181(8):1065-1070. doi: 10.1001/jamainternmed.2021.2626.
- Asgari E, Montaña-Brown N, Dubois M, Khalil S, Balloch J, Yeung JA, Pimenta D. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. npj Digital Medicine. 2025;8:274. doi:10.1038/s41746-025-01670-7
- U.S. Food and Drug Administration. Artificial Intelligence and Machine Learning Software as a Medical Device Action Plan. Updated 2023. Available from: https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-software-medical-device.
- The Joint Commission and Coalition for Health AI Join Forces to Scale the Responsible Use of AI in Delivering Better Healthcare. Available from: https://www.jointcommission.org/en-us/knowledge-library/news/2025-06-the-joint-commission-and-coalition-for-health-ai-join-forces.
About The Authors:
Akshaya S. Bhagavathula, PhD, FACE is an associate professor of epidemiology in the Department of Public Health at North Dakota State University. He is an internationally recognized public health scientist whose work spans clinical epidemiology, digital and spatial epidemiology, pharmacoepidemiology, infodemiology, and AI-enabled risk prediction. His research examines methodological challenges related to bias, generalizability, and performance assessment when data-driven tools are used for patient selection, safety monitoring, outcome measurement, and real-world evidence generation. Dr. Bhagavathula has authored more than 350 peer-reviewed publications and serves as an associate editor of Annals of Epidemiology. He has been selected for competitive national programs in biomedical data science, including a national cohort focused on generative AI training in biomedical and clinical research, and has contributed to global public health projects across multiple countries.
Michelle A. Williams, ScD, is a professor of epidemiology and population health at Stanford University School of Medicine and associate chair for academic affairs in the Department of Epidemiology and Population Health. Her research centers on reproductive, perinatal, pediatric, and molecular epidemiology, with a long record of advancing understanding of adverse pregnancy outcomes and other population health issues through rigorous epidemiologic study designs. She co-designed and co-leads the Apple Women’s Health Study, a large national digital cohort examining gynecological health. Over her career, she has published more than 540 peer-reviewed scientific articles and been elected a member of the National Academy of Medicine in recognition of her contributions to the field. Dr. Williams previously served as dean of the faculty at the Harvard T. H. Chan School of Public Health, and she has received numerous honors for her research and mentorship.