Synthetic data in the clinical laboratory : methods, applications, and future prospects

Pillay, Tahir S.Van Deventer, Barbara StrohGwiliza, SiphokaziSubramoney, Evette L.Van Niekerk, Chantal2026-03-122026-03-122026-04Pillay, T.S., Van Deventer, B.S., Gwiliza, S., Subramoney, E.L. & Van Niekerk, C. 2026, 'Synthetic data in the clinical laboratory : methods, applications, and future prospects', Clinica Chimica Acta, vol. 585, art. 120878, pp. 1-13, doi : 10.1016/j.cca.2026.120878.0009-8981 (print)1873-3492 (online)10.1016/j.cca.2026.120878http://hdl.handle.net/2263/108932DATA AVAILABILITY : No data was used for the research described in the article.Clinical laboratories face stringent privacy constraints, limited datasets for rare conditions, and rising demands to validate AI algorithms and workflows safely. Synthetic data—artificially generated data that preserve the statistical characteristics of real clinical data without exposing patient identities—has emerged as a powerful tool to address these challenges. This review provides a comprehensive overview of synthetic data in the context of laboratory medicine. We begin by defining synthetic data and describing the main generation methods, from rule-based simulations to modern generative models (including generative adversarial networks, variational autoencoders, and diffusion models) with examples of their use in healthcare. We then delve into key applications in the clinical laboratory: quality control and method validation, education and training, machine learning development, test utilization and workflow simulation, and external quality assessment. Advantages of synthetic data—such as enhanced privacy, scalability, flexibility in simulating rare events, and cost-effectiveness—are discussed with illustrative case studies. We also examine challenges and limitations, including concerns about data fidelity, bias amplification, risks of model overfitting or re-identification attacks, and the cautious stance of regulators that still require real patient data for approvals. Finally, we outline future directions for synthetic data in laboratory medicine, from hybrid real–synthetic datasets and privacy-enhancing techniques to evolving regulatory frameworks and the potential to democratize data access globally. While synthetic data cannot entirely replace real clinical data—especially for regulatory validation—it can significantly augment what laboratories can design, test, and achieve, provided it is used with careful validation and ethical safeguards. HIGHLIGHTS • Synthetic laboratory data enable safer sharing for method and algorithm development. • Three approaches: simulation, probabilistic models, and deep generative models. • Use cases include middleware testing, rare results, and competency training. • Governance needs privacy risk review, documentation, and drift monitoring.en© 2026 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).Synthetic dataMethod and algorithm developmentSimulationProbabilistic modelsDeep generative modelsGovernanceHealthcareSynthetic data in the clinical laboratory : methods, applications, and future prospectsArticle