Digital Mirrors in Medicine: When the Data Isn’t Real—But the Decisions Are

Faking it, for real: How synthetic data is reshaping medicine.

Maria Giovanna Trovato

July 11, 2025

Share this post

As we move deeper into precision medicine and real-time decisions, access to quality data is no longer optional—but it's often the biggest constraint. Between privacy concerns, institutional walls, and fragmented systems, collaboration hits a wall. That’s where synthetic data starts to look like a promising workaround.

Across federal research panels and startup accelerators, the conversation is shifting from “Why synthetic?” to “How do we use it responsibly?” Having attended several conferences this year, I’ve noticed a clear split: private companies are eager to push implementation, often with cautious optimism—even when the risks aren’t fully addressed. Meanwhile, federal panels tend to offer more warnings than endorsements, emphasizing the need for rigorous standards, traceability, and context before widespread adoption.

1. From Data Deserts to Synthetic Solutions

Traditional anonymization no longer meets the scale or fidelity required for real-world healthcare AI. De-identification often strips away crucial clinical detail, making the data less useful for machine learning^[1].

That’s where the need for synthetic data generators come from—tools that can mimic the statistical properties of real datasets while preserving privacy. Companies like MDClone and Syntegra are building platforms that simulate patient journeys, disease trajectories, and treatment pathways. MDClone, for example, has enabled hospitals to generate shareable, privacy-preserving data for researchers without waiting months for IRB approval. Syntegra’s Transformer-based model builds synthetic datasets that retain correlations between patient demographics and clinical features, aiming for use in both predictive modeling and AI testing.

Critically, this has implications for rare disease research, where small cohorts often lead to underpowered studies^[2]. According to EURORDIS, rare diseases affect 30 million Americans and 1 in 17 people in Europe, yet 95% of rare diseases lack an FDA-approved treatment^[3]. With real data so scarce, high-fidelity synthetic augmentation could help researchers build more inclusive, representative models.

2. Real-World Wins... and Warnings

Case Study 1: Spine Surgery Validation

At Washington University in St. Louis, researchers validated MDClone's synthetic dataset by comparing it with real-world EHRs. They found nearly identical cohort distributions and outcome patterns, suggesting synthetic data could be viable for retrospective clinical studies^[4].

Case Study 2: CKD Survival Modeling

A 2024 paper on Chronic Kidney Disease used attention-based neural networks (MCM) to generate synthetic EHRs. These models reduced calibration error by 15% and improved subgroup fairness by 9%, outperforming 15 other benchmark techniques^[5].

Case Study 3: HealthGAN Fairness Failure

Despite efforts to mitigate bias, HealthGAN-generated EHR data exhibited altered fairness properties compared to real data, reinforcing the concern that bias embedded in source data doesn’t vanish—it persists in synthetic versions^[6].

In broader reviews, researchers found that synthetic data often fails to preserve multi-modal physiological signals—like ECG with respiration—even in high-stakes contexts such as ICU monitoring or wearable health. Generative models “struggle to preserve inter-signal correlations inherent to physiological recordings,” which risks distorting the relationships clinicians rely on for patient care^[7].

3. Funding & the Innovation Push

The U.S. government is investing heavily in infrastructure to make synthetic data viable:

ARPA-H’s INDEX initiative (Imaging Data Exchange) focuses on developing representative, standardized, and privacy-preserving datasets across institutions—many of which will incorporate synthetic augmentation to fill gaps in coverage. INDEX aims to accelerate discovery in medical imaging and serve as a resource for AI training.
The FDA-NIST AISAMD collaboration is drafting regulatory frameworks for synthetic data in medical device testing, emphasizing transparency and reproducibility.‍
NIH has begun funding open-source synthetic data frameworks as part of its Bridge2AI and BDF Toolbox efforts—especially those tied to digital twin development, where real patient data may be incomplete or unavailable.

These aren’t fringe experiments. They reflect a new consensus: we need to build safe, scalable, and inclusive AI—without relying solely on real patient records.

4. The Hard Truths

Synthetic data isn’t de-risked data

Even privacy-safe data can be dangerous if poorly constructed. Validation pipelines must go beyond statistical similarity to include real-world relevance and expert review.

Bias isn’t erased—it’s echoed

Models trained on biased real-world data can generate synthetic data that deepens inequities. If most of your source patients are white males, your synthetic dataset likely will be, too.

Generated doesn’t mean grounded

We can’t synthesize signals we don’t capture. One study showed that synthetic ICU data lacked valid multi-signal correlations—rendering digital twin simulations unreliable for predicting adverse events like sepsis or respiratory failure.

Validation falls short

Many LLM-based tools generate EHR-style data that looks plausible but fails when tested against real-world ground truth. This disconnect can lead to overconfident models with dangerous gaps.

5. What to do then

Hybrid pipelines

Synthetic data shouldn’t replace real data—it should complement it. For example, synthetic patient histories in oncology have been used to pre-train models before launching them on real datasets. In neurology, researchers have successfully trained brain tumor segmentation models on fully synthetic MRIs that achieved up to 90% Dice scores compared to real-image models^[8].

Benchmarked quality metrics

Frameworks like those proposed in Vallevik et al. (2024) outline best practices for testing synthetic data on fairness, correlation preservation, and downstream model robustness^[9].

Regulatory engagement

With FDA/NIST’s guidance and ARPA-H’s pilot evaluations, we have a growing opportunity to shape how synthetic data is used safely across the industry.

Human-in-the-loop design

Every success story in synthetic data—from MDClone’s test case to emerging models in cardiac prediction—has relied on interdisciplinary collaboration. Engineers, ethicists, clinicians, and patients must all have a seat at the table.

Conclusion: Synthetic Isn’t a Shortcut—It’s a Tool

Synthetic data is not a miracle fix, but it can be a powerful asset when applied responsibly. When we treat it as a partner to real-world evidence—not a replacement—we move toward a more equitable, secure, and innovation-ready future in healthcare AI.

Used recklessly, it’s illusion.

Used wisely, it’s a step toward democratizing discovery.

References

1.Javaid, U. (2024, August 12). Why are legacy data anonymization techniques failing? https://www.betterdata.ai/blogs/why-are-legacy-data-anonymization-techniques-failing

2. Mendes, J. M., Barbar, A., & Refaie, M. (2025, March 18). Synthetic data generation: A privacy-preserving approach to accelerate rare disease research. Frontiers in Digital Health, 7, 1563991. https://pmc.ncbi.nlm.nih.gov/articles/PMC11958975/

3. Putting rare diseases at the heart of the EU Life Sciences Strategy. (May 2025). European Commission. EURORDIS. https://www.eurordis.org/eu-life-sciences-strategy/

4. Foraker RE, Yu SC, Gupta A, et al. Spot the Difference: Comparing Results of Analyses from Real Patient Data and Synthetic Derivatives. JAMIA Open. 2020;3(4):557–566. https://pubmed.ncbi.nlm.nih.gov/33623891/

5. Kuo N-I-H, Gallego B, Jorm L. Attention-Based Synthetic Data Generation for Calibration‑Enhanced Survival Analysis: A Case Study for Chronic Kidney Disease Using Electronic Health Records. arXiv [cs.LG]. 2025 Mar 8; arXiv:2503.06096. https://arxiv.org/html/2503.06096v1

6. Bhanot K, Baldini I, Wei D, Zeng J, Bennett KP. Downstream Fairness Caveats with Synthetic Healthcare Data. arXiv [cs.LG]. 2022 Mar 9. arXiv:2203.04462. https://arxiv.org/abs/2203.04462

7. Neifar, N., Mdhaffar, A., Ben‑Hamadou, A., & Jmaiel, M. (2023). Deep Generative Models for Physiological Signals: A Systematic Literature Review. https://arxiv.org/abs/2110.04902

8. Akbar MU, Larsson M, Blystad I, Eklund A. Brain tumor segmentation using synthetic MR images: A comparison of GANs and diffusion models. Sci Rep. 2024;14:4703. https://arxiv.org/abs/2306.02986

9. Vallevik VB, Babic A, Marshall SE, et al.Can I Trust My Fake Data—A Comprehensive Quality Assessment Framework for Synthetic Tabular Data in Healthcare. arXiv 2024.01.13716 [cs.LG]. https://arxiv.org/abs/2401.13716