Synthetic data

Official Definition

Artificially generated data that mimics the statistical properties and patterns of real-world data but does not contain any actual real-world observations.

Source: AIEOG AI Lexicon (Feb 2026), adapted from NIST SP 800-188 and BIS FSI Insights No. 63

What synthetic data means in plain language

Synthetic data is fake data that is designed to look and behave like real data. It is generated algorithmically to preserve the statistical patterns, distributions, and relationships found in real datasets, but without containing any actual individuals’ information.

Synthetic data is used when real data is unavailable, insufficient, or too sensitive to use directly. For example, an institution might generate synthetic customer data to train a fraud model when the real fraud data is too limited, to test systems without exposing actual customer information, or to share data across teams without privacy concerns.

Synthetic data can be generated through various methods including GANs, variational autoencoders, statistical sampling, and simulation. The quality of synthetic data depends on how well it captures the meaningful patterns in the real data while avoiding overfitting that could inadvertently reproduce identifiable information.

Why it matters in financial services

Synthetic data addresses several practical challenges in financial services:

  • Privacy compliance. Synthetic data can enable model development and testing without exposing actual customer data, supporting compliance with privacy regulations.
  • Data scarcity. For rare events (certain fraud patterns, specific default scenarios), synthetic data can augment limited real-world examples.
  • Testing and development. Synthetic data enables realistic testing environments without the risk of exposing production data.
  • Cross-team sharing. Synthetic data can be shared across teams or organizations when real data sharing is restricted.

Governance considerations include privacy validation (confirming synthetic data cannot be reverse-engineered to identify real individuals), quality validation (ensuring synthetic data accurately represents real-world patterns), and appropriate use policies (defining when synthetic data is and is not acceptable).

Key considerations for compliance teams

  1. Validate privacy properties. Test that synthetic data cannot be used to identify real individuals or re-create actual records.
  2. Assess data quality. Ensure synthetic data preserves the statistical properties needed for its intended use.
  3. Define acceptable use. Establish policies on when synthetic data can be used for model training, testing, and validation.
  4. Document generation methods. Record how synthetic data was generated, what real data it was based on, and what quality checks were performed.
  5. Test for bias. Synthetic data can amplify or introduce biases present in the source data. Validate for fairness.
  6. Include in data governance. Synthetic data should be subject to data governance policies, including lineage tracking.

Stay current on AI risk in financial services

Get practical guidance on AI governance, model risk, and regulatory developments delivered to your inbox. Stay up to date on the latest in financial compliance from our experts.

Google reCaptcha: Invalid site key.