Synthetic Financial Data: Evaluation Beyond Statistical Similarity

Synthetic Financial Data: Evaluation Beyond Statistical Similarity
Quick Answer
Synthetic financial data must be evaluated on three axes: statistical fidelity, downstream utility and privacy guarantees. Marginal distribution tests like KS statistics and Jensen-Shannon divergence measure column-level similarity but miss joint distributions, temporal correlations and fraud signal preservation. Membership inference attacks against synthesizers, specifically shadow model attacks, remain the most credible privacy threat. A gold-standard evaluation framework combines multivariate fidelity metrics, task-based utility benchmarks and formal differential privacy audits before any synthetic dataset reaches production.

Synthetic financial data has moved from research curiosity to production infrastructure. Banks generate it to train fraud models without exposing real cardholder records. Fintechs use it to satisfy GDPR data minimization obligations while running model development pipelines. Regulators are beginning to accept synthetic datasets in sandbox environments under PSD2 and the CFPB's data access framework. The technology is maturing. The evaluation methodology is not.

The dominant practice today is to generate a synthetic dataset, run a battery of marginal distribution tests, report that column-level statistics match within acceptable tolerance, and ship. That workflow is dangerously incomplete. It misses correlated feature failure, temporal signal degradation and the full spectrum of privacy vulnerabilities that modern membership inference attacks can exploit.

This article examines why synthetic financial data evaluation must go further, and what a production-grade assessment framework actually looks like in 2026.

Why Marginal Distribution Tests Fall Short

Marginal distribution tests evaluate each column in isolation. The Kolmogorov-Smirnov test checks whether the empirical CDF of a synthetic feature matches the real distribution. Jensen-Shannon divergence measures the information-theoretic distance between two probability distributions over a single variable. These are legitimate statistical tools. They are simply measuring the wrong thing for financial transaction data.

A transaction dataset is not a collection of independent columns. The relationship between merchant category code, transaction amount, time-of-day and card-present flag is what drives fraud signal. A synthesizer can reproduce each marginal distribution perfectly while completely destroying the joint structure that separates legitimate grocery purchases from card-not-present fraud attempts at 3 a.m.

Dr. Patrick Fisher has observed this failure mode repeatedly when auditing synthetic data pipelines for mid-tier banks. The KS statistics look clean. The fraud classifier trained on synthetic data underperforms by 12 to 18 percentage points on real holdout data when AUC is the metric. The marginal test passes. The model fails in production. The evaluation framework lied.

Wasserstein distance and maximum mean discrepancy give a more complete picture of distributional similarity, but even those remain global metrics. They cannot surface localized failures in rare event subspaces, which is exactly where fraud lives.

Joint Distribution Fidelity in Transaction Data

Measuring joint distribution fidelity requires moving from univariate statistics to multivariate assessments. Three approaches have demonstrated practical value in financial applications.

The first is pairwise correlation preservation. Compute the Pearson and Spearman correlation matrices for the real and synthetic datasets and measure the Frobenius norm of their difference. This is a coarse but fast check that surfaces gross structural failures in the synthesizer.

The second is mutual information scoring across feature pairs. For categorical features common in transaction data, such as merchant category, channel type and card type, mutual information captures nonlinear dependencies that correlation misses. A synthesizer that destroys mutual information between merchant category and fraud label is useless for downstream model training regardless of how clean its marginal histograms look.

The third is temporal autocorrelation testing. Transaction data has strong temporal structure. A cardholder's spending velocity, their typical time-between-transactions and their recency patterns are all features used in behavioral fraud scoring. Synthetic data generators, particularly those based on variational autoencoders and basic GANs, routinely destroy temporal autocorrelation because they model each record independently. Evaluating synthetic transaction data without autocorrelation analysis is evaluating a fundamentally different data type than what will be encountered in production.

Work published on arXiv and through the ACM Conference on Knowledge Discovery and Data Mining has formalized these multivariate fidelity criteria under the umbrella of "utility-preserving synthesis." The distinction from privacy-preserving synthesis is important, and the two objectives pull in opposite directions.

Membership Inference Attacks Against Synthesizers

The primary privacy threat against synthetic data is not re-identification of records in the synthetic dataset itself. It is inference about the training data used to build the synthesizer.

Membership inference attacks attempt to determine whether a specific real record was part of the training population for a generative model. In financial contexts, a successful attack against a payment network's synthesizer could reveal that a specific account or transaction pattern was present in the training corpus, which in a fraud investigation context carries significant legal and reputational risk.

Shadow model attacks, first formalized by Shokri et al. and extended substantially since, work by training multiple shadow models that mimic the target synthesizer. The attacker builds a meta-classifier that distinguishes between the output distribution for records the model "memorized" versus records it never saw. This attack class has proven effective against GAN-based synthesizers, particularly when the GAN is trained on imbalanced data, which is always the case in fraud datasets where fraud rates are typically under 0.5 percent.

Attribute inference is a related attack. Rather than determining membership, the adversary infers a sensitive attribute, such as credit tier or behavioral risk score, from the released synthetic dataset by exploiting correlations the synthesizer preserved. If a synthetic dataset faithfully reproduces the correlation between transaction velocity and internal risk score, and internal risk score is considered sensitive, the synthesizer has created a privacy leak from a fidelity requirement.

Neither of these attack classes is assessed by any marginal distribution test. They require dedicated adversarial evaluation. The NIST Privacy Framework and guidance from the National Institute of Standards and Technology's SP 800-188 on de-identification of government datasets both acknowledge the inadequacy of statistical disclosure limitation metrics alone, though financial-sector specific guidance remains fragmented as of 2026.

The Privacy-Utility Tradeoff Is Not Linear

A common misconception among teams deploying synthetic data for the first time is that the privacy-utility tradeoff is a smooth dial. Turn up privacy protection, accept proportionally less utility. This model is wrong in ways that matter operationally.

The tradeoff is highly nonlinear and data-type dependent. For financial transaction data specifically, small increases in differential privacy budget constraints, specifically epsilon values below 1.0 using the standard definition of (epsilon, delta)-differential privacy, tend to produce catastrophic drops in fraud detection utility before producing meaningful marginal gains in privacy. The rare event subspaces that define fraud patterns are the first features destroyed by noise injection.

At epsilon values above 10.0, the formal differential privacy guarantee becomes largely nominal. The mathematical bound is technically satisfied but the practical protection against membership inference degrades substantially. Teams that report epsilon as their sole privacy metric without running empirical membership inference audits are providing a number that sounds rigorous and means very little in practice.

The work by Dwork, Roth and collaborators on the algorithmic foundations of differential privacy, available through the published text "The Algorithmic Foundations of Differential Privacy," establishes the formal definitions clearly. The gap between theoretical epsilon bounds and empirical attack success rates under realistic adversarial conditions is the space that production evaluation frameworks must address.

Own Your Data Inc. has documented this gap in deployments covered at ownmydata.ai, where the relationship between privacy budget, synthesizer architecture and downstream classifier performance varies significantly by dataset composition and fraud prevalence.

Task-Based Utility Benchmarks for Fraud Detection

Statistical fidelity metrics, even multivariate ones, measure whether the synthetic data looks like the real data. Task-based utility benchmarks measure whether models trained on synthetic data perform comparably to models trained on real data. These are different questions and both are required.

The standard task-based protocol for fraud detection synthetic data evaluation runs as follows. Train a binary classifier, typically a gradient boosted tree or a neural network appropriate to the transaction feature set, on the synthetic training data. Evaluate that classifier on a real holdout set. Compare AUC, average precision and the false positive rate at a fixed recall threshold against a baseline classifier trained on the equivalent real training set.

The gap between synthetic-trained and real-trained model performance is called the train-on-synthetic test-on-real delta, often abbreviated TSTR delta. It is the most direct measurement of whether the synthetic data is fit for its intended purpose. A TSTR delta below 2 percent AUC points for fraud detection is generally considered acceptable for model development use cases. Larger gaps indicate the synthesizer is destroying signal in the joint feature space despite marginal distribution tests passing.

For model development pipelines where synthetic data substitutes for real data during feature engineering and hyperparameter selection, TSTR delta must be computed per feature subset, not just for the full feature set. A synthesizer may preserve fraud signal adequately for behavioral velocity features while degrading it substantially for device fingerprint features. Aggregate TSTR metrics hide these subspace failures.

Implementation guides for open banking synthetic data pipelines, including those compatible with the UK Open Banking standards at openbanking.org.uk and CFPB Section 1033 compliant data sharing architectures, are covered in the product implementation work at mydatakey.org.

A Gold-Standard Evaluation Framework for 2026

A production-grade evaluation framework for synthetic financial data requires four components running in sequence, not as independent checkboxes.

Component 1: Multivariate Fidelity Assessment. Compute pairwise correlation matrix distance, mutual information preservation scores for key feature pairs and temporal autocorrelation coefficients for cardholder-level time series. Flag any synthesizer that scores below acceptable thresholds on joint structure metrics before proceeding.

Component 2: Adversarial Privacy Audit. Run shadow model membership inference attacks using a held-out real dataset as the positive class. Report empirical membership inference advantage, which measures how much better than random chance an adversary can determine membership. Compute attribute inference risk for all features designated sensitive under the organization's GDPR Article 4 or CCPA Section 1798.140 definitions. This component must be run by a team independent of the synthesizer development team.

Component 3: Task-Based Utility Benchmarks. Run TSTR delta evaluation across primary downstream tasks. For fraud detection, this means binary classification AUC and average precision. For credit risk, it means rank-ordering Gini coefficient preservation. For transaction monitoring under Bank Secrecy Act obligations, it means alert recall at regulated false positive rate thresholds.

Component 4: Distributional Shift Monitoring in Deployment. Synthetic data is not a one-time artifact. Synthesizers trained on a snapshot of transaction data age as real transaction distributions drift. Continuous PSI monitoring, Population Stability Index, between the synthetic training distribution and the current real transaction distribution must be part of the production pipeline. A synthesizer validated against a 2026 Q1 transaction snapshot may be producing misleading training data by Q3 if merchant category distributions shift.

This four-component framework is consistent with guidance from the Bank for International Settlements on responsible use of machine learning in financial services and aligns with the FATF guidance on technology solutions for AML compliance, both of which emphasize the importance of rigorous validation for ML training data irrespective of whether that data is real or synthetic.

Implementation Considerations for Compliance Teams

Compliance officers and CISOs evaluating synthetic data programs face a specific documentation challenge. Regulators, including the OCC and the Federal Reserve, increasingly ask for model risk management documentation that covers training data quality. Synthetic data requires an additional layer of validation evidence that real data documentation does not.

At minimum, a MRM package for a model trained on synthetic data should include the full evaluation report from all four components described above, the epsilon and delta values for any differentially private synthesizer, empirical membership inference advantage scores and the TSTR delta across all primary evaluation tasks. Documentation of the synthesizer architecture, its training data governance and its retraining schedule should be treated as model documentation under SR 11-7 guidance from the Federal Reserve.

PCI-DSS v4 compliance teams should note that synthetic transaction data derived from real cardholder data may still be classified as cardholder data under certain scoping interpretations if the synthesis process does not meet adequate anonymization standards. The PCI Security Standards Council has not issued definitive guidance on this question as of 2026, making conservative scoping the appropriate default.

The field is moving quickly. GAN architectures are being replaced in many financial applications by diffusion-based synthesizers and transformer-based tabular generative models, both of which have different failure modes and different attack surfaces than the variational autoencoder and CTGAN implementations that most current evaluation tooling was designed against. Evaluation frameworks must be architecture-aware, not just dataset-aware.

Statistical similarity is a necessary condition for useful synthetic financial data. It has never been a sufficient one. The organizations that treat it as sufficient are building fraud models on foundations they have not tested, and accumulating privacy liability they have not measured.

Frequently Asked Questions

Why do KS tests and Jensen-Shannon divergence fail to validate synthetic transaction data?
KS tests and Jensen-Shannon divergence measure each column independently. Financial transaction data derives its predictive value from joint distributions across multiple features simultaneously. A synthesizer can pass all marginal tests while destroying the correlated feature structures that fraud detection models depend on, producing a statistically clean dataset that causes significant model degradation in production.
What is a TSTR delta and what threshold indicates a usable synthetic dataset for fraud detection?
TSTR delta is the train-on-synthetic test-on-real performance gap, measured as the AUC difference between a classifier trained on synthetic data and one trained on equivalent real data, both evaluated on a real holdout set. For fraud detection use cases, a TSTR delta below 2 AUC percentage points is generally considered acceptable for model development workflows, though subspace-level evaluation per feature group is required to surface localized signal failures.
How does membership inference attack risk change when the training dataset is highly imbalanced, as in fraud data?
High class imbalance amplifies membership inference risk significantly. Synthesizers trained on imbalanced fraud datasets tend to memorize minority-class records more aggressively because those records have outsized influence on the generative model's loss surface. Shadow model attacks exploit this by targeting the distributional signature of overfit minority-class samples. Independent adversarial audits are especially critical for fraud-domain synthesizers.
Does reporting a formal differential privacy epsilon value satisfy regulatory privacy documentation requirements for synthetic data?
Epsilon alone is insufficient for regulatory documentation under most current frameworks. The Federal Reserve's SR 11-7 model risk guidance and GDPR accountability requirements under Article 5(2) both expect empirical validation, not just theoretical bounds. A complete privacy documentation package should include empirical membership inference advantage scores alongside epsilon and delta values to demonstrate the formal guarantee translates to practical protection under realistic adversarial conditions.
Does synthetic transaction data derived from real cardholder records fall within PCI-DSS v4 scope?
The PCI Security Standards Council has not issued definitive guidance on this question as of 2026. If the synthesis process does not meet a recognized standard for anonymization that renders re-identification infeasible, conservative scoping treats synthetic derivatives as cardholder data for PCI-DSS purposes. Organizations should document their anonymization methodology and obtain qualified security assessor review before excluding synthetic datasets from PCI scope.
synthetic dataGANsmembership inferencedata utilitydifferential privacyfraud detectionprivacy-preserving MLRegTech
← Back to Blog