federated datasets

Transparent Model Training: On-Chain Data Provenance for Auditable AI

1. The Problem

Nobody knows what's in a training dataset.

GPT-4 was trained on some mixture of internet text, licensed data, and undisclosed sources. Stable Diffusion was trained on LAION-5B, which contained copyrighted images, medical records, and personal photographs that nobody consented to include. When lawsuits arrive — and they have — neither the model providers nor the courts can efficiently determine what was and wasn't in the training set. The data pipeline is opaque from end to end: collection, filtering, deduplication, and mixing all happen behind closed doors, documented (if at all) in internal logs that are neither standardized nor independently verifiable.

This opacity creates three problems. Model providers cannot prove compliance with data regulations (EU AI Act, proposed US legislation) because they lack verifiable records of what their models consumed. Data creators cannot determine whether their work was used in training, so they cannot enforce licensing terms or seek compensation. And downstream users of models cannot assess legal risk, because they have no way to audit the provenance of the model's knowledge.

The technical solution is straightforward. The incentive design is hard.

2. Architecture

2.1 On-Chain Dataset Commitments

Every dataset used for training is represented as a Merkle tree. Each leaf is a hash of a single training sample (a document, image, audio clip, or other data unit) along with metadata: the creator's identity (or pseudonym), a license type, a timestamp, and a content-type tag.

The Merkle root is published on-chain. This produces a compact, immutable commitment to the exact contents of the dataset at training time. Verifying that a specific sample was or was not in the dataset requires only the sample's hash and a Merkle proof — a logarithmic-size path from the leaf to the root. No one needs to download or inspect the full dataset to audit membership.

Publishing the root on a public blockchain makes the commitment tamper-evident. A model provider cannot retroactively claim their training set was different from what it was. A data creator can prove inclusion of their work by producing a valid Merkle proof against the published root.

2.2 Data Registration

Creators register data on the platform. Each submission is hashed, tagged with metadata, and assigned a unique identifier. The creator specifies licensing terms: open use, commercial license with royalty, research-only, or exclusion (opt-out from all training). These terms are stored alongside the data hash and are enforceable by the platform's training pipeline.

Registration does not require the data to be hosted on-chain. The chain stores hashes and metadata. The raw data lives off-chain in conventional storage (IPFS, S3, or the creator's own servers), referenced by content-addressed hashes. This keeps on-chain costs manageable — storing a hash and metadata costs orders of magnitude less than storing a full document or image.

2.3 Training Pipeline

A model trainer selects data for their training run by specifying filters: content types, license categories, quality scores, domains. The platform assembles the dataset, constructs the Merkle tree, publishes the root on-chain, and executes the training job.

The resulting model weights are associated with the published Merkle root. This creates a verifiable link: model M was trained on dataset D, where D's exact contents are committed to by root R at block height B. Any future audit starts from R.

2.4 Audit Protocol

An auditor (regulator, court, data creator, or downstream user) can answer three questions:

Was sample S in the training set? Produce the hash of S. Check whether a valid Merkle proof exists from that hash to the published root R. If yes, S was in the dataset. If no valid proof exists, S was not included.

What license governed sample S? The metadata leaf associated with S contains the license type. This is committed to by the same Merkle tree, so it cannot be altered after the fact.

Was the training run compliant with all applicable licenses? Traverse the tree. Check that no leaf contains a license type that conflicts with the model's intended use (e.g., a research-only sample in a commercial model's training set). This is automated and scales linearly with dataset size.

3. Incentive Design

The architecture is not difficult. Merkle trees, content-addressed hashing, and on-chain commitments are established primitives. The open question is why anyone would use this instead of training on scraped data with no oversight.

3.1 Why a Model Provider Would Use This

Legal protection. The EU AI Act (effective 2025) requires providers of general-purpose AI models to document their training data, including a summary of copyrighted material used. The US Copyright Office has signaled interest in similar disclosure requirements. A model trained through this platform has a ready-made compliance record. A model trained on scraped data does not. As regulatory enforcement begins, the cost of non-compliance (fines, injunctions, forced retraining) will exceed the cost of using an auditable pipeline.

Litigation defense. In the NYT v. OpenAI lawsuit and similar cases, a central factual question is whether specific copyrighted works were in the training set. Without auditable records, this question is answered through expensive, adversarial discovery. A model provider with a published Merkle root can answer it in milliseconds with a proof of inclusion or exclusion. The legal costs saved are substantial.

Customer trust. Enterprise buyers increasingly require documentation of AI supply chains. A verifiable training provenance is a sales advantage. "We can prove exactly what our model was trained on" is a differentiator in regulated industries (finance, healthcare, legal) where the downstream user bears liability for the model's outputs.

3.2 Why a Data Creator Would Use This

Compensation. Registered data with a commercial license triggers royalty payments when used in training. The platform automates this: when a training job includes a sample with a royalty-bearing license, the fee is deducted from the trainer's account and credited to the creator. This is a revenue stream that does not exist today because creators cannot currently prove their data was used or enforce licensing terms at scale.

Control. Creators who register data can set terms. Opt-out is enforceable — the platform excludes opted-out data from all training jobs. This gives creators actual control over how their work is used, as opposed to the purely theoretical control they have today (where opt-out requests go to model providers who may or may not honor them).

Attribution. On-chain provenance is permanent. A creator can prove their data contributed to a specific model. This has value beyond royalties — for reputation, for portfolio building, for establishing priority.

3.3 What Prevents Data Extraction

If the platform exposes raw training data to the model trainer, nothing prevents them from copying it and training separately on their own infrastructure, bypassing the provenance system entirely.

Two architectures address this:

Enclave model. The platform runs training jobs in a trusted execution environment (TEE). The trainer submits a training configuration (architecture, hyperparameters, data filters). The platform assembles the data, trains the model inside the enclave, and returns only the model weights. The trainer never sees the raw data. The Merkle root commits to what was used; the enclave ensures the commitment is honest. This provides the stronger guarantee, but it requires the platform to provide compute — a capital-intensive business with thin margins and concentration risk.

Provenance model. The platform accepts that data is copyable and builds the value proposition around the audit trail rather than data access control. Raw data is accessible (possibly gated by license terms), but only training runs executed through the platform produce a verifiable provenance chain. A model provider who copies the data and trains independently gets the data but not the compliance record. In a regulatory environment where the record is what matters, this is sufficient. The analogy is open-source software licensing: the code is freely available, but commercial use without license compliance carries legal risk. The enforcement mechanism is legal, not technical.

The provenance model is weaker but more practical. It does not require the platform to operate GPU clusters. It scales by making provenance valuable rather than by making data scarce.

A hybrid is possible: the platform provides optional enclave-based training for high-sensitivity datasets (medical, financial, personally identifiable) and provenance-only tracking for general datasets.

2/9/2025