AI provenance — tier glossary
The same ten invariant tiers veric proves your data warehouse respects, with the AI-provenance tag overlay layered on top. T6 information-flow carries most of the AI-Act tag vocabulary (eu-personal-data, copyrighted-text, gdpr-erased, …); T7 erasure-completeness carries the DSAR / Art. 17 residual proofs; T8 closure carries source-attribution and license inheritance.
T-ladder is equal-weight · the tag set, not the tier shape, is what makes this an AI-provenance vertical
The thing you wrote is a well-formed query at all — parses, has a recognizable schema reference, and isn't structurally garbage.
No AI-provenance tags surface at this tier in the current default vocabulary — customers extend the tag set freely.
The columns you reference exist with the types you assume. No silent int↔string coercion; no column rename that drifted past you.
AI-provenance tags (2)DP previewembedding-shapeEU AI ActSectoralVector dimensionality and dtype declared by the upstream embedding provider matches the tensor consumed downstream — no silent truncation, no float64↔float32 cast, no projection-head mismatch between fine-tune base and adapter.
Regulatory anchors- AI Act Art. 15 — accuracy/robustness
- ISO 42001 §B.7
Example demos- 13-embedding-leakage-from-forbidden-source
DP previewdtype-coherenceEU AI ActSectoralNumeric, string, and categorical dtypes propagate consistently through every pipeline stage. A column declared int32 in the dataset card is not silently widened or stringified before reaching the trainer.
Regulatory anchors- AI Act Art. 15
- NIST AI RMF MEASURE-2.5
Example demos- 09-pii-in-training-corpus
You won't materialize a null where downstream code assumes non-null. Renames don't leave dangling references that pass review.
AI-provenance tags (2)DP previewcroissant-manifest-schemaEU AI ActThe MLCommons Croissant 1.1 manifest declared for a dataset matches the actual record set on disk: every field, distribution record, and record-set reference resolves and typechecks against the manifest schema.
Regulatory anchors- AI Act Art. 10(2)(a)–(b)
- AI Act Annex IV §2(d)
Example demos- 09-pii-in-training-corpus
- 14-unlicensed-image-dataset
DP previewdataset-card-schemaEU AI ActSectoralThe HuggingFace dataset-card YAML front-matter (license, language, task_categories, size_categories, source_datasets) is well-formed and the declared license is recognised by the corpus crawler.
Regulatory anchors- AI Act Art. 53(1)(c)
- ISO 42001 §B.7
Example demos- 14-unlicensed-image-dataset
- 11-copyrighted-text-in-training-summary
Foreign keys point at real rows. Joins target columns that actually exist. Declared types match the warehouse — no phantom columns from a hallucinated schema.
No AI-provenance tags surface at this tier in the current default vocabulary — customers extend the tag set freely.
Joins don't fan out unboundedly. Loops and recursion terminate. A boolean flag's meaning is consistent across every live consumer.
AI-provenance tags (2)DP previewtraining-set-duplicatesEU AI ActSectoralThe training corpus does not contain near-duplicate rows that would inflate sample-count and bias the loss surface. Cross-corpus deduplication is materialised as a tag, not a footnote.
Regulatory anchors- AI Act Art. 10(3) — relevant, representative
- NIST AI RMF MEASURE-2.7
Example demos- 12-model-output-as-training-input-loop
DP previewsample-countEU AI ActPer-source row counts reported in the dataset card and Annex IV §2(d) match the shard-by-shard tally produced by the training pipeline at compile time. No silent overrun, no row-multiplier from a fan-out join.
Regulatory anchors- AI Act Annex IV §2(d)
- AI Act Art. 53(1)(d)
Example demos- 11-copyrighted-text-in-training-summary
- 15-synthetic-data-not-flagged
Two runs on the same data produce the same answer. No race condition, no ORDER BY drift, no batch-evaluation order changing eligibility downstream.
No AI-provenance tags surface at this tier in the current default vocabulary — customers extend the tag set freely.
PII can't reach a public sink. Tainted data can't leak through a join, union, or aggregation. Sensitive labels propagate.
AI-provenance tags (8)DP previeweu-personal-dataGDPREU AI ActRows derived from EU data subjects (under GDPR Art. 4(1)) carry an `eu-personal-data` taint that must propagate through every transform. A row tagged `eu-personal-data=true` cannot reach a sink declared `eu-personal-data=false` without an explicit, lawful-basis-tagged anonymisation step.
Regulatory anchors- GDPR Art. 5(1)(a)
- GDPR Art. 6
- AI Act Art. 10(2)(g)
- EDPB Op. 28/2024
Example incidents- italian-dpa-chatgpt-ban-2023
- clearview-ai-scraping-2020
- replika-italian-dpa-2023
Example demos- 09-pii-in-training-corpus
- 16-minor-pii-in-fine-tune
DP previewcopyrighted-textCopyright / DSMEU AI ActText excerpts whose source is under a copyright licence the corpus does not hold (or where the upstream TDM reservation under DSM Art. 4(3) was set) carry a `copyrighted-text` taint. Forbidden-flow refutation: no `copyrighted-text=true` row reaches a generative-output sink.
Regulatory anchors- DSM Art. 4(3)
- AI Act Art. 53(1)(c)
- AI Act Art. 53(1)(d)
- Berne Convention Art. 9
Example incidents- nyt-v-openai-2023
- authors-guild-v-openai-2023
- stable-diffusion-getty-2023
Example demos- 11-copyrighted-text-in-training-summary
DP previewgdpr-erasedGDPRRows belonging to a subject who has exercised GDPR Art. 17 erasure are tagged `gdpr-erased=true`. Forward-flow check: no erased row appears in a downstream training corpus, embedding store, or model checkpoint produced after the deletion timestamp.
Regulatory anchors- GDPR Art. 17
- GDPR Art. 5(1)(e)
- EDPB Op. 28/2024 §3
Example incidents- italian-dpa-chatgpt-ban-2023
Example demos- 10-gdpr-erased-residual
DP previewlicensed-cc0Copyright / DSMEU AI ActRows certified under Creative Commons CC0 (or another permissive declared licence) carry an explicit licence tag that survives every join. Used as the positive case in tag-flow proofs — the `license=CC0` set is the safe-harbour subset of the corpus.
Regulatory anchors- AI Act Art. 53(1)(c)
- DSM Art. 4(3)
Example demos- 11-copyrighted-text-in-training-summary
- 14-unlicensed-image-dataset
DP previewsyntheticEU AI ActUS state lawsRows produced by a generative model carry a `synthetic=true` tag plus a generator-lineage manifest reference. Required to satisfy AI Act Art. 50 transparency and to prevent silent model-output-as-training-input loops.
Regulatory anchors- AI Act Art. 50
- California SB 942
Example demos- 15-synthetic-data-not-flagged
- 12-model-output-as-training-input-loop
DP previewmodel-outputEU AI ActOutputs sampled from a deployed model. Tagging this class of row is the only way to make the model-output→training-input feedback loop visible to the verifier.
Regulatory anchors- AI Act Art. 50
- AI Act Art. 53(1)(d)
Example incidents- air-canada-chatbot-2024
Example demos- 12-model-output-as-training-input-loop
DP previewtraining-inputEU AI ActThe negative-space tag: every row that crossed the boundary from raw corpus to a training shard. Combined with `eu-personal-data`, `copyrighted-text`, `gdpr-erased`, etc. to form forbidden-flow predicates of the form `tag=X ⇒ training-input=false`.
Regulatory anchors- AI Act Art. 10
- AI Act Annex IV §2(d)
Example incidents- samsung-chatgpt-leak-2023
Example demos- 09-pii-in-training-corpus
- 16-minor-pii-in-fine-tune
red-team-curatedEU AI ActSectoralRows that passed an internal red-team review and have a signed reviewer attestation in the audit ledger. Used as a positive gate on jailbreak / safety-evaluation corpora.
Regulatory anchors- AI Act Art. 55
- NIST AI RMF MANAGE-2.3
Numeric values stay inside their declared bounds. Phantom columns are refused. Timestamp joins handle DST and timezone discontinuities correctly.
AI-provenance tags (3)DP previewdsar-residualGDPRFTCUS state lawsAfter a Data Subject Access Request erasure runs against the source warehouse, the verifier proves no path remains from the deleted row to any pinned model artefact (embedding, checkpoint, fine-tune, KV-cache). This is the P7 erasure-completeness primitive.
Regulatory anchors- GDPR Art. 17
- FTC §5 — algorithmic disgorgement
- CCPA §1798.105
Example incidents- italian-dpa-chatgpt-ban-2023
- openai-memory-gdpr-art17-2024
Example demos- 10-gdpr-erased-residual
DP previewfine-tune-residueGDPRSpecialised case of dsar-residual: weights produced by LoRA / full fine-tuning are themselves a downstream artefact of the source rows. Erasure-completeness must reach into adapter weights, not just the corpus shard.
Regulatory anchors- GDPR Art. 17
- EDPB Op. 28/2024 §3
Example incidents- openai-memory-gdpr-art17-2024
- samsung-chatgpt-leak-2023
Example demos- 16-minor-pii-in-fine-tune
- 10-gdpr-erased-residual
DP previewembedding-leakageGDPREU AI ActThe vector store (FAISS, pgvector, Pinecone) holds embeddings derived from rows that should no longer be reachable from production sinks. The reverse-direction tag-flow proof must close over the embedding index, not just the source table.
Regulatory anchors- GDPR Art. 17
- AI Act Art. 10(5)
Example incidents- samsung-chatgpt-leak-2023
Example demos- 13-embedding-leakage-from-forbidden-source
- 10-gdpr-erased-residual
PII tier survives every operator — no abstract over-approximation collapses to public when two sensitivity labels meet.
AI-provenance tags (2)DP previewsource-attributionEU AI ActUS state lawsSectoralPer-column / per-record provenance string (vendor + dataset version + acquisition date + due-diligence record ID) survives every transform. The closure of source-attribution sets at any sink IS the Annex IV §2(d) summary.
Regulatory anchors- AI Act Art. 10(2)(b)
- AI Act Art. 53(1)(d)
- California AB 2013
- SR 11-7 §III.E
Example incidents- nyt-v-openai-2023
- stable-diffusion-getty-2023
- clearview-ai-scraping-2020
Example demos- 11-copyrighted-text-in-training-summary
- 14-unlicensed-image-dataset
DP previewlicense-inheritanceCopyright / DSMEU AI ActWhen two rows are joined, the resulting row inherits the meet of their licences in the licence lattice. CC-BY ⨝ proprietary = proprietary; CC-BY-SA ⨝ MIT = CC-BY-SA. The verifier rejects any sink where the inherited licence is incompatible with the declared use.
Regulatory anchors- DSM Art. 4(3)
- AI Act Art. 53(1)(c)
- Creative Commons compatibility chart
Example incidents- stable-diffusion-getty-2023
- authors-guild-v-openai-2023
Example demos- 14-unlicensed-image-dataset
- 11-copyrighted-text-in-training-summary
Two queries provably compute the same answer on every input. A refactor proven equivalent is a refactor that won't surprise you in production.
AI-provenance tags (2)DP previewmodel-card-attestationEU AI ActUS state lawsEvery model-card field (training data, evaluation metrics, intended use, known limitations) is backed by a compile-time fact pinned to a git SHA + dbt manifest hash. The model card is signed; tampering refutes the signature.
Regulatory anchors- AI Act Art. 13
- AI Act Annex IV §2(e)
- Colorado AI Act developer disclosure
- Texas TRAIGA safe-harbor
Example incidents- air-canada-chatbot-2024
Example demos- 12-model-output-as-training-input-loop
- 15-synthetic-data-not-flagged
DP previewannex-iv-referenceEU AI ActEach Annex IV §1–§9 section in the published technical-doc pack is a hash-pinned reference into the audit ledger. A regulator can replay any cited fact against the original Croissant manifest + pipeline IR snapshot.
Regulatory anchors- AI Act Art. 11
- AI Act Annex IV
- AI Act Art. 53(1)(a)
Example incidents- eu-commission-art53-investigation-2026
Example demos- 11-copyrighted-text-in-training-summary