AI provenance — tier glossary

The same ten invariant tiers veric proves your data warehouse respects, with the AI-provenance tag overlay layered on top. T6 information-flow carries most of the AI-Act tag vocabulary (eu-personal-data, copyrighted-text, gdpr-erased, …); T7 erasure-completeness carries the DSAR / Art. 17 residual proofs; T8 closure carries source-attribution and license inheritance.

T-ladder is equal-weight · the tag set, not the tier shape, is what makes this an AI-provenance vertical

Design partner preview. The tag glossary itself is real and aligned with the Batch 4 product surface. Some example incidents and demos referenced from each tag link out to content shipping in this same batch (Agents D and F); slug links resolve once those agents land. Tags carrying unmet references show a per-card DP-preview chip.

T0tier definition
Shape
↗ /tiers#T0
The thing you wrote is a well-formed query at all — parses, has a recognizable schema reference, and isn't structurally garbage.
No AI-provenance tags surface at this tier in the current default vocabulary — customers extend the tag set freely.
T1tier definition
Type
↗ /tiers#T1
The columns you reference exist with the types you assume. No silent int↔string coercion; no column rename that drifted past you.
AI-provenance tags (2)
DP preview
embedding-shape
EU AI ActSectoral
Vector dimensionality and dtype declared by the upstream embedding provider matches the tensor consumed downstream — no silent truncation, no float64↔float32 cast, no projection-head mismatch between fine-tune base and adapter.
Regulatory anchors
- AI Act Art. 15 — accuracy/robustness
- ISO 42001 §B.7
Example demos
13-embedding-leakage-from-forbidden-source
DP preview
dtype-coherence
EU AI ActSectoral
Numeric, string, and categorical dtypes propagate consistently through every pipeline stage. A column declared int32 in the dataset card is not silently widened or stringified before reaching the trainer.
Regulatory anchors
- AI Act Art. 15
- NIST AI RMF MEASURE-2.5
Example demos
09-pii-in-training-corpus
T2tier definition
Null-propagation
↗ /tiers#T2
You won't materialize a null where downstream code assumes non-null. Renames don't leave dangling references that pass review.
AI-provenance tags (2)
DP preview
croissant-manifest-schema
EU AI Act
The MLCommons Croissant 1.1 manifest declared for a dataset matches the actual record set on disk: every field, distribution record, and record-set reference resolves and typechecks against the manifest schema.
Regulatory anchors
- AI Act Art. 10(2)(a)–(b)
- AI Act Annex IV §2(d)
Example demos
09-pii-in-training-corpus
14-unlicensed-image-dataset
DP preview
dataset-card-schema
EU AI ActSectoral
The HuggingFace dataset-card YAML front-matter (license, language, task_categories, size_categories, source_datasets) is well-formed and the declared license is recognised by the corpus crawler.
Regulatory anchors
- AI Act Art. 53(1)(c)
- ISO 42001 §B.7
Example demos
14-unlicensed-image-dataset
11-copyrighted-text-in-training-summary
T3tier definition
Referential integrity
↗ /tiers#T3
Foreign keys point at real rows. Joins target columns that actually exist. Declared types match the warehouse — no phantom columns from a hallucinated schema.
No AI-provenance tags surface at this tier in the current default vocabulary — customers extend the tag set freely.
T4tier definition
Cardinality / control-flow
↗ /tiers#T4
Joins don't fan out unboundedly. Loops and recursion terminate. A boolean flag's meaning is consistent across every live consumer.
AI-provenance tags (2)
DP preview
training-set-duplicates
EU AI ActSectoral
The training corpus does not contain near-duplicate rows that would inflate sample-count and bias the loss surface. Cross-corpus deduplication is materialised as a tag, not a footnote.
Regulatory anchors
- AI Act Art. 10(3) — relevant, representative
- NIST AI RMF MEASURE-2.7
Example demos
12-model-output-as-training-input-loop
DP preview
sample-count
EU AI Act
Per-source row counts reported in the dataset card and Annex IV §2(d) match the shard-by-shard tally produced by the training pipeline at compile time. No silent overrun, no row-multiplier from a fan-out join.
Regulatory anchors
- AI Act Annex IV §2(d)
- AI Act Art. 53(1)(d)
Example demos
11-copyrighted-text-in-training-summary
15-synthetic-data-not-flagged
T5tier definition
Ordering / determinism
↗ /tiers#T5
Two runs on the same data produce the same answer. No race condition, no ORDER BY drift, no batch-evaluation order changing eligibility downstream.
No AI-provenance tags surface at this tier in the current default vocabulary — customers extend the tag set freely.
T6tier definition
Information-flow reachability
↗ /tiers#T6
PII can't reach a public sink. Tainted data can't leak through a join, union, or aggregation. Sensitive labels propagate.
AI-provenance tags (8)
DP preview
eu-personal-data
GDPREU AI Act
Rows derived from EU data subjects (under GDPR Art. 4(1)) carry an `eu-personal-data` taint that must propagate through every transform. A row tagged `eu-personal-data=true` cannot reach a sink declared `eu-personal-data=false` without an explicit, lawful-basis-tagged anonymisation step.
Regulatory anchors
- GDPR Art. 5(1)(a)
- GDPR Art. 6
- AI Act Art. 10(2)(g)
- EDPB Op. 28/2024
Example incidents
italian-dpa-chatgpt-ban-2023
clearview-ai-scraping-2020
replika-italian-dpa-2023
Example demos
09-pii-in-training-corpus
16-minor-pii-in-fine-tune
DP preview
copyrighted-text
Copyright / DSMEU AI Act
Text excerpts whose source is under a copyright licence the corpus does not hold (or where the upstream TDM reservation under DSM Art. 4(3) was set) carry a `copyrighted-text` taint. Forbidden-flow refutation: no `copyrighted-text=true` row reaches a generative-output sink.
Regulatory anchors
- DSM Art. 4(3)
- AI Act Art. 53(1)(c)
- AI Act Art. 53(1)(d)
- Berne Convention Art. 9
Example incidents
nyt-v-openai-2023
authors-guild-v-openai-2023
stable-diffusion-getty-2023
Example demos
11-copyrighted-text-in-training-summary
DP preview
gdpr-erased
GDPR
Rows belonging to a subject who has exercised GDPR Art. 17 erasure are tagged `gdpr-erased=true`. Forward-flow check: no erased row appears in a downstream training corpus, embedding store, or model checkpoint produced after the deletion timestamp.
Regulatory anchors
- GDPR Art. 17
- GDPR Art. 5(1)(e)
- EDPB Op. 28/2024 §3
Example incidents
italian-dpa-chatgpt-ban-2023
Example demos
10-gdpr-erased-residual
DP preview
licensed-cc0
Copyright / DSMEU AI Act
Rows certified under Creative Commons CC0 (or another permissive declared licence) carry an explicit licence tag that survives every join. Used as the positive case in tag-flow proofs — the `license=CC0` set is the safe-harbour subset of the corpus.
Regulatory anchors
- AI Act Art. 53(1)(c)
- DSM Art. 4(3)
Example demos
11-copyrighted-text-in-training-summary
14-unlicensed-image-dataset
DP preview
synthetic
EU AI ActUS state laws
Rows produced by a generative model carry a `synthetic=true` tag plus a generator-lineage manifest reference. Required to satisfy AI Act Art. 50 transparency and to prevent silent model-output-as-training-input loops.
Regulatory anchors
- AI Act Art. 50
- California SB 942
Example demos
15-synthetic-data-not-flagged
12-model-output-as-training-input-loop
DP preview
model-output
EU AI Act
Outputs sampled from a deployed model. Tagging this class of row is the only way to make the model-output→training-input feedback loop visible to the verifier.
Regulatory anchors
- AI Act Art. 50
- AI Act Art. 53(1)(d)
Example incidents
air-canada-chatbot-2024
Example demos
12-model-output-as-training-input-loop
DP preview
training-input
EU AI Act
The negative-space tag: every row that crossed the boundary from raw corpus to a training shard. Combined with `eu-personal-data`, `copyrighted-text`, `gdpr-erased`, etc. to form forbidden-flow predicates of the form `tag=X ⇒ training-input=false`.
Regulatory anchors
- AI Act Art. 10
- AI Act Annex IV §2(d)
Example incidents
samsung-chatgpt-leak-2023
Example demos
09-pii-in-training-corpus
16-minor-pii-in-fine-tune
red-team-curated
EU AI ActSectoral
Rows that passed an internal red-team review and have a signed reviewer attestation in the audit ledger. Used as a positive gate on jailbreak / safety-evaluation corpora.
Regulatory anchors
- AI Act Art. 55
- NIST AI RMF MANAGE-2.3
T7tier definition
Range / interval
↗ /tiers#T7
Numeric values stay inside their declared bounds. Phantom columns are refused. Timestamp joins handle DST and timezone discontinuities correctly.
AI-provenance tags (3)
DP preview
dsar-residual
GDPRFTCUS state laws
After a Data Subject Access Request erasure runs against the source warehouse, the verifier proves no path remains from the deleted row to any pinned model artefact (embedding, checkpoint, fine-tune, KV-cache). This is the P7 erasure-completeness primitive.
Regulatory anchors
- GDPR Art. 17
- FTC §5 — algorithmic disgorgement
- CCPA §1798.105
Example incidents
italian-dpa-chatgpt-ban-2023
openai-memory-gdpr-art17-2024
Example demos
10-gdpr-erased-residual
DP preview
fine-tune-residue
GDPR
Specialised case of dsar-residual: weights produced by LoRA / full fine-tuning are themselves a downstream artefact of the source rows. Erasure-completeness must reach into adapter weights, not just the corpus shard.
Regulatory anchors
- GDPR Art. 17
- EDPB Op. 28/2024 §3
Example incidents
openai-memory-gdpr-art17-2024
samsung-chatgpt-leak-2023
Example demos
16-minor-pii-in-fine-tune
10-gdpr-erased-residual
DP preview
embedding-leakage
GDPREU AI Act
The vector store (FAISS, pgvector, Pinecone) holds embeddings derived from rows that should no longer be reachable from production sinks. The reverse-direction tag-flow proof must close over the embedding index, not just the source table.
Regulatory anchors
- GDPR Art. 17
- AI Act Art. 10(5)
Example incidents
samsung-chatgpt-leak-2023
Example demos
13-embedding-leakage-from-forbidden-source
10-gdpr-erased-residual
T8tier definition
Lattice PII (closure under join)
↗ /tiers#T8
PII tier survives every operator — no abstract over-approximation collapses to public when two sensitivity labels meet.
AI-provenance tags (2)
DP preview
source-attribution
EU AI ActUS state lawsSectoral
Per-column / per-record provenance string (vendor + dataset version + acquisition date + due-diligence record ID) survives every transform. The closure of source-attribution sets at any sink IS the Annex IV §2(d) summary.
Regulatory anchors
- AI Act Art. 10(2)(b)
- AI Act Art. 53(1)(d)
- California AB 2013
- SR 11-7 §III.E
Example incidents
nyt-v-openai-2023
stable-diffusion-getty-2023
clearview-ai-scraping-2020
Example demos
11-copyrighted-text-in-training-summary
14-unlicensed-image-dataset
DP preview
license-inheritance
Copyright / DSMEU AI Act
When two rows are joined, the resulting row inherits the meet of their licences in the licence lattice. CC-BY ⨝ proprietary = proprietary; CC-BY-SA ⨝ MIT = CC-BY-SA. The verifier rejects any sink where the inherited licence is incompatible with the declared use.
Regulatory anchors
- DSM Art. 4(3)
- AI Act Art. 53(1)(c)
- Creative Commons compatibility chart
Example incidents
stable-diffusion-getty-2023
authors-guild-v-openai-2023
Example demos
14-unlicensed-image-dataset
11-copyrighted-text-in-training-summary
T9tier definition
Equivalence
↗ /tiers#T9
Two queries provably compute the same answer on every input. A refactor proven equivalent is a refactor that won't surprise you in production.
AI-provenance tags (2)
DP preview
model-card-attestation
EU AI ActUS state laws
Every model-card field (training data, evaluation metrics, intended use, known limitations) is backed by a compile-time fact pinned to a git SHA + dbt manifest hash. The model card is signed; tampering refutes the signature.
Regulatory anchors
- AI Act Art. 13
- AI Act Annex IV §2(e)
- Colorado AI Act developer disclosure
- Texas TRAIGA safe-harbor
Example incidents
air-canada-chatbot-2024
Example demos
12-model-output-as-training-input-loop
15-synthetic-data-not-flagged
DP preview
annex-iv-reference
EU AI Act
Each Annex IV §1–§9 section in the published technical-doc pack is a hash-pinned reference into the audit ledger. A regulator can replay any cited fact against the original Croissant manifest + pipeline IR snapshot.
Regulatory anchors
- AI Act Art. 11
- AI Act Annex IV
- AI Act Art. 53(1)(a)
Example incidents
eu-commission-art53-investigation-2026
Example demos
11-copyrighted-text-in-training-summary