run.veric.dev
AI vertical · DP preview

Getty Images v. Stability AI — January 17, 2023

Cost: unliquidated; UK High Court trial June–July 2025, judgment pending; parallel US case (D. Del.) in active discovery · Time-to-detect: months between Stable Diffusion v1 release (Aug 2022) and Getty's UK filing (Jan 2023), when watermark regurgitation was widely reproducible · Root cause class: T6 (information-flow reachability — copyrighted images → training corpus) compounded by T8 (provenance-flow — model weights regenerate the watermark of the upstream rights-holder)

What happened

On January 17, 2023 Getty Images filed suit against Stability AI in the High Court of England and Wales, and on February 3, 2023 a parallel case in the US District Court for the District of Delaware (1:23-cv-00135). Getty alleged that Stability had scraped more than 12 million images from Getty's website — including watermarked stock photographs — and used them as training data for the Stable Diffusion text-to-image model. Public reproductions showed Stable Diffusion outputs containing distorted but recognisable Getty watermark glyphs in the lower corners of generated images, particularly when prompted for sports, news, and editorial subject matter.

The training corpus most heavily implicated was LAION-5B — an open-source image-text dataset assembled by web-scraping with no per-image license attestation. Stability AI did not assemble LAION-5B itself but trained Stable Diffusion on it. LAION's index contained the URLs and alt-text of Getty-hosted images at scale.

The UK trial ran in summer 2025. Stability's defences included (a) the training was performed outside the UK (server location), (b) the watermark regurgitation was de minimis and the result of statistical patterning, not copying. Getty's case turned on demonstrating that the model's output distribution was a derivative work of the training distribution and that the watermark was the smoking gun. As of mid-2026 the UK judgment is pending; the US case is in active discovery on a similar theory.

The case is one of the leading tests of training-corpus copyright liability for image-generation models, and it has already changed industry practice — Stability subsequently introduced an opt-out mechanism for rights-holders, and the successor model (Stable Diffusion 3) used a more carefully filtered training set with explicit license attestation.

The pattern

Same shape as NYT v OpenAI and Authors Guild v OpenAI but on images. A web-scraped corpus (LAION-5B) was assembled without per-image license attestation. A downstream model trainer (Stability) consumed the corpus without re-running license verification. The model's output distribution preserved enough statistical structure of the training distribution that the watermark of the rights-holder was recoverable from generated images.

The bug is not the watermark. The bug is the assembly path that lets a watermarked image reach a model artifact without attestation that the rights-holder has licensed the training use. Any pipeline where an image flows from a rights-holder's CDN through a third-party scrape index into a training corpus, without a per-image license tag flowing through the assembly graph, has this exposure.

Which tier failed

T6 information-flow at corpus assembly. The license_class tag was never assigned at the LAION-5B scrape stage — LAION's contract is "URL list of public images," not "license-cleared training corpus." The downstream consumer (Stability) treated the absence of a tag as permission. T8 provenance-flow at the output: the model card emitted by Stability did not enumerate which Getty licenses applied to which corpus subset; that absence is precisely what the courts now have to reconstruct in discovery.

The bug class is the absence of a typed flow tag travelling with the data through the corpus graph. The Getty watermark in the output is just the visible signature of an upstream tag that was never assigned.

What an AG-tower-driven control would have done

A contract every_record_in(image_corpus) → has_tag(license_class) and license_class:rights_holder_licensed_for_training ∈ allowed refutes at corpus-assembly time. The verifier surfaces: "shard laion5b.shard_4429 contains 1,847,332 records with license_class: undeclared; corpus assembly gate FAIL." Either the trainer pays the licensing or removes the records; either way, the model card the regulator/court reads is honest about what's in there.

A second contract — flow(model_artifact) → declared_in(model_card.training_data_summary) — is the EU AI Act Art. 53(1)(d) "sufficiently detailed summary" obligation, lifted into a build-time refutation. If the corpus contains records that the summary doesn't enumerate, the model card build fails. Getty is the case that demonstrates why that contract has to be the build gate, not a post-deployment due-diligence question.

See also

Sources

See the AI-provenance tag glossaryT6 · Information-flow reachability in the canonical glossary

Each refutation in this archive is a SARIF artifact a regulator could replay tomorrow — the same artifact format the SQL-vertical playground emits today, with the AI-provenance tag glossary swapped in.

These write-ups are journalism + product framing; they are not legal advice. Regulatory citations are best-effort references to public documents at time of writing. For anticipated cases, the entry labels the framing explicitly as anticipated rather than closed.