run.veric.dev
AI vertical · DP preview

New York Times v. OpenAI & Microsoft — December 27, 2023

Cost: unliquidated; complaint pleads "billions of dollars in statutory and actual damages" · Time-to-detect: weeks to months between regurgitation discovery and filing · Root cause class: T6 (information-flow reachability — copyrighted source → training corpus → user-facing output) compounding into T8 (provenance-flow — generated text indistinguishable from licensed copy without attribution)

What happened

On December 27, 2023 The New York Times filed The New York Times Company v. Microsoft Corporation et al. (S.D.N.Y. 1:23-cv-11195), alleging that OpenAI and Microsoft trained GPT-4 and Copilot on millions of Times articles without a license. The complaint includes Exhibit J: a 100+ page appendix of side-by-side comparisons in which GPT-4, given a leading sentence, emits the next several paragraphs of paywalled Times reporting almost verbatim. The plaintiffs argue this is direct infringement at the training stage (copying), at the output stage (display), and at the model-weights stage (a derivative work). OpenAI's public response framed the regurgitation as a "rare bug" and pointed at fair-use doctrine and the transformative nature of LLM training.

Pre-suit, the Times had spent months in license negotiations with OpenAI; those broke down in mid-2023. The complaint specifically calls out that GPT-4 reproduces Times articles "near-verbatim" in response to prompts a paying subscriber would have used the paywall to access. Discovery has since centered on the composition of the training corpus, the deduplication pipeline, and whether individual Times articles can be unlearned without retraining from scratch.

As of mid-2026 the case has survived motion-to-dismiss on the core copyright claims; the DMCA §1202 (copyright-management-information removal) claim was dismissed and re-pleaded. The matter remains the leading US test of whether ingestion of copyrighted text into a foundation-model training corpus is fair use, and — independently — whether output-stage regurgitation is its own infringement event.

The pattern

A copyrighted-source corpus was ingested into a training pipeline without a declared per-source license tag flowing through to the model artifact. The same artifact, at inference time, emitted text that was lexically derived from the ingested source. There was no tag — no policy gate — preventing "copyright-restricted" content from flowing to a sink that would render it to an unauthenticated user. This is the same shape as a PII-reachability bug: a regulated class of input reaches a public sink because the data path was never constrained by a declared invariant.

Any pipeline where a training corpus is assembled by URL-list expansion or web-scrape without a per-source license tag flowing all the way through to the model card and the inference-time output filter has this exposure. The bug is the absence of a property, not the presence of an opcode.

Which tier failed

T6 information-flow at the training stage: the copyrighted-text tag (per Article 53(1)(d) of the EU AI Act) was either never assigned to source records or was lost at the deduplication / shuffle / tokenize step. T8 provenance-flow at the output stage: the model card emitted no per-source attribution chain that would let a downstream user reconcile a generated paragraph against its training-corpus origin. The compound failure is the bug class — T6 alone is a copyright violation; T8 alone is a transparency violation; together they are why the Times had to litigate to find out what was in the corpus.

What an AG-tower-driven control would have done

A control declared as flow(license_class:copyrighted_news) ∉ flow(model_artifact) would have refuted at training-pipeline-compile time: the verifier would trace the corpus assembly graph and surface "source nytimes.com/* enters corpus.shard_017 via 4 ingest hops; model_artifact is reachable from corpus.shard_017 via the train loop — T6 information-flow VIOLATED." A regulator could read that refutation and ask exactly one follow-up question: "is your license tag wrong, or is your pipeline wrong?" Either answer is auditable. The current state of the art is the opposite: discovery, depositions, and a years-long fact-finding mission to reconstruct what was in the corpus.

See also

Sources

See the AI-provenance tag glossaryT6 · Information-flow reachability in the canonical glossary

Each refutation in this archive is a SARIF artifact a regulator could replay tomorrow — the same artifact format the SQL-vertical playground emits today, with the AI-provenance tag glossary swapped in.

These write-ups are journalism + product framing; they are not legal advice. Regulatory citations are best-effort references to public documents at time of writing. For anticipated cases, the entry labels the framing explicitly as anticipated rather than closed.