Authors Guild v. OpenAI — September 19, 2023

Cost: unliquidated; consolidated class seeks statutory damages of up to $150,000 per infringed work · Time-to-detect: months between author identification of training-corpus inclusion and filing · Root cause class: T6 (information-flow reachability — copyrighted books → training corpus) compounding into T8 (provenance-flow — no per-work attestation)

What happened

On September 19, 2023 the Authors Guild and seventeen named plaintiffs — including George R. R. Martin, John Grisham, Jodi Picoult, Jonathan Franzen, and David Baldacci — filed a class-action complaint against OpenAI in the Southern District of New York (1:23-cv-08292). A second class action followed on the same day (Alter et al.) and the matters were consolidated. The plaintiffs allege that OpenAI ingested at least the contents of "Books2" and "Books3" — large unlicensed e-book corpora circulated on the open web — into the GPT training pipeline. They plead that ChatGPT can produce detailed, accurate summaries and stylistic pastiches that are only achievable if the underlying texts were in the training corpus.

The complaint is structurally similar to the Times case but pleads against authors of long-form copyrighted books rather than journalism. It cites academic work (Bandy & Vincent 2021; Schaul et al. 2023 in The Atlantic) demonstrating that "Books3" — a 196,640-volume corpus aggregated from Bibliotik — was used in EleutherAI's "The Pile" and is widely believed to have been ingested by major foundation models. OpenAI has not publicly confirmed or denied Books2/Books3 ingestion; the complaint frames that opacity itself as a harm.

Discovery has focused on what corpora OpenAI used, what deduplication or filtering was applied at the sourcing stage, and whether per-work provenance can be reconstructed from the model weights. As of 2026 the case is one of several consolidated class actions against frontier-model developers; outcome will turn substantially on whether the court treats training-corpus ingestion as fair use under Authors Guild v. Google or as direct infringement under more recent caselaw.

The pattern

Same shape as NYT v. OpenAI but on long-form copyrighted text rather than news. A web-scraped corpus — in this case, an extra-legal aggregation of pirated e-books — flowed into a training pipeline without a per-source license tag. There was no compile-time check that "every record in the corpus has a declared license_class, and license_class:pirated_books ∉ allowed_set." The author class learned of inclusion only by black-box probing of the trained model — exactly the kind of after-the-fact reconstruction that a provenance-flow contract is supposed to make unnecessary.

Any pipeline where the corpus assembly step accepts an opaque archive (a torrent, a bucket, a "books.tar.gz") without per-record license attestation has this exposure. The bug is the missing tag, not the missing model.

Which tier failed

T6 information-flow: the pirated_books tag never reached the corpus-assembly stage, so the policy "no piracy-derived training input" had nothing to refute against. T8 provenance-flow: the model card carries no per-work attribution graph, so the only way for an author to learn whether their work was in the corpus is to prompt the deployed model and observe stylistic mimicry. The same compound failure as NYT: T6 enables the violation; T8 hides it.

What an AG-tower-driven control would have done

A control of the form every_record_in(corpus) → has_tag(license_class) and license_class ∈ {licensed, public_domain, fair_use_attested} would have refuted at corpus-assembly time, with the failing record set surfaced for review: "shard books3.shard_042 contains 4,217 records with license_class: undeclared; corpus assembly gate FAIL." Replace the assembly script, re-run, get a clean SARIF, ship the model with a model card that names every shard's license. Authors who want to sue can read the SARIF and decide whether they were included before any complaint hits a docket.

Sources

Authors Guild et al. v. OpenAI Inc., complaint (S.D.N.Y. 1:23-cv-08292), Sep 19, 2023: https://authorsguild.org/app/uploads/2023/09/Authors-Guild-OpenAI-Class-Action-Complaint.pdf
Authors Guild press release (Sep 20, 2023): https://authorsguild.org/news/ag-and-authors-file-class-action-suit-against-openai/
Schaul, Chen & Tiku, "Inside the secret list of websites that make AI like ChatGPT sound smart," Washington Post (Apr 19, 2023): https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/

These write-ups are journalism + product framing; they are not legal advice. Regulatory citations are best-effort references to public documents at time of writing. For anticipated cases, the entry labels the framing explicitly as anticipated rather than closed.