run.veric.dev
AI vertical · DP preview

Clearview AI scraping enforcement — 2020 onward

Cost: ≥€90M in EU/UK/Italian/Greek/French/Dutch DPA fines through 2025; permanent injunction against private-sector US sales (BIPA settlement, 2022); ongoing enforcement · Time-to-detect: ~3 years between corpus-build start (2017) and NYT exposé (Jan 2020) that triggered enforcement · Root cause class: T6 (information-flow reachability — public-web facial images → biometric training corpus with no consent or lawful basis) territory of EU AI Act Art. 5 prohibition

What happened

Clearview AI built a facial-recognition product by scraping over 30 billion images (Clearview's own claim, March 2023) from public websites including Facebook, Instagram, LinkedIn, Twitter, Venmo, and news sites. It paired those images with face embeddings and metadata, then licensed the resulting database to law-enforcement agencies. The scrape was systematic, automated, and conducted without consent or notice to data subjects.

The New York Times published a January 18, 2020 investigation exposing the company's product and customer base. Enforcement followed across multiple jurisdictions:

  • United StatesACLU v. Clearview AI (Illinois BIPA), settled May 9, 2022. Permanent injunction against Clearview selling the database to most private US entities. Federal Trade Commission opened parallel investigation.
  • Italy — GPDP fine of €20M (Mar 9, 2022); deletion order; ban on further processing of Italian residents' biometric data.
  • France — CNIL fine of €20M (Oct 17, 2022); subsequent €5.2M penalty (May 10, 2023) for non-compliance with the deletion order.
  • United Kingdom — ICO fine of £7,552,800 (May 23, 2022) and enforcement notice. Clearview successfully appealed jurisdiction at the First-Tier Tribunal in October 2023 (the company has no UK customers); ICO appeal pending.
  • Greece — HDPA fine of €20M (Jul 13, 2022).
  • Netherlands — Autoriteit Persoonsgegevens fine of €30.5M (Sep 3, 2024); €5.1M penalty against the founder personally.

The Italian, Greek, French, and Dutch decisions all rest on the same finding: scraping public-web images of EU residents to construct a biometric identification system has no lawful basis under GDPR Art. 6 and falls within the special-category prohibition under Art. 9. None of the decisions accept the "publicly available data" argument — visibility on the open web is not consent for biometric processing. Under the EU AI Act (in force August 2024), the Clearview product would also test Art. 5(1)(e) — the explicit prohibition on building or expanding facial-recognition databases through "untargeted scraping of facial images from the internet or CCTV footage."

The pattern

A biometric-identifier-class data input was assembled by automated public-web scraping. There was no per-record lawful-basis attestation. There was no jurisdictional check — the corpus was global, the enforcement is per-jurisdiction. There was no path between "the scraping bot" and "the user whose face was scraped" that would let the data subject exercise their Art. 17/15/16 rights. The product launched, ran, and accumulated 30 billion images before regulators became aware.

Any pipeline where a biometric-class input flows into a model from a scraping source without per-record consent / lawful-basis attestation is, post-AI-Act, a per-record Art. 5(1)(e) violation. The Clearview pattern is the canonical one.

Which tier failed

T6 information-flow on training-corpus assembly: eu_personal_data:biometric flowed into a corpus with no lawful_basis attestation. The EU AI Act's Art. 5(1)(e) prohibition makes this a binary: not "missing lawful basis" but "the corpus assembly itself is prohibited." Compile-time refutation should be a build break, not a fine.

A regulator can ask three questions of any biometric-corpus operator: (a) by which lawful basis under GDPR Art. 6/9 is each record processed, (b) by which technical measures is Art. 5(1)(e) of the AI Act respected, (c) what is the deletion pathway when an Art. 17 request is exercised. Clearview's product has no defensible answer to any of them, which is why the fines have stacked.

What an AG-tower-driven control would have done

A contract flow(class:biometric_identifier) where source = public_scrape ∉ allowed_corpus_inputs is not subtle: it refutes at corpus-assembly time the moment a scraper's output is wired to the corpus assembler. The AI Act Art. 5(1)(e) prohibition lifts cleanly into a flow tag. Combine with every_record where data_subject_jurisdiction = EU → has_pathway(art_17_erasure) → corpus_record and the assembly stage either (a) refuses to ship, or (b) ships with a SARIF artifact that explicitly attests the lawful basis, the consent record, and the erasure pathway for every record. There is no version of those attestations that survives a 30-billion-image scrape; the contract is the ban.

See also

Sources

See the AI-provenance tag glossaryT6 · Information-flow reachability in the canonical glossary

Each refutation in this archive is a SARIF artifact a regulator could replay tomorrow — the same artifact format the SQL-vertical playground emits today, with the AI-provenance tag glossary swapped in.

These write-ups are journalism + product framing; they are not legal advice. Regulatory citations are best-effort references to public documents at time of writing. For anticipated cases, the entry labels the framing explicitly as anticipated rather than closed.