Encyclopedia Britannica and Merriam-Webster Sue OpenAI for Training on ~100,000 Copyrighted Articles

2. Encyclopedia Britannica and Merriam-Webster Sue OpenAI for Training on ~100,000 Copyrighted Articles

Encyclopedia Britannica and Merriam-Webster filed a federal lawsuit against OpenAI on Friday, alleging the company scraped and used nearly 100,000 of their copyrighted articles to train its large language models, then reproduced that content in ChatGPT’s responses with sufficient fidelity to constitute infringement — the “memorization” claim that courts have not yet fully adjudicated. The plaintiffs are not fringe content creators; Britannica’s reference database is among the most editorially curated corpora on the internet, and Merriam-Webster’s dictionary definitions represent a specific, structured form of intellectual property where verbatim reproduction is both detectable and damaging.

The competitive dynamics here are layered. Britannica and Merriam-Webster have both been running digital subscription businesses — reference content that AI assistants now render largely redundant for casual users. This lawsuit is simultaneously a legal action and a market signal: these publishers are declaring that licensing, not displacement, is the correct commercial relationship with foundation model companies. OpenAI has already signed licensing deals with The Associated Press, News Corp, and others, which means it has tacitly acknowledged that training-data provenance matters — making the “we can train on anything public” defense structurally weaker every time a deal is signed. The plaintiffs’ lawyers will almost certainly use OpenAI’s own licensing contracts as admissions against interest.

The clearest historical analogy is the music industry’s war with Napster and early peer-to-peer networks in 2000–2001. Napster argued it was merely a neutral conduit; the labels argued the platform was built on the value of their catalogs. The labels won — not just legally but structurally — and the settlement architecture they forced (licensing, royalties, takedown compliance) became the operating template for Spotify, Apple Music, and every streaming service that followed. The AI training-data litigation wave is playing out on a compressed but parallel track: the question is not whether a licensing regime emerges, but who sets its terms and price.

Two other signals from this week connect directly. ByteDance has reportedly paused the global launch of Seedance 2.0, its video generation model, specifically because its engineers and lawyers are working to preempt further legal exposure — a real-time example of a frontier lab pulling back under copyright pressure before litigation forces it. That’s a behavioral shift. Meanwhile, The Verge’s conversation with Yahoo CEO Jim Lanzone is relevant context: Yahoo’s collapse was accelerated by its failure to control the terms on which Google indexed and monetized its content. Reference publishers watched that happen and are not repeating the mistake of waiting passively.

The structural flywheel working against OpenAI here is a licensing-pressure ratchet. Each new deal OpenAI signs with a major publisher (AP, News Corp, Reddit) sets a market rate and implicitly validates the legal theory that training requires consent. That validation emboldens the next wave of plaintiffs — Britannica and Merriam-Webster today, academic publishers and database operators tomorrow. More plaintiffs mean higher aggregate liability exposure, which increases OpenAI’s incentive to settle broadly, which further institutionalizes licensing norms, which raises the cost of training the next generation of models. The companies that locked in training data rights early (Google via its own properties, Meta via its data moats) face structurally lower compliance costs than any new entrant — a compounding incumbency advantage that has nothing to do with model architecture.

Why it matters:

Academic and database publishers (JSTOR, Elsevier, Springer Nature) can now use the Britannica complaint as a litigation template, forcing OpenAI and competitors into pre-emptive licensing negotiations before a second wave of suits accelerates settlement costs.
Enterprise customers of ChatGPT face emerging indemnification risk: if courts find OpenAI’s outputs constitute infringing reproductions, downstream commercial users of those outputs may face secondary liability exposure that procurement and legal teams have not yet priced in.
Seedance, Sora, and video-generation competitors must now treat copyright clearance as a pre-launch engineering constraint, not a post-launch legal problem — compressing the release window advantage that speed-to-market has historically provided in the generative AI race.

Sources: Encyclopedia Britannica is suing OpenAI for allegedly ‘memorizing’ its content with ChatGPT — The Verge, The dictionary sues OpenAI — TechCrunch AI, ByteDance reportedly pauses global launch of its Seedance 2.0 video generator — TechCrunch AI