Encyclopaedia Britannica and its subsidiary Merriam‑Webster have sued OpenAI in federal court in Manhattan, alleging the company copied their authoritative reference content without permission to train ChatGPT. The complaint, filed last Friday, accuses the Microsoft‑backed firm of scraping online encyclopedia articles and dictionary entries—some 100,000 pieces of content, the suit says—and using them to power responses that sometimes reproduce Britannica’s wording nearly verbatim.
Britannica says ChatGPT’s summaries and answers have siphoned traffic away from its site, eroding the commercial value that funds editorial work. The publishers also assert trademark claims, arguing that instances in which the chatbot implies it has authorization to reproduce Britannica material—or cites the encyclopedia as a source during so‑called “hallucinations”—have damaged their brand and misled users.
OpenAI rejected the allegation in a brief statement, saying its models are trained on publicly available data and that such training falls within the bounds of the United States’ "fair use" doctrine because it is transformative. That defence mirrors arguments used by other AI companies facing similar litigation; authors, news organisations and, recently, Britannica itself in a separate suit against Perplexity AI have already set precedents for high‑stakes legal battles over how copyrighted material can be used to train large language models.
The case matters because it cuts to the economics that sustain professional knowledge providers. Encyclopaedias and newsrooms rely on audiences and licensing income to underwrite expensive editorial processes; if generative models can extract and repackage that value without payment, publishers say their business models will become unsustainable. Some media groups are quietly moving in the opposite direction—striking licensing deals with tech firms instead of litigating—with News Corp. and Reach among recent examples negotiating paid access with large AI and platform players.
Legal outcomes are far from certain. U.S. fair‑use jurisprudence asks whether the use is transformative and whether it harms the market for the original work; applying those tests to machine learning training is untested terrain. Judges will need to decide not only whether OpenAI copied protected expression during model training, but whether any allegedly reproduced text in outputs crosses the line from lawful transformation into infringement.
Beyond the courtroom, the suit will reverberate through the industry. If publishers prevail or secure widespread licensing, AI firms will face higher content acquisition costs and more constrained training regimes, incentivising greater reliance on licensed datasets, proprietary partnerships, synthetic data and stricter provenance controls. Smaller startups without deep pockets could be squeezed out or forced to change their technical approaches, while major platforms with partner ecosystems may emerge comparatively unscathed.
Regulators and legislators are already scrutinising the technology’s data practices, and this litigation will add pressure for clearer rules on scraping, attribution and remuneration. The Microsoft connection complicates remedies: enterprise stakes are large and any injunctions could affect a broad swath of commercial products that depend on the same underlying models. The Britannica suit is therefore not only a dispute over words on a page but a signal that the rules governing the data that fuels AI are about to be rewritten.
