Encyclopaedia Britannica Sues OpenAI, Escalating a Global Copyright Clash Over AI Training Data

Encyclopaedia Britannica and Merriam‑Webster have sued OpenAI in New York, alleging unauthorized copying of nearly 100,000 reference entries to train ChatGPT and claiming the AI’s outputs divert traffic and misattribute sources. OpenAI invokes "fair use" and transformative use; the case joins a wave of copyright litigation that could reshape how generative models are trained, funded and regulated.

A smartphone displaying the Wikipedia page for ChatGPT, illustrating its technology interface.

Key Takeaways

  • 1Encyclopaedia Britannica and Merriam‑Webster filed suit against OpenAI in Manhattan, alleging unauthorized copying of roughly 100,000 reference articles and dictionary entries to train ChatGPT.
  • 2The publishers claim ChatGPT sometimes reproduces Britannica content near‑verbatim, diverts web traffic, and in some cases falsely suggests authorised use, prompting copyright and trademark claims.
  • 3OpenAI says its models are trained on publicly available data and that such training is protected by the U.S. fair‑use doctrine as transformative.
  • 4The lawsuit is part of a broader wave of litigation by authors and news organisations; some media companies are pursuing licensing deals with tech firms instead of or alongside legal action.
  • 5A ruling for publishers could raise the cost and complexity of training large models, push AI vendors toward licensed data or synthetic alternatives, and spur regulatory reforms on data use.

Editor's
Desk

Strategic Analysis

This case is a turning point in the battle over who captures the economic value of online knowledge in the age of generative AI. A judiciary that narrows the scope of permissible training data will force a structural change in the AI industry: models will shift from being built on broad, untargeted crawls of the public web to curated, licensed corpora or synthetic substitutes. That outcome would advantage deep‑pocketed incumbents that can buy rights and strike partnerships, while raising the barrier to entry for smaller firms and academic researchers. Conversely, a court that accepts expansive fair use for training would preserve the current, faster path to model development but leave incumbent publishers to seek commercial remedies or new business models. Either way, the decision will be a pivotal input into policy debates and commercial strategies worldwide.

NewsWeb Editorial
Strategic Insight
NewsWeb

Encyclopaedia Britannica and its subsidiary Merriam‑Webster have sued OpenAI in federal court in Manhattan, alleging the company copied their authoritative reference content without permission to train ChatGPT. The complaint, filed last Friday, accuses the Microsoft‑backed firm of scraping online encyclopedia articles and dictionary entries—some 100,000 pieces of content, the suit says—and using them to power responses that sometimes reproduce Britannica’s wording nearly verbatim.

Britannica says ChatGPT’s summaries and answers have siphoned traffic away from its site, eroding the commercial value that funds editorial work. The publishers also assert trademark claims, arguing that instances in which the chatbot implies it has authorization to reproduce Britannica material—or cites the encyclopedia as a source during so‑called “hallucinations”—have damaged their brand and misled users.

OpenAI rejected the allegation in a brief statement, saying its models are trained on publicly available data and that such training falls within the bounds of the United States’ "fair use" doctrine because it is transformative. That defence mirrors arguments used by other AI companies facing similar litigation; authors, news organisations and, recently, Britannica itself in a separate suit against Perplexity AI have already set precedents for high‑stakes legal battles over how copyrighted material can be used to train large language models.

The case matters because it cuts to the economics that sustain professional knowledge providers. Encyclopaedias and newsrooms rely on audiences and licensing income to underwrite expensive editorial processes; if generative models can extract and repackage that value without payment, publishers say their business models will become unsustainable. Some media groups are quietly moving in the opposite direction—striking licensing deals with tech firms instead of litigating—with News Corp. and Reach among recent examples negotiating paid access with large AI and platform players.

Legal outcomes are far from certain. U.S. fair‑use jurisprudence asks whether the use is transformative and whether it harms the market for the original work; applying those tests to machine learning training is untested terrain. Judges will need to decide not only whether OpenAI copied protected expression during model training, but whether any allegedly reproduced text in outputs crosses the line from lawful transformation into infringement.

Beyond the courtroom, the suit will reverberate through the industry. If publishers prevail or secure widespread licensing, AI firms will face higher content acquisition costs and more constrained training regimes, incentivising greater reliance on licensed datasets, proprietary partnerships, synthetic data and stricter provenance controls. Smaller startups without deep pockets could be squeezed out or forced to change their technical approaches, while major platforms with partner ecosystems may emerge comparatively unscathed.

Regulators and legislators are already scrutinising the technology’s data practices, and this litigation will add pressure for clearer rules on scraping, attribution and remuneration. The Microsoft connection complicates remedies: enterprise stakes are large and any injunctions could affect a broad swath of commercial products that depend on the same underlying models. The Britannica suit is therefore not only a dispute over words on a page but a signal that the rules governing the data that fuels AI are about to be rewritten.

Share Article

Related Articles

📰
No related articles found