The Britannica Problem — Why AI Keeps Rewriting What It Already Knows

Every new model spends $100M+ re-learning Wikipedia. There are better ways.

Part III — Historical Analogies
Chapter 10: The Britannica Problem

Every time a frontier AI lab trains a new model, it starts from a blank neural network and re-teaches it everything humanity knows — at a cost of hundreds of millions of dollars. This is the Britannica Problem: the staggering waste of re-learning common knowledge from scratch with each model generation.

The name derives from a simple observation. Every large language model must independently learn that Paris is the capital of France, that water boils at 100°C, and that Shakespeare wrote Hamlet. GPT-3 learned these facts. GPT-4 learned them again. GPT-5 learned them yet again — at 5x the cost. There is no mechanism to carry forward what the previous model already mastered. Each generation rewrites all 32 volumes of the encyclopedia from scratch, just to add one new chapter. As Chapter 14 documented, frontier training runs now exceed $500M. A substantial fraction of that expenditure — perhaps the majority — goes toward re-encoding knowledge that the previous model already possessed to a high degree of accuracy.

This chapter examines the structural reasons behind this inefficiency, quantifies the waste, and evaluates the emerging alternatives — from retrieval-augmented generation to modular architectures — that could break the cycle. The Britannica Problem is not merely a technical curiosity. It is a $500M question asked anew with every model generation, and the organizations that solve it first will fundamentally alter the cost structure of artificial intelligence.

40-50%
replay required
In continual pre-training, old data
must be replayed to prevent forgetting
$500M+
per training run
For ~1% improvement on
already-saturated benchmarks
~30-40%
of text tokens
Are function words ("the", "is", "a")
yet get equal weight in loss function
7
frontier models
Released Feb 2026, all
re-learning the same facts
The Core Absurdity
Imagine if every new edition of the Encyclopaedia Britannica required rewriting all 32 volumes from scratch — just to add one new chapter. That is exactly how AI model training works today.

Every time OpenAI trains GPT-5, or Google trains Gemini 2.5, or Anthropic trains Claude 4, they start from a blank neural network and re-teach it everything: basic grammar, world geography, chemistry, history, programming, mathematics — knowledge that the previous model already mastered perfectly. The new training run doesn't update the old model. It replaces it entirely, at a cost of hundreds of millions of dollars, for improvements that are increasingly marginal on the knowledge benchmarks that matter.

"We are spending $500M to go from 86% to 91% on MMLU — a benchmark where the previous model already exceeded PhD-level human performance. The knowledge problem is largely solved. We keep paying to solve it again."

— The training efficiency paradox

Quantifying the Waste

The statistics above frame the scale of the problem. Seven frontier models released in February 2026 alone — each independently re-learning the same corpus of human knowledge. Each training run allocates $500M+ in compute for a net improvement of roughly 1% on benchmarks where models already exceed PhD-level performance. Meanwhile, 40–50% of compute in continual pre-training is consumed by data replay, simply to prevent the model from forgetting what it already knew. The redundancy is not a minor inefficiency; it is the dominant cost in the most expensive computing workloads ever undertaken.

To understand why this waste persists, it helps to consider the analogy that gives this problem its name. The encyclopedia comparison is not merely rhetorical — it reveals the structural mismatch between how knowledge is organized in traditional systems (incrementally, modularly) and how it is encoded in neural networks (holistically, from scratch).

The Encyclopedia Analogy
What if publishers operated like AI labs?
How AI Training Works Today GPT-3 $4.6M · 300B tokens MMLU: 43% GPT-4 $100M · ~13T tokens MMLU: 86.4% GPT-5 $500M+ · ~20T+ tokens MMLU: ~91% 6 vol 10 vol 11 vol Each model rewrites everything to add 1 new chapter vs. The Sensible Approach Shared Base Knowledge (frozen, reused) Cost: $0 — already trained once New Knowledge Domain Expert Alignment Update Only train what's new Cost: ~$1M-10M per module Total: $500M-2.5B per generation (mostly redundant) Total: $1M-50M per update (0% redundant)

What GPT-5 Re-Learned

The vast majority of training tokens — textbooks, Wikipedia, code repos, web pages — cover knowledge GPT-4 already mastered.

Only a small fraction represents truly novel training signal.
The rest? Paying hundreds of millions to re-learn what GPT-4 already knew.

What Changed?

GPT-4 → GPT-5 improvements: slightly better reasoning, slightly more knowledge, marginally fewer hallucinations.

MMLU: 86.4% → ~91% (+4.6 points)
Cost: $100M → $500M+ (5x more)
Diminishing returns are exponential.

The True Cost of Redundancy

The encyclopedia diagram makes the economics visceral. On the left: three successive model generations, each rewriting the entire knowledge base from scratch, each costing more than the last, each yielding diminishing incremental returns. On the right: a hypothetical modular architecture where the base knowledge is frozen and reused, and only new capabilities — domain experts, alignment updates, fresh knowledge — require training. The cost differential is not 2x or 5x. It is 10x to 500x, measured in hundreds of millions of dollars per model generation.

The MMLU gap between GPT-4 and GPT-5 — roughly 4.6 percentage points, from 86.4% to approximately 91% — cost an estimated $400M in incremental training investment. Yet the vast majority of the knowledge tested by MMLU was already mastered by GPT-4. The new training run did not add new knowledge so much as re-encode existing knowledge at marginally higher fidelity, while simultaneously paying for architectural improvements in reasoning and instruction-following that represent the genuinely novel contribution. The diminishing returns cliff below illustrates this dynamic in stark terms.

The Diminishing Returns Cliff
Cost vs. accuracy on MMLU (general knowledge benchmark). Log scale on Y-axis. Each percentage point costs exponentially more than the last.
$1M $10M $100M $1B $10B $100B 40% 50% 60% 70% 80% 90% 95% MMLU Accuracy → Estimated Training Cost (log) → GPT-3 43%, $4.6M GPT-3.5 70%, ~$10M GPT-4 86.4%, $100M GPT-5 ~91%, $500M+ 95%: ~$2B+? 97%: ~$10B+? The last 5% costs more than the first 90% All frontier models cluster at 90-92%
The scaling law tax: Scaling laws show power-law diminishing returns: each 10x increase in compute yields progressively smaller accuracy gains. Going from 70% to 86% MMLU cost ~$90M. Going from 86% to 91% cost ~$400M. Going from 91% to 95% would cost an estimated ~$2B+. Each point becomes exponentially more expensive — and all frontier models are now converging at the same 90-92% ceiling, making the entire investment increasingly pointless for general knowledge tasks.

The Forgetting Wall

If the waste is so evident, the natural question is: why not simply update the existing model? Why not take GPT-4 and teach it the incremental knowledge that GPT-5 requires, rather than starting from scratch? The answer is a significant limitation of neural network architectures called catastrophic forgetting — one of the most important reasons the Britannica Problem persists. As a model learns new information, it systematically overwrites the weights that stored old information. The standard mitigation — replaying 40–50% of old training data alongside new data — costs nearly as much as full retraining while introducing unpredictable quality regressions. Recent research (Google’s Nested Learning, 2025; Collaborative Neural Learning achieving 59–82% forgetting reduction) and architectural innovations like Mixture-of-Experts are making progress, but no approach has yet demonstrated full continuous pre-training at frontier scale without quality degradation. Practical workarounds — RAG, million-token context windows, and modular fine-tuning — reduce the operational impact, even as the theoretical problem remains unsolved.

Why Can't You Just Update? — The Catastrophic Forgetting Wall
The obvious question: if retraining is so wasteful, why not just update the existing model? The answer is a fundamental limitation of neural networks called catastrophic forgetting.
The Forgetting Cliff 100% 50% 0% Original model Fine-tuning on new data → More new data → Old Knowledge (degrades rapidly) New Knowledge (improves) Old Knowledge (with 40-50% replay) Forgetting cliff New knowledge overtakes, old is lost

Catastrophic Forgetting

When you fine-tune a model on new data, the weight updates that encode new knowledge systematically overwrite the weights that stored old knowledge. Learn quantum physics, forget Shakespeare. The network has a fixed capacity — new memories evict old ones.

The Replay Tax

The standard fix: replay 40-50% of old training data alongside new data (research shows a ~50-50 new-replay ratio works best). This preserves old knowledge but costs almost as much as full retraining. You're paying 40-50% of the original cost just to not forget — before learning anything new.

Unpredictable Regressions

Even with replay, incremental updates produce unpredictable quality regressions. A model might lose its coding ability when learning new medical knowledge, or forget formatting rules when updating its math skills. There's no reliable way to predict what breaks.

Brute Force Is Simpler

Full retraining from scratch is wasteful but predictable. Incremental updating is efficient but risky. Labs choose predictability over efficiency — the cost of an unpredictable regression in production is higher than the cost of retraining.

"Catastrophic forgetting is the original sin of neural networks. Every workaround is either expensive (replay), fragile (regularization), or incomplete (parameter isolation). Full retraining is the only approach that guarantees quality — which is why everyone does it, despite the absurd cost."

— The continual learning challenge
The Function Word Problem
Around 30-40% of English text consists of function words that carry little knowledge. The standard loss function treats every token equally — "the" gets the same gradient weight as "photosynthesis".
The transformer architecture was introduced in a 2017 paper by Google researchers. It uses self-attention mechanisms to process sequences in parallel, rather than sequentially like RNNs.
Information-bearing (knowledge) Function words (grammar/structure)
Metric Standard (Next-Token) Multi-Token Prediction
Loss function Equal weight per token Predicts n tokens ahead simultaneously
Training overhead Baseline Zero additional cost
Inference speed 1x Up to 3x faster
Code benchmarks Baseline +12% HumanEval, +17% MBPP

Multi-Token Prediction

Meta's research (Gloeckle et al., 2024): train the model to predict multiple future tokens simultaneously using n independent output heads. This achieves better performance with zero additional training cost, up to 3x faster inference, and +12-17% improvement on code benchmarks (HumanEval, MBPP). By looking further ahead, the model implicitly learns richer representations.

3x inference speedup, zero training overhead

The Broader Inefficiency

Between function words consuming ~30-40% of token-level loss compute, and the 40-50% replay tax in continual pre-training, current training approaches carry substantial overhead. Research on selective token weighting, curriculum learning, and multi-token prediction all point to the same conclusion: significant compute is spent on low-information signals, and architectural innovations can reclaim much of it.

Multiple sources of training inefficiency

Rethinking What We Train

The function word analysis and forgetting wall together reveal a compound inefficiency: not only does each model re-learn the same knowledge, but a substantial fraction of the compute budget within each run is allocated to low-information tokens — articles, prepositions, and conjunctions that carry grammatical structure but minimal semantic content. Multi-token prediction, which trains the model to predict multiple future tokens simultaneously, addresses this by forcing the model to develop richer internal representations without additional training cost. The result is up to 3x faster inference and 12–17% improvement on code benchmarks — essentially free performance gains from smarter training objectives.

These inefficiencies, taken together, suggest that the current paradigm is not merely expensive but structurally suboptimal. The alternatives below represent a portfolio of approaches, each attacking a different dimension of the problem. None is a silver bullet, but in combination they point toward a future where knowledge is a reusable asset rather than a disposable byproduct of each training run.

The Alternatives That Work
Six approaches that avoid rewriting the encyclopedia from scratch. Each addresses a different aspect of the problem — and they can be combined.

1. Mixture of Experts (MoE)

Instead of one monolithic network, use many specialized sub-networks ("experts") with a router that activates only the relevant ones per query.

DeepSeek V3: 671B total parameters, but only 37B activated per query (~5.5%). Knowledge is distributed across experts, and new experts can be added modularly (though the routing mechanism may need fine-tuning).

Not rewriting Britannica — adding specialist chapters.

5-6% activation ratio

2. Retrieval-Augmented Generation (RAG)

Don't bake knowledge into neural network weights — store it in a searchable external index and retrieve it at inference time.

Update knowledge by updating the document index, not retraining. No forgetting, instant updates, verifiable sources. 1,200+ RAG papers published in 2024 alone.

The encyclopedia becomes a library: you consult it, you don't memorize it.

1,200+ papers in 2024

3. Knowledge Distillation

Transfer knowledge from a large "teacher" model to a smaller "student" model, without the student needing to see the original training data.

LIMA (Zhou et al., 2023): matched or was preferred over GPT-4 in 43% of human evaluations, using just 1,000 curated examples.
Minitron (NVIDIA): requires up to 40x fewer training tokens than training from scratch to reach equivalent quality.

40x token reduction

4. Continual Pre-Training

Carefully update existing models with new data instead of starting from scratch. With proper learning rate scheduling, data mixing ratios, and regularization, you can match full retraining quality.

"Reuse, Don't Retrain" (Parmar et al., 2024): demonstrated 9-18% accuracy improvements over baseline continued training through proper learning rate scheduling and data mixing strategies.

9-18% accuracy improvement

5. Model Merging / Soups

Combine multiple specialized models by merging their weights — no additional training required. Like merging encyclopedia volumes from different publishers.

FuseLLM, MonoSoup, TIES-Merging: techniques that merge models trained on different domains into a single model that inherits capabilities from all parents.

Zero additional training

6. Post-Training (DPO / RLHF)

The value frontier has moved from knowledge to alignment. Post-training techniques refine how a model uses its knowledge — not what it knows.

DeepSeek R1: the RL post-training phase cost just $294K on top of the $5.6M V3 base model. DPO is substantially simpler and computationally lighter than RLHF, eliminating the need for reward model training and on-policy sampling.

$294K RL phase + $5.6M V3 base
The common thread
Every alternative separates what the model knows from how it uses what it knows. Knowledge is a solved problem — it can be stored, retrieved, merged, or distilled. The expensive part should be reasoning, not memorization.
The Future Architecture — Don't Retrain, Compose
The endgame isn't one giant model. It's an orchestra of specialized instruments — a shared foundation with modular, updatable components.

The Linux Kernel Model

Shared Foundation
GLM-5, Qwen 3, Kimi K2.5 (all open-weight)
Ubuntu
General purpose
Red Hat
Enterprise
Alpine
Lightweight

Just as Linux is a shared kernel with many distributions, AI should have shared base models with many fine-tuned variants. The kernel is maintained collectively. Distributions compete on integration, support, and specialization.

The Orchestra Architecture

Router (0.6B params)
Classifies query → routes to specialist
General Base
Frozen, shared
Medical LoRA
50-200MB adapter
Legal LoRA
50-200MB adapter
RAG Retrieval
External, updatable

A small router directs queries to the right specialist. The base is frozen and shared. Each domain expert is a lightweight LoRA adapter (~100MB). New domains = new adapters, not retraining.

"The future is not one giant model — it is an orchestra of specialized instruments, playing from the same score."

The Economics

Approach Cost Time to Deploy
Train frontier model from scratch $100M-2.5B 6-12 months
Fine-tune 7B specialist (LoRA) $500-3,000 Hours
100 domain-specific models <$100,000 total Days
RAG knowledge update ~$0 (index update) Minutes
Post-training alignment (DPO) $100K-500K Days

A domain-fine-tuned 7B model frequently outperforms generic frontier models on domain-specific tasks — at a fraction of the cost.

From Knowledge Storage to Knowledge Architecture

The six alternatives share a common thread: they separate what the model knows from how it uses what it knows. Mixture-of-experts distributes knowledge across specialized sub-networks. RAG externalizes knowledge into searchable indexes. Distillation transfers knowledge without re-processing raw training data. Continual pre-training, model merging, and post-training alignment each reduce the volume of redundant computation in different ways. The common denominator is that knowledge — the facts, the relationships, the patterns — is treated as a reusable asset rather than a transient byproduct of each training run.

DeepSeek’s architecture is perhaps the most instructive example of where this trajectory leads. V3’s mixture-of-experts design activates only 37B of its 671B parameters per query — a 5.5% activation ratio that means 94.5% of the model’s knowledge sits dormant for any given input, available when needed but not consuming compute when it is not. R1’s $294K reinforcement learning post-training phase then added reasoning capability on top of V3’s base — creating more measurable value than the entire pre-training investment. The future is modular, composable, and updatable. The question is how quickly the industry will adopt it.

What Should Change — Three Shifts
The Britannica problem isn't inevitable. The industry needs three fundamental shifts in how it thinks about model development.
1

From Scale to Efficiency

DeepSeek demonstrated that algorithmic efficiency can dramatically reduce costs, consistent with research showing algorithmic gains outpace Moore's Law (~2x every 8 months for language models). V3 matched GPT-4 quality at 1/20th the cost. R1 achieved reasoning breakthroughs with $294K in RL post-training (on top of V3). The path forward is not bigger models — it is smarter training.

2

From Monolithic to Modular

Stop rewriting Britannica. Build composable, updatable architectures. MoE + LoRA + RAG already enables this: a frozen shared base, modular experts, and external knowledge retrieval. New knowledge doesn't require retraining — it requires a new adapter or an index update. The model becomes a platform, not a product that expires every 12 months.

3

From Pre-Training to Post-Training

The value frontier has moved. General knowledge is a commodity — every model has it, and the gap between models on MMLU is shrinking. The differentiation is now in reasoning, alignment, and task-specific adaptation. DeepSeek R1's $294K RL post-training phase (on the $5.6M V3 base) created more measurable value than additional pre-training investment. The future investment should shift toward post-training — not away from it.

We don't need to rewrite the encyclopedia.
We need to build a library system.

The knowledge is already there. The facts haven't changed. What we need is better ways to access, organize, and apply that knowledge — not to re-learn it from scratch every 12 months at $500M a pop.

What Comes Next

The Britannica Problem illuminates a structural flaw in how intelligence is produced, but it also points to an opportunity. If knowledge can be separated from reasoning, modularized, and reused, then the cost of each model generation drops dramatically — and the competitive landscape shifts toward those who can most efficiently build on shared foundations rather than those who can afford the most compute. This raises a strategic question: if the model layer is heading toward commodity economics, who benefits from accelerating that trend — and who is threatened by it? The answer is the subject of Chapter 15: The Open Source Paradox, which examines why some of the world’s largest companies are choosing to give away their most expensive assets for free.

The Britannica Problem — Analysis of AI Training Efficiency · February 2026
Data synthesized from: Kaplan et al. (arXiv:2001.08361), Hoffmann et al./Chinchilla (arXiv:2203.15556), DeepSeek V3/R1 technical reports,
NVIDIA Minitron (arXiv:2407.14679), Gloeckle et al./MTP (arXiv:2404.19737), Parmar et al. (arXiv:2407.07263),
Zhou et al./LIMA (arXiv:2305.11206), Rafailov et al./DPO (arXiv:2305.18290), Luo et al. catastrophic forgetting (arXiv:2308.08747)