The Britannica Problem — Why AI Keeps Rewriting What It Already Knows
Every new model spends $100M+ re-learning Wikipedia. There are better ways.
Every time a frontier AI lab trains a new model, it starts from a blank neural network and re-teaches it everything humanity knows — at a cost of hundreds of millions of dollars. This is the Britannica Problem: the staggering waste of re-learning common knowledge from scratch with each model generation.
The name derives from a simple observation. Every large language model must independently learn that Paris is the capital of France, that water boils at 100°C, and that Shakespeare wrote Hamlet. GPT-3 learned these facts. GPT-4 learned them again. GPT-5 learned them yet again — at 5x the cost. There is no mechanism to carry forward what the previous model already mastered. Each generation rewrites all 32 volumes of the encyclopedia from scratch, just to add one new chapter. As Chapter 14 documented, frontier training runs now exceed $500M. A substantial fraction of that expenditure — perhaps the majority — goes toward re-encoding knowledge that the previous model already possessed to a high degree of accuracy.
This chapter examines the structural reasons behind this inefficiency, quantifies the waste, and evaluates the emerging alternatives — from retrieval-augmented generation to modular architectures — that could break the cycle. The Britannica Problem is not merely a technical curiosity. It is a $500M question asked anew with every model generation, and the organizations that solve it first will fundamentally alter the cost structure of artificial intelligence.
must be replayed to prevent forgetting
already-saturated benchmarks
yet get equal weight in loss function
re-learning the same facts
Every time OpenAI trains GPT-5, or Google trains Gemini 2.5, or Anthropic trains Claude 4, they start from a blank neural network and re-teach it everything: basic grammar, world geography, chemistry, history, programming, mathematics — knowledge that the previous model already mastered perfectly. The new training run doesn't update the old model. It replaces it entirely, at a cost of hundreds of millions of dollars, for improvements that are increasingly marginal on the knowledge benchmarks that matter.
"We are spending $500M to go from 86% to 91% on MMLU — a benchmark where the previous model already exceeded PhD-level human performance. The knowledge problem is largely solved. We keep paying to solve it again."
Quantifying the Waste
The statistics above frame the scale of the problem. Seven frontier models released in February 2026 alone — each independently re-learning the same corpus of human knowledge. Each training run allocates $500M+ in compute for a net improvement of roughly 1% on benchmarks where models already exceed PhD-level performance. Meanwhile, 40–50% of compute in continual pre-training is consumed by data replay, simply to prevent the model from forgetting what it already knew. The redundancy is not a minor inefficiency; it is the dominant cost in the most expensive computing workloads ever undertaken.
To understand why this waste persists, it helps to consider the analogy that gives this problem its name. The encyclopedia comparison is not merely rhetorical — it reveals the structural mismatch between how knowledge is organized in traditional systems (incrementally, modularly) and how it is encoded in neural networks (holistically, from scratch).
What GPT-5 Re-Learned
The vast majority of training tokens — textbooks, Wikipedia, code repos, web pages — cover knowledge GPT-4 already mastered.
Only a small fraction represents truly novel training signal.
The rest? Paying hundreds of millions to re-learn what GPT-4 already knew.
What Changed?
GPT-4 → GPT-5 improvements: slightly better reasoning, slightly more knowledge, marginally fewer hallucinations.
MMLU: 86.4% → ~91% (+4.6 points)
Cost: $100M → $500M+ (5x more)
Diminishing returns are exponential.
The True Cost of Redundancy
The encyclopedia diagram makes the economics visceral. On the left: three successive model generations, each rewriting the entire knowledge base from scratch, each costing more than the last, each yielding diminishing incremental returns. On the right: a hypothetical modular architecture where the base knowledge is frozen and reused, and only new capabilities — domain experts, alignment updates, fresh knowledge — require training. The cost differential is not 2x or 5x. It is 10x to 500x, measured in hundreds of millions of dollars per model generation.
The MMLU gap between GPT-4 and GPT-5 — roughly 4.6 percentage points, from 86.4% to approximately 91% — cost an estimated $400M in incremental training investment. Yet the vast majority of the knowledge tested by MMLU was already mastered by GPT-4. The new training run did not add new knowledge so much as re-encode existing knowledge at marginally higher fidelity, while simultaneously paying for architectural improvements in reasoning and instruction-following that represent the genuinely novel contribution. The diminishing returns cliff below illustrates this dynamic in stark terms.
The Forgetting Wall
If the waste is so evident, the natural question is: why not simply update the existing model? Why not take GPT-4 and teach it the incremental knowledge that GPT-5 requires, rather than starting from scratch? The answer is a significant limitation of neural network architectures called catastrophic forgetting — one of the most important reasons the Britannica Problem persists. As a model learns new information, it systematically overwrites the weights that stored old information. The standard mitigation — replaying 40–50% of old training data alongside new data — costs nearly as much as full retraining while introducing unpredictable quality regressions. Recent research (Google’s Nested Learning, 2025; Collaborative Neural Learning achieving 59–82% forgetting reduction) and architectural innovations like Mixture-of-Experts are making progress, but no approach has yet demonstrated full continuous pre-training at frontier scale without quality degradation. Practical workarounds — RAG, million-token context windows, and modular fine-tuning — reduce the operational impact, even as the theoretical problem remains unsolved.
Catastrophic Forgetting
When you fine-tune a model on new data, the weight updates that encode new knowledge systematically overwrite the weights that stored old knowledge. Learn quantum physics, forget Shakespeare. The network has a fixed capacity — new memories evict old ones.
The Replay Tax
The standard fix: replay 40-50% of old training data alongside new data (research shows a ~50-50 new-replay ratio works best). This preserves old knowledge but costs almost as much as full retraining. You're paying 40-50% of the original cost just to not forget — before learning anything new.
Unpredictable Regressions
Even with replay, incremental updates produce unpredictable quality regressions. A model might lose its coding ability when learning new medical knowledge, or forget formatting rules when updating its math skills. There's no reliable way to predict what breaks.
Brute Force Is Simpler
Full retraining from scratch is wasteful but predictable. Incremental updating is efficient but risky. Labs choose predictability over efficiency — the cost of an unpredictable regression in production is higher than the cost of retraining.
"Catastrophic forgetting is the original sin of neural networks. Every workaround is either expensive (replay), fragile (regularization), or incomplete (parameter isolation). Full retraining is the only approach that guarantees quality — which is why everyone does it, despite the absurd cost."
| Metric | Standard (Next-Token) | Multi-Token Prediction |
|---|---|---|
| Loss function | Equal weight per token | Predicts n tokens ahead simultaneously |
| Training overhead | Baseline | Zero additional cost |
| Inference speed | 1x | Up to 3x faster |
| Code benchmarks | Baseline | +12% HumanEval, +17% MBPP |
Multi-Token Prediction
Meta's research (Gloeckle et al., 2024): train the model to predict multiple future tokens simultaneously using n independent output heads. This achieves better performance with zero additional training cost, up to 3x faster inference, and +12-17% improvement on code benchmarks (HumanEval, MBPP). By looking further ahead, the model implicitly learns richer representations.
The Broader Inefficiency
Between function words consuming ~30-40% of token-level loss compute, and the 40-50% replay tax in continual pre-training, current training approaches carry substantial overhead. Research on selective token weighting, curriculum learning, and multi-token prediction all point to the same conclusion: significant compute is spent on low-information signals, and architectural innovations can reclaim much of it.
Rethinking What We Train
The function word analysis and forgetting wall together reveal a compound inefficiency: not only does each model re-learn the same knowledge, but a substantial fraction of the compute budget within each run is allocated to low-information tokens — articles, prepositions, and conjunctions that carry grammatical structure but minimal semantic content. Multi-token prediction, which trains the model to predict multiple future tokens simultaneously, addresses this by forcing the model to develop richer internal representations without additional training cost. The result is up to 3x faster inference and 12–17% improvement on code benchmarks — essentially free performance gains from smarter training objectives.
These inefficiencies, taken together, suggest that the current paradigm is not merely expensive but structurally suboptimal. The alternatives below represent a portfolio of approaches, each attacking a different dimension of the problem. None is a silver bullet, but in combination they point toward a future where knowledge is a reusable asset rather than a disposable byproduct of each training run.
1. Mixture of Experts (MoE)
Instead of one monolithic network, use many specialized sub-networks ("experts") with a router that activates only the relevant ones per query.
DeepSeek V3: 671B total parameters, but only 37B activated per query (~5.5%). Knowledge is distributed across experts, and new experts can be added modularly (though the routing mechanism may need fine-tuning).
Not rewriting Britannica — adding specialist chapters.
2. Retrieval-Augmented Generation (RAG)
Don't bake knowledge into neural network weights — store it in a searchable external index and retrieve it at inference time.
Update knowledge by updating the document index, not retraining. No forgetting, instant updates, verifiable sources. 1,200+ RAG papers published in 2024 alone.
The encyclopedia becomes a library: you consult it, you don't memorize it.
3. Knowledge Distillation
Transfer knowledge from a large "teacher" model to a smaller "student" model, without the student needing to see the original training data.
LIMA (Zhou et al., 2023): matched or was preferred over GPT-4 in 43% of human evaluations, using just 1,000 curated examples.
Minitron (NVIDIA): requires up to 40x fewer training tokens than training from scratch to reach equivalent quality.
4. Continual Pre-Training
Carefully update existing models with new data instead of starting from scratch. With proper learning rate scheduling, data mixing ratios, and regularization, you can match full retraining quality.
"Reuse, Don't Retrain" (Parmar et al., 2024): demonstrated 9-18% accuracy improvements over baseline continued training through proper learning rate scheduling and data mixing strategies.
5. Model Merging / Soups
Combine multiple specialized models by merging their weights — no additional training required. Like merging encyclopedia volumes from different publishers.
FuseLLM, MonoSoup, TIES-Merging: techniques that merge models trained on different domains into a single model that inherits capabilities from all parents.
6. Post-Training (DPO / RLHF)
The value frontier has moved from knowledge to alignment. Post-training techniques refine how a model uses its knowledge — not what it knows.
DeepSeek R1: the RL post-training phase cost just $294K on top of the $5.6M V3 base model. DPO is substantially simpler and computationally lighter than RLHF, eliminating the need for reward model training and on-policy sampling.
The Linux Kernel Model
GLM-5, Qwen 3, Kimi K2.5 (all open-weight)
General purpose
Enterprise
Lightweight
Just as Linux is a shared kernel with many distributions, AI should have shared base models with many fine-tuned variants. The kernel is maintained collectively. Distributions compete on integration, support, and specialization.
The Orchestra Architecture
Classifies query → routes to specialist
Frozen, shared
50-200MB adapter
50-200MB adapter
External, updatable
A small router directs queries to the right specialist. The base is frozen and shared. Each domain expert is a lightweight LoRA adapter (~100MB). New domains = new adapters, not retraining.
"The future is not one giant model — it is an orchestra of specialized instruments, playing from the same score."
The Economics
| Approach | Cost | Time to Deploy |
|---|---|---|
| Train frontier model from scratch | $100M-2.5B | 6-12 months |
| Fine-tune 7B specialist (LoRA) | $500-3,000 | Hours |
| 100 domain-specific models | <$100,000 total | Days |
| RAG knowledge update | ~$0 (index update) | Minutes |
| Post-training alignment (DPO) | $100K-500K | Days |
A domain-fine-tuned 7B model frequently outperforms generic frontier models on domain-specific tasks — at a fraction of the cost.
From Knowledge Storage to Knowledge Architecture
The six alternatives share a common thread: they separate what the model knows from how it uses what it knows. Mixture-of-experts distributes knowledge across specialized sub-networks. RAG externalizes knowledge into searchable indexes. Distillation transfers knowledge without re-processing raw training data. Continual pre-training, model merging, and post-training alignment each reduce the volume of redundant computation in different ways. The common denominator is that knowledge — the facts, the relationships, the patterns — is treated as a reusable asset rather than a transient byproduct of each training run.
DeepSeek’s architecture is perhaps the most instructive example of where this trajectory leads. V3’s mixture-of-experts design activates only 37B of its 671B parameters per query — a 5.5% activation ratio that means 94.5% of the model’s knowledge sits dormant for any given input, available when needed but not consuming compute when it is not. R1’s $294K reinforcement learning post-training phase then added reasoning capability on top of V3’s base — creating more measurable value than the entire pre-training investment. The future is modular, composable, and updatable. The question is how quickly the industry will adopt it.
From Scale to Efficiency
DeepSeek demonstrated that algorithmic efficiency can dramatically reduce costs, consistent with research showing algorithmic gains outpace Moore's Law (~2x every 8 months for language models). V3 matched GPT-4 quality at 1/20th the cost. R1 achieved reasoning breakthroughs with $294K in RL post-training (on top of V3). The path forward is not bigger models — it is smarter training.
From Monolithic to Modular
Stop rewriting Britannica. Build composable, updatable architectures. MoE + LoRA + RAG already enables this: a frozen shared base, modular experts, and external knowledge retrieval. New knowledge doesn't require retraining — it requires a new adapter or an index update. The model becomes a platform, not a product that expires every 12 months.
From Pre-Training to Post-Training
The value frontier has moved. General knowledge is a commodity — every model has it, and the gap between models on MMLU is shrinking. The differentiation is now in reasoning, alignment, and task-specific adaptation. DeepSeek R1's $294K RL post-training phase (on the $5.6M V3 base) created more measurable value than additional pre-training investment. The future investment should shift toward post-training — not away from it.
We need to build a library system.
The knowledge is already there. The facts haven't changed. What we need is better ways to access, organize, and apply that knowledge — not to re-learn it from scratch every 12 months at $500M a pop.
What Comes Next
The Britannica Problem illuminates a structural flaw in how intelligence is produced, but it also points to an opportunity. If knowledge can be separated from reasoning, modularized, and reused, then the cost of each model generation drops dramatically — and the competitive landscape shifts toward those who can most efficiently build on shared foundations rather than those who can afford the most compute. This raises a strategic question: if the model layer is heading toward commodity economics, who benefits from accelerating that trend — and who is threatened by it? The answer is the subject of Chapter 15: The Open Source Paradox, which examines why some of the world’s largest companies are choosing to give away their most expensive assets for free.
Data synthesized from: Kaplan et al. (arXiv:2001.08361), Hoffmann et al./Chinchilla (arXiv:2203.15556), DeepSeek V3/R1 technical reports,
NVIDIA Minitron (arXiv:2407.14679), Gloeckle et al./MTP (arXiv:2404.19737), Parmar et al. (arXiv:2407.07263),
Zhou et al./LIMA (arXiv:2305.11206), Rafailov et al./DPO (arXiv:2305.18290), Luo et al. catastrophic forgetting (arXiv:2308.08747)