The Great Convergence — AI Model Gaps & Economics 2022-2026
How the gap between open & closed models collapsed, and the economics that drove it
In 2022, frontier AI models led their open-source counterparts by more than 24 months. By early 2026, that gap has collapsed to under six months for commodity tasks — and on some benchmarks, open models have pulled ahead entirely. However, closed models retain clear leads in frontier reasoning and agentic tasks (Claude Opus 4.5 achieves 80.9% on SWE-bench Verified; GPT-5 scores 100% on AIME 2026). This is the most consequential structural shift in the AI industry since the launch of ChatGPT.
The data in this chapter tells a story that would have seemed implausible just two years ago. On MMLU — the most widely cited general knowledge benchmark — the gap between the best closed and best open model shrank from 17.5 percentage points in mid-2023 to 0.3 points by late 2025. The benchmark is now effectively saturated, with both open and closed models scoring above 94% — making MMLU increasingly irrelevant for differentiation. Newer benchmarks (GPQA Diamond, SWE-bench Verified, agentic evaluations) show more mixed results where closed models often retain meaningful leads. On AIME, the competition-level mathematics benchmark, multiple models have hit 100%, and the best open model trails by just four points. Even on SWE-bench, the demanding real-world software engineering evaluation, the gap between Anthropic’s Sonnet 5 (82.1%) and Zhipu’s open-weight GLM-5 (77.8%) has narrowed to just 4.3 points.
For enterprise leaders, this convergence is not an academic curiosity — it is a strategic inflection point. It means the assumption that proprietary APIs represent the only path to frontier intelligence is no longer valid. It means vendor lock-in carries real opportunity cost. And it means the architecture decisions made today will determine whether organizations can capitalize on the rapidly expanding universe of capable models, or find themselves trapped in a single provider’s ecosystem as intelligence becomes a commodity.
See also: Model Tier Comparison (S/A/B/C) for a detailed breakdown of the current model landscape by capability tier.
Open vs Closed (2023→2025)
$20 → $0.07/MTok (18mo)
2023 → Feb 2026
Nov 2023 → Feb 2026
rate (post-2024)
on HLE (GLM-5 vs GPT-5.2)
The Benchmarks Tell the Story
The six headline statistics above capture the velocity of convergence across every major capability dimension. A 280x cost reduction for GPT-3.5-level performance in just 18 months. SWE-bench scores that jumped from 2% to 82% in two years. GPQA Diamond results that surpassed PhD-level human experts. And perhaps most striking: on Humanity’s Last Exam (HLE) — a benchmark specifically designed to remain unsaturated — the open-weight GLM-5 now leads GPT-5.2 by 4.9 points. The frontier is no longer a single line; it is a broad zone where open and closed models trade blows.
Coding and Science: Where the Gap Closed Fastest
The SWE-bench and GPQA Diamond charts above reveal two distinct convergence patterns. In coding, both open and closed models have followed steep upward trajectories, but open models accelerated dramatically in 2025. DeepSeek V3 reached approximately 48% on SWE-bench by mid-2025 — a level that would have been state-of-the-art just a year earlier. By February 2026, open models like GLM-5 and Kimi K2.5 were scoring above 76%, closing the gap with Sonnet 5 to single digits.
In science reasoning, the picture is more nuanced. Frontier closed models still maintain a meaningful lead on GPQA Diamond — approximately 19 percentage points — but the trajectory is unmistakable. When GPT-4 debuted in November 2023, it scored 39% while Llama 2 managed only 28%. By early 2026, the best closed model (Gemini 3.1 at 94.3%) has surpassed PhD experts by nearly 30 points, while open models (at approximately 75%) have surpassed those same experts by roughly 10 points. The gap remains, but both sides of the divide now operate well above the human expert threshold.
Knowledge Benchmarks: Saturation and What It Means
MMLU — once the gold standard for measuring general AI capability — is now effectively a solved benchmark. The gap between the best closed model (GPT-5 at approximately 93%) and the best open model has narrowed to just 0.3 percentage points. This saturation is itself a signal: the easy capability gains have been captured, and differentiation now happens on harder, more specialized evaluations.
On AIME (competition mathematics), the convergence pattern is even more dramatic. In 2023, both open and closed models scored around 13–15% — essentially at parity because neither was capable enough to differentiate. The gap widened sharply in 2024 as o1’s chain-of-thought reasoning reached 75% while open models languished near 40%. But by 2025, reasoning technique diffusion closed this gap almost entirely: GPT-5.2 hit 100%, and the best open model (Kimi K2.5) reached 96%. This pattern — temporary divergence followed by rapid catch-up — recurs across every benchmark and constitutes the defining rhythm of the convergence.
The Economics of Convergence
The capability convergence is inseparable from an economic transformation of equal magnitude. The cost of achieving GPT-3.5-level performance has fallen from $20 per million tokens at launch in November 2022 to approximately $0.02 per million tokens in early 2026 — a 1,000x reduction in just over three years. This is not an incremental efficiency gain; it is an entirely new cost regime that changes the fundamental economics of deploying AI at scale.
Two forces drive this collapse simultaneously. First, frontier providers themselves are racing to the bottom: GPT-4 launched at $30 per million input tokens in March 2023; by February 2026, GPT-5 offers superior capability at $1.25. Second, open-source models have introduced a price floor near zero. DeepSeek V3 offers competitive performance at $0.27 per million tokens via API, and self-hosted open models like Qwen 3, Llama 4, and OLMo 3.1 can be run for the marginal cost of compute alone. The result is a 1,000x price range in a single market — from $0.02 (Mistral Nemo) to $15.75 (GPT-5.2 Pro reasoning) — creating the most stratified pricing landscape in the history of enterprise software.
The Jevons Paradox
Token prices dropped 1,000x, but total AI inference spending surged 320% in 2025. Cheaper tokens = more usage. The market expanded faster than prices fell.
Open-Source Advantage
GLM-5 (MIT, $0.11/MTok) approaches Claude Opus 4.6's ($5/MTok) capability on most benchmarks. For many tasks, the cost-adjusted value of open models is dramatically higher than closed.
Reasoning Premium
While base model prices collapsed, reasoning/thinking modes command 10-15x premiums ($15-21/MTok). The margin has shifted from raw intelligence to compute-time reasoning.
What Drives the Convergence?
The convergence waterfall above makes the pattern unmistakable: across nearly every benchmark, the gap between open and closed models has narrowed relentlessly. But understanding why this is happening matters more than observing that it is. Three structural forces account for the bulk of the convergence, and none of them show signs of reversing.
Training technique diffusion. Breakthroughs in training methodology — reinforcement learning from human feedback, chain-of-thought reasoning, mixture-of-experts architectures — originate at frontier labs but propagate to the open-source ecosystem within months. DeepSeek V3 and R1 were watershed moments in early 2025, demonstrating that a relatively small Chinese lab could match frontier quality by applying publicly known techniques with rigorous engineering discipline. Epoch AI’s analysis of the open-weight landscape consistently shows that open models trail frontier releases by just 3–6 months on core capability benchmarks.
Open-weight releases from frontier labs. Meta’s Llama series, Alibaba’s Qwen (which surpassed 700 million downloads), and Google’s Gemma have created a virtuous cycle. Each release raises the capability floor for the entire open ecosystem, enabling fine-tuning, distillation, and adaptation that would be impossible from scratch. The open-source community is no longer building from first principles; it is standing on the shoulders of frontier research.
The Jevons Paradox in action. As the Convergence Waterfall and Economics Deep Dive panels illustrate, token prices have dropped 1,000x — yet total AI inference spending surged 320% in 2025. Cheaper intelligence does not reduce consumption; it explodes it. This demand surge funds further open-source development, which drives further cost reduction, which drives further demand. The flywheel is self-reinforcing.
What Comes Next
The convergence documented in this chapter has profound implications for enterprise strategy. If the gap between frontier and open models is measured in months rather than years, then the traditional logic of choosing a single AI provider and building deeply around its ecosystem carries escalating risk. The S-tier models of today — Gemini 3.1 Pro, Claude Opus 4.6, GPT-5.2 — will be matched or exceeded by open alternatives within a matter of quarters.
This does not mean frontier models are irrelevant. As the GPQA Diamond data shows, closed models still maintain meaningful leads in the most demanding reasoning tasks, and proprietary reasoning modes command 10–15x price premiums precisely because they deliver capabilities that open models have not yet replicated. The strategic question is not whether to use frontier models, but how to architect systems that can route between frontier and commodity intelligence as the landscape continues to shift.
Chapter 4 explores this question through the lens of model taxonomy — the classification framework that enables enterprises to match the right model tier to the right task, ensuring that expensive frontier intelligence is reserved for the problems that demand it while commodity tasks flow to the most cost-effective option available. The convergence is not the end of differentiation; it is the beginning of intelligent orchestration.