The Great Convergence — AI Model Gaps & Economics 2022-2026

How the gap between open & closed models collapsed, and the economics that drove it

Part I — The Intelligence Layer
Chapter 3: The Great Convergence

In 2022, frontier AI models led their open-source counterparts by more than 24 months. By early 2026, that gap has collapsed to under six months for commodity tasks — and on some benchmarks, open models have pulled ahead entirely. However, closed models retain clear leads in frontier reasoning and agentic tasks (Claude Opus 4.5 achieves 80.9% on SWE-bench Verified; GPT-5 scores 100% on AIME 2026). This is the most consequential structural shift in the AI industry since the launch of ChatGPT.

The data in this chapter tells a story that would have seemed implausible just two years ago. On MMLU — the most widely cited general knowledge benchmark — the gap between the best closed and best open model shrank from 17.5 percentage points in mid-2023 to 0.3 points by late 2025. The benchmark is now effectively saturated, with both open and closed models scoring above 94% — making MMLU increasingly irrelevant for differentiation. Newer benchmarks (GPQA Diamond, SWE-bench Verified, agentic evaluations) show more mixed results where closed models often retain meaningful leads. On AIME, the competition-level mathematics benchmark, multiple models have hit 100%, and the best open model trails by just four points. Even on SWE-bench, the demanding real-world software engineering evaluation, the gap between Anthropic’s Sonnet 5 (82.1%) and Zhipu’s open-weight GLM-5 (77.8%) has narrowed to just 4.3 points.

For enterprise leaders, this convergence is not an academic curiosity — it is a strategic inflection point. It means the assumption that proprietary APIs represent the only path to frontier intelligence is no longer valid. It means vendor lock-in carries real opportunity cost. And it means the architecture decisions made today will determine whether organizations can capitalize on the rapidly expanding universe of capable models, or find themselves trapped in a single provider’s ecosystem as intelligence becomes a commodity.

See also: Model Tier Comparison (S/A/B/C) for a detailed breakdown of the current model landscape by capability tier.

17.5 0.3
percentage points
MMLU gap closed
Open vs Closed (2023→2025)
280x
cost reduction
GPT-3.5 level perf.
$20 → $0.07/MTok (18mo)
2% 82%
SWE-bench
Coding capability
2023 → Feb 2026
39 94%
GPQA Diamond
Science reasoning
Nov 2023 → Feb 2026
>100x
per year
Est. price decline
rate (post-2024)
+4.9
points ahead
Open beats closed
on HLE (GLM-5 vs GPT-5.2)

The Benchmarks Tell the Story

The six headline statistics above capture the velocity of convergence across every major capability dimension. A 280x cost reduction for GPT-3.5-level performance in just 18 months. SWE-bench scores that jumped from 2% to 82% in two years. GPQA Diamond results that surpassed PhD-level human experts. And perhaps most striking: on Humanity’s Last Exam (HLE) — a benchmark specifically designed to remain unsaturated — the open-weight GLM-5 now leads GPT-5.2 by 4.9 points. The frontier is no longer a single line; it is a broad zone where open and closed models trade blows.

SWE-bench — From 2% to 82% in Two Years
Real-world software engineering capability. Shows best closed (gold) vs best open (green) model at each milestone.
100% 80% 60% 40% 20% 0% 2023 Early 2024 Late 2024 Mid 2025 Feb 2026 ~50% human baseline 1.96% RAG+GPT-3.5 13.9% Devin ~20% GPT-4 agents 33.4% Claude 3.5 Sonnet 49% Claude 3.5 (upd) 62.3% Claude 3.7 72% Claude 4 Opus 80.9% Opus 4.5 82.1% Sonnet 5 ~10% ~18% ~48% DeepSeek V3 76.8% Kimi K2 77.8% GLM-5 31pt gap 4.3pt gap Best Closed Best Open
GPQA Diamond — Surpassing Human Experts
Graduate-level science reasoning. PhD experts score ~65%. Frontier models now score 94%.
100% 90% 70% 50% 30% 0% Nov 2023 Mar 2024 Sep 2024 Mid 2025 Feb 2026 PhD Expert ~70% Non-expert 34% 39% GPT-4 50.4% Claude 3 Opus 77% o1 Surpasses Experts 87% Opus 4.5 91.3% Opus 4.6 94.3% Gemini 3.1 ~28% Llama 2 ~36% ~51% Llama 3.1 405B ~70% Qwen/DeepSeek ~75% ~22pt ~19pt Best Closed Best Open

Coding and Science: Where the Gap Closed Fastest

The SWE-bench and GPQA Diamond charts above reveal two distinct convergence patterns. In coding, both open and closed models have followed steep upward trajectories, but open models accelerated dramatically in 2025. DeepSeek V3 reached approximately 48% on SWE-bench by mid-2025 — a level that would have been state-of-the-art just a year earlier. By February 2026, open models like GLM-5 and Kimi K2.5 were scoring above 76%, closing the gap with Sonnet 5 to single digits.

In science reasoning, the picture is more nuanced. Frontier closed models still maintain a meaningful lead on GPQA Diamond — approximately 19 percentage points — but the trajectory is unmistakable. When GPT-4 debuted in November 2023, it scored 39% while Llama 2 managed only 28%. By early 2026, the best closed model (Gemini 3.1 at 94.3%) has surpassed PhD experts by nearly 30 points, while open models (at approximately 75%) have surpassed those same experts by roughly 10 points. The gap remains, but both sides of the divide now operate well above the human expert threshold.

MMLU — The Gap That Vanished
General knowledge benchmark. Gap collapsed from 17.5pt (2023) to 0.3pt (2025). Benchmark now saturated.
100% 95% 85% 75% 65% 55% 50% Mid 2023 Early 2024 Late 2024 2025 86.4% GPT-4 88.7% GPT-4o 92.9% ~93% GPT-5 68.9% Llama 2 70B 79.5% Llama 3 70B 87.3% Llama 3.1 405B ~92% Qwen/DeepSeek ~93% 17.5pt 0.3pt Closed Open Gap
AIME 2025 — Open Models Nearly Match 100%
Competition math. Multiple models hit 100%. Open models at 96%+. Gap: just 4 points.
2023 2024 2025 2026 ~13% GPT-4 ~15% Llama 2 ~75% o1 ~40% Qwen 2.5 100% GPT-5.2 96.1% Kimi K2.5 100% 96% ~35pt gap 4pt gap Closed Open

Knowledge Benchmarks: Saturation and What It Means

MMLU — once the gold standard for measuring general AI capability — is now effectively a solved benchmark. The gap between the best closed model (GPT-5 at approximately 93%) and the best open model has narrowed to just 0.3 percentage points. This saturation is itself a signal: the easy capability gains have been captured, and differentiation now happens on harder, more specialized evaluations.

On AIME (competition mathematics), the convergence pattern is even more dramatic. In 2023, both open and closed models scored around 13–15% — essentially at parity because neither was capable enough to differentiate. The gap widened sharply in 2024 as o1’s chain-of-thought reasoning reached 75% while open models languished near 40%. But by 2025, reasoning technique diffusion closed this gap almost entirely: GPT-5.2 hit 100%, and the best open model (Kimi K2.5) reached 96%. This pattern — temporary divergence followed by rapid catch-up — recurs across every benchmark and constitutes the defining rhythm of the convergence.

The Cost of Intelligence — Exponential Collapse
Cost to achieve GPT-3.5 level performance ($/MTok). Logarithmic scale. From $20 to $0.07 in 18 months.
$100 $10 $1 $0.10 $0.01 Nov 2022 2023 2024 2025 Feb 2026 $20.00 GPT-3.5 launch $12.00 GPT-3.5 Turbo $2.00 Turbo-era models $0.15 GPT-4o mini $0.07 280x drop (Oct '24) $0.05 GPT-5 Nano $0.02 Mistral Nemo 280x in 18 months 1,000x total by 2026
Frontier Model Pricing — Race to the Bottom
Input cost per million tokens for the best available model at each point in time.
$60 $30 $15 $5 $1 $0 Mar 2023 Late 2023 Mid 2024 Late 2024 Feb 2026 $30/MTok GPT-4 launch $10 GPT-4 Turbo $5 GPT-4o / Claude 3 $3 Claude 3.5 Sonnet $1.25 GPT-5 / Gemini 3 $0 Llama 3 (free) $0.27 DeepSeek V3 $0.11 GLM-5 (MIT) $30 GAP (2023) $1.14 gap Closed Open

The Economics of Convergence

The capability convergence is inseparable from an economic transformation of equal magnitude. The cost of achieving GPT-3.5-level performance has fallen from $20 per million tokens at launch in November 2022 to approximately $0.02 per million tokens in early 2026 — a 1,000x reduction in just over three years. This is not an incremental efficiency gain; it is an entirely new cost regime that changes the fundamental economics of deploying AI at scale.

Two forces drive this collapse simultaneously. First, frontier providers themselves are racing to the bottom: GPT-4 launched at $30 per million input tokens in March 2023; by February 2026, GPT-5 offers superior capability at $1.25. Second, open-source models have introduced a price floor near zero. DeepSeek V3 offers competitive performance at $0.27 per million tokens via API, and self-hosted open models like Qwen 3, Llama 4, and OLMo 3.1 can be run for the marginal cost of compute alone. The result is a 1,000x price range in a single market — from $0.02 (Mistral Nemo) to $15.75 (GPT-5.2 Pro reasoning) — creating the most stratified pricing landscape in the history of enterprise software.

The Convergence Waterfall — Gap Shrinkage by Benchmark
Percentage-point gap between best closed and best open model. Red = gap size. Narrower = more convergence.
2023 2024 2025 Feb 2026 MMLU 17.5pt 5pt 1pt 0.3pt SWE-bench ~20pt ~15pt 5pt 4.3pt GPQA 11pt ~15pt 17pt ~19pt Still large AIME ~0pt ~35pt 4pt 4pt HLE N/A N/A N/A +4.9 OPEN LEADS Arena Elo ~200 Elo ~120 ~50 ~49
Economics Deep Dive — Who Captures Value?
The AI inference market paradox: prices dropped 1,000x but spending surged 320%.
2026 Price Tiers — 1,000x Range From Cheapest to Most Expensive $0.02 Nemo $0.05 GPT-5 Nano $0.10 Flash Lite $0.11 GLM-5 $0.27 DeepSeek $1.25 GPT-5 $2.00 Gemini 3.1 $3.00 Sonnet 5 $5.00 Opus 4.6 $15.00 GPT-5.2 Reason $15.75 GPT-5.2 Pro 1,000x price range in a single market

The Jevons Paradox

Token prices dropped 1,000x, but total AI inference spending surged 320% in 2025. Cheaper tokens = more usage. The market expanded faster than prices fell.

Open-Source Advantage

GLM-5 (MIT, $0.11/MTok) approaches Claude Opus 4.6's ($5/MTok) capability on most benchmarks. For many tasks, the cost-adjusted value of open models is dramatically higher than closed.

Reasoning Premium

While base model prices collapsed, reasoning/thinking modes command 10-15x premiums ($15-21/MTok). The margin has shifted from raw intelligence to compute-time reasoning.

What Drives the Convergence?

The convergence waterfall above makes the pattern unmistakable: across nearly every benchmark, the gap between open and closed models has narrowed relentlessly. But understanding why this is happening matters more than observing that it is. Three structural forces account for the bulk of the convergence, and none of them show signs of reversing.

Training technique diffusion. Breakthroughs in training methodology — reinforcement learning from human feedback, chain-of-thought reasoning, mixture-of-experts architectures — originate at frontier labs but propagate to the open-source ecosystem within months. DeepSeek V3 and R1 were watershed moments in early 2025, demonstrating that a relatively small Chinese lab could match frontier quality by applying publicly known techniques with rigorous engineering discipline. Epoch AI’s analysis of the open-weight landscape consistently shows that open models trail frontier releases by just 3–6 months on core capability benchmarks.

Open-weight releases from frontier labs. Meta’s Llama series, Alibaba’s Qwen (which surpassed 700 million downloads), and Google’s Gemma have created a virtuous cycle. Each release raises the capability floor for the entire open ecosystem, enabling fine-tuning, distillation, and adaptation that would be impossible from scratch. The open-source community is no longer building from first principles; it is standing on the shoulders of frontier research.

The Jevons Paradox in action. As the Convergence Waterfall and Economics Deep Dive panels illustrate, token prices have dropped 1,000x — yet total AI inference spending surged 320% in 2025. Cheaper intelligence does not reduce consumption; it explodes it. This demand surge funds further open-source development, which drives further cost reduction, which drives further demand. The flywheel is self-reinforcing.

Historical Timeline — Key Milestones in the Convergence
2022
GPT-3.5 launches ($20/MTok). MMLU: 70%.
ChatGPT reaches 100M users in 2 months.
No competitive open models exist yet.
2023
GPT-4 launches (MMLU: 86.4%, $30/MTok).
Claude 2, Gemini 1 Pro released.
GPQA Diamond: GPT-4 scores 39%.
Llama 2 (MMLU: 68.9%). 17.5pt gap.
Vicuna/Mistral 7B show viability.
GPT-4 Turbo cuts price to $10/MTok.
2024
GPT-4o ($2.50/MTok). Claude 3 family.
o1 surpasses PhD experts on GPQA (77%).
SWE-bench: Claude 3.5 hits 49%.
Llama 3.1 405B matches GPT-4 (87%).
MMLU gap closes to ~5pt.
GPT-3.5 perf: $20 → $0.07 (280x).
Decline rate: 50x → 200x/year.
2025
GPT-5, then GPT-5.2 (AIME: 100%).
Gemini 3 Pro (GPQA: 91.9%).
Claude 4 Opus. SWE-bench: 72%+.
DeepSeek R1/V3 shock. Speciale: IMO gold.
Qwen surpasses 700M downloads.
MMLU gap: 0.3pt. Arena gap: 1.7%.
Total AI spend surges 320% (Jevons).
2026 (Feb)
Gemini 3.1 Pro (ARC-AGI-2: 77.1%).
Claude Opus 4.6 (#1 Arena). Sonnet 5 (82.1%).
7 major releases in one month.
GLM-5 beats GPT-5.2 on HLE (+4.9pt).
Kimi K2.5: 200-300 autonomous tool calls.
Open models lead in specific benchmarks.
GLM-5: $0.11/MTok. 45x cheaper than Opus.
1,000x range: $0.02 to $21/MTok.
Closed model milestone Open model milestone Economics milestone

What Comes Next

The convergence documented in this chapter has profound implications for enterprise strategy. If the gap between frontier and open models is measured in months rather than years, then the traditional logic of choosing a single AI provider and building deeply around its ecosystem carries escalating risk. The S-tier models of today — Gemini 3.1 Pro, Claude Opus 4.6, GPT-5.2 — will be matched or exceeded by open alternatives within a matter of quarters.

This does not mean frontier models are irrelevant. As the GPQA Diamond data shows, closed models still maintain meaningful leads in the most demanding reasoning tasks, and proprietary reasoning modes command 10–15x price premiums precisely because they deliver capabilities that open models have not yet replicated. The strategic question is not whether to use frontier models, but how to architect systems that can route between frontier and commodity intelligence as the landscape continues to shift.

Chapter 4 explores this question through the lens of model taxonomy — the classification framework that enables enterprises to match the right model tier to the right task, ensuring that expensive frontier intelligence is reserved for the problems that demand it while commodity tasks flow to the most cost-effective option available. The convergence is not the end of differentiation; it is the beginning of intelligent orchestration.

Sources: Epoch AI (inference price trends), LMSYS Chatbot Arena, Stanford AI Index 2025, a16z LLMflation, OpenAI, Anthropic, Google DeepMind, Artificial Analysis, VentureBeat, SWE-bench.com, Andrew Ng (pricing data) · Data as of February 28, 2026