Diminishing Returns at the Frontier

Open vs frontier large language models — a mid-2026 data refresh · June 2026 · sourced & verified

A re-examination of the thesis that the top 4–5 models are within 5–10% of each other, and that million-token context windows collapse under KV-cache economics. Updated with June-2026 data and corrected arithmetic; every figure is sourced.

Verdict: The thesis holds — with two corrections. (1) The top-3 models span only ~5 points on the aggregate Intelligence Index, but a real 8–18 point gap survives on the hardest benchmarks (HLE, TerminalBench Hard). (2) The February doc's KV-cache table mislabeled an 8B model as 70B — a real 70B at 1M context needs ~320 GB/user, not 128 GB, making the long-context argument stronger.

Contents: 1 Landscape · 2 Index spread · 3 Hard-task gap · 4 Cost of intelligence · 5 KV-cache · 6 What changed · 7 Methodology & verification · Sources

1 · The mid-2026 model landscape

Since the February snapshot, OpenAI shipped GPT-5.5 (2026-04-23) and Anthropic shipped Claude Fable 5 + Mythos 5 (2026-06-09). The leading open-weight models are all large sparse Mixture-of-Experts designs — trillion-plus total parameters, only 32–49B active per token.

Frontier (closed-weight)

Model	Vendor	Released	Context	Price $/M (in/out)	AA Index
Claude Fable 5 closed	Anthropic	2026-06-09	1M	$10 / $50	65
Claude Opus 4.8	Anthropic	pre-June	1M	$5 / $25	61
GPT-5.5 (xhigh)	OpenAI	2026-04-23	—	unverified	60
Claude Sonnet 4.6	Anthropic	pre-June	1M	$3 / $15	~53
Grok 4.3 (high)	xAI	2026-04-30	—	lowered	53
Gemini 3.1 Pro (Preview)	Google	pre-June	—	—	HLE 45%

Open-weight

Model	Vendor	Released	Architecture (total / active)	Context	License	AA Index
Kimi K2.6 open	Moonshot	2026-04-20	1T / 32B MoE (384 exp, MLA)	256K	MIT-class	54
Xiaomi MiMo V2.5 Pro	Xiaomi	Apr 2026	~1.02T / 42B MoE	1M	open weights	54
DeepSeek V4 Pro	DeepSeek	2026-04-22	1.6T / 49B MoE	1M	MIT	52

Context window (K tokens)

Most flagships are now 1M; Kimi K2.6 is 256K.

Open-weight MoE — total vs active params (B, log)

Trillion-scale total, 32–49B active — frontier quality at mid-tier serving cost [8][9].

2 · The aggregate spread is narrow

On the Artificial Analysis Intelligence Index, Claude Fable 5 (65) leads Claude Opus 4.8 (61) and GPT-5.5 (60) by single digits; the best open-weight models trail by only 3–6 points. Paying for #1 buys a small edge over a crowded field.

Artificial Analysis Intelligence Index — June 2026

Top 3 span 5 pts; top 7 span 13 pts. Source: Artificial Analysis [5][8].

3 · …but the gap is real on the hardest tasks

Convergence is benchmark-dependent. On the two hardest evaluations the best open-weight models still trail the closed frontier by 8–18 points — and on agentic work Grok 4.3 trails GPT-5.5 by 276 Elo on GDPval-AA.

Hardest benchmarks — open vs frontier (%)

Open figures are the top-3 open-weight range midpoint. Source: Artificial Analysis [8].

4 · The cost of intelligence — real, but method-dependent

The cost to reach a fixed capability has fallen sharply, but how much depends entirely on the measurement method. The famous “200×/year” is a token-price figure; a stricter benchmark-cost measure that corrects for reasoning models emitting more tokens gives 5–10×/year, and ~3×/year once hardware and competition are removed. All three are correct.

Method	What it measures	Rate	Source
Epoch AI	Token price at fixed benchmark score	50×/yr median; 200×/yr post-Jan-2024; up to 900×	[12]
Epoch AI (specific)	GPT-4-level PhD science (GPQA Diamond)	~40×/yr	[12]
Thompson et al. (MIT)	Benchmark cost, corrected for reasoning-token inflation	5–10×/yr	[10][11]
Thompson et al. (residual)	After removing hardware (~1.3×/yr) & competition	~3×/yr (pure algorithmic)	[10][11]

Concrete data point: by Grok 4 Fast (Sept 2025), an Index>60-tier model cost just $0.20 / $0.50 per million tokens (in/out) [13]. As the MIT authors put it, “much of the more dramatic 100×–1000× price decreases… are likely due to non-technical factors like increased market competition” [11].

Annual cost-decline rate by method (log)

Epoch token-price [12] vs Thompson benchmark-cost & algorithmic-only [10][11].

Verified API pricing — $/M tokens (log)

Anthropic prices verified; Grok 4 Fast = the low-cost Index>60 point. GPT-5.5 & open self-host prices unverified.

5 · KV-cache economics, corrected

The formula is 2 × n_layers × n_kv_heads × head_dim × bytes × tokens. The February doc used the 32-layer (8B) result but labeled it “70B.” Re-derived for a real 70B (80 layers, FP16):

Context	Doc claimed “70B”	Real 70B (80 layers)	Real 8B (32 layers)
4K	0.5 GB	1.25 GB	0.5 GB
32K	4 GB	10 GB	4 GB
128K	16 GB	40 GB	16 GB
1M	128 GB	320 GB	128 GB

One H200 (141 GB) serving a Q4 70B (~40 GB weights, ~95 GB free for KV) drops from the doc's claimed 270 concurrent users at 4K to a real ~76, to ~2 at 128K, to under one at 1M. You don't escape this with bigger GPUs — you redesign attention: GQA (≈4× vs MHA), MLA (≈5× smaller), and GTA/GLA which match GQA/MLA quality at roughly half the KV cache [14].

KV-cache per user vs context (GB, log)

FP16, GQA. Re-derived from first principles; MLA = DeepSeek compressed latent.

The widely-cited “MLA cuts KV cache 93.3%” figure did not survive verification; the direction (large reduction) is well established, the exact percentage is not [16].

6 · What changed since February 2026

Item	February doc	June 2026
Top frontier model	Gemini 3.1 Pro / Opus 4.6	Claude Fable 5 (Index 65)
OpenAI flagship	GPT-5.2	GPT-5.5 (2026-04-23)
Top open-weight	GLM-5, Kimi K2.5	Kimi K2.6 / MiMo V2.5 Pro (54), DeepSeek V4 Pro (52)
“Within 5–10%”	asserted broadly	True on Index; false on HLE/TerminalBench
Cost decline	“200×/yr, 1000×” as fact	Method-dependent: 200× / 5–10× / ~3×
70B KV @ 128K / 1M	16 / 128 GB	40 / 320 GB

7 · Methodology & verification

The web data was gathered by a deep-research harness: parallel search fan-out → source fetch → claim extraction → 3-vote adversarial verification (a claim needs 2 of 3 “refute” votes to be killed) → synthesis. 6 angles · 27 sources · 116 claims extracted · 25 verified · 22 confirmed, 3 killed · 110 agents. KV-cache arithmetic was additionally re-derived by hand.

Confirmed findings

3-0 ✓ — Fable 5 + Mythos 5 released 2026-06-09; 1M context, 128k output, $10/$50; Mythos = Fable without safety classifiers. Anthropic model card + news [1][2].

3-0 ✓ — GPT-5.5 exists (announced 2026-04-23); tops AA Index among non-Anthropic at ~60. OpenAI + Wikipedia [3][4]. Pricing claim refuted (see below).

3-0 ✓ — Index top-3 span ~5 pts: Fable 5 = 65, Opus 4.8 = 61, GPT-5.5 = 60. Artificial Analysis [5].

3-0 ✓ — Grok 4.3 = 53, trails GPT-5.5 by 276 Elo on GDPval-AA agentic. AA article + model page [6][7].

3-0 ✓ — Open leaders: Kimi K2.6 (54), MiMo V2.5 Pro (54), DeepSeek V4 Pro (52) — within 3–6 pts; all trillion-scale MoE. AA + Hugging Face [8][9].

3-0 ✓ — Gap only on hardest benchmarks: HLE 34–36% vs 44–45%; TerminalBench Hard 43–46% vs 54–61%. AA [8].

3-0 ✓ — Pareto: open dominates low/mid cost; closed defines the lowest-error end. Thompson et al. + Digital Applied [10][15].

3-0 ✓ — Cost of intelligence: Epoch 50×/yr median, 200× post-Jan-2024, GPQA-level 40×/yr; Grok 4 Fast $0.20/$0.50. Epoch + AA [12][13].

3-0 / 2-1 — Method-dependence: benchmark-cost 5–10×/yr, algorithmic-only ~3×/yr; “much of 100–1000× is competition.” Thompson et al. + LessWrong [10][11].

3-0 ✓ — KV-cache is the HBM bottleneck; GTA matches GQA at ~half the KV cache; GLA matches MLA with up to 2× throughput. Tri Dao et al. [14].

Refuted — killed in verification, NOT used

Claim	Vote
GPT-5.5 priced at $5/M in, $30/M out	0-3
MLA reduces KV cache by exactly 93.3% vs DeepSeek-67B MHA	1-2
~Half of benchmark progress is associated with rising inference prices	1-2

Caveats

Never quote a cost-decline number without its method (token-price 50×/200× vs benchmark-cost 5–10× vs algorithmic ~3×).
NVIDIA H200/B200/B300 VRAM & bandwidth specs did not survive verification; the H200 = 141 GB figure is vendor-published, used as-is.
Model identities are a June-2026 snapshot; thesis-level conclusions hold regardless of specific versions.

Open questions

Confirmed NVIDIA H200 / B200 / B300 VRAM + HBM bandwidth.
The actual published GPT-5.5 API price (existence confirmed; pricing refuted).
Whether the fastest ~900×/yr price trends persist or revert toward the 50× median.
How much of the 100–1000× market price decline is competition vs algorithmic progress.

Sources (27 fetched, by quality)

#	Source	Quality
1	platform.claude.com — Fable 5 & Mythos 5 card	primary
2	anthropic.com/news — Fable 5 / Mythos 5	—
3	openai.com — Introducing GPT-5.5	secondary
4	en.wikipedia.org — GPT-5.5	—
5	Artificial Analysis — Intelligence Index	primary
6	Artificial Analysis — xAI launches Grok 4.3	primary
7	Artificial Analysis — Grok 4.3 model page	primary
8	Artificial Analysis — Recent open-weights launches	primary
9	Hugging Face — DeepSeek-V4-Pro	—
10	arXiv 2511.23455v2 — Thompson et al. (MIT)	primary
11	LessWrong — MIT FutureTech, inference progress	primary
12	Epoch AI — LLM inference price trends	primary
13	Artificial Analysis (X) — Grok 4 Fast pricing	primary
14	arXiv 2505.21487 — Grouped-Tied/Latent Attention (Tri Dao)	primary
15	Digital Applied — Efficient Frontier Q2 2026	secondary
16	arXiv 2405.04434 — DeepSeek-V2 (MLA)	primary
17–27	benchlm.ai, benchmarkingagents.com, tokencost.app, fiegenbaum.solutions, medium.com, spheron.network, sebastianraschka.com (secondary/blog corroboration)	blog

Full appendix with every claim, evidence quote, and vote: frontier-diminishing-returns-2026-research-appendix.md.