Home Report A2H Protocol Books Newsletter Articles About

Diminishing Returns at the Frontier

Open vs frontier large language models — a mid-2026 data refresh · June 2026 · sourced & verified

A re-examination of the thesis that the top 4–5 models are within 5–10% of each other, and that million-token context windows collapse under KV-cache economics. Updated with June-2026 data and corrected arithmetic; every figure is sourced.

Verdict: The thesis holds — with two corrections. (1) The top-3 models span only ~5 points on the aggregate Intelligence Index, but a real 8–18 point gap survives on the hardest benchmarks (HLE, TerminalBench Hard). (2) The February doc's KV-cache table mislabeled an 8B model as 70B — a real 70B at 1M context needs ~320 GB/user, not 128 GB, making the long-context argument stronger.
Contents: 1 Landscape · 2 Index spread · 3 Hard-task gap · 4 Cost of intelligence · 5 KV-cache · 6 What changed · 7 Methodology & verification · Sources

1 · The mid-2026 model landscape

Since the February snapshot, OpenAI shipped GPT-5.5 (2026-04-23) and Anthropic shipped Claude Fable 5 + Mythos 5 (2026-06-09). The leading open-weight models are all large sparse Mixture-of-Experts designs — trillion-plus total parameters, only 32–49B active per token.

Frontier (closed-weight)

ModelVendorReleasedContextPrice $/M (in/out)AA Index
Claude Fable 5 closedAnthropic2026-06-091M$10 / $5065
Claude Opus 4.8Anthropicpre-June1M$5 / $2561
GPT-5.5 (xhigh)OpenAI2026-04-23unverified60
Claude Sonnet 4.6Anthropicpre-June1M$3 / $15~53
Grok 4.3 (high)xAI2026-04-30lowered53
Gemini 3.1 Pro (Preview)Googlepre-JuneHLE 45%

Open-weight

ModelVendorReleasedArchitecture (total / active)ContextLicenseAA Index
Kimi K2.6 openMoonshot2026-04-201T / 32B MoE (384 exp, MLA)256KMIT-class54
Xiaomi MiMo V2.5 ProXiaomiApr 2026~1.02T / 42B MoE1Mopen weights54
DeepSeek V4 ProDeepSeek2026-04-221.6T / 49B MoE1MMIT52

Context window (K tokens)

Most flagships are now 1M; Kimi K2.6 is 256K.

Open-weight MoE — total vs active params (B, log)

Trillion-scale total, 32–49B active — frontier quality at mid-tier serving cost [8][9].

2 · The aggregate spread is narrow

On the Artificial Analysis Intelligence Index, Claude Fable 5 (65) leads Claude Opus 4.8 (61) and GPT-5.5 (60) by single digits; the best open-weight models trail by only 3–6 points. Paying for #1 buys a small edge over a crowded field.

Artificial Analysis Intelligence Index — June 2026

Top 3 span 5 pts; top 7 span 13 pts. Source: Artificial Analysis [5][8].

3 · …but the gap is real on the hardest tasks

Convergence is benchmark-dependent. On the two hardest evaluations the best open-weight models still trail the closed frontier by 8–18 points — and on agentic work Grok 4.3 trails GPT-5.5 by 276 Elo on GDPval-AA.

Hardest benchmarks — open vs frontier (%)

Open figures are the top-3 open-weight range midpoint. Source: Artificial Analysis [8].

4 · The cost of intelligence — real, but method-dependent

The cost to reach a fixed capability has fallen sharply, but how much depends entirely on the measurement method. The famous “200×/year” is a token-price figure; a stricter benchmark-cost measure that corrects for reasoning models emitting more tokens gives 5–10×/year, and ~3×/year once hardware and competition are removed. All three are correct.

MethodWhat it measuresRateSource
Epoch AIToken price at fixed benchmark score50×/yr median; 200×/yr post-Jan-2024; up to 900×[12]
Epoch AI (specific)GPT-4-level PhD science (GPQA Diamond)~40×/yr[12]
Thompson et al. (MIT)Benchmark cost, corrected for reasoning-token inflation5–10×/yr[10][11]
Thompson et al. (residual)After removing hardware (~1.3×/yr) & competition~3×/yr (pure algorithmic)[10][11]

Concrete data point: by Grok 4 Fast (Sept 2025), an Index>60-tier model cost just $0.20 / $0.50 per million tokens (in/out) [13]. As the MIT authors put it, “much of the more dramatic 100×–1000× price decreases… are likely due to non-technical factors like increased market competition” [11].

Annual cost-decline rate by method (log)

Epoch token-price [12] vs Thompson benchmark-cost & algorithmic-only [10][11].

Verified API pricing — $/M tokens (log)

Anthropic prices verified; Grok 4 Fast = the low-cost Index>60 point. GPT-5.5 & open self-host prices unverified.

5 · KV-cache economics, corrected

The formula is 2 × n_layers × n_kv_heads × head_dim × bytes × tokens. The February doc used the 32-layer (8B) result but labeled it “70B.” Re-derived for a real 70B (80 layers, FP16):

ContextDoc claimed “70B”Real 70B (80 layers)Real 8B (32 layers)
4K0.5 GB1.25 GB0.5 GB
32K4 GB10 GB4 GB
128K16 GB40 GB16 GB
1M128 GB320 GB128 GB

One H200 (141 GB) serving a Q4 70B (~40 GB weights, ~95 GB free for KV) drops from the doc's claimed 270 concurrent users at 4K to a real ~76, to ~2 at 128K, to under one at 1M. You don't escape this with bigger GPUs — you redesign attention: GQA (≈4× vs MHA), MLA (≈5× smaller), and GTA/GLA which match GQA/MLA quality at roughly half the KV cache [14].

KV-cache per user vs context (GB, log)

FP16, GQA. Re-derived from first principles; MLA = DeepSeek compressed latent.
The widely-cited “MLA cuts KV cache 93.3%” figure did not survive verification; the direction (large reduction) is well established, the exact percentage is not [16].

6 · What changed since February 2026

ItemFebruary docJune 2026
Top frontier modelGemini 3.1 Pro / Opus 4.6Claude Fable 5 (Index 65)
OpenAI flagshipGPT-5.2GPT-5.5 (2026-04-23)
Top open-weightGLM-5, Kimi K2.5Kimi K2.6 / MiMo V2.5 Pro (54), DeepSeek V4 Pro (52)
“Within 5–10%”asserted broadlyTrue on Index; false on HLE/TerminalBench
Cost decline“200×/yr, 1000×” as factMethod-dependent: 200× / 5–10× / ~3×
70B KV @ 128K / 1M16 / 128 GB40 / 320 GB

7 · Methodology & verification

The web data was gathered by a deep-research harness: parallel search fan-out → source fetch → claim extraction → 3-vote adversarial verification (a claim needs 2 of 3 “refute” votes to be killed) → synthesis. 6 angles · 27 sources · 116 claims extracted · 25 verified · 22 confirmed, 3 killed · 110 agents. KV-cache arithmetic was additionally re-derived by hand.

Confirmed findings

3-0 ✓Fable 5 + Mythos 5 released 2026-06-09; 1M context, 128k output, $10/$50; Mythos = Fable without safety classifiers. Anthropic model card + news [1][2].
3-0 ✓GPT-5.5 exists (announced 2026-04-23); tops AA Index among non-Anthropic at ~60. OpenAI + Wikipedia [3][4]. Pricing claim refuted (see below).
3-0 ✓Index top-3 span ~5 pts: Fable 5 = 65, Opus 4.8 = 61, GPT-5.5 = 60. Artificial Analysis [5].
3-0 ✓Grok 4.3 = 53, trails GPT-5.5 by 276 Elo on GDPval-AA agentic. AA article + model page [6][7].
3-0 ✓Open leaders: Kimi K2.6 (54), MiMo V2.5 Pro (54), DeepSeek V4 Pro (52) — within 3–6 pts; all trillion-scale MoE. AA + Hugging Face [8][9].
3-0 ✓Gap only on hardest benchmarks: HLE 34–36% vs 44–45%; TerminalBench Hard 43–46% vs 54–61%. AA [8].
3-0 ✓Pareto: open dominates low/mid cost; closed defines the lowest-error end. Thompson et al. + Digital Applied [10][15].
3-0 ✓Cost of intelligence: Epoch 50×/yr median, 200× post-Jan-2024, GPQA-level 40×/yr; Grok 4 Fast $0.20/$0.50. Epoch + AA [12][13].
3-0 / 2-1Method-dependence: benchmark-cost 5–10×/yr, algorithmic-only ~3×/yr; “much of 100–1000× is competition.” Thompson et al. + LessWrong [10][11].
3-0 ✓KV-cache is the HBM bottleneck; GTA matches GQA at ~half the KV cache; GLA matches MLA with up to 2× throughput. Tri Dao et al. [14].

Refuted — killed in verification, NOT used

ClaimVote
GPT-5.5 priced at $5/M in, $30/M out0-3
MLA reduces KV cache by exactly 93.3% vs DeepSeek-67B MHA1-2
~Half of benchmark progress is associated with rising inference prices1-2

Caveats

Open questions

Sources (27 fetched, by quality)

Full appendix with every claim, evidence quote, and vote: frontier-diminishing-returns-2026-research-appendix.md.