Open vs frontier large language models — a mid-2026 data refresh · June 2026 · sourced & verified
A re-examination of the thesis that the top 4–5 models are within 5–10% of each other, and that million-token context windows collapse under KV-cache economics. Updated with June-2026 data and corrected arithmetic; every figure is sourced.
Since the February snapshot, OpenAI shipped GPT-5.5 (2026-04-23) and Anthropic shipped Claude Fable 5 + Mythos 5 (2026-06-09). The leading open-weight models are all large sparse Mixture-of-Experts designs — trillion-plus total parameters, only 32–49B active per token.
| Model | Vendor | Released | Context | Price $/M (in/out) | AA Index |
|---|---|---|---|---|---|
| Claude Fable 5 closed | Anthropic | 2026-06-09 | 1M | $10 / $50 | 65 |
| Claude Opus 4.8 | Anthropic | pre-June | 1M | $5 / $25 | 61 |
| GPT-5.5 (xhigh) | OpenAI | 2026-04-23 | — | unverified | 60 |
| Claude Sonnet 4.6 | Anthropic | pre-June | 1M | $3 / $15 | ~53 |
| Grok 4.3 (high) | xAI | 2026-04-30 | — | lowered | 53 |
| Gemini 3.1 Pro (Preview) | pre-June | — | — | HLE 45% |
| Model | Vendor | Released | Architecture (total / active) | Context | License | AA Index |
|---|---|---|---|---|---|---|
| Kimi K2.6 open | Moonshot | 2026-04-20 | 1T / 32B MoE (384 exp, MLA) | 256K | MIT-class | 54 |
| Xiaomi MiMo V2.5 Pro | Xiaomi | Apr 2026 | ~1.02T / 42B MoE | 1M | open weights | 54 |
| DeepSeek V4 Pro | DeepSeek | 2026-04-22 | 1.6T / 49B MoE | 1M | MIT | 52 |
On the Artificial Analysis Intelligence Index, Claude Fable 5 (65) leads Claude Opus 4.8 (61) and GPT-5.5 (60) by single digits; the best open-weight models trail by only 3–6 points. Paying for #1 buys a small edge over a crowded field.
Convergence is benchmark-dependent. On the two hardest evaluations the best open-weight models still trail the closed frontier by 8–18 points — and on agentic work Grok 4.3 trails GPT-5.5 by 276 Elo on GDPval-AA.
The cost to reach a fixed capability has fallen sharply, but how much depends entirely on the measurement method. The famous “200×/year” is a token-price figure; a stricter benchmark-cost measure that corrects for reasoning models emitting more tokens gives 5–10×/year, and ~3×/year once hardware and competition are removed. All three are correct.
| Method | What it measures | Rate | Source |
|---|---|---|---|
| Epoch AI | Token price at fixed benchmark score | 50×/yr median; 200×/yr post-Jan-2024; up to 900× | [12] |
| Epoch AI (specific) | GPT-4-level PhD science (GPQA Diamond) | ~40×/yr | [12] |
| Thompson et al. (MIT) | Benchmark cost, corrected for reasoning-token inflation | 5–10×/yr | [10][11] |
| Thompson et al. (residual) | After removing hardware (~1.3×/yr) & competition | ~3×/yr (pure algorithmic) | [10][11] |
Concrete data point: by Grok 4 Fast (Sept 2025), an Index>60-tier model cost just $0.20 / $0.50 per million tokens (in/out) [13]. As the MIT authors put it, “much of the more dramatic 100×–1000× price decreases… are likely due to non-technical factors like increased market competition” [11].
The formula is 2 × n_layers × n_kv_heads × head_dim × bytes × tokens. The February doc used the 32-layer (8B) result but labeled it “70B.” Re-derived for a real 70B (80 layers, FP16):
| Context | Doc claimed “70B” | Real 70B (80 layers) | Real 8B (32 layers) |
|---|---|---|---|
| 4K | 0.5 GB | 1.25 GB | 0.5 GB |
| 32K | 4 GB | 10 GB | 4 GB |
| 128K | 16 GB | 40 GB | 16 GB |
| 1M | 128 GB | 320 GB | 128 GB |
One H200 (141 GB) serving a Q4 70B (~40 GB weights, ~95 GB free for KV) drops from the doc's claimed 270 concurrent users at 4K to a real ~76, to ~2 at 128K, to under one at 1M. You don't escape this with bigger GPUs — you redesign attention: GQA (≈4× vs MHA), MLA (≈5× smaller), and GTA/GLA which match GQA/MLA quality at roughly half the KV cache [14].
| Item | February doc | June 2026 |
|---|---|---|
| Top frontier model | Gemini 3.1 Pro / Opus 4.6 | Claude Fable 5 (Index 65) |
| OpenAI flagship | GPT-5.2 | GPT-5.5 (2026-04-23) |
| Top open-weight | GLM-5, Kimi K2.5 | Kimi K2.6 / MiMo V2.5 Pro (54), DeepSeek V4 Pro (52) |
| “Within 5–10%” | asserted broadly | True on Index; false on HLE/TerminalBench |
| Cost decline | “200×/yr, 1000×” as fact | Method-dependent: 200× / 5–10× / ~3× |
| 70B KV @ 128K / 1M | 16 / 128 GB | 40 / 320 GB |
The web data was gathered by a deep-research harness: parallel search fan-out → source fetch → claim extraction → 3-vote adversarial verification (a claim needs 2 of 3 “refute” votes to be killed) → synthesis. 6 angles · 27 sources · 116 claims extracted · 25 verified · 22 confirmed, 3 killed · 110 agents. KV-cache arithmetic was additionally re-derived by hand.
| Claim | Vote |
|---|---|
| GPT-5.5 priced at $5/M in, $30/M out | 0-3 |
| MLA reduces KV cache by exactly 93.3% vs DeepSeek-67B MHA | 1-2 |
| ~Half of benchmark progress is associated with rising inference prices | 1-2 |
| # | Source | Quality |
|---|---|---|
| 1 | platform.claude.com — Fable 5 & Mythos 5 card | primary |
| 2 | anthropic.com/news — Fable 5 / Mythos 5 | — |
| 3 | openai.com — Introducing GPT-5.5 | secondary |
| 4 | en.wikipedia.org — GPT-5.5 | — |
| 5 | Artificial Analysis — Intelligence Index | primary |
| 6 | Artificial Analysis — xAI launches Grok 4.3 | primary |
| 7 | Artificial Analysis — Grok 4.3 model page | primary |
| 8 | Artificial Analysis — Recent open-weights launches | primary |
| 9 | Hugging Face — DeepSeek-V4-Pro | — |
| 10 | arXiv 2511.23455v2 — Thompson et al. (MIT) | primary |
| 11 | LessWrong — MIT FutureTech, inference progress | primary |
| 12 | Epoch AI — LLM inference price trends | primary |
| 13 | Artificial Analysis (X) — Grok 4 Fast pricing | primary |
| 14 | arXiv 2505.21487 — Grouped-Tied/Latent Attention (Tri Dao) | primary |
| 15 | Digital Applied — Efficient Frontier Q2 2026 | secondary |
| 16 | arXiv 2405.04434 — DeepSeek-V2 (MLA) | primary |
| 17–27 | benchlm.ai, benchmarkingagents.com, tokencost.app, fiegenbaum.solutions, medium.com, spheron.network, sebastianraschka.com (secondary/blog corroboration) | blog |
Full appendix with every claim, evidence quote, and vote: frontier-diminishing-returns-2026-research-appendix.md.