Model Training Economics: From Chinchilla to Vertical AI

Pre-training costs, dataset requirements, over-training ratios, and the vertical model playbook for enterprise AI
February 2026
Cross-referenced against 5 research papers
Web-Readable Report
Part IV — The Model Landscape
Chapter 14: Model Training Economics

From Chinchilla scaling laws to half-billion-dollar training runs, the economics of building frontier AI models have shifted from an academic curiosity to a geopolitical force. Understanding where the money goes — and where it is wasted — is now a prerequisite for any serious technology strategy.

In 2022, DeepMind’s Chinchilla paper established the principle that for a fixed compute budget, there exists an optimal ratio of model parameters to training tokens — roughly 20 tokens per parameter. This insight reshaped the industry almost overnight: it meant that many existing models were undertrained relative to their size, leaving performance on the table. But within two years, the industry had moved far beyond Chinchilla’s prescriptions. The new logic was brutal in its simplicity: training compute is a one-time capital expenditure, while inference compute is an ongoing operational cost. If you over-train a smaller model on vastly more data — 100x or even 1,000x the Chinchilla ratio — you pay more upfront but save enormously on every subsequent query. Qwen3’s 0.6B model, trained on 36 trillion tokens at 60,000 tokens per parameter, embodies this philosophy taken to its extreme.

The result is a dramatic bifurcation in the economics of intelligence. Training a frontier model — GPT-4 at an estimated $100M+, GPT-5 at $500M to $2.5B — remains the province of a handful of organizations with access to tens of thousands of GPUs and billions in capital. But once trained, serving those models becomes cheaper with each passing quarter as hardware improves and optimization techniques advance. This asymmetry creates winner-take-most dynamics in training (where only five to seven organizations can afford to compete) alongside commodity dynamics in inference (where prices collapse toward marginal cost). The data below maps this economic landscape in detail.

For enterprise strategists, the critical implication is that the training investment moat is real but narrowing from an unexpected direction. Vertical fine-tuning — taking an open-source base model and adapting it with domain-specific data for $0.5M to $2.5M — now delivers domain performance that rivals or exceeds frontier models costing 100x more. The economics of model training are not merely a concern for AI labs; they define the strategic options available to every organization building an intelligence capability.

60,000x
Max over-training ratio
(Qwen3 0.6B)
15T+
Tokens in Llama 3.1
(all model sizes)
$5.6M
DeepSeek V3 compute
(671B MoE)
~$60M
Llama 3.1 405B
compute-only
$0.5–2.5M
Vertical model family
(fine-tune + distill)
1

From Chinchilla to Over-Training (2022 β†’ 2026)

πŸ“ Chinchilla Baseline (2022)

DeepMind's compute-optimal ratio: ~20 tokens per parameter. Minimizes loss for a fixed compute budget, but produces models expensive to serve at inference.

Model SizeChinchilla Tokens
7B~140B
32B~640B
70B~1.4T
130B~2.6T
405B~8.1T

πŸš€ 2026 Reality: Inference Killed Chinchilla

Practical ratios: 100–60,000+ tokens/param. Inference cost dominates TCO. Smaller + massively over-trained = cheaper to serve, comparable quality.

Chinchilla (20x)Baseline
Llama 3.1 405B (38x)2x over
Llama 3.1 70B (214x)11x over
Phi-4 14B (714x)36x over
Llama 3.1 8B (1,875x)94x over
Qwen 2.5 7B (2,571x)129x over
Qwen3 0.6B (60,000x)3,000x over

πŸ“Š Real-World Over-Training Ratios

ModelParamsTraining TokensTokens/Paramvs ChinchillaSource
Qwen3 0.6B0.6B36T60,000x3,000x overAlibaba, Apr 2025
Qwen 2.5 7B7B18T2,571x129x overAlibaba, Sep 2024
Llama 3.1 8B8B15T1,875x94x overMeta, Jul 2024
Phi-3 3.8B3.8B3.3T870x44x overMicrosoft, Apr 2024
Phi-4 14B14B~10T~714x36x overMicrosoft, Dec 2024
DeepSeek V3671B MoE (37B active)14.8T400x (active)20x overDeepSeek, Dec 2024
Llama 3.1 70B70B15T214x11x overMeta, Jul 2024
Llama 3.1 405B405B15.6T38x2x overMeta, Jul 2024

Key Insight: Smaller models get massively more over-training. A 7B model at 2,000+ tokens/param can match a 70B at 200 tokens/param on many benchmarks β€” while being 10x cheaper to serve. Over-training ratios grow ~3x/year. By 2027, 7B on 30T+ will be standard.

2

Dataset Requirements

πŸ’Ύ Recommended Dataset Sizes (Feb 2026)

SizeChinchilla2026 PracticalTok/ParamExamples
7B~140B2T – 15T280–2,000+Llama 3.1 8B (15T), Qwen 2.5 7B (18T)
32B~640B4T – 10T125–310Qwen 2.5 32B (~5–8T)
80B~1.6T10T – 15T125–190Llama 3.1 70B (15T)
130B~2.6T6T – 12T45–90Yi-1.5 range; trending 8T+

πŸ“¦ Training Corpus Composition

High-quality web (filtered, deduped)40–50%
Code (GitHub, SO, docs)15–25%
Synthetic data (from stronger models)10–20%
Books & academic papers10–15%
Domain-specific corpora5–15%

πŸ§ͺ Data Economics

MethodCostNotes
Synthetic data (via API)$0.10–1.00/1K ex1M examples = $100–$1,000
Human annotation$1–100/exampleHigh quality but expensive
Min viable fine-tune dataset5,000+ examples<2K risks degradation

⚑ Training Compute (FLOPs)

Dense: FLOPs β‰ˆ 6 Γ— N Γ— D   |   MoE: FLOPs β‰ˆ 6 Γ— Nactive Γ— D

ModelParamsTokensFLOPs
7B dense7B8T3.4 Γ— 10Β²Β³
32B dense32B7T1.3 Γ— 10²⁴
80B dense80B12T5.8 Γ— 10²⁴
130B dense130B9T7.0 Γ— 10²⁴
DeepSeek V3 MoE37B active14.8T3.3 Γ— 10²⁴

MoE advantage: DeepSeek V3 has 671B total params but only 37B active per token. Training FLOPs match a 55B dense model β†’ trained for only $5.6M compute.

The Over-Training Revolution

The data above reveals a seismic shift in how the industry thinks about model efficiency. Chinchilla’s original 20-token-per-parameter prescription optimized for training cost — but it ignored the far larger cost of inference. In production, a model may be queried billions of times; the marginal cost of each query is determined by model size, not training investment. By over-training a smaller model on vastly more data, labs can produce models that approach the quality of much larger counterparts while being 5x to 20x cheaper to serve.

The implications are visible in the numbers: Qwen 2.5 7B, trained at 2,571 tokens per parameter, achieves benchmark scores that would have required a 70B model just two years ago. Llama 3.1 8B, over-trained at 1,875x, now handles production workloads that previously demanded GPT-4 class capabilities. This dynamic is accelerating — over-training ratios are growing roughly 3x per year. By 2027, a 7B model trained on 30T+ tokens will be the standard entry point, and the economics of inference will have permanently decoupled from the economics of training.

The Hardware and Cost Landscape

Understanding training economics requires understanding the hardware that makes it possible. GPU specifications, cloud rental rates, and cluster configurations define the cost floor for any training run. The tables below present the current state of the art — from NVIDIA’s H100 and B200 to the emerging GB200 superchip — and map the full cost breakdown for pre-training at various model scales.

3

Hardware & Training Time (2026)

πŸ–₯️ GPU Specifications & Pricing

GPUYearVRAMPerf vs A100PurchaseCloud $/hr
A100 80GB202080 GB1.0x~$10K$1.50–2.00
H100 SXM202380 GB~2.5x$25–40K$2.00–3.00
H2002024141 GB~3.0x$30–35K$2.29–5.00
B2002025192 GB~6.0x$35–40K$4.00–6.00
GB2002025384 GB~12x$60–70Kβ€”

⏱️ Training Time (512Γ— H100-eq. cluster)

35–50% MFU. Halve with Blackwell B200s.

SizeTimeClusterNotes
7B3–10 days128–512 H100sVery accessible
32B10–25 days512–1K H100sSweet spot for verticals
80B3–6 weeks1K–2K H100s70B-class territory
130B4–8 weeks1.5K–3K H100sSerious infrastructure

⚑ Inference Throughput (vLLM)

ModelGPUThroughputLatency
7BL40S~10,000 tok/s<50ms
8B (Llama 3.1)H100~16,200 tok/s<30ms
14BA100 80GB~3,004 tok/s~100ms
32BA100 80GB~577 tok/s~250ms
4

Full Cost Breakdown β€” Pre-Training from Scratch

πŸ’° Compute-Only Costs (Cloud Rental)

SizeCompute OnlyTotal (All-In)
7B$50K – $500K$0.5M – $3M
32B$800K – $3M$3M – $10M
80B$2M – $6M$10M – $25M
130B$4M – $12M$15M – $40M

Reference Points:

Llama 3.1 405B~$60M compute30.84M H100-hrs Γ— $2/hr
DeepSeek V3 (671B MoE)~$5.6M computeMoE efficiency: 2.79M H100-hrs
Llama 2 70B (2023)~$2–8M compute1.72M A100-hrs (Karpathy)

πŸ“Š Beyond Compute: Where Real Costs Live

CategoryMultiplierNotes
Data curation & annotation3–28x#1 expense. $1–100/annotation. RLHF drives upper range.
Post-training (SFT+RLHF)0.5–5x$5–50M frontier; $100K–1M vertical
Talent & team0.2–1xMin viable: 3–5 ML engineers
Failed runs & experiments0.2–0.5xHP search, arch experiments
Storage & infra0.1–0.2xCheckpoints, networking

Data is the new bottleneck. Daniel Kang (Aug 2025): annotation costs = 3.1x compute for frontier models. For RLHF-heavy alignment, ratio reaches 28x. Synthetic data reduces this but still requires human curation.

🏷️ The 7 Model Categories β€” Cost per Million Tokens (2025 Pricing)

CategoryCost Range $/MKey ModelsMMLU RangeIntelligence Yield
Reasoning$2–15o1 ($15), o3 ($2), o4-mini ($1.10), Claude Opus ($5)94–96%Essential for RM>5 tasks
Commercial Large$1.25–5.00GPT-4o ($2.50), Claude Sonnet ($3), Gemini 2.5 Pro ($1.25)88–92%26–38 quality/$
Commercial Small$0.075–1.00GPT-4o-mini ($0.15), Claude Haiku ($0.80), Flash-Lite ($0.075)82–87%520–620 quality/$
Open-Source$0.03–0.28DeepSeek V3 ($0.28), Llama 3.1 405B, Qwen 2.5 72B82–88.5%293+ quality/$
Fine-Tuned$0.03–0.32QLoRA 7B, Fine-tuned Phi-4, Domain-specific variantsDomain-high1,417–1,600 quality/$
Instruct$0.03–1.50Llama 3.2 3B, Phi-4 14B, Qwen 2.5 7B-Instruct63–84.8%Varies by domain
Specialist$0.001–0.10DistilBERT 66M, xLAM-1B, GLiNER 90MTask-specific9,700 quality/$

Why Only Five Organizations Can Afford Frontier Training

The cost tables above illustrate a stark reality: training a frontier model from scratch is an endeavor reserved for organizations with extraordinary resources. At $100M+ for a GPT-4 class model and $500M–$2.5B for the next generation, the capital requirements alone exclude all but a handful of contenders: OpenAI (backed by Microsoft’s infrastructure), Google DeepMind, Anthropic, Meta, and a small cohort of Chinese labs including DeepSeek, Alibaba, and ByteDance. Even among these, the economics are punishing — data curation and annotation costs run 3x to 28x the compute bill, failed experiments consume 20–50% of the budget, and the talent required commands compensation measured in millions per researcher per year.

But this picture has a counterpoint that reshapes the strategic landscape entirely. The seven-category pricing model — from $15/million tokens for reasoning models down to $0.001/million tokens for specialist models — reveals that the unit economics of intelligence are collapsing even as the cost of producing it rises. Fine-tuned vertical models, leveraging open-source bases and domain-specific data, can achieve 50x to 150x cost reductions versus frontier API pricing on targeted tasks. The question for most enterprises is not whether to train a frontier model, but how to ride the cost curve downward through fine-tuning, distillation, and intelligent routing.

Section 5
The Vertical Model Playbook
Open-Source Base β†’ Curate β†’ Over-Train β†’ Fine-Tune β†’ Distill β†’ Iterate
$0.5–2.5M
Total cost first version
vs $10–25M from scratch (70B)

πŸ”„ The 5-Step Build Process

1
Pick Base + CPT
7B–70B open model + continued pre-training
2
Heavy Curation
Filter, ontologies, knowledge graphs, synthetic
3
SFT + Alignment
DPO/ORPO to reduce hallucinations
4
Distillation
Teacherβ†’student, 90–95% quality retained
5
Iterate Quarterly
New data β†’ retrain β†’ distill β†’ deploy

Step 1: Pick Base + CPT ($100K–$2M)

Base ModelStrengthBest For
DeepSeek R1Math precision, reasoningQuant finance, risk
Qwen3 32B–72BStructured analysis, multilingualCompliance, analysis
Llama 4 MaverickHuge ecosystem, toolingGeneral enterprise
Qwen3-235B MoEHigh capability, efficientTeacher for distillation

Tools: Axolotl, Unsloth (2x faster, 70% less VRAM), NVIDIA NeMo

Step 2: Heavy Curation ($100K–$500K)

Data quality is the #1 determinant of vertical model performance. Invest 3–10x more in curation than compute.

  • Filter aggressively for accuracy, recency, diversity
  • Ontologies / knowledge graphs (e.g., FinCap-style)
  • Synthetic data via NeMo Data Designer or frontier APIs
  • Regulatory coverage across market cycles & geographies
  • License compliance β€” only redistributable data

Step 3–4: SFT + Distill ($150K–$800K)

SFT + Alignment: Instruction-tune on domain tasks. DPO/ORPO reduces hallucinations on numbers & regulations. Residual instruction preserves general capability.

Distillation: Teacherβ†’student (large fine-tuned β†’ 7B–13B). Retains 90–95% of domain performance. Runs on single GPU, 5–20x cheaper inference.

Examples: DeepSeek R1 β†’ 6 distilled variants; Qwen3 uses strong-to-weak distillation

πŸ“… Total Cost & Timeline (First Strong Version)

PhaseTimelineCostDeliverable
Phase 1: CPT + curation3–6 months$300K–$1.5MDomain-adapted 32B–70B model
Phase 2: SFT + alignment1–2 months$100K–$500KInstruction-tuned, aligned model
Phase 3: Distill to production1–2 months$50K–$300K7B–13B production variants
Total first version4–8 months$0.5M–$2.5MFull model family
Ongoing (quarterly)Continuous$50K–$300K/qtrRefreshed, improved models

4–10x cheaper than from-scratch pre-training of a 70B model ($10–25M) with comparable or better domain quality

How Vertical Models Compare to Alternatives

The five-step build process above outlines the mechanics, but the strategic decision requires side-by-side comparison. How does the open-source fine-tuning path stack up against training from scratch, using frontier APIs, or staying API-only? And at what model size does the capability-to-cost trade-off become favorable? The data below answers both questions — and the finding is stark: a 14B parameter model retains 94% of frontier capability at a fraction of the inference cost, making it the sweet spot for most enterprise deployments.

βš–οΈ Approach Comparison

ApproachCostTimeDomain QualityInference
From-scratch 7B$0.5–3M3–6 moModerateLow
From-scratch 70B$10–25M6–12 moHighHigh
Open-source + FT + Distill$0.5–2.5M3–6 moHighLow
API-only (GPT-4o/Claude)$0ImmediateGenericVery High

πŸ“ˆ Capability Retention by Model Size

% of frontier (90.4 MMLU) capability retained

1B55% β€” Classification, routing only
3B70% β€” Basic reasoning, Q&A
3.8B (Phi-3)76% β€” GPT-3.5 equivalent
7B (Qwen)83% β€” Strong general-purpose
14B (Phi-4)94% β€” Near GPT-4o-mini
24B (Mistral)89% β€” Multimodal, 128K ctx

Sweet spot: 14B. Retains 94% of frontier capability at a fraction of cost. Covers most enterprise tasks. Phi-4 14B (84.8% MMLU) exceeds GPT-4o-mini (82%).

🎯 Fine-Tuned vs Frontier (Domain Tasks)

TaskFrontierFine-Tuned 7BDelta
Invoice classification95%98%+3%
Ticket routing90%96%+6%
Resume screening90%96%+6%
Contract review91%75%-16%
Code debugging91%58%-33%

85% of fine-tuned small models beat GPT-4 on their target task (LoRA Land, Predibase β€” 310 fine-tuned models). Frontier wins on complex, open-ended tasks.

πŸ’Έ Fine-Tuning ROI Example (14B QLoRA)

Fine-tuning (14B, 5K examples)$100
Serving infra (L40S, 1 month)$1,620
Data prep (analyst, 40 hrs)$4,000
Total first-month investment$5,720
Monthly API alternative (Sonnet, 5M tok/day)$45,000
Monthly savings after month 1$39,280
Payback period< 1 month

The Vertical Model as Strategic Asset

The five-step process outlined above — pick a base, curate data, fine-tune, distill, iterate — represents a fundamentally different economic proposition from frontier training. At $0.5M to $2.5M for a first production version, a vertical model family costs less than 1% of a frontier training run. Yet on domain-specific tasks, these models routinely outperform general-purpose frontier models, precisely because they can allocate all of their capacity to the domain that matters rather than spreading it across the entire breadth of human knowledge.

The practical implication is that enterprises now face a build-or-buy decision that did not exist two years ago. A mid-size bank processing 5 million tokens per day can reduce its AI costs from $45K/month (using frontier APIs) to $3K/month (using a self-hosted fine-tuned 14B model) — a payback period measured in weeks, not years. The sections below map this playbook to three high-value verticals: finance, real estate, and corporate intelligence analysis.

Section 6
Vertical Applications: Finance, Real Estate & Company Analysis
Data sources, high-value tasks, recommended stacks, and ROI for each vertical
$0.8–2M
Finance
$0.5–1.5M
Real Estate
$1–2.5M
Corp. Intel

To illustrate the vertical model playbook in practice, we examine three high-value domains below. Each has distinct data sources, regulatory requirements, and task profiles — but all share the same economic logic: domain-specific fine-tuning on an open-source base delivers superior accuracy at a fraction of frontier API costs. The recommended stacks, cost estimates, and ROI calculations are based on production deployments and published benchmarks as of early 2026.

🏦 Finance & Banking

Data Sources:

SEC EDGAR (free) Earnings transcripts Bloomberg/Reuters Basel III/IV / MiFID II Synthetic Q&A Internal (anonymized)
TaskComplexityCurrent $/MFine-Tuned $/MROI
Regulatory complianceHard$3.00$0.0650x
Risk model analysisHard$5–15$0.1050–150x
Earnings summarizationMedium$2.50$0.0463x
Transaction monitoringSimple$0.15$0.001150x
Contract covenantsHard$3.00$0.0650x

Recommended Stack:

Base: DeepSeek R1 or Qwen3-32B

CPT corpus: ~2T tokens (SEC + earnings + regulatory + synthetic)

Distill to: 7B (transactions) + 14B (analysis)

Cost: $0.8M–$2M first version

ROI: Mid-size bank, 5M tokens/day β†’ API cost $45K/mo vs self-hosted fine-tuned 14B $3K/mo. Savings: $42K/mo. Payback: <2 months.

🏒 Real Estate

Data Sources:

Public MLS County assessor records REIT filings (EDGAR) CRE databases Zoning docs HMDA / Freddie Mac Synthetic valuations
TaskComplexityCurrent CostFine-Tuned Advantage
Property valuation (AVM)HardProprietary modelsQualitative reasoning over trends
Lease abstractionMedium$50–200/lease$0.001/lease in seconds
Zoning complianceHard$200–500/hr legalCross-ref zoning + specs auto
Market report genMedium2–4 hrs analystDraft in minutes, human review
Due diligence reviewHard$300–500/hrScreen + flag + draft summaries

Recommended Stack:

Base: Llama 4 or Qwen3-32B

CPT corpus: ~1T tokens (listings + assessor + REIT + zoning + synthetic)

Distill to: 7B (extraction) + 14B (valuation, analysis)

Cost: $0.5M–$1.5M first version

πŸ” Company Analysis / Corporate Intelligence

Data Sources:

SEC EDGAR (full) Company websites USPTO / EPO patents Gartner / IBISWorld PE memos (internal) News corpus Synthetic SWOT
TaskComplexityCurrent CostFine-Tuned Advantage
Comparable company analysisHard4–8 hrs analystAutomated comps + qualitative
M&A target screeningHardDB query + reviewNL queries, structured + unstructured
Competitive intelligenceMediumAnalyst monitoringReal-time synthesis: filings + news + patents
Due diligence automationHardWeeks of teamsScreen, flag risks, draft summaries
Earnings analysisMedium2–4 hrs analystAuto beat/miss + context reasoning

Recommended Stack:

Base: DeepSeek R1 or Qwen3-72B

CPT corpus: ~2T tokens (SEC + news + patents + research + synthetic)

Distill to: 7B (extraction) + 14B (analysis) + 32B (due diligence)

Cost: $1M–$2.5M first version

Why Routing Determines Enterprise ROI

The vertical application economics above demonstrate that fine-tuned models can deliver 50x–150x cost reductions on domain-specific tasks. But these savings only materialize when each query reaches the right model. An enterprise routing all traffic through a single frontier API — the default approach for most organizations today — leaves enormous value on the table. The analysis below quantifies this overspend across 14 enterprise roles and 98 common tasks, revealing the gap between “all-large” deployment and optimal intelligent routing.

The pattern is consistent: high-volume, low-complexity tasks — ticket classification, FAQ answering, expense categorization — account for the majority of enterprise AI invocations but require only fine-tuned specialist models costing fractions of a cent. Sending these queries to GPT-4o or Claude Sonnet wastes 20x to 24x the optimal cost. Only a small fraction of tasks, those involving complex reasoning, legal analysis, or multi-step agentic workflows, justify premium model pricing.

7

Enterprise Overspend Analysis β€” Why Routing Matters

πŸ“‰ Per-Department Overspend (All-Large vs Optimal)

RoleOptimal $/dayAll-Large $/dayOverspend
Customer Service (Support)$0.035$0.8424.0x
Supply Chain (Ops Mgr)$0.02$0.4120.5x
Marketing (Content)$0.10$0.606.0x
Customer Service (Mgr)$0.13$0.685.2x
Finance (Analyst)$0.29$1.465.0x
Engineering (DevOps)$0.33$1.514.6x
HR (Business Partner)$0.16$0.694.3x
Supply Chain (Analyst)$0.34$1.303.8x
Marketing (Analyst)$0.32$0.932.9x
Sales (Rep)$0.43$1.162.7x
HR (Recruiter)$0.31$0.852.7x
Legal (Corp Counsel)$0.79$1.962.5x
Finance (Compliance)$0.57$1.342.3x
Engineering (SW Dev)$1.20$2.281.9x
TOTAL (14 roles)$4.96$15.943.2x

πŸ”Ÿ Top 10 Highest-Volume Tasks

#TaskDaily$/InvocationModel
1Ticket classification100$0.000028Fine-Tuned
2FAQ answering80$0.000032Fine-Tuned
3Sentiment analysis60$0.000040Fine-Tuned
4Expense categorization60$0.000044Fine-Tuned
5Resume screening50$0.000128Fine-Tuned
6Inventory anomaly50$0.000048Fine-Tuned
7Alert triage50$0.000052Fine-Tuned
8Transaction flagging50$0.000064Fine-Tuned
9Code generation40$0.015000Large Commercial
10PO extraction40$0.000060Fine-Tuned

9 of top 10 highest-volume tasks are optimally served by fine-tuned models at $0.00003–$0.00013/invocation. Only code generation needs Large Commercial at $0.015.

$0.35/day
Per worker 2025
(optimal routing)
$0.025/day
Per worker 2030
(projected)
93%
Cost reduction
2025 β†’ 2030

When Premium Models Justify Their Cost

The overspend analysis reveals a blanket problem: enterprises using a single large model for all tasks overpay by 3.2x on average, with customer service and supply chain roles wasting 20–24x the optimal cost. But the solution is not simply to route everything to the cheapest model. Some tasks — legal analysis, mathematical reasoning, complex agentic workflows — exhibit dramatic quality gains from reasoning models that justify their premium pricing. The Reasoning Multiplier framework below quantifies exactly when premium models earn their price, while the price trajectory and self-hosting comparison map the cost landscape through 2030.

🧠 Reasoning Multiplier β€” When Premium Models Justify Cost

TaskInstructReasoningGainCost Γ—RMVerdict
Simple Q&A95%97%+2.1%13.3x0.16Never
Creative Writing88%90%+2.3%5.0x0.46Marginal
Code Gen90%95%+5.6%2.7x2.07Hard only
Legal Analysis80%92%+15.0%2.7x5.56Buy
Math Reasoning75%97%+29.3%2.7x10.9Strong buy
Complex Agentic45%80%+77.8%2.7x28.8Essential

πŸ“‰ Model Price Trajectory 2024 β†’ 2030 ($/M input)

YearLargeSmallOpen-SourceFine-Tuned
2024$5.00$0.15$0.28$0.06
2025$2.50$0.075$0.14$0.04
2026$1.25$0.04$0.07$0.02
2028$0.40$0.012$0.02$0.006
2030$0.12$0.004$0.006$0.002

API prices declining 40–50% per year. By 2028, 68% of tasks below $0.001/invocation.

🏠 Self-Hosting vs API (Cost per M Tokens)

Deployment$/MMMLU-eq
Self-hosted 7B (H100 spot)$0.03~75%
Self-hosted 7B (L40S)$0.06~75%
Gemini Flash-Lite API$0.075β€”
GPT-4o-mini API$0.1582%
DeepSeek V3 API$0.2887%
Self-hosted 14B (A100)$0.32~85%
Claude Sonnet API$3.0090%

Breakeven: ~2M tokens/day (~$60K/yr API). Above this, self-hosting is 10–50x cheaper.

From Economics to Architecture

The vertical application data confirms a consistent pattern across domains: fine-tuned models achieve 50x to 150x cost reductions on specialized tasks while maintaining or exceeding the accuracy of general-purpose frontier models. In finance, regulatory compliance analysis drops from $3.00/million tokens to $0.06. In real estate, property valuation summaries fall from $2.50 to $0.04. These are not marginal improvements — they represent a structural shift in the cost of domain-specific intelligence.

The strategic question becomes one of orchestration: how to combine frontier models for tasks that demand maximum general capability with fine-tuned vertical models for high-volume domain work, routing each query to the most cost-effective model that meets the required quality threshold. This is the intelligence routing problem — and it is where the economic logic of training connects to the architectural logic of deployment.

8

Connection to Atlas & OrchestraAI β€” Two-Company Architecture

πŸ”— How the Vertical Model Playbook Maps to Two Companies

OrchestraAI
Routing & Orchestration (Global, Model-Agnostic)
Simple tasks β†’ Commercial Small (GPT-4o-mini, Haiku)
Medium tasks β†’ Open-Source (DeepSeek V3, Llama 4)
Hard tasks β†’ Commercial Large / Reasoning
Sovereign / Domain β†’ Atlas Fine-Tuned Models
⟷
Routing + Revenue
AtlasAIFactory
Sovereign Fine-Tuning (EU, On-Prem)
Step 1: Continued pre-training on customer GPUs
Step 2: Expert network curates domain data
Step 3: SFT + alignment with DPO
Step 4: Distill + deploy to OrchestraAI

πŸ’° Revenue per Transaction Type

EventOrchestraAI EarnsAtlas Earns
Simple task β†’ GPT-4o-mini$0.001/requestβ€”
Hard task β†’ Claude Sonnet$0.005/requestβ€”
Sovereign task β†’ Atlas fine-tuned$0.005/request€0.50–1.50/GPU-hr
New vertical fine-tuning request—€2–10K/mo + Expert 20–30%
Fine-tuned model in production$5–15K/mo agent premiumOngoing throughput + SaaS

πŸ”„ The Data Flywheel

User assigns
complex task
β†’
OrchestraAI
decomposes
β†’
Routes to
optimal models
β†’
Collects 5 types
of data
β†’
Feeds back into
training
β†’
Better models
β†’ More users
1
Task decomposition
patterns
2
Routing
decisions
3
Quality
scores
4
Human DPO
signals
5
Cross-step
dependency graphs

πŸ”€ RouteLLM Routing Efficiency (ICLR 2025)

74%
Queries routed away from GPT-4
95%
Quality retained
77%
Max cost savings
(blended routing)
BenchmarkCost Reduction
MT-Bench85%
MMLU45%
GSM8K35%
Enterprise avg30–70%
9

Sources & Methodology

πŸ“š Sources

SourceKey Data
Meta Llama 3.1 blog15T+ tokens, 30.84M GPU-hrs
Qwen3 blog (Apr 2025)36T tokens, 0.6B–235B family
DeepSeek V3 report14.8T tokens, $5.6M compute
Phi-4 paper (Microsoft)14B params, 84.8% MMLU
Epoch AI (2025–2026)Training cost trends
Daniel Kang (Aug 2025)Annotation 3.1x compute
Karpathy (Jul 2023)Llama 2 70B GPU-hours
RouteLLM (ICLR 2025)74% routing, 95% quality
FinGAIA / Fin-R1Finance benchmarking
Red Hat OSS AI (2025)Enterprise adoption
NVIDIA NeMo / TensorZeroDistillation, 5–30x savings

Additional cross-references:

Enterprise AI disruption analysis ($607B TAM) Β· Model taxonomy (7 categories, RouteLLM) Β· Job function task mapping (98 tasks, overspend analysis) Β· Small models study (fine-tuning, throughput) Β· GPU demand projections

πŸ“… Per-Task Cost Evolution 2024 β†’ 2030

TaskComplexity20242025202620282030
FAQ answeringSimple$0.000048$0.000032$0.000016$0.0000048$0.0000016
Email draftingSimple$0.000338$0.000188$0.000091$0.000031$0.000010
Code generationMedium$0.025$0.015$0.0075$0.0024$0.00072
Contract reviewHard$0.150$0.0875$0.04375$0.014$0.0042
Litigation riskFrontier$0.120$0.070$0.035$0.0112$0.00336

2026: 55% of tasks <$0.001. 2028: 68% <$0.001. 2030: Avg task cost $0.0006.

What Comes Next

The economics of model training expose a paradox at the heart of the current paradigm: each new generation costs exponentially more to train, yet the marginal gains in general knowledge grow ever smaller. Frontier models are converging at the 90–92% ceiling on MMLU, spending hundreds of millions of dollars for single-digit percentage point improvements. This raises a fundamental question about the architecture of training itself: why does every new model re-learn the same facts from scratch? The answer — and the enormous waste it represents — is the subject of Chapter 10: The Britannica Problem.