Core Thesis
The shift from training-dominated to inference-dominated AI spend — combined with the reasoning model explosion (>50% of token usage) — creates sustained GPU demand growth through 2030, even as per-token costs decline 40-50% annually.
Three reinforcing dynamics drive the paradox:
- Token volume outpaces cost decline: Global tokens grow ~8x (2025-2030) while cost/token drops ~97%. Net GPU demand still rises ~4x.
- Reasoning models multiply hidden compute: Each visible reasoning token generates 2-5x "thinking tokens" internally. This 3.5x multiplier makes reasoning 70% of GPU load despite being 35% of visible tokens.
- Inference eclipses training: Training is one-time; inference is recurring and scales with every user. Inference share rises from 35% to 75% of AI compute by 2030.
The reasoning model revolution adds a critical multiplier to this equation. By 2030, reasoning models account for approximately 35% of visible inference tokens — but because each visible reasoning token generates 2–5x hidden "thinking tokens" internally, reasoning workloads consume over 55% of all inference GPU cycles. This shift from throughput-optimized to latency-optimized GPU requirements favors NVIDIA's high-end SKUs and rack-scale solutions like the GB200 NVL72, reinforcing the company's premium positioning even as commodity inference migrates to lower-cost alternatives.
From AI Spending to GPU Demand: The Calculation Chain
Every projection follows a five-step derivation. Each step transforms the previous output into a more concrete infrastructure metric.
Step 1 Project Global Token Consumption
Starting from observed weekly token consumption, we project annual volumes through 2030 with decelerating growth rates.
Starting Data enterprise-ai-disruption.md §4.1
| Metric | 2025 | 2026 (Projected) | Source |
|---|---|---|---|
| Weekly token consumption | 12.1T | ~25T+ | a16z / OpenRouter |
| YoY token growth | — | 12.7x | a16z |
| Inference as % of AI compute | 34.6% | 66% | Deloitte |
| Inference spending | $9.2B | $20.6B | Deloitte |
Base Year Calculation
Growth Rate Assumptions
The 12.7x YoY growth from a16z reflects explosive early adoption. We calibrate deceleration using the 2026 anchor of ~25T/week and typical S-curve dynamics.
| Period | Growth Multiplier | Rationale |
|---|---|---|
| 2025→2026 | 2.07x | Anchored to a16z data (25T/week) |
| 2026→2027 | 1.70x | Reasoning models expanding; enterprise adoption accelerating |
| 2027→2028 | 1.50x | Base effect; routing efficiency (RouteLLM 74% redirection) |
| 2028→2029 | 1.30x | Market maturing; inference efficiency compounds |
| 2029→2030 | 1.20x | Approaching saturation in current use cases |
Annual Token Projections
| Year | Growth | Annual Tokens | Weekly Equivalent |
|---|---|---|---|
| 2025 | — | 629T | 12.1T |
| 2026 | 2.07x | 1,302T | 25.0T |
| 2027 | 1.70x | 2,213T | 42.6T |
| 2028 | 1.50x | 3,320T | 63.8T |
| 2029 | 1.30x | 4,316T | 83.0T |
| 2030 | 1.20x | 5,179T | 99.6T |
| Scenario | Conservative | Base | Optimistic |
|---|---|---|---|
| 2030 Annual Tokens | ~2,900T | ~5,179T | ~8,400T |
| Assumption | Growth × 0.8 | As modeled | Growth × 1.2 + new modalities |
Step 2 Break Tokens by Model Size
Token consumption is not uniform. Different model sizes require different GPU hardware. We map tokens into four GPU-relevant buckets.
Token Consumption by Task Type enterprise-ai-disruption.md §4.2
Four GPU-Relevant Buckets
| Bucket | Model Size | Token Share | Reference GPU | Source Mapping |
|---|---|---|---|---|
| A: Small | Sub-7B | 15% | L40S, T4, L4 | Content gen (10%) + Chat (5%) |
| B: Medium | 7-14B | 20% | A100, L40S | Data analysis + open-source instruct |
| C: Large | 32-200B+ | 30% | H100, A100 80GB | Half of Programming + analysis share |
| D: Reasoning | Frontier | 35% | H100, B200 | Reasoning (25%) + half Programming |
The Reasoning Token Multiplier
From model-taxonomy-analysis.md §2.3: Reasoning model effective cost is 2-5x higher than input price due to hidden thinking tokens. Each visible reasoning token requires generating 2-5 additional internal tokens.
Effective Token Distribution (After Reasoning Overhead)
| Bucket | Visible Share | Thinking Multiplier | Effective GPU Share |
|---|---|---|---|
| A: Small | 15% | 1.0x | 7% |
| B: Medium | 20% | 1.0x | 9% |
| C: Large | 30% | 1.0x | 14% |
| D: Reasoning | 35% | 3.5x | 70% |
Reasoning models dominate GPU demand. Despite being 35% of visible tokens, they consume ~70% of effective GPU compute due to thinking token overhead. This is why GPU demand grows faster than visible token counts.
Step 3 Map Model Sizes to GPU Requirements
Using vLLM inference benchmarks and NVIDIA hardware specifications, we map each model size bucket to GPU type and throughput.
vLLM Throughput Benchmarks small-models-analysis.md §5.2
| Model Size | GPU | Throughput (tok/s) | Latency (first token) |
|---|---|---|---|
| 7B | L40S | ~10,000 | <50ms |
| 8B (Llama 3.1) | H100 | ~16,200 | <30ms |
| 14B | A100 80GB | ~3,004 | ~100ms |
| 32B | A100 80GB | ~577 | ~250ms |
GPU Hardware Specifications
| GPU | Launch | VRAM | HBM Type | TDP | Perf (vs A100) | ASP (est.) |
|---|---|---|---|---|---|---|
| A100 80GB | 2020 | 80 GB | HBM2e | 400W | 1.0x | ~$10K |
| H100 SXM | 2023 | 80 GB | HBM3 | 700W | ~2.5x | ~$25-30K |
| H200 | 2024 | 141 GB | HBM3e | 700W | ~3.0x | ~$30-35K |
| B200 | 2025 | 192 GB | HBM3e | 1,000W | ~6.0x | ~$35-40K |
| GB200 | 2025 | 384 GB | HBM3e | 2,700W | ~12x | ~$60-70K |
| Rubin | 2027E | 256-384 GB | HBM4 | ~1,000-1,200W | ~15-20x | ~$40-50K |
Fleet Mix Evolution
Step 4 Calculate GPU-Hours Needed
Convert token demand into GPU-hours using throughput data, then apply real-world dampening factors.
Formula
Dampening Factors
| Factor | Value | Rationale |
|---|---|---|
| Enterprise share | 0.45 | Not all tokens need dedicated GPUs; consumer use on shared platforms is more efficient |
| Cache/dedup | 0.80 | KV-cache reuse, semantic caching reduce computation by ~20% |
| Inference optimization | 0.37 (by 2030) | Speculative decoding, quantization, continuous batching compound annually |
Worked Example: 2025 (Base Case)
GPU-Hours by Year (Recalculated with Hardware Improvements)
| Year | Tokens (T) | Dampening | Reasoning Throughput | Reasoning GPU-hrs | Other GPU-hrs | Total |
|---|---|---|---|---|---|---|
| 2025 | 629 | 0.360 | 2,000 | 38.5M | 3.8M | 42.3M |
| 2026 | 1,302 | 0.288 | 2,800 | 46.9M | 4.7M | 51.6M |
| 2027 | 2,213 | 0.216 | 3,600 | 46.6M | 4.7M | 51.3M |
| 2028 | 3,320 | 0.173 | 4,800 | 41.9M | 4.2M | 46.1M |
| 2029 | 4,316 | 0.148 | 5,500 | 40.7M | 4.1M | 44.8M |
| 2030 | 5,179 | 0.133 | 6,500 | 37.1M | 3.7M | 40.8M |
Step 5 Convert to Total GPU Fleet Size
The final step: translate GPU-hours into the total number of GPUs that must be deployed globally.
Formula
Adjusting for Global Market Scope
The 12.1T tokens/week figure represents one platform. Total global inference (OpenAI, Anthropic, Google, Microsoft, enterprise private) is estimated at 3-5x higher. Total global = ~55T tokens/week in 2025.
GPU Fleet Projection — Three Scenarios
| Year | Conservative | Base | Optimistic |
|---|---|---|---|
| 2025 | 2.8M | 3.8M | 5.0M |
| 2026 | 3.8M | 5.2M | 7.1M |
| 2027 | 4.8M | 6.7M | 9.5M |
| 2028 | 5.8M | 8.4M | 12.3M |
| 2029 | 7.0M | 10.5M | 16.8M |
| 2030 | 8.5M | 14.2M | 22.5M |
Training vs Inference Fleet Split
Training GPU demand flattens while inference nearly quadruples. The structural shift from training to inference is the primary driver of sustained GPU demand growth. By 2030, inference is 75% of the total AI GPU fleet.
NVIDIA Revenue Trajectory
Decomposing the $607B enterprise AI TAM into GPU-addressable layers and projecting NVIDIA's capture.
TAM Layer Decomposition
| Layer | % of TAM | 2025 ($B) | 2030 ($B) | NVIDIA Role |
|---|---|---|---|---|
| GPU / Accelerator Hardware | 30% | $40.1 | $182.2 | Primary supplier (70-85%) |
| Cloud GPU Rental | 15% | $20.0 | $91.1 | Hardware supplier to clouds (~37%) |
| AI Software | 30% | $40.1 | $182.2 | CUDA / AI Enterprise (~6%) |
| AI Services | 25% | $33.4 | $151.8 | Indirect |
NVIDIA Revenue by Segment (Base Case)
| Year | TAM ($B) | GPU HW ($B) | Cloud ($B) | Software ($B) | Networking ($B) | Total ($B) |
|---|---|---|---|---|---|---|
| 2025 | 133.7 | 32.9 | 7.4 | 2.4 | 5.9 | 48.6 |
| 2026 | 178.3 | 42.8 | 9.9 | 3.2 | 7.7 | 63.6 |
| 2027 | 241.7 | 56.6 | 13.4 | 4.4 | 10.2 | 84.5 |
| 2028 | 326.7 | 74.5 | 18.1 | 5.9 | 13.4 | 111.9 |
| 2029 | 445.3 | 99.0 | 24.7 | 8.0 | 17.8 | 149.5 |
| 2030 | 607.2 | 142.1 | 33.7 | 10.9 | 25.6 | 212.3 |
Revenue Scenarios
| Year | Conservative ($B) | Base ($B) | Optimistic ($B) |
|---|---|---|---|
| 2025 | 42 | 49 | 55 |
| 2026 | 52 | 64 | 76 |
| 2027 | 66 | 85 | 106 |
| 2028 | 82 | 112 | 147 |
| 2029 | 108 | 150 | 201 |
| 2030 | 145 | 212 | 260 |
Revenue Scenarios and Strategic Positioning
Three revenue scenarios bracket NVIDIA's trajectory. The bear case projects $145 billion in data center revenue by 2030, driven by a capex pullback and accelerated ASIC adoption. The base case reaches $212 billion, assuming sustained enterprise adoption and moderate competitive erosion. The bull case hits $260 billion on the back of AGI-driven training surges and sovereign AI acceleration. In all three scenarios, NVIDIA remains the dominant AI infrastructure company — the question is not whether it grows, but by how much.
The Blackwell generation, shipping in volume through 2025, represents the critical inflection. Its 10x per-token cost reduction over Hopper, combined with the GB200 NVL72 rack-scale architecture, repositions NVIDIA from selling individual GPUs to selling integrated AI infrastructure at $2 million per rack. This shift from component to system sales increases NVIDIA's share of datacenter wallet and makes it harder for competitors to displace individual components.
NVIDIA Ecosystem Partners
The NVIDIA GPU ecosystem generates massive revenue for cloud providers, memory suppliers, server OEMs, and networking companies.
Cloud Providers (AWS, Azure, GCP)
HBM Memory Suppliers
Networking (Mellanox / InfiniBand)
NVLink: 900 GB/s (H100) → 1.8 TB/s (B200)
InfiniBand: 400 Gb/s per port, dominates AI workloads
Server OEMs
Software Ecosystem
CUDA: 4M+ developers, ~15 years of ecosystem
TensorRT: 2-5x inference speedup
vLLM: 3.7-24x throughput vs TGI
Edge / Embedded (Jetson)
Orin Nano: ~4B models, 15W
AGX Orin: ~20B models, 60W
57% of enterprise tasks run sub-7B models — edge-deployable
Data Center Capacity Requirements
Converting GPU fleet size into power demand, data center count, and geographic distribution.
Power Consumption by GPU Generation
| GPU | TDP (GPU) | System Power (×2.2) | Per Rack (8 GPUs) |
|---|---|---|---|
| A100 SXM | 400W | 880W | 7.0 kW |
| H100 SXM | 700W | 1,540W | 12.3 kW |
| B200 | 1,000W | 2,200W | 17.6 kW |
| GB200 NVL72 | 72 GPUs | — | 120 kW / rack |
| Rubin (est.) | 1,000-1,200W | 2,200-2,640W | 17.6-21.1 kW |
Total AI Power Demand (Base Case)
| Year | Total GPU Fleet | Avg System Power | Total Power (GW) | Annual Energy (TWh) |
|---|---|---|---|---|
| 2025 | 9.5M | 1,400W | 13.3 GW | ~87 |
| 2026 | 9.5M | 1,700W | 16.2 GW | ~106 |
| 2027 | 10.8M | 1,900W | 20.5 GW | ~135 |
| 2028 | 12.5M | 2,100W | 26.3 GW | ~173 |
| 2029 | 14.6M | 2,200W | 32.1 GW | ~211 |
| 2030 | 18.9M | 2,300W | 43.5 GW | ~286 |
Global data center electricity was ~460 TWh in 2024 (IEA). AI inference alone could add 286 TWh by 2030. This is directionally consistent with IEA and Goldman Sachs estimates of data center power doubling by 2030. Power availability is increasingly the binding constraint on GPU deployment — not chip supply.
New Data Centers Needed (2025-2030)
| Year | Power Growth (GW) | Avg DC Capacity (MW) | New DCs Needed |
|---|---|---|---|
| 2025 | +3.0 | 200 | ~15 |
| 2026 | +2.9 | 250 | ~12 |
| 2027 | +4.3 | 300 | ~14 |
| 2028 | +5.8 | 400 | ~15 |
| 2029 | +5.8 | 450 | ~13 |
| 2030 | +11.4 | 500 | ~23 |
| Total | ~92 |
Geographic Distribution (Estimated)
Competitive Landscape
NVIDIA faces growing competition from AMD, Google TPUs, and hyperscaler custom ASICs. But the CUDA moat remains formidable.
Current Market Share (2025)
NVIDIA Competitive Moat
| Moat Factor | Strength | Durability | Risk |
|---|---|---|---|
| CUDA ecosystem | Very Strong | 5-7 years | ROCm + Triton narrowing gap |
| NVLink / networking | Strong | 3-5 years | Ultra Ethernet Consortium |
| Full-stack integration | Very Strong | 5+ years | No competitor has full stack |
| Developer mindshare | Very Strong | 5+ years | 4M+ CUDA devs; hard to retrain |
| TensorRT optimization | Strong | 3-4 years | vLLM/SGLang are open alternatives |
2030 Market Share Projections
| Vendor | Conservative | Base | Optimistic |
|---|---|---|---|
| NVIDIA | 65% | 78% | 85% |
| AMD | 18% | 12% | 8% |
| Google TPU | 5% | 4% | 3% |
| Custom ASICs | 10% | 5% | 3% |
| Others | 2% | 1% | 1% |
Risk Factors
Key risks that could materially alter the projections in either direction.
MoE architectures (DeepSeek V3: 671B params, only 37B active), distillation (fine-tuned 7B beats GPT-4 on 85% of narrow tasks), and RouteLLM routing (74% redirected at 95% quality) could reduce per-token GPU demand faster than modeled. If efficiency improves 50%/year instead of 30%, 2030 fleet drops to ~8.5M.
TSMC dependency (all NVIDIA GPUs on 4nm), HBM shortage (30-40% capacity growth may lag 40-50% demand), CoWoS packaging bottleneck. Could constrain GPU production and slow fleet expansion.
US export controls restrict AI chip sales to China (~25-30% of potential market). Escalation could fragment the market and reduce NVIDIA's addressable base by 10-20%.
44 GW by 2030 requires massive grid expansion. Nuclear deals take 7-10 years. Many regions (N. Virginia, Dublin, Singapore) are already capacity-constrained. Power could become the binding constraint.
Photonic computing (Lightmatter), neuromorphic chips (Intel Loihi), and custom inference ASICs promise higher efficiency but are 3-5 years from production scale. Not a 2025-2030 threat.
Assumptions Register
Every assumption used in this analysis, with confidence and sensitivity ratings. Transparency enables the reader to adjust projections to their own views.
| ID | Assumption | Value | Confidence | Sensitivity | Used In |
|---|---|---|---|---|---|
| A1 | Token growth deceleration rates | 2.07x → 1.2x | Medium | Very High | Step 1 |
| A2 | Enterprise share of global tokens | 45% | Medium | High | Step 4 |
| A3 | Reasoning thinking token multiplier | 3.5x | Low-Med | Very High | Step 2 |
| A4 | Batched throughput multiplier | 5x | Medium | High | Step 3 |
| A5 | GPU ASPs by generation | Varies | Med / Low | High (rev) | Revenue |
| A6 | Rubin throughput improvement | 2.0x over B200 | Low | Medium | Step 3 |
| A7 | GPU fleet mix by year | See table | Low | Medium | Steps 3-5 |
| A8 | Cache / dedup reduction | 20% | Low-Med | Medium | Step 4 |
| A9 | Annual inference optimization | 15-25%, decelerating | Medium | High | Step 4 |
| A10 | GPU utilization rate | 65% → 80% | Med-High | Medium | Step 5 |
| A11 | OpenRouter as % of global tokens | 22% | Low | Very High | Step 5 |
| A12 | TAM layer splits | 30/15/30/25 | Medium | High | Revenue |
| A13 | Cloud provider market shares | AWS 35%, etc. | Medium | Medium | Ecosystem |
| A14 | HBM supplier shares | SK Hynix 50% | Med-High | Low | Ecosystem |
| A15 | PUE for AI data centers | 1.3 | High | Low | Data Centers |
The CUDA Moat and Its Limits
NVIDIA's competitive moat is widest in training and narrowest in inference. For training, the CUDA ecosystem, NVLink interconnect, and PyTorch-first framework integration create switching costs that no competitor has overcome. For inference, the calculus is different: models are fixed, workloads are predictable, and purpose-built ASICs can deliver 2–3x better efficiency for specific architectures. This structural distinction explains why custom ASICs grow from 6% to 20% of the market by 2030, capturing inference share while barely denting NVIDIA's training dominance. For strategic decision-makers, the implication is clear: NVIDIA remains essential for training and general-purpose inference, while specialized inference at scale increasingly rewards alternative hardware. The full GPU demand context is covered in Chapter 16: The $1.7T Compute Stack.