Core Thesis

The shift from training-dominated to inference-dominated AI spend — combined with the reasoning model explosion (>50% of token usage) — creates sustained GPU demand growth through 2030, even as per-token costs decline 40-50% annually.

Why GPU Demand Keeps Growing Despite Cost Declines

Three reinforcing dynamics drive the paradox:

  1. Token volume outpaces cost decline: Global tokens grow ~8x (2025-2030) while cost/token drops ~97%. Net GPU demand still rises ~4x.
  2. Reasoning models multiply hidden compute: Each visible reasoning token generates 2-5x "thinking tokens" internally. This 3.5x multiplier makes reasoning 70% of GPU load despite being 35% of visible tokens.
  3. Inference eclipses training: Training is one-time; inference is recurring and scales with every user. Inference share rises from 35% to 75% of AI compute by 2030.

The reasoning model revolution adds a critical multiplier to this equation. By 2030, reasoning models account for approximately 35% of visible inference tokens — but because each visible reasoning token generates 2–5x hidden "thinking tokens" internally, reasoning workloads consume over 55% of all inference GPU cycles. This shift from throughput-optimized to latency-optimized GPU requirements favors NVIDIA's high-end SKUs and rack-scale solutions like the GB200 NVL72, reinforcing the company's premium positioning even as commodity inference migrates to lower-cost alternatives.

From AI Spending to GPU Demand: The Calculation Chain

Every projection follows a five-step derivation. Each step transforms the previous output into a more concrete infrastructure metric.

1 Total Token Consumption (2025-2030)
Source: enterprise-ai-disruption.md §4.1 — 12.1T tokens/week → project forward
2 Tokens by Model Size Bucket
Source: enterprise-ai-disruption.md §4.2-4.3 + job-function-task-analysis.md §4.2
3 GPU Requirements per Bucket
Source: small-models-analysis.md §5.1-5.2 — vLLM throughput benchmarks
4 GPU-Hours Required
Formula: Tokens / Throughput / 3600 × dampening factors
5 Total GPU Fleet Size
Formula: GPU-Hours / (8,760 × utilization_rate) → three scenarios

Step 1 Project Global Token Consumption

Starting from observed weekly token consumption, we project annual volumes through 2030 with decelerating growth rates.

Starting Data enterprise-ai-disruption.md §4.1

Metric20252026 (Projected)Source
Weekly token consumption12.1T~25T+a16z / OpenRouter
YoY token growth12.7xa16z
Inference as % of AI compute34.6%66%Deloitte
Inference spending$9.2B$20.6BDeloitte

Base Year Calculation

Annual tokens (2025) = 12.1T tokens/week × 52 weeks = 629.2T tokens/year

Growth Rate Assumptions

The 12.7x YoY growth from a16z reflects explosive early adoption. We calibrate deceleration using the 2026 anchor of ~25T/week and typical S-curve dynamics.

PeriodGrowth MultiplierRationale
2025→20262.07xAnchored to a16z data (25T/week)
2026→20271.70xReasoning models expanding; enterprise adoption accelerating
2027→20281.50xBase effect; routing efficiency (RouteLLM 74% redirection)
2028→20291.30xMarket maturing; inference efficiency compounds
2029→20301.20xApproaching saturation in current use cases

Annual Token Projections

YearGrowthAnnual TokensWeekly Equivalent
2025629T12.1T
20262.07x1,302T25.0T
20271.70x2,213T42.6T
20281.50x3,320T63.8T
20291.30x4,316T83.0T
20301.20x5,179T99.6T
ScenarioConservativeBaseOptimistic
2030 Annual Tokens~2,900T~5,179T~8,400T
AssumptionGrowth × 0.8As modeledGrowth × 1.2 + new modalities

Step 2 Break Tokens by Model Size

Token consumption is not uniform. Different model sizes require different GPU hardware. We map tokens into four GPU-relevant buckets.

Token Consumption by Task Type enterprise-ai-disruption.md §4.2

Programming
50%
Reasoning / Thinking
25%
Content Generation
10%
Data Analysis
8%
General Chat
5%
Other
2%

Four GPU-Relevant Buckets

BucketModel SizeToken ShareReference GPUSource Mapping
A: SmallSub-7B15%L40S, T4, L4Content gen (10%) + Chat (5%)
B: Medium7-14B20%A100, L40SData analysis + open-source instruct
C: Large32-200B+30%H100, A100 80GBHalf of Programming + analysis share
D: ReasoningFrontier35%H100, B200Reasoning (25%) + half Programming

The Reasoning Token Multiplier

Hidden Thinking Tokens: The Most Impactful Assumption

From model-taxonomy-analysis.md §2.3: Reasoning model effective cost is 2-5x higher than input price due to hidden thinking tokens. Each visible reasoning token requires generating 2-5 additional internal tokens.

Effective GPU load = Visible tokens × thinking multiplier 35% visible share × 3.5x = 122.5% equivalent token load

Effective Token Distribution (After Reasoning Overhead)

BucketVisible ShareThinking MultiplierEffective GPU Share
A: Small15%1.0x7%
B: Medium20%1.0x9%
C: Large30%1.0x14%
D: Reasoning35%3.5x70%

Reasoning models dominate GPU demand. Despite being 35% of visible tokens, they consume ~70% of effective GPU compute due to thinking token overhead. This is why GPU demand grows faster than visible token counts.

Step 3 Map Model Sizes to GPU Requirements

Using vLLM inference benchmarks and NVIDIA hardware specifications, we map each model size bucket to GPU type and throughput.

vLLM Throughput Benchmarks small-models-analysis.md §5.2

Model SizeGPUThroughput (tok/s)Latency (first token)
7BL40S~10,000<50ms
8B (Llama 3.1)H100~16,200<30ms
14BA100 80GB~3,004~100ms
32BA100 80GB~577~250ms

GPU Hardware Specifications

GPULaunchVRAMHBM TypeTDPPerf (vs A100)ASP (est.)
A100 80GB202080 GBHBM2e400W1.0x~$10K
H100 SXM202380 GBHBM3700W~2.5x~$25-30K
H2002024141 GBHBM3e700W~3.0x~$30-35K
B2002025192 GBHBM3e1,000W~6.0x~$35-40K
GB2002025384 GBHBM3e2,700W~12x~$60-70K
Rubin2027E256-384 GBHBM4~1,000-1,200W~15-20x~$40-50K

Fleet Mix Evolution

Step 4 Calculate GPU-Hours Needed

Convert token demand into GPU-hours using throughput data, then apply real-world dampening factors.

Formula

Effective_Tokens(Y, B) = Annual_Tokens(Y) × Visible_Share(B) × Thinking_Multiplier(B) GPU_Seconds(Y, B) = Effective_Tokens(Y, B) / Batched_Throughput(B, Y) GPU_Hours(Y, B) = GPU_Seconds(Y, B) / 3600

Dampening Factors

FactorValueRationale
Enterprise share0.45Not all tokens need dedicated GPUs; consumer use on shared platforms is more efficient
Cache/dedup0.80KV-cache reuse, semantic caching reduce computation by ~20%
Inference optimization0.37 (by 2030)Speculative decoding, quantization, continuous batching compound annually

Worked Example: 2025 (Base Case)

Bucket D — Reasoning (Largest GPU consumer)
Effective tokens = 629T × 0.35 × 3.5 = 770.5T Dampened tokens = 770.5T × 0.360 = 277.4T GPU-seconds = 277.4T / 2,000 tok/s = 138,690M GPU-hours = 138,690M / 3,600 = 38.5M GPU-hours Non-reasoning GPU-hours = 3.8M Total 2025 = 42.3M GPU-hours

GPU-Hours by Year (Recalculated with Hardware Improvements)

YearTokens (T)DampeningReasoning ThroughputReasoning GPU-hrsOther GPU-hrsTotal
20256290.3602,00038.5M3.8M42.3M
20261,3020.2882,80046.9M4.7M51.6M
20272,2130.2163,60046.6M4.7M51.3M
20283,3200.1734,80041.9M4.2M46.1M
20294,3160.1485,50040.7M4.1M44.8M
20305,1790.1336,50037.1M3.7M40.8M

Step 5 Convert to Total GPU Fleet Size

The final step: translate GPU-hours into the total number of GPUs that must be deployed globally.

Formula

Available_Hours/GPU/Year = 8,760 (24/7 × 365) Utilization_Rate(Y) = 65% (2025) → 80% (2030) Effective_Hours(Y) = 8,760 × Utilization_Rate(Y) Fleet_Size(Y) = Total_GPU_Hours(Y) / Effective_Hours(Y)

Adjusting for Global Market Scope

OpenRouter = ~22% of Global Token Volume

The 12.1T tokens/week figure represents one platform. Total global inference (OpenAI, Anthropic, Google, Microsoft, enterprise private) is estimated at 3-5x higher. Total global = ~55T tokens/week in 2025.

GPU Fleet Projection — Three Scenarios

YearConservativeBaseOptimistic
20252.8M3.8M5.0M
20263.8M5.2M7.1M
20274.8M6.7M9.5M
20285.8M8.4M12.3M
20297.0M10.5M16.8M
20308.5M14.2M22.5M

Training vs Inference Fleet Split

Training GPU demand flattens while inference nearly quadruples. The structural shift from training to inference is the primary driver of sustained GPU demand growth. By 2030, inference is 75% of the total AI GPU fleet.

NVIDIA Revenue Trajectory

Decomposing the $607B enterprise AI TAM into GPU-addressable layers and projecting NVIDIA's capture.

TAM Layer Decomposition

Layer% of TAM2025 ($B)2030 ($B)NVIDIA Role
GPU / Accelerator Hardware30%$40.1$182.2Primary supplier (70-85%)
Cloud GPU Rental15%$20.0$91.1Hardware supplier to clouds (~37%)
AI Software30%$40.1$182.2CUDA / AI Enterprise (~6%)
AI Services25%$33.4$151.8Indirect

NVIDIA Revenue by Segment (Base Case)

YearTAM ($B)GPU HW ($B)Cloud ($B)Software ($B)Networking ($B)Total ($B)
2025133.732.97.42.45.948.6
2026178.342.89.93.27.763.6
2027241.756.613.44.410.284.5
2028326.774.518.15.913.4111.9
2029445.399.024.78.017.8149.5
2030607.2142.133.710.925.6212.3

Revenue Scenarios

YearConservative ($B)Base ($B)Optimistic ($B)
2025424955
2026526476
20276685106
202882112147
2029108150201
2030145212260

Revenue Scenarios and Strategic Positioning

Three revenue scenarios bracket NVIDIA's trajectory. The bear case projects $145 billion in data center revenue by 2030, driven by a capex pullback and accelerated ASIC adoption. The base case reaches $212 billion, assuming sustained enterprise adoption and moderate competitive erosion. The bull case hits $260 billion on the back of AGI-driven training surges and sovereign AI acceleration. In all three scenarios, NVIDIA remains the dominant AI infrastructure company — the question is not whether it grows, but by how much.

The Blackwell generation, shipping in volume through 2025, represents the critical inflection. Its 10x per-token cost reduction over Hopper, combined with the GB200 NVL72 rack-scale architecture, repositions NVIDIA from selling individual GPUs to selling integrated AI infrastructure at $2 million per rack. This shift from component to system sales increases NVIDIA's share of datacenter wallet and makes it harder for competitors to displace individual components.

NVIDIA Ecosystem Partners

The NVIDIA GPU ecosystem generates massive revenue for cloud providers, memory suppliers, server OEMs, and networking companies.

Cloud Providers (AWS, Azure, GCP)

$91.1B
Cloud GPU rental market by 2030 (from $20.1B in 2025)
AWS
35%
Azure
30%
GCP
20%
Others
15%

HBM Memory Suppliers

2,380 PB
Annual HBM demand by 2030 (7M new GPUs × 340GB avg)
SK Hynix
50%
Samsung
35%
Micron
15%

Networking (Mellanox / InfiniBand)

$25.6B
NVIDIA networking revenue by 2030 (~18% of GPU system cost)

NVLink: 900 GB/s (H100) → 1.8 TB/s (B200)

InfiniBand: 400 Gb/s per port, dominates AI workloads

Server OEMs

$170B
Non-GPU system integration revenue by 2030
Dell
25%
Supermicro
20%
HPE
18%
Lenovo
12%

Software Ecosystem

$10.9B
NVIDIA software revenue by 2030 (AI Enterprise, DGX Cloud, Omniverse)

CUDA: 4M+ developers, ~15 years of ecosystem

TensorRT: 2-5x inference speedup

vLLM: 3.7-24x throughput vs TGI

Edge / Embedded (Jetson)

$7.3B
Edge AI inference market by 2030 (~8% of total inference)

Orin Nano: ~4B models, 15W

AGX Orin: ~20B models, 60W

57% of enterprise tasks run sub-7B models — edge-deployable

Data Center Capacity Requirements

Converting GPU fleet size into power demand, data center count, and geographic distribution.

Power Consumption by GPU Generation

GPUTDP (GPU)System Power (×2.2)Per Rack (8 GPUs)
A100 SXM400W880W7.0 kW
H100 SXM700W1,540W12.3 kW
B2001,000W2,200W17.6 kW
GB200 NVL7272 GPUs120 kW / rack
Rubin (est.)1,000-1,200W2,200-2,640W17.6-21.1 kW

Total AI Power Demand (Base Case)

YearTotal GPU FleetAvg System PowerTotal Power (GW)Annual Energy (TWh)
20259.5M1,400W13.3 GW~87
20269.5M1,700W16.2 GW~106
202710.8M1,900W20.5 GW~135
202812.5M2,100W26.3 GW~173
202914.6M2,200W32.1 GW~211
203018.9M2,300W43.5 GW~286
Context: 286 TWh/year = 62% of current global data center consumption

Global data center electricity was ~460 TWh in 2024 (IEA). AI inference alone could add 286 TWh by 2030. This is directionally consistent with IEA and Goldman Sachs estimates of data center power doubling by 2030. Power availability is increasingly the binding constraint on GPU deployment — not chip supply.

New Data Centers Needed (2025-2030)

YearPower Growth (GW)Avg DC Capacity (MW)New DCs Needed
2025+3.0200~15
2026+2.9250~12
2027+4.3300~14
2028+5.8400~15
2029+5.8450~13
2030+11.4500~23
Total~92

Geographic Distribution (Estimated)

United States
45-50%
Asia-Pacific
20-25%
Europe
15-20%
Middle East
5-10%

Competitive Landscape

NVIDIA faces growing competition from AMD, Google TPUs, and hyperscaler custom ASICs. But the CUDA moat remains formidable.

Current Market Share (2025)

NVIDIA
~82%
AMD
~10%
Google TPU
~4%
Custom ASICs
~3%
Others
~1%

NVIDIA Competitive Moat

Moat FactorStrengthDurabilityRisk
CUDA ecosystemVery Strong5-7 yearsROCm + Triton narrowing gap
NVLink / networkingStrong3-5 yearsUltra Ethernet Consortium
Full-stack integrationVery Strong5+ yearsNo competitor has full stack
Developer mindshareVery Strong5+ years4M+ CUDA devs; hard to retrain
TensorRT optimizationStrong3-4 yearsvLLM/SGLang are open alternatives

2030 Market Share Projections

VendorConservativeBaseOptimistic
NVIDIA65%78%85%
AMD18%12%8%
Google TPU5%4%3%
Custom ASICs10%5%3%
Others2%1%1%

Risk Factors

Key risks that could materially alter the projections in either direction.

Model Efficiency Improvements

MoE architectures (DeepSeek V3: 671B params, only 37B active), distillation (fine-tuned 7B beats GPT-4 on 85% of narrow tasks), and RouteLLM routing (74% redirected at 95% quality) could reduce per-token GPU demand faster than modeled. If efficiency improves 50%/year instead of 30%, 2030 fleet drops to ~8.5M.

Supply Chain Constraints

TSMC dependency (all NVIDIA GPUs on 4nm), HBM shortage (30-40% capacity growth may lag 40-50% demand), CoWoS packaging bottleneck. Could constrain GPU production and slow fleet expansion.

Geopolitical / Export Controls

US export controls restrict AI chip sales to China (~25-30% of potential market). Escalation could fragment the market and reduce NVIDIA's addressable base by 10-20%.

Power Availability

44 GW by 2030 requires massive grid expansion. Nuclear deals take 7-10 years. Many regions (N. Virginia, Dublin, Singapore) are already capacity-constrained. Power could become the binding constraint.

Alternative Architectures

Photonic computing (Lightmatter), neuromorphic chips (Intel Loihi), and custom inference ASICs promise higher efficiency but are 3-5 years from production scale. Not a 2025-2030 threat.

Assumptions Register

Every assumption used in this analysis, with confidence and sensitivity ratings. Transparency enables the reader to adjust projections to their own views.

IDAssumptionValueConfidenceSensitivityUsed In
A1Token growth deceleration rates2.07x → 1.2xMediumVery HighStep 1
A2Enterprise share of global tokens45%MediumHighStep 4
A3Reasoning thinking token multiplier3.5xLow-MedVery HighStep 2
A4Batched throughput multiplier5xMediumHighStep 3
A5GPU ASPs by generationVariesMed / LowHigh (rev)Revenue
A6Rubin throughput improvement2.0x over B200LowMediumStep 3
A7GPU fleet mix by yearSee tableLowMediumSteps 3-5
A8Cache / dedup reduction20%Low-MedMediumStep 4
A9Annual inference optimization15-25%, deceleratingMediumHighStep 4
A10GPU utilization rate65% → 80%Med-HighMediumStep 5
A11OpenRouter as % of global tokens22%LowVery HighStep 5
A12TAM layer splits30/15/30/25MediumHighRevenue
A13Cloud provider market sharesAWS 35%, etc.MediumMediumEcosystem
A14HBM supplier sharesSK Hynix 50%Med-HighLowEcosystem
A15PUE for AI data centers1.3HighLowData Centers

The CUDA Moat and Its Limits

NVIDIA's competitive moat is widest in training and narrowest in inference. For training, the CUDA ecosystem, NVLink interconnect, and PyTorch-first framework integration create switching costs that no competitor has overcome. For inference, the calculus is different: models are fixed, workloads are predictable, and purpose-built ASICs can deliver 2–3x better efficiency for specific architectures. This structural distinction explains why custom ASICs grow from 6% to 20% of the market by 2030, capturing inference share while barely denting NVIDIA's training dominance. For strategic decision-makers, the implication is clear: NVIDIA remains essential for training and general-purpose inference, while specialized inference at scale increasingly rewards alternative hardware. The full GPU demand context is covered in Chapter 16: The $1.7T Compute Stack.