Bottom-Up Inference Demand Model

296 Trillion Tokens Per Day

Decomposing global AI inference demand by segment, platform, model type, and GPU requirement. Cross-validated against the GPU Demand Forecast.

2.46T

Tokens/Day (2025)

34.7%

Reasoning Share 2030

1.3M

Provisioned GPUs 2030

1.4x

Bottom-Up vs Measured

45x

Installed Base / Demand

Part V — The Infrastructure Arms Race

Chapter 18: Bottom-Up Inference Demand

296 trillion tokens per day by 2030. This number is not extrapolated from a trend line — it is built bottom-up from seven consumer platforms, five enterprise copilot products, eight industry verticals, and seven developer API providers, each modeled independently by users, queries per day, tokens per query, and model mix.

The GPU Demand Forecast projected 200 trillion tokens per day by 2030 using top-down methods. This companion model arrives at 296 trillion — roughly 1.5x higher — and the gap itself is informative. Approximately 30–40% of actual inference runs through channels that top-down API measurements miss: embedded SaaS AI (Microsoft 365 Copilot tokens generated inside the platform, never touching a tracked API endpoint), self-hosted open source (vLLM and Ollama clusters running Llama and Mistral on enterprise GPUs, completely invisible to commercial API monitoring), on-device processing (Apple Intelligence, local Ollama, edge TPU inference), and custom enterprise pipelines calling fine-tuned models on private infrastructure.

Three segments drive inference demand. Developer and API traffic dominates in 2025 with 47% of total tokens, powered by platform infrastructure and the emerging category of agent-to-agent communication. Enterprise inference contributes 35%, split between copilot products like GitHub Copilot (20 million developers generating 150 code completions per day) and vertical-specific workloads in finance, legal, and healthcare where reasoning models consume 30–42% of tokens. Consumer inference, at 18% in 2025, grows fastest — 188x from 2024 to 2030 — driven by AI search replacing traditional search, agent loops multiplying queries 10–100x per user action, and multimodal inputs increasing token counts 3–5x. The model mix evolution is as important as the volume growth: reasoning models rise from 8.8% to 34.7% of inference tokens, but their 5x compute overhead means over half of all inference GPU cycles serve reasoning workloads by 2030.

900M

ChatGPT Weekly Active Users

6B/min

OpenAI Total Token Rate

540K

Tokens/Worker/Day

<30%

Industry GPU Utilization

1. Consumer Inference by Platform

Tokens per day (trillions) across 7 consumer AI platforms, 2024-2030E

Consumer inference grows 450x from 2024 to 2030. ChatGPT leads in tokens but Meta AI has the steepest growth curve via messaging app distribution.

2. Enterprise Inference by Category

Tokens per day (trillions) across copilot products, custom deployments, and vertical AI

Enterprise inference is driven by copilot products in 2025 but custom enterprise deployments overtake by 2028 as enterprises build proprietary AI pipelines.

3. Developer/API Inference by Provider

Tokens per day (trillions) across commercial APIs and self-hosted open source

Open-source self-hosted inference is the hidden giant — invisible to API tracking, running on enterprise GPUs via vLLM, Ollama, and TGI.

4. Total Inference Demand by Segment

Stacked area showing consumer, enterprise, and developer/API contribution to total tokens/day

Developer/API dominates in 2024-2025 (47% share) but enterprise grows to 30% and consumer to 28% by 2030 as adoption broadens beyond developers.

The Measurement Gap and the Dark Pool

A persistent gap exists between bottom-up and top-down token estimates: 3.0x in 2024, narrowing to 1.5x by 2030. This gap is not a model error but a measurement limitation. The top-down estimate, derived from a16z and OpenRouter data, captures tokens flowing through tracked commercial API endpoints. It systematically undercounts embedded SaaS AI, self-hosted open-source inference, on-device processing, and internal enterprise pipelines. As more enterprise AI migrates to tracked cloud APIs and monitoring improves, the gap narrows — but a "dark pool" of self-hosted inference processing an estimated 20 trillion tokens per day by 2030 will remain permanently invisible to API tracking.

5. Model Mix Evolution

Distribution of inference tokens across 5 model types, 2024-2030E

Commercial Small models peak at 40.6% in 2024-2025, then yield share to Reasoning (34.7%) and Fine-Tuned (28.9%) by 2030. Large models collapse from 33.6% to 6.0%.

6. GPU Demand by Segment

Inference GPU demand (thousands of H100-equivalents) by consumer, enterprise, and developer

Enterprise is more GPU-intensive per token than consumer due to higher reasoning model usage (30-35% in finance/legal vs 5% in consumer).

The 45x Overcapacity Paradox

The bottom-up model computes 1.3 million provisioned GPUs needed for inference by 2030 (321,000 theoretical minimum multiplied by a 4x production overhead for redundancy, latency SLAs, geographic distribution, and model diversity). The installed base for inference by 2030 is projected at 57.6 million H100-equivalents — a 45x overcapacity. This gap explains a seeming paradox: how can "GPU scarcity" coexist with aggregate overcapacity? Because the installed base is fragmented across thousands of providers, each running different models at low utilization. The scarcity is local and model-specific, not aggregate. Companies solving the allocation problem — through better orchestration, continuous batching, and multi-model serving — may capture more value than those simply adding supply.

7. GPU Supply vs Inference Demand

Bottom-up provisioned demand vs top-down forecasts (K H100-eq, log scale)

The 45x gap between provisioned demand (1.3M) and installed base (57.6M) by 2030 quantifies the AI capex arms race. Hyperscalers build for strategic positioning, not just token demand.

8. Inference Cost Evolution

Cost per million tokens by model type (log scale), 2024-2030E

All model types follow parallel cost decline curves. Reasoning maintains a 10-50x premium over fine-tuned models, but even reasoning drops below $1/M tokens by 2030.

9. Tokens Per Query by Use Case

Average tokens per query/completion across different platforms and products

A 6x spread: Perplexity deep research averages 1,200 tokens vs GitHub Copilot completions at 130 tokens. Platform mix matters as much as user count.

The Hidden Variable: Token Intensity per Query

The 9.2x spread in tokens per query across platforms — from Perplexity’s 1,200-token deep research completions to GitHub Copilot’s 130-token code suggestions — reveals why user counts alone are a misleading proxy for inference demand. A single Perplexity power user generating 5 deep-research queries per day consumes more tokens than 46 GitHub Copilot developers running 150 completions each. Claude’s 800-token average and ChatGPT’s 700-token average reflect the growing dominance of extended, multi-turn conversations over simple Q&A, a trend that intensifies as reasoning models — which generate 5–10x more internal tokens than visible output — become the default for complex tasks. The aggregate consumer token count grows from 532 tokens per query in 2025 to 1,267 by 2030, a 2.4x increase that compounds on top of user growth and query frequency expansion.

This token intensity gradient carries direct implications for GPU provisioning. Platforms skewing toward research-heavy, reasoning-intensive workloads (Perplexity, Claude) require disproportionately more GPU capacity per user than high-volume, low-complexity platforms (Meta AI at 200 tokens, Microsoft Copilot Free at 280 tokens). As the industry shifts from simple chatbot interactions to agentic workflows — where a single user action triggers chains of 10–100 subordinate queries, each generating 800–1,200 tokens — the effective token demand per engaged user could increase by an order of magnitude. The platforms that win the inference economics race will be those that route intelligently across model tiers: reserving expensive reasoning capacity for the queries that demand it, while handling the long tail with efficient small and fine-tuned models at one-tenth the cost.

10. Scenario Analysis: Total Inference Tokens

Bull/Base/Bear scenarios for total inference demand (tokens/day, log scale)

The bull-bear spread is 4.4x by 2030 (518T vs 119T tokens/day). The key variable: enterprise adoption rate and agent multiplication factor.

Bounding the Uncertainty: A 4.4x Demand Cone

The scenario analysis exposes the fragility of any single-point inference forecast. By 2030, the bull case projects 518 trillion tokens per day while the bear case yields 119 trillion — a 4.4x spread that represents over $80 billion in cumulative GPU capital expenditure variance. The two variables that drive the widest divergence are enterprise adoption rate (90% by 2028 in the bull case versus 40% in the bear) and the agent multiplication factor (100x queries per user action versus 3x). Together, these account for roughly 7.5x of demand swing before efficiency offsets compress the range. Notably, all three scenarios share identical 2024–2025 actuals, meaning the divergence is entirely forward-looking: the market has not yet generated the data needed to distinguish which trajectory is correct.

The asymmetry of consequences matters more than the point estimate. In the bull case, provisioned GPU demand reaches 2.2 million H100-equivalents — still far below the projected 57.6 million inference-allocated installed base, implying 26x overcapacity. In the bear case, provisioned demand of 514,000 GPUs against the same installed base implies 112x overcapacity and a potential write-down cycle rivaling the fiber-optic overbuild of 2000–2002. For investors, the base case of 296 trillion tokens per day is less important than the shape of the demand cone: even the optimistic scenario does not justify current capex trajectories on a pure inference-demand basis. The strategic rationale for building ahead of demand — training capacity, competitive positioning, sovereign AI mandates — must carry the economic argument where inference volumes alone cannot.

11. Consumer Platform Growth Trajectories

Monthly active users (millions) for 7 consumer AI platforms

ChatGPT leads in absolute users, but Meta AI's distribution advantage (WhatsApp, Instagram, Facebook) gives it the steepest growth trajectory to 3B MAU by 2030.

Distribution as Destiny: The Consumer Platform Race

The consumer AI platform growth trajectories reveal a market where distribution advantages are compounding faster than product advantages. ChatGPT leads with 900 million MAU in 2025 and projects to 3 billion by 2030, but Meta AI’s embedded distribution across WhatsApp, Instagram, and Facebook gives it a steeper growth slope — from 700 million to 3 billion MAU over the same period — without requiring users to adopt a new application. Google’s Gemini follows a similar embedded strategy through AI Overviews in Search, scaling from 650 million to 2.8 billion MAU. The standalone AI applications (Claude at 100 million to 800 million, Perplexity at 45 million to 500 million) grow faster in percentage terms but face a structural ceiling: they must earn every user through deliberate adoption, while the embedded players inherit billions of existing sessions.

The strategic implication runs deeper than market share. Embedded platforms generate lower tokens per query (Meta AI at 200 tokens versus Claude at 800) but dramatically higher query frequency at population scale. Meta AI’s 40 million DAU in 2025, generating just 2 queries per day at 200 tokens each, produces 16 billion tokens daily — a fraction of ChatGPT’s 182 billion. But by 2030, Meta AI’s 900 million DAU at 12 queries per day and 700 combined tokens per query yields 7.56 trillion tokens daily, approaching ChatGPT’s 37.4 trillion despite generating far less value per interaction. This pattern — low-intensity, high-breadth distribution outscaling high-intensity, narrow-distribution products in raw token volume — will shape which infrastructure providers benefit most. The winners in inference infrastructure will be those who can serve massive volumes of small-model, low-latency requests (the Meta AI pattern) while simultaneously handling smaller volumes of compute-intensive reasoning chains (the Claude and Perplexity pattern).

12. Segment Deep Dive

Detailed breakdowns by segment for 2025

2025 Segment Composition: A Market in Transition

The segment deep-dive for 2025 captures a market at an inflection point. Developer and API inference commands 47% of total tokens (1.16 trillion per day), reflecting the dominance of platform-layer infrastructure — OpenAI’s API alone processes 350 billion tokens daily, with Azure OpenAI adding another 200 billion. Enterprise contributes 35% (853 billion tokens per day), split unevenly: copilot products like M365 Copilot and GitHub Copilot account for 714 billion of that total, while custom enterprise deployments and vertical-specific workloads remain nascent at 90 billion and 49 billion respectively. Consumer, despite commanding the most public attention, generates just 18% of tokens (442 billion per day). This composition will shift materially by 2030 as enterprise custom deployments scale from near-zero to 3.85 trillion tokens daily and consumer platforms multiply their engaged user bases by 10–15x.

The model mix within each segment tells an equally important story. Consumer inference in 2025 skews heavily toward commercial small models (40–48% of tokens across most platforms), reflecting the cost discipline required at consumer price points. Enterprise verticals tell a different story: finance and legal already allocate 30–35% of their inference tokens to reasoning models, accepting the 5x compute premium because accuracy on regulatory and contractual analysis justifies the cost. The developer segment sits between these poles, with open-source self-hosted inference (vLLM, Ollama, TGI) processing 180 billion tokens per day entirely outside commercial API tracking — a “dark pool” that will grow to 20 trillion tokens daily by 2030. For GPU manufacturers, this segmentation reveals that the future inference market is not one workload but three distinct demand profiles, each with different model mixes, latency requirements, and willingness to pay — and optimizing for all three simultaneously is the central hardware design challenge of the next five years.

Reference: Aggregate Demand Summary

Key metrics by year across all segments

Year	Consumer (T/day)	Enterprise (T/day)	Developer (T/day)	Total (T/day)	Theoretical GPU (K)	Overhead	Provisioned GPU (K)	Installed Base (K)	Inference % of Total

Strategic Implications

Three conclusions follow from the bottom-up model. For GPU manufacturers: optimize for inference diversity, not just throughput. The model mix evolution shows reasoning growing to 35% of tokens with 5x compute overhead, meaning inference GPUs need large memory for reasoning chains more than raw FLOPS. For enterprises: the model mix decision is the cost decision. Choosing fine-tuned and small models over frontier models reduces compute cost by 3.2x, and the shift from 33.6% large models in 2024 to 6.0% by 2030 represents a massive cost reallocation. For investors: the 45x gap between provisioned inference demand and installed base by 2030 quantifies the AI capex bubble risk. If inference demand follows the bear case of 80 trillion tokens per day, GPU overcapacity reaches 115x, creating significant write-down exposure. The full supply-side picture is in Chapter 16: The $1.7T Compute Stack and Chapter 17: NVIDIA Infrastructure Analysis.

← Chapter 17: NVIDIA Chapter 19: Enterprise Disruption →