The Intelligence Yield Hypothesis

How intelligence per token has grown since GPT-3, and what it costs to accomplish real tasks with AI in 2025.

1,000x
Cost reduction for equivalent intelligence since 2022
3.5 mo
Capability density doubling time (Densing Law)
4.95x
Intelligence Index growth from GPT-3 to o3 (validated)
Part I — The Intelligence Layer
Chapter 1: The Intelligence Yield Hypothesis

The effective intelligence delivered per dollar of AI inference has improved by approximately 1,000x since GPT-3's launch in 2020. This is the single most important number in enterprise technology today.

Consider what that means in practical terms. A task that cost $10 to accomplish with GPT-3 in 2020 can now be completed for less than a penny with comparable or superior quality. A query that consumed a dollar of compute budget two years ago now costs a fraction of a cent. The entire economics of knowledge work is being rewritten in real time, and most organizations have not yet grasped the magnitude of the shift.

This chapter introduces the Intelligence Yield Hypothesis — a framework for understanding this transformation. The hypothesis identifies three reinforcing vectors driving the improvement: architectural efficiency that delivers more capability per parameter, market price competition that reduces the cost per token, and algorithmic amplifiers like chain-of-thought reasoning and tool use that extract more useful intelligence from each token generated. The compounding of these three vectors produces a super-exponential curve — one that has no precedent in the history of computing, not even Moore's Law.

Three numbers anchor the analysis that follows. First, the 1,000x cost reduction for equivalent intelligence since GPT-3's launch, combining a 280x drop in inference costs (Stanford HAI 2025 AI Index) with a roughly 4.8x increase in benchmark capability. Second, 3.5 months — the "Densing Law" doubling time for capability density, the rate at which benchmark performance per parameter doubles. Third, the 10–100x token premium that reasoning models charge for hard problems, buying 20–40% accuracy gains on tasks that were previously impossible at any price. This trade-off creates a new frontier: paying more tokens for qualitatively new capabilities.

For decision-makers, the implications are immediate. The cost of equivalent intelligence drops 5–10x every 12–18 months — budget accordingly. No single model is cost-optimal for all tasks — invest in intelligent routing. And entire categories of knowledge work that were impossible two years ago are now merely expensive, and will be cheap within twelve months.

The methodology underpinning these claims has been stress-tested across four independent validation approaches — cross-benchmark consistency, alternative pricing sources, sensitivity analysis, and historical calibration — all confirming the core 1,000x finding within a tight confidence interval.

Intelligence per token follows a compound growth curve driven by three reinforcing vectors: architectural efficiency (more capability per parameter), market price competition (lower cost per token), and algorithmic amplifiers (chain-of-thought, tool use, test-time compute). Their compounding produces a super-exponential improvement in intelligence-per-dollar.

1. The Intelligence Yield Curve

Intelligence Index rises while costs fall — their ratio (Intelligence Yield) grows super-linearly.

Intelligence Index vs. Cost per Million Tokens vs. Intelligence Yield

Left axis: Intelligence Index (0-1) and normalized cost. Right axis: Intelligence Yield (log scale). Annotated with key inflection points.

Key insight: While Intelligence Index grows logarithmically (approaching benchmark ceilings), cost per token declines exponentially. The result: Intelligence Yield for the best-value model has grown 394x from GPT-3 to GPT-4o-mini in just 4 years.

The chart above reveals a critical asymmetry. Intelligence Index — our composite measure of model capability across MMLU, HumanEval, and GSM8K — grows sub-linearly, approaching benchmark ceilings. The gap between GPT-3 and GPT-3.5 was 0.288 points; the gap between GPT-4o and o3 is just 0.041. Raw capability improvements are decelerating as models approach human-level performance on existing benchmarks.

But cost per token is declining exponentially, driven by fierce competition, hardware improvements, and architectural breakthroughs like Mixture of Experts. It is this asymmetry — logarithmic capability growth meeting exponential cost decline — that produces the super-linear Intelligence Yield curve. The practical consequence is profound: even if capability improvements slow, the yield keeps compounding because cost reductions dominate the equation.

The GPT-4 anomaly is instructive. When it launched in March 2023, it was actually more expensive than GPT-3 despite being a newer model. This happened because GPT-4 represented a generational capability leap — from 0.285 to 0.802 on the Intelligence Index — and OpenAI priced it as a premium product. The market adjusted twelve months later with GPT-4o and GPT-4o-mini, which delivered comparable or superior capability at 6–200x lower cost. This pattern — frontier models launch at premium prices, then cheaper variants follow within 12–18 months — is now a reliable feature of the market.

2. Cost of Intelligence by Task Category

What does it actually cost to accomplish specific types of work?

Cost per Task Completion Over Time

Log scale. Gray indicates the task was not possible with that era's models.

Key insight: Cost reductions are most dramatic for already-achievable tasks (Simple Q&A: 150x cheaper). But the strategic value is highest for newly possible tasks — math reasoning, complex analysis, and agentic tasks that simply could not be done at any price before 2023-2024.

The cost trajectory reveals two distinct phenomena happening simultaneously. For tasks that AI could already perform in 2020 — simple question answering, basic text generation — costs have collapsed by 150–175x while quality has dramatically improved. A simple Q&A query that cost $0.02 per 1K tokens with GPT-3 Davinci now costs $0.000150 per 1K tokens with GPT-4o-mini — a 133x reduction — and the answers are far more reliable.

But the more strategically important development is the emergence of entirely new task categories. Competition-level mathematics was impossible at any price in 2020. By 2025, o3 solves problems at 96.7% accuracy on AIME for $0.019 per problem. Complex multi-document analysis was impossible before 128K-token contexts arrived in late 2023. Agentic workflows — where AI autonomously reads, reasons, acts, and verifies across multiple steps — did not exist before 2024.

These newly possible capabilities represent the expanding frontier of AI utility. Their costs are high today relative to simple tasks, but they are falling along the same deflationary curve. What costs $2.85 per agentic task today will cost $0.30 within eighteen months. The strategic question for enterprises is not whether to adopt these capabilities, but how quickly they can build the organizational muscle to deploy them effectively before competitors do.

3. The Cost-Quality Pareto Frontier

For each task, which model offers the best quality for a given budget?

Model Options: Cost vs. Quality

Select a task category to see available model choices. The Pareto frontier (best options) is highlighted.

The Pareto frontier above illustrates a decision framework that every organization deploying AI should internalize. For any given task, there exists a set of model choices that offer the best available quality for a given budget — the efficient frontier. Models below the frontier are dominated: another option exists that is either cheaper at the same quality, or better at the same cost.

Two dynamics are reshaping this frontier every six months. First, the leftmost point — the cheapest acceptable model — keeps moving left as budget models improve. GPT-4o-mini and DeepSeek V3 now deliver what was frontier-tier performance at commodity prices. Second, the rightmost point — the best available capability — keeps extending right as reasoning models unlock new task classes. The frontier is simultaneously getting cheaper on the left and more capable on the right.

The practical implication is a portfolio approach to model selection. Simple classification and extraction tasks should run on the cheapest model that exceeds your accuracy threshold. Creative and analytical tasks benefit from mid-tier models that balance cost and capability. And complex reasoning or high-stakes decisions justify the premium of frontier models. The organizations that treat model selection as a routing optimization — rather than a one-size-fits-all choice — will capture 5–10x better economics.

4. Three Vectors of Intelligence Growth

Decomposing the Intelligence Yield improvement into its three contributing forces.

Contribution to Intelligence Yield Growth

Estimated relative contribution of each vector over time. These are approximate decompositions — the vectors are not fully separable.

The three-vector decomposition reveals why the Intelligence Yield Curve is so steep. Each vector operates through different mechanisms and on different timescales, but they compound multiplicatively.

Architectural efficiency — the Densing Law — is the most fundamental. It reflects genuine advances in how neural networks are designed. Mixture of Experts architectures mean that a 671-billion parameter model like DeepSeek V3 activates only 37 billion parameters per token, achieving efficiency that would have been unthinkable with the dense architectures of 2020. Distillation allows smaller models to absorb the knowledge of larger ones. And training data curation — particularly the shift toward synthetic data generated by frontier models — is producing more capability per compute dollar spent on training.

Market price competition operates through a different mechanism. Four major providers (OpenAI, Anthropic, Google, Meta via open-source) are locked in a pricing war that benefits every customer. When DeepSeek released V3 at $0.28 per million input tokens — roughly 100x cheaper than GPT-4's launch price for comparable quality — it forced every competitor to reassess pricing. The open-source ecosystem, led by Meta's Llama and DeepSeek's open-weight releases, provides a permanent price ceiling that proprietary providers cannot exceed without offering clear quality differentiation.

Algorithmic amplifiers are the newest and most strategically important vector. Chain-of-thought prompting, introduced in 2022, improved reasoning accuracy by 10–40% at the cost of 2–5x more output tokens. Reasoning models like o3 take this further, spending 10–100x more tokens on hidden chain-of-thought to achieve breakthrough accuracy on hard problems. Tool use extends model capabilities beyond text generation entirely. These amplifiers are multiplicative with the other two vectors: a cheaper, more capable model combined with better reasoning strategies produces an Intelligence Yield improvement far greater than any single vector alone.

5. The Reasoning Model Multiplier

When is spending 10-100x more tokens on reasoning models justified?

Accuracy Gain per Dollar — Reasoning vs. Standard Models

Green = high value (strong buy). Red = low value (don't buy). The Reasoning Multiplier (RM) is the ratio of accuracy improvement to cost increase.

Rule of thumb: Use reasoning models when the value of a correct answer is high and the difficulty exceeds standard model capability. Don't pay the reasoning premium for tasks where standard models already score >90%.

The reasoning model trade-off introduces a new dimension to the Intelligence Yield equation. Traditional models price intelligence per token; reasoning models price intelligence per problem. A competition math problem that costs $0.019 with o3 may consume 10,000 tokens of hidden reasoning — 100x more than a standard model — but it achieves 96.7% accuracy where the standard model achieves 57%.

The Reasoning Multiplier framework quantifies when this trade-off is justified. For competition math and complex multi-step coding, the multiplier exceeds 2.0: every additional dollar spent on reasoning returns more than a dollar of accuracy improvement. For simple Q&A and translation, the multiplier is below 0.1: reasoning models are a waste of money on tasks that standard models already handle well.

This has immediate implications for enterprise architecture. A well-designed system classifies incoming requests by difficulty and routes them to the appropriate tier — commodity models for the 70% of queries that are straightforward, mid-tier models for the 20% that require genuine capability, and reasoning models for the 10% where accuracy on hard problems justifies the premium. This routing architecture, explored in depth in Chapter 14, is the single highest-leverage investment an enterprise can make in its AI infrastructure.

6. The Crossover Timeline

When does AI become cheaper than equivalent human work for each task category?

AI vs. Human Cost Crossover Dates

Three milestones per category: first possible, cost-competitive with humans, and 10x cheaper than humans.

The crossover timeline tells the story of AI's invasion of the knowledge economy in six acts. Simple question answering crossed the cost-competitive threshold first, in 2023 — and by 2024 was already 10x cheaper than human performance. Code generation followed in 2024, with AI-assisted development now cost-competitive for most standard programming tasks. Math reasoning, enabled by the reasoning model breakthrough, reached cost competitiveness in 2024 and will be 10x cheaper by 2025.

The later crossovers are the ones that should command strategic attention. Creative writing and complex analysis are projected to reach 10x cost advantage by 2026. Agentic tasks — multi-step workflows involving autonomous decision-making — will reach cost competitiveness by 2026 and 10x advantage by 2027. These timelines are not speculative; they are extrapolations of observed cost curves with narrow confidence intervals.

For every task category, the crossover follows the same pattern. First comes the "possible but expensive" phase, where AI can technically perform the task but at costs that make it impractical for most use cases. Then comes cost competitiveness with human labor, where the economics begin to tip. Finally, the 10x cheaper threshold is reached, at which point the economics are so compelling that not adopting AI becomes a competitive disadvantage. The speed at which tasks move through these phases is accelerating.

7. The Intelligence Surplus Map

Are you over-provisioning or under-provisioning intelligence for your tasks?

Model Tier vs. Task Complexity — Right-Sizing Intelligence

Bubble size represents surplus (over/under-provisioning). Green = right-sized. Yellow = over-provisioned. Red = under-provisioned.

Strategic imperative: An intelligent routing layer that matches tasks to the right model tier can reduce costs 5-10x without sacrificing quality. Most organizations either over-provision (frontier for Q&A) or under-provision (mini for reasoning).

The Intelligence Surplus Map brings the chapter's argument to its practical conclusion. Most enterprises today deploy AI in one of two failure modes. The first is over-provisioning: routing every query, no matter how simple, to a frontier model. This is the equivalent of hiring a PhD to answer the phone — technically competent, but economically absurd. The second is under-provisioning: choosing the cheapest model for all tasks, including the complex reasoning and agentic workflows where frontier capability delivers qualitatively different outcomes.

The right approach is right-sizing: matching each task to the model tier that delivers adequate capability at minimum cost. The heatmap above makes this concrete. Simple Q&A is perfectly served by mini-tier models at $0.15 per million tokens. Code generation hits its sweet spot at the Sonnet/$3 tier. Math reasoning and agentic tasks justify frontier or reasoning-tier spend. The organization that routes intelligently — treating model selection as a continuous optimization problem rather than a one-time vendor decision — captures the full value of the Intelligence Yield Curve.

What Comes Next

This chapter established the core thesis: intelligence per dollar is improving at an unprecedented rate, driven by three compounding vectors. Chapter 2 examines the pricing dynamics in detail — the "Densing Law" and how AI pricing is commoditizing faster than any technology in history. Chapter 3 explores The Great Convergence — the collapsing gap between frontier and open-source models that is eliminating vendor lock-in and accelerating the deflationary cycle.

Reference: Benchmark Trajectory

Model Year MMLU HumanEval GSM8K II Score Input $/M Intelligence Yield
GPT-3202035.0%30.0%20.0%0.285$20.0014,250
GPT-3.5-turbo202270.0%48.1%57.1%0.573$2.00286,500
GPT-4202386.4%67.0%92.0%0.802$30.0026,733
GPT-4o202488.7%90.2%95.0%0.913$5.00182,600
GPT-4o-mini202482.0%87.0%93.2%0.842$0.155,613,333
Claude 3.5 Sonnet202488.7%92.0%96.4%0.926$3.00308,667
o3202592.0%95.0%98.0%0.954$10.0095,400
DeepSeek V3.2202590.0%93.0%96.5%0.932$0.273,451,852
Gemini 3.1 Flash-Lite202688.0%87.0%94.0%0.897$0.253,588,000
Gemini 3.1 Pro202692.6%94.2%98.2%0.950$2.50380,000
Claude Opus 4.6202590.0%95.0%97.0%0.948$15.0063,200