Deep Research — Visual & World Models

The Visual Intelligence Revolution

Comprehensive analysis of 38 image, video, and world models — architectures, hardware requirements, economics, and the road to real-time world simulation.

Models Cataloged

Architecture Families

40B

Largest Video Model

$115B

Visual AI TAM 2030

1.0

Open vs Commercial Gap

Part IV — The Model Landscape

Chapter 13: Visual & World Models

The visual AI landscape has undergone a complete architectural revolution in 24 months — an upheaval as consequential as the attention-is-all-you-need moment that transformed natural language processing.

This chapter maps the state of the art across 38 image, video, and world models spanning 8 architecture families. The headline finding is a clear cascade of commoditization: image generation commoditized in 2025 and is now a solved category, with FLUX.1 schnell generating images in one second on a consumer GPU for free. Video generation is the current frontier, with 8 commercial models competing on resolution (up to 4K), duration (up to 60 seconds), and quality — at training costs of $15M–$80M that create real barriers to entry. World models represent the next compute frontier, requiring 10–100x more compute than video for interactive 3D simulation with physics and causality.

The total addressable market reflects this progression: $25 billion for image generation, $40 billion for video, and $50 billion for world models by 2030 — $115 billion in aggregate. But the most strategically significant finding may be the open-source gap. Unlike large language models, where frontier capability requires $100M+ investments and 200B+ parameters, competitive visual models can be trained for $10M at 12 billion parameters (FLUX.1) or even $28K at 600 million parameters (PixArt-Sigma). The commercial-to-open quality gap has already narrowed to 1.0 point on a 10-point scale and is projected to close to 0.2 by 2030.

Key Insights

Seven findings from analyzing the visual AI landscape

1. Architecture Evolution Timeline

From pixel-space diffusion to real-time world simulation — how visual generation architectures evolved

Architecture Introduction Timeline

Year each architecture family was introduced, with current status

Why DiT Won the Architecture Race

Every frontier visual model released since mid-2024 — Sora, Veo, FLUX.1, SD3, HunyuanVideo, Kling — uses the Diffusion Transformer (DiT) architecture, replacing the U-Net that dominated from 2020 to 2023. The shift was driven by three factors. First, DiT performance improves log-linearly with compute, just like GPT models, enabling companies to confidently invest $50M+ in training knowing the quality improvement is predictable. Second, the entire LLM training and inference stack — Flash Attention, tensor parallelism, KV-cache optimization — transfers directly to DiT with no new infrastructure required. Third, the MMDiT variant (used in SD3 and FLUX.1) processes text and image tokens through separate transformer streams with cross-attention, enabling deeper prompt understanding than U-Net's cross-attention layers ever achieved.

Alongside DiT, flow matching is replacing the Markov chain denoising of DDPM. Where DDPM required 20–50 inference steps, flow matching learns direct ODE paths that need only 4–10 steps for equal quality. Combined with distillation, FLUX.1-schnell achieves 4-step generation in under one second on an RTX 4090. This convergence of DiT + flow matching is now the default paradigm for all frontier models.

2. Architecture Deep Dive

Select an architecture to see its components, representative models, and compute characteristics

3. Commercial Image Models

Quality, speed, cost, and control scores for leading commercial image generators

Model Comparison — Quality × Speed × Cost × Control

Radar chart comparing key dimensions across commercial image models

Three Commercial Moats in Image Generation

The commercial image generation market has bifurcated into three distinct moats. Midjourney leads on pure aesthetic quality (9.5/10), self-funded with an estimated $300M+ in revenue and no external investors — the most capital-efficient company in the space. DALL-E 3 owns distribution, integrated into ChatGPT's 200M+ user base, with unmatched text rendering capability driven by GPT-4 prompt rewriting. Adobe Firefly holds the enterprise safety moat, trained exclusively on licensed content, making it the only defensible choice for organizations concerned about copyright litigation. Its deep integration with Photoshop and Creative Cloud creates a product moat competitors cannot replicate.

But the strategic insight is that image generation is already a commodity. The real frontier — where training costs create barriers, where quality gaps persist, and where the next $40 billion in market value will be created — is video.

4. Commercial Video Models

Resolution, duration, and quality scores for frontier video generation models

Video Model Capabilities

Quality, motion quality, and consistency scores side by side

5. Open-Source Image Models

Quality versus VRAM requirements — finding the efficiency frontier

Quality vs VRAM Requirement

Higher quality with less VRAM = better value. Bubble size = parameter count.

Efficiency leader: PixArt-Sigma achieves 7.0 quality on just 8GB VRAM with only 600M parameters — trained for just $28K. FLUX.1 leads on raw quality (9.0) but needs 24GB.

6. Open-Source Video Models

The rapidly evolving landscape of open video generation models

Open-Source Video: Quality vs Model Size

Parameters (B) vs quality score. Larger models generally achieve higher quality.

The Democratization Wave

Open-source visual models are closing the commercial quality gap faster than their LLM counterparts. FLUX.1 at 12 billion parameters already matches or exceeds DALL-E 3 on most image benchmarks. HunyuanVideo 1.5 from Tencent approaches Sora 1 quality and, with FP8 quantization, runs on a consumer RTX 4090. Open-Sora 2.0 proves that competitive video models can be trained for just $200K — compared to $80M for Sora. The efficiency frontier is anchored by PixArt-Sigma, which achieves 7.0/10 quality with only 600 million parameters, 8 GB VRAM, and $28K in total training cost.

Four structural factors accelerate this convergence. Architecture papers are published openly before commercial deployment. Training recipes, data pipelines, and hyperparameters are increasingly shared. Competitive image models require only 12B parameters and $10M versus 200B+ and $100M+ for frontier LLMs. And a massive community fine-tuning ecosystem — thousands of LoRA adaptations and custom checkpoints — compounds quality improvements on open base models. The chart below tracks this convergence quantitatively.

7. Open-Source vs Commercial Quality Gap

The gap is closing faster than in LLMs — open-source reaches near-parity by 2028

Quality Convergence Over Time

Commercial vs open-source quality scores (1-10 scale) and the narrowing gap

Key driver: Unlike LLMs, visual model architectures are published openly. FLUX.1 was built by ex-Stability researchers who published the architecture, then released weights. Training recipes (data, schedules, tricks) are increasingly shared.

8. World Models — The Next Frontier

Neural networks that simulate interactive 3D environments with physics and causality

World Models: Maturity vs Scale

Parameter count vs maturity level across world model approaches

World Models: From Passive Video to Interactive Simulation

World models represent the most consequential shift in visual AI — the transition from passive artifacts to interactive environments that respond to user actions in real time. Where video generation produces a clip and stops, world models generate continuously at 24–60 frames per second, conditioning each frame on user actions with physically accurate responses. The compute constraints are fundamentally different: a 40-millisecond latency budget per frame (versus minutes per video clip), continuous GPU utilization rather than one-time generation, and consistency requirements that span not just temporal coherence but physics and 3D geometry.

NVIDIA Cosmos, Genie 3, and World Labs Marble lead the field, with training costs of $40M–$50M. The open-source Genesis physics engine achieves 430,000+ FPS for simulation on a single GPU, providing the foundation layer for open world model research. The $50B TAM projected for 2030 will be driven by gaming, autonomous vehicles, robotics policy training, and industrial simulation — and it will demand 10–100x more compute than video generation, reshaping GPU demand forecasts for the next decade.

9. Hardware Requirements Matrix

VRAM requirements by task type and model size — the memory bandwidth bottleneck explained

GPU Comparison: Throughput by Model Type

Comparing GPUs across image, video, and world model workloads

Memory Bandwidth vs FLOPS Growth

FLOPS grew 80x but bandwidth only 17x — this is why inference is memory-bound

Quantization Impact

Trading precision for speed and memory savings

VRAM Requirements by Task

10. Training Cost Analysis

From $28K (PixArt) to $80M+ (Sora) — training costs span 3 orders of magnitude

Training Cost by Model (Log Scale)

Logarithmic scale — each grid line is a 10x increase

The paradox: Cost per quality-unit drops ~50%/year, but absolute frontier costs RISE because quality targets increase faster. PixArt achieves 2023 quality for $28K, but Sora 2 costs ~$80M because the bar moved to physics-aware video.

Frontier Training Cost Trajectory

Absolute training cost for frontier models over time

The Training Cost Paradox

The training cost analysis above reveals a paradox at the heart of visual AI economics. Cost per quality-unit drops approximately 50% per year — PixArt achieves 2023-competitive quality for $28K today, down from the $600K SDXL required at the same quality level. But absolute frontier costs are rising, because quality targets advance faster than efficiency gains. Sora cost $80M not because training is inefficient but because the bar moved from static images to physics-aware video generation. By 2030, frontier world model training is projected to reach $500M.

This creates a two-tier market. Frontier labs — OpenAI, Google DeepMind, NVIDIA — spend $100M+ pushing the capability boundary. Everyone else fine-tunes open-source models for $1K–$100K to match last year's frontier. Both tiers are viable strategies; the question is whether an organization needs to be at the frontier or one generation behind it. For most enterprise use cases, last year's frontier at 1/1000th the cost is the rational choice.

11. Inference Economics

Per-generation costs: API vs self-hosted, across image/video/world model workloads

Inference Cost: API vs Self-Hosted

Comparing API pricing against self-hosted cost (amortized H100 at $2/hr)

Inference Cost Decline Trajectory

Index: 2024 = 100. Cost declining ~50% per year from hardware + algorithmic improvements.

12. Capability Evolution (2022-2030)

How resolution, duration, frame rate, and quality scores evolve over time

Capability Scores Over Time

Consistency, audio, interactivity, physics accuracy, and 3D understanding (0-10 scale)

Video Duration & Resolution Progression

Maximum video duration (seconds) and resolution progression by year

The Capability S-Curves Converge

The capability evolution radar reveals a striking pattern: visual AI’s five core dimensions — consistency, audio, interactivity, physics accuracy, and 3D understanding — are advancing at dramatically different rates. Consistency, the earliest capability to mature, reached approximately 8/10 by 2025; it is effectively solved for production use. Audio integration and physics accuracy, by contrast, matured more slowly and remain the binding constraints on real-world deployment as of 2026. By 2030, all five dimensions are projected to converge above 8/10, creating the conditions for general-purpose visual generation that can replace not just stock photography but live-action production itself.

The video duration and resolution progression chart tells an equally consequential story. Maximum duration expanded from 4 seconds in 2022 to 60 seconds in 2025 — a 15x improvement in three years — while frame rates climbed from 8 FPS to 30 FPS, crossing the threshold of perceptual smoothness. By 2028, the trajectory points to 5-minute clips at 60 FPS in 4K — sufficient for commercial-grade short-form content production. For enterprise decision-makers, the implication is clear: visual content pipelines built around stock footage licensing, studio shoots, and post-production editing will face fundamental disruption within the next 2–3 product cycles. Organizations should begin piloting generative video in low-stakes workflows now to build internal capability before the technology reaches full production readiness.

13. Market Landscape

Funding, valuations, and total addressable market for visual AI

Total Addressable Market by Segment

Image gen, video gen, and world models — $115B total by 2030

Company Landscape: Funding vs Estimated Revenue

Bubble size = valuation. Companies positioned by funding raised vs estimated visual AI revenue.

Enterprise Adoption by Vertical

Current (2024) vs projected (2027) adoption rates

A $115 Billion Market with Uneven Distribution

The total addressable market chart reveals a market that will triple from roughly $35 billion in 2024 to $115 billion by 2030 — but the growth is heavily concentrated in the later-stage capabilities. Image generation, the most mature segment, grows modestly from its current base to approximately $25 billion. Video generation, the current frontier, is projected to reach $40 billion as production costs for advertising, entertainment, and training content collapse by an order of magnitude. World models — rapidly maturing since 2025 — represent the largest single segment at $50 billion by 2030, driven by gaming, autonomous vehicle simulation, and industrial digital twins. The strategic implication: the largest revenue pools will accrue to the capabilities that are hardest to commoditize today.

The company landscape bubble chart exposes a striking capital efficiency gap. Midjourney, with zero external funding, generates an estimated $300M+ in visual AI revenue — the highest revenue-to-funding ratio in the field. By contrast, several well-funded startups have raised billions but generate modest revenue relative to their capital base. Enterprise adoption data adds a further dimension: advertising and media lead at roughly 45% adoption in 2024, projected to reach 80% by 2027, while manufacturing and healthcare lag below 15% adoption today. This adoption gap represents a significant opportunity for organizations willing to move early in under-penetrated verticals, where first-mover advantages in domain-specific fine-tuning and workflow integration remain available.

What Comes Next

The commoditization cascade follows a predictable trajectory: image generation commoditized in 2025, video generation will commoditize by 2027, 3D generation by 2032, and real-time world simulation by 2035. Each wave follows the same pattern — a frontier lab invests $50M–$100M+ to achieve a breakthrough, open-source replicates it within 6–12 months, and within 24 months the capability runs on consumer hardware. The strategic window for proprietary advantage at each stage is narrow and shrinking.

For enterprise strategists, the implication is clear: the value in visual AI is shifting from model capability (increasingly commoditized) to application-layer integration, domain-specific fine-tuning, and workflow orchestration. The organizations that capture the most value will not be those with the best generative models, but those that most effectively embed visual generation into their advertising pipelines, product photography workflows, training content systems, and simulation environments. The economics of how these models are trained — the infrastructure costs, GPU demand, and scaling laws that underpin the entire landscape — are the subject of Part IV, beginning with Chapter 14: Model Training Economics.

14. Future Projections & Milestones

Architecture trends, capability milestones, and the commoditization cascade

Architecture Trends

Capability Milestones Timeline

Commoditization Cascade

When each capability becomes a commodity (open-source matches commercial quality, self-hosted cost drops to near-zero)

The Commoditization Cascade and Its Strategic Window

The commoditization cascade chart quantifies the central strategic dynamic of visual AI: each capability follows a predictable progression from frontier breakthrough to commodity, and the window between these two states is narrowing with each generation. Image generation commoditized in 2025 — roughly 30 months after Stable Diffusion 1.0 proved the category viable. Video generation is commoditizing faster than expected — by early 2026, four of six major AI video models (Kling v3, Sora 2, Veo 3.1, Seedance 2.0) generate synchronized audio natively, and AI-generated video ads represent 30–40% of all digital video advertising. Full commoditization is on track for 2027. 3D generation and real-time world simulation follow in sequence, with projected commoditization by 2032 and 2035 respectively. The architecture trends timeline confirms this acceleration: DiT + flow matching is already the dominant paradigm, unified multimodal architectures are arriving in 2025–2026, and neural scene representations will mature by 2027–2028.

The capability milestones provide a concrete planning framework for enterprise strategists. Photorealistic image generation and minute-length video are already achieved milestones. Real-time 4K video generation, interactive world simulation, and full physics-aware generation remain ahead — each representing an inflection point at which new categories of enterprise application become viable. The strategic imperative is to time investments not to when a capability emerges at the frontier, but to when it commoditizes sufficiently for reliable, cost-effective production deployment. Organizations that invest too early pay a premium for unreliable capability; those that invest too late cede first-mover advantages in workflow integration and domain specialization to faster-moving competitors.

15. Full Model Comparison

Click any column header to sort. All 38 models across image, video, and world model categories.

Model	Type	Category	Company	Params (B)	Architecture	Resolution	Quality	Speed	Released

Reading the Full Landscape

The 38-model comparison table encapsulates the state of visual AI in a single sortable view — and the patterns it reveals are as important as any individual model’s specifications. Sorting by quality score, the top tier is dominated by commercial models (Midjourney 6, Veo 2, Sora) scoring 9.0–9.5, while the best open-source models (FLUX.1, HunyuanVideo 1.5) cluster at 8.5–9.0 — a gap of less than one point. Sorting by speed score inverts the ranking: open-source models with distilled checkpoints and consumer-GPU optimization consistently outperform API-gated commercial offerings. The architecture column shows near-universal DiT adoption across 2024–2025 releases, confirming the convergence thesis documented throughout this chapter.

For enterprise architects evaluating this landscape, three patterns should guide procurement and build-versus-buy decisions. First, parameter count alone is a poor proxy for quality — PixArt-Sigma at 0.6B parameters achieves higher quality scores than several models 10–20x its size, because architecture efficiency and training data quality dominate brute-force scale. Second, the commercial-versus-open distinction matters less for capability and more for licensing, safety guarantees, and support SLAs. Third, the release date column reveals the pace of obsolescence: models released before mid-2024 are already outperformed by newer entrants on nearly every dimension. Any procurement decision in visual AI should assume a 12–18 month refresh cycle and architect for model-swappable pipelines accordingly.

← Chapter 12: Small Models Chapter 14: Training Economics →