How much does the OpenAI GPT-5.4 API cost?

GPT-5.4 API pricing is $2.50 per million input tokens and $15.00 per million output tokens. Use our calculator at aiapicost.com for exact cost estimates based on your usage.

Which AI model is cheapest for API usage?

The cheapest AI API models change frequently. Use aiapicost.com to compare real-time pricing across 400+ models from OpenAI, Anthropic, Google, DeepSeek, and more. DeepSeek and open-source models typically offer the lowest per-token costs.

How do AI API token costs work?

AI APIs charge per token (roughly 0.75 words). Costs are split into input tokens (what you send) and output tokens (what the model generates). Output tokens are typically 2-5x more expensive. Prices are quoted per 1 million tokens.

Claude vs ChatGPT: which is better?

Both are top-tier models. Claude excels at coding and instruction-following, while GPT-5.4 offers broader multimodal capabilities. Compare them head-to-head at aiapicost.com/compare with real benchmark data.

June 30, 2026·9 min read·Speed·Updated today

Fastest AI Models 2026 — Speed Leaderboard & Latency (Mercury 2, 1,101 tok/s)

Speed matters. For real-time chat, code completion, and production APIs, latency and throughput can make or break the user experience. Here's how every major model stacks up.

Fastest model right now: Mercury 2 at 1194 tokens/second with 3.02s time-to-first-token. Top 3: Mercury 2 (1194 tok/s), LFM2.5-VL-1.6B (455 tok/s), Granite 4.0 H Small (397 tok/s).

⚡ Key Takeaways

• Groq-hosted models dominate raw throughput thanks to custom LPU hardware
• Smaller models (Haiku, 4o-mini, Flash) are 3-10x faster than their larger siblings
• TTFT (time to first token) matters more than throughput for interactive apps
• Speed-per-dollar is the most useful metric for production — raw speed alone is misleading

Speed is only half the picture — see how these models perform on GPQA, AIME and HLE reasoning benchmarks or jump to the live benchmark leaderboard for the full picture.

Why Speed Matters for AI Applications

In AI application development, speed isn't just about convenience — it directly impacts user experience, cost efficiency, and application architecture. A model that generates 200 tokens per second enables real-time streaming that feels like a fast human typist. A model at 20 tokens per second feels sluggish, and users notice.

There are two key speed metrics to understand:

Time to First Token (TTFT): How long after sending a request before the first token arrives. Critical for chat interfaces where users are waiting for a response to begin. Sub-500ms TTFT feels instant.
Output Throughput (tokens/second): How fast the model generates tokens once it starts. Important for long-form generation, code completion, and any use case where total generation time matters.

For production applications, there's a third metric that matters even more: speed per dollar. A model that's 2x faster but 10x more expensive is a bad deal. We track all three metrics with live data on our speed leaderboard.

🏎️ Top 20 Fastest Models (Throughput)

Ranked by output tokens per second. Live data from Artificial Analysis.

#	Model	tok/s	TTFT	Price/1M out
1	Mercury 2	1194	3.02s	$0.75
2	LFM2.5-VL-1.6B	455	8.73s	—
3	Granite 4.0 H Small	397	8.73s	$0.25
4	Step 3.7 Flash	393	0.75s	$1.15
5	Granite 3.3 8B (Non-reasoning)	354	21.19s	$0.25
6	HyperNova 60B 2605	351	0.77s	$0.14
7	LFM2.5-8B-A1B	342	10.24s	—
8	Nova Micro	340	0.63s	$0.14
9	gpt-oss-120b (low)	329	0.50s	$0.60
10	Gemini 3.1 Flash-Lite	311	5.11s	$1.50
11	Llama 3.1 Nemotron Instruct …	307	4.03s	$1.20
12	NVIDIA Nemotron Nano 12B v2 …	297	0.24s	$0.60
13	Nemotron 3 Nano Omni 30B A3B…	288	0.57s	$0.30
14	gpt-oss-120b (high)	271	0.51s	$0.60
15	Gemini 2.5 Flash-Lite (Reaso…	263	24.82s	$0.40
16	Qwen3.5 Omni Flash	246	1.06s	$0.80
17	Gemini 3.5 Flash (high)	243	13.94s	$9.00
18	Nova 2.0 Lite (Non-reasoning)	237	0.84s	$2.50
19	Sarvam 30B (high)	236	1.15s	$0.11
20	o3-mini	232	5.54s	$4.40

⏱️ Lowest Latency (TTFT)

Time to first token — critical for interactive applications.

#	Model	TTFT (s)	tok/s
1	Command A+	0.180	179
2	North Mini Code	0.185	108
3	NVIDIA Nemotron Nano 12B v2 …	0.237	297
4	Llama Nemotron Super 49B v1.…	0.266	52
5	Llama Nemotron Super 49B v1.…	0.278	52
6	LFM2 24B A2B	0.286	126
7	Gemini 2.5 Flash-Lite (Non-r…	0.298	222
8	Phi-4 Mini Instruct	0.323	46
9	Command A	0.329	54
10	Tiny Aya Global	0.337	136
11	Phi-4 Multimodal Instruct	0.342	17
12	Grok Build 0.1 0616	0.402	45
13	Gemma 3n E4B Instruct	0.403	55
14	Gemini 2.5 Flash (Non-reason…	0.403	211
15	Ministral 3 3B	0.420	167

💰 Best Speed-per-Dollar

Throughput divided by output price — the metric that matters for production.

#	Model	tok/s/$	tok/s	$/1M out
1	HyperNova 60B 2605	2504	351	$0.14
2	Nova Micro	2427	340	$0.14
3	Sarvam 30B (high)	2145	236	$0.11
4	Llama 3.2 Instruct 1B	1743	87	$0.05
5	Ministral 3 3B	1666	167	$0.10
6	Llama 3.1 Instruct 8B	1627	155	$0.10
7	Mercury 2	1592	1194	$0.75
8	Granite 4.0 H Small	1587	397	$0.25
9	Granite 3.3 8B (Non-reasoning)	1417	354	$0.25
10	Gemma 3n E4B Instruct	1376	55	$0.04

The Speed Landscape in 2026

The AI speed landscape has bifurcated into two distinct tiers. On one side, specialized inference providers like Groq (using custom LPU chips) and Cerebras (using wafer-scale engines) push smaller models to extreme speeds — sometimes exceeding 1,000 tokens per second. On the other, cloud providers run larger, more capable models at moderate speeds of 50-200 tok/s.

For most applications, the practical question isn't "which model is fastest?" but rather "which model is fast enough while meeting my quality requirements?" A model that runs at 500 tok/s but produces mediocre output isn't useful. The sweet spot is usually a mid-tier model (Sonnet, 4o, Flash) that provides 80-200 tok/s with high quality.

When Raw Speed Matters

There are specific use cases where extreme speed is worth paying for:

Real-time autocomplete — code completion, search suggestions, inline editing
Batch processing — processing millions of items where total time matters
Voice AI — conversational agents where response latency determines naturalness
Gaming / interactive — NPC dialogue, game master AI, real-time narrative

Optimizing for Speed in Production

Beyond model selection, there are architectural strategies for improving perceived speed:

Streaming responses — show tokens as they arrive rather than waiting for the full response
Prompt caching — Anthropic, OpenAI, and Google all offer prompt caching that dramatically reduces TTFT for repeated prefixes
Model routing — use fast small models for simple tasks, reserve large models for complex ones
Speculative decoding — some providers offer this to boost throughput at the infrastructure level

Explore the Full Speed Leaderboard

Interactive speed rankings with filtering and real-time data.

Open Speed Leaderboard →

Claude Sonnet vs GPT-4o vs Gemini Pro

Mid-tier models compared with live data

AI API Pricing Guide 2026

Complete cost optimization strategies

Fastest AI Models 2026 — Speed Leaderboard & Latency (Mercury 2, 1,101 tok/s)

⚡ Key Takeaways

Why Speed Matters for AI Applications

🏎️ Top 20 Fastest Models (Throughput)

⏱️ Lowest Latency (TTFT)

💰 Best Speed-per-Dollar

The Speed Landscape in 2026

When Raw Speed Matters

Optimizing for Speed in Production

Explore the Full Speed Leaderboard

Related Articles

Tools

Rankings

Guides

Comparisons