Fastest AI Models in 2026: Speed Benchmarks & Latency Rankings
Speed matters. For real-time chat, code completion, and production APIs, latency and throughput can make or break the user experience. Here's how every major model stacks up.
⚡ Key Takeaways
- • Groq-hosted models dominate raw throughput thanks to custom LPU hardware
- • Smaller models (Haiku, 4o-mini, Flash) are 3-10x faster than their larger siblings
- • TTFT (time to first token) matters more than throughput for interactive apps
- • Speed-per-dollar is the most useful metric for production — raw speed alone is misleading
Why Speed Matters for AI Applications
In AI application development, speed isn't just about convenience — it directly impacts user experience, cost efficiency, and application architecture. A model that generates 200 tokens per second enables real-time streaming that feels like a fast human typist. A model at 20 tokens per second feels sluggish, and users notice.
There are two key speed metrics to understand:
- Time to First Token (TTFT): How long after sending a request before the first token arrives. Critical for chat interfaces where users are waiting for a response to begin. Sub-500ms TTFT feels instant.
- Output Throughput (tokens/second): How fast the model generates tokens once it starts. Important for long-form generation, code completion, and any use case where total generation time matters.
For production applications, there's a third metric that matters even more: speed per dollar. A model that's 2x faster but 10x more expensive is a bad deal. We track all three metrics with live data on our speed leaderboard.
🏎️ Top 20 Fastest Models (Throughput)
Ranked by output tokens per second. Live data from Artificial Analysis.
| # | Model | tok/s | TTFT | Price/1M out |
|---|---|---|---|---|
| 1 | Mercury 2 | 907 | 3.76s | $0.75 |
| 2 | Granite 4.0 H Small | 524 | 8.68s | $0.25 |
| 3 | Granite 3.3 8B (Non-reasoning) | 386 | 7.20s | $0.25 |
| 4 | NVIDIA Nemotron 3 Super 120B… | 370 | 0.55s | $0.75 |
| 5 | Gemini 2.5 Flash-Lite Previe… | 365 | 3.39s | $0.40 |
| 6 | Gemini 2.5 Flash-Lite Previe… | 344 | 0.36s | $0.40 |
| 7 | Gemini 2.5 Flash-Lite (Non-r… | 323 | 0.25s | $0.40 |
| 8 | gpt-oss-20B (low) | 321 | 0.46s | $0.20 |
| 9 | Gemini 2.5 Flash-Lite (Reaso… | 318 | 18.42s | $0.40 |
| 10 | gpt-oss-20B (high) | 316 | 0.41s | $0.20 |
| 11 | Nova Micro | 303 | 0.35s | $0.14 |
| 12 | Devstral Small (Jul '25) | 301 | 0.34s | $0.30 |
| 13 | Ministral 3 3B | 301 | 0.26s | $0.10 |
| 14 | gpt-oss-120B (low) | 255 | 0.51s | $0.60 |
| 15 | gpt-oss-120B (high) | 253 | 0.49s | $0.60 |
| 16 | Grok 4.20 Beta 0309 (Reasoni… | 238 | 10.94s | $6.00 |
| 17 | Gemini 2.5 Flash (Reasoning) | 227 | 12.02s | $2.50 |
| 18 | Nova 2.0 Omni (Non-reasoning) | 226 | 0.67s | $2.50 |
| 19 | Nova 2.0 Lite (low) | 224 | 3.55s | $2.50 |
| 20 | Gemini 3.1 Flash-Lite Preview | 219 | 7.37s | $1.50 |
⏱️ Lowest Latency (TTFT)
Time to first token — critical for interactive applications.
| # | Model | TTFT (s) | tok/s |
|---|---|---|---|
| 1 | Apriel-v1.5-15B-Thinker | 0.186 | 145 |
| 2 | Apriel-v1.6-15B-Thinker | 0.195 | 139 |
| 3 | LFM2 24B A2B | 0.222 | 211 |
| 4 | NVIDIA Nemotron Nano 12B v2 … | 0.229 | 132 |
| 5 | Olmo 3.1 32B Instruct | 0.230 | 55 |
| 6 | Llama Nemotron Super 49B v1.… | 0.235 | 51 |
| 7 | NVIDIA Nemotron Nano 9B V2 (… | 0.246 | 127 |
| 8 | Llama Nemotron Super 49B v1.… | 0.253 | 51 |
| 9 | Gemini 2.5 Flash-Lite (Non-r… | 0.253 | 323 |
| 10 | Ministral 3 3B | 0.255 | 301 |
| 11 | Cogito v2.1 (Reasoning) | 0.274 | 92 |
| 12 | Mistral 7B Instruct | 0.274 | 188 |
| 13 | Ministral 3 8B | 0.279 | 192 |
| 14 | Ministral 3 14B | 0.285 | 125 |
| 15 | NVIDIA Nemotron 3 Nano 30B A… | 0.292 | 80 |
💰 Best Speed-per-Dollar
Throughput divided by output price — the metric that matters for production.
| # | Model | tok/s/$ | tok/s | $/1M out |
|---|---|---|---|---|
| 1 | Ministral 3 3B | 3006 | 301 | $0.10 |
| 2 | Nova Micro | 2163 | 303 | $0.14 |
| 3 | Granite 4.0 H Small | 2095 | 524 | $0.25 |
| 4 | Llama 3.1 Instruct 8B | 1936 | 194 | $0.10 |
| 5 | LFM2 24B A2B | 1756 | 211 | $0.12 |
| 6 | gpt-oss-20B (low) | 1603 | 321 | $0.20 |
| 7 | gpt-oss-20B (high) | 1580 | 316 | $0.20 |
| 8 | Granite 3.3 8B (Non-reasoning) | 1545 | 386 | $0.25 |
| 9 | Ministral 3 8B | 1278 | 192 | $0.15 |
| 10 | Mercury 2 | 1209 | 907 | $0.75 |
The Speed Landscape in 2026
The AI speed landscape has bifurcated into two distinct tiers. On one side, specialized inference providers like Groq (using custom LPU chips) and Cerebras (using wafer-scale engines) push smaller models to extreme speeds — sometimes exceeding 1,000 tokens per second. On the other, cloud providers run larger, more capable models at moderate speeds of 50-200 tok/s.
For most applications, the practical question isn't "which model is fastest?" but rather "which model is fast enough while meeting my quality requirements?" A model that runs at 500 tok/s but produces mediocre output isn't useful. The sweet spot is usually a mid-tier model (Sonnet, 4o, Flash) that provides 80-200 tok/s with high quality.
When Raw Speed Matters
There are specific use cases where extreme speed is worth paying for:
- Real-time autocomplete — code completion, search suggestions, inline editing
- Batch processing — processing millions of items where total time matters
- Voice AI — conversational agents where response latency determines naturalness
- Gaming / interactive — NPC dialogue, game master AI, real-time narrative
Optimizing for Speed in Production
Beyond model selection, there are architectural strategies for improving perceived speed:
- Streaming responses — show tokens as they arrive rather than waiting for the full response
- Prompt caching — Anthropic, OpenAI, and Google all offer prompt caching that dramatically reduces TTFT for repeated prefixes
- Model routing — use fast small models for simple tasks, reserve large models for complex ones
- Speculative decoding — some providers offer this to boost throughput at the infrastructure level
Explore the Full Speed Leaderboard
Interactive speed rankings with filtering and real-time data.
Open Speed Leaderboard →