·9 min read·Speed

Fastest AI Models in 2026: Speed Benchmarks & Latency Rankings

Speed matters. For real-time chat, code completion, and production APIs, latency and throughput can make or break the user experience. Here's how every major model stacks up.

⚡ Key Takeaways

  • Groq-hosted models dominate raw throughput thanks to custom LPU hardware
  • Smaller models (Haiku, 4o-mini, Flash) are 3-10x faster than their larger siblings
  • TTFT (time to first token) matters more than throughput for interactive apps
  • Speed-per-dollar is the most useful metric for production — raw speed alone is misleading

Why Speed Matters for AI Applications

In AI application development, speed isn't just about convenience — it directly impacts user experience, cost efficiency, and application architecture. A model that generates 200 tokens per second enables real-time streaming that feels like a fast human typist. A model at 20 tokens per second feels sluggish, and users notice.

There are two key speed metrics to understand:

  • Time to First Token (TTFT): How long after sending a request before the first token arrives. Critical for chat interfaces where users are waiting for a response to begin. Sub-500ms TTFT feels instant.
  • Output Throughput (tokens/second): How fast the model generates tokens once it starts. Important for long-form generation, code completion, and any use case where total generation time matters.

For production applications, there's a third metric that matters even more: speed per dollar. A model that's 2x faster but 10x more expensive is a bad deal. We track all three metrics with live data on our speed leaderboard.

🏎️ Top 20 Fastest Models (Throughput)

Ranked by output tokens per second. Live data from Artificial Analysis.

#Modeltok/sTTFTPrice/1M out
1Mercury 29073.76s$0.75
2Granite 4.0 H Small5248.68s$0.25
3Granite 3.3 8B (Non-reasoning)3867.20s$0.25
4NVIDIA Nemotron 3 Super 120B…3700.55s$0.75
5Gemini 2.5 Flash-Lite Previe…3653.39s$0.40
6Gemini 2.5 Flash-Lite Previe…3440.36s$0.40
7Gemini 2.5 Flash-Lite (Non-r…3230.25s$0.40
8gpt-oss-20B (low)3210.46s$0.20
9Gemini 2.5 Flash-Lite (Reaso…31818.42s$0.40
10gpt-oss-20B (high)3160.41s$0.20
11Nova Micro3030.35s$0.14
12Devstral Small (Jul '25)3010.34s$0.30
13Ministral 3 3B3010.26s$0.10
14gpt-oss-120B (low)2550.51s$0.60
15gpt-oss-120B (high)2530.49s$0.60
16Grok 4.20 Beta 0309 (Reasoni…23810.94s$6.00
17Gemini 2.5 Flash (Reasoning)22712.02s$2.50
18Nova 2.0 Omni (Non-reasoning)2260.67s$2.50
19Nova 2.0 Lite (low)2243.55s$2.50
20Gemini 3.1 Flash-Lite Preview2197.37s$1.50

⏱️ Lowest Latency (TTFT)

Time to first token — critical for interactive applications.

💰 Best Speed-per-Dollar

Throughput divided by output price — the metric that matters for production.

#Modeltok/s/$tok/s$/1M out
1Ministral 3 3B3006301$0.10
2Nova Micro2163303$0.14
3Granite 4.0 H Small2095524$0.25
4Llama 3.1 Instruct 8B1936194$0.10
5LFM2 24B A2B1756211$0.12
6gpt-oss-20B (low)1603321$0.20
7gpt-oss-20B (high)1580316$0.20
8Granite 3.3 8B (Non-reasoning)1545386$0.25
9Ministral 3 8B1278192$0.15
10Mercury 21209907$0.75

The Speed Landscape in 2026

The AI speed landscape has bifurcated into two distinct tiers. On one side, specialized inference providers like Groq (using custom LPU chips) and Cerebras (using wafer-scale engines) push smaller models to extreme speeds — sometimes exceeding 1,000 tokens per second. On the other, cloud providers run larger, more capable models at moderate speeds of 50-200 tok/s.

For most applications, the practical question isn't "which model is fastest?" but rather "which model is fast enough while meeting my quality requirements?" A model that runs at 500 tok/s but produces mediocre output isn't useful. The sweet spot is usually a mid-tier model (Sonnet, 4o, Flash) that provides 80-200 tok/s with high quality.

When Raw Speed Matters

There are specific use cases where extreme speed is worth paying for:

  • Real-time autocomplete — code completion, search suggestions, inline editing
  • Batch processing — processing millions of items where total time matters
  • Voice AI — conversational agents where response latency determines naturalness
  • Gaming / interactive — NPC dialogue, game master AI, real-time narrative

Optimizing for Speed in Production

Beyond model selection, there are architectural strategies for improving perceived speed:

  • Streaming responses — show tokens as they arrive rather than waiting for the full response
  • Prompt caching — Anthropic, OpenAI, and Google all offer prompt caching that dramatically reduces TTFT for repeated prefixes
  • Model routing — use fast small models for simple tasks, reserve large models for complex ones
  • Speculative decoding — some providers offer this to boost throughput at the infrastructure level

Explore the Full Speed Leaderboard

Interactive speed rankings with filtering and real-time data.

Open Speed Leaderboard →

Related Articles