How do AI models perform on real agent tasks? PinchBench scores 446+ models across coding, reasoning, tool use, and instruction following โ with live pricing data.
Balanced score across all agent capabilities
| # | Model | Score | Bar | Input $/M | Output $/M | Speed | TTFT | Efficiency |
|---|---|---|---|---|---|---|---|---|
| 1 | GPT-5.2 (xhigh) OpenAI | 67.1 | $1.75 | $14.00 | 63 | 74.41s | 13.9 | |
| 2 | Gemini 3.1 Pro Preview Google | 66.8 | $2.00 | $12.00 | 115 | 31.49s | 14.8 | |
| 3 | GPT-5.4 (xhigh) OpenAI | 66.0 | $2.50 | $15.00 | 84 | 147.97s | 11.7 | |
| 4 | Gemini 3 Pro Preview (high) Google | 65.7 | $2.00 | $12.00 | 115 | 37.41s | 14.6 | |
| 5 | Gemini 3 Flash Preview (Reasoning) Google | 64.3 | $0.50 | $3.00 | 180 | 6.08s | 57.1 | |
| 6 | GPT-5.3 Codex (xhigh) OpenAI | 63.9 | $1.75 | $14.00 | 78 | 65.90s | 13.3 | |
| 7 | Claude Opus 4.5 (Reasoning) Anthropic | 63.4 | $5.00 | $25.00 | 64 | 10.18s | 6.3 | |
| 8 | GPT-5.1 (high) OpenAI | 63.4 | $1.25 | $10.00 | 85 | 31.07s | 18.4 | |
| 9 | GPT-5.2 (medium) OpenAI | 61.6 | $1.75 | $14.00 | โ | โ | 12.8 | |
| 10 | GPT-5 Codex (high) OpenAI | 61.6 | $1.25 | $10.00 | 216 | 12.05s | 17.9 | |
| 11 | GLM-4.7 (Reasoning) Z AI | 60.9 | $0.60 | $2.20 | 79 | 0.73s | 60.9 | |
| 12 | GPT-5 (high) OpenAI | 60.2 | $1.25 | $10.00 | 94 | 83.34s | 17.5 | |
| 13 | GPT-5.1 Codex (high) OpenAI | 59.7 | $1.25 | $10.00 | 139 | 7.13s | 17.4 | |
| 14 | Grok 4.20 Beta 0309 (Reasoning) xAI | 59.4 | $2.00 | $6.00 | 238 | 10.94s | 19.8 | |
| 15 | Kimi K2 Thinking Kimi | 59.2 | $0.60 | $2.50 | 104 | 0.66s | 55.1 | |
| 16 | DeepSeek V3.2 (Reasoning) DeepSeek | 58.9 | $0.28 | $0.42 | 31 | 1.54s | 187.0 | |
| 17 | Claude Opus 4.6 (Adaptive Reasoning, Max Effort) Anthropic | 58.7 | $5.00 | $25.00 | 54 | 12.36s | 5.9 | |
| 18 | GPT-5 (medium) OpenAI | 58.6 | $1.25 | $10.00 | 69 | 45.44s | 17.1 | |
| 19 | GPT-5.2 Codex (xhigh) OpenAI | 58.5 | $1.75 | $14.00 | 123 | 9.13s | 12.2 | |
| 20 | MiMo-V2-Flash (Reasoning) Xiaomi | 58.3 | $0.10 | $0.30 | 125 | 1.49s | 388.3 |
Score per dollar (higher = better value). Only models with pricing data.
Models in the top-right are both fast and capable.
PinchBench evaluates AI models on real-world agent tasks spanning coding, reasoning, tool use, and instruction following. Unlike academic benchmarks that test isolated capabilities, PinchBench combines multiple benchmark dimensions to reflect how models perform as autonomous agents in practical workflows.
PinchBench covers 6 scenarios: Coding Agent (code generation, debugging, terminal use), Reasoning & Logic (math, science, multi-step problems), Instruction Following (format compliance, structured output), Research & Analysis (scientific reasoning, knowledge), Tool Use & Agentic (multi-turn orchestration, planning), and an Overall balanced score.
Each scenario uses a weighted combination of relevant benchmarks. For example, Coding Agent combines LiveCodeBench, TerminalBench, SciCode, and the Artificial Analysis Coding Index. Scores are normalized to 0-100. Cost efficiency is calculated as score divided by price per million tokens.
Academic benchmarks test specific skills in controlled conditions. Real agent tasks require combining multiple skills โ a model might score well on individual benchmarks but struggle when tasks require coding + tool use + instruction following simultaneously. PinchBench's weighted scenario scores better approximate this combined performance.
PinchBench data refreshes hourly from the Artificial Analysis API, ensuring you see the latest benchmark scores and pricing for all models.