Live data ยท Updated hourly

PinchBench โ€” Real-World AI Agent Benchmarks

How do AI models perform on real agent tasks? PinchBench scores 446+ models across coding, reasoning, tool use, and instruction following โ€” with live pricing data.

Models Tested
446
Scenarios
6
Avg Score
30.3
Best Value
gpt-oss-20B (high)
โญ Overall

Balanced score across all agent capabilities

intelligence index (15%)coding index (15%)math index (10%)gpqa (10%)livecodebench (10%)ifbench (10%)tau2 (10%)terminalbench hard (10%)hle (10%)
๐Ÿฅ‡#167.1
OpenAI

GPT-5.2 (xhigh)

Price
$4.81
Speed
63
Efficiency
13.9
๐Ÿฅˆ#266.8
Google

Gemini 3.1 Pro Preview

Price
$4.50
Speed
115
Efficiency
14.8
๐Ÿฅ‰#366.0
OpenAI

GPT-5.4 (xhigh)

Price
$5.63
Speed
84
Efficiency
11.7
#ModelScoreBarInput $/MOutput $/MSpeedTTFTEfficiency
1
GPT-5.2 (xhigh)
OpenAI
67.1
$1.75$14.006374.41s13.9
2
Gemini 3.1 Pro Preview
Google
66.8
$2.00$12.0011531.49s14.8
3
GPT-5.4 (xhigh)
OpenAI
66.0
$2.50$15.0084147.97s11.7
4
Gemini 3 Pro Preview (high)
Google
65.7
$2.00$12.0011537.41s14.6
5
Gemini 3 Flash Preview (Reasoning)
Google
64.3
$0.50$3.001806.08s57.1
6
GPT-5.3 Codex (xhigh)
OpenAI
63.9
$1.75$14.007865.90s13.3
7
Claude Opus 4.5 (Reasoning)
Anthropic
63.4
$5.00$25.006410.18s6.3
8
GPT-5.1 (high)
OpenAI
63.4
$1.25$10.008531.07s18.4
9
GPT-5.2 (medium)
OpenAI
61.6
$1.75$14.00โ€”โ€”12.8
10
GPT-5 Codex (high)
OpenAI
61.6
$1.25$10.0021612.05s17.9
11
GLM-4.7 (Reasoning)
Z AI
60.9
$0.60$2.20790.73s60.9
12
GPT-5 (high)
OpenAI
60.2
$1.25$10.009483.34s17.5
13
GPT-5.1 Codex (high)
OpenAI
59.7
$1.25$10.001397.13s17.4
14
Grok 4.20 Beta 0309 (Reasoning)
xAI
59.4
$2.00$6.0023810.94s19.8
15
Kimi K2 Thinking
Kimi
59.2
$0.60$2.501040.66s55.1
16
DeepSeek V3.2 (Reasoning)
DeepSeek
58.9
$0.28$0.42311.54s187.0
17
Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
Anthropic
58.7
$5.00$25.005412.36s5.9
18
GPT-5 (medium)
OpenAI
58.6
$1.25$10.006945.44s17.1
19
GPT-5.2 Codex (xhigh)
OpenAI
58.5
$1.75$14.001239.13s12.2
20
MiMo-V2-Flash (Reasoning)
Xiaomi
58.3
$0.10$0.301251.49s388.3

๐Ÿ’ฐ Best Cost Efficiency โ€” Overall

Score per dollar (higher = better value). Only models with pricing data.

1
gpt-oss-20B (high)
474.5$0.09
2
Gemma 3n E4B Instruct
456.0$0.03
3
NVIDIA Nemotron 3 Nano 30B A3B (Reasoning)
420.5$0.10
4
NVIDIA Nemotron Nano 9B V2 (Reasoning)
413.4$0.07
5
Qwen3.5 9B (Reasoning)
396.2$0.11
6
MiMo-V2-Flash (Reasoning)
388.3$0.15
7
gpt-oss-20B (low)
382.0$0.09
8
MiMo-V2-Flash (Feb 2026)
343.5$0.15
9
Step 3.5 Flash
327.2$0.15
10
NVIDIA Nemotron Nano 9B V2 (Non-reasoning)
319.1$0.09

โšก Score vs Speed โ€” Overall

Models in the top-right are both fast and capable.

Inception
Mercury 2
Score
44.3
Speed
907
IBM
Granite 4.0 H Small
Score
16.4
Speed
524
NVIDIA
NVIDIA Nemotron 3 Super 120B A12B (Reasoning)
Score
46.0
Speed
363
Google
Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning)
Score
37.1
Speed
365
OpenAI
gpt-oss-20B (high)
Score
44.6
Speed
316
Google
Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning)
Score
31.1
Speed
344
IBM
Granite 3.3 8B (Non-reasoning)
Score
10.6
Speed
386
OpenAI
gpt-oss-20B (low)
Score
35.9
Speed
320
Google
Gemini 2.5 Flash-Lite (Reasoning)
Score
29.5
Speed
318
xAI
Grok 4.20 Beta 0309 (Reasoning)
Score
59.4
Speed
238

Frequently Asked Questions

What is PinchBench and how does it differ from traditional benchmarks?

PinchBench evaluates AI models on real-world agent tasks spanning coding, reasoning, tool use, and instruction following. Unlike academic benchmarks that test isolated capabilities, PinchBench combines multiple benchmark dimensions to reflect how models perform as autonomous agents in practical workflows.

Which scenarios does PinchBench test?

PinchBench covers 6 scenarios: Coding Agent (code generation, debugging, terminal use), Reasoning & Logic (math, science, multi-step problems), Instruction Following (format compliance, structured output), Research & Analysis (scientific reasoning, knowledge), Tool Use & Agentic (multi-turn orchestration, planning), and an Overall balanced score.

How are scores calculated?

Each scenario uses a weighted combination of relevant benchmarks. For example, Coding Agent combines LiveCodeBench, TerminalBench, SciCode, and the Artificial Analysis Coding Index. Scores are normalized to 0-100. Cost efficiency is calculated as score divided by price per million tokens.

Why do real-world results differ from academic benchmarks?

Academic benchmarks test specific skills in controlled conditions. Real agent tasks require combining multiple skills โ€” a model might score well on individual benchmarks but struggle when tasks require coding + tool use + instruction following simultaneously. PinchBench's weighted scenario scores better approximate this combined performance.

How often is the data updated?

PinchBench data refreshes hourly from the Artificial Analysis API, ensuring you see the latest benchmark scores and pricing for all models.