Live data ยท Updated hourly

PinchBench โ€” Real-World AI Agent Benchmarks

How do AI models perform on real agent tasks? PinchBench scores 510+ models across coding, reasoning, tool use, and instruction following โ€” with live pricing data.

Models Tested
510
Scenarios
6
Avg Score
32.2
Best Value
Qwen3.5 0.8B (Non-reasoning)
โญ Overall

Balanced score across all agent capabilities

intelligence index (15%)coding index (15%)math index (10%)gpqa (10%)livecodebench (10%)ifbench (10%)tau2 (10%)terminalbench hard (10%)hle (10%)
๐Ÿฅ‡#168.4
OpenAI

GPT-5.5 (xhigh)

Price
$11.25
Speed
63
Efficiency
6.1
๐Ÿฅˆ#267.1
OpenAI

GPT-5.5 (high)

Price
$11.25
Speed
61
Efficiency
6.0
๐Ÿฅ‰#367.1
OpenAI

GPT-5.2 (xhigh)

Price
$4.81
Speed
69
Efficiency
13.9
#ModelScoreBarInput $/MOutput $/MSpeedTTFTEfficiency
1
GPT-5.5 (xhigh)
OpenAI
68.4
$5.00$30.006346.82s6.1
2
GPT-5.5 (high)
OpenAI
67.1
$5.00$30.006119.21s6.0
3
GPT-5.2 (xhigh)
OpenAI
67.1
$1.75$14.006962.91s13.9
4
Gemini 3.1 Pro Preview
Google
66.8
$2.00$12.0013323.51s14.8
5
Gemini 3 Pro Preview (high)
Google
65.7
$2.00$12.0012569.37s14.6
6
GPT-5.4 (xhigh)
OpenAI
65.4
$2.50$15.0080159.82s11.6
7
GPT-5.5 (medium)
OpenAI
65.4
$5.00$30.00613.52s5.8
8
Gemini 3 Flash Preview (Reasoning)
Google
64.3
$0.50$3.001975.65s57.1
9
Claude Opus 4.5 (Reasoning)
Anthropic
63.4
$6.25$25.006211.93s5.8
10
GPT-5.1 (high)
OpenAI
63.3
$1.25$10.0011821.14s18.4
11
GPT-5.3 Codex (xhigh)
OpenAI
63.2
$1.75$14.008052.48s13.1
12
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
Anthropic
61.8
$6.25$25.006414.54s5.7
13
Kimi K2.6
Kimi
61.8
$0.95$4.00401.35s36.1
14
GPT-5.2 (medium)
OpenAI
61.6
$1.75$14.00โ€”โ€”12.8
15
GPT-5 Codex (high)
OpenAI
61.6
$1.25$10.001716.50s17.9
16
DeepSeek V4 Pro (Reasoning, Max Effort)
DeepSeek
61.5
$1.74$3.48311.19s28.3
17
Muse Spark
Meta
61.3
โ€”โ€”โ€”โ€”โ€”
18
GLM-4.7 (Reasoning)
Z AI
60.9
$0.60$2.201070.77s60.9
19
MiMo-V2.5-Pro
Xiaomi
60.8
$1.00$3.00552.32s40.5
20
Grok 4.3
xAI
60.4
$1.25$2.508610.26s38.6

๐Ÿ’ฐ Best Cost Efficiency โ€” Overall

Score per dollar (higher = better value). Only models with pricing data.

1
Qwen3.5 0.8B (Non-reasoning)
822.6$0.02
2
Qwen3.5 4B (Reasoning)
654.3$0.06
3
Qwen3.5 0.8B (Reasoning)
607.5$0.02
4
Qwen3.5 2B (Non-reasoning)
601.8$0.04
5
Qwen3.5 2B (Reasoning)
567.8$0.04
6
Qwen3.5 4B (Non-reasoning)
553.1$0.06
7
gpt-oss-20B (high)
506.9$0.09
8
NVIDIA Nemotron 3 Nano 30B A3B (Reasoning)
460.0$0.10
9
Gemma 3n E4B Instruct
455.7$0.03
10
NVIDIA Nemotron Nano 9B V2 (Reasoning)
413.4$0.07

โšก Score vs Speed โ€” Overall

Models in the top-right are both fast and capable.

Inception
Mercury 2
Score
44.3
Speed
736
IBM
Granite 3.3 8B (Non-reasoning)
Score
10.6
Speed
380
Alibaba
Qwen3.5 2B (Non-reasoning)
Score
24.1
Speed
333
Google
Gemini 3.1 Flash-Lite Preview
Score
40.8
Speed
277
NVIDIA
Nemotron 3 Nano Omni 30B A3B Reasoning
Score
27.9
Speed
308
Amazon
Nova Micro
Score
12.7
Speed
341
OpenAI
gpt-oss-20B (high)
Score
44.6
Speed
256
Google
Gemini 3 Flash Preview (Reasoning)
Score
64.3
Speed
197
OpenAI
gpt-oss-20B (low)
Score
35.9
Speed
266
OpenAI
gpt-oss-120B (high)
Score
52.9
Speed
208

Frequently Asked Questions

What is PinchBench and how does it differ from traditional benchmarks?

PinchBench evaluates AI models on real-world agent tasks spanning coding, reasoning, tool use, and instruction following. Unlike academic benchmarks that test isolated capabilities, PinchBench combines multiple benchmark dimensions to reflect how models perform as autonomous agents in practical workflows.

Which scenarios does PinchBench test?

PinchBench covers 6 scenarios: Coding Agent (code generation, debugging, terminal use), Reasoning & Logic (math, science, multi-step problems), Instruction Following (format compliance, structured output), Research & Analysis (scientific reasoning, knowledge), Tool Use & Agentic (multi-turn orchestration, planning), and an Overall balanced score.

How are scores calculated?

Each scenario uses a weighted combination of relevant benchmarks. For example, Coding Agent combines LiveCodeBench, TerminalBench, SciCode, and the Artificial Analysis Coding Index. Scores are normalized to 0-100. Cost efficiency is calculated as score divided by price per million tokens.

Why do real-world results differ from academic benchmarks?

Academic benchmarks test specific skills in controlled conditions. Real agent tasks require combining multiple skills โ€” a model might score well on individual benchmarks but struggle when tasks require coding + tool use + instruction following simultaneously. PinchBench's weighted scenario scores better approximate this combined performance.

How often is the data updated?

PinchBench data refreshes hourly from the Artificial Analysis API, ensuring you see the latest benchmark scores and pricing for all models.