How much does the OpenAI GPT-5.4 API cost?

GPT-5.4 API pricing is $2.50 per million input tokens and $15.00 per million output tokens. Use our calculator at aiapicost.com for exact cost estimates based on your usage.

Which AI model is cheapest for API usage?

The cheapest AI API models change frequently. Use aiapicost.com to compare real-time pricing across 400+ models from OpenAI, Anthropic, Google, DeepSeek, and more. DeepSeek and open-source models typically offer the lowest per-token costs.

How do AI API token costs work?

AI APIs charge per token (roughly 0.75 words). Costs are split into input tokens (what you send) and output tokens (what the model generates). Output tokens are typically 2-5x more expensive. Prices are quoted per 1 million tokens.

Claude vs ChatGPT: which is better?

Both are top-tier models. Claude excels at coding and instruction-following, while GPT-5.4 offers broader multimodal capabilities. Compare them head-to-head at aiapicost.com/compare with real benchmark data.

How is the Anthropic reasoning eval different from GPQA or HLE?

GPQA, AIME, and HLE are public benchmarks with fixed answer keys. The Anthropic reasoning eval uses proprietary Anthropic scoring across three sub-scores (novel reasoning, chain-of-thought robustness, and calibration) and produces a composite Anthropic Reasoning Index (ARI). It complements public benchmarks by testing real-world reasoning that production AI applications need.

AI Benchmarks 2026 — Claude Fable 5 leads, Anthropic Reasoning Eval (500+ Models, Free & Live)

Welcome to the most comprehensive free AI benchmark leaderboard, tracking 500+ language models across 12 industry-standard benchmarks. Our data is updated hourly from Artificial Analysis, ensuring you always see the latest performance numbers for reasoning, coding, math, and general knowledge — including the newest Anthropic reasoning eval (Claude Mythos benchmark), GPT-5.4 releases, and Gemini 3 Pro results.

Each model is evaluated on GPQA Diamond (graduate-level science reasoning), AIME 2025 (competition math), MMLU-Pro (multitask language understanding), HLE (hard logic and reasoning), LiveCodeBench and SWE-bench (coding ability), and the proprietary Intelligence Index. Unlike other leaderboards that only show scores, we pair every benchmark result with real-time per-token API pricing, so you can compare both performance and cost efficiency in one place.

Looking for the fastest AI model? Check the Speed Rankings for tokens-per-second and TTFT latency. Need reasoning power? The Math & Reasoning guide breaks down the top models by GPQA and AIME scores. Looking for LiveCodeBench 2026 top models? The interactive table below ranks all models by live LCB scores updated hourly. Comparing AI coding tools? See AI coding plans & subscriptions side-by-side. For the latest OpenAI API pricing July 2026 and all model costs, see our AI API Pricing Guide with live cost calculator.

Benchmarks Tracked (July 2026)

• GPQA Diamond — graduate-level science Q&A
• AIME 2025 — American Invitational Math Exam
• MMLU-Pro — 57-subject multitask language understanding
• HLE — Humanity's Last Exam (hardest reasoning)
• LiveCodeBench — live coding benchmark
• SWE-bench — software engineering agent benchmark
• MATH-500 — competition-level math problems
• SciCode — scientific coding problems
• IFBench — instruction-following benchmark
• TerminalBench — terminal / CLI task completion
• Claude Mythos benchmark — Anthropic reasoning eval
• Intelligence Index — Artificial Analysis composite

What is the Anthropic Reasoning Eval? (Claude Mythos benchmark)

The Anthropic reasoning eval (also called the Claude Mythos benchmark) is Anthropic's in-house evaluation suite for measuring how well Claude models reason through novel, open-ended problems that don't have a single correct answer. It complements industry benchmarks like GPQA Diamond, AIME 2025, and HLE by testing the kind of multi-step, real-world reasoning that production AI applications need — chain-of-thought robustness, refusal calibration, and the ability to recognize when a problem is underspecified.

Anthropic's flagship Claude Mythos model is the current top performer on the eval, with a composite Anthropic Reasoning Index (ARI) score of 76.2 — narrowly ahead of Claude Opus 4.8 (ARI 74.5) and Claude Sonnet 4.6 (ARI 71.0). The full eval leaderboard (refreshed quarterly by Anthropic) includes all Claude 4.x and 5.x family models plus a public set of open-weights comparison models.

Because the eval uses proprietary Anthropic scoring, results are not aggregated into the Artificial Analysis Intelligence Index or other public composite scores — we list Claude Mythos benchmark alongside GPQA Diamond, AIME 2025, MMLU-Pro, and HLE so visitors looking for one canonical reference can compare all major reasoning benchmarks on a single page. For the official methodology and per-task breakdowns, see Anthropic's blog.

How is the Anthropic Reasoning Eval scored?

The eval produces three sub-scores — novel reasoning (solving problems the model hasn't seen during training), chain-of-thought robustness (consistency across multi-step reasoning chains), and calibration (how well the model recognizes unanswerable or underspecified prompts). The composite ARI is the geometric mean of the three sub-scores, weighted toward novel reasoning. Higher ARI = better. Scores above 70 are considered state-of-the-art as of mid-2026.

Top Performing AI Models (July 2026)

Live leaderboard across the benchmarks users ask about most — GPQA Diamond, AIME 2025, MMLU-Pro, HLE, LiveCodeBench and more. Claude Fable 5 leads the Intelligence Index at 59.9, followed by Claude Opus 4.8 (55.7) and GPT-5.5 (54.8). Click any model for full pricing, speed, and side-by-side comparisons.