The order, as I see it
Ranked by the llmbusse index: a weighted blend tilted toward the benchmarks that predict real work. Click any column to re-sort. Data as of July 2026.
| # | Model | License | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Qwen3.7-MaxAlibaba (Qwen) | 88.9 | 92 | 80 | — | 97 | 92 | 1M | $2.5 | $7.5 | closed |
| 2 | GPT-5.2OpenAI | 88.7 | 92 | 80 | — | 100 | — | 400K | $1.75 | $14 | closed |
| 3 | DeepSeek-V4-ProDeepSeek | 88.1 | 90 | 81 | 88 | 95 | 94 | 1M | $0.435 | $0.87 | open |
| 4 | Gemini 3.1 ProGoogle DeepMind | 87.7 | 94 | 81 | 91 | — | — | 1M | $2 | $12 | closed |
| 5 | Gemini 3 ProGoogle DeepMind | 86.5 | 92 | 76 | 90 | 95 | — | 1M | $2 | $12 | closed |
| 6 | Grok 4xAI | 86.3 | 88 | — | — | 92 | 79 | 256K | $3 | $15 | closed |
| 7 | Claude Opus 4.8Anthropic | 86.2 | 94 | 89 | — | — | 69 | 1M | $5 | $25 | closed |
| 8 | Gemini 3 FlashGoogle DeepMind | 86.1 | 90 | 78 | — | 95 | — | 1M | $0.5 | $3 | closed |
| 9 | GPT-5.5OpenAI | 85.7 | 94 | 81 | — | — | 83 | 1M | $5 | $30 | closed |
| 10 | DeepSeek-V4-FlashDeepSeek | 85.2 | 88 | 79 | 86 | — | 92 | 1M | $0.14 | $0.28 | open |
| 11 | Gemini 3.1 Flash-LiteGoogle DeepMind | 83.4 | 87 | — | 89 | — | 72 | 1M | $0.25 | $1.5 | closed |
| 12 | DeepSeek-V3.2DeepSeek | 78.8 | 82 | 70 | 85 | 84 | — | 164K | $0.28 | $0.42 | open |
| 13 | DeepSeek-R1-0528DeepSeek | 74.4 | 81 | 58 | 85 | 88 | 73 | 128K | $0.55 | $2.19 | open |
| 14 | Llama 4 MaverickMeta | 65.5 | 70 | — | 81 | — | 43 | 1M | $0.2 | $0.6 | open |
Every number is sourced on each model page. Disagree with the weighting? The method is public.
Off the board
Strong or brand-new models I can't rank, because their makers stopped publishing the comparable benchmarks. Listed so you know they exist, not scored because I won't fake it.
- Claude Fable 5 Anthropic
The most capable Claude, and top of the independent intelligence index at launch. It is off the board here because Anthropic stopped publishing the comparable suite, and because its safety classifiers refuse enough real work that even benchmarking it is a fight. Powerful, gated, expensive.
- Claude Sonnet 5 Anthropic
Near-Opus agentic muscle at half the Opus price, and the one I would hand most day-jobs. Off the board only because it is days old and reports vendor evals, not the standard suite. Watch this one.
- GPT-5.6 Sol OpenAI
OpenAI's newest, and you probably cannot have it: limited preview, rollout throttled at the government's request. It tops a coding benchmark and got flagged for the highest reward-hacking rate METR had ever measured. Read that headline number with tongs.
- GLM-5.2 Z.ai (Zhipu AI)
The strongest open-weights model on the independent index, MIT-licensed, and a genuine problem for the closed labs. Off the board here only because Z.ai skipped SWE-bench Verified, MMLU-Pro and LiveCodeBench, so I cannot place it fairly against the rest.
- Grok 4.3 xAI
Cheap, fast, and genuinely good at tool use. Off the board because xAI now publishes almost nothing you can compare, leaning on their own agentic metrics instead. Trust it for agents, verify everything else.
- Gemini 3.5 Flash Google DeepMind
Google's newest Flash, tuned for agents and, by their own tables, beating last year's Pro. Off the board because they withheld the standard academic evals. The vendor charts look great; I want the neutral ones.
- GPT-5.3-Codex OpenAI
The Codex-line coding specialist, now being quietly retired in favour of the general models. Good SWE-bench, thin on everything else, which is why it sits here rather than on the board.