AI Socratic

AI Leaderboard

Socratic Score = mean of normalized benchmarks. LM Arena: (ELO - 1000) / 400 × 100. Vending-Bench: balance / $10k × 100. SWE-bench, ARC-AGI, and HLE are used as-is (0-100%).

AI Socratic Leaderboard

Scores across benchmarks

#ModelScoreLM Arena ↗SWE-bench ↗ARC-AGI-2 ↗HLE ↗Vending ↗Vibe Bench
🥇
AnthropicClaude Opus 4.6
81.2150475.6%69.2%($3.47)-$8,017.5950%
🥈
OpenAIGPT-5.2
70.4148172.8%72.9%($38.99)-$3,591.3324%
🥉
GoogleGemini 3 Pro
63.3148569.6%54.0%($30.57)38.3%$5,478.1616%
4
xAIGrok 4
51.31493-29.4%($30.40)24.5%-3%
5
DeepSeekDeepSeek V3
37.0-70.0%4.0%($0.12)---
6
AlibabaQwen3
1.3--1.3%($0.00)---

Vibe Bench

Community favorites · Socratic Feb · 38 responses

AnthropicClaude
19
AnthropicClaude Code
12
OpenAIChatGPT
9
GoogleGemini
6
OpenAICodex
6
Open Source
5
CursorCursor
4
Other Tools
2
xAIGrok
1
PerplexityPerplexity
0
WindsurfWindsurf
0

Vibe Bench: trends

Mentions per event (absolute count)

SWE-bench Bash

Verified

Real-world software engineering tasks

#Model% Resolved
🥇
AnthropicClaude 4.5 Opus (high reasoning)
76.8%
🥈
GoogleGemini 3 Flash (high reasoning)
75.8%
🥉
MiniMaxMiniMax M2.5 (high reasoning)
75.8%
4
AnthropicClaude Opus 4.6
75.6%
5
OpenAIGPT-5-2 Codex
72.8%
6
Zhipu AIGLM-5 (high reasoning)
72.8%
7
OpenAIGPT-5-2 (high reasoning)
72.8%
8
OpenAIGPT 5.2 Codex
72.8%
9
AnthropicClaude 4.5 Sonnet (high reasoning)
71.4%
10
Kimi K2.5 (high reasoning)
70.8%

ARC-AGI-2

Semi-Private

Abstract reasoning capabilities

#ModelScore
🥇
GoogleGemini 3 Deep Think (2/26)
84.6%
🥈
OpenAIGPT-5.4 Pro (xHigh)
83.3%
🥉
GoogleGemini 3.1 Pro (Preview)
77.1%
4
OpenAIGPT-5.4 (xHigh)
74.0%
5
OpenAIGPT-5.2 (Refine.)
72.9%
6
AnthropicClaude Opus 4.6 (120K, High)
69.2%
7
AnthropicClaude Opus 4.6 (120K, Max)
68.8%
8
OpenAIGPT-5.4 (High)
67.5%
9
AnthropicClaude Opus 4.6 (120K, Medium)
66.3%
10
AnthropicClaude Opus 4.6 (120K, Low)
64.6%

Humanity's Last Exam

HLE

Expert-level reasoning across disciplines

#ModelAccuracy
🥇
GoogleGemini 3 Pro
38.3%
🥈
OpenAIGPT-5
25.3%
🥉
xAIGrok 4
24.5%
4
GoogleGemini 2.5 Pro
21.6%
5
OpenAIGPT-5-mini
19.4%
6
AnthropicClaude 4.5 Sonnet
13.7%
7
GoogleGemini 2.5 Flash
12.1%
8
DeepSeekDeepSeek-R1*
8.5%
9
OpenAIo1
8.0%
10
OpenAIGPT-4o
2.7%

LM Arena - Text

1 day ago

Crowdsourced human evaluations

#ModelScoreVotes
🥇
Anthropicclaude-opus-4-6
15048,945
🥈
Googlegemini-3.1-pro-preview
15004,042
🥉
Anthropicclaude-opus-4-6-thinking
15008,073
4
xAIgrok-4.20-beta1
14935,071
5
Googlegemini-3-pro
148539,673
6
OpenAIgpt-5.2-chat-latest-20260210
14815,502
7
OpenAIgpt-5.4-high
14802,290
8
Googlegemini-3-flash
147330,621
9
xAIgrok-4.1-thinking
147339,058
10
Anthropicclaude-opus-4-5-20251101-thinking-32k
147132,254

Vending-Bench 2

Andon Labs

Long-term agentic coherence

#ModelBalance
🥇
AnthropicClaude Opus 4.6
$8,017.59
🥈
AnthropicClaude Sonnet 4.6
$7,204.14
🥉
OpenAIGPT-5.3-Codex New
$5,940.12
4
GoogleGemini 3 Pro
$5,478.16
5
AnthropicClaude Opus 4.5
$4,967.06
6
Zhipu AIGLM-5
$4,432.12
7
AnthropicClaude Sonnet 4.5
$3,838.74
8
GoogleGemini 3.1 Pro Custom Tools
$3,774.25
9
GoogleGemini 3 Flash
$3,634.72
10
OpenAIGPT-5.2
$3,591.33

Search

Search across updates, events, members, and blog posts