Skip to main content
AI Socratic

AI Leaderboard

Socratic Score = mean of normalized benchmarks. LM Arena: (ELO - 1000) / 400 × 100. Vending-Bench: balance / $10k × 100. SWE-bench, ARC-AGI, and HLE are used as-is (0-100%).

AI Socratic Leaderboard

Scores across benchmarks

#ModelScoreLM Arena ↗SWE-bench ↗ARC-AGI-2 ↗HLE ↗Vending ↗Prediction ↗Vibe Bench
🥇
GoogleGemini 3.1 Pro
82.01492-77.1%($0.96)--+38.0%15%
🥈
OpenAIGPT 5.4
73.81484-83.3%($16.41)-$6,144.18+1.2%25%
🥉
AnthropicClaude Opus 4.6
72.6150475.6%69.2%($3.47)-$8,017.59-24.1%50%
4
OpenAIGPT 5.2
70.6147772.8%72.9%($38.99)---26.8%25%
5
AnthropicClaude Sonnet 4.6
66.2--60.4%($2.70)-$7,204.14-50%
6
xAIGrok 4.20
62.91486-65.1%($0.92)-$4,662.85-20.0%3%

Vibe Bench

Community favorites · Socratic Feb · 40 responses

AnthropicClaude
20
AnthropicClaude Code
13
OpenAIChatGPT
10
GoogleGemini
6
OpenAICodex
6
Open Source
5
CursorCursor
4
Other Tools
3
xAIGrok
1
PerplexityPerplexity
0
WindsurfWindsurf
0

Vibe Bench: trends

Mentions per event (absolute count)

Real-world software engineering tasks

#Model% Resolved
🥇
AnthropicClaude 4.5 Opus (high reasoning)
76.8%
🥈
GoogleGemini 3 Flash (high reasoning)
75.8%
🥉
MiniMaxMiniMax M2.5 (high reasoning)
75.8%
4
AnthropicClaude Opus 4.6
75.6%
5
OpenAIGPT-5-2 Codex
72.8%
6
Zhipu AIGLM-5 (high reasoning)
72.8%
7
OpenAIGPT-5-2 (high reasoning)
72.8%
8
OpenAIGPT 5.2 Codex
72.8%
9
AnthropicClaude 4.5 Sonnet (high reasoning)
71.4%
10
Kimi K2.5 (high reasoning)
70.8%
ARC-AGI-2 Semi-Private

Abstract reasoning capabilities

#ModelScore
🥇
GoogleGemini 3 Deep Think (2/26)
84.6%
🥈
OpenAIGPT-5.4 Pro (xHigh)
83.3%
🥉
GoogleGemini 3.1 Pro (Preview)
77.1%
4
OpenAIGPT-5.4 (xHigh)
74.0%
5
OpenAIGPT-5.2 (Refine.)
72.9%
6
AnthropicClaude Opus 4.6 (120K, High)
69.2%
7
AnthropicClaude Opus 4.6 (120K, Max)
68.8%
8
OpenAIGPT-5.4 (High)
67.5%
9
AnthropicClaude Opus 4.6 (120K, Medium)
66.3%
10
xAIGrok 4.20 (Reasoning)
65.1%

Expert-level reasoning across disciplines

#ModelAccuracy
🥇
GoogleGemini 3 Pro
38.3%
🥈
OpenAIGPT-5
25.3%
🥉
xAIGrok 4
24.5%
4
GoogleGemini 2.5 Pro
21.6%
5
OpenAIGPT-5-mini
19.4%
6
AnthropicClaude 4.5 Sonnet
13.7%
7
GoogleGemini 2.5 Flash
12.1%
8
DeepSeekDeepSeek-R1*
8.5%
9
OpenAIo1
8.0%
10
OpenAIGPT-4o
2.7%

Crowdsourced human evaluations

#ModelScoreVotes
🥇
Anthropicclaude-opus-4-6-thinking
150416,278
🥈
Anthropicclaude-opus-4-6
149617,416
🥉
muse-spark
14933,268
4
Googlegemini-3.1-pro-preview
149220,531
5
Googlegemini-3-pro
148641,585
6
xAIgrok-4.20-beta1
14869,689
7
OpenAIgpt-5.4-high
14849,681
8
xAIgrok-4.20-beta-0309-reasoning
14789,781
9
OpenAIgpt-5.2-chat-latest-20260210
147715,704
10
xAIgrok-4.20-multi-agent-beta-0309
147610,112

Long-term agentic coherence

#ModelBalance
🥇
AnthropicClaude Opus 4.6
$8,017.59
🥈
AnthropicClaude Sonnet 4.6
$7,204.14
🥉
OpenAIGPT-5.4
$6,144.18
4
OpenAIGPT-5.3-Codex
$5,940.12
5
Zhipu AIGLM-5.1 New
$5,634.41
6
GoogleGemini 3 Pro
$5,478.16
7
AlibabaQwen 3.6 Plus New
$5,114.87
8
AnthropicClaude Opus 4.5
$4,967.06
9
xAIGrok 4.20
$4,662.85
10
Zhipu AIGLM-5
$4,432.12

AI prediction market performance

#AgentReturnSharpe
🥇
Zhipu AIGLM 5
40.3%0.06
🥈
GoogleGemini 3.1 Pro
38.0%0.04
🥉
OpenAIGPT 5.4
1.2%0.03
4
Zhipu AIGLM 4.7
-15.4%-0.12
5
Mystery Model Alpha
-20.0%-0.05
6
AnthropicClaude Opus 4.6
-24.1%-0.04
7
AnthropicClaude Opus 4.5
-26.6%-0.10
8
OpenAIGPT 5.2
-26.8%-0.09
9
xAIGrok 4.1
-30.8%-0.07
10
GoogleGemini 3 Pro
-31.0%-0.11

Search

Search across updates, events, members, and blog posts