AI Leaderboard
Socratic Score = mean of normalized benchmarks. LM Arena: (ELO - 1000) / 400 × 100. Vending-Bench: balance / $10k × 100. SWE-bench, ARC-AGI, and HLE are used as-is (0-100%).
AI Socratic Leaderboard
Scores across benchmarks
| # | Model | Score | LM Arena ↗ | SWE-bench ↗ | ARC-AGI-2 ↗ | HLE ↗ | Vending ↗ | Vibe Bench |
|---|---|---|---|---|---|---|---|---|
| 🥇 | 81.2 | 1504 | 75.6% | 69.2%($3.47) | - | $8,017.59 | 50% | |
| 🥈 | 70.4 | 1481 | 72.8% | 72.9%($38.99) | - | $3,591.33 | 24% | |
| 🥉 | 63.3 | 1485 | 69.6% | 54.0%($30.57) | 38.3% | $5,478.16 | 16% | |
| 4 | 51.3 | 1493 | - | 29.4%($30.40) | 24.5% | - | 3% | |
| 5 | 37.0 | - | 70.0% | 4.0%($0.12) | - | - | - | |
| 6 | 1.3 | - | - | 1.3%($0.00) | - | - | - |
Vibe Bench
Community favorites · Socratic Feb · 38 responses
| 19 | |
| 12 | |
| 9 | |
| 6 | |
| 6 | |
Open Source | 5 |
| 4 | |
Other Tools | 2 |
| 1 | |
| 0 | |
| 0 |
Vibe Bench: trends
Mentions per event (absolute count)
SWE-bench Bash
VerifiedReal-world software engineering tasks
| # | Model | % Resolved |
|---|---|---|
| 🥇 | 76.8% | |
| 🥈 | 75.8% | |
| 🥉 | 75.8% | |
| 4 | 75.6% | |
| 5 | 72.8% | |
| 6 | 72.8% | |
| 7 | 72.8% | |
| 8 | 72.8% | |
| 9 | 71.4% | |
| 10 | Kimi K2.5 (high reasoning) | 70.8% |
ARC-AGI-2
Semi-PrivateAbstract reasoning capabilities
| # | Model | Score |
|---|---|---|
| 🥇 | 84.6% | |
| 🥈 | 83.3% | |
| 🥉 | 77.1% | |
| 4 | 74.0% | |
| 5 | 72.9% | |
| 6 | 69.2% | |
| 7 | 68.8% | |
| 8 | 67.5% | |
| 9 | 66.3% | |
| 10 | 64.6% |
Humanity's Last Exam
HLEExpert-level reasoning across disciplines
| # | Model | Accuracy |
|---|---|---|
| 🥇 | 38.3% | |
| 🥈 | 25.3% | |
| 🥉 | 24.5% | |
| 4 | 21.6% | |
| 5 | 19.4% | |
| 6 | 13.7% | |
| 7 | 12.1% | |
| 8 | 8.5% | |
| 9 | 8.0% | |
| 10 | 2.7% |
LM Arena - Text
1 day agoCrowdsourced human evaluations
| # | Model | Score | Votes |
|---|---|---|---|
| 🥇 | 1504 | 8,945 | |
| 🥈 | 1500 | 4,042 | |
| 🥉 | 1500 | 8,073 | |
| 4 | 1493 | 5,071 | |
| 5 | 1485 | 39,673 | |
| 6 | 1481 | 5,502 | |
| 7 | 1480 | 2,290 | |
| 8 | 1473 | 30,621 | |
| 9 | 1473 | 39,058 | |
| 10 | 1471 | 32,254 |
Vending-Bench 2
Andon LabsLong-term agentic coherence
| # | Model | Balance |
|---|---|---|
| 🥇 | $8,017.59 | |
| 🥈 | $7,204.14 | |
| 🥉 | $5,940.12 | |
| 4 | $5,478.16 | |
| 5 | $4,967.06 | |
| 6 | $4,432.12 | |
| 7 | $3,838.74 | |
| 8 | $3,774.25 | |
| 9 | $3,634.72 | |
| 10 | $3,591.33 |