Skip to main content
AI Socratic

Over the past two years, the AI NY community has been actively reviewing and discussing various benchmarks while tracking the rapid progress of new models. What has become increasingly clear is that most models are heavily overfitted to these benchmarks. As a result, they are less a true measure of real-world performance and more a way to track the pace of new model releases.

ModelARC-AGI 1ARC-AGI 2OpenRouter (this week)Live Bench (live average)Humanity's Last ExamAverage Score
GPT-5 (High)65.7% / $0.509.9% / $0.7362.7B tokens77.3925.3244.57
Grok 4 (Thinking)66.7% / $1.1116% / $2.1793.1B tokens70.32N/A51.00
Claude Sonnet 4 (Thinking 16K)40% / $0.3665.9% / $0.48545B tokens71.027.7631.16
Gemini Flash 2.5 (Thinking 16K)33% / $0.21%2 / $0.31259B tokens64.4212.0827.87
Gemini 2.5 Pro (Thinking 16K)41% / $0.484%4.0 / $0.72150B tokens65.7021.6433.08

ARC-AGI 1

In ARC-AGI 2

OpenRouter Leaderboard

React:

Comments

Sign in as a member to join the conversation.

Loading comments…

Stay Updated

Get the latest AI insights delivered to your inbox. No spam, unsubscribe anytime.

Search

Search across events, members, and blog posts