Skip to main content
AI Socratic
August 2025
Research

Benchmarks & Metrics: Models Increasingly Overfitted to Leaderboards

Over the past two years, the AI NY community has been actively reviewing and discussing various benchmarks while tracking the rapid progress of new models. What has become increasingly clear is that most models are heavily overfitted to these benchmarks. As a result, they are less a true measure of real-world performance and more a way to track the pace of new model releases.

ModelARC-AGI 1ARC-AGI 2OpenRouter (this week)Live Bench (live average)Humanity's Last ExamAverage Score
GPT-5 (High)65.7% / $0.509.9% / $0.7362.7B tokens77.3925.3244.57
Grok 4 (Thinking)66.7% / $1.1116% / $2.1793.1B tokens70.32N/A51.00
Claude Sonnet 4 (Thinking 16K)40% / $0.3665.9% / $0.48545B tokens71.027.7631.16
Gemini Flash 2.5 (Thinking 16K)33% / $0.21%2 / $0.31259B tokens64.4212.0827.87
Gemini 2.5 Pro (Thinking 16K)41% / $0.484%4.0 / $0.72150B tokens65.7021.6433.08

ARC-AGI 1

In ARC-AGI 2

OpenRouter Leaderboard

Federico UlfoFederico Ulfo
Research

Genie 3: World Model Generates Interactive Environments

Genie 3 is a groundbreaking world model that transforms simple text prompts into immersive, interactive virtual environments, you can explore with the direction keys like you would in a video game. This innovation marks a significant leap in how AI comprehends and simulates real-world dynamics, enabling agents to navigate and interact with generated worlds in real time, and remember changes.

It can do 24 fps at 720p for a few minutes. Can generate the world from a single text prompt or image. The action is handled by auto-regressive frame generation based on user trajectories.

Federico UlfoFederico Ulfo
Research

Blog Post: The Second Half by Shunyu Yao

This is one of the best blog posts of 2025 by the OpenAI researcher Shunyu Yao. A playbook for what will matter most in AI research and the startup ecosystem, and how to prepare.

In the first half the focus was on developing new training methods and models. In the second half the focus shift from solving problems to defining problems, and write the correct eval for it.

Reinforcement learning (RL) key components

In RL there 3 main components:

  • Algorithm, the learning rule or optimization method that updates the agent's policy. This is the machinery of RL: from early methods like REINFORCE and Q-learning, to more advanced ones like actor-critic, PPO, or TRPO. Algorithms define how the agent learns from experience.

  • Environment, the world in which the agent operates. It provides the state, feedback, and reward signals. In research, environments range from toy grids and Atari games to robotics simulators and large-scale language interaction settings. The environment defines what the agent is learning about.

  • And priors, the knowledge the agent brings before any RL begins. Priors can come from pre-training on vast datasets.

For decades, RL researchers obsessed over algorithms—REINFORCE, DQN, TD-learning, actor-critic, PPO, TRPO. Sutton and Barto's classical textbook is almost entirely about algorithms, with little attention to environments or priors.

That mindset carried into mainstream AI. The first half of the modern AI era was dominated by new training methods and model architectures: AlexNet, the transformer, GPT-3. Then the progress stack on algorithms and models beating benchmarks. But the game changed when RL finally generalized.

The working recipe now looks like this: massive language pre-training (priors) + scale + reasoning-as-action inside a RL loop.

It turned out the most important part of RL might not be the algorithm or the environment, but the priors, which can be obtained in a way totally unrelated from RL.

So now the frontier shifts from solving problems to defining the right problems, evaluation takes center stage, and the core benchmark is no longer a leaderboard score but the "utility problem." In fact traditional benchmarks don't translate well to real-world tasks.

This is a great blog post that requires a few reading before really syncing in.

https://x.com/ShunyuYao12/status/1911671943457345675

https://x.com/Hesamation/status/1960717949771092429

Federico UlfoFederico Ulfo

Search

Search across events, members, and blog posts