AI Socratic

AI Socratic July-Sep 2025 Part 1 — The Genie3 Is Out of The Box 🍌

Federico UlfoAugust 28, 2025
Market News

Updated on Sep 6th

The top AI developments from Jul 15–Sep 15, debated Socratically over AI dinners in NY and SF. This is our 1st draft.

Sign up to receive the mailing list!

Upcoming Events

This time around we’ll have 2 events, one in New York, and one for the first time in San Francisco at the Frontier Tower. We’ll discuss the top news and updates from this blog post using the Socratic methodology, and we’ll have a line up of demos and presentations!

AI Dinner

Sep 10th, Solana Skyline
New York City

Link: https://lu.ma/ai-dinner-13.0

AI Dinner

Sep 17th, Frontier Tower
San Francisco

Link: https://lu.ma/ai-dinner-13.2

Benchmarks & Metrics

Over the past two years, the AI NY community has been actively reviewing and discussing various benchmarks while tracking the rapid progress of new models. What has become increasingly clear is that most models are heavily overfitted to these benchmarks. As a result, they are less a true measure of real-world performance and more a way to track the pace of new model releases.

Model

ARC-AGI 1

ARC-AGI 2

OpenRouter (this week)

Live Bench (live average)

Humanity’s Last Exam

Average Score

GPT-5 (High)

65.7% / $0.50

9.9% / $0.73

62.7B tokens

77.39

25.32

44.57

Grok 4 (Thinking)

66.7% / $1.11

16% / $2.17

93.1B tokens

70.32

N/A

51.00

Claude Sonnet 4 (Thinking 16K)

40% / $0.366

5.9% / $0.48

545B tokens

71.02

7.76

31.16

Gemini Flash 2.5 (Thinking 16K)

33% / $0.21

%2 / $0.31

259B tokens

64.42

12.08

27.87

Gemini 2.5 Pro (Thinking 16K)

41% / $0.484

%4.0 / $0.72

150B tokens

65.70

21.64

33.08

ARC-AGI 1

In ARC-AGI 2

OpenRouter Leaderboard

OpenAI Launches GPT-5 with Advanced Reasoning

OpenAI launched GPT-5 on August 7, 2025, touting "PhD-level intelligence" with built-in advanced reasoning for coding, planning, and agentic tasks. Available to Free, Plus, Pro, and Team users immediately, with Enterprise/Edu following. The rollout faced backlash over glitches, hallucinations, and a "colder" tone, prompting OpenAI to restore legacy models for paid users.

Tweets reflected mixed sentiments, praising coding strengths but criticizing the chaotic debut and emotional impacts, and many missed gpt-4o.

Category

Metric / Claim

Key Benefit

Cost Efficiency

25× cheaper than GPT-4

Enables wider access (link)

Coding Win Rate

70%+ vs. GPT-4

Excels in complex tasks (link)

Reasoning Boost

40% improvement over GPT-4

Better for long chains and tool use

Training Scale

170K-180K H100 GPUs

Massive compute investment for multimodal advancements (link)

Google Updates

Gemini 2.5 Deep Think

Google launched Gemini 2.5 Deep Think, an advanced reasoning mode in Gemini 2.5 Pro, available to Google AI Ultra subscribers via the Gemini app. It uses parallel thinking and reinforcement learning for better problem-solving, generating and refining hypotheses over time. Strengths include creative projects, science, math, and coding. Activate via "Deep Think" in the prompt bar; limited daily prompts. Integrates tools like code execution and search for detailed, safe responses.

Benchmark

Score

Notes

Humanity’s Last Exam

21.6%

Expert-level; beats predecessors.

GPQA (Science)

86.4%

High accuracy in STEM.

AIME 2025 (Math)

88.0%

Near SOTA.

LiveCodeBench (Code)

69.0%

Outperforms o3, Grok 4.

Aider Polyglot

82.2%

Multi-language edits.

SWE-bench

67.2%

Real-world tasks.

MMMU

82.0%

Multimodal reasoning.

Genie 3: World Model Enhances AI Environmental Understanding

Genie 3 is a groundbreaking world model that transforms simple text prompts into immersive, interactive virtual environments, you can explore with the direction keys like you would in a video game. This innovation marks a significant leap in how AI comprehends and simulates real-world dynamics, enabling agents to navigate and interact with generated worlds in real time, and remember changes.

It can do 24 fps at 720p for a few minutes. Can generate the world from a single text prompt or image. The action is handled by auto-regressive frame generation based on user trajectories.

Nano-Banana 🍌: Consistent Image Generation Application

AI companies loves fruit, so now is Google’s time with “Nano Banana” the codename for Gemini 2.5 Flash Image, a major upgrade to its Gemini AI suite.

Why is a game changer?

  • Consistency Over Everything: Nano Banana shines by ensuring visual elements—faces, pets, objects—remain consistent across multiple edits. It addresses the common “close but no cigar” problem in AI editing  .
  • Editing, Not Just Generating: Unlike many AI tools, it doesn’t just spawn new images—it excels at nuanced, multi-step edits. Think combining images, modifying existing ones, or transforming backgrounds—all while keeping things coherent  .
  • Deep Integration Across Platforms: It’s now integrated into both web and mobile versions of the Gemini app, available to free and paid users alike. So yes, everyone gets in on the action.

Nano-Banana already launched and we had the privilege to try it and I can tell you, it deliver what it promises for the image editing, but it's still hallucinate when the image has way too many details.

Blog Posts

The Second Half

This is one of the best blog posts of 2025 by the OpenAI researcher Shunyu Yao. A playbook for what will matter most in AI research and the startup ecosystem, and how to prepare.

In the first half the focus was on developing new training methods and models. In the second half the focus shift from solving problems to defining problems, and write the correct eval for it.

Reinforcement learning (RL) key components

In RL there 3 main components:

  • Algorithm, the learning rule or optimization method that updates the agent’s policy. This is the machinery of RL: from early methods like REINFORCE and Q-learning, to more advanced ones like actor-critic, PPO, or TRPO. Algorithms define how the agent learns from experience.

  • Environment, the world in which the agent operates. It provides the state, feedback, and reward signals. In research, environments range from toy grids and Atari games to robotics simulators and large-scale language interaction settings. The environment defines what the agent is learning about.

  • And priors, the knowledge the agent brings before any RL begins. Priors can come from pre-training on vast datasets.

For decades, RL researchers obsessed over algorithms—REINFORCE, DQN, TD-learning, actor-critic, PPO, TRPO. Sutton and Barto’s classical textbook is almost entirely about algorithms, with little attention to environments or priors.

That mindset carried into mainstream AI. The first half of the modern AI era was dominated by new training methods and model architectures: AlexNet, the transformer, GPT-3. Then the progress stack on algorithms and models beating benchmarks. But the game changed when RL finally generalized.

The working recipe now looks like this: massive language pre-training (priors) + scale + reasoning-as-action inside a RL loop.

It turned out the most important part of RL might not be the algorithm or the environment, but the priors, which can be obtained in a way totally unrelated from RL.

So now the frontier shifts from solving problems to defining the right problems, evaluation takes center stage, and the core benchmark is no longer a leaderboard score but the “utility problem.” In fact traditional benchmarks don’t translate well to real-world tasks.

This is a great blog post that requires a few reading before really syncing in.

https://x.com/ShunyuYao12/status/1911671943457345675

https://x.com/Hesamation/status/1960717949771092429

Mechanize: How to Fully Automate Software Engineering

Current AI presents the Moravec’s paradox, it finds easy tasks that are really hard for humans (math, coding) and very hard tasks that are easy for humans (a baby chosing the right shape), some people think we can’t get over this paradigm.

The bitter lesson (Richard Sutton) tells us that scale (brute force) can solve this problem. RL is also a solution to this problem but stuck on narrow problems.

This blog argues that full automation needs more than just bigger models, it’s about training on tons of real human coding data and then letting AI learn in rich, realistic environments using RL. Right now, those RL environments are way too basic, and grading open-ended engineering work is tough.

So, AI shifts engineers to higher-level planning, testing, and coordination. And here’s the twist: software engineering might be both the first and last white-collar job to be totally automated, since building a “drop-in remote worker” AI is a much bigger challenge than just writing code.

https://x.com/MechanizeWork/status/1945528661131849790

Garnter: AI Hype Cycles

2025 reached the peak of the hype cycle with the AI Agents. Focus is now shifting to delivery stable business applications.

https://www.gartner.com/en/newsroom/press-releases/2025-08-05-gartner-hype-cycle-identifies-top-ai-innovations-in-2025

Nic Carter Tweet on AI opportunities and negative effects

Nic think AI is going to increase the Gini coefficient, drawing parallels to the post-1850 horse population drop, foreseeing job disruptions, regulations, and socialist backlash.

The NVDA rally changed the way he's seeing the future investing of AI. He was an early investor in CoreWeave. In 2020 he read the Scaling Hypothesis from Gwern, the most important blog post of that time. 2022 stable diffusion came out.

I'm probably doing a terrible job at summarizing this tweet, so just go read it for yourself: https://x.com/nic__carter/status/1797635177973158182.

Videos And Podcasts

Can Goodness Compete?

A philosophical exploration of a post AGI future Shelling point good for humanity.

https://www.youtube.com/embed/i6RMHcMbqQw

The Fractured Entangled Representation Hypothesis

Current diffusion models train on massive data and create images that happen to be correct, but they’re “castle in the sand”.

https://open.spotify.com/episode/2T75g7BLaztaWSeXWO0G18?si=m7BQkUYcTSGOQuCQRO6Jcg

Instead of building top bottom with SGD. This new models start from bottom up, build the images composing them piece by piece.. they show an example of a skull, and explaining how step by step, representation emerge locking one piece at the time. This is a 15 min summary of the full episode.

Demis Hassabis: Future of AI, Simulating Reality, Physics and Video Games | Lex Fridman Podcast

https://www.youtube.com/watch?v=-HzgcbRXUK8

Demis Hassabis and Lex discuss how AI is modeling complex patterns in nature, the path toward artificial general intelligence (AGI), and the societal impact of rapid technological change. The conversation covers breakthroughs like AlphaFold, the future of video games, the importance of creativity and research “taste,” and the philosophical questions surrounding consciousness and human purpose. Hassabis emphasizes the need for responsible AI development, collaboration, and ensuring that the benefits of these technologies are shared widely.

How AI image generation draws from physics | Guest video by @WelchLabsVideo

https://www.youtube.com/watch?v=iv-5mZ\_9CPY

A geometrical exploration for how diffusion models work 🔥 and how the gradient descent is represented.

MLST — DeepMind Genie 3 - Simulate The World

https://www.youtube.com/watch?v=ekgvWeHidJs&t=604s

MLST interview with the DeepMind Genie 3 team, a general purpose world model that can generate an unprecedented diversity of interactive environments, can create and remember the world details, and it has emergent rules, like parallax. Video games and virtual reality are about to have their GPT moment. It’s a real time model that can render at 30fps and 720p and it can render the world for several minutes.

Theory Of Everything: The (Terrifying) Theory That Your Thoughts Were Never Your Own

https://www.youtube.com/watch?v=Ca\_RbPXraDE

The theory that your thoughts were never yours.. in other words, language is an entity with its own evolution that is injected in our brain, and likely there are other parallel entity (or programs) running in our biology. Does the brain works just like an LLM?

This is a mind bending live discussion and podcast at the University of Toronto, from Prof Elan Barenholtz and William Hahn, moderated by Curt Jaimungal.


Full Sources List

As usual there are way too many news, articles, papers, fun memes, and tweets, to write about them all. Here’s the complete list, in case you wanted to explore what happened last month, here they are!

AI Agents

AI Builders

Books, Blogs, X Threads

DeAI

Diffusion Models

Fundraising, grants, programs

GEO Politics

Hardware and Infra

IMO Gold

Learning

LLMs

Lol, Memes

MCPs

Metrics & Benchs

Opinions

Philosophy and AGI

RAG

Random

Research

Videos

  • 3Blue1Brown x Welch Labs is the crossover we didn’t know we needed https://x.com/novasarc01/status/1948784701180674232
  • Demis Hassabis discusses Genie 3, Deep Think, and IMO gold medal with Logan Kilpatrik. Agent-based legacy (AlphaGo/Zero) with new "thinking models" aimed at planning, reasoning and AGI. Jagged intelligence: strong in some areas and weak in others. Traditional benchmarks are saturated, DeepMind is launching the Kaggle Game Arena. https://www.youtube.com/watch?v=njDochQ2zHs
  • MLST: Pushing compute to the limit of physics, with Guillaume Verdon. Fun MLST episode with Beff Jezos on e/acc and thermodynamic computers, mostly an intellectual word soup, but nonetheless fun to listen to
    https://open.spotify.com/episode/50Qm2NoerRIrV0jt4tYODE

Visuals


Sign up to receive the mailing list! | Read all the past AI Socratic reports