Skip to main content
AI Socratic
February 2026
Research

World Models Are Taking Over

This was a huge month for simulated worlds. The vibe shifted from “make pretty video” to: build systems that can live inside a world.. Here's the list of World Model worth mentioning:

  • Project Genie: playable worlds from text prompts (rolling out to Google AI Ultra subscribers in the US).
  • World API (World Labs): persistent 3D worlds from text, images, video.
  • D4RT (DeepMind): unify video into full 4D (space + time) representations.
  • Roblox 4D generation powered by Cube Foundation Model (beta).
  • Alibaba’s LingBot-World (open-source Genie competitor) getting attention.
  • Hot take making the rounds: “World models, not video generation models, will dominate AI in 2026.”

Why it matters? Agents that can reason are good. Agents that can reason inside a stable world are the foundation of robotics, game economies, simulation-heavy science, and eventually: autonomous businesses.

Yann LeCunn is the strongest anti-LLMs advocate, he thinks World Model are the right direction, because models to be AGI need to predict the outcome of their actions.

Sources: deepmind project genie, worldlab AI, tweet 1, DeepMind Teaching AI to see the World in 4d, tweet 2, roblox tweet, Alibaba Tweet, hot take tweet, Yann Tweet

Federico UlfoFederico Ulfo
Research

Recursive Language Models (RLM)

This is a paper from December 2025 that recently had a resurgence due to the improvement of LLMs and the proven successful implementation of sub-agents with Claude Code. image.png

RLM are an agent architecture that overcome LLM context and reasoning limits by giving agents programmatic control over their own context via a REPL.

RLM works over a mutable context and can recursively spawn sub-agents to work on sub-tasks.

An RLM agent has:

  • A context object: a mutable structure that can scale to very large contexts
  • A recursive agent function: rlm_agent(query, context) → response, which can spawn child agents
  • A Python execution environment: enabling search, filtering, and computation over context

The agent alternates between writing code, inspecting results, and delegating work recursively.

RLM with Google's Agent Development Kit (ADK)

ADK adapts RLMs for production by providing low-level control over execution, memory, and orchestration via BaseAgent. Key features:

  • Lazy context loading from files instead of massive in-memory prompts
  • Parallel recursive delegation for scalable reasoning
  • Tool-first reasoning over code before language
  • Built-in observability for debugging recursive behavior

ADK preserves the core RLM idea—recursive, compute-over-context agents—while making it practical to deploy at scale.

Sources: Zhang Tweet, RLM with ADK Tweet, paper

Federico UlfoFederico Ulfo
Research

Prompt Repetition Improves Non-Reasoning LLMs

image.png

LLMs read prompts left to right, so early context can’t “know” what question is coming. This paper tests a simple fix: repeat the entire prompt twice, giving every token a second chance to attend to everything else.

Across seven benchmarks and seven major models (Gemini, ChatGPT, Claude, DeepSeek), accuracy improves, sometimes dramatically, without longer outputs or meaningful slowdown. Without fine-tuning, extra training, or prompting techniques.

Sources: tweet, (paper)[https://arxiv.org/pdf/2512.14982]

Federico UlfoFederico Ulfo
Research

Patterning: The Duality Of Interpretability

image.png

“Neural networks are grown, not programmed”. This paper changes that. Mechinterp investigates how models generalize beyond their training data by studying the resulting internal structure. They introduce patterning as the dual: given desired structure, determine what data produces it.

This is done with the language of susceptibilities. In physics, susceptibilities measure how a system responds to perturbations. Here, we think of the neural network as such a system, and of shifts in the training distribution as such perturbations.

image.png

This is a small language model (3M) across training, visualised with a new interpretability technique: susceptibilities. We call this handsome critter the rainbow serpent.

In a synthetic parentheses balancing task, we show that, given two solutions that both achieve perfect training accuracy and loss, we can effectively steer the solution that the model chooses to implement. We do this using only in-distribution data.

This is closely related to, but distinct from influence functions and training data attribution. These study the effects of data at the behavioral level, such as the impact of a data point on test loss, whereas patterning is concerned with the structure underlying that behavior.

Sources: tweet, paper, NN are grown tweet

Federico UlfoFederico Ulfo
Research

Anthropic: How Misalignment Scales with Bigger Models

AI failures on hard tasks tend to be incoherent and unpredictable (“hot mess”) rather than systematically pursuing the wrong goal.

  • More scale ≠ more coherence: bigger models don’t reliably behave more consistently and can get worse on very hard problems.
  • Longer reasoning can backfire: “overthinking” increases error variance; ensembling helps but isn’t practical for real-time agents.
  • Safety implication: future risks look more like industrial accidents from complexity and goal misspecification than deliberate, coherent misalignment.

Take away for AI engineers: build simple system that are easy to test and combine them. In other words SOLID and KISS methods translate from engineering to AI.

image.png

Source: blog

Federico UlfoFederico Ulfo
← NewerFebruary 2026Older →

Search

Search across events, members, and blog posts