Updates — Voices from the AI Socratic Community

June 2025

Jun 16, 2025Random

AI Dinner 11.0 at Arize.ai Office

The next AI dinner will be on June 18th, and it will be hosted in the Arize.ai office.

We'll discuss the top news and updates from this blog post using the Socratic methodolgy. As well as going through few presentations:

Kerem Kazan will present "Commenting on chess moves in natural language (training LLM via RL to play chess)".
Jianing Qi will present the paper Learning to Reason Across Parallel Samples for LLM Reasoning. Do RL reasoning on top of the model outputs with a tiny LLM. Is RL optimizing output distribution or learning new abilities?

Federico Ulfo

Comment

Jun 16, 2025Videos & Podcasts

AI.Engineer SF Conference: Takeaways and Recordings

Despite numerous snafus—such as poor to nonexistent internet, big screens malfunctioning, unclear ticket access, general organizational issues, insufficient food for attendees, and high prices, and most speakers were primarily promoting their products. Still, this conference turned out to be one of my favorite events of the year and attracted an incredibly high level of talent.

Jory Pestorious wrote this fantastic summary of the top insights from the ai engineer conference:

Competition is about clear ideas; everyone can build today, so code and models are not a moat anymore.
Engineering excellence equals articulation excellence. Write specs that humans understand and that AI can execute.
Engineers need to learn how to use AI tools to 10x their output, or they’ll be left behind — harsh reality.
No code should go unreviewed; code debt grows faster than AI can fix it.
Claude code 🔥
MCP is becoming the standard; time to fully embrace it.

Link: http://jorypestorious.com/blog/ai-engineer-spec.

Another interesting report comes from Thomas Gear:

AI coding complexity doubles every 70 days
50% of engineers use LLMs
RAG leads customization at 70%
costs dropped 600x (to $0.10/million tokens)
models shrink (405B to 24B) while staying powerful
Gemini’s market share jumped to 35% with 50x inference growth
companies now design for AI to handle 80% of work.

Link: https://x.com/tg\_bytes/status/1931938102861271042

Here’s the recording of the general track. All the advanced talks in the RL track were just incredible and made the conference worth it.

https://www.youtube.com/watch?v=z4zXicOAF28&t=18946s

The highlight of the tech talk for me was Dan Han from Unsloth — a RL beast! https://x.com/danielhanchen/status/1930752903960211608.

Federico Ulfo

Comment

Jun 16, 2025Agents

Anthropic: How We Built Our Multi-Agent Research System

Anthropic shares how they built Claude's new multi-agent Research feature, an architecture where a lead Claude agent spawns and coordinates subagents to explore complex queries in parallel. They use this orchestrator-worker architecture:

Traditional approaches using Retrieval Augmented Generation (RAG) use static retrieval. That is, they fetch some set of chunks that are most similar to an input query and use these chunks to generate a response. Anthropic Advanced Research architecture uses a multi-step search that dynamically finds relevant information in parallel, adapts to new findings, and analyzes results to formulate high-quality answers.

Token-efficient Scaling Performance gains correlate strongly with token usage and parallel tool calls. By distributing work across multiple agents and context windows, Claude’s system scales reasoning capacity efficiently. However, this comes with a 15× token cost over standard chats, making it suitable for high-value queries only.

Think like your agents.
Teach the orchestrator how to delegate.
Scale effort to query complexity.
Tools design and selection are critical. MCP servers gives tools access on steroids.
Let agents improve themselves. The agents can diagnose when something fails, and fix it by rewriting the MCP description. This process saved 40% times.
Start wide, narrow down.
Guide the thinking process.
Parallel tool calling transforms speed and performance. Parallelism can cut up to 90% of the total time.

Flexible Evaluation + Production Reliability Anthropic uses LLM-as-judge scoring with rubrics for factuality, citation, and efficiency, alongside human testing to catch subtle failures. For reliability, they built resumable stateful agents with checkpointing, rainbow deployments, and full observability of agent decision traces, crucial for debugging non-deterministic, long-running agents.

Blog: https://anthropic.com/engineering/built-multi-agent-research-system Tweet: https://x.com/omarsar0/status/1933941545675206936.

https://x.com/swyx/status/1933981734456230190

Claude Code CLI 🔥

AI coding tooling & coding agents being packaged into products, and even worse, cloud products, is the wrong path. Command Line is the way!

Tutorial on how to use it: https://x.com/rasmickyy/status/1931078993022730248

Federico Ulfo

Comment

Jun 16, 2025Models

Google Keeps on Shipping: Gemini 2.5 Pro and Veo3

Currently Google is the best vertically integrated company. Google simply does not get enough credit for the TPU, as one of the reason to be one of the top player in AI.

https://x.com/ArtificialAnlys/status/1933254125757870104

With the launch of Gemini 2.5 Pro they're now only second to OpenAI o3-pro.

Google have been constantly shipping

Veo3 is now positioned #2 in the best video models

Federico Ulfo

Comment

Jun 16, 2025Models

OpenAI Leads with o3-pro at 80% Lower Cost

o3 pro performs just like or better than o3 in most benchmarks including ARC1 and ARC2. What's incredible about it, is the cut in cost by 80%, basically costing as much as 4o-mini!

https://x.com/ArtificialAnlys/status/1932489573462081898

Twitter did what Twitter does, speculating on o3 using distillation, but some insider says that OpenAI have been using Codex internally to optimize the heck out of it, obtaining the incredible 80% without performance losses.

OpenAI retention curve is a wet dream for most investors. Their 1 month retention has skyrocketed from <60% 2yrs ago to an unprecedented ~90%! Youtube was best-in-class with ~85%. 6mo retention is trending to ~80%. Rapidly rising smile curve.

https://x.com/deedydas/status/1932619060057084193

Federico Ulfo

Comment

Jun 16, 2025Fundraising & Startups

Meta Takes 49% Stake in Scale AI for $14.3 Billion

After the acquisition Alexandr will take the role of head of AI at Meta. Some drama is already unraveling saying he's toxic and only great at fundraising. We shall see.

Meta is currently offering $2M+/yr in offers for AI talent and still losing them to OpenAI and Anthropic (who has 80% retention rate). Source.

Federico Ulfo

Comment

Jun 16, 2025Models

Mistral Releases Magistral RL Reasoning Model

The Mistral team at it again with Magistral! A reasoning model designed to excel in domain-specific, transparent, and multilingual reasoning.

GRPO with edits:

1. Removed KL Divergence 2. Normalize by total length (Dr. GRPO style) 3. Minibatch normalization for advantages 4. Relaxing trust region

https://arxiv.org/pdf/2506.10910

Simon Wilson: all LLM API vendors are converging to the same product:

Code execution: Python in a sandbox
Web search — like Anthropic, Mistral seem to use Brave
Document library aka hosted RAG
Image generation (FLUX for Mistral)
Model Context Protocol

Federico Ulfo

Comment

Jun 16, 2025Philosophy & Ethics

Philosophy: Defining Intelligence and Life

How do you define intelligence? How do you define Life?

https://x.com/blaiseaguera/status/1924514755982606493

https://x.com/reedbndr/status/1927495304380559744

Federico Ulfo

Comment

Jun 16, 2025Philosophy & Ethics

Sam Altman: The Gentle Singularity

"We are past the event horizon" this phrase might go into our future history books.

The gentle singularity is a short read, jam packed with philosophical takes and with some fun fact: ChatGPT queries uses 0.34 watt-hours (few minutes of a normal lightbulb) and a tea spoon of water.

Federico Ulfo

Comment

Jun 16, 2025Random

Learning: Best AI Guides and YouTube Channels

OpenAI, Google, and Anthropic released the best guides on AI: https://x.com/CodeByPoonam/status/1932813119111508214
Top 10 YouTube channels to learn AI from scratch: https://x.com/Hesamation/status/1927649769360412813

Federico Ulfo

Comment

Jun 16, 2025Vibe Coding

AI Builders and Tools: "Cursor of X" Era and Dia Browser

We entered in the era of the "Cursor Of X" startups.
Dia Browser is out, go get it: diabrowser.com.

Federico Ulfo

Comment

Jun 16, 2025Research

Apple's "The Illusion of Thinking" and the Rebuttal

This research paper from Apple is been quite controversial for several reasons, first of which, Apple lagging behind the AI race: https://machinelearning.apple.com/research/illusion-of-thinking.

Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really “thinking”? Or are they just throwing more compute towards pattern matching?

Apple designed an experiment using Tower of Hanoi to test these models. Well turns out, it was a memory problem, the models were failing because they were going out of context.

Asking the model to be more concise in fact enabled o3 to solve the Tower of Hanoi, as shown in the paper of The Illusion of the Illusion of Thinking — this paper also sign as the first time an LLM, Claude Opus, is listed as an author on arXiv.

https://x.com/rohanpaul_ai/status/1933296859730301353

LLMs are not the only one faking it either: The Illusion Of Human Thinking.

Federico Ulfo

Comment

Jun 16, 2025Research

MEMOIR: Scalable Lifelong Model Editing Framework

We introduce MEMOIR — a scalable framework for lifelong model editing that reliably rewrites thousands of facts sequentially using a residual memory module. https://x.com/qinym710/status/1933514852313563228

Federico Ulfo

Comment

Jun 16, 2025Research

SEAL: LLMs That Update Their Own Weights

What if an LLM could update its own weights? Meet SEAL: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward. https://x.com/jyo_pari/status/1933350025284702697.

Federico Ulfo

Comment

Jun 16, 2025Research

Reasoning Features Learnt by SAEs Transfer Across Models

🔥 “Reasoning” features learnt by an SAEs (Sparce Auto Encoders) can be transferred as is across MODELS and datasets is super cool and similar in spirit to Mistral’s finding that there exists a low dim reasoning direction https://x.com/nrehiew_/status/1933308951334170712

Federico Ulfo

Comment

Jun 16, 2025Research

Sakana AI: Text-to-LoRA Hypernetwork

🔥 SakanaAILabs: We’re excited to introduce Text-to-LoRA: a Hypernetwork that generates task-specific LLM adapters (LoRAs) based on a text description of the task. Catch our presentation at #ICML2025! https://x.com/SakanaAILabs/status/1932972420522230214

Federico Ulfo

Comment

Jun 16, 2025Research

DeepSeek Drops Best End-to-End Large Model Training Paper

🔥 deedydas: DeepSeek just dropped the single best end-to-end paper on large model training. https://x.com/deedydas/status/1924512147947848039

Federico Ulfo

Comment

Jun 16, 2025Research

deedydas: The Best 340-Page AI Report

🔥 deedydas: The BEST AI report in the world just dropped and I read all 340 pages so you don’t have to. https://x.com/deedydas/status/1929381310856151280

Federico Ulfo

Comment

Jun 16, 2025Videos & Podcasts