Skip to main content
AI Socratic
June 2025
Random

AI Dinner 11.0 at Arize.ai Office

The next AI dinner will be on June 18th, and it will be hosted in the Arize.ai office.

We'll discuss the top news and updates from this blog post using the Socratic methodolgy. As well as going through few presentations:

Federico UlfoFederico Ulfo
Videos & Podcasts

AI.Engineer SF Conference: Takeaways and Recordings

Despite numerous snafus—such as poor to nonexistent internet, big screens malfunctioning, unclear ticket access, general organizational issues, insufficient food for attendees, and high prices, and most speakers were primarily promoting their products. Still, this conference turned out to be one of my favorite events of the year and attracted an incredibly high level of talent.

Jory Pestorious wrote this fantastic summary of the top insights from the ai engineer conference:

  1. Competition is about clear ideas; everyone can build today, so code and models are not a moat anymore.
  2. Engineering excellence equals articulation excellence. Write specs that humans understand and that AI can execute.
  3. Engineers need to learn how to use AI tools to 10x their output, or they’ll be left behind — harsh reality.
  4. No code should go unreviewed; code debt grows faster than AI can fix it.
  5. Claude code 🔥
  6. MCP is becoming the standard; time to fully embrace it.

Link: http://jorypestorious.com/blog/ai-engineer-spec.

Another interesting report comes from Thomas Gear:

  • AI coding complexity doubles every 70 days
  • 50% of engineers use LLMs
  • RAG leads customization at 70%
  • costs dropped 600x (to $0.10/million tokens)
  • models shrink (405B to 24B) while staying powerful
  • Gemini’s market share jumped to 35% with 50x inference growth
  • companies now design for AI to handle 80% of work.

Link: https://x.com/tg\_bytes/status/1931938102861271042

Here’s the recording of the general track. All the advanced talks in the RL track were just incredible and made the conference worth it.

https://www.youtube.com/watch?v=z4zXicOAF28&t=18946s

The highlight of the tech talk for me was Dan Han from Unsloth — a RL beast! https://x.com/danielhanchen/status/1930752903960211608.

Federico UlfoFederico Ulfo
Agents

Anthropic: How We Built Our Multi-Agent Research System

Anthropic shares how they built Claude's new multi-agent Research feature, an architecture where a lead Claude agent spawns and coordinates subagents to explore complex queries in parallel. They use this orchestrator-worker architecture:

Traditional approaches using Retrieval Augmented Generation (RAG) use static retrieval. That is, they fetch some set of chunks that are most similar to an input query and use these chunks to generate a response. Anthropic Advanced Research architecture uses a multi-step search that dynamically finds relevant information in parallel, adapts to new findings, and analyzes results to formulate high-quality answers.

Token-efficient Scaling Performance gains correlate strongly with token usage and parallel tool calls. By distributing work across multiple agents and context windows, Claude’s system scales reasoning capacity efficiently. However, this comes with a 15× token cost over standard chats, making it suitable for high-value queries only.

  • Think like your agents.
  • Teach the orchestrator how to delegate.
  • Scale effort to query complexity.
  • Tools design and selection are critical. MCP servers gives tools access on steroids.
  • Let agents improve themselves. The agents can diagnose when something fails, and fix it by rewriting the MCP description. This process saved 40% times.
  • Start wide, narrow down.
  • Guide the thinking process.
  • Parallel tool calling transforms speed and performance. Parallelism can cut up to 90% of the total time.

Flexible Evaluation + Production Reliability Anthropic uses LLM-as-judge scoring with rubrics for factuality, citation, and efficiency, alongside human testing to catch subtle failures. For reliability, they built resumable stateful agents with checkpointing, rainbow deployments, and full observability of agent decision traces, crucial for debugging non-deterministic, long-running agents.

Blog: https://anthropic.com/engineering/built-multi-agent-research-system Tweet: https://x.com/omarsar0/status/1933941545675206936.

https://x.com/swyx/status/1933981734456230190

Claude Code CLI 🔥

AI coding tooling & coding agents being packaged into products, and even worse, cloud products, is the wrong path. Command Line is the way!

Tutorial on how to use it: https://x.com/rasmickyy/status/1931078993022730248

Federico UlfoFederico Ulfo
Models

OpenAI Leads with o3-pro at 80% Lower Cost

o3 pro performs just like or better than o3 in most benchmarks including ARC1 and ARC2. What's incredible about it, is the cut in cost by 80%, basically costing as much as 4o-mini!

https://x.com/ArtificialAnlys/status/1932489573462081898

Twitter did what Twitter does, speculating on o3 using distillation, but some insider says that OpenAI have been using Codex internally to optimize the heck out of it, obtaining the incredible 80% without performance losses.

OpenAI retention curve is a wet dream for most investors. Their 1 month retention has skyrocketed from <60% 2yrs ago to an unprecedented ~90%! Youtube was best-in-class with ~85%. 6mo retention is trending to ~80%. Rapidly rising smile curve.

https://x.com/deedydas/status/1932619060057084193

Federico UlfoFederico Ulfo
Models

Mistral Releases Magistral RL Reasoning Model

The Mistral team at it again with Magistral! A reasoning model designed to excel in domain-specific, transparent, and multilingual reasoning.

GRPO with edits:

1. Removed KL Divergence 2. Normalize by total length (Dr. GRPO style) 3. Minibatch normalization for advantages 4. Relaxing trust region

https://arxiv.org/pdf/2506.10910

Simon Wilson: all LLM API vendors are converging to the same product:

  • Code execution: Python in a sandbox
  • Web search — like Anthropic, Mistral seem to use Brave
  • Document library aka hosted RAG
  • Image generation (FLUX for Mistral)
  • Model Context Protocol
Federico UlfoFederico Ulfo
Research

Apple's "The Illusion of Thinking" and the Rebuttal

This research paper from Apple is been quite controversial for several reasons, first of which, Apple lagging behind the AI race: https://machinelearning.apple.com/research/illusion-of-thinking.

Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really “thinking”? Or are they just throwing more compute towards pattern matching?

Apple designed an experiment using Tower of Hanoi to test these models. Well turns out, it was a memory problem, the models were failing because they were going out of context.

Asking the model to be more concise in fact enabled o3 to solve the Tower of Hanoi, as shown in the paper of The Illusion of the Illusion of Thinking — this paper also sign as the first time an LLM, Claude Opus, is listed as an author on arXiv.

https://x.com/rohanpaul_ai/status/1933296859730301353

LLMs are not the only one faking it either: The Illusion Of Human Thinking.

Federico UlfoFederico Ulfo

Search

Search across events, members, and blog posts