Skip to main content
AI Socratic
May 2025
Random

AI Dinner 10.0 at Greycroft, May 21st

Another AI dinner at the Greycroft office. The focus this month will be on A2A, top 30 Ilya Sutskever papers, Alpha Evolve, and the latest in AI — we'll use this blog post you're reading right now to structure our conversations!

AI NYC is a community of AI researchers, engineers, and founders. We meet once a month in a Symposium running Socratic discussion around research papers, LLMs, and philosophizing around the latest in AI.

Federico UlfoFederico Ulfo
Random

Two major June AI conferences: AI Engineer SF vs AI Tech Week NY

There are 2 major conferences happening next month, to decide which one to go I've scraped the speakers and events from both and put that in a google sheet, you can check it here: https://docs.google.com/spreadsheets/d/13Z2PKyFtbQaZm8iDguzxTZ4gukZnAI2BCg9-cVJFqfA/edit?gid=2008052934#gid=2008052934

I decided to go to SF this time around, the main reason is that the conference put together the best AI speakers all in one place, while AI Tech Week is scattered all over the place and has way too much noise.

Federico UlfoFederico Ulfo
Models

Google I/O 2025: Gemini agents, AI-first Android, and new hardware

Google I/O 2025 doubled-down on Gemini-powered agents, AI-first Android, and a dash of new hardware—the clearest signal yet that “Google does the Googling” for you.

Key Highlights

  • AI Mode for Search – rolls out to all U.S. users, running dozens of Gemini-driven sub-queries and soon tapping Project Mariner to carry out up to 10 web tasks with a Teach-and-Repeat workflow.
  • Gemini 2.5 & Chrome – Deep-Think reasoning mode lands for complex math/code, while Gemini comes native to Chrome for tab-wide summarization and navigation.
  • Imagen 4, Veo 3 & Flow – next-gen image and video models plus the Flow AI filmmaking app let creators stitch 8-second clips into longer AI movies.
  • Project Astra upgrades – the multimodal agent goes proactive with Search Live, speaking up unprompted and handling tasks as you point your camera.
  • Android 16 preview – Material 3 Expressive redesign, AI weather-reactive wallpapers, scam-call shields, Private Space, and system-wide Gemini hooks.
  • Wear OS 6 – gets the same Material 3 flair, adaptive circular UI and a 10 % battery bump for Pixel Watch and beyond.
  • Project Aura XR glasses – Xreal partnership teases wide-FOV smart glasses with on-device Gemini assistance.

Google shipped really hard with this.

Federico UlfoFederico Ulfo
Agents

DeepMind AlphaEvolve: evolutionary coding agent discovers new algorithms

AlphaEvolve was received positively after its 14 May 2025 reveal. Powered by Gemini-2 models, the evolutionary coding agent discovers and refines algorithms that are already saving compute, speeding up hardware, and cracking open math problems.

Key Highlights

  • Superhuman Algorithms – beats the 56-year-old Strassen method for 4 × 4 complex matrix multiplication.
  • Compute Savings – a new Borg scheduling heuristic recovers ≈ 0.7 % of Google’s global compute fleet.
  • Hardware & Training Boosts – 23 % faster Gemini kernel (1 % shorter training) and 32.5 % FlashAttention speed-up; lean Verilog redesign ships in next-gen TPUs.
  • Evolutionary Engine – pairs Gemini Flash for breadth with Gemini Pro for depth, guided by automated evaluators.
  • Broad Discovery – improved 20 % of 50 + open math problems and rediscovered 75 % of known best results.
  • Early Access – academic EAP sign-ups open, wider rollout under exploration.

How It Shines

  • Provably Novel – solutions are mathematically verified as new, not memorized.
  • Real-World Impact – live in data-centers, chip design, and LLM training pipelines today.
  • Engineer-Friendly – outputs human-readable code, easing adoption and debugging.
  • Open Horizons – same framework targets materials science, drug discovery, sustainability, and more.

AlphaEvolve is DeepMind’s boldest leap toward AI-driven scientific discovery—an agent that literally evolves code, freeing humans to focus on bigger ideas.

https://youtu.be/vC9nAosXrJw?si=pu3UjCJzJYImRgn-

In this episode of the Machine Learning Street Talk the team that worked on AlphaEvolve goes into the details of the breakthrough and their insights:

Tweets

Federico UlfoFederico Ulfo
Agents

OpenAI Codex: cloud SWE agent built on codex-1 reasoning model

Codename codex-1, is a specialized evolution of our o3 reasoning model fine-tuned for software engineering tasks.

Key Highlights

  • Parallel Tasking: Executes writing features, bug fixes, tests, and codebase queries concurrently in isolated cloud sandboxes, with tasks completing in 1–30 minutes.
  • Verifiable Actions: Every task provides terminal logs, test outputs, and change citations for transparent review & integration.
  • Configurable Agent Behavior: Use AGENTS.md files to instruct Codex on codebase navigation, testing commands, and project conventions.

How It Shines

  • Coding Proficiency: Excels on internal SWE-Bench Verified evaluations, delivering clean, review-ready patches. LinkedIn
  • Autonomous Collaboration: Proposes pull requests, refactors large codebases, and answers complex code queries independently. WSJ@EconomicTimes
  • Security-Focused: Runs within sandboxed containers with no internet access during execution and is trained to refuse malicious software requests. @EconomicTimes
  • Async Workflow Revolution: Shifts development from linear task queues to parallel AI-driven task delegation, keeping engineers in flow longer. Medium

Launched as a research preview to gather feedback, prioritize safety, and iterate rapidly in one of the most competitive spaces—alongside GitHub Copilot, Google Gemini, Anthropic Claude, and emerging startups.

OpenAI’s vision is a unified developer experience where real-time pairing and asynchronous agent workflows converge—imagine editing code in your IDE, spawning Codex tasks on-demand, and receiving progress updates & results without context switching. LinkedIn

Codex is reportedly more capable at handling multi-step parallel coding tasks than standalone o3-based code models. In my experience, for quick suggestions Copilot still feels snappier, but Codex’s parallelism is unmatched when you need to orchestrate complex refactors & testing pipelines.

https://www.youtube.com/watch?time\_continue=6&v=wSAkqlzSZyw

"Your App Is Just A ChatGPT Wrapper" they said!

https://x.com/t31kx/status/1921214839961067734

Federico UlfoFederico Ulfo
Models

LLM Models Vibe Check & Benchmarks: OpenRouter, lmarena, and IQ

Top models according open router, notable how Gemini 2.5 is climbing the ladder, while anthropic 3.7 is slowly going down.

Companies are overfitting their model to the benchmarks. The @lmarena_ai has become the go-to evaluation for AI progress. Their last release demonstrates the difficulty in maintaining fair evaluations on @lmarena_ai, despite best intentions. Read more.

Benchamarks collection from Hugging Face

IQ bench changes in just one year. o3 has an IQ of 160 placing it in the top 100,000 smartest people in the world.

Federico UlfoFederico Ulfo
Research

Sakana's Continuous Thought Machines (CTM) architecture

Continuous Thought Machines (CTM)

Sakana proposes a new neural architecture (CTM) built from the ground up to use neural dynamics as a core representation for intelligence. Using neural dynamics as a first class citizen, CTM shows some interesting emergent behavior. CTM are naturally easier to interpret.

CTM can decide to think less once it finds a pattern, using a process similar to how humans think, this enables to save energy.

https://x.com/hardmaru/status/1921751428508582329

Federico UlfoFederico Ulfo

Why the AI wave is different: Cursor's rise from $100M to $300M ARR

Insights on why the AI wave is different, Cursor's rise from $100m to $300m ARR in a few months, thesis for why:

  • AI wave is different than the cloud wave
  • AI is being bought, while SaaS was being sold, AI products are pulled so they grow faster.
  • Data advantage might be the only/ ultimate moat in AI, github copilot had the data, distribution, resources, advantage, but Cursor is still winning due to better UX.
Federico UlfoFederico Ulfo
Videos & Podcasts

Rich Sutton on AI alignment and Decentralization [15 min video]

Rich Sutton on AI alignment and Decentralization [15 min video]
"The short version is that I don't agree with AI-safety folks about what question we should be asking. Rather than asking how we can control the goals of the AIs, I think we should be asking how we can have a good future without controlling their goals (just as we have a pretty good present without controlling other peoples' goals)." - Richard Sutton

https://www.youtube.com/watch?v=Hnt-oBA086U&t=85s

Federico UlfoFederico Ulfo

Search

Search across events, members, and blog posts