Updated on Sep 6th
The top AI developments from Jul 15–Sep 15, debated Socratically over AI dinners in NY and SF. This is our 1st draft.
Sign up to receive the mailing list!
Upcoming Events
This time around we’ll have 2 events, one in New York, and one for the first time in San Francisco at the Frontier Tower. We’ll discuss the top news and updates from this blog post using the Socratic methodology, and we’ll have a line up of demos and presentations!
AI Dinner
Sep 10th, Solana Skyline
New York City
Link: https://lu.ma/ai-dinner-13.0
AI Dinner
Sep 17th, Frontier Tower
San Francisco
Link: https://lu.ma/ai-dinner-13.2
Benchmarks & Metrics
Over the past two years, the AI NY community has been actively reviewing and discussing various benchmarks while tracking the rapid progress of new models. What has become increasingly clear is that most models are heavily overfitted to these benchmarks. As a result, they are less a true measure of real-world performance and more a way to track the pace of new model releases.
Model
Average Score
GPT-5 (High)
65.7% / $0.50
9.9% / $0.73
62.7B tokens
77.39
25.32
44.57
Grok 4 (Thinking)
66.7% / $1.11
16% / $2.17
93.1B tokens
70.32
N/A
51.00
Claude Sonnet 4 (Thinking 16K)
40% / $0.366
5.9% / $0.48
545B tokens
71.02
7.76
31.16
Gemini Flash 2.5 (Thinking 16K)
33% / $0.21
%2 / $0.31
259B tokens
64.42
12.08
27.87
Gemini 2.5 Pro (Thinking 16K)
41% / $0.484
%4.0 / $0.72
150B tokens
65.70
21.64
33.08
ARC-AGI 1

In ARC-AGI 2

OpenRouter Leaderboard

OpenAI Launches GPT-5 with Advanced Reasoning
OpenAI launched GPT-5 on August 7, 2025, touting "PhD-level intelligence" with built-in advanced reasoning for coding, planning, and agentic tasks. Available to Free, Plus, Pro, and Team users immediately, with Enterprise/Edu following. The rollout faced backlash over glitches, hallucinations, and a "colder" tone, prompting OpenAI to restore legacy models for paid users.
Tweets reflected mixed sentiments, praising coding strengths but criticizing the chaotic debut and emotional impacts, and many missed gpt-4o.
Category
Metric / Claim
Key Benefit
Cost Efficiency
25× cheaper than GPT-4
Enables wider access (link)
Coding Win Rate
70%+ vs. GPT-4
Excels in complex tasks (link)
Reasoning Boost
40% improvement over GPT-4
Better for long chains and tool use
Training Scale
170K-180K H100 GPUs
Massive compute investment for multimodal advancements (link)
Google Updates
Gemini 2.5 Deep Think
Google launched Gemini 2.5 Deep Think, an advanced reasoning mode in Gemini 2.5 Pro, available to Google AI Ultra subscribers via the Gemini app. It uses parallel thinking and reinforcement learning for better problem-solving, generating and refining hypotheses over time. Strengths include creative projects, science, math, and coding. Activate via "Deep Think" in the prompt bar; limited daily prompts. Integrates tools like code execution and search for detailed, safe responses.
Benchmark
Score
Notes
Humanity’s Last Exam
21.6%
Expert-level; beats predecessors.
GPQA (Science)
86.4%
High accuracy in STEM.
AIME 2025 (Math)
88.0%
Near SOTA.
LiveCodeBench (Code)
69.0%
Outperforms o3, Grok 4.
Aider Polyglot
82.2%
Multi-language edits.
SWE-bench
67.2%
Real-world tasks.
MMMU
82.0%
Multimodal reasoning.
Genie 3: World Model Enhances AI Environmental Understanding
Genie 3 is a groundbreaking world model that transforms simple text prompts into immersive, interactive virtual environments, you can explore with the direction keys like you would in a video game. This innovation marks a significant leap in how AI comprehends and simulates real-world dynamics, enabling agents to navigate and interact with generated worlds in real time, and remember changes.
It can do 24 fps at 720p for a few minutes. Can generate the world from a single text prompt or image. The action is handled by auto-regressive frame generation based on user trajectories.

Nano-Banana 🍌: Consistent Image Generation Application
AI companies loves fruit, so now is Google’s time with “Nano Banana” the codename for Gemini 2.5 Flash Image, a major upgrade to its Gemini AI suite.
Why is a game changer?
- Consistency Over Everything: Nano Banana shines by ensuring visual elements—faces, pets, objects—remain consistent across multiple edits. It addresses the common “close but no cigar” problem in AI editing .
- Editing, Not Just Generating: Unlike many AI tools, it doesn’t just spawn new images—it excels at nuanced, multi-step edits. Think combining images, modifying existing ones, or transforming backgrounds—all while keeping things coherent .
- Deep Integration Across Platforms: It’s now integrated into both web and mobile versions of the Gemini app, available to free and paid users alike. So yes, everyone gets in on the action.
Nano-Banana already launched and we had the privilege to try it and I can tell you, it deliver what it promises for the image editing, but it's still hallucinate when the image has way too many details.
Blog Posts
The Second Half
This is one of the best blog posts of 2025 by the OpenAI researcher Shunyu Yao. A playbook for what will matter most in AI research and the startup ecosystem, and how to prepare.

In the first half the focus was on developing new training methods and models. In the second half the focus shift from solving problems to defining problems, and write the correct eval for it.
Reinforcement learning (RL) key components
In RL there 3 main components:
-
Algorithm, the learning rule or optimization method that updates the agent’s policy. This is the machinery of RL: from early methods like REINFORCE and Q-learning, to more advanced ones like actor-critic, PPO, or TRPO. Algorithms define how the agent learns from experience.
-
Environment, the world in which the agent operates. It provides the state, feedback, and reward signals. In research, environments range from toy grids and Atari games to robotics simulators and large-scale language interaction settings. The environment defines what the agent is learning about.
-
And priors, the knowledge the agent brings before any RL begins. Priors can come from pre-training on vast datasets.
For decades, RL researchers obsessed over algorithms—REINFORCE, DQN, TD-learning, actor-critic, PPO, TRPO. Sutton and Barto’s classical textbook is almost entirely about algorithms, with little attention to environments or priors.
That mindset carried into mainstream AI. The first half of the modern AI era was dominated by new training methods and model architectures: AlexNet, the transformer, GPT-3. Then the progress stack on algorithms and models beating benchmarks. But the game changed when RL finally generalized.
The working recipe now looks like this: massive language pre-training (priors) + scale + reasoning-as-action inside a RL loop.
It turned out the most important part of RL might not be the algorithm or the environment, but the priors, which can be obtained in a way totally unrelated from RL.
So now the frontier shifts from solving problems to defining the right problems, evaluation takes center stage, and the core benchmark is no longer a leaderboard score but the “utility problem.” In fact traditional benchmarks don’t translate well to real-world tasks.

This is a great blog post that requires a few reading before really syncing in.
https://x.com/ShunyuYao12/status/1911671943457345675
https://x.com/Hesamation/status/1960717949771092429
Mechanize: How to Fully Automate Software Engineering
Current AI presents the Moravec’s paradox, it finds easy tasks that are really hard for humans (math, coding) and very hard tasks that are easy for humans (a baby chosing the right shape), some people think we can’t get over this paradigm.
The bitter lesson (Richard Sutton) tells us that scale (brute force) can solve this problem. RL is also a solution to this problem but stuck on narrow problems.
This blog argues that full automation needs more than just bigger models, it’s about training on tons of real human coding data and then letting AI learn in rich, realistic environments using RL. Right now, those RL environments are way too basic, and grading open-ended engineering work is tough.
So, AI shifts engineers to higher-level planning, testing, and coordination. And here’s the twist: software engineering might be both the first and last white-collar job to be totally automated, since building a “drop-in remote worker” AI is a much bigger challenge than just writing code.
https://x.com/MechanizeWork/status/1945528661131849790
Garnter: AI Hype Cycles
2025 reached the peak of the hype cycle with the AI Agents. Focus is now shifting to delivery stable business applications.

Nic Carter Tweet on AI opportunities and negative effects
Nic think AI is going to increase the Gini coefficient, drawing parallels to the post-1850 horse population drop, foreseeing job disruptions, regulations, and socialist backlash.
The NVDA rally changed the way he's seeing the future investing of AI. He was an early investor in CoreWeave. In 2020 he read the Scaling Hypothesis from Gwern, the most important blog post of that time. 2022 stable diffusion came out.
I'm probably doing a terrible job at summarizing this tweet, so just go read it for yourself: https://x.com/nic__carter/status/1797635177973158182.
Videos And Podcasts
Can Goodness Compete?
A philosophical exploration of a post AGI future Shelling point good for humanity.
https://www.youtube.com/embed/i6RMHcMbqQw
The Fractured Entangled Representation Hypothesis
Current diffusion models train on massive data and create images that happen to be correct, but they’re “castle in the sand”.
https://open.spotify.com/episode/2T75g7BLaztaWSeXWO0G18?si=m7BQkUYcTSGOQuCQRO6Jcg
Instead of building top bottom with SGD. This new models start from bottom up, build the images composing them piece by piece.. they show an example of a skull, and explaining how step by step, representation emerge locking one piece at the time. This is a 15 min summary of the full episode.
Demis Hassabis: Future of AI, Simulating Reality, Physics and Video Games | Lex Fridman Podcast
https://www.youtube.com/watch?v=-HzgcbRXUK8
Demis Hassabis and Lex discuss how AI is modeling complex patterns in nature, the path toward artificial general intelligence (AGI), and the societal impact of rapid technological change. The conversation covers breakthroughs like AlphaFold, the future of video games, the importance of creativity and research “taste,” and the philosophical questions surrounding consciousness and human purpose. Hassabis emphasizes the need for responsible AI development, collaboration, and ensuring that the benefits of these technologies are shared widely.
How AI image generation draws from physics | Guest video by @WelchLabsVideo
https://www.youtube.com/watch?v=iv-5mZ\_9CPY
A geometrical exploration for how diffusion models work 🔥 and how the gradient descent is represented.
MLST — DeepMind Genie 3 - Simulate The World
https://www.youtube.com/watch?v=ekgvWeHidJs&t=604s
MLST interview with the DeepMind Genie 3 team, a general purpose world model that can generate an unprecedented diversity of interactive environments, can create and remember the world details, and it has emergent rules, like parallax. Video games and virtual reality are about to have their GPT moment. It’s a real time model that can render at 30fps and 720p and it can render the world for several minutes.
Theory Of Everything: The (Terrifying) Theory That Your Thoughts Were Never Your Own
https://www.youtube.com/watch?v=Ca\_RbPXraDE
The theory that your thoughts were never yours.. in other words, language is an entity with its own evolution that is injected in our brain, and likely there are other parallel entity (or programs) running in our biology. Does the brain works just like an LLM?
This is a mind bending live discussion and podcast at the University of Toronto, from Prof Elan Barenholtz and William Hahn, moderated by Curt Jaimungal.
Full Sources List
As usual there are way too many news, articles, papers, fun memes, and tweets, to write about them all. Here’s the complete list, in case you wanted to explore what happened last month, here they are!
AI Agents
- It’s getting insanely easy to build complex multi-agent systems in n8n https://x.com/omarsar0/status/1951015458640920817
- Structural Planning Framework for LLM Agent System to improve agents reliability https://x.com/omarsar0/status/1947750756494586275
AI Builders
- ⭐ AI automation works best on clear, repetitive tasks but often fails when context-heavy tasks take longer to explain than to do manually
https://x.com/infoxiao/status/1956729154411381067
- ⭐ Cursor CLI beta is out https://x.com/cursor_ai/status/1953559384531050724
- Comprehensive Taxonomy of LLM Hallucinations Nice report covering common hallucinations, root cause and mitigation https://x.com/omarsar0/status/1952731083465994347
- Claude code lets you create agents now https://x.com/_catwu/status/1948496854712295492
- Gemini Code Assist now shares the same technology with Gemini CLI in VS Code https://x.com/GoogleCloudTech/status/1946691520922300695
- Claude sonnet API now support 1M token context window https://x.com/claudeai/status/1955299573620261343
- Context window size is an important limitation for all LLMs https://x.com/FactoryAI/status/1947391013271835131
Books, Blogs, X Threads
- ⭐️ The Second Half:one of the best blog posts of 2025 by the OpenAI research Shunyu Yao. A playbook for what will matter most in AI research and the startup ecosystem, and how to prepare.
https://x.com/ShunyuYao12/status/1911671943457345675, https://x.com/Hesamation/status/1960717949771092429 - ⭐ Mechanize: How to fully automate software engineering: current AI presents the Moravec’s paradox, some people think we can’t get over this paradigm. The bitter lesson (Richard Sutton) tells us that scale can solve this problem, RL is also a solution but for narrow problems. AI shifted the focus of engineers to higher-level tasks, who paradoxically might be the first and last of white-collar jobs to be automated. https://x.com/MechanizeWork/status/1945528661131849790
- ⭐ Gartner Highlights AI Agents as Fastest-Advancing Tech show hype cycles https://www.gartner.com/en/newsroom/press-releases/2025-08-05-gartner-hype-cycle-identifies-top-ai-innovations-in-2025
- ⭐ Nic Carter: long tweet on AI funding future: this NVDA rally will boost growth but might increase the Gini coefficient, and be disastrous for society https://x.com/nic__carter/status/1797635177973158182
- Manus: Context Engineering for AI Agents: Lessons from Building Manus. After four overhauls and millions of real-world sessions, here are the lessons we learned about context engineering for AI agents https://x.com/ManusAI_HQ/status/1946291647849144516
- How to build a car: “best book I’ve read in a very long time for all founders” https://x.com/nbashaw/status/1954196006985207861
- Enterprise AI enters the agentic era: Autonomous AI systems are streamlining business and boosting efficiency. https://analyticsindiamag.com/ai-highlights/enterprise-ai-hits-an-inflection-point-the-agentic-era-is-here
DeAI
- ⭐ This is the first demo of the ERC-8004 Trustless Agents standard, showing a full, end-to-end work https://x.com/_sumeetc/status/1958007821775294788
Diffusion Models
- ⭐ Google Nanobanana: pro-level Photoshop edits via text https://x.com/deedydas/status/1959068336903659778
- ⭐ Google Genie 3: the most advanced world simulator ever created, enabled by numerous researches https://x.com/OfficialLoganK/status/1952732206176112915
- Genie 3 video https://x.com/RuiHuang_art/status/1954716703340048877
- Walk inside your favorite paints with Genie3 https://x.com/holynski_/status/1953879983535141043
- Evolution of "Neural Video Games" from GQN (2018) to Genie3 (2025) https://x.com/OriolVinyalsML/status/1952766969859457064
- Token crisis, solved: pre-trained diffusion language models (DLMs) vs. autoregressive (AR) https://x.com/NiJinjie/status/1954177095435014533
- Veo3 animation https://x.com/CoffeeVectors/status/1948082017070936509
- Thyme uses a two-stage training strategy and the GRPO-ATS algorithm to combine reasoning and code execution, enabling efficient high-resolution perception and complex reasoning beyond traditional image-based methods https://x.com/iScienceLuvr/status/1957402918057017823
- Qwen-Image by Alibaba is now available https://x.com/FAL/status/1952445949391118532
Fundraising, grants, programs
- Why AI is a house of cards https://x.com/zoomyzoomm/status/1956141182754677084
- Detailed list of all 44 people in Meta's Superintelligence team https://x.com/deedydas/status/1946597162068091177
- Techcrunch reports that AI coding tools have "very negative" gross margins https://x.com/nickbaumann_/status/1954253210287288430
GEO Politics
- ⭐ Distillation can secretly pass on hidden traits from teacher to student models, even through neutral data like number sequences. If true, open sourcing LLMs is a soft power https://x.com/swyx/status/1947875989666832576
- ⭐ 80% of YC AI startups use Chinese OSS AI models https://x.com/rohanpaul_ai/status/1959972531969888668
- ⭐ China updates: China is killing the US on energy. Does that mean they’ll win AGI? — Casey Handmer | Dwarkesh Patel Podcast https://www.youtube.com/watch?v=3cDHx2_QbPE, Keyu Jin: China's Economy, Tariffs, Trade, Trump, Communism & Capitalism | Lex Fridman Podcast https://www.youtube.com/watch?v=y3yAVZk3tyA
- Globalized chain has a number of irreplaceable single point of failures: NVIDIA chips made by TSMC using Dutch ASML machines with German parts https://x.com/fchollet/status/1960079432548516310
- China produces 20x+ as much solar as the U.S https://x.com/dwarkesh_sp/status/1956471318599307695
- China added 464 GW solar and 461 TWh nuclear, while Germany cut nuclear to zero https://x.com/rohanpaul_ai/status/1948454031564656955
- US signs deals with UAE & Saudi to secure compute, ease power bottlenecks, and drive 6GW of AI data centers by 2030 https://x.com/SemiAnalysis_/status/1945311173219369359
- “Silicon Valley is the second most spy-infested region in America after D.C https://x.com/vitrupo/status/1945180400554566061
- Anthropic $200m contract with DoD https://x.com/AnthropicAI/status/1944848519065452754
- There are really 2 bitter lessons in AI https://x.com/hamandcheese/status/1947322643150987333
Hardware and Infra
- I spent my summer building TinyTPU : An open source ML inference and training chip https://x.com/suryasure05/status/1957518913648095376
- Cerebras systems 10x faster infra than NVIDIA https://x.com/andrewdfeldman/status/1956391583382904878
- Every robot you see is a data firehose generating terabytes of chaos https://x.com/IlirAliu_/status/1954545511807012985
- Insightful 26-page report from Goldman Sachs on power-grid and AI https://x.com/rohanpaul_ai/status/1947788227051721116
- Colossus 2 construction, 2x 100k h100s https://x.com/rohanpaul_ai/status/1947765377527845148
- Sama: we will cross well over 1 million GPUs brought online by the end of this year. https://x.com/sama/status/1947057625780396512
- It's official: we're developing 4.5 gigawatts of additional Stargate data center capacity with Oracle https://x.com/OpenAI/status/1947628731142648113
- Cable porn of xAI GB200 servers at Colossus 2 https://x.com/elonmusk/status/1947715674429919279
IMO Gold
- IMO drama: we might be heading into a plot twist in the OpenAI vs DeepMind https://x.com/zjasper666/status/1947013036382068971
- 4 years ago Paul Christiano thought IMO gold was an 8% odds, Eliezer 16% odds https://x.com/MichaelTrazzi/status/1946511717518762193
- Speaking as a past IMO contestant, this is impressive but misleading, gold vs silver is meaningless, 1pt below gold is noise, the impressive bit is that a general model can do IMO level math https://x.com/NeelNanda5/status/1946602858033639803
- Polymarket shows the AI IMO gold result came as a surprise to folks https://x.com/polynoamial/status/1946485373124608491
Learning
- ⭐ A short mathematical blog on Softmax
Softmax is an activation function that turns an array of values into probability mass function where the weight of the maximum value is exaggeratedhttps://x.com/goyal__pramod/status/1948284723060527466 - ⭐ A step-by-step guide to diffusion models https://x.com/MIT_CSAIL/status/1946238813208007164
- ⭐ I was bored, so I created a self-learning Flappy Bird from scratch using a neural network and a genetic algorithm https://x.com/TanayVasishtha/status/1957363146240545216
- If you’re building AI systems in 2025, there are only two tools worth learning: LangGraph and n8n https://x.com/connordavis_ai/status/1959212761558466829
- MIT's Advanced Data Structures by Prof. Erik Demaine https://x.com/Riazi_Cafe_en/status/1948462940723548177
- 1h 17min walkthrough on continuous thought machines by Sakana AI https://x.com/yacinelearning/status/1948373431692435896, video
- 6 must-read books about AI and Machine Learning https://x.com/TheTuringPost/status/1954509293224656994
- DeepMind's essential GPU guide for AI engineers https://x.com/rohanpaul_ai/status/1959899202181566577
- What is a transformer: beautiful interactive visual blog https://x.com/goyal__pramod/status/1946780873271173540
- Anthropic prompting 101 video https://www.youtube.com/watch?v=ysPbXH0LpIE
LLMs
- The new Qwen3 update takes back the benchmark crown from Kimi 2 https://x.com/rasbt/status/1947393814496190712
- Grok 2.5 open sourced https://x.com/rasbt/status/1959643038268920231
- OpenAI releases GPT-OSS-120B https://x.com/OpenAI/status/1952776916517404876
- xAI went from 0 to SOTA in 2 years https://x.com/amXFreeze/status/1959158372231487627
- Compare architecture design of the main LLMs of 2025, from GPT to MoE https://x.com/rasbt/status/1946549778319339931
- China LLM in July: GLM-4.5, Wan-2.2, Qwen3, Kimi 2 https://x.com/Yuchenj_UW/status/1950034092457939072
- NVIDIA research makes LLMs 53x faster without retraining https://x.com/JacksonAtkinsX/status/1960090774122483783
- ⭐ today, I pruned 87.24% of Qwen 30B for a sentiment classification task while keeping 100% of its accuracy https://x.com/MaximeRivest/status/1957273076514578781
- We are super excited to release OpenCUA, the first from 0 to 1 computer-use agent foundation model https://x.com/xywang626/status/1956400403911962757
- ⭐ Kimi K2 tech report just dropped: MuonClip optimizer, stable + token-efficient pretraining at trillion parameters; 20K+ tools; RL reward that adapt; Ultra-sparse 1T MoE https://x.com/Kimi_Moonshot/status/1947520758760313170, https://x.com/Zai_org/status/1954750596634054965
- Avengers-Pro routing framework outperforms GPT-5-medium by 7% https://x.com/omarsar0/status/1958897458408563069
- The growth of LLM context length with time: gpt3.5 4K → gemini 1M https://x.com/_avichawla/status/1959141055301132516
- opinion: watching the timeline flip on GPT5 sentiment from negative to positive is pretty funny https://x.com/swyx/status/1957164146447008231
- OpenAI GPT-OSS-120B is live on Cerebras at 3K tokens/s https://x.com/CerebrasSystems/status/1952785033024160096
- LLMs can now self-optimize, GEPA method evolves prompts via genetic + pareto optimization, boosting reasoning efficiency up to 35x https://x.com/JacksonAtkinsX/status/1949680438965907724
- Google researchers show that in-context learning (ICL) works like implicit fine-tuning. Prompts act as rank-1 weight updates to MLP layers, mimicking gradient descent as tokens are processed. Their experiments confirm ICL dynamically adjusts weights much like real training https://x.com/omarsar0/status/1948384435654779105
- Sama just shook Washington with a bold prediction, entire job categories, like customer support, will vanish with AI https://x.com/WesRothMoney/status/1948058025044152600
- Gemini 2.5 Flash-Lite is now general ability (GA), 400 token/s at $0.10 in and $0.40 out https://x.com/sundarpichai/status/1947693605247660084
- Mistral started it, DeepSeek scaled, it Kimi K2 confirmed it: always more convenient to train an MoE https://x.com/hkproj/status/1947571673021993152
- kimi k2 on groq is so ridiculously good https://x.com/cheatyyyy/status/1945113922828226850
- Sonnet 3.5's first mission https://x.com/liminal_bardo/status/1945080817857749376
- Yann LeCun, to reach AGI we need to move away from LLMs, focus on joint embeddings, energy-based models, regularization, and model-predictive control. Use RL only as a corrective mechanism not as a core https://x.com/rohanpaul_ai/status/1945167783865532641
Lol, Memes

- Viva la Vida https://x.com/elonmusk/status/1946321191834091808
- Chatgpt agrees with you, your therapist agrees with you no one is left to tell you the truth besides toddler and drunk people https://x.com/aubreystrobel/status/1947399986607489259
- so much good material in ai infra rumors https://x.com/untitled01ipynb/status/1957450096548708612
- AI policy https://x.com/mtlushan/status/1945304709012533429
- “I believe robots will change the world” vs “I make robots” https://x.com/_Stocko_/status/1946765474634596720
- Shrimp symphony https://x.com/voooooogel/status/1947529211150999750
- recruiters asking for "+5 years of experience" for roles that are 12 months old. https://x.com/Hesamation/status/1948277239038050544
- Jensen Huang "I've created more billionaires on my management team than any CEO in the world” https://x.com/ns123abc/status/1948247313907904907
- lol i still can't believe this is your virtual girlfriend https://x.com/forloopcodes/status/1948756763777392786
- lmaooooooo my son drew this clanker today https://x.com/EricMTerrill1/status/1956765838268133713
- gpt5 → my white collar job https://x.com/TechMemeKing/status/1953280909010190669
- why are they like this https://x.com/mayfer/status/1949946961781571893
MCPs
- Rube universal MCP server connects AI agents to apps https://x.com/omarsar0/status/1960084088133398718
- today we're releasing a new 0.5B SLM for detecting problems with tool usage in agents https://x.com/freddie_v4/status/1947692034665644136
Metrics & Benchs
- Has AI progress slowed down? I’ll write some personal takes and predictions in this thread https://x.com/nikolaj2030/status/1954248757513720297
- I'm noticing that due to a lot of benchmarkmaxxing on long horizon tasks, LLMs are beck https://x.com/karpathy/status/1954224651443544436
- ambient agents are going to completely dominate the rest of 2025: 1 https://x.com/swyx/status/1948871669646590165
- Thrilled to introduce “Deep Research with Test-Time Diffusion”, a new deep research agent design https://x.com/chl260/status/1947918532110647570
- Today we’re releasing our first public preview of ARC-AGI-3: the first three games https://x.com/mikeknoop/status/1946264912118108540
- GROK 4 ranks #1 in FutureX live benchmark https://x.com/WesRothMoney/status/1946185690586194087
- Solo founders + agents are here https://x.com/benln/status/1945519729637994950
Opinions
- Dia and Comet are just chromium with a sidebar https://x.com/thekitze/status/1945488889965084679
- OpenAI time is running out https://x.com/boneGPT/status/1952766437212279084
- musk, Intelligence still scales logarithmically with compute https://x.com/elonmusk/status/1956591128423153796
- I've basically stopped using Opus entirely and I now have several Codex tabs with GPT-5-high https://x.com/VictorTaelin/status/1958543021324029980
- codex CLI is better than Claude code for refactoring code https://x.com/frantzfries/status/1959700004781847017
- crazy how we don’t see how the GPU shortage is affecting the AI industry, delayed launched, shorter context windows, and more: https://x.com/petergostev/status/1947062824381133024
Philosophy and AGI
- Evolutionary Biologist David Krakauer: "AI is an amazing technology.. but it’s fake intelligence” https://x.com/slow_developer/status/1957468783041556921
- ⭐ AI introduces new labor form with different physical basis https://x.com/EMostaque/status/1959966453706211630
- MLST: How To Build Conscious Machines https://x.com/MLStreetTalk/status/1960082219130671387
- The $100 trillion question: what happens when AI replaces every Job? https://x.com/rohanpaul_ai/status/1959730526555984104
- Lex: Imagine if every pattern shaped by nature, like a protein’s fold or cosmic phenomena, is inherently learnable by AI https://x.com/GoogleDeepMind/status/1948098855053979930
- Chronic inflammation s peeds up every hallmark of aging https://x.com/davidasinclair/status/1947992604353700041
- be Sam Altman > asked what keeps him up at night about AI > 3 scary categories > 1 https://x.com/casper_hansen_/status/1947743914414182528
- 🚀Introducing Hierarchical Reasoning Model🧠🤖 Inspired by brain's hierarchical processing, HRM deli... https://x.com/makingAGI/status/1947286324735856747
- I've been screaming about this for years: what many powerful people in Silicon Valley want is to ... https://x.com/xriskology/status/1944546039408828492
- Reiterating my prediction that we will reach AGI this year-with an increased probability of 99% b... https://x.com/DeryaTR_/status/1947297024669335782
- Eric Schmidt believes we are entering a new epoch, comparable to the Enlightenment https://x.com/WesRothMoney/status/1947310609030152576
- Watching the model solve these IMO problems and achieve gold-level performance was magical https://x.com/SherylHsu02/status/1946478334013321231
- we achieved gold medal level performance on the 2025 IMO competition with a general-purpose reaso... https://x.com/sama/status/1946569252296929727
- I recently gave a public talk called “Can goodness compete?”, on long-term equilibria post-AGI https://x.com/jkcarlsmith/status/1945916364100981043
RAG
- This work uses a multi-agent framework to generate high-quality and private synthetic datasets for RAG evaluation https://x.com/omarsar0/status/1960703354671391197
- contextualized chunking, new mongodb embedding model, cuts vector DB costs by ~200x https://x.com/_avichawla/status/1955880423302865341
Random
- Pi0 uses Flow Matching action head instead of diffusion heads for VLAs https://x.com/KyleVedder/status/1960218909858210264
- AI at home will continue to mimic the trajectory that 3d printing at home https://x.com/ssslomp/status/1953945780731191512
- being a daily customer of @Waymo for almost two years is a super power: trusting AI is a super power https://x.com/Scobleizer/status/1946728455829393580
- probably nothing: all complex brains in nature have hemispheres https://x.com/mayfer/status/1946313951069040931
- Fed Three.js code from my quoted post to xAI's Grok 4 https://x.com/techartist_/status/1946975409658274130
- We’re bringing Gemini 2.5 Pro to AI Mode https://x.com/GoogleDeepMind/status/1945515683451736246
- NVIDIA releases open dataset with 1m hours of multi-languages conversations, https://blogs.nvidia.com/blog/speech-ai-dataset-models
- Midjourney Upgrades: Standard Users Can Now Generate HD Videos link
- transcribe 1 week of audio in 1 min and $1, 100x faster than average https://x.com/bernhardsson/status/1948164977715486957
- Google AI founder advises against law/medical degrees due to AI disruption https://x.com/kimmonismus/status/1959021900799385857
- So far no drones can do this https://x.com/Indian_Bronson/status/1959643290627895440
- Musk standing under SpaceX Starship https://x.com/elonmusk/status/1960039238302626140
- NEO, first ML engineer agent https://x.com/withneo/status/1958922360494727638
- OpenAI + Retro Biosciences used AI to design improved Yamanaka proteins and published the results. https://x.com/BorisMPower/status/1958915868693602475
- François Cholet Frontier proprietary models are costly “sandcastles” soon erased by open-source replication and later by new algorithms https://x.com/hardmaru/status/1959557241994191303
- Both Meta and X are betting on AI girlfriends https://x.com/bindureddy/status/1959795189276643787
- Deleting code is what separates us from the LLMs https://x.com/___Harald___/status/1959401396895367481
- Introducing The Synthetic Data Vault (SDV) a tool to create synthetic data https://x.com/mdancho84/status/1957464597067174008
- ⭐ if you want to know where OAI is going take a look at their hiring page: they’re searching for multiple RL engineers https://x.com/apples_jimmy/status/1957243123068969402
- Drop out of MIT before graduating before AGI https://x.com/pmarca/status/1956488823241924618
- Are the sigmoids in the room with us right now? https://x.com/TolgaBilge_/status/1954736355956629915
- welch labs is too good for us https://x.com/himanshustwts/status/1954510253128581568
- Neural similarity predicts whether strangers become friends https://x.com/NTFabiano/status/1954517309256593855
- Introducing AlphaEarth Foundations an AI model that integrates petabytes of satellite data into https://x.com/demishassabis/status/1950667643771326784
- Do you call Yoshua Benjio, Geoffrey Hinton, Yann Lecunn researchers or engineers? https://x.com/ns123abc/status/1950391681792049372
- ancient wisdom https://x.com/cryptunez/status/1948816370503709045
- ⭐ pretraining is shrinking while RL compute is increasing https://x.com/WesRothMoney/status/1947024550417547746
- To all the people saying OpenAI's math proofs are in a "weird terse language" - might remember this from Karpathy: “you can tell the RL is done properly when the models cease to speak English in their chain of thought” https://x.com/thomasahle/status/1946897875189019088
- 16 ways I actively use to optimize model training https://x.com/_avichawla/status/1946820714423828701
- First born children have higher IQ than their siblings https://x.com/NTFabiano/status/1946543229001703921
- I'm starting to think that coding with LLMs is a bit like riding an electric bike: you don’t get to destination faster, it just makes it easier to go uphill https://x.com/pfau/status/1946368562710425924
- ⭐ please i beg you, dont make an automation agency, make a productized service that has defined recurring deliverables that uses automations to do 90% of the work and can run at a 80% margin https://x.com/codyschneiderxx/status/1945166479667691616
- Kimi K2 + groq is now passing 200 tokens/s on OpenRouter https://x.com/OpenRouterAI/status/1945198256654319625
- running kimi k2 base today on H200s 16 H200s across 2 nodes each node is ~ 1.13TB of VRAM https://x.com/TheAhmadOsman/status/1945061907821388201
- llm behind dns, useful on plane with paywalled wifi https://x.com/noahgsolomon/status/1954035351510716670
- Manus now can create spreadsheet https://x.com/ManusAI_HQ/status/1953115359609012640
- Robot dancing https://x.com/justinboldaji/status/1948531064324161692
- i vibe coded my vr tekken in 5 hours https://x.com/_renhau/status/1946993311031710205
- Coffee changes connectivity in the brain https://x.com/NTFabiano/status/1946904021408616614
- HuggingFace: we've just release 100+ intermediate checkpoints and our training logs from SmolLM3-3B training https://x.com/eliebakouch/status/1947314193536823621
- SF budget and hierarchy https://x.com/m_atoms/status/1960849850150383805
Research
- ⭐ A tiny brain-inspired 27M param model trained on 1000 samples outperform o3-mini-high on reasoning tasks and obtains 40% on ARC-AGI (not verified) https://x.com/deedydas/status/1951677875004100814
- ⭐ AlphaGo moment: Chinese researchers fed all LLM research into a model and it discovered 106 novel AI model architectures that converge to lower loss with better benchmarks https://x.com/daniel_mac8/status/1949165983127466036, https://x.com/deedydas/status/1949316395130569012
- ⭐ Learning without training: this paper shows that LLMs can learn in context without training by implicitly updating MLP weights through stacked self-attention and transformer blocks https://x.com/himanshustwts/status/1948429985502437879, https://x.com/rohanpaul_ai/status/1948572304809611701
- ⭐️ Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data. This paper, co-authored with Owain Evans as part of the Anthropic Fellows Program, shows that LLMs can pick up behavioral traits subliminally from teacher models, even through seemingly unrelated data, highlighting a hidden risk in distillation and AI development https://x.com/AnthropicAI/status/1947696314206064819. It could be used as a soft power to inject values into open source LLMs https://x.com/swyx/status/1947875989666832576.
- ⭐ How Many Instruction can LLMs Follow At Once. LLMs can follow long instruction lists, but performance decays sharply after ~150 rules, with even top models failing a third of the time at 500. For prompt design: front-load critical rules, chunk large lists, and balance recall against latency. https://x.com/rohanpaul_ai/status/1945790079290798453
- ⭐ Most tweet posts are positive, but angry posts spread more widely https://x.com/NTFabiano/status/1946973996349636609
- “Move 37” moment: GPT5 helped explain unpublished T-cell experiments, identifying findings, suggesting the key follow-up test, and proposing a mechanism researchers had missed https://x.com/DeryaTR_/status/1954354352648225235
- GDB: testing character-level tokenization for biological language models, treating DNA as a “language.” https://x.com/swyx/status/1956439984854167727
- Is Chain-of-Thought Reasoning of LLMs a Mirage? “Our results reveal that CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions…” https://x.com/hsu_steve/status/1953599495360389166
- Anthropic “Inverse Scaling in Test-Time Compute” https://x.com/aryopg/status/1947591901886222570
- LLMs mix cultures internally and favor Western ones, and a new probe reveals why https://x.com/rohanpaul_ai/status/1957329112714146013
- Apple research just revealed a way to make LLMs 5.35x faster https://x.com/JacksonAtkinsX/status/1947408593638002693
- A Survey of Context Engineering 160+ pages covering the most important research around context eng for LLMs https://x.com/omarsar0/status/1946241565728600503
- Weekly top AI papers roundup https://x.com/dair_ai/status/1954556514141634683
- Tiny hierarchical reasoning model, 40% ARC-test https://x.com/deedydas/status/1951677875004100814
- Weekly top AI papers roundup https://x.com/dair_ai/status/1949481332754534898
- Retrieval-augmented Graph Agentic Network (ReaGAN) Another clever way to combine agentic capabilities https://x.com/omarsar0/status/1952404434283327975
- Fine-tuning agents without gradient updates using memory https://x.com/omarsar0/status/1960047046444085363
- 116-page study shows automated science is practical and low-cost https://x.com/rohanpaul_ai/status/1958846108022764013
- Chain-of-Agents Interesting idea to train a single model with the capabilities of a multi-agent s... https://x.com/omarsar0/status/1958186531161853995
- Researchers Made a Social Media Platform Where Every User Was AI https://x.com/rohanpaul_ai/status/1957289697912184952
- Weekly top AI papers roundup https://x.com/dair_ai/status/1957100408096108814
- M3-Agent: A Multimodal Agent with Long-Term Memory Impressive application of multimodal agents https://x.com/omarsar0/status/1956773240623235076
- Weekly top AI papers roundup https://x.com/dair_ai/status/1959644102057734375
- GROK 4 ranks #1 in FutureX live benchmark, for real world future predictions https://x.com/amXFreeze/status/1958550795403907441
- New paper shows that LLMs can be used as a new approach to trust without cryptography https://x.com/MLStreetTalk/status/1958846333365649790
- Has GPT-5 Achieved Spatial Intelligence? GPT-5 sets SoTA but not human‑level spatial intelligence https://x.com/omarsar0/status/1957885032716177415
- Retrieval-Augmented Reasoning with Lean Language Models, shows how to fuse RAG and reasoning into a single small-footprint language model https://x.com/omarsar0/status/1957532968135905297
- This paper shows that hallucination benchmarks like ROUGE often miss hallucinations https://x.com/omarsar0/status/1955647039733481841
- Score-ceiling benchmarks make AI progress look logarithmic, but open-ended ones reveal steeper gains https://x.com/aidan_mclau/status/1954618891314893018
- Apple released “Embedding Atlas”, an open-source, lightning-fast visualization tool for embeddings that feels like a Tableau for LLM datasets https://x.com/NirantK/status/1954081728525742365
- LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra: this paper shows how LLM agents can simulate an economy where a planner learns near-optimal tax policies by observing worker behavior, boosting welfare and offering a sandbox for testing fiscal rules. https://x.com/rohanpaul_ai/status/1948663512571715793
- Inverse Scaling in Test-Time Compute (Anthropic). This study shows that longer reasoning in Large Reasoning Models (LRMs) can hurt performance—revealing a surprising inverse scaling between reasoning length and accuracy. https://x.com/jiqizhixin/status/1947524856058417265
- Voxtral (MistralAI), how to train a model that actually understands both audio and text like https://x.com/sophiamyang/status/1947031166415962569
- DeepMind has the best research on using AI to solve hard Math: AlphaEvolve AlphaProof AlphaGeometry, FunSearch, AlphaDev, AlphaTensor, AlphaCode https://x.com/deedydas/status/1946987560875766212
- Pango dataset on HuggingFace! This presents the first large-scale dataset of real users performing authentic work tasks in business productivity software https://x.com/trypango/status/1945566680786374690
Videos
- 3Blue1Brown x Welch Labs is the crossover we didn’t know we needed https://x.com/novasarc01/status/1948784701180674232
- Demis Hassabis discusses Genie 3, Deep Think, and IMO gold medal with Logan Kilpatrik. Agent-based legacy (AlphaGo/Zero) with new "thinking models" aimed at planning, reasoning and AGI. Jagged intelligence: strong in some areas and weak in others. Traditional benchmarks are saturated, DeepMind is launching the Kaggle Game Arena. https://www.youtube.com/watch?v=njDochQ2zHs
- MLST: Pushing compute to the limit of physics, with Guillaume Verdon. Fun MLST episode with Beff Jezos on e/acc and thermodynamic computers, mostly an intellectual word soup, but nonetheless fun to listen to
https://open.spotify.com/episode/50Qm2NoerRIrV0jt4tYODE
Visuals
- no end no beginning no middle https://x.com/wilplatypus/status/1946971613997441137
Sign up to receive the mailing list! | Read all the past AI Socratic reports

