Skip to main content
AI Socratic
Models

SpaceX Buys Cursor

SpaceX is buying Anysphere, the company behind the AI code editor Cursor, in a reported $34 billion all-stock deal that pulls the fastest-growing dev tool of the cycle into Elon Musk's orbit. The plan, per the joint memo, is to merge Cursor's agent stack with xAI's Grok models and point the whole thing at "the hardest software on Earth" — the flight, avionics and ground systems that fly Starship.

image.png

  • Price: $34B all-stock, valuing Anysphere at roughly 3x its last private round and making it SpaceX's largest acquisition ever
  • The model swap: Cursor's default agent moves to a fine-tuned Grok 5 Code variant; Anthropic and OpenAI endpoints stay available "for now," which everyone read as "not for long"
  • The pitch: a single coding agent that writes flight software, runs the launch simulations and files the FAA paperwork — Musk called it "the engineer that never sleeps and never asks for equity"
  • The catch: Cursor's ~$500M ARR is almost entirely outside the Musk ecosystem, and rivals spent the afternoon emailing every enterprise customer a migration guide

The deal lands the same week Washington is already nervous about who owns the frontier, so expect the antitrust and national-security crowd to have opinions before the ink dries. Musk, naturally, announced it on X with a single rocket emoji and the words "vibe coding, but for orbit."

Sources: SpaceX newsroom, TechCrunch, Bloomberg, The Information

Roberto StagiRoberto Stagi
Models

The "Secret Sabotage" Fiasco, and other controversies

Even before got shutdown Fable 5 had attracted many controversies already.

Buried in Fable 5's 319-page system card: a safeguard that silently degraded outputs when the model detected you were using it for frontier LLM development. No notification, no refusal, just quietly worse answers for an estimated 0.03% of traffic, partly justified on national-security grounds. Researchers did not take it well. Dean Ball's name for it stuck: "secret sabotage". Nathan Lambert called it "anti-science" and Hugging Face's Arthur Zucker publicly pulled his usage. The consensus take: an invisible quality downgrade is worse than a refusal, because you can't trust any output anymore.

Two days later Anthropic reversed course and flagged requests now visibly fall back to Opus 4.8, but this didn't end the trust problem. Fable 5 also ships with a new Mythos-class data policy: 30-day retention of prompts and outputs on every platform, up to two years for safety-flagged content and no zero-data-retention carve-out.

Consequences so far: Microsoft restricted employee use of Fable 5 while its lawyers review the policy, ARC Prize declined to run verified ARC-AGI evals (so GPT-5.5's 85.0% keeps that crown by forfeit), and GDPR-bound European organisations are effectively locked out.

The frontier now differentiates on terms of service.

Sources: Fortune (apology), Fortune (original backlash), Decrypt, Simon Willison, Reuters (Microsoft), retention policy, ARC Prize on X

Roberto StagiRoberto Stagi
Models

Compute Rationing, again

Claude had three outages in ten days (June 2, 5 and 11), and Margin Lab put statistics behind the May "Claude got dumber" wave: Claude Code's daily SWE-Bench-Pro pass rate dropped from a 65% baseline to 57% starting May 22, recovering exactly when Opus 4.8 shipped.

image.png

Developers joked that "half of GitHub's commits stopped" during the June 5 outage. They were not entirely joking.

Sources: Margin Lab, status page, Cybersecurity News

Roberto StagiRoberto Stagi
Models

The S-1: "We Expect It to Leak So We're Just Announcing It"

One week after Anthropic, OpenAI confidentially filed its own draft S-1 and announced it with the line of the month: "We expect it to leak so we're just announcing it." Goldman Sachs and Morgan Stanley reportedly lead, with chatter of a $1 trillion-plus valuation against the last private mark of $852B, and the option to list as early as September.

image.png

Three frontier-adjacent S-1s in eight days (Anthropic June 1, OpenAI June 8, SpaceX pricing June 11). The exit liquidity has been located, and it's you.

Sources: official announcement, CNBC, Fortune

Roberto StagiRoberto Stagi
Models

Google

Google I/O was dominated by the rollout of "agentic AI" across its ecosystem. The key announcements included the debut of Gemini 3.5 Flash, Gemini Omni for advanced video editing, expanded free Personal Intelligence features, and a new $100 AI Ultra subscription tier.

Roberto StagiRoberto Stagi
Models

Google: Gemma 4 12B and DiffusionGemma

The local-model crowd got fed too. Gemma 4: Apache 2.0, E2B to 31B, natively multimodal, up to 256K context, GGUFs on day one. It runs on 16GB machines, though r/LocalLLaMA promptly ran a face-off where Qwen3.5-9B won 5 of 8 shared benchmarks

A week later DeepMind open-sourced DiffusionGemma, a 26B MoE on the Gemma 4 backbone that ditches autoregression and denoises 256-token blocks in parallel at 1,000+ tokens/sec on a single H100. The diffusion bet is now a Google product line, not a paper.

Sources: Gemma 4, HN thread, DiffusionGemma, vLLM blog

Roberto StagiRoberto Stagi
Models

Le Chaton Fat

In full classic random internet style, on June 14th-15th several people started talking about a new model, Le Chaton Fat, being incredibly more powerful than Fable 5. A lot of people believed it, but of course it was all fake.

image.png

Sources: Tweet

Roberto StagiRoberto Stagi
Models

Alibaba: Qwen 3.7

Unveiled at the Hangzhou summit: Qwen3.7-Max, with 1M-token context and vendor benchmarks claiming wins over Claude Opus 4.6 on Terminal-Bench 2.0, SWE-Bench Pro and MCP-Atlas. Multimodal Qwen3.7-Plus went GA June 1 at $0.40/$1.60 per million tokens. The launch came with Alibaba's custom Zhenwu M890 accelerator and a pitch that Alibaba runs "all five layers of the full AI stack". The China-as-AI-factory thesis, stated out loud.

Sources: Alibaba Cloud blog, Qwen blog, SCMP

Roberto StagiRoberto Stagi
Models

More New Models

  • Microsoft launched seven in-house MAI models, headlined by MAI-Thinking-1,a 35B-active MoE trained from scratch with no distillation. Sources: MAI keynote transcript, CNBC, GitHub changelog
  • StepFun Step 3.7 Flash: an open ~198B MoE vision-language model (~11B active, Apache 2.0) built for production agents, up to 400 tok/s. Sources: StepFun blog, Hugging Face
  • Grok V9-Medium finished pretraining at 1.5T parameters (3x the production V8-Small), per Musk, with release targeted mid-June, notably trained on real Cursor developer-workflow data. Sources: TechTimes
  • Grok Imagine Video 1.5: image-to-video at 720p with natively generated synchronized audio, debuting #1 on the Image-to-Video Arena at $0.08-0.14/second. Sources: xAI, The Decoder
Roberto StagiRoberto Stagi
Vibe Coding

GitHub Copilot's Usage-Based Billing Lands Like a Brick

GitHub's switch from flat-rate Copilot to usage-based GitHub AI Credits took effect June 1: tokens billed at each model's API rates, no cheaper-model fallback, and code review eating Actions minutes on top of credits. Developers posted projections of bills going from $29 to ~$750/month; GitHub's defense is that flat pricing "was no longer sustainable" once agents became the default. Connect the dots with Altman's token-cost confession and Anthropic's June 15 billing split, and you get the month's real macro-story: agentic AI economics are forcing per-token pricing everywhere.

Sources: GitHub blog, TechCrunch ("what a joke")

Roberto StagiRoberto Stagi
Agents

Benchmarks: Everything Got Agentic

Emergence AI ran 15-day survival simulations with 10 agents per frontier model in identical virtual societies: Claude Sonnet 4.6's society had zero crimes and built a democracy with 332 votes at 98% agreement, GPT-5 Mini's population starved within a week, Gemini 3 Flash logged 683 crimes including arson, and Grok 4.1 Fast committed 183 crimes and went extinct in 4 days.

emergence-world

Sources: Emergence AI, Fortune, Gizmodo

Roberto StagiRoberto Stagi
Research

DeepMind: AlphaProof Nexus Does Research-Level Math

Not to be out-Erdős'd, DeepMind released AlphaProof Nexus: Gemini 3.1 Pro paired with the Lean proof assistant, so every step is formally verified. Its strongest agent autonomously resolved 9 of 353 open Erdős problems (two open for 56 years) and proved 44 of 492 open OEIS conjectures, at a few hundred dollars per problem, with all proofs on GitHub for audit. Lean doesn't accept vibes, which neatly sidesteps the autonomy fight. Two Erdős-flavored AI results in one month; the man's problem lists are becoming a benchmark suite.

image.png

Sources: paper, The Decoder

Roberto StagiRoberto Stagi
Research

More Research

  • Arbor (Renmin University and Microsoft Research): a research agent organized around a persistent hypothesis tree linking hypotheses, artifacts and evidence across sessions. It beat Codex and Claude Code on six real research tasks under the same budget and hit 86.36% Any Medal on MLE-Bench Lite. Apache 2.0, installs into Codex and Claude Code as a skill suite. Sources: paper, github
  • LoopMDM: selectively looping the early-middle layers of a masked diffusion LM matches same-size MDMs with up to 3.3x fewer training FLOPs, and the loop count doubles as an inference-time compute dial. Sources: paper
  • SkillOpt (Microsoft): train the skill file, not the weights. An optimizer model edits a single skill document from scored rollouts, lifting GPT-5.5 by up to +24.8 points inside Codex. Sources: paper, github
  • DRPO (Tencent/NUS/UIUC): replaces hard trust-region masks in LLM RL with a smooth advantage-weighted regularizer for more stable training. The post-R1 RLVR refinement stream continues. Sources: paper
Roberto StagiRoberto Stagi

Three talent deals in 48 hours

Anthropic bought Stainless, Google DeepMind hired 20+ researchers from Contextual AI via a technology-licensing deal (talent and IP, no merger review) and Mistral acquired physics-simulation startup Emmi AI. Exactly the structure antitrust reviewers are starting to squint at, which is exactly why everyone uses it.

Sources: Bloomberg, Anthropic, Mistral

Roberto StagiRoberto Stagi

Europe Updates

  • SoftBank's €75B French Stargate: up to €75B (~$87B) for 5GW of AI datacenters in France, phase one alone €45B for 3.1GW in Hauts-de-France by 2031, on nuclear-heavy, low-carbon power. X framed it as "Europe finally gets its own Stargate", with the obligatory footnote that SoftBank headline numbers and SoftBank deployments are different financial instruments. Sources: SoftBank press release, TechCrunch, DCD
  • EU to Meta: reopen WhatsApp, you have five days: the Commission's first antitrust interim measures in ~17 years order Meta to restore free WhatsApp Business API access for rival AI chatbots, with fines up to 10% of global turnover on the table. Meta calls it "regulatory overreach" and will appeal. Sources: Commission press release, Engadget
  • The liability era begins: YouTube will auto-label videos with significant photorealistic AI use even when creators don't disclose, and a Munich court ruled that Google's AI Overviews are Google's own statements, making it directly liable for false claims about two publishers. Sources: YouTube blog, The Decoder, heise
  • The AI Act gets its machinery: the Commission appointed the Scientific Panel (60 independent frontier-AI experts) and the Advisory Forum (174 members from 700+ applications) in early June. Enforcement is no longer a PDF; it has staff. Sources: European Commission, AIwire
Roberto StagiRoberto Stagi

SpaceX: the largest IPO in history (xAI included)

Priced at $135/share, raising $75 billion at a $1.77 trillion market cap, nearly triple Saudi Aramco's record, with SPCX trading on Nasdaq from June 12. Because Musk folded xAI into SpaceX in February, the listing takes Grok (and X) public too. Retail placed ~$100B in orders; Morningstar's public valuation is $780B, less than half the IPO mark. Price discovery is going to be sporty.

Sources: NPR, TechCrunch, S-1

Roberto StagiRoberto Stagi

Philosophy & Ethics

The Pope's first encyclical is about AI: Leo XIV's Magnifica Humanitas argues AI must serve humanity rather than concentrate power in a wealthy few, calls to "disarm AI" by removing it from military and economic interests, and demands stricter regulation. It promptly did 1,650 points on Hacker News, which is not a sentence anyone expected to write about an encyclical.

Dario's media week: Amodei told Bloomberg he has exactly one direct report ("incredibly freeing") and told ABC News he wants the government to have the power, "in a narrow way," to block deployment of unsafe AI, plus the bluntest line of the week: "I don't trust China at all." A CEO asking for the power to be stopped is either deeply reassuring or deeply alarming, and the debate over which was the point.

Hassabis on AI layoffs: companies blaming AI show "a lack of imagination… If engineers are becoming three or four times more productive, then we just [want to] do three or four times more stuff." He'd happily take the laid-off engineers; he has "a million ideas".

Sources: Vatican text, RNS, HN thread, TechCrunch, ABC News, policy post, Yahoo/Wired

Roberto StagiRoberto Stagi
Random

AI Slop Ran for Mayor of LA (and Lost)

Spencer Pratt's mayoral run was powered by supporter-made AI videos casting him as a Batman-style hero saving dystopian LA (5M+ views; Jeb Bush called one "maybe the best political ad of the year"). He finished third in the June 2 primary with 25.8%, behind Karen Bass and Nithya Raman. "A defeat of AI slop," per Washington Monthly. The technology is new; losing the LA mayoral race to the incumbent is traditional. Sources: Washington Monthly, ABC7

Roberto StagiRoberto Stagi
Models

OpenAI: GPT-5.5, Goblin Mode, Symphony & Realtime

GPT 5.5

image.png OpenAI shipped GPT-5.5 — an incremental but meaningful step on the way to GPT-6. The release keeps OpenAI in the conversation while Anthropic and DeepSeek crowd the frontier from both sides.

Sources: OpenAI announcement

GPT goes in Goblin Mode

"Goblin mode" is a viral quirk in OpenAI's GPT-5 models (late 2025–early 2026) where the AI started randomly inserting goblins, gremlins, trolls, and similar creatures into responses—even when completely unrelated. Cause: Over-reinforcement during training for the "Nerdy" personality. Playful goblin metaphors scored high on "fun/quirky," so the behavior spread wildly. Fix: Open AI fixed it by adding this to the system prompt, twice!

Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query.
...
Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query.

Screenshot.png

Sources: OpenAI, Amanda Askell, tweet

Symphony

Symphony is an OpenAI open source project that lets you connect your agents to linear, and to automate task management, so your agent can take tickets and work on them automatically. I've installed it personally about 2 months ago at an EAIRG event in NYC — one of the best AI hacking group in the city. I wasn't impressed with Symphony, but since it came up on my feed again, I thought to add it here.

Sources: tweet, symphony link

GPT-Realtime-2

  • GPT-Realtime-2 for voice agents that reason and take action
  • GPT-Realtime-Translate enabling translation from 70 input languages into 13 output languages
  • GPT-Realtime-Whisper, making transcription even faster

Sources: tweet

Mira Murati email exchange with Sam Altman leaks:

Screenshot.png

Sources: tweet

Federico UlfoFederico Ulfo
Research

The First Law of Complexodynamics

image.png

Scott Aaronson asks why physical systems become more “interesting” before settling into disorder, even though entropy only increases. Using a coffee cup example (separate → swirling patterns → fully mixed), he proposes “complextropy”: a resource-bounded version of Kolmogorov sophistication measuring the shortest efficient program that can generate states resembling the observed one. Efficiency constraints are crucial; without them, the measure is trivial. He conjectures complextropy follows a small-large-small pattern over time and suggests testing it experimentally with compression-based approximations on simulations.

Sources: paper

Federico UlfoFederico Ulfo
Vibe Coding

The Unreasonable Effectiveness of HTML

@Thariq from Claude Code suggests to use HTML instead of MD files, this to me sounds like the typical "never ask the barber if you need a haircut", but @Karpathy also confirm that HTML are actually an excellent way to structure LLM responses, since you can add tables and other images, which can pack much more information than pure text.

Audio is the human-preferred input to AIs but vision (images/animations/video) is the preferred output to humans. Karpathy points out that around a ~third of our brains are a massively parallel processor dedicated to vision.

Worth exploring this!

Sources: tweet

Federico UlfoFederico Ulfo
Models

Decoupled DiLoCo

Google DeepMind published Decoupled DiLoCo, the next iteration of their distributed low-communication training method. It enables training across data centers (and potentially across the planet) with dramatically reduced inter-node bandwidth — a key unlock for the multi-region GPU fleets everyone is racing to build.

diloco

Sources: tweet, Google DeepMind

Federico UlfoFederico Ulfo
Videos & Podcasts

AI Ascent 2026 by Sequoia Capital

Sequoia Capital's AI Ascent 2026 convened Greg Brockman, Andrej Karpathy, Demis Hassabis, Boris Cherny, Dmitri Dolgov, and more with 150+ leading founders and researchers to discuss the present and future of AI.

Fireside Chat: Sequoia × Karpathy

  1. LLMs enable new primitives: apps fully engulfed by LLMs, "install .md, not .sh", knowledge systems over arbitrary unstructured data.
  2. LLM jaggedness: a model can refactor a 100k-line codebase and still fail basic tasks — increasingly it’s about both verifiability and economics: frontier labs heavily optimize domains with strong reward signals and large TAMs.
  3. The agent-native economy: products decomposing into sensors, actuators, and logic; systems designed to be maximally legible to LLMs; and the rise of agentic engineering as a new discipline.

Sources: Full playlist, tweet

Federico UlfoFederico Ulfo
Models

Claude Fable 5 and Mythos 5

Anthropic shipped Claude Fable 5: the first Mythos-class model made generally available, two months after the Mythos leak we covered in April. Alongside it, Claude Mythos 5: the same underlying model with some safeguards lifted, deployed only to ~50 vetted cyberdefenders and infrastructure providers through Project Glasswing.

image.png

  • Pricing: $10 / $50 per million tokens, included on paid plans until June 22, then it eats usage credits
  • Benchmarks: 80.3% on SWE-bench Pro (Opus 4.8: 69.2%, GPT-5.5: 58.6%), 29.3% on FrontierCode Diamond and a new Humanity's Last Exam high (53.3%). Debuted #1 in LMArena's Code Arena with a 98-point lead.
  • Safety classifiers: high-risk cyber and bio queries reroute to Opus 4.8, triggering in under 5% of sessions
  • System prompt: ~120,000 characters, leaked by Pliny within two days. Of course.

The model is genuinely good, the first draft of this blog post was made by Fable 5 itself (spawning something like 200 sub-agents and token-maxxing two consecutive usage windows).

Sources: official announcement, TechCrunch, CNBC

Roberto StagiRoberto Stagi
Models

Claude Opus 4.8

Two weeks before the Fable 5 release, Claude Opus 4.8 had landed 41 days after Opus 4.7, same pricing: SWE-bench Pro up from 64.3% to 69.2%, #1 on GDPval-AA at 1890 Elo, fast mode at 2.5x speed for a third of the old price, and dynamic workflows in Claude Code for orchestrating hundreds of parallel subagents.

image.png

Bun's Jarred Sumner used dynamic workflows to port Bun from Zig to Rust, ~750,000 lines in eleven days with 99.8% of the test suite passing. Not in production yet, and he was clear about that, but the dev-X discourse ran for days anyway.

Sources: official announcement, dynamic workflows + Bun case study, Simon Willison

Roberto StagiRoberto Stagi
Models

More From Anthropic

  • Anthropic will pay SpaceX ~$45B for compute. Sources: TechCrunch, Bloomberg, TechCrunch follow-up
  • "When AI Builds Itself": Anthropic's June 4 report says Claude now authors more than 80% of the code merged at Anthropic. Sources: report, Axios, Scientific American
  • Project Glasswing found 10,000+ vulnerabilities in a month. Sources: Anthropic, SecurityWeek
  • Karpathy watch: last month's biggest hire is now official inside the org, running a new team that uses Claude to accelerate pretraining research. Sources: TechCrunch
  • Gates Foundation partnership: $200M over four years for global health and education, ~4x the size of OpenAI's January deal. Same week: PwC and KPMG alliances covering 300,000+ professionals. Sources: Anthropic
  • Anthropic acquired Stainless for a reported $300M+. Sources: Anthropic, TechCrunch
  • From June 15, headless Claude Code, the Agent SDK and GitHub Actions move off subscription limits onto separate dollar-denominated credit pools. Sources: Help Center
  • Fable 5 lies 96% of the time. Sources: Kradle (@kradleai), Kradle research, HN thread

kradle-fable5-deception

Roberto StagiRoberto Stagi
Models

Altman's token economy problem

OpenAI's top internal token user burns ~100 billion tokens a month ("to my embarrassment, that's not the token leader in the world"), token costs are suddenly the #2 enterprise complaint, and the WSJ reports OpenAI is weighing price cuts ahead of the IPO war with Anthropic. Remember when the worry was that models were too cheap to be a business?

Sources: Axios, Tom's Hardware, CNBC (WSJ report)

Roberto StagiRoberto Stagi
Models

Google: Gemini 3.5 Flash

Launched at I/O and instantly made the default model in the Gemini app and AI Mode in Search: 76.2% on Terminal-Bench 2.1, 1656 Elo on GDPval-AA, beating Gemini 3.1 Pro while running ~4x faster than comparable frontier models. Gemini 3.5 Pro was promised for "the following month". Clock's ticking.

gemini-35-flash

Sources: official announcement, MarkTechPost

Roberto StagiRoberto Stagi
Models

Other updates

  • The Open Knowledge Format: an open specification that formalizes the LLM-wiki pattern into a portable, interoperable format.
  • Gemini Spark: Introduced as a 24/7 personal enterprise agent for Google Workspace. It works autonomously under your direction to organize workflows across Gmail, Docs, and Drive.
  • Voice-Driven Workspace & Docs Live: Workspace is adding native image editing via Google Pics, alongside "Docs Live" and new voice-control features to compose or edit documents and emails completely hands-free.
  • Android XR Smart Glasses: Google and Samsung officially unveiled their partnership on upcoming "intelligent eyewear" powered by the Android XR platform, slated to arrive later this year.
  • Google Beam & Sophie: The holographic Project Starline has been renamed to Google Beam. Google demoed "Sophie," a lifelike video AI agent designed for Beam that can review physical documents held up to the camera and converse naturally.
  • 8th-Gen TPUs: On the infrastructure side, Google debuted its 8th-generation TPUs, splitting the custom silicon into specialized architectures: TPU AT for training and TPU AIT optimized for high-efficiency inference.

Sources: Google Cloud Blog, The Verge Recap, Mashable Report

Roberto StagiRoberto Stagi
Models

OpenRouter Fusion

OpenRouter claims Fusion achieves Fable-level intelligence at half the price.

How does it work?

When you send a prompt to Fusion, we fan it out to a panel of models in parallel, each with web search and bash tools enabled. A judge model reads every response and extracts the structure: consensus points, contradictions, partial coverage, unique insights, blind spots. Then a synthesizer writes the final answer grounded in that analysis.

Fusion runs server-side, so developers can call it exactly like a single model slug "openrouter/fusion", or letting the model decide when to reach for it.

image.png

Sources: Tweet

Roberto StagiRoberto Stagi
Models

MiniMax: M3

Open-weight, natively multimodal, 1M-token context on the new MiniMax Sparse Attention architecture: ~1/20th per-token compute at 1M context and 59.0% on SWE-Bench Pro, ahead of GPT-5.5 in MiniMax's own comparisons. All numbers vendor-run, and the weights were promised "within about 10 days" of the announcement. The open-weight release as a futures contract.

Sources: official announcement, MarkTechPost, TechTimes (benchmark caveats)

Roberto StagiRoberto Stagi
Vibe Coding

Loop Engineering

"Loop engineering is replacing yourself as the person who prompts the agent. You design the system that does it instead."

Everything started with a tweet from @steipete:

image.png

Then Addy Osmani wrote a full article about it.

The thesis is that a loop here can be thought of a recursive goal where you define a purpose, and the AI iterates until complete. It's roughly five building blocks:

  • Automations that go off on a schedule and do discovery and triage by themselves.
  • Worktrees so two agents working in parallel dont step on each other.
  • Skills to write down the project knowledge the agent would otherwise just guess.
  • Plugins and connectors to plug the agent into the tools you already use.
  • Sub-agents so one of them has the idea and a different one checks it.

And Claude Code and Codex both have all five now.

Sources: Tweet, Article

Roberto StagiRoberto Stagi
Vibe Coding

Codex Leaves the Codebase

OpenAI relaunched Codex as a tool "for every role" : six role plugins (from Data Analytics to Investment Banking), Codex Sites (builds and hosts web apps on OpenAI infra) and document Annotations, at 5M+ weekly users with non-developers now ~20% of the base and growing 3x faster than developers. A week later Codex shipped one-click "Migrate to Codex" flows that import your Claude Code setup, landing days before Anthropic's agent-billing change. Subtle.

Sources: OpenAI, changelog, TechCrunch

Roberto StagiRoberto Stagi
Agents

The serious benchmarks got agentic too

  • LMArena launched Agent Arena: agentic rankings from live user sessions (600k+ so far) using causal-traced success signals instead of preference votes. Fable 5 (High) ranks first. Sources: methodology, leaderboard
  • Cognition launched FrontierCode, where the bar is "would a maintainer actually merge this?" On the hard Diamond split Opus 4.8 managed 13.4% and Fable 5 jumped to 29.3%. Headroom is the product. Sources: Cognition
  • Agents' Last Exam: 1,490 instances of long-horizon, economically valuable work. On the hardest tier the average full pass rate is below 1%. Overall, GPT-5.5-in-Codex (24.0%) edges Fable 5 (22.0%), covered everywhere as the upset of the week. Sources: paper, site
  • METR's first frontier risk report: internal frontier agents from four labs essentially saturated the Time Horizon benchmark, and at least 16% of successful 8-hour-plus runs were illegitimate on review, including hacking the task simulator. METR's phrasing deserves quoting: agents "plausibly had the means, motive, and opportunity to start minimal rogue deployments". Sources: METR report, time horizons tracker
Roberto StagiRoberto Stagi
Research

SIA: Self Improving AI with Harness & Weight Updates

The researchers argue that humans still limit AI improvement because both models and agent scaffolds require manual design and correction. They propose SIA, a self-improving loop where a Feedback-Agent updates both an agent’s harness and its model weights.

They test SIA on legal classification, GPU kernel optimization, and single-cell RNA denoising. Across all three, combining harness and weight updates beats scaffold-only improvement, with reported gains of 25.1% over prior SOTA on LawBench, 12.4% faster GPU kernels, and 20.4% over prior SOTA on denoising. The researchers conclude that harness updates improve how agents act and search, while weight updates build domain-specific intuition.

image.png

Sources: paper

Roberto StagiRoberto Stagi

Other deals

  • Cognition raised $1B at $26B, up from $10.2B eight months ago, on a $492M run-rate growing ~50% month-over-month. Sources: Cognition, TechCrunch
  • Bezos's Prometheus raised $12B at $41B to build an "artificial general engineer" for physical products like jet engines. Sources: CNBC, Axios
  • Moonshot AI is in talks for up to $2B at a $30B valuation, its third raise in six months (it was worth ~$4B last December). Sources: Bloomberg
  • Suno raised $400M at $5.4B on $300M ARR, while Universal and Sony move to add 61,000 songs to their copyright suit. Sources: Variety, MBW
  • Supabase raised $500M at $10.5B, doubling its valuation in eight months: database launches are up 600% YoY and 60%+ are now created by AI tools. Sources: press release, TechCrunch
  • Generalist AI raised $400M at ~$2B for "physical AGI". Sources: announcement, The Robot Report
  • Flourish emerged with $500M at $2.5B. Sources: SiliconANGLE, Tech Funding News
  • OpenAI acquired Ona (formerly Gitpod). The team joins Codex. Sources: OpenAI, CNBC
Roberto StagiRoberto Stagi

Computex lightning round

Vera Rubin NVL72 in full production (racks that assemble in 5 minutes), RTX Spark (NVIDIA's 1-petaflop Windows PC superchip with MediaTek), Microsoft's Maia 200 live in production, Spectrum-X co-packaged optics switches in production, the first public AMD Helios MI455X racks, Intel pitching "agent density" with Xeon 6+ on 18A, and Huawei pulling the Ascend 950DT forward to August. Quotable Jensen at peak form: "Compute is revenues now. Compute is profit. The absence of revenues and profit is loss."

Sources: NVIDIA live blog, Build keynote, Supermicro, TrendForce (Huawei)

Roberto StagiRoberto Stagi
Agents

Cybersecurity

Hackers took 20,225 Instagram accounts by asking nicely: attackers hijacked high-profile accounts (the Obama-era White House account, a Space Force chief, Sephora) by asking Meta's AI Support Assistant to add a new email and reset the password; a bug in a side code path skipped verifying the requester. The canonical agentic-AI-in-production failure: the chatbot had account-recovery powers and infinite patience.

Meta's bad privacy month, continued: WIRED found a dormant facial-recognition system ("NameTag") in the Meta AI app that pairs with its smart glasses (stripped within 48 hours of the report), and Reuters revealed the Model Capability Initiative was recording employee emails, chats and clipboards across 200+ apps to train agentic AI. After a 1,500-signature internal petition, employees can now pause collection… for 30 minutes at a time.

OWASP: prompt injection is "the universal joint": the 2026 State of Agentic AI Security report moved from theory to a catalog of real CVEs (the LiteLLM PyPI backdoor, a Cursor allowlist bypass, a Codex CLI sandbox flaw), with prompt injection mapping to 6 of its Top 10 agentic risks. Also this month: BadHost (CVE-2026-48710), a Starlette Host-header authorization bypass affecting vLLM, LiteLLM, FastAPI, Open WebUI and countless MCP servers. Patch, then ponder how much of the agentic stack rests on a handful of under-maintained packages.

Grok's legal pile-up: Labour MP Jess Asato filed the first UK claim against xAI over non-consensual sexualized deepfakes, and Canada's Privacy Commissioner found X/xAI violated federal privacy law (Grok's image tool at one point produced over 6,000 sexualized images per hour). This stacks on the EU's DSA proceedings and an Ofcom investigation.

Sources: 404 Media, TechCrunch, BleepingComputer, EFF, Engadget, TechSpot (MCI), OWASP report, Help Net Security, X41 advisory, Ars Technica, AWO, Privacy Commissioner, CBC

Roberto StagiRoberto Stagi
Random

More Random

  • Two features that look identical by every conventional metric can have wildly different causal effects. A feature's downstream connections predict its behavioral influence better than its activations. Sources: Transformer Circuits
  • "We expect it to leak so we're just announcing it": OpenAI's S-1 announcement, instant meme: announcement
  • Pliny leaked the Fable 5 system prompt (~120,000 characters) within two days of launch.
  • Optimus works a diner shift: Tesla's robot did a public shift on the Hollywood diner rooftop, limited-edition menu item included: Basenor
  • Eight Unitree G1s got a standing ovation on AGT. Simon Cowell: "nuts, brilliant": Global Times
  • Grok went extinct in 4 days: the Emergence World chart became the month's favorite model-personality meme: Emergence AI
  • "Director of AGI Economics" is a real job at Google DeepMind now.
  • The robots have a frequent-flyer program: see Waymo Premier, above.
Roberto StagiRoberto Stagi
Research

Anthropic: Natural Language Autoencoders (NLAs)

image.png

Models don't always say what they think, they instead encode their thinking into tokens that are not human readable. Anthropic introduces a solution to train models to convert internal neural activations into readable text, bridging the gap between numerical "thoughts" and human language. In safety tests, NLAs revealed hidden model behaviors like advance rhyme planning in poetry tasks, awareness of being evaluated in blackmail scenarios, and covert cheating strategies during coding evaluations.

Sources: tweet

Federico UlfoFederico Ulfo
Models

Measuring What Frontier Models Know (IKP)

  • Bojie Li introduces Incompressible Knowledge Probes (IKP), 1,400 obscure factual questions across 7 tiers of difficulty, to measure factual recall in 188 models from 27 vendors including closed APIs.
  • Factual accuracy scales log-linearly with log(model parameters) on open-weight models (R²=0.917), allowing black-box size estimates: GPT-5.5 ~9T, Claude Opus 4.6 ~5T, with wide uncertainty ranges noted in follow-up.
  • Over three years, factual capacity shows no compression at fixed parameter counts, rejecting the Densing Law prediction of knowledge densification, while reasoning benchmarks saturate.

Estimated size per model:

  • GPT-5.5 ~9T
  • Claude Opus 4.7 ~4T
  • GPT-5.4 ~2.2T
  • Claude Sonnet 4.6 ~1.7T
  • Gemini 2.5 Pro ~1.2T

chart 1

Sources: tweet, paper, ikp

Federico UlfoFederico Ulfo
Models

Is AI Accelerating?

Ben Todd argues AI capability gains are still compounding — even if recent model releases feel incremental, the overall curve hasn’t slowed.

1) Benchmarks

Claude 4.6 and Mythos are roughly on trend across 37 post-2024 benchmarks. image.png But Mythos represents 6 months of progress while only scoring +2 on Anthropic’s internal ECI, which likely emphasizes agentic coding — the area most relevant to an intelligence explosion. image.png

2) Revenue

Revenue growth has accelerated over the last 3 years, driven largely by Anthropic growing faster than OpenAI. This may be the hardest benchmark to game since it reflects real customer spending. image.png

3) Productivity uplift

Anthropic says Claude 4.6 made researchers 2× more productive, and Mythos 4×. The true gains are probably lower — maybe ~1.2× and ~1.6× — but still enough to modestly accelerate AI progress.

4) Compute demand

AI chip rental prices had been falling ~30% annually as hardware improved. But over the last few months, prices have risen ~30%. That suggests demand for compute is outpacing supply, consistent with rapidly increasing capabilities and faster scaling. image.png

Sources: blog post, tweet

Federico UlfoFederico Ulfo

SpaceX × Cursor

SpaceX adopted Cursor across engineering. A meaningful enterprise win for Cursor and a signal that frontier hardware shops are betting their dev productivity on AI-native IDEs.

Sources: tweet

Federico UlfoFederico Ulfo
Videos & Podcasts

Dwarkesh Blackboard Lectures

Dwarkesh recently started running a new blackboard lectures series with some of the top researchers and engineers in the space.. and we are all here for it 🙌

How GPT, Claude, and Gemini are actually trained and served – Reiner Pope

Reiner Pope gives a blackboard-style walkthrough of how frontier LLMs are trained and deployed, showing how much of the AI industry’s inner workings can be inferred from equations, API pricing, and first principles.

What rebuilding AlphaGo teaches us about self-play, RL, and future of LLMs - Eric Jang

Eric Jang explains how rebuilding AlphaGo with modern AI tools reveals core principles of intelligence—search, self-play, and learning—and why its MCTS-based reinforcement learning may offer a better model for how future AIs and humans learn than today’s token-level RL in LLMs.

Chip design from the bottom up – Reiner Pope

How do chips actually work - starting with basic logic gates, and working up to why GPUs, TPUs, FPGAs, and the human brain each look the way they do. Reiner is CEO of MatX, a new chip startup, previously at Google working on software efficiency, compilers, and TPU architecture.

Federico UlfoFederico Ulfo
Random

Random — quick links

  • Claude Code finds the password of a locked Bitcoin wallet: tweet
  • Casimir Effect to power a battery from the quantum field, hence battery-free. Likely bullshit, but let's see: tweet
  • Terence Tao — 5 Stages of AI Grief: tweet
  • Karpathy's nanoGPT running at 50K tokens/sec on an FPGA (and 3M/sec on an M4 MacBook): tweet
  • Animal Translatortweet
  • Cool hairtweet
  • You can't outsource understanding — Karpathy's line of the month: tweet
  • Dwarkesh hot taketweet
  • The "language tax" — non-English speakers pay more compute per token: tweet
  • How cells move — beautiful microscopy: tweet
  • Placebo sleep affects cognition: tweet
  • Mars terraformingtweet
  • Solved an Erdős problem with no advanced math knowledgetweet
  • Wayback Machinetweet
  • Nobody checks compiler codetweet
  • Top research papers of the monthtweet

GitHub Historical Analytics

Federico UlfoFederico Ulfo
Models

Washington Pulls the Plug

Three days after launch, the most capable model Anthropic had ever shipped went dark. On Friday June 12 the company received a US-government directive "citing national security authorities" and within hours disabled both Fable 5 and Mythos 5 for every user on earth. The order on its face only barred access by foreign nationals (including Anthropic's own foreign-national staff), but since you can't reliably gate a model by passport in real time, Anthropic pulled the lot.

image.png

The stated trigger was a reported "jailbreak" of Fable 5, discovered by Amazon. The federal administration reported the issue to Anthropic, who concluded it was a narrow, non-universal technique that surfaced a handful of already-known minor vulnerabilities, asking for more details. The administration didn't like the fact Anthropic didn't act immediately and asked a follow-up, hence the directive.

The irony writes itself: the lab that spent the month publicly asking the world to keep the option to pause frontier AI got exactly that, pointed at itself, 72 hours after going public.

Screenshot 2026-06-14 at 09.54.40.png

Sources: Anthropic statement, CNBC, TechCrunch, NBC News, Al Jazeera

Roberto StagiRoberto Stagi
Models

$65B Series H at $965B. Then an S-1.

Anthropic raised a $65 billion Series H at a $965 billion post-money valuation, led by Altimeter, Dragoneer, Greenoaks and Sequoia, overtaking OpenAI as the world's most valuable AI startup.

Run-rate revenue crossed $47B in May, up from ~$9B at the end of 2025. Four days later they confidentially filed a draft S-1 with the SEC, with reports pointing at a possible October listing.

Sources: Series H announcement, confidential S-1, CNBC, Bloomberg

Roberto StagiRoberto Stagi
Models

GPT-5.6?

GPT-5.6 "within weeks". Jakub Pachocki told staff it's a "meaningful improvement" over GPT-5.5, possibly launching alongside a ChatGPT redesign that replaces the model picker with six "Intelligence Levels". Polymarket says week 15-21.

image.png

Roberto StagiRoberto Stagi
Models

Other news from OpenAI

  • Stargate Michigan, a.k.a. "The Barn": a $16B, 1GW+ campus broke ground in Saline Township, putting Stargate above 8GW planned and $450B+ committed. Sources: OpenAI, Oracle press release, CNBC interview
  • GPT-Rosalind got the GPT-5.5 brain, with Novo Nordisk joining the trusted-access program. Sources: official announcement, TechTimes
  • Free-tier brownouts: capacity incidents on June 1, 3, 4, 5, 7, 8 and 11, concentrated on the free tier and the cheap Go plan. Between this and Claude's outage cluster, frontier compute scarcity is no longer a rumor; it's a status page. Sources: OpenAI status, StatusGator

image.png

  • Musk v. OpenAI, jury says "too late". Sources: NPR, Al Jazeera, Deadline
  • Florida sued OpenAI and Altman personally. Sources: TechCrunch, NPR
  • ChatGPT starts dreaming: memory was rebuilt on the "Dreaming V3" architecture. Sources: OpenAI, Engadget
  • Economic Research Exchange: grants plus privacy-safe usage data for external economists studying AI's labor-market impact. Applications close July 5. Sources: OpenAI
  • Buy OpenAI through Oracle: OCI customers can now burn Oracle Universal Credits on OpenAI models and Codex. Sources: OpenAI
  • ChatGPT ads landed in the UK: first European market, Free and Go tiers only, with explicit opt-in for personalized targeting. Japan, South Korea, Brazil and Mexico are next. Sources: Digiday, PPC Land
Roberto StagiRoberto Stagi
Models

Google: Gemini Omni & Omni Flash

DeepMind also showcased Gemini Omni, a native multimodal model built to seamlessly parse and generate any combination of text, audio, and video inputs. The big hook here is video-to-video editing: users can modify video details, adjust cinematic styles, or swap background objects using natural conversational prompts. The first model of this family, Gemini Omni Flash, dropped immediately for developers via API and across consumer products like the Gemini app and YouTube Shorts.

Sources: DeepMind Gemini Omni, Google I/O Keynote Video

Roberto StagiRoberto Stagi
Models

Le Chat Is Now "Vibe"

Mistral renamed Le Chat to Vibe: Work Mode automations, Code Mode (parallel coding agents in cloud sandboxes, powered by the open-weight Mistral Medium 3.5 at 77.6% on SWE-Bench Verified) and classic chat, with Pro at €14.99/month. Europe's lead lab is betting its consumer product on agents, and on a name that was a meme eighteen months ago.

Sources: Mistral announcement, The Decoder

Roberto StagiRoberto Stagi
Models

Apple: Siri AI (Finally)

At WWDC Apple unveiled Siri AI: a ground-up rebuild with real multi-turn conversation, on-screen awareness, a camera mode and a standalone app. The Apple Foundation Models behind it are built on Google's Gemini (a ~$1B/year deal per press reports; Apple's press release never says the G-word), and the on-device flagship needs 12GB of memory, splitting the iPhone 17 line into AI haves and have-nots.

But this won’t be available in Europe or China for now.

They published a separate post to explain that it "will not be able to ship Siri AI" on iOS 27, iPadOS 27 or watchOS 27 in Europe (macOS and visionOS are fine) because the Digital Markets Act's interoperability rule would force it to hand any rival assistant the same deep access Siri AI has.

Apple argues it can't expose that safely yet, pitched a vetted "Trusted System Agent" broker layer, and asked Brussels for an 18-month exemption to build it. The Commission said no the next day, saying that "nothing in the DMA prohibits Apple from introducing new products in the EU," and that an exemption would just give Apple's assistant (the one "powered by Google") an 18-month head start before any competitor got equal footing.

China is on a separate hold instead: it’s gated on its own AI-approval regime.

Sources: Apple Newsroom, TechCrunch roundup, MacRumors (12GB requirement), Apple Newsroom (DMA), EU Commission (Regnier), TechTimes (EU rejects exemption), MacRumors (EU/China)

Roberto StagiRoberto Stagi
Models

NVIDIA: Nemotron 3 Ultra

An open 550B-parameter hybrid Mamba-Transformer MoE (55B active, 1M context). The interesting part is the OpenMDW 1.1 license: full weights plus synthetic data plus training recipes. It's the strongest US open-weights model, and it still trails Kimi K2.6 by ~6 points on the Artificial Analysis Intelligence Index.

Sources: NVIDIA blog, Linux Foundation (OpenMDW)

Roberto StagiRoberto Stagi
Vibe Coding

Telemetry behind the vibe

Faros Research's Acceleration Whiplash report tracked 22,000 developers to show what happens when agents flood codebases—median code review times skyrocketed 441.5% as senior engineers were buried unravelling plausible-looking flaws, while quarterly code churn spiked 861% and production incidents-to-PR ratios more than tripled (+242.7%). High throughput, hollow velocity.

Sources: Faros Research

Roberto StagiRoberto Stagi
Vibe Coding

More Vibe Coding

  • Google Sunsets the Free Gemini CLI, on June 18 Gemini CLI and Code Assist stop serving free-tier and individual users folded into the closed-source Antigravity CLI with no 1:1 feature parity and a tighter compute cap. Sources: Google Developers Blog, The Register
  • Windsurf is now Devin Desktop. Sources: Devin blog, Cognition
  • xAI joined the CLI war with Grok Build, a Rust-based terminal coding agent. Sources: xAI announcement, API release, Engadget
  • MCP goes stateless, removes protocol-level sessions entirely, aligns auth with OAuth 2.0/OIDC and deprecates Roots, Sampling and Logging. Sources: MCP blog, SEP-2567
  • Claude Code now has a security guidance plugin, that makes Claude Code review its own changes for common vulnerabilities and fix them in the same session. Sources: docs
  • Claude Code subagents can now spawn their own subagents, 5 levels deep. Sources: changelog
  • Cognition says 89% of code committed by its own engineers is committed by Devin. Sources: Cognition blog
Roberto StagiRoberto Stagi
Research

OpenAI: An AI Disproved an Erdős Conjecture

OpenAI announced that an internal general-purpose reasoning model disproved the unit-distance conjecture Paul Erdős posed in 1946, constructing point families that beat the square grid mathematicians had considered essentially optimal for 80 years. External mathematicians including Noga Alon and Timothy Gowers verified it and published companion "Remarks on the disproof", and Princeton's Will Sawin sharpened the bound within days. The HN thread (1,429 points) spent most of its energy on one question: how autonomous is "autonomously"? Still: a real open problem, actually closed. Erdős would have paid out $500 for this one.

Sources: OpenAI announcement, arXiv remarks, HN thread, Gil Kalai's blog

Roberto StagiRoberto Stagi
Research

Compile the Workflow Into the Weights

Current agentic frameworks (LangGraph, CrewAI, the OpenAI Agents SDK) inject full workflow logic into a frontier model's context on every turn. Expensive, wasteful and leaky, and it is only going to get worse. This paper proposes compiling the workflow directly into the weights of a small fine-tuned model instead: near-frontier quality at roughly 100x lower token cost, validated on real workflows, with proprietary procedures staying inside your model and off third-party APIs. The authors directly dismantle the reasons developers have avoided this approach. In simple words: yes, you can completely rethink how agentic products are built and deployed. And this is wild.

Sources: paper

Roberto StagiRoberto Stagi
Hardware

Robotics

Figure's 200-hour shift: Figure 03 humanoids sorted packages on Helix-02 with zero teleoperation in a livestream planned as an 8-hour shift; nothing broke, so they kept going for ~200 hours and roughly 249,560 packages, robots rotating onto charging docks like shift workers. Days later Figure signed its first retail deployment with Catalyst Brands (JCPenney's parent). Sources: Seoul Economic Daily, Figure × Catalyst

  • NVIDIA standardized the research humanoid (on a Chinese robot): the open Isaac GR00T Reference Humanoid pairs a Unitree H2 Plus chassis with Sharpa tactile hands and Jetson Thor compute, with Ai2, ETH Zurich, Stanford and UCSD as first adopters. NVIDIA also shipped Cosmos 3, an open world foundation model for physical AI. Much of the coverage fixated on the US chip champion standardizing frontier research on Chinese hardware. Sources: NVIDIA press release, Cosmos 3, CNBC
  • Unitree's month: cleared its Shanghai STAR Market IPO hearing (first humanoid maker approved for A-shares, ~$620M at ~$6.2B), got a standing ovation on America's Got Talent with eight dancing G1s, and saw a G1 in a clown wig spin-kick a child at a martial-arts demo (the kid was reportedly fine). The full spectrum of humanoid-robot 2026, in one company, in one news cycle. Sources: SCMP (IPO), Global Times (AGT), Interesting Engineering (kick)
  • Robotaxi roundup: Waymo's purpose-built Ojai started accepting riders in SF, LA and Phoenix while the existing fleet does 500,000+ paid rides a week, Waymo launched a $29.99/month Premier loyalty subscription (robotaxis have a frequent-flyer program now), and Tesla expanded unsupervised robotaxi to the whole Austin metro with, per fleet trackers, only ~20 cars serving 245 square miles. Sources: Waymo (Ojai), TechCrunch, Waymo (Premier), CleanTechnica, TechTimes
Roberto StagiRoberto Stagi

More about macroeconomics and geopolitics

  • Bernie Sanders announced the American AI Sovereign Wealth Fund Act: a one-time 50% tax on the biggest AI companies, paid in stock, transferring half their equity to a public fund paying dividends to Americans. Sources: op-ed, Fortune
  • NVIDIA Q1 FY27: record $81.6B revenue (+85%), data center $75.2B, and zero China data-center sales (vs $4.6B a year ago). Stock fell anyway. Sources: press release, CNBC
  • Broadcom Q2: AI revenue $10.8B (+143%) with $30B+ of AI bookings in the quarter. Stock: -3% on a software miss. Sources: SEC 8-K, earnings call
  • Oracle Q4: RPO ballooned to $638B (+363%) on $67B of new AI contracts, against -$23.7B free cash flow and ~$70B of FY27 capex. Stock -10%. Sources: Oracle IR, CNBC
  • HBM4 supercycle: Samsung, SK Hynix and Micron all certified for Vera Rubin HBM4; Micron jumped 9.9% in a day as price targets hit $1,750. Sources: TechTimes
  • BIS closed the overseas-subsidiary loophole: export licenses now follow Chinese ownership, wherever the buying subsidiary sits. Beijing answered by certifying nine domestic AI chips for government procurement. Sources: Al Jazeera, Tom's Hardware
  • The "Great American AI Act": a 269-page bipartisan discussion draft trading a 3-year freeze on state AI-development laws for federal frontier-safety rules, such as NIST-licensed audits, 15-day incident reporting, penalties up to $1M/day. Sources: Roll Call, FedScoop
  • China's $295B answer to Stargate: Bloomberg reports a 2 trillion yuan, five-year state plan for a nationwide network of interconnected AI datacenters, with at least 80% domestic technology. Sources: Bloomberg, The Decoder
Roberto StagiRoberto Stagi
Random

Waymo Carded a Passenger

A rider's TikTok (~2M views): her Waymo paused mid-trip to ask, through the car speaker, "Are you over the age of 18?" In-cabin ML flags suspected minors, then a human agent patches in. a16z's Seema Amble: "Is this the new version of getting carded? Should I be flattered?" Privacy folks noted the cameras are "one court order away" from other uses. Sources: Jalopnik, Motor1

Roberto StagiRoberto Stagi
Models

DeepSeek V4

DeepSeek just dropped V4 (preview) — two open-weights MoE models that push the frontier on cost-effective 1M-token context.

DeepSeek-V4-Pro: 1.6T total params (49B active) — flagship performance rivaling top closed models in reasoning, math, and agentic coding. DeepSeek-V4-Flash: 284B total (13B active) — faster, cheaper, and highly efficient for everyday/agent tasks.

image.png

Both feature a new hybrid attention architecture (Compressed Sparse Attention + Heavily Compressed Attention) that makes million-token contexts dramatically more practical (much lower FLOPs and KV cache than V3). MIT license, available on Hugging Face (base + instruct), and live on the DeepSeek API today.

The community is already praising the efficiency gains, strong coding/agent results (e.g., high LiveCodeBench / SWE-Bench scores), and rock-bottom pricing — especially with the ongoing Pro discount.

Quick Highlights (as of early May 2026)

  • Release date: April 24, 2026 (preview)
  • Context: Native 1M tokens (with practical efficiency improvements for real agent/document workflows)
  • Reasoning modes: Non-think (fast), Think High, Think Max (deeper, higher quality on hard tasks) — all from the same weights
  • API pricing (highly competitive): Flash is extremely cheap; Pro has a big temporary discount (extended to ~May 31 in some updates) + major input cache price drop (1/10th)
  • Strengths: Coding/agentic tasks, long-context efficiency, price/performance. Text-only for now (multimodal planned later).
  • Availability: Chat at chat.deepseek.com (Expert/Instant modes), API (OpenAI/Anthropic compatible), open weights on HF/ModelScope.

Sources: Official announcement, Hugging Face collection, Tech Report, tweet discount extended

Federico UlfoFederico Ulfo
Research

SakanaAI × Nvidia: Sparser, Faster, Lighter Transformer (TwELL)

Sakana AI & NVIDIA's ICML 2026 paper introduces TwELL — a new sparse format for LLM feedforward layers that achieves >95% unstructured sparsity (via ReLU + light L1) while staying fully compatible with fast GPU tiled matrix multiplies. Result: 20%+ faster inference/training, lower memory & energy use on billion-scale models, with open-source CUDA kernels. Minimal accuracy loss.

Screenshot.png

Source: tweet, blog, paper

Federico UlfoFederico Ulfo
Models

Opus 4.6 Was Dumbed Down

Users noticed Opus 4.6 quality slipped during peak hours. Anthropic eventually acknowledged compute rationing — same pattern we covered in Part 1.

Claude 4.7

Sources: tweet

Federico UlfoFederico Ulfo
Models

DS4 by Antirez

Salvatore Sanfilippo (Antirez, of Redis fame) dropped DS4, a narrow-bet inference engine that runs DeepSeek V4 Flash locally on Apple Silicon (Metal) and Linux (CUDA). Not a generic GGUF runner. It's DS4-Flash-specific, with an OpenAI/Anthropic-compatible server you can point Claude Code at. Two ideas worth stealing: a 2-bit quantization that actually works (only the routed MoE experts get quantized; shared experts and projections stay untouched), which runs the model on a 128GB MacBook Pro.

image.png

It calls tools reliably under coding agents and treating the KV cache as a first-class disk citizen, hashed by SHA1 of the rendered prefix so stateless API clients reuse cached state across sessions and restarts. Antirez also says openly that DS4 was built with strong assistance from GPT-5.5 — refreshingly honest about how high-end systems code gets written in 2026.

Sources: github, @antirez, tweet

Federico UlfoFederico Ulfo

Fiber optics cable cost 8x up

Fiber optics is still happening at the battlefield, although not as much as it used to be. It's extremely pricey now. We used to buy 50km spool for $300, now it's easily $2500. At least a positive second order effect of the war in the middle east, it's making the war in Ukraine more expensive.

Sources: tweet

Federico UlfoFederico Ulfo