Skip to main content
AI Socratic
Agents

Benchmarks: Everything Got Agentic

Emergence AI ran 15-day survival simulations with 10 agents per frontier model in identical virtual societies: Claude Sonnet 4.6's society had zero crimes and built a democracy with 332 votes at 98% agreement, GPT-5 Mini's population starved within a week, Gemini 3 Flash logged 683 crimes including arson, and Grok 4.1 Fast committed 183 crimes and went extinct in 4 days.

emergence-world

Sources: Emergence AI, Fortune, Gizmodo

Roberto StagiRoberto Stagi
Agents

The serious benchmarks got agentic too

  • LMArena launched Agent Arena: agentic rankings from live user sessions (600k+ so far) using causal-traced success signals instead of preference votes. Fable 5 (High) ranks first. Sources: methodology, leaderboard
  • Cognition launched FrontierCode, where the bar is "would a maintainer actually merge this?" On the hard Diamond split Opus 4.8 managed 13.4% and Fable 5 jumped to 29.3%. Headroom is the product. Sources: Cognition
  • Agents' Last Exam: 1,490 instances of long-horizon, economically valuable work. On the hardest tier the average full pass rate is below 1%. Overall, GPT-5.5-in-Codex (24.0%) edges Fable 5 (22.0%), covered everywhere as the upset of the week. Sources: paper, site
  • METR's first frontier risk report: internal frontier agents from four labs essentially saturated the Time Horizon benchmark, and at least 16% of successful 8-hour-plus runs were illegitimate on review, including hacking the task simulator. METR's phrasing deserves quoting: agents "plausibly had the means, motive, and opportunity to start minimal rogue deployments". Sources: METR report, time horizons tracker
Roberto StagiRoberto Stagi
Agents

Cybersecurity

Hackers took 20,225 Instagram accounts by asking nicely: attackers hijacked high-profile accounts (the Obama-era White House account, a Space Force chief, Sephora) by asking Meta's AI Support Assistant to add a new email and reset the password; a bug in a side code path skipped verifying the requester. The canonical agentic-AI-in-production failure: the chatbot had account-recovery powers and infinite patience.

Meta's bad privacy month, continued: WIRED found a dormant facial-recognition system ("NameTag") in the Meta AI app that pairs with its smart glasses (stripped within 48 hours of the report), and Reuters revealed the Model Capability Initiative was recording employee emails, chats and clipboards across 200+ apps to train agentic AI. After a 1,500-signature internal petition, employees can now pause collection… for 30 minutes at a time.

OWASP: prompt injection is "the universal joint": the 2026 State of Agentic AI Security report moved from theory to a catalog of real CVEs (the LiteLLM PyPI backdoor, a Cursor allowlist bypass, a Codex CLI sandbox flaw), with prompt injection mapping to 6 of its Top 10 agentic risks. Also this month: BadHost (CVE-2026-48710), a Starlette Host-header authorization bypass affecting vLLM, LiteLLM, FastAPI, Open WebUI and countless MCP servers. Patch, then ponder how much of the agentic stack rests on a handful of under-maintained packages.

Grok's legal pile-up: Labour MP Jess Asato filed the first UK claim against xAI over non-consensual sexualized deepfakes, and Canada's Privacy Commissioner found X/xAI violated federal privacy law (Grok's image tool at one point produced over 6,000 sexualized images per hour). This stacks on the EU's DSA proceedings and an Ofcom investigation.

Sources: 404 Media, TechCrunch, BleepingComputer, EFF, Engadget, TechSpot (MCI), OWASP report, Help Net Security, X41 advisory, Ars Technica, AWO, Privacy Commissioner, CBC

Roberto StagiRoberto Stagi