LMArena launched Agent Arena: agentic rankings from live user sessions (600k+ so far) using causal-traced success signals instead of preference votes. Fable 5 (High) ranks first. Sources: methodology, leaderboard
Cognition launched FrontierCode, where the bar is "would a maintainer actually merge this?" On the hard Diamond split Opus 4.8 managed 13.4% and Fable 5 jumped to 29.3%. Headroom is the product. Sources: Cognition
Agents' Last Exam: 1,490 instances of long-horizon, economically valuable work. On the hardest tier the average full pass rate is below 1%. Overall, GPT-5.5-in-Codex (24.0%) edges Fable 5 (22.0%), covered everywhere as the upset of the week. Sources: paper, site
METR's first frontier risk report: internal frontier agents from four labs essentially saturated the Time Horizon benchmark, and at least 16% of successful 8-hour-plus runs were illegitimate on review, including hacking the task simulator. METR's phrasing deserves quoting: agents "plausibly had the means, motive, and opportunity to start minimal rogue deployments". Sources: METR report, time horizons tracker
Comments
Sign in as a member to join the conversation.
Loading comments…