The Data Black Hole at the Center of AI

Federico Ulfo

June 23, 20263 min read

Dwarkesh Patel published this crisp 12-minute video on the current state of LLM scaling: The Data Black Hole at the Center of AI. Totally worth watching.

His core idea is sample efficiency: how much data a system needs before it can operate competently in a domain. His argument is that we haven't dramatically improved training sample efficiency in recent years. Instead, what improved is the breadth and quality of the data distribution. He frames RL as a synthetic data generation process and explains how much bespoke, expert-generated data sits behind frontier models — one reason the data-labeling industry has become a multi-billion-dollar market.

A human absorbs roughly 200 million words from birth to adulthood, while frontier models train on tens to hundreds of trillions of tokens — a difference approaching a millionfold. A teenager can learn to drive in around 20 hours. Patel walks through the common objections: evolution as a form of "pretraining," multimodal learning, scaling laws, and why he finds these explanations insufficient.

A few thoughts

The strongest argument is around scaling laws. Even with unlimited parameters, improvements may only reduce data requirements by roughly an order of magnitude, while humans appear thousands to millions of times more sample efficient. If true, humans may be operating on a fundamentally different curve than current AI systems — a much bigger challenge to the "just scale it" thesis than the industry often acknowledges.

Another interesting observation: open models are often surprisingly close to closed models because capabilities can be distilled through APIs. Data and behavior transfer more easily than proprietary architectural breakthroughs.

The biggest open question is whether sample-inefficient AI systems can eventually solve the sample-efficiency problem themselves. Can today's AI systems accelerate AI research enough to improve the next generation of models?

My view: current models are not autonomously self-improving, but humans using AI systems as research assistants are already accelerating experimentation, coding, analysis, and iteration. The interesting question is how far this feedback loop can go.

Again, this is one of the crispiest pieces I've seen on the future of AI scaling. Worth spending 12 minutes on!

About the Author

Federico Ulfo

Founder, Engineer

AI Socratic

Founder of AI Socratic

New York City

#A few thoughts

About the Author

Federico Ulfo

A few thoughts