SFT Memorizes, RL Generalizes
SFT Memorizes, RL Generalizes. DeepSeek has shown the power of Reinforcement Learning (RL) without Supervised Fine-Tuning (SFT). What does RL learn differently than SFT? Well, as the title, SFT memorizes, RL generalizes.
As AIs Get Smarter, They Develop Coherent Value Systems
As AIs get smarter, they develop their own coherent value systems.
For example, they value human lives higher in order of Pakistan > India > China > US.
These are not just random biases, but internally consistent values that shape their behavior, with many implications for AI alignment, website link.
Humanity's Last Exam Dataset Released
Humanity's Last Exam is a dataset with 3,000 questions, with known and verifiable answers, developed with hundreds of subject matter experts to capture the human frontier of knowledge and reasoning.
New Distributed Training Paper from Google DeepMind
New Distributed training paper from Google DeepMind