Skip to main content
AI Socratic

This paper got top score at NeurIPS 2025. It aims at answering: does RL make LLM better reasoners?

The authors study Reinforcement Learning with Verifiable Rewards (RLVR) and find that while it improves accuracy for small k, it doesn’t create new reasoning patterns — meaning the base model still determines the upper limit of reasoning ability.

Interestingly, it’s distillation, not RL, that shows genuine signs of emergent reasoning 😮.

link: x.com/jiqizhixin/status/1987710546674856051
web: limit-of-rlvr.github.io

React:

Comments

Sign in as a member to join the conversation.

Loading comments…

Stay Updated

Get the latest AI insights delivered to your inbox. No spam, unsubscribe anytime.

Search

Search across events, members, and blog posts