Settings

Theme

Emerging Reasoning with Reinforcement Learning Is Both Effective and Efficient

hkust-nlp.notion.site

4 points by greenflag a year ago · 1 comment

Reader

jedharris a year ago

They replicate the DeepSeek-R1-Zero and DeepSeek-R1 training on small models with limited data. We show that long Chain-of-Thought (CoT) and self-reflection can emerge on a 7B model with only 8K MATH examples, and they achieve surprisingly strong results on complex mathematical reasoning. They fully open-source their training code and details to the community to inspire more works on reasoning.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection