Understanding R1-Zero-Like Training: A Critical Perspective

github.com

160 points by pama 2 months ago


mentalgear - 2 months ago

Overall the industry needs more review, less hype. I was shocked to find out SWE-verified [0] is all but verified.

[0] benchmark used by all major vendors to "showcase" coding ability, turns out to be <10% properly solved: https://www.youtube.com/watch?v=QnOc_kKKuac

drakenot - 2 months ago

I've seen the same "Superficial Self-Reflection" mentioned in their linked blog post[0] as well, where the conclusion doesn't naturally follow the output of the thinking tokens. I think people are fooled by this, but if you take the time to inspect the "chain of thought" tokens they often don't match the final output answer.

I don't deny that performance for certain logic tasks goes up with these models but I don't fully understand what role the thinking tokens take in these cases.

[0] https://oatllm.notion.site/oat-zero

scribu - 2 months ago

If the base models already have the “reasoning” capability, as they claim, then it’s not surprising that they were able to get to SOTA using a relatively negligible amount of compute for RL fine-tuning.

I love this sort of “anti-hype” research. We need more of it.

fancyfredbot - 2 months ago

The article starts by saying

"DeepSeek-V3-Base already exhibit 'Aha moment'."

I tried to read the screenshot they present as evidence of this, and indeed it does say "Aha!". But both the preceding reasoning and the following conclusion look like gibberish to me. I'm not sure what we're supposed to conclude here and I gave up reading the article after this inauspicious start.

mirekrusin - 2 months ago

So they achived R1-Zero like performance without those long CoT that sometimes never end/are impacting inference time with fraction of fine tunining resources?

ilrwbwrkhv - a month ago

I mean anybody who understands the maths knows that there is no real reasoning from these models. The only ones hyping this up are VC bros who want the return on their investment money.

blahhh2525 - 2 months ago

[flagged]