Will Scaling Reasoning Models Like o3 and R1 Unlock Superhuman Reasoning?

I asked: As we scale up training and inference compute for reasoning models, will they show: A) Strong general logical reasoning skills that work across most logically-bound tasks, B) Some generalization to other logic tasks, but perhaps requiring domain-specific retraining, or C) Minimal generalization?

Person Opinion Affiliations

Finbarr Timbers

Ex-DeepMind

Artfintel DeepMind

We'll achieve superhuman performance on specific tasks with verifiable rewards. I see no evidence for general transfer, but it seems extremely plausible.

Artfintel DeepMind

Gwern Branwen

Independent Researcher

Gwern

"Everyone neglects to ask, what are we scaling?" Depends on what data they scale up on. The more you scale up on a few domains like coding, the less I expect transfer, as they become ultra-specialized.

Gwern

Jacob Buckman

Founder of Manifest AI

Manifest AI Google Brain

Generalization can't really be predicted like that except empirically. All I know is as you add more compute and data you go from minimal transfer to some transfer to broad transfer. I have no clue where on that spectrum we stand when we run out of compute or data

Manifest AI Google Brain

Karthik Narasimhan

Co-author of GPT paper

Sierra Sierra OpenAI

I expect some generalization with domain specific retraining

Sierra Sierra OpenAI

Near

Independent

I think the "spikiness" of intelligence will continue to be notable (models which are extremely good at some things yet quite 'dumb' at others), but that it is easy to improve generalization in the areas we care about, since it just requires some data/RL fun.

Independent

Nathan Lambert

Post-Training Lead at Allen AI

Allen AI

New models trained to reason heavily about every subject will come to have better average performance than standard autoregression. In domains with explicit verifiers, this performance will be superhuman, in domains without, reasoning will still enable better performance, but maybe not more economical performance.

Allen AI

Pengfei Liu

Shanghai Jiao Tong University

SJTU

Increased compute and inference time will drive reasoning capabilities to expert-level performance where rich feedback loops exist. However, the development of general reasoning will be gated by two factors: the availability of problems requiring genuine deep thinking, and access to high-quality expert cognitive process data or well-defined reward signals.

SJTU

Ross Taylor

Led reasoning at Meta AI

Meta Meta AI

I think general reasoning will come fairly quickly. Right now it's easier to scale in domains where problems are easy to verify with an external signal. The generalisation will come if models themselves become good verifiers across domains.

Meta Meta AI

Shannon Sands

Nous Research

Nous Nous Research

There's at least some generalisation to other tasks like logic puzzles - but it might require more domain specific training to improve on many more out of domain tasks.

Nous Nous Research

Steve Newman

Co-founder of Google Docs

AI Soup Google Docs

This is a trillion dollar question. If I had to guess: we'll see some transfer of reasoning skills across domains, but (on anything resembling current architectures) some specialized training will be needed in each domain. We'll learn a lot one way or another this year.

AI Soup Google Docs

Tamay Besiroglu

Co-founder of Epoch AI

Epoch AI

I think minimal transfer is wrong because reasoning is a very general skill that you can apply to perform a wide range of actions. Planning, for instance, is something that requires good reasoning.

Epoch AI

Teortaxes

Independent

I think there will be a period of strong 'natively verifiable reasoning overhang' which translates to more general verifiers using models' strong coding ability and general knowledge+tools, then they grok more general regularities of sound reasoning, and the next generation can natively generate good reasoning data for all domains.

Independent

Xeophon

Independent

We will see some generalizations into other domains where the model was not explicitly trained on. For example, R1 writes better and more creative stories than V3, the model it is based on. To push this further, models need to be trained on more data in other domains.

Independent

Chris Barber

Creator of this

Independent

Synthesis: The expert takes point to generalization for all logically bound domains where we can construct verifiers for now, and trending in the direction of broad transfer in the future. More notes from experts: @chrisbarber.

Independent

Follow-up Questions

Tamay suggested a good follow up would be to "elicit ideas for experiments that they would expect to turn out one way conditional on "Weak transfer" and another if "Strong transfer" is correct" – let me know if you have ideas.

Questions

What follow up questions should I ask?
What's the even more important question to ask?

To answer, tag or DM me at @chrisbarber or email me at [email protected]

Thank you to Amir Haghighat, Arun Rao, Ash Bhat, Avery Lamp, Charlie Songhurst, Connor Mann, Daniel Kang, Dhruv Singh, Eric Jang, Ethan Beal-Brown, Finbarr Timbers, Flo Crivello, Griffin Choe, Gwern, Herrick Fang, Jacob Buckman, James Betker, Jay Hack, Josh Singer, Julian Michael, Katja Grace, Karthik Narasimhan, Logan Graham, Matt Figdore, Mike Choi, Nathan Lambert, Nicholas Carlini, Nitish Kulkarni, Pengfei Liu, Rick Barber, Robert Nishihara, Robert Wachen, Rohit Krishnan, Ron Bhattacharyay, Ross Taylor, Shannon Sands, Spencer Greenberg, Steve Newman, Tamay Besiroglu, Teknium, Teortaxes, Tim Shi, Tim Wee, Tyler Cowen, and Xeophon.