Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation

We trained Bespoke-Stratos-32B, our reasoning model distilled from DeepSeek-R1 using Berkeley NovaSky’s Sky-T1 data pipeline. The model outperforms Sky-T1 and o1-preview in reasoning (Math and Code) benchmarks and almost reaches the performance of DeepSeek-R1-Distill-Qwen-32B while being trained on 47x fewer examples:

Benchmark	Bespoke-Stratos-32B	Sky-T1-32B	o1-preview	DeepSeek-R1 (reported)	DeepSeek-R1-Distill-Qwen-32B (ours / reported)
AIME2024	63.3	43.3	40.0	79.8	66.7 / 72.6
MATH500	93.0	82.4	81.4	97.3	89.8 / 94.3
GPQA-Diamond	58.1	56.8	75.2	71.5	61.1 / 62.1
LiveCodeBench v2 Easy	96.7	86.3	92.9	-	91.2 / -
LiveCodeBench v2 Medium	75.2	56.8	54.9	-	75.7 / -
LiveCodeBench v2 Hard	26.2	17.9	16.3	-	38.2 / -
LiveCodeBench v2 All	71.1	57.9	59.1	-	72.2 / -

We open-source everything, including the reasoning dataset, to continue experimenting together with the community!

Please also refer to Sky-T1’s codebase for the training and evaluation code.

Data Curation

We used Bespoke Curator to create the synthetic reasoning dataset. We ported the Sky-T1 data pipeline into Curator, making it faster and fault-tolerant which helped us generate the reasoning dataset within 1.5 hour with DeepSeek-R1.

Rejection sampling involves filtering out reasoning traces with incorrect solutions. This is challenging for code verification, which we speed up using a Ray cluster. We are currently integrating code execution verifier directly in Curator, so stay tuned.

We followed the same recipe as the Sky-T1, but with the following differences:

We used DeepSeek-R1 as the teacher reasoning model instead of QwQ.
The Sky-T1 recipe used gpt-4o-mini to reformat QwQ’s traces, whereas we did not reformat DeepSeek-R1’s. We found that DeepSeek-R1’s reasoning traces were sufficiently well-formatted and coherent for parsing and fine-tuning even without an intermediate reformatting step.
We used gpt-4o-mini instead of Sky-T1’s parsing logic to filter out incorrect math solutions. We found that Sky-T1’s parsing logic, which relies on regex and sympy, often fails to extract the right answer given a solution and thus tends to filter out solutions that were actually correct (an issue also reported here). Using gpt-4o-mini allowed us to reduce the number of false negatives, increasing the number of retained correct solutions from 25% to 73%.

7B model

We also release Bespoke-Stratos-7B, a fine-tune of Qwen-2.5-7B-Instruct.

Benchmark	Bespoke-Stratos-7B	Qwen2.5-7B-Instruct	DeepSeek-R1-Distill-Qwen-7B (ours / reported)
AIME2024	20.0	10.0	43.3 / 55.5
MATH500	82.0	74.2	89.4 / 92.8
GPQA-Diamond	37.8	33.3	44.9 / 49.1
LiveCodeBench v2 Easy	71.4	65.9	81.3 / -
LiveCodeBench v2 Medium	25.5	18.9	42.2 / -
LiveCodeBench v2 Hard	1.6	3.3	2.4 / -
LiveCodeBench v2 All	36.1	31.9	46.6 / -

The authors of Sky-T1 had noted that they saw little or no improvement training 7B or 14B models with their data.

With only 17k examples, we find the distillation is effective at the 7B scale, possibly due to higher quality of the data. For comparison, DeepSeek-R1-Distill-Qwen-7B used 800k examples.

Thoughts and future work

We are pleasantly surprised at the results, but acknowledge that benchmarks convey only one side to the story. We invite the community to try the models out and evaluate on other benchmarks so we can figure out what to improve.

There are many open questions we would like to understand better. For example, what is the Pareto frontier between the student model size and the number of SFT examples?

Beyond this work, we are excited about what reasoning distillation unlocks. We are building Curator to democratize the creation of powerful reasoning models and agents by enterprises and developers.

Citation

@misc{bespoke_stratos,  
    author = {Bespoke Labs},  
    title = {Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation},  
    howpublished = {www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation},  
    note = {Accessed: 2025-01-22},  
    year = {2025}
}

‍

Acknowledgement

We are standing on the shoulder of giants. Bespoke Labs would like to thank Berkeley Sky Computing Lab for their work on Sky-T1 and releasing the code and data, DeepSeek for releasing the DeepSeek-R1 model, and the Datacomp community for insightful discussions.

‍