
We trained Bespoke-Stratos-32B, our reasoning model distilled from DeepSeek-R1 using Berkeley NovaSky’s Sky-T1 data pipeline. The model outperforms Sky-T1 and o1-preview in reasoning (Math and Code) benchmarks and almost reaches the performance of DeepSeek-R1-Distill-Qwen-32B while being trained on 47x fewer examples:
| Benchmark | Bespoke-Stratos-32B | Sky-T1-32B | o1-preview | DeepSeek-R1 (reported) | DeepSeek-R1-Distill-Qwen-32B (ours / reported) |
|---|---|---|---|---|---|
| AIME2024 | 63.3 | 43.3 | 40.0 | 79.8 | 66.7 / 72.6 |
| MATH500 | 93.0 | 82.4 | 81.4 | 97.3 | 89.8 / 94.3 |
| GPQA-Diamond | 58.1 | 56.8 | 75.2 | 71.5 | 61.1 / 62.1 |
| LiveCodeBench v2 Easy | 96.7 | 86.3 | 92.9 | - | 91.2 / - |
| LiveCodeBench v2 Medium | 75.2 | 56.8 | 54.9 | - | 75.7 / - |
| LiveCodeBench v2 Hard | 26.2 | 17.9 | 16.3 | - | 38.2 / - |
| LiveCodeBench v2 All | 71.1 | 57.9 | 59.1 | - | 72.2 / - |
We open-source everything, including the reasoning dataset, to continue experimenting together with the community!
Please also refer to Sky-T1’s codebase for the training and evaluation code.
Data Curation
We used Bespoke Curator to create the synthetic reasoning dataset. We ported the Sky-T1 data pipeline into Curator, making it faster and fault-tolerant which helped us generate the reasoning dataset within 1.5 hour with DeepSeek-R1.
Rejection sampling involves filtering out reasoning traces with incorrect solutions. This is challenging for code verification, which we speed up using a Ray cluster. We are currently integrating code execution verifier directly in Curator, so stay tuned.
We followed the same recipe as the Sky-T1, but with the following differences:
- We used DeepSeek-R1 as the teacher reasoning model instead of QwQ.
- The Sky-T1 recipe used gpt-4o-mini to reformat QwQ’s traces, whereas we did not reformat DeepSeek-R1’s. We found that DeepSeek-R1’s reasoning traces were sufficiently well-formatted and coherent for parsing and fine-tuning even without an intermediate reformatting step.
- We used gpt-4o-mini instead of Sky-T1’s parsing logic to filter out incorrect math solutions. We found that Sky-T1’s parsing logic, which relies on regex and sympy, often fails to extract the right answer given a solution and thus tends to filter out solutions that were actually correct (an issue also reported here). Using gpt-4o-mini allowed us to reduce the number of false negatives, increasing the number of retained correct solutions from 25% to 73%.
7B model
We also release Bespoke-Stratos-7B, a fine-tune of Qwen-2.5-7B-Instruct.
| Benchmark | Bespoke-Stratos-7B | Qwen2.5-7B-Instruct | DeepSeek-R1-Distill-Qwen-7B (ours / reported) |
|---|---|---|---|
| AIME2024 | 20.0 | 10.0 | 43.3 / 55.5 |
| MATH500 | 82.0 | 74.2 | 89.4 / 92.8 |
| GPQA-Diamond | 37.8 | 33.3 | 44.9 / 49.1 |
| LiveCodeBench v2 Easy | 71.4 | 65.9 | 81.3 / - |
| LiveCodeBench v2 Medium | 25.5 | 18.9 | 42.2 / - |
| LiveCodeBench v2 Hard | 1.6 | 3.3 | 2.4 / - |
| LiveCodeBench v2 All | 36.1 | 31.9 | 46.6 / - |
The authors of Sky-T1 had noted that they saw little or no improvement training 7B or 14B models with their data.
With only 17k examples, we find the distillation is effective at the 7B scale, possibly due to higher quality of the data. For comparison, DeepSeek-R1-Distill-Qwen-7B used 800k examples.
Thoughts and future work
We are pleasantly surprised at the results, but acknowledge that benchmarks convey only one side to the story. We invite the community to try the models out and evaluate on other benchmarks so we can figure out what to improve.
There are many open questions we would like to understand better. For example, what is the Pareto frontier between the student model size and the number of SFT examples?
Beyond this work, we are excited about what reasoning distillation unlocks. We are building Curator to democratize the creation of powerful reasoning models and agents by enterprises and developers.
Citation
@misc{bespoke_stratos,
author = {Bespoke Labs},
title = {Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation},
howpublished = {www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation},
note = {Accessed: 2025-01-22},
year = {2025}
}
Acknowledgement
We are standing on the shoulder of giants. Bespoke Labs would like to thank Berkeley Sky Computing Lab for their work on Sky-T1 and releasing the code and data, DeepSeek for releasing the DeepSeek-R1 model, and the Datacomp community for insightful discussions.