Published on Jun 5
·
Submitted by
Abstract
Empirical assessments reveal significant fluctuations in benchmark evaluation results of Deepseek-R1-Distill models, questioning the reliability of claimed performance improvements and advocating for a more rigorous evaluation paradigm.
Reasoning models represented by the Deepseek-R1-Distill series have been widely adopted by the open-source community due to their strong performance in mathematics, science, programming, and other domains. However, our study reveals that their benchmark evaluation results are subject to significant fluctuations caused by various factors. Subtle differences in evaluation conditions can lead to substantial variations in results. Similar phenomena are observed in other open-source inference models fine-tuned based on the Deepseek-R1-Distill series, as well as in the QwQ-32B model, making their claimed performance improvements difficult to reproduce reliably. Therefore, we advocate for the establishment of a more rigorous paradigm for model performance evaluation and present our empirical assessments of the Deepseek-R1-Distill series models.
Models citing this paper 0
No model linking this paper
Cite arxiv.org/abs/2506.04734 in a model README.md to link it from this page.
Datasets citing this paper 0
No dataset linking this paper
Cite arxiv.org/abs/2506.04734 in a dataset README.md to link it from this page.
Spaces citing this paper 0
No Space linking this paper
Cite arxiv.org/abs/2506.04734 in a Space README.md to link it from this page.