Exploring Efficiency Frontiers of Thinking Budget in Medical Reasoning: Scaling Laws between Computational Resources and Reasoning Quality

Bi, Ziqian; Chen, Lu; Song, Junhao; Luo, Hongying; Ge, Enze; Huang, Junmin; Wang, Tianyang; Chen, Keyu; Liang, Chia Xin; Wei, Zihan; Liu, Huafeng; Tian, Chunjie; Guan, Jibin; Yeong, Joe; Xu, Yongzhi; Wang, Peng; Hao, Junfeng

计算机科学 > 计算与语言

arXiv:2508.12140 (cs)

[提交于 2025年8月16日 ]

标题：探索医学推理中思维预算的效率前沿：计算资源与推理质量之间的尺度定律

标题： Exploring Efficiency Frontiers of Thinking Budget in Medical Reasoning: Scaling Laws between Computational Resources and Reasoning Quality

Authors:Ziqian Bi, Lu Chen, Junhao Song, Hongying Luo, Enze Ge, Junmin Huang, Tianyang Wang, Keyu Chen, Chia Xin Liang, Zihan Wei, Huafeng Liu, Chunjie Tian, Jibin Guan, Joe Yeong, Yongzhi Xu, Peng Wang, Junfeng Hao

摘要：本研究首次全面评估了医疗推理任务中的思维预算机制，揭示了计算资源与推理质量之间的基本缩放规律。我们系统地评估了两个主要模型系列Qwen3（17亿到2350亿参数）和DeepSeek-R1（15亿到700亿参数），覆盖了15个医学数据集，涵盖不同的专业领域和难度等级。通过控制实验，思维预算范围从零到无限标记，我们建立了对数缩放关系，其中准确率的提升遵循可预测的模式，与思维预算和模型大小有关。我们的研究结果确定了三个不同的效率区域：高效率（0到256个标记），适用于实时应用；平衡（256到512个标记），为常规临床支持提供最佳成本效益权衡；高精度（超过512个标记），仅适用于关键诊断任务。值得注意的是，较小的模型在延长思维预算方面表现出不成比例的更大优势，相比大型模型的5%到10%，其改进幅度为15%到20%，这表明思维预算为容量受限的模型提供了更大的相对优势。特定领域模式明显显现，神经学和胃肠病学比心血管或呼吸医学需要更深层次的推理过程。 Qwen3原生思维预算API与我们为DeepSeek-R1提出的截断方法之间的一致性验证了思维预算概念在不同架构间的通用性。这些结果确立了思维预算控制作为优化医疗AI系统的关键机制，使动态资源分配与临床需求保持一致，同时保持医疗部署所需的透明度。

摘要： This study presents the first comprehensive evaluation of thinking budget mechanisms in medical reasoning tasks, revealing fundamental scaling laws between computational resources and reasoning quality. We systematically evaluated two major model families, Qwen3 (1.7B to 235B parameters) and DeepSeek-R1 (1.5B to 70B parameters), across 15 medical datasets spanning diverse specialties and difficulty levels. Through controlled experiments with thinking budgets ranging from zero to unlimited tokens, we establish logarithmic scaling relationships where accuracy improvements follow a predictable pattern with both thinking budget and model size. Our findings identify three distinct efficiency regimes: high-efficiency (0 to 256 tokens) suitable for real-time applications, balanced (256 to 512 tokens) offering optimal cost-performance tradeoffs for routine clinical support, and high-accuracy (above 512 tokens) justified only for critical diagnostic tasks. Notably, smaller models demonstrate disproportionately larger benefits from extended thinking, with 15 to 20% improvements compared to 5 to 10% for larger models, suggesting a complementary relationship where thinking budget provides greater relative benefits for capacity-constrained models. Domain-specific patterns emerge clearly, with neurology and gastroenterology requiring significantly deeper reasoning processes than cardiovascular or respiratory medicine. The consistency between Qwen3 native thinking budget API and our proposed truncation method for DeepSeek-R1 validates the generalizability of thinking budget concepts across architectures. These results establish thinking budget control as a critical mechanism for optimizing medical AI systems, enabling dynamic resource allocation aligned with clinical needs while maintaining the transparency essential for healthcare deployment.

主题：	计算与语言 (cs.CL)
引用方式：	arXiv:2508.12140 [cs.CL]
	(或者 arXiv:2508.12140v1 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.12140

提交历史

来自： Hongying Luo [查看电子邮件]
[v1] 星期六， 2025 年 8 月 16 日 19:25:06 UTC (15,577 KB)

计算机科学 > 计算与语言

标题：探索医学推理中思维预算的效率前沿：计算资源与推理质量之间的尺度定律

标题： Exploring Efficiency Frontiers of Thinking Budget in Medical Reasoning: Scaling Laws between Computational Resources and Reasoning Quality

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： 探索医学推理中思维预算的效率前沿：计算资源与推理质量之间的尺度定律 显示英文标题

标题： Exploring Efficiency Frontiers of Thinking Budget in Medical Reasoning: Scaling Laws between Computational Resources and Reasoning Quality

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：探索医学推理中思维预算的效率前沿：计算资源与推理质量之间的尺度定律