Greedy Sampling Is Provably Efficient for RLHF

Wu, Di; Shi, Chengshuai; Yang, Jing; Shen, Cong

Computer Science > Machine Learning

arXiv:2510.24700 (cs)

[Submitted on 28 Oct 2025 ]

Title: Greedy Sampling Is Provably Efficient for RLHF

Title: 贪心采样在RLHF中被证明是高效的

Authors:Di Wu, Chengshuai Shi, Jing Yang, Cong Shen

Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique for post-training large language models. Despite its empirical success, the theoretical understanding of RLHF is still limited, as learning the KL-regularized target with only preference feedback poses additional challenges compared with canonical RL. Existing works mostly study the reward-based Bradley-Terry (BT) preference model, and extend classical designs utilizing optimism or pessimism. This work, instead, considers the general preference model (whose practical relevance has been observed recently) and obtains performance guarantees with major, order-wise improvements over existing ones. Surprisingly, these results are derived from algorithms that directly use the empirical estimates (i.e., greedy sampling), as opposed to constructing optimistic or pessimistic estimates in previous works. This insight has a deep root in the unique structural property of the optimal policy class under the KL-regularized target, and we further specialize it to the BT model, highlighting the surprising sufficiency of greedy sampling in RLHF.

Abstract: 从人类反馈中进行强化学习（RLHF）已成为后训练大型语言模型的关键技术。尽管其在实践中取得了成功，但对RLHF的理论理解仍然有限，因为仅通过偏好反馈学习KL正则化目标相比经典的强化学习带来了额外的挑战。现有工作主要研究基于奖励的Bradley-Terry（BT）偏好模型，并扩展了利用乐观或悲观的经典设计。本工作则考虑了一般的偏好模型（其实际相关性最近已被观察到），并获得了比现有方法有重大改进的性能保证。令人惊讶的是，这些结果是通过直接使用经验估计（即贪婪采样）的算法得出的，而不是像之前的工作那样构建乐观或悲观的估计。这一见解在KL正则化目标下的最优策略类的独特结构特性中有深刻根源，我们进一步将其应用于BT模型，突显了贪婪采样在RLHF中的惊人充分性。

Comments:	NeurIPS 2025
Subjects:	Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
Cite as:	arXiv:2510.24700 [cs.LG]
	(or arXiv:2510.24700v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2510.24700

Submission history

From: Di Wu [view email]
[v1] Tue, 28 Oct 2025 17:52:08 UTC (457 KB)

Computer Science > Machine Learning

Title: Greedy Sampling Is Provably Efficient for RLHF

Title: 贪心采样在RLHF中被证明是高效的

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title: Greedy Sampling Is Provably Efficient for RLHF Show Chinese title

Title: 贪心采样在RLHF中被证明是高效的

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: Greedy Sampling Is Provably Efficient for RLHF