Skip to main content
CenXiv.org
This website is in trial operation, support us!
We gratefully acknowledge support from all contributors.
Contribute
Donate
cenxiv logo > cs > arXiv:2510.24700

Help | Advanced Search

Computer Science > Machine Learning

arXiv:2510.24700 (cs)
[Submitted on 28 Oct 2025 ]

Title: Greedy Sampling Is Provably Efficient for RLHF

Title: 贪心采样在RLHF中被证明是高效的

Authors:Di Wu, Chengshuai Shi, Jing Yang, Cong Shen
Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique for post-training large language models. Despite its empirical success, the theoretical understanding of RLHF is still limited, as learning the KL-regularized target with only preference feedback poses additional challenges compared with canonical RL. Existing works mostly study the reward-based Bradley-Terry (BT) preference model, and extend classical designs utilizing optimism or pessimism. This work, instead, considers the general preference model (whose practical relevance has been observed recently) and obtains performance guarantees with major, order-wise improvements over existing ones. Surprisingly, these results are derived from algorithms that directly use the empirical estimates (i.e., greedy sampling), as opposed to constructing optimistic or pessimistic estimates in previous works. This insight has a deep root in the unique structural property of the optimal policy class under the KL-regularized target, and we further specialize it to the BT model, highlighting the surprising sufficiency of greedy sampling in RLHF.
Abstract: 从人类反馈中进行强化学习(RLHF)已成为后训练大型语言模型的关键技术。 尽管其在实践中取得了成功,但对RLHF的理论理解仍然有限,因为仅通过偏好反馈学习KL正则化目标相比经典的强化学习带来了额外的挑战。 现有工作主要研究基于奖励的Bradley-Terry(BT)偏好模型,并扩展了利用乐观或悲观的经典设计。 本工作则考虑了一般的偏好模型(其实际相关性最近已被观察到),并获得了比现有方法有重大改进的性能保证。 令人惊讶的是,这些结果是通过直接使用经验估计(即贪婪采样)的算法得出的,而不是像之前的工作那样构建乐观或悲观的估计。 这一见解在KL正则化目标下的最优策略类的独特结构特性中有深刻根源,我们进一步将其应用于BT模型,突显了贪婪采样在RLHF中的惊人充分性。
Comments: NeurIPS 2025
Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
Cite as: arXiv:2510.24700 [cs.LG]
  (or arXiv:2510.24700v1 [cs.LG] for this version)
  https://doi.org/10.48550/arXiv.2510.24700
arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Di Wu [view email]
[v1] Tue, 28 Oct 2025 17:52:08 UTC (457 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled
  • View Chinese PDF
  • View PDF
  • HTML (experimental)
  • TeX Source
view license
Current browse context:
cs.LG
< prev   |   next >
new | recent | 2025-10
Change to browse by:
cs
cs.AI
cs.IT
math
math.IT
stat
stat.ML

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar
a export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
IArxiv Recommender (What is IArxiv?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack

京ICP备2025123034号