ADPO: Anchored Direct Preference Optimization

Zixian, Wang

计算机科学 > 机器学习

arXiv:2510.18913v5 (cs)

[提交于 2025年10月21日 (v1) ，最后修订 2025年11月6日 (此版本， v5)]

标题： ADPO：基于锚点的直接偏好优化

标题： ADPO: Anchored Direct Preference Optimization

Authors:Wang Zixian

摘要：直接偏好优化（DPO）在标注器噪声和分布偏移下效果不佳，因为它基于硬性成对标签，并且仅对对数概率差异进行正则化。我们引入了锚定直接偏好优化（ADPO），这是一种通过参考锚定扩展偏好学习到软列表监督的框架。ADPO最小化KL(q || softmax((s - s_ref) / tau_anc))，这（i）通过适当选择目标q、锚定策略和温度，恢复了监督微调、知识蒸馏、最大熵强化学习和DPO作为特殊情况；（ii）由softmax费舍尔度量引导了一个隐式的信任区域，与锚定无关；并且（iii）支持稳定的动态锚定更新。经验上，我们观察到任务相关的权衡：动态锚定在噪声下改善在线探索，而固定锚定在离线蒸馏中表现更优，在我们的基准测试中，学生-教师KL减少了最多170到5000倍。

摘要： Direct Preference Optimization (DPO) is effective but brittle under annotator noise and distribution shift because it operates on hard, pairwise labels and only regularizes log-probability differences. We introduce Anchored Direct Preference Optimization (ADPO), a framework that extends preference learning to soft listwise supervision via reference anchoring. ADPO minimizes KL(q || softmax((s - s_ref) / tau_anc)), which (i) recovers supervised fine-tuning, knowledge distillation, maximum-entropy reinforcement learning, and DPO as special cases through suitable choices of target q, anchor policy, and temperature; (ii) induces an implicit trust region governed by the softmax Fisher metric, independent of the anchor; and (iii) supports stable dynamic-anchor updates. Empirically, we observe a task-dependent tradeoff: dynamic anchors improve online exploration under noise, while fixed anchors excel at offline distillation, achieving up to 170 to 5000 times reduction in student-teacher KL on our benchmarks.

主题：	机器学习 (cs.LG) ; 人工智能 (cs.AI); 机器学习 (stat.ML)
引用方式：	arXiv:2510.18913 [cs.LG]
	(或者 arXiv:2510.18913v5 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2510.18913

提交历史

来自： Zixian Wang [查看电子邮件]
[v1] 星期二， 2025 年 10 月 21 日 05:53:13 UTC (754 KB)
[v2] 星期一， 2025 年 10 月 27 日 12:50:13 UTC (2,856 KB)
[v3] 星期六， 2025 年 11 月 1 日 10:49:23 UTC (2,784 KB)
[v4] 星期三， 2025 年 11 月 5 日 14:26:44 UTC (2,795 KB)
[v5] 星期四， 2025 年 11 月 6 日 06:55:06 UTC (2,793 KB)

计算机科学 > 机器学习

标题： ADPO：基于锚点的直接偏好优化

标题： ADPO: Anchored Direct Preference Optimization

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： ADPO：基于锚点的直接偏好优化 显示英文标题

标题： ADPO: Anchored Direct Preference Optimization

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： ADPO：基于锚点的直接偏好优化