Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing

Qi, Biqing; Li, Pengfei; Li, Fangyuan; Gao, Junqi; Zhang, Kaiyan; Zhou, Bowen

计算机科学 > 人工智能

arXiv:2406.05534 (cs)

[提交于 2024年6月8日 ]

标题：在线DPO：具有快速-慢速追逐的在线直接偏好优化

标题： Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing

Authors:Biqing Qi, Pengfei Li, Fangyuan Li, Junqi Gao, Kaiyan Zhang, Bowen Zhou

摘要：直接偏好优化（DPO）通过直接在人类偏好数据集上进行训练，提高了大型语言模型（LLMs）与人类价值观的一致性，消除了对奖励模型的需求。然而，由于存在跨领域的用户偏好，直接的持续训练可能导致灾难性遗忘，限制了DPO的性能和效率。受种内竞争推动物种进化启发，我们提出了一种在线快速-慢速追逐DPO（OFS-DPO）用于偏好对齐，通过模型之间的快速和慢速追逐模拟竞争，以促进快速适应。具体来说，我们首先推导了在线学习的遗憾上界，并通过最小最大优化模式验证了我们的动机。基于此，我们引入了两个使用低秩自适应（LoRA）且优化速度不同的相同模块，以模拟种内竞争，并提出了一种新的正则化项来指导它们的学习。为了进一步缓解跨领域场景中的灾难性遗忘，我们将OFS-DPO扩展为LoRA模块组合策略，从而得到跨领域在线快速-慢速追逐DPO（COFS-DPO）。该方法利用来自不同任务领域的快速模块参数的线性组合，充分利用历史信息以实现持续的价值对齐。实验结果表明，OFS-DPO在领域内对齐方面优于DPO，而COFS-DPO在跨领域持续学习场景中表现出色。

摘要： Direct Preference Optimization (DPO) improves the alignment of large language models (LLMs) with human values by training directly on human preference datasets, eliminating the need for reward models. However, due to the presence of cross-domain human preferences, direct continual training can lead to catastrophic forgetting, limiting DPO's performance and efficiency. Inspired by intraspecific competition driving species evolution, we propose a Online Fast-Slow chasing DPO (OFS-DPO) for preference alignment, simulating competition through fast and slow chasing among models to facilitate rapid adaptation. Specifically, we first derive the regret upper bound for online learning, validating our motivation with a min-max optimization pattern. Based on this, we introduce two identical modules using Low-rank Adaptive (LoRA) with different optimization speeds to simulate intraspecific competition, and propose a new regularization term to guide their learning. To further mitigate catastrophic forgetting in cross-domain scenarios, we extend the OFS-DPO with LoRA modules combination strategy, resulting in the Cross domain Online Fast-Slow chasing DPO (COFS-DPO). This method leverages linear combinations of fast modules parameters from different task domains, fully utilizing historical information to achive continual value alignment. Experimental results show that OFS-DPO outperforms DPO in in-domain alignment, while COFS-DPO excels in cross-domain continual learning scenarios.

主题：	人工智能 (cs.AI) ; 计算与语言 (cs.CL); 机器学习 (cs.LG)
引用方式：	arXiv:2406.05534 [cs.AI]
	(或者 arXiv:2406.05534v1 [cs.AI] 对于此版本)
	https://doi.org/10.48550/arXiv.2406.05534

提交历史

来自： Biqing Qi [查看电子邮件]
[v1] 星期六， 2024 年 6 月 8 日 17:30:54 UTC (1,395 KB)

计算机科学 > 人工智能

标题：在线DPO：具有快速-慢速追逐的在线直接偏好优化

标题： Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 人工智能

标题： 在线DPO：具有快速-慢速追逐的在线直接偏好优化 显示英文标题

标题： Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：在线DPO：具有快速-慢速追逐的在线直接偏好优化