An Analysis of Action-Value Temporal-Difference Methods That Learn State Values

Daley, Brett; Nagarajan, Prabhat; White, Martha; Machado, Marlos C.

计算机科学 > 机器学习

arXiv:2507.09523 (cs)

[提交于 2025年7月13日 ]

标题：对学习状态值的动作价值时序差分方法的分析

标题： An Analysis of Action-Value Temporal-Difference Methods That Learn State Values

Authors:Brett Daley, Prabhat Nagarajan, Martha White, Marlos C. Machado

摘要：时间差分（TD）学习的显著特征是自举：使用价值预测来生成新的价值预测。绝大多数用于控制的TD方法通过从单一动作价值函数（例如Q-learning和Sarsa）进行自举来学习策略。对从两个非对称价值函数进行自举的方法关注较少：即在学习动作价值的过程中，将状态价值作为中间步骤的方法。此类现有算法可分为QV-learning或AV-learning。尽管这些算法在以前的工作中已经进行了一定程度的研究，但尚不清楚在何时以及是否学习两个价值函数而不是一个是有优势的——以及这类方法在理论上是否普遍成立。在本文中，我们从收敛性和样本效率的角度分析了这些算法族。我们发现，在预测设置中，这两个家族都比期望Sarsa更高效，但在控制设置中，只有AV-learning方法相对于Q-learning提供了主要的优势。最后，我们引入了一种新的AV-learning算法，称为正则化对抗Q-learning（RDQ），它在MinAtar基准测试中显著优于对抗DQN。

摘要： The hallmark feature of temporal-difference (TD) learning is bootstrapping: using value predictions to generate new value predictions. The vast majority of TD methods for control learn a policy by bootstrapping from a single action-value function (e.g., Q-learning and Sarsa). Significantly less attention has been given to methods that bootstrap from two asymmetric value functions: i.e., methods that learn state values as an intermediate step in learning action values. Existing algorithms in this vein can be categorized as either QV-learning or AV-learning. Though these algorithms have been investigated to some degree in prior work, it remains unclear if and when it is advantageous to learn two value functions instead of just one -- and whether such approaches are theoretically sound in general. In this paper, we analyze these algorithmic families in terms of convergence and sample efficiency. We find that while both families are more efficient than Expected Sarsa in the prediction setting, only AV-learning methods offer any major benefit over Q-learning in the control setting. Finally, we introduce a new AV-learning algorithm called Regularized Dueling Q-learning (RDQ), which significantly outperforms Dueling DQN in the MinAtar benchmark.

评论：	发表于 RLC/RLJ 2025
主题：	机器学习 (cs.LG) ; 人工智能 (cs.AI)
引用方式：	arXiv:2507.09523 [cs.LG]
	(或者 arXiv:2507.09523v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.09523

提交历史

来自： Prabhat Nagarajan [查看电子邮件]
[v1] 星期日， 2025 年 7 月 13 日 07:34:33 UTC (650 KB)

计算机科学 > 机器学习

标题：对学习状态值的动作价值时序差分方法的分析

标题： An Analysis of Action-Value Temporal-Difference Methods That Learn State Values

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： 对学习状态值的动作价值时序差分方法的分析 显示英文标题

标题： An Analysis of Action-Value Temporal-Difference Methods That Learn State Values

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：对学习状态值的动作价值时序差分方法的分析