Online Meta-Critic Learning for Off-Policy Actor-Critic Methods

Zhou, Wei; Li, Yiying; Yang, Yongxin; Wang, Huaimin; Hospedales, Timothy M.

计算机科学 > 机器学习

arXiv:2003.05334v2 (cs)

[提交于 2020年3月11日 (v1) ，最后修订 2020年11月2日 (此版本， v2)]

标题：在线元批评者学习用于非策略演员-批评者方法

标题： Online Meta-Critic Learning for Off-Policy Actor-Critic Methods

Authors:Wei Zhou, Yiying Li, Yongxin Yang, Huaimin Wang, Timothy M. Hospedales

摘要：离策略行为者-评论家（Off-PAC）方法在各种连续控制任务中已被证明是成功的。通常，评论家的动作价值函数是通过时间差分进行更新的，而评论家反过来为行为者提供一个损失，使其学习采取具有更高预期回报的动作。在本文中，我们引入了一种新颖且灵活的元评论家，它观察学习过程，并元学习一个额外的损失，以加速和改进行为者-评论家的学习。与普通评论家相比，元评论家网络被显式训练以加速学习过程；与现有的元学习算法相比，元评论家是针对单个任务快速在线学习的，而不是在一系列任务上缓慢学习。关键的是，我们的元评论家框架专为基于离策略的学习者设计，这些学习者目前提供了最先进的强化学习样本效率。我们证明，当与现代的Off-PAC方法DDPG、TD3和最先进的SAC结合时，在线元评论家学习能带来各种连续控制环境中的改进。

摘要： Off-Policy Actor-Critic (Off-PAC) methods have proven successful in a variety of continuous control tasks. Normally, the critic's action-value function is updated using temporal-difference, and the critic in turn provides a loss for the actor that trains it to take actions with higher expected return. In this paper, we introduce a novel and flexible meta-critic that observes the learning process and meta-learns an additional loss for the actor that accelerates and improves actor-critic learning. Compared to the vanilla critic, the meta-critic network is explicitly trained to accelerate the learning process; and compared to existing meta-learning algorithms, meta-critic is rapidly learned online for a single task, rather than slowly over a family of tasks. Crucially, our meta-critic framework is designed for off-policy based learners, which currently provide state-of-the-art reinforcement learning sample efficiency. We demonstrate that online meta-critic learning leads to improvements in avariety of continuous control environments when combined with contemporary Off-PAC methods DDPG, TD3 and the state-of-the-art SAC.

评论：	神经信息处理系统大会 2020
主题：	机器学习 (cs.LG) ; 机器学习 (stat.ML)
引用方式：	arXiv:2003.05334 [cs.LG]
	(或者 arXiv:2003.05334v2 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2003.05334

提交历史

来自： Wei Zhou [查看电子邮件]
[v1] 星期三， 2020 年 3 月 11 日 14:39:49 UTC (5,186 KB)
[v2] 星期一， 2020 年 11 月 2 日 04:53:38 UTC (36,503 KB)

计算机科学 > 机器学习

标题：在线元批评者学习用于非策略演员-批评者方法

标题： Online Meta-Critic Learning for Off-Policy Actor-Critic Methods

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： 在线元批评者学习用于非策略演员-批评者方法 显示英文标题

标题： Online Meta-Critic Learning for Off-Policy Actor-Critic Methods

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：在线元批评者学习用于非策略演员-批评者方法