Provably Efficient Generative Adversarial Imitation Learning for Online and Offline Setting with Linear Function Approximation

Liu, Zhihan; Zhang, Yufeng; Fu, Zuyue; Yang, Zhuoran; Wang, Zhaoran

计算机科学 > 机器学习

arXiv:2108.08765 (cs)

[提交于 2021年8月19日 ]

标题：具有线性函数逼近的在线和离线设置下可证明高效生成对抗模仿学习

标题： Provably Efficient Generative Adversarial Imitation Learning for Online and Offline Setting with Linear Function Approximation

Authors:Zhihan Liu, Yufeng Zhang, Zuyue Fu, Zhuoran Yang, Zhaoran Wang

摘要：在生成对抗模仿学习（GAIL）中，智能体的目标是从专家演示中学习一个策略，使得其在某些预定义奖励集上的表现无法与专家策略区分开来。本文研究了在线和离线设置下的 GAIL，并采用线性函数逼近，其中转移函数和奖励函数在线性特征映射下表示。除了专家演示外，在线设置下智能体可以与环境交互，而在离线设置下，智能体只能访问由先前收集的额外数据集。对于在线 GAIL，我们提出了一个乐观生成对抗策略优化算法（OGAP），并证明了 OGAP 实现了 $\widetilde{\mathcal{O}}(H^2 d^{3/2}K^{1/2}+KH^{3/2}dN_1^{-1/2})$悔值。这里 $N_1$表示专家演示的轨迹数，$d$是特征维度，$K$是轮次数。对于离线 GAIL，我们提出了一个悲观生成对抗策略优化算法（PGAP）。对于任意的额外数据集，我们得到了 PGAP 的最优性差距，实现了额外数据集利用中的 minimax 下界。假设额外数据集具有足够的覆盖率，我们表明 PGAP 实现了 $\widetilde{\mathcal{O}}(H^{2}dK^{-1/2} +H^2d^{3/2}N_2^{-1/2}+H^{3/2}dN_1^{-1/2} \ )$最优性差距。这里$N_2$表示具有足够覆盖率的额外数据集的轨迹数。

摘要： In generative adversarial imitation learning (GAIL), the agent aims to learn a policy from an expert demonstration so that its performance cannot be discriminated from the expert policy on a certain predefined reward set. In this paper, we study GAIL in both online and offline settings with linear function approximation, where both the transition and reward function are linear in the feature maps. Besides the expert demonstration, in the online setting the agent can interact with the environment, while in the offline setting the agent only accesses an additional dataset collected by a prior. For online GAIL, we propose an optimistic generative adversarial policy optimization algorithm (OGAP) and prove that OGAP achieves $\widetilde{\mathcal{O}}(H^2 d^{3/2}K^{1/2}+KH^{3/2}dN_1^{-1/2})$ regret. Here $N_1$ represents the number of trajectories of the expert demonstration, $d$ is the feature dimension, and $K$ is the number of episodes. For offline GAIL, we propose a pessimistic generative adversarial policy optimization algorithm (PGAP). For an arbitrary additional dataset, we obtain the optimality gap of PGAP, achieving the minimax lower bound in the utilization of the additional dataset. Assuming sufficient coverage on the additional dataset, we show that PGAP achieves $\widetilde{\mathcal{O}}(H^{2}dK^{-1/2} +H^2d^{3/2}N_2^{-1/2}+H^{3/2}dN_1^{-1/2} \ )$ optimality gap. Here $N_2$ represents the number of trajectories of the additional dataset with sufficient coverage.

评论：	54页，已投稿
主题：	机器学习 (cs.LG) ; 人工智能 (cs.AI); 优化与控制 (math.OC); 机器学习 (stat.ML)
引用方式：	arXiv:2108.08765 [cs.LG]
	(或者 arXiv:2108.08765v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2108.08765

提交历史

来自： Yufeng Zhang [查看电子邮件]
[v1] 星期四， 2021 年 8 月 19 日 16:16:00 UTC (80 KB)

计算机科学 > 机器学习

标题：具有线性函数逼近的在线和离线设置下可证明高效生成对抗模仿学习

标题： Provably Efficient Generative Adversarial Imitation Learning for Online and Offline Setting with Linear Function Approximation

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： 具有线性函数逼近的在线和离线设置下可证明高效生成对抗模仿学习 显示英文标题

标题： Provably Efficient Generative Adversarial Imitation Learning for Online and Offline Setting with Linear Function Approximation

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：具有线性函数逼近的在线和离线设置下可证明高效生成对抗模仿学习