Regret Bounds for Decentralized Learning in Cooperative Multi-Agent Dynamical Systems

Asghari, Seyed Mohammad; Ouyang, Yi; Nayyar, Ashutosh

计算机科学 > 机器学习

arXiv:2001.10122 (cs)

[提交于 2020年1月27日 ]

标题：去中心化协作多智能体动态系统中的遗憾界

标题： Regret Bounds for Decentralized Learning in Cooperative Multi-Agent Dynamical Systems

Authors:Seyed Mohammad Asghari, Yi Ouyang, Ashutosh Nayyar

摘要：后悔分析在多智能体强化学习（MARL）中具有挑战性，主要是由于动态环境以及智能体之间的去中心化信息。我们在多智能体线性二次（LQ）动力系统中的去中心化学习背景下尝试解决这一挑战。我们从一个由两个智能体和两个动态解耦的随机线性系统组成的简单设置开始，每个系统由一个智能体控制。这些系统通过一个二次成本函数耦合。当两个系统的动态未知且智能体之间没有通信时，我们证明没有任何学习策略可以生成关于$T$的次线性后悔，其中$T$是时间范围。当只有一个系统的动态未知且存在从控制未知系统的智能体到另一个智能体的一向通信时，我们提出了一种基于构建辅助单智能体 LQ 问题的 MARL 算法。所提出的 MARL 算法中的辅助单智能体问题作为两个学习智能体之间的隐式协调机制。这使得智能体能够达到与辅助单智能体问题的后悔值相差不超过$O(\sqrt{T})$的后悔值。因此，利用现有的单智能体 LQ 后悔结果，我们的算法提供了一个$\tilde{O}(\sqrt{T})$的后悔界。（此处$\tilde{O}(\cdot)$隐藏了常数和对数因子）。我们的数值实验表明，该界限在实践中得到了匹配。从双智能体问题出发，我们将结果扩展到具有特定通信模式的多智能体 LQ 系统。

摘要： Regret analysis is challenging in Multi-Agent Reinforcement Learning (MARL) primarily due to the dynamical environments and the decentralized information among agents. We attempt to solve this challenge in the context of decentralized learning in multi-agent linear-quadratic (LQ) dynamical systems. We begin with a simple setup consisting of two agents and two dynamically decoupled stochastic linear systems, each system controlled by an agent. The systems are coupled through a quadratic cost function. When both systems' dynamics are unknown and there is no communication among the agents, we show that no learning policy can generate sub-linear in $T$ regret, where $T$ is the time horizon. When only one system's dynamics are unknown and there is one-directional communication from the agent controlling the unknown system to the other agent, we propose a MARL algorithm based on the construction of an auxiliary single-agent LQ problem. The auxiliary single-agent problem in the proposed MARL algorithm serves as an implicit coordination mechanism among the two learning agents. This allows the agents to achieve a regret within $O(\sqrt{T})$ of the regret of the auxiliary single-agent problem. Consequently, using existing results for single-agent LQ regret, our algorithm provides a $\tilde{O}(\sqrt{T})$ regret bound. (Here $\tilde{O}(\cdot)$ hides constants and logarithmic factors). Our numerical experiments indicate that this bound is matched in practice. From the two-agent problem, we extend our results to multi-agent LQ systems with certain communication patterns.

主题：	机器学习 (cs.LG) ; 多智能体系统 (cs.MA); 优化与控制 (math.OC); 机器学习 (stat.ML)
引用方式：	arXiv:2001.10122 [cs.LG]
	(或者 arXiv:2001.10122v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2001.10122

提交历史

来自： Seyed Mohammad Asghari [查看电子邮件]
[v1] 星期一， 2020 年 1 月 27 日 23:37:41 UTC (1,878 KB)

计算机科学 > 机器学习

标题：去中心化协作多智能体动态系统中的遗憾界

标题： Regret Bounds for Decentralized Learning in Cooperative Multi-Agent Dynamical Systems

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： 去中心化协作多智能体动态系统中的遗憾界 显示英文标题

标题： Regret Bounds for Decentralized Learning in Cooperative Multi-Agent Dynamical Systems

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：去中心化协作多智能体动态系统中的遗憾界