Restless Bandits with Average Reward: Breaking the Uniform Global Attractor Assumption

Hong, Yige; Xie, Qiaomin; Chen, Yudong; Wang, Weina

Computer Science > Machine Learning

arXiv:2306.00196 (cs)

[Submitted on 31 May 2023 (v1) , last revised 16 Jan 2024 (this version, v3)]

Title: Restless Bandits with Average Reward: Breaking the Uniform Global Attractor Assumption

Title: 具有平均奖励的非平稳多臂老虎机：打破统一全局吸引子假设

Authors:Yige Hong, Qiaomin Xie, Yudong Chen, Weina Wang

Abstract: We study the infinite-horizon restless bandit problem with the average reward criterion, in both discrete-time and continuous-time settings. A fundamental goal is to efficiently compute policies that achieve a diminishing optimality gap as the number of arms, $N$, grows large. Existing results on asymptotic optimality all rely on the uniform global attractor property (UGAP), a complex and challenging-to-verify assumption. In this paper, we propose a general, simulation-based framework, Follow-the-Virtual-Advice, that converts any single-armed policy into a policy for the original $N$-armed problem. This is done by simulating the single-armed policy on each arm and carefully steering the real state towards the simulated state. Our framework can be instantiated to produce a policy with an $O(1/\sqrt{N})$ optimality gap. In the discrete-time setting, our result holds under a simpler synchronization assumption, which covers some problem instances that violate UGAP. More notably, in the continuous-time setting, we do not require \emph{any} additional assumptions beyond the standard unichain condition. In both settings, our work is the first asymptotic optimality result that does not require UGAP.

Abstract: 我们研究了在平均奖励准则下的无限时间范围的休息老虎机问题，在离散时间和连续时间设置中都进行研究。一个基本目标是高效计算出在手臂数量$N$增大时，实现渐近最优差距的策略。现有的关于渐近最优性的结果都依赖于统一全局吸引子性质（UGAP），这是一个复杂且难以验证的假设。在本文中，我们提出了一种通用的、基于仿真的框架，Follow-the-Virtual-Advice，该框架可以将任何单臂策略转换为原始$N$臂问题的策略。这是通过在每个臂上模拟单臂策略，并仔细引导真实状态向模拟状态靠近来实现的。我们的框架可以实例化为一个具有$O(1/\sqrt{N})$最优性差距的策略。在离散时间设置中，我们的结果在更简单的同步假设下成立，该假设涵盖了某些违反 UGAP 的问题实例。更值得注意的是，在连续时间设置中，除了标准的单链条件外，我们不需要\emph{任何}额外的假设。在两种设置中，我们的工作是第一个不依赖 UGAP 的渐近最优性结果。

Comments:	NeurIPS 2023. 35 pages, 8 figures
Subjects:	Machine Learning (cs.LG) ; Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
MSC classes:	90C40
ACM classes:	G.3; I.6
Cite as:	arXiv:2306.00196 [cs.LG]
	(or arXiv:2306.00196v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2306.00196

Submission history

From: Yige Hong [view email]
[v1] Wed, 31 May 2023 21:26:43 UTC (1,161 KB)
[v2] Sun, 10 Dec 2023 05:59:41 UTC (2,808 KB)
[v3] Tue, 16 Jan 2024 05:42:06 UTC (2,205 KB)

Computer Science > Machine Learning

Title: Restless Bandits with Average Reward: Breaking the Uniform Global Attractor Assumption

Title: 具有平均奖励的非平稳多臂老虎机：打破统一全局吸引子假设

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title: Restless Bandits with Average Reward: Breaking the Uniform Global Attractor Assumption Show Chinese title

Title: 具有平均奖励的非平稳多臂老虎机：打破统一全局吸引子假设

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: Restless Bandits with Average Reward: Breaking the Uniform Global Attractor Assumption