Explore-then-Commit for Nonstationary Linear Bandits with Latent Dynamics

Choi, Sunmook; Sattar, Yahya; Jedra, Yassir; Fazel, Maryam; Dean, Sarah

计算机科学 > 机器学习

arXiv:2510.16208 (cs)

[提交于 2025年10月17日 ]

标题：探索然后承诺用于具有潜在动态的非平稳线性老虎机

标题： Explore-then-Commit for Nonstationary Linear Bandits with Latent Dynamics

Authors:Sunmook Choi, Yahya Sattar, Yassir Jedra, Maryam Fazel, Sarah Dean

摘要：我们研究一个非平稳的老虎机问题，其中奖励取决于动作和潜在状态，后者由未知的线性动力学控制。关键的是，状态动力学也依赖于动作，这导致了短期和长期奖励之间的矛盾。我们为有限时间范围的$T$提出了一种探索后决策算法。在探索阶段，随机Rademacher动作使得能够估计线性动力学的马尔可夫参数，这些参数描述了动作与奖励之间的关系。在决策阶段，该算法利用估计的参数设计一个优化的动作序列以实现长期奖励。我们提出的算法实现了$\tilde{\mathcal{O}}(T^{2/3})$的遗憾。我们的分析处理了两个关键挑战：从时间相关奖励中学习，以及设计具有最优长期奖励的动作序列。我们通过提供使用双线性奖励进行系统识别的接近最优的样本复杂度和误差界限来解决第一个挑战。我们通过证明与超立方体上的不定二次优化的等价性来解决第二个挑战，这是一个已知的NP难问题。我们为此问题提供了次优性保证，从而实现了我们的遗憾上界。最后，我们提出了一种半定松弛结合Goemans-Williamson舍入作为一种实用的方法。

摘要： We study a nonstationary bandit problem where rewards depend on both actions and latent states, the latter governed by unknown linear dynamics. Crucially, the state dynamics also depend on the actions, resulting in tension between short-term and long-term rewards. We propose an explore-then-commit algorithm for a finite horizon $T$. During the exploration phase, random Rademacher actions enable estimation of the Markov parameters of the linear dynamics, which characterize the action-reward relationship. In the commit phase, the algorithm uses the estimated parameters to design an optimized action sequence for long-term reward. Our proposed algorithm achieves $\tilde{\mathcal{O}}(T^{2/3})$ regret. Our analysis handles two key challenges: learning from temporally correlated rewards, and designing action sequences with optimal long-term reward. We address the first challenge by providing near-optimal sample complexity and error bounds for system identification using bilinear rewards. We address the second challenge by proving an equivalence with indefinite quadratic optimization over a hypercube, a known NP-hard problem. We provide a sub-optimality guarantee for this problem, enabling our regret upper bound. Lastly, we propose a semidefinite relaxation with Goemans-Williamson rounding as a practical approach.

主题：	机器学习 (cs.LG) ; 系统与控制 (eess.SY); 优化与控制 (math.OC); 机器学习 (stat.ML)
引用方式：	arXiv:2510.16208 [cs.LG]
	(或者 arXiv:2510.16208v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2510.16208

提交历史

来自： Sunmook Choi [查看电子邮件]
[v1] 星期五， 2025 年 10 月 17 日 20:41:14 UTC (710 KB)

计算机科学 > 机器学习

标题：探索然后承诺用于具有潜在动态的非平稳线性老虎机

标题： Explore-then-Commit for Nonstationary Linear Bandits with Latent Dynamics

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： 探索然后承诺用于具有潜在动态的非平稳线性老虎机 显示英文标题

标题： Explore-then-Commit for Nonstationary Linear Bandits with Latent Dynamics

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：探索然后承诺用于具有潜在动态的非平稳线性老虎机