Sound Heuristic Search Value Iteration for Undiscounted POMDPs with Reachability Objectives

Ho, Qi Heng; Feather, Martin S.; Rossi, Federico; Sunberg, Zachary N.; Lahijanian, Morteza

计算机科学 > 人工智能

arXiv:2406.02871 (cs)

[提交于 2024年6月5日 ]

标题：声音启发式搜索值迭代用于具有可达性目标的无折扣部分可观察马尔可夫决策过程

标题： Sound Heuristic Search Value Iteration for Undiscounted POMDPs with Reachability Objectives

Authors:Qi Heng Ho, Martin S. Feather, Federico Rossi, Zachary N. Sunberg, Morteza Lahijanian

摘要：部分可观测马尔可夫决策过程（POMDPs）是用于在转移和观测不确定性下进行序列决策的强大模型。本文研究了在POMDP中一个具有挑战性且重要的问题，即（无限时间范围）最大可达概率问题（MRPP），其中目标是最大化达到某些目标状态的概率。这在带有逻辑规范的模型检测中也是一个核心问题，并且是自然无折扣的（折扣因子为一）。受针对折扣问题开发的点基方法成功的启发，我们研究了它们在MRPP中的扩展。具体而言，我们专注于基于试验的启发式搜索价值迭代技术，并提出了一种新算法，该算法利用这些技术的优势以高效探索信念空间（通过价值界限进行有指导的搜索），同时解决了它们在处理无限时间范围问题中的循环问题的缺点。该算法生成具有最优可达概率双界策略。我们在一系列基准测试中进行了实验评估，结果表明，在几乎所有情况下，我们的算法在概率保证和计算时间方面都优于现有方法。

摘要： Partially Observable Markov Decision Processes (POMDPs) are powerful models for sequential decision making under transition and observation uncertainties. This paper studies the challenging yet important problem in POMDPs known as the (indefinite-horizon) Maximal Reachability Probability Problem (MRPP), where the goal is to maximize the probability of reaching some target states. This is also a core problem in model checking with logical specifications and is naturally undiscounted (discount factor is one). Inspired by the success of point-based methods developed for discounted problems, we study their extensions to MRPP. Specifically, we focus on trial-based heuristic search value iteration techniques and present a novel algorithm that leverages the strengths of these techniques for efficient exploration of the belief space (informed search via value bounds) while addressing their drawbacks in handling loops for indefinite-horizon problems. The algorithm produces policies with two-sided bounds on optimal reachability probabilities. We prove convergence to an optimal policy from below under certain conditions. Experimental evaluations on a suite of benchmarks show that our algorithm outperforms existing methods in almost all cases in both probability guarantees and computation time.

评论：	被接受至人工智能不确定性会议（UAI）2024
主题：	人工智能 (cs.AI) ; 计算机科学中的逻辑 (cs.LO); 机器人技术 (cs.RO); 系统与控制 (eess.SY)
引用方式：	arXiv:2406.02871 [cs.AI]
	(或者 arXiv:2406.02871v1 [cs.AI] 对于此版本)
	https://doi.org/10.48550/arXiv.2406.02871

提交历史

来自： Qi Heng Ho [查看电子邮件]
[v1] 星期三， 2024 年 6 月 5 日 02:33:50 UTC (300 KB)

计算机科学 > 人工智能

标题：声音启发式搜索值迭代用于具有可达性目标的无折扣部分可观察马尔可夫决策过程

标题： Sound Heuristic Search Value Iteration for Undiscounted POMDPs with Reachability Objectives

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 人工智能

标题： 声音启发式搜索值迭代用于具有可达性目标的无折扣部分可观察马尔可夫决策过程 显示英文标题

标题： Sound Heuristic Search Value Iteration for Undiscounted POMDPs with Reachability Objectives

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：声音启发式搜索值迭代用于具有可达性目标的无折扣部分可观察马尔可夫决策过程