Expressive Reward Synthesis with the Runtime Monitoring Language

Donnelly, Daniel; Ferrando, Angelo; Belardinelli, Francesco

计算机科学 > 机器学习

arXiv:2510.16185 (cs)

[提交于 2025年10月17日 (v1) ，最后修订 2025年10月21日 (此版本， v2)]

标题：带有运行时监控语言的表达奖励合成

标题： Expressive Reward Synthesis with the Runtime Monitoring Language

Authors:Daniel Donnelly, Angelo Ferrando, Francesco Belardinelli

摘要：强化学习（RL）中的一个关键挑战是奖励（错误）规范问题，其中定义不明确的奖励函数可能导致意外的、可能有害的行为。确实，RL中的奖励函数通常被视为从状态-动作对到标量值的黑盒映射。虽然在许多情况下有效，但这种方法无法提供奖励给出的原因，这可能阻碍学习和可解释性。奖励机器通过将奖励函数表示为有限状态自动机来解决这个问题，从而能够指定结构化、非马尔可夫的奖励函数。然而，它们的表达能力通常受到正则语言的限制，使其无法捕捉更复杂的行为，例如计数或参数化条件。在本工作中，我们基于运行时监控语言（RML）开发了一类基于语言的奖励机器。通过利用RML内置的内存，我们的方法可以为非正则、非马尔可夫的任务指定奖励函数。我们通过实验展示了我们方法的表达能力，并突出了在灵活事件处理和任务规范方面相对于现有基于奖励机器的方法的额外优势。

摘要： A key challenge in reinforcement learning (RL) is reward (mis)specification, whereby imprecisely defined reward functions can result in unintended, possibly harmful, behaviours. Indeed, reward functions in RL are typically treated as black-box mappings from state-action pairs to scalar values. While effective in many settings, this approach provides no information about why rewards are given, which can hinder learning and interpretability. Reward Machines address this issue by representing reward functions as finite state automata, enabling the specification of structured, non-Markovian reward functions. However, their expressivity is typically bounded by regular languages, leaving them unable to capture more complex behaviours such as counting or parametrised conditions. In this work, we build on the Runtime Monitoring Language (RML) to develop a novel class of language-based Reward Machines. By leveraging the built-in memory of RML, our approach can specify reward functions for non-regular, non-Markovian tasks. We demonstrate the expressiveness of our approach through experiments, highlighting additional advantages in flexible event-handling and task specification over existing Reward Machine-based methods.

主题：	机器学习 (cs.LG) ; 人工智能 (cs.AI); 形式语言与自动机理论 (cs.FL); 机器学习 (stat.ML)
引用方式：	arXiv:2510.16185 [cs.LG]
	(或者 arXiv:2510.16185v2 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2510.16185

提交历史

来自： Francesco Belardinelli [查看电子邮件]
[v1] 星期五， 2025 年 10 月 17 日 19:54:59 UTC (198 KB)
[v2] 星期二， 2025 年 10 月 21 日 10:04:30 UTC (200 KB)

计算机科学 > 机器学习

标题：带有运行时监控语言的表达奖励合成

标题： Expressive Reward Synthesis with the Runtime Monitoring Language

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： 带有运行时监控语言的表达奖励合成 显示英文标题

标题： Expressive Reward Synthesis with the Runtime Monitoring Language

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：带有运行时监控语言的表达奖励合成