An Empirical Evaluation of True Online TD({\lambda})

van Seijen, Harm; Mahmood, A. Rupam; Pilarski, Patrick M.; Sutton, Richard S.

计算机科学 > 人工智能

arXiv:1507.00353 (cs)

[提交于 2015年7月1日 ]

标题： True Online TD(λ)的实证评估

标题： An Empirical Evaluation of True Online TD(λ)

Authors:Harm van Seijen, A. Rupam Mahmood, Patrick M. Pilarski, Richard S. Sutton

摘要：最近提出了真正的在线 TD({\lambda }) 算法（van Seijen 和 Sutton，2014 年），作为时序差分学习和强化学习中流行的 TD({\lambda }) 算法的通用替代方案。与传统的 TD({\lambda }) 相比，真正的在线 TD({\lambda }) 具有更好的理论特性，并且预期它也能带来更快的学习速度。本文对这一假设进行了检验。具体来说，我们在具有挑战性的例子、随机马尔可夫奖励过程以及一款现实世界的肌电假肢手臂上比较了真正的在线 TD({\lambda }) 和 TD({\lambda }) 的性能。我们使用表格、二进制和非二进制特征的线性函数逼近方法。我们从三个维度评估这些算法：计算成本、学习速度和易用性。我们的研究结果证实了真正在线 TD({\lambda }) 的优势：1) 对于稀疏特征向量，相对于 TD({\lambda }) 的计算开销可以忽略不计；对于非稀疏特征，计算时间最多是 TD({\lambda }) 的两倍；2) 在所有领域/表示中，真正在线 TD({\lambda }) 的学习速度通常优于 TD({\lambda })，但从未低于后者；3) 真正在线 TD({\lambda }) 更易于使用，因为它不需要在迹类型之间进行选择，并且对步长参数更为稳定。总体而言，我们的研究结果表明，真正在线 TD({\lambda }) 应当作为寻找高效通用 TD 方法的首选。

摘要： The true online TD({\lambda}) algorithm has recently been proposed (van Seijen and Sutton, 2014) as a universal replacement for the popular TD({\lambda}) algorithm, in temporal-difference learning and reinforcement learning. True online TD({\lambda}) has better theoretical properties than conventional TD({\lambda}), and the expectation is that it also results in faster learning. In this paper, we put this hypothesis to the test. Specifically, we compare the performance of true online TD({\lambda}) with that of TD({\lambda}) on challenging examples, random Markov reward processes, and a real-world myoelectric prosthetic arm. We use linear function approximation with tabular, binary, and non-binary features. We assess the algorithms along three dimensions: computational cost, learning speed, and ease of use. Our results confirm the strength of true online TD({\lambda}): 1) for sparse feature vectors, the computational overhead with respect to TD({\lambda}) is minimal; for non-sparse features the computation time is at most twice that of TD({\lambda}), 2) across all domains/representations the learning speed of true online TD({\lambda}) is often better, but never worse than that of TD({\lambda}), and 3) true online TD({\lambda}) is easier to use, because it does not require choosing between trace types, and it is generally more stable with respect to the step-size. Overall, our results suggest that true online TD({\lambda}) should be the first choice when looking for an efficient, general-purpose TD method.

评论：	2015年欧洲强化学习研讨会 (EWRL)
主题：	人工智能 (cs.AI) ; 机器学习 (cs.LG); 机器学习 (stat.ML)
引用方式：	arXiv:1507.00353 [cs.AI]
	(或者 arXiv:1507.00353v1 [cs.AI] 对于此版本)
	https://doi.org/10.48550/arXiv.1507.00353

提交历史

来自： Harm van Seijen [查看电子邮件]
[v1] 星期三， 2015 年 7 月 1 日 20:03:49 UTC (178 KB)

计算机科学 > 人工智能

标题： True Online TD(λ)的实证评估

标题： An Empirical Evaluation of True Online TD(λ)

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 人工智能

标题： True Online TD(λ)的实证评估 显示英文标题

标题： An Empirical Evaluation of True Online TD(λ)

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： True Online TD(λ)的实证评估