STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds

Chen, Yinfang; Pan, Jiaqi; Clark, Jackson; Su, Yiming; Zheutlin, Noah; Bhavya, Bhavya; Arora, Rohan; Deng, Yu; Jha, Saurabh; Xu, Tianyin

计算机科学 > 分布式、并行与集群计算

arXiv:2506.02009 (cs)

[提交于 2025年5月27日 ]

标题： STRATUS：面向现代云自主可靠性工程的多智能体系统

标题： STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds

Authors:Yinfang Chen, Jiaqi Pan, Jackson Clark, Yiming Su, Noah Zheutlin, Bhavya Bhavya, Rohan Arora, Yu Deng, Saurabh Jha, Tianyin Xu

摘要：在云规模系统中，故障是常态。分布式计算集群会表现出数百台机器故障和上千个磁盘故障；据报道，软件漏洞和错误配置更为频繁。随着现有需要人工参与的实践几乎无法跟上现代云计算的规模，对自主的、以人工智能驱动的可靠性工程的需求持续增长。本文介绍了一个基于大语言模型（LLM）的多智能体系统——STRATUS，用于实现云服务的自主站点可靠性工程（SRE）。STRATUS由多个专门的智能体（例如用于故障检测、诊断和缓解的智能体）组成，并以状态机的形式组织起来，协助系统级的安全推理和执行。我们形式化定义了像STRATUS这样的智能体SRE系统的安全规范中的一个关键特性，称为事务性无回退（TNR），这使得安全探索和迭代成为可能。我们表明，TNR可以有效提高自主故障缓解的效果。在AIOpsLab和ITBench（两个SRE基准套件）的故障缓解问题上，STRATUS的表现显著优于最先进的SRE智能体，在各种模型下至少提升了1.5倍的成功率。STRATUS为云可靠性的智能体系统实际部署展示了一条有前景的道路。

摘要： In cloud-scale systems, failures are the norm. A distributed computing cluster exhibits hundreds of machine failures and thousands of disk failures; software bugs and misconfigurations are reported to be more frequent. The demand for autonomous, AI-driven reliability engineering continues to grow, as existing humanin-the-loop practices can hardly keep up with the scale of modern clouds. This paper presents STRATUS, an LLM-based multi-agent system for realizing autonomous Site Reliability Engineering (SRE) of cloud services. STRATUS consists of multiple specialized agents (e.g., for failure detection, diagnosis, mitigation), organized in a state machine to assist system-level safety reasoning and enforcement. We formalize a key safety specification of agentic SRE systems like STRATUS, termed Transactional No-Regression (TNR), which enables safe exploration and iteration. We show that TNR can effectively improve autonomous failure mitigation. STRATUS significantly outperforms state-of-the-art SRE agents in terms of success rate of failure mitigation problems in AIOpsLab and ITBench (two SRE benchmark suites), by at least 1.5 times across various models. STRATUS shows a promising path toward practical deployment of agentic systems for cloud reliability.

评论：	正文10页，总计40页。
主题：	分布式、并行与集群计算 (cs.DC)
引用方式：	arXiv:2506.02009 [cs.DC]
	(或者 arXiv:2506.02009v1 [cs.DC] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.02009

提交历史

来自： Yinfang Chen [查看电子邮件]
[v1] 星期二， 2025 年 5 月 27 日 19:15:19 UTC (591 KB)

计算机科学 > 分布式、并行与集群计算

标题： STRATUS：面向现代云自主可靠性工程的多智能体系统

标题： STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 分布式、并行与集群计算

标题： STRATUS：面向现代云自主可靠性工程的多智能体系统 显示英文标题

标题： STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： STRATUS：面向现代云自主可靠性工程的多智能体系统