AICrypto: A Comprehensive Benchmark For Evaluating Cryptography Capabilities of Large Language Models

Wang, Yu; Liu, Yijian; Ji, Liheng; Luo, Han; Li, Wenjie; Zhou, Xiaofei; Feng, Chiyun; Wang, Puji; Cao, Yuhan; Zhang, Geyuan; Li, Xiaojian; Xu, Rongwu; Chen, Yilei; He, Tianxing

计算机科学 > 密码学与安全

arXiv:2507.09580v1 (cs)

[提交于 2025年7月13日 ]

标题： AICrypto：评估大型语言模型密码学能力的全面基准

标题： AICrypto: A Comprehensive Benchmark For Evaluating Cryptography Capabilities of Large Language Models

Authors:Yu Wang, Yijian Liu, Liheng Ji, Han Luo, Wenjie Li, Xiaofei Zhou, Chiyun Feng, Puji Wang, Yuhan Cao, Geyuan Zhang, Xiaojian Li, Rongwu Xu, Yilei Chen, Tianxing He

摘要：大型语言模型（LLMs）在多个领域展示了显著的能力。然而，它们在密码学中的应用，作为网络安全的基础支柱，仍大多未被探索。为解决这一差距，我们提出了\textbf{AICrypto}，首个全面的基准，旨在评估LLMs的密码学能力。该基准包括135道选择题，150个夺旗（CTF）挑战和18个证明问题，涵盖了从事实记忆到漏洞利用和形式推理的广泛技能。所有任务均由密码学专家仔细审查或构建，以确保正确性和严谨性。为了支持CTF挑战的自动化评估，我们设计了一个基于代理的框架。为了更深入地了解当前LLM在密码学方面的熟练程度，我们引入了人类专家性能基线，用于所有任务类型的比较。我们对17个领先的LLMs进行的评估显示，最先进的模型在记忆密码学概念、利用常见漏洞和常规证明方面可以与人类专家相媲美，甚至超越。然而，它们仍然缺乏对抽象数学概念的深刻理解，并且在需要多步骤推理和动态分析的任务上表现出困难。我们希望这项工作能为未来在密码学应用中的LLMs研究提供见解。我们的代码和数据集可在 https://aicryptobench.github.io 获取。

摘要： Large language models (LLMs) have demonstrated remarkable capabilities across a variety of domains. However, their applications in cryptography, which serves as a foundational pillar of cybersecurity, remain largely unexplored. To address this gap, we propose \textbf{AICrypto}, the first comprehensive benchmark designed to evaluate the cryptographic capabilities of LLMs. The benchmark comprises 135 multiple-choice questions, 150 capture-the-flag (CTF) challenges, and 18 proof problems, covering a broad range of skills from factual memorization to vulnerability exploitation and formal reasoning. All tasks are carefully reviewed or constructed by cryptography experts to ensure correctness and rigor. To support automated evaluation of CTF challenges, we design an agent-based framework. To gain deeper insight into the current state of cryptographic proficiency in LLMs, we introduce human expert performance baselines for comparison across all task types. Our evaluation of 17 leading LLMs reveals that state-of-the-art models match or even surpass human experts in memorizing cryptographic concepts, exploiting common vulnerabilities, and routine proofs. However, they still lack a deep understanding of abstract mathematical concepts and struggle with tasks that require multi-step reasoning and dynamic analysis. We hope this work could provide insights for future research on LLMs in cryptographic applications. Our code and dataset are available at https://aicryptobench.github.io.

主题：	密码学与安全 (cs.CR)
引用方式：	arXiv:2507.09580 [cs.CR]
	(或者 arXiv:2507.09580v1 [cs.CR] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.09580

提交历史

来自： Yu Wang [查看电子邮件]
[v1] 星期日， 2025 年 7 月 13 日 11:11:01 UTC (757 KB)

计算机科学 > 密码学与安全

标题： AICrypto：评估大型语言模型密码学能力的全面基准

标题： AICrypto: A Comprehensive Benchmark For Evaluating Cryptography Capabilities of Large Language Models

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 密码学与安全

标题： AICrypto：评估大型语言模型密码学能力的全面基准 显示英文标题

标题： AICrypto: A Comprehensive Benchmark For Evaluating Cryptography Capabilities of Large Language Models

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： AICrypto：评估大型语言模型密码学能力的全面基准