Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test?

Khera, Bhakti; Alamian, Rezvan; Scherz, Pascal A.; Goetz, Stephan M.

计算机科学 > 计算机与社会

arXiv:2507.10576 (cs)

[提交于 2025年7月11日 ]

标题：大型语言模型能否像应用专利法规一样理解专利法规以通过实践性专利律师测试？

标题： Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test?

Authors:Bhakti Khera, Rezvan Alamian, Pascal A. Scherz, Stephan M. Goetz

摘要：法律领域已经将各种大型语言模型（LLMs）应用于实际应用，但它们的定量表现及其原因尚未得到深入研究。我们评估了几种开源和专有LLMs——包括GPT系列、Anthropic、Deepseek和Llama-3变体——在欧洲专利代理人未来考试的部分内容上，即欧洲资格考试（EQE）。 OpenAI o1以0.82的准确率和0.81的F1分数领先，而（亚马逊网络服务）AWS Llama 3.1 8B则以0.50的准确率落后，一个部署了Python的Llama 3.1 8B得分为0.55。后两者在两种答案强制选择设计中处于纯粹猜测的范围内。评估的任何模型都无法完全通过考试，因为准确率从未超过专业水平标准所需的平均阈值0.90——同样，那些经常被宣传为具有超越博士和律师水平性能的模型也未达到。 GPT-4o在整合文本和图形方面表现出色，而Claude 3 Opus经常失去格式一致性。人类专利专家评估了文本理由，并发现了每个模型的各种关键缺陷。他们更重视清晰度和法律推理，而不是答案的原始正确性，这揭示了自动指标与专家判断之间的不一致。模型输出对适度的温度变化和提示措辞敏感，这强调了专家监督的必要性仍然存在。未来的工作应针对逻辑一致性、鲁棒的多模态性和自适应提示，以接近人类水平的专利能力。总之，尽管最近的大规模模型表现出色，但公众可能高估了它们的表现。该领域还有很长的路要走，才能开发出一个虚拟专利律师。本文旨在指出几个需要解决的具体限制。

摘要： The legal field already uses various large language models (LLMs) in actual applications, but their quantitative performance and reasons for it are underexplored. We evaluated several open-source and proprietary LLMs -- including GPT-series, Anthropic, Deepseek and Llama-3, variants -- on parts of the European Qualifying Examination (EQE) for future European Patent Attorneys. OpenAI o1 led with 0.82 accuracy and 0.81 F1 score, whereas (Amazon Web Services) AWS Llama 3.1 8B lagged at 0.50 accuracy, and a Python-deployed Llama 3.1 8B scored 0.55. The latter two are within the range of mere guessing for the two-answer forced-choice design. None of the evaluated models could have passed the examination fully, as accuracy never exceeded the average threshold of 0.90 required for professional-level standards -- also not models that are regularly promoted for their assumed beyond-PhD- and bar-admitted-lawyer-level performance. GPT-4o excelled at integrating text and graphics, while Claude 3 Opus often lost formatting coherence. Human patent experts evaluated the textual justifications and uncovered various critical shortcomings of each model. They valued clarity and legal rationale over the raw correctness of the answers, which revealed misalignment between automatic metrics and expert judgment. Model outputs were sensitive to modest temperature changes and prompt wording, which underscores the remaining necessity of expert oversight. Future work should target logical consistency, robust multimodality, and adaptive prompting to approach human-level patent proficiency. In summary, despite the outstanding performance of recent large models, the general public might overestimate their performance. The field has a long way to go to develop a virtual patent attorney. This paper wants to point out several specific limitations that need solutions.

评论：	39页，21图
主题：	计算机与社会 (cs.CY) ; 人工智能 (cs.AI); 计算与语言 (cs.CL); 新兴技术 (cs.ET)
引用方式：	arXiv:2507.10576 [cs.CY]
	(或者 arXiv:2507.10576v1 [cs.CY] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.10576

提交历史

来自： Stefan Goetz [查看电子邮件]
[v1] 星期五， 2025 年 7 月 11 日 09:42:23 UTC (8,919 KB)

计算机科学 > 计算机与社会

标题：大型语言模型能否像应用专利法规一样理解专利法规以通过实践性专利律师测试？

标题： Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test?

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机与社会

标题： 大型语言模型能否像应用专利法规一样理解专利法规以通过实践性专利律师测试？ 显示英文标题

标题： Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test?

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：大型语言模型能否像应用专利法规一样理解专利法规以通过实践性专利律师测试？