Do Not Trust Licenses You See: Dataset Compliance Requires Massive-Scale AI-Powered Lifecycle Tracing

Kim, Jaekyeom; Sohn, Sungryull; Jo, Gerrard Jeongwon; Choi, Jihoon; Bae, Kyunghoon; Lee, Hwayoung; Park, Yongmin; Lee, Honglak

计算机科学 > 计算机与社会

arXiv:2503.02784 (cs)

[提交于 2025年3月4日 (v1) ，最后修订 2025年3月14日 (此版本， v3)]

标题：不要信任您看到的许可证：数据集合规性需要大规模人工智能驱动的生命周期追踪

标题： Do Not Trust Licenses You See: Dataset Compliance Requires Massive-Scale AI-Powered Lifecycle Tracing

Authors:Jaekyeom Kim, Sungryull Sohn, Gerrard Jeongwon Jo, Jihoon Choi, Kyunghoon Bae, Hwayoung Lee, Yongmin Park, Honglak Lee

摘要：本文认为，仅凭数据集的许可条款无法准确评估其法律风险；相反，跟踪数据集的再分发及其整个生命周期是必不可少的。然而，这一过程对于法律专家来说，在大规模情况下手动处理过于复杂。跟踪数据集的来源，验证再分发权利，并在多个阶段评估不断变化的法律风险，需要精确和高效的水平，这超出了人类的能力。有效应对这一挑战需要AI代理，它们可以系统地追踪数据集的再分发，分析合规性，并识别法律风险。我们开发了一个称为NEXUS的自动化数据合规系统，并表明AI可以在准确性、效率和成本效益方面优于人类专家执行这些任务。我们使用这种方法对17,429个唯一实体和8,072个许可条款进行了大规模的法律分析，揭示了再分发前原始数据集与其再分发子集之间的法律权利差异，强调了数据生命周期意识合规的必要性。例如，我们发现，在2,852个具有商业可行个体许可条款的数据集中，只有605个（21%）在法律上允许商业化。这项工作为AI数据治理设定了新标准，倡导一种系统地检查数据集再分发整个生命周期的框架，以确保数据集管理的透明、合法和负责任。

摘要： This paper argues that a dataset's legal risk cannot be accurately assessed by its license terms alone; instead, tracking dataset redistribution and its full lifecycle is essential. However, this process is too complex for legal experts to handle manually at scale. Tracking dataset provenance, verifying redistribution rights, and assessing evolving legal risks across multiple stages require a level of precision and efficiency that exceeds human capabilities. Addressing this challenge effectively demands AI agents that can systematically trace dataset redistribution, analyze compliance, and identify legal risks. We develop an automated data compliance system called NEXUS and show that AI can perform these tasks with higher accuracy, efficiency, and cost-effectiveness than human experts. Our massive legal analysis of 17,429 unique entities and 8,072 license terms using this approach reveals the discrepancies in legal rights between the original datasets before redistribution and their redistributed subsets, underscoring the necessity of the data lifecycle-aware compliance. For instance, we find that out of 2,852 datasets with commercially viable individual license terms, only 605 (21%) are legally permissible for commercialization. This work sets a new standard for AI data governance, advocating for a framework that systematically examines the entire lifecycle of dataset redistribution to ensure transparent, legal, and responsible dataset management.

主题：	计算机与社会 (cs.CY) ; 人工智能 (cs.AI)
引用方式：	arXiv:2503.02784 [cs.CY]
	(或者 arXiv:2503.02784v3 [cs.CY] 对于此版本)
	https://doi.org/10.48550/arXiv.2503.02784

提交历史

来自： Jaekyeom Kim [查看电子邮件]
[v1] 星期二， 2025 年 3 月 4 日 16:57:53 UTC (3,627 KB)
[v2] 星期四， 2025 年 3 月 6 日 18:45:51 UTC (3,627 KB)
[v3] 星期五， 2025 年 3 月 14 日 16:58:30 UTC (3,627 KB)

计算机科学 > 计算机与社会

标题：不要信任您看到的许可证：数据集合规性需要大规模人工智能驱动的生命周期追踪

标题： Do Not Trust Licenses You See: Dataset Compliance Requires Massive-Scale AI-Powered Lifecycle Tracing

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机与社会

标题： 不要信任您看到的许可证：数据集合规性需要大规模人工智能驱动的生命周期追踪 显示英文标题

标题： Do Not Trust Licenses You See: Dataset Compliance Requires Massive-Scale AI-Powered Lifecycle Tracing

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：不要信任您看到的许可证：数据集合规性需要大规模人工智能驱动的生命周期追踪