Understanding the Effect of Knowledge Graph Extraction Error on Downstream Graph Analyses: A Case Study on Affiliation Graphs

Cai, Erica; O'Connor, Brendan

计算机科学 > 计算与语言

arXiv:2506.12367 (cs)

[提交于 2025年6月14日 ]

标题：理解知识图谱提取错误对下游图分析的影响：以隶属关系图为例

标题： Understanding the Effect of Knowledge Graph Extraction Error on Downstream Graph Analyses: A Case Study on Affiliation Graphs

Authors:Erica Cai, Brendan O'Connor

摘要：知识图谱（KGs）对于分析从社会学到公共卫生等多个领域的社会结构、社区动态、机构成员资格以及其他复杂关系非常有用。尽管大型语言模型（LLMs）的最新进展提高了从大规模文本语料库中自动提取知识图谱的可扩展性和易用性，但人们对提取错误对下游分析的影响了解甚少，尤其是对依赖准确知识图谱以获取现实世界洞见的应用科学家而言。为了解决这一差距，我们进行了首次针对两个层面的知识图谱提取性能评估：(1) 微观层面的边准确性，这与标准自然语言处理（NLP）评估一致，并手动识别常见错误来源；(2) 宏观层面的图指标，这些指标评估社区检测和连通性等结构属性，这些属性与实际应用相关。专注于从社交登记簿中提取的个人在组织中的隶属关系图，我们的研究发现了一种提取性能范围，在此范围内大多数下游图分析指标的偏差接近于零。然而，随着提取性能下降，我们发现许多指标表现出越来越明显的偏差，每个指标倾向于一致地过度估计或低估。通过模拟，我们进一步表明文献中常用的错误模型未能捕捉到这些偏差模式，这表明需要更现实的错误模型来用于知识图谱提取。我们的研究结果为从业人员提供了可行的见解，并强调了改进提取方法和错误建模的重要性，以确保可靠且有意义的下游分析。

摘要： Knowledge graphs (KGs) are useful for analyzing social structures, community dynamics, institutional memberships, and other complex relationships across domains from sociology to public health. While recent advances in large language models (LLMs) have improved the scalability and accessibility of automated KG extraction from large text corpora, the impacts of extraction errors on downstream analyses are poorly understood, especially for applied scientists who depend on accurate KGs for real-world insights. To address this gap, we conducted the first evaluation of KG extraction performance at two levels: (1) micro-level edge accuracy, which is consistent with standard NLP evaluations, and manual identification of common error sources; (2) macro-level graph metrics that assess structural properties such as community detection and connectivity, which are relevant to real-world applications. Focusing on affiliation graphs of person membership in organizations extracted from social register books, our study identifies a range of extraction performance where biases across most downstream graph analysis metrics are near zero. However, as extraction performance declines, we find that many metrics exhibit increasingly pronounced biases, with each metric tending toward a consistent direction of either over- or under-estimation. Through simulations, we further show that error models commonly used in the literature do not capture these bias patterns, indicating the need for more realistic error models for KG extraction. Our findings provide actionable insights for practitioners and underscores the importance of advancing extraction methods and error modeling to ensure reliable and meaningful downstream analyses.

评论：	30页
主题：	计算与语言 (cs.CL) ; 社会与信息网络 (cs.SI)
引用方式：	arXiv:2506.12367 [cs.CL]
	(或者 arXiv:2506.12367v1 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.12367

提交历史

来自： Erica Cai [查看电子邮件]
[v1] 星期六， 2025 年 6 月 14 日 06:14:06 UTC (3,256 KB)

计算机科学 > 计算与语言

标题：理解知识图谱提取错误对下游图分析的影响：以隶属关系图为例

标题： Understanding the Effect of Knowledge Graph Extraction Error on Downstream Graph Analyses: A Case Study on Affiliation Graphs

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： 理解知识图谱提取错误对下游图分析的影响：以隶属关系图为例 显示英文标题

标题： Understanding the Effect of Knowledge Graph Extraction Error on Downstream Graph Analyses: A Case Study on Affiliation Graphs

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：理解知识图谱提取错误对下游图分析的影响：以隶属关系图为例