Context-Aware Scientific Knowledge Extraction on Linked Open Data using Large Language Models

Rubaiat, Sajratul Y.; Jamil, Hasan M.

计算机科学 > 信息检索

arXiv:2506.17580 (cs)

[提交于 2025年6月21日 ]

标题：基于大型语言模型的链接开放数据上的上下文感知科学知识提取

标题： Context-Aware Scientific Knowledge Extraction on Linked Open Data using Large Language Models

Authors:Sajratul Y. Rubaiat, Hasan M. Jamil

摘要：科学文献的指数增长给研究人员提取和综合知识带来了挑战。传统的搜索引擎返回许多没有直接、详细答案的来源，而通用的大规模语言模型可能提供简洁的回答，但缺乏深度或遗漏了最新信息。具有搜索功能的大规模语言模型也受到上下文窗口的限制，导致回答简短且不完整。本文介绍了WISE（智能科学知识提取工作流程），该系统通过使用结构化的工作流程来提取、精炼和排序与查询相关的知识，以解决这些限制。 WISE采用基于树的大规模语言模型架构来精炼数据，专注于与查询一致、上下文感知且无冗余的信息。动态评分和排序优先考虑每个来源的独特贡献，自适应停止标准则最小化处理开销。 WISE通过系统地探索和综合来自不同来源的知识，提供详细且有组织的答案。在与HBB基因相关疾病上的实验表明，WISE在实现比搜索引擎和其他基于大规模语言模型的方法显著更高的召回率的同时，将处理的文本减少了超过80%。 ROUGE和BLEU指标显示WISE的输出比其他系统更独特，一种新的基于级别的指标表明它提供了更深入的信息。我们还探讨了如何将WISE工作流程适应于药物发现、材料科学和社会科学等不同领域，从而从非结构化的科学论文和网络资源中高效地提取和综合知识。

摘要： The exponential growth of scientific literature challenges researchers extracting and synthesizing knowledge. Traditional search engines return many sources without direct, detailed answers, while general-purpose LLMs may offer concise responses that lack depth or omit current information. LLMs with search capabilities are also limited by context window, yielding short, incomplete answers. This paper introduces WISE (Workflow for Intelligent Scientific Knowledge Extraction), a system addressing these limits by using a structured workflow to extract, refine, and rank query-specific knowledge. WISE uses an LLM-powered, tree-based architecture to refine data, focusing on query-aligned, context-aware, and non-redundant information. Dynamic scoring and ranking prioritize unique contributions from each source, and adaptive stopping criteria minimize processing overhead. WISE delivers detailed, organized answers by systematically exploring and synthesizing knowledge from diverse sources. Experiments on HBB gene-associated diseases demonstrate WISE reduces processed text by over 80% while achieving significantly higher recall over baselines like search engines and other LLM-based approaches. ROUGE and BLEU metrics reveal WISE's output is more unique than other systems, and a novel level-based metric shows it provides more in-depth information. We also explore how the WISE workflow can be adapted for diverse domains like drug discovery, material science, and social science, enabling efficient knowledge extraction and synthesis from unstructured scientific papers and web sources.

主题：	信息检索 (cs.IR) ; 人工智能 (cs.AI); 数字图书馆 (cs.DL); 新兴技术 (cs.ET)
引用方式：	arXiv:2506.17580 [cs.IR]
	(或者 arXiv:2506.17580v1 [cs.IR] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.17580

提交历史

来自： Sajratul Yakin Rubaiat [查看电子邮件]
[v1] 星期六， 2025 年 6 月 21 日 04:22:34 UTC (211 KB)

计算机科学 > 信息检索

标题：基于大型语言模型的链接开放数据上的上下文感知科学知识提取

标题： Context-Aware Scientific Knowledge Extraction on Linked Open Data using Large Language Models

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 信息检索

标题： 基于大型语言模型的链接开放数据上的上下文感知科学知识提取 显示英文标题

标题： Context-Aware Scientific Knowledge Extraction on Linked Open Data using Large Language Models

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：基于大型语言模型的链接开放数据上的上下文感知科学知识提取