StreamLink: Large-Language-Model Driven Distributed Data Engineering System

Feng, Dawei; Mei, Di; Tan, Huiri; Ren, Lei; Lou, Xianying; Tan, Zhangxi

计算机科学 > 数据库

arXiv:2505.21575 (cs)

[提交于 2025年5月27日 ]

标题： StreamLink：基于大型语言模型的分布式数据工程系统

标题： StreamLink: Large-Language-Model Driven Distributed Data Engineering System

Authors:Dawei Feng, Di Mei, Huiri Tan, Lei Ren, Xianying Lou, Zhangxi Tan

摘要：大型语言模型（LLMs）在自然语言理解（NLU）方面表现出色，为创新应用打开了大门。我们介绍了StreamLink——一种基于LLM的分布式数据系统，旨在提高数据工程任务的效率和可访问性。我们基于Apache Spark和Hadoop等分布式框架构建了StreamLink，以处理大规模数据。 StreamLink的一个重要设计理念是通过使用本地微调的LLM来尊重用户的数据隐私，而不是像ChatGPT这样的公共AI服务。借助领域适应的LLM，我们可以改进系统在各种场景下对用户自然语言查询的理解，并简化生成用于信息处理的数据库查询（如结构化查询语言SQL）的过程。我们还结合了基于LLM的语法和安全性检查器，以确保每个生成查询的可靠性和安全性。 StreamLink展示了将生成型LLM与分布式数据处理相结合以实现全面且以用户为中心的数据工程的潜力。有了这个架构，我们让用户能够以友好的方式与不同规模的复杂数据库系统交互，并确保安全性，其中SQL生成的执行准确率比基线方法高出10%以上，同时允许用户在几秒钟内使用自然语言从数亿个项目中找到最关心的项目。

摘要： Large Language Models (LLMs) have shown remarkable proficiency in natural language understanding (NLU), opening doors for innovative applications. We introduce StreamLink - an LLM-driven distributed data system designed to improve the efficiency and accessibility of data engineering tasks. We build StreamLink on top of distributed frameworks such as Apache Spark and Hadoop to handle large data at scale. One of the important design philosophies of StreamLink is to respect user data privacy by utilizing local fine-tuned LLMs instead of a public AI service like ChatGPT. With help from domain-adapted LLMs, we can improve our system's understanding of natural language queries from users in various scenarios and simplify the procedure of generating database queries like the Structured Query Language (SQL) for information processing. We also incorporate LLM-based syntax and security checkers to guarantee the reliability and safety of each generated query. StreamLink illustrates the potential of merging generative LLMs with distributed data processing for comprehensive and user-centric data engineering. With this architecture, we allow users to interact with complex database systems at different scales in a user-friendly and security-ensured manner, where the SQL generation reaches over 10\% of execution accuracy compared to baseline methods, and allow users to find the most concerned item from hundreds of millions of items within a few seconds using natural language.

评论：	已被CIKM Workshop 2024接受，https://sites.google.com/view/cikm2024-rag/papers?authuser=0#h.ddm5fg2z885t
主题：	数据库 (cs.DB) ; 人工智能 (cs.AI)
引用方式：	arXiv:2505.21575 [cs.DB]
	(或者 arXiv:2505.21575v1 [cs.DB] 对于此版本)
	https://doi.org/10.48550/arXiv.2505.21575

提交历史

来自： Dawei Feng [查看电子邮件]
[v1] 星期二， 2025 年 5 月 27 日 07:44:16 UTC (438 KB)

计算机科学 > 数据库

标题： StreamLink：基于大型语言模型的分布式数据工程系统

标题： StreamLink: Large-Language-Model Driven Distributed Data Engineering System

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 数据库

标题： StreamLink：基于大型语言模型的分布式数据工程系统 显示英文标题

标题： StreamLink: Large-Language-Model Driven Distributed Data Engineering System

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： StreamLink：基于大型语言模型的分布式数据工程系统