Automatic Metadata Extraction for Text-to-SQL

Shkapenyuk, Vladislav; Srivastava, Divesh; Johnson, Theodore; Ghane, Parisa

计算机科学 > 数据库

arXiv:2505.19988v2 (cs)

[提交于 2025年5月26日 (v1) ，最后修订 2025年6月3日 (此版本， v2)]

标题：自动元数据提取用于文本到SQL

标题： Automatic Metadata Extraction for Text-to-SQL

Authors:Vladislav Shkapenyuk, Divesh Srivastava, Theodore Johnson, Parisa Ghane

摘要：大型语言模型（LLMs）最近变得足够复杂，可以自动化许多任务，从模式发现到写作辅助再到代码生成。在本文中，我们研究了文本到SQL的生成。我们从几十年的经验中观察到，查询开发中最困难的部分在于理解数据库内容。这些经验指导了我们的研究方向。 SPIDER和Bird等文本到SQL基准测试包含了广泛且详细的元数据，而这些元数据在实践中通常不可用。人为生成的元数据需要使用昂贵的主题专家（SMEs），而这些专家往往并不完全了解他们数据库的许多方面。在本文中，我们探索了自动元数据提取的技术，以实现文本到SQL的生成。我们探讨了两种标准和一种较新的元数据提取技术：分析、查询日志分析以及使用LLM的SQL到文本生成。我们使用Bird基准测试[JHQY+23]来评估这些技术的有效性。 Bird在其测试数据库上没有提供查询日志，所以我们准备了一个仅使用分析的提交，并且没有使用任何专门调整过的模型（我们使用的是GPT-4o）。在2024年9月1日至9月23日以及11月11日至11月23日期间，我们在使用和不使用问题集提供的“oracle”信息的情况下都取得了最高分。我们在2025年3月11日重新夺回第一名的位置，并且在撰写时（2025年5月）仍保持在第一位。

摘要： Large Language Models (LLMs) have recently become sophisticated enough to automate many tasks ranging from pattern finding to writing assistance to code generation. In this paper, we examine text-to-SQL generation. We have observed from decades of experience that the most difficult part of query development lies in understanding the database contents. These experiences inform the direction of our research. Text-to-SQL benchmarks such as SPIDER and Bird contain extensive metadata that is generally not available in practice. Human-generated metadata requires the use of expensive Subject Matter Experts (SMEs), who are often not fully aware of many aspects of their databases. In this paper, we explore techniques for automatic metadata extraction to enable text-to-SQL generation. Ee explore the use of two standard and one newer metadata extraction techniques: profiling, query log analysis, and SQL-to text generation using an LLM. We use BIRD benchmark [JHQY+23] to evaluate the effectiveness of these techniques. BIRD does not provide query logs on their test database, so we prepared a submission that uses profiling alone, and does not use any specially tuned model (we used GPT-4o). From Sept 1 to Sept 23, 2024, and Nov 11 through Nov 23, 2024 we achieved the highest score both with and without using the "oracle" information provided with the question set. We regained the number 1 spot on Mar 11, 2025, and are still at #1 at the time of the writing (May, 2025).

评论：	37页
主题：	数据库 (cs.DB)
引用方式：	arXiv:2505.19988 [cs.DB]
	(或者 arXiv:2505.19988v2 [cs.DB] 对于此版本)
	https://doi.org/10.48550/arXiv.2505.19988

提交历史

来自： Theodore Johnson [查看电子邮件]
[v1] 星期一， 2025 年 5 月 26 日 13:43:43 UTC (823 KB)
[v2] 星期二， 2025 年 6 月 3 日 15:23:03 UTC (823 KB)

计算机科学 > 数据库

标题：自动元数据提取用于文本到SQL

标题： Automatic Metadata Extraction for Text-to-SQL

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 数据库

标题： 自动元数据提取用于文本到SQL 显示英文标题

标题： Automatic Metadata Extraction for Text-to-SQL

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：自动元数据提取用于文本到SQL