Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Zhu, Yuqi; Zhong, Yi; Zhang, Jintian; Zhang, Ziheng; Qiao, Shuofei; Luo, Yujie; Du, Lun; Zheng, Da; Zhang, Ningyu; Chen, Huajun

Computer Science > Computation and Language

arXiv:2506.19794 (cs)

[Submitted on 24 Jun 2025 (v1) , last revised 14 Aug 2025 (this version, v4)]

Title: Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Title: 为什么开源大语言模型在数据分析上遇到困难？一项系统性的实证研究

Authors:Yuqi Zhu, Yi Zhong, Jintian Zhang, Ziheng Zhang, Shuofei Qiao, Yujie Luo, Lun Du, Da Zheng, Ningyu Zhang, Huajun Chen

Abstract: Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities. Code is available at https://github.com/zjunlp/DataMind.

Abstract: 大型语言模型（LLMs）在自动化数据分析任务方面具有前景，但开源模型在这些需要密集推理的场景中面临显著限制。在本工作中，我们研究了增强开源LLMs数据分析能力的策略。通过整理一个包含多种现实场景的种子数据集，我们在三个核心维度上评估模型行为：数据理解、代码生成和战略规划。我们的分析揭示了三个关键发现：（1）战略规划质量是模型性能的主要决定因素；（2）交互设计和任务复杂度显著影响推理能力；（3）数据质量在实现最佳性能方面比多样性产生更大的影响。我们利用这些见解开发了一种数据合成方法，展示了开源LLMs分析推理能力的显著提升。代码可在 https://github.com/zjunlp/DataMind 获取。

Comments:	Work in progress
Subjects:	Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Cite as:	arXiv:2506.19794 [cs.CL]
	(or arXiv:2506.19794v4 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2506.19794

Submission history

From: Ningyu Zhang [view email]
[v1] Tue, 24 Jun 2025 17:04:23 UTC (1,401 KB)
[v2] Mon, 7 Jul 2025 14:20:16 UTC (1,398 KB)
[v3] Tue, 5 Aug 2025 10:29:19 UTC (1,427 KB)
[v4] Thu, 14 Aug 2025 00:35:54 UTC (1,429 KB)

Computer Science > Computation and Language

Title: Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Title: 为什么开源大语言模型在数据分析上遇到困难？一项系统性的实证研究

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title: Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study Show Chinese title

Title: 为什么开源大语言模型在数据分析上遇到困难？ 一项系统性的实证研究

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Title: 为什么开源大语言模型在数据分析上遇到困难？一项系统性的实证研究