CF-RAG: A Dataset and Method for Carbon Footprint QA Using Retrieval-Augmented Generation

Zhao, Kaiwen; Balaji, Bharathan; Lee, Stephen

Computer Science > Computation and Language

arXiv:2508.03489 (cs)

[Submitted on 5 Aug 2025 ]

Title: CF-RAG: A Dataset and Method for Carbon Footprint QA Using Retrieval-Augmented Generation

Title: CF-RAG：一种使用检索增强生成的碳足迹问答数据集和方法

Authors:Kaiwen Zhao, Bharathan Balaji, Stephen Lee

Abstract: Product sustainability reports provide valuable insights into the environmental impacts of a product and are often distributed in PDF format. These reports often include a combination of tables and text, which complicates their analysis. The lack of standardization and the variability in reporting formats further exacerbate the difficulty of extracting and interpreting relevant information from large volumes of documents. In this paper, we tackle the challenge of answering questions related to carbon footprints within sustainability reports available in PDF format. Unlike previous approaches, our focus is on addressing the difficulties posed by the unstructured and inconsistent nature of text extracted from PDF parsing. To facilitate this analysis, we introduce CarbonPDF-QA, an open-source dataset containing question-answer pairs for 1735 product report documents, along with human-annotated answers. Our analysis shows that GPT-4o struggles to answer questions with data inconsistencies. To address this limitation, we propose CarbonPDF, an LLM-based technique specifically designed to answer carbon footprint questions on such datasets. We develop CarbonPDF by fine-tuning Llama 3 with our training data. Our results show that our technique outperforms current state-of-the-art techniques, including question-answering (QA) systems finetuned on table and text data.

Abstract: 产品可持续性报告为了解产品的环境影响提供了有价值的见解，通常以PDF格式分发。这些报告通常包含表格和文本的组合，这使得它们的分析变得更加复杂。缺乏标准化和报告格式的多样性进一步加剧了从大量文档中提取和解释相关信息的难度。在本文中，我们解决了在以PDF格式提供的可持续性报告中回答与碳足迹相关问题的挑战。与之前的方法不同，我们的重点是解决从PDF解析中提取的文本的非结构化和不一致性质所带来的困难。为了促进这种分析，我们引入了CarbonPDF-QA，这是一个开源数据集，包含1735份产品报告文档的问题-答案对，以及人工标注的答案。我们的分析表明，GPT-4o在回答存在数据不一致的问题时表现不佳。为了解决这一限制，我们提出了CarbonPDF，这是一种基于大语言模型的技术，专门设计用于回答此类数据集上的碳足迹问题。我们通过使用训练数据对Llama 3进行微调来开发CarbonPDF。我们的结果表明，我们的技术优于当前最先进的技术，包括在表格和文本数据上微调的问答（QA）系统。

Subjects:	Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Cite as:	arXiv:2508.03489 [cs.CL]
	(or arXiv:2508.03489v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2508.03489

Submission history

From: Kaiwen Zhao [view email]
[v1] Tue, 5 Aug 2025 14:20:10 UTC (938 KB)

Computer Science > Computation and Language

Title: CF-RAG: A Dataset and Method for Carbon Footprint QA Using Retrieval-Augmented Generation

Title: CF-RAG：一种使用检索增强生成的碳足迹问答数据集和方法

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title: CF-RAG: A Dataset and Method for Carbon Footprint QA Using Retrieval-Augmented Generation Show Chinese title

Title: CF-RAG：一种使用检索增强生成的碳足迹问答数据集和方法

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: CF-RAG: A Dataset and Method for Carbon Footprint QA Using Retrieval-Augmented Generation