Skip to main content
CenXiv.org
This website is in trial operation, support us!
We gratefully acknowledge support from all contributors.
Contribute
Donate
cenxiv logo > cs > arXiv:2508.03489

Help | Advanced Search

Computer Science > Computation and Language

arXiv:2508.03489 (cs)
[Submitted on 5 Aug 2025 ]

Title: CF-RAG: A Dataset and Method for Carbon Footprint QA Using Retrieval-Augmented Generation

Title: CF-RAG:一种使用检索增强生成的碳足迹问答数据集和方法

Authors:Kaiwen Zhao, Bharathan Balaji, Stephen Lee
Abstract: Product sustainability reports provide valuable insights into the environmental impacts of a product and are often distributed in PDF format. These reports often include a combination of tables and text, which complicates their analysis. The lack of standardization and the variability in reporting formats further exacerbate the difficulty of extracting and interpreting relevant information from large volumes of documents. In this paper, we tackle the challenge of answering questions related to carbon footprints within sustainability reports available in PDF format. Unlike previous approaches, our focus is on addressing the difficulties posed by the unstructured and inconsistent nature of text extracted from PDF parsing. To facilitate this analysis, we introduce CarbonPDF-QA, an open-source dataset containing question-answer pairs for 1735 product report documents, along with human-annotated answers. Our analysis shows that GPT-4o struggles to answer questions with data inconsistencies. To address this limitation, we propose CarbonPDF, an LLM-based technique specifically designed to answer carbon footprint questions on such datasets. We develop CarbonPDF by fine-tuning Llama 3 with our training data. Our results show that our technique outperforms current state-of-the-art techniques, including question-answering (QA) systems finetuned on table and text data.
Abstract: 产品可持续性报告为了解产品的环境影响提供了有价值的见解,通常以PDF格式分发。 这些报告通常包含表格和文本的组合,这使得它们的分析变得更加复杂。 缺乏标准化和报告格式的多样性进一步加剧了从大量文档中提取和解释相关信息的难度。 在本文中,我们解决了在以PDF格式提供的可持续性报告中回答与碳足迹相关问题的挑战。 与之前的方法不同,我们的重点是解决从PDF解析中提取的文本的非结构化和不一致性质所带来的困难。 为了促进这种分析,我们引入了CarbonPDF-QA,这是一个开源数据集,包含1735份产品报告文档的问题-答案对,以及人工标注的答案。 我们的分析表明,GPT-4o在回答存在数据不一致的问题时表现不佳。 为了解决这一限制,我们提出了CarbonPDF,这是一种基于大语言模型的技术,专门设计用于回答此类数据集上的碳足迹问题。 我们通过使用训练数据对Llama 3进行微调来开发CarbonPDF。 我们的结果表明,我们的技术优于当前最先进的技术,包括在表格和文本数据上微调的问答(QA)系统。
Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Cite as: arXiv:2508.03489 [cs.CL]
  (or arXiv:2508.03489v1 [cs.CL] for this version)
  https://doi.org/10.48550/arXiv.2508.03489
arXiv-issued DOI via DataCite

Submission history

From: Kaiwen Zhao [view email]
[v1] Tue, 5 Aug 2025 14:20:10 UTC (938 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled
  • View Chinese PDF
  • View PDF
  • HTML (experimental)
  • TeX Source
  • Other Formats
license icon view license
Current browse context:
cs.AI
< prev   |   next >
new | recent | 2025-08
Change to browse by:
cs
cs.CL

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar
a export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
IArxiv Recommender (What is IArxiv?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack

京ICP备2025123034号