DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images

Kasuba, Badri Vishal; Chaudhuri, Parag; Ramakrishnan, Ganesh

计算机科学 > 计算机视觉与模式识别

arXiv:2506.21316v1 (cs)

[提交于 2025年6月26日 (此版本) ， 最新版本 2025年7月16日 (v2) ]

标题： DrishtiKon：文本丰富的文档图像的多粒度视觉定位

标题： DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images

Authors:Badri Vishal Kasuba, Parag Chaudhuri, Ganesh Ramakrishnan

摘要：文本丰富的文档图像中的视觉定位是文档智能和视觉问答（VQA）系统中一个关键但研究不足的挑战。我们提出\drishtikon ，一种多粒度视觉定位框架，旨在增强复杂、多语言文档中VQA的可解释性和信任度。我们的方法结合了强大的多语言OCR、大型语言模型和一种新颖的区域匹配算法，以在块、行、词和点级别准确定位答案段。我们从CircularsVQA测试集中整理了一个新的基准，提供了跨多个粒度的细粒度、人工验证的注释。大量实验表明，我们的方法在定位准确性方面达到了最先进水平，其中行级粒度在精度和召回率之间提供了最佳平衡。消融研究进一步突显了多块和多行推理的优势。与领先的视觉-语言模型的比较评估揭示了当前VLM在精确定位方面的局限性，强调了我们结构化、对齐方法的有效性。我们的研究结果为现实世界中以文本为中心的场景下更强大和可解释的文档理解系统铺平了道路。代码和数据集已发布在 https://github.com/kasuba-badri-vishal/DhrishtiKon。

摘要： Visual grounding in text-rich document images is a critical yet underexplored challenge for document intelligence and visual question answering (VQA) systems. We present \drishtikon, a multi-granular visual grounding framework designed to enhance interpretability and trust in VQA for complex, multilingual documents. Our approach integrates robust multi-lingual OCR, large language models, and a novel region matching algorithm to accurately localize answer spans at block, line, word, and point levels. We curate a new benchmark from the CircularsVQA test set, providing fine-grained, human-verified annotations across multiple granularities. Extensive experiments demonstrate that our method achieves state-of-the-art grounding accuracy, with line-level granularity offering the best trade-off between precision and recall. Ablation studies further highlight the benefits of multi-block and multi-line reasoning. Comparative evaluations with leading vision-language models reveal the limitations of current VLMs in precise localization, underscoring the effectiveness of our structured, alignment-based approach. Our findings pave the way for more robust and interpretable document understanding systems in real-world, text-centric scenarios. Code and dataset has been made available at https://github.com/kasuba-badri-vishal/DhrishtiKon.

评论：	进行中
主题：	计算机视觉与模式识别 (cs.CV)
引用方式：	arXiv:2506.21316 [cs.CV]
	(或者 arXiv:2506.21316v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.21316

提交历史

来自： Badri Vishal Kasuba [查看电子邮件]
[v1] 星期四， 2025 年 6 月 26 日 14:32:23 UTC (2,721 KB)
[v2] 星期三， 2025 年 7 月 16 日 01:55:35 UTC (4,399 KB)

计算机科学 > 计算机视觉与模式识别

标题： DrishtiKon：文本丰富的文档图像的多粒度视觉定位

标题： DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： DrishtiKon：文本丰富的文档图像的多粒度视觉定位 显示英文标题

标题： DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： DrishtiKon：文本丰富的文档图像的多粒度视觉定位