DRISHTIKON: Visual Grounding at Multiple Granularities in Documents

Kasuba, Badri Vishal; Chaudhuri, Parag; Ramakrishnan, Ganesh

计算机科学 > 计算机视觉与模式识别

arXiv:2506.21316 (cs)

[提交于 2025年6月26日 (v1) ，最后修订 2025年7月16日 (此版本， v2)]

标题： DRISHTIKON：文档中多粒度的视觉定位

标题： DRISHTIKON: Visual Grounding at Multiple Granularities in Documents

Authors:Badri Vishal Kasuba, Parag Chaudhuri, Ganesh Ramakrishnan

摘要：文本丰富文档图像中的视觉定位是文档智能和视觉问答（VQA）系统中一个关键但研究不足的挑战。我们提出了DRISHTIKON，一种多粒度和多块的视觉定位框架，旨在增强复杂、多语言文档中VQA的可解释性和信任度。我们的方法结合了多语言OCR、大语言模型和一种新颖的区域匹配算法，以在块、行、词和点级别定位答案片段。我们引入了多粒度视觉定位（MGVG）基准，这是一个精心整理的测试集，包含来自不同领域的多样化圆形通知，每个都经过人工标注，在多个粒度上具有细粒度的人工验证标签。大量实验表明，我们的方法实现了最先进的定位准确性，其中行级粒度在精度和召回率之间提供了最佳平衡。消融研究进一步突显了多块和多行推理的好处。比较评估显示，领先的视觉语言模型在精确定位方面存在困难，这强调了我们结构化、对齐方法的有效性。我们的研究结果为在具有多粒度定位支持的真实世界、以文本为中心的场景中更强大和可解释的文档理解系统铺平了道路。代码和数据集已提供给未来的研究。

摘要： Visual grounding in text-rich document images is a critical yet underexplored challenge for Document Intelligence and Visual Question Answering (VQA) systems. We present DRISHTIKON, a multi-granular and multi-block visual grounding framework designed to enhance interpretability and trust in VQA for complex, multilingual documents. Our approach integrates multilingual OCR, large language models, and a novel region matching algorithm to localize answer spans at the block, line, word, and point levels. We introduce the Multi-Granular Visual Grounding (MGVG) benchmark, a curated test set of diverse circular notifications from various sectors, each manually annotated with fine-grained, human-verified labels across multiple granularities. Extensive experiments show that our method achieves state-of-the-art grounding accuracy, with line-level granularity providing the best balance between precision and recall. Ablation studies further highlight the benefits of multi-block and multi-line reasoning. Comparative evaluations reveal that leading vision-language models struggle with precise localization, underscoring the effectiveness of our structured, alignment-based approach. Our findings pave the way for more robust and interpretable document understanding systems in real-world, text-centric scenarios with multi-granular grounding support. Code and dataset are made available for future research.

评论：	进行中
主题：	计算机视觉与模式识别 (cs.CV)
引用方式：	arXiv:2506.21316 [cs.CV]
	(或者 arXiv:2506.21316v2 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.21316

提交历史

来自： Badri Vishal Kasuba [查看电子邮件]
[v1] 星期四， 2025 年 6 月 26 日 14:32:23 UTC (2,721 KB)
[v2] 星期三， 2025 年 7 月 16 日 01:55:35 UTC (4,399 KB)

计算机科学 > 计算机视觉与模式识别

标题： DRISHTIKON：文档中多粒度的视觉定位

标题： DRISHTIKON: Visual Grounding at Multiple Granularities in Documents

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： DRISHTIKON：文档中多粒度的视觉定位 显示英文标题

标题： DRISHTIKON: Visual Grounding at Multiple Granularities in Documents

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： DRISHTIKON：文档中多粒度的视觉定位