CAT: Content-Adaptive Image Tokenization

Shen, Junhong; Tirumala, Kushal; Yasunaga, Michihiro; Misra, Ishan; Zettlemoyer, Luke; Yu, Lili; Zhou, Chunting

计算机科学 > 计算机视觉与模式识别

arXiv:2501.03120 (cs)

[提交于 2025年1月6日 ]

标题： CAT：内容自适应图像标记化

标题： CAT: Content-Adaptive Image Tokenization

Authors:Junhong Shen, Kushal Tirumala, Michihiro Yasunaga, Ishan Misra, Luke Zettlemoyer, Lili Yu, Chunting Zhou

摘要：大多数现有的图像分词器将图像编码为固定数量的标记或块，忽略了图像复杂性中的固有变化。为了解决这个问题，我们引入了内容自适应分词器（CAT），它根据图像内容动态调整表示能力，并将简单的图像编码为更少的标记。我们设计了一个基于标题的评估系统，利用大型语言模型（LLMs）预测内容复杂性并确定给定图像的最佳压缩比，同时考虑对人类感知至关重要的因素。在具有不同压缩比的图像上进行训练，CAT在图像重建方面表现出强大的性能。我们还利用其可变长度的潜在表示来训练扩散变换器（DiTs）进行ImageNet生成。通过优化标记分配，CAT在与相同flops训练的固定比率基线相比提高了FID分数，并将推理吞吐量提高了18.5%。

摘要： Most existing image tokenizers encode images into a fixed number of tokens or patches, overlooking the inherent variability in image complexity. To address this, we introduce Content-Adaptive Tokenizer (CAT), which dynamically adjusts representation capacity based on the image content and encodes simpler images into fewer tokens. We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image, taking into account factors critical to human perception. Trained on images with diverse compression ratios, CAT demonstrates robust performance in image reconstruction. We also utilize its variable-length latent representations to train Diffusion Transformers (DiTs) for ImageNet generation. By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.

主题：	计算机视觉与模式识别 (cs.CV)
引用方式：	arXiv:2501.03120 [cs.CV]
	(或者 arXiv:2501.03120v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.03120

提交历史

来自： Junhong Shen [查看电子邮件]
[v1] 星期一， 2025 年 1 月 6 日 16:28:47 UTC (9,753 KB)

计算机科学 > 计算机视觉与模式识别

标题： CAT：内容自适应图像标记化

标题： CAT: Content-Adaptive Image Tokenization

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： CAT：内容自适应图像标记化 显示英文标题

标题： CAT: Content-Adaptive Image Tokenization

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： CAT：内容自适应图像标记化