Seq vs Seq: An Open Suite of Paired Encoders and Decoders

Weller, Orion; Ricci, Kathryn; Marone, Marc; Chaffin, Antoine; Lawrie, Dawn; Van Durme, Benjamin

计算机科学 > 计算与语言

arXiv:2507.11412 (cs)

[提交于 2025年7月15日 ]

标题：序列对序列：一个成对编码器和解码器的开放套件

标题： Seq vs Seq: An Open Suite of Paired Encoders and Decoders

Authors:Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, Benjamin Van Durme

摘要：大型语言模型（LLM）社区几乎只关注仅解码器语言模型，因为它们在文本生成方面更容易使用。然而，社区中仍有很大一部分人使用仅编码器模型进行分类或检索等任务。以前的工作尝试比较这些架构，但被迫使用参数数量、训练技术和数据集不同的模型进行比较。我们介绍了最先进的开放数据Ettin模型套件：从1700万参数到10亿参数的配对仅编码器和仅解码器模型，最多训练了2万亿个标记。对仅编码器和仅解码器模型使用相同的配方，在各自尺寸的类别中都产生了最先进的配方，作为编码器击败ModernBERT，作为解码器击败Llama 3.2和SmolLM2。与之前的工作一样，我们发现仅编码器模型在分类和检索任务中表现优异，而解码器在生成任务中表现优异。然而，我们证明通过持续训练将解码器模型适应为编码器任务（反之亦然）的效果不如仅使用相反目标（即，一个4亿参数的编码器在MNLI上优于一个10亿参数的解码器，生成任务则相反）。我们开源了本研究的所有成果，包括训练数据、按检查点分割的训练顺序以及200多个检查点，以允许未来工作分析或扩展训练的所有方面。

摘要： The large language model (LLM) community focuses almost exclusively on decoder-only language models, since they are easier to use for text generation. However, a large subset of the community still uses encoder-only models for tasks such as classification or retrieval. Previous work has attempted to compare these architectures, but is forced to make comparisons with models that have different numbers of parameters, training techniques, and datasets. We introduce the SOTA open-data Ettin suite of models: paired encoder-only and decoder-only models ranging from 17 million parameters to 1 billion, trained on up to 2 trillion tokens. Using the same recipe for both encoder-only and decoder-only models produces SOTA recipes in both categories for their respective sizes, beating ModernBERT as an encoder and Llama 3.2 and SmolLM2 as decoders. Like previous work, we find that encoder-only models excel at classification and retrieval tasks while decoders excel at generative tasks. However, we show that adapting a decoder model to encoder tasks (and vice versa) through continued training is subpar compared to using only the reverse objective (i.e. a 400M encoder outperforms a 1B decoder on MNLI, and vice versa for generative tasks). We open-source all artifacts of this study including training data, training order segmented by checkpoint, and 200+ checkpoints to allow future work to analyze or extend all aspects of training.

主题：	计算与语言 (cs.CL) ; 信息检索 (cs.IR); 机器学习 (cs.LG)
引用方式：	arXiv:2507.11412 [cs.CL]
	(或者 arXiv:2507.11412v1 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.11412

提交历史

来自： Orion Weller [查看电子邮件]
[v1] 星期二， 2025 年 7 月 15 日 15:31:51 UTC (88 KB)

计算机科学 > 计算与语言

标题：序列对序列：一个成对编码器和解码器的开放套件

标题： Seq vs Seq: An Open Suite of Paired Encoders and Decoders

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： 序列对序列：一个成对编码器和解码器的开放套件 显示英文标题

标题： Seq vs Seq: An Open Suite of Paired Encoders and Decoders

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：序列对序列：一个成对编码器和解码器的开放套件