TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment

Yang, Ke; Kindratenko, Volodymyr; Zhai, ChengXiang

计算机科学 > 计算与语言

arXiv:2501.00522 (cs)

[提交于 2024年12月31日 ]

标题： TinyHelen的第一课程：在更简单的语言环境中训练和评估小型语言模型

标题： TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment

Authors:Ke Yang, Volodymyr Kindratenko, ChengXiang Zhai

摘要：训练语言模型（LMs）及其应用代理的成本越来越高，因为数据集和模型规模庞大，导致测试失败难以承受。简化的语言环境作为原始的训练和测试场所，保留了基本常识和沟通技能，但形式更为易消化，可能提高LM的学习效率，从而减少有效训练和评估所需的模型大小和数据量。在这些简化的语言环境中，适用于小型模型、数据集和代理的可行策略可能适用于复杂语言环境中的大型模型、数据集和代理。为了创建这样的环境，我们关注两个方面：i）最小化语言数据集的噪声和复杂性，ii）保留基本的文本分布特征。与之前的方法不同，我们提出了一种管道来通过消除噪声、最小化词汇量并保持特定类型的模式（例如，对于书籍、对话、代码等）来精炼文本数据。使用大型LM实现此管道，我们创建了一个更精简的LM训练和评估数据集套件：71M Leaner-Pretrain，7M Leaner-Instruct， Leaner-Glue用于评估语言能力，Leaner-Eval用于测试指令遵循能力。我们的实验表明，精简预训练提高了LM的学习效率。在这些数据集上训练的小型LM在不同语言粒度级别上的指令遵循方面表现优于在原始数据集上训练的模型。此外，Leaner-Pretrain数据集与传统大型LM训练集的一致性使得可以对学习目标、模型架构和训练技术如何影响语言建模和下游任务性能进行资源优化分析。我们的代码和数据集可在 https://github.com/EmpathYang/TinyHelen.git 获取。

摘要： Training language models (LMs) and their application agents is increasingly costly due to large datasets and models, making test failures difficult to bear. Simplified language environments serve as primordial training and testing grounds, retaining essential commonsense and communication skills but in a more digestible form, potentially enhancing the learning efficiency of LMs, and thus reducing the required model size and data volume for effective training and evaluation. In these simplified language environments, workable strategies for small models, datasets, and agents may be adaptable to larger models, datasets, and agents in complex language environments. To create such environments, we focus on two aspects: i) minimizing language dataset noise and complexity, and ii) preserving the essential text distribution characteristics. Unlike previous methods, we propose a pipeline to refine text data by eliminating noise, minimizing vocabulary, and maintaining genre-specific patterns (e.g., for books, conversation, code, etc.). Implementing this pipeline with large LMs, we have created a leaner suite of LM training and evaluation datasets: 71M Leaner-Pretrain, 7M Leaner-Instruct, Leaner-Glue for assessing linguistic proficiency, and Leaner-Eval for testing instruction-following ability. Our experiments show that leaner pre-training boosts LM learning efficiency. Tiny LMs trained on these datasets outperform those trained on original datasets in instruction-following across different language granularity levels. Moreover, the Leaner-Pretrain dataset's alignment with conventional large LM training sets enables resource-optimized analysis of how learning objectives, model architectures, and training techniques impact performance on language modeling and downstream tasks. Our code and datasets are available at https://github.com/EmpathYang/TinyHelen.git.

主题：	计算与语言 (cs.CL) ; 人工智能 (cs.AI)
引用方式：	arXiv:2501.00522 [cs.CL]
	(或者 arXiv:2501.00522v1 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.00522

提交历史

来自： Ke Yang [查看电子邮件]
[v1] 星期二， 2024 年 12 月 31 日 16:08:15 UTC (162 KB)

计算机科学 > 计算与语言

标题： TinyHelen的第一课程：在更简单的语言环境中训练和评估小型语言模型

标题： TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： TinyHelen的第一课程：在更简单的语言环境中训练和评估小型语言模型 显示英文标题

标题： TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： TinyHelen的第一课程：在更简单的语言环境中训练和评估小型语言模型