METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

Liu, Ollie; Jaghouar, Sami; Hagemann, Johannes; Wang, Shangshang; Wiemels, Jason; Kaufman, Jeff; Neiswanger, Willie

定量生物学 > 基因组学

arXiv:2501.02045 (q-bio)

[提交于 2025年1月3日 ]

标题： METAGENE-1：用于疫情监测的宏基因组基础模型

标题： METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

Authors:Ollie Liu, Sami Jaghouar, Johannes Hagemann, Shangshang Wang, Jason Wiemels, Jeff Kaufman, Willie Neiswanger

摘要：我们预训练了METAGENE-1，这是一个70亿参数的自回归变压器模型，我们将其称为元基因组基础模型，在一个由多样化元基因组DNA和RNA序列组成的新型语料库上进行预训练，该语料库包含超过1.5万亿个碱基对。该数据集来源于大量人类污水样本，使用深度元基因组（下一代）测序方法进行处理和测序。与专注于单个基因组或特定物种的定制集合的基因组模型不同，METAGENE-1的目标是捕捉这种污水中存在基因组信息的完整分布，以帮助与大流行监测和病原体检测相关的任务。我们在数据集上进行字节对编码（BPE）分词，专门针对元基因组序列，然后进行模型预训练。在本文中，我们首先详细介绍了预训练数据集、分词策略和模型架构，强调了使元基因组数据有效建模的考虑因素和设计选择。然后，我们展示了在我们的元基因组数据集上预训练该模型的结果，提供了关于损失函数、系统指标以及预训练过程中训练稳定性的详细信息。最后，我们展示了METAGENE-1的性能，它在一组基因组基准测试和新的专注于人源病原体检测和基因组序列嵌入的评估中取得了最先进的结果，展示了其在大流行监测、生物监视和新兴健康威胁早期检测方面的公共卫生应用潜力。

摘要： We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer model, which we refer to as a metagenomic foundation model, on a novel corpus of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base pairs. This dataset is sourced from a large collection of human wastewater samples, processed and sequenced using deep metagenomic (next-generation) sequencing methods. Unlike genomic models that focus on individual genomes or curated sets of specific species, the aim of METAGENE-1 is to capture the full distribution of genomic information present within this wastewater, to aid in tasks relevant to pandemic monitoring and pathogen detection. We carry out byte-pair encoding (BPE) tokenization on our dataset, tailored for metagenomic sequences, and then pretrain our model. In this paper, we first detail the pretraining dataset, tokenization strategy, and model architecture, highlighting the considerations and design choices that enable the effective modeling of metagenomic data. We then show results of pretraining this model on our metagenomic dataset, providing details about our losses, system metrics, and training stability over the course of pretraining. Finally, we demonstrate the performance of METAGENE-1, which achieves state-of-the-art results on a set of genomic benchmarks and new evaluations focused on human-pathogen detection and genomic sequence embedding, showcasing its potential for public health applications in pandemic monitoring, biosurveillance, and early detection of emerging health threats.

主题：	基因组学 (q-bio.GN) ; 人工智能 (cs.AI); 计算与语言 (cs.CL); 机器学习 (cs.LG)
引用方式：	arXiv:2501.02045 [q-bio.GN]
	(或者 arXiv:2501.02045v1 [q-bio.GN] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.02045

提交历史

来自： Willie Neiswanger [查看电子邮件]
[v1] 星期五， 2025 年 1 月 3 日 18:44:43 UTC (5,988 KB)

定量生物学 > 基因组学

标题： METAGENE-1：用于疫情监测的宏基因组基础模型

标题： METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

定量生物学 > 基因组学

标题： METAGENE-1：用于疫情监测的宏基因组基础模型 显示英文标题

标题： METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： METAGENE-1：用于疫情监测的宏基因组基础模型