Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates

Gao, Kaiyuan; Wang, Yusong; Guan, Haoxiang; Wang, Zun; Pei, Qizhi; Hopcroft, John E.; He, Kun; Wu, Lijun

计算机科学 > 机器学习

arXiv:2412.01564 (cs)

[提交于 2024年12月2日 ]

标题：用量化球坐标对三维分子结构进行分词

标题： Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates

Authors:Kaiyuan Gao, Yusong Wang, Haoxiang Guan, Zun Wang, Qizhi Pei, John E. Hopcroft, Kun He, Lijun Wu

摘要：语言模型（LMs）在使用SMILES和SELFIES等线性表示法进行分子结构生成中的应用在化学生物信息学领域已经得到了充分确立。然而，将这些模型扩展到生成三维分子结构面临重大挑战。两个主要障碍出现：（1）设计一种确保SE(3)不变原子坐标的三维线性表示法的困难，以及（2）将连续坐标进行分词以用于本质上需要离散输入的语言模型的非平凡任务。为了解决这些挑战，我们提出了Mol-StrucTok，这是一种用于对三维分子结构进行分词的新方法。我们的方法包含两项关键创新：（1）我们通过在球面坐标系中提取局部原子坐标来设计一种三维分子的线性表示法。这种表示法建立在现有的二维线性表示法基础上，并对其具体形式保持中立，从而确保与各种分子表示方案的兼容性。（2）我们采用向量量化变分自编码器（VQ-VAE）对这些坐标进行分词，将其视为生成描述符。为了进一步增强表示，我们将邻域键长和键角作为理解描述符进行整合。利用这种分词框架，我们训练了一个类似GPT-2的模型，用于三维分子生成任务。结果表明，与之前的方法相比，该方法表现出强大的性能，生成速度显著加快，并且具有竞争力的化学稳定性。此外，通过将我们学习到的离散表示整合到Graphormer模型中，用于QM9数据集上的性质预测，Mol-StrucTok在各种分子性质上均显示出一致的改进，这凸显了我们方法的多样性和鲁棒性。

摘要： The application of language models (LMs) to molecular structure generation using line notations such as SMILES and SELFIES has been well-established in the field of cheminformatics. However, extending these models to generate 3D molecular structures presents significant challenges. Two primary obstacles emerge: (1) the difficulty in designing a 3D line notation that ensures SE(3)-invariant atomic coordinates, and (2) the non-trivial task of tokenizing continuous coordinates for use in LMs, which inherently require discrete inputs. To address these challenges, we propose Mol-StrucTok, a novel method for tokenizing 3D molecular structures. Our approach comprises two key innovations: (1) We design a line notation for 3D molecules by extracting local atomic coordinates in a spherical coordinate system. This notation builds upon existing 2D line notations and remains agnostic to their specific forms, ensuring compatibility with various molecular representation schemes. (2) We employ a Vector Quantized Variational Autoencoder (VQ-VAE) to tokenize these coordinates, treating them as generation descriptors. To further enhance the representation, we incorporate neighborhood bond lengths and bond angles as understanding descriptors. Leveraging this tokenization framework, we train a GPT-2 style model for 3D molecular generation tasks. Results demonstrate strong performance with significantly faster generation speeds and competitive chemical stability compared to previous methods. Further, by integrating our learned discrete representations into Graphormer model for property prediction on QM9 dataset, Mol-StrucTok reveals consistent improvements across various molecular properties, underscoring the versatility and robustness of our approach.

评论：	17页，6图，预印本
主题：	机器学习 (cs.LG) ; 生物大分子 (q-bio.BM)
引用方式：	arXiv:2412.01564 [cs.LG]
	(或者 arXiv:2412.01564v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2412.01564

提交历史

来自： Kaiyuan Gao [查看电子邮件]
[v1] 星期一， 2024 年 12 月 2 日 14:50:44 UTC (1,017 KB)

计算机科学 > 机器学习

标题：用量化球坐标对三维分子结构进行分词

标题： Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： 用量化球坐标对三维分子结构进行分词 显示英文标题

标题： Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：用量化球坐标对三维分子结构进行分词