Structural bias in three-dimensional autoregressive generative machine learning of organic molecules

Koczor-Benda, Zsuzsanna; Gilkes, Joe; Bartucca, Francesco; Al-Fekaiki, Abdulla; Maurer, Reinhard J.

物理学 > 化学物理

arXiv:2503.21328 (physics)

[提交于 2025年3月27日 ]

标题：有机分子三维自回归生成机器学习中的结构偏差

标题： Structural bias in three-dimensional autoregressive generative machine learning of organic molecules

Authors:Zsuzsanna Koczor-Benda, Joe Gilkes, Francesco Bartucca, Abdulla Al-Fekaiki, Reinhard J. Maurer

摘要：近年来，提出了一系列生成式机器学习模型，用于设计新型分子和材料。能够生成三维结构的模型特别适合量子化学工作流，从而可以直接预测性质。生成模型的性能通常根据其生成新颖、有效和唯一分子的能力来评估。然而，同样重要的是它们学习训练数据中功能基团和某些化学基元出现频率的能力，即忠实地再现训练数据所跨越的化学空间。在此，我们研究了自回归生成机器学习模型G-SchNet再现由大型功能有机分子组成的训练数据集的化学空间和性质分布的能力。我们评估了训练分子和生成分子的元素组成、尺寸和键长分布，以及功能基团和化学空间分布。通过化学空间的主成分分析，我们发现该模型导致了一种偏向性的生成，这种偏向性在很大程度上不受超参数选择或训练数据集分布的影响，生成的分子平均而言更加不饱和并含有更多杂原子。纯脂肪族分子在生成过程中大多缺失。我们进一步研究了在功能基团约束下的生成以及基于复合数据集的生成，这有助于部分缓解模型生成的偏差。决策树模型可以识别模型中的生成偏差，并区分训练数据和生成数据，揭示这两组之间的关键化学差异。我们发现的化学差异会影响电子性质（如HOMO-LUMO间隙）的分布，而HOMO-LUMO间隙是功能分子设计的常见目标。

摘要： A range of generative machine learning models for the design of novel molecules and materials have been proposed in recent years. Models that can generate three-dimensional structures are particularly suitable for quantum chemistry workflows, enabling direct property prediction. The performance of generative models is typically assessed based on their ability to produce novel, valid, and unique molecules. However, equally important is their ability to learn the prevalence of functional groups and certain chemical moieties in the underlying training data, that is, to faithfully reproduce the chemical space spanned by the training data. Here, we investigate the ability of the autoregressive generative machine learning model G-SchNet to reproduce the chemical space and property distributions of training datasets composed of large, functional organic molecules. We assess the elemental composition, size- and bond-length distributions, as well as the functional group and chemical space distribution of training and generated molecules. By principal component analysis of the chemical space, we find that the model leads to a biased generation that is largely unaffected by the choice of hyperparameters or the training dataset distribution, producing molecules that are, on average, more unsaturated and contain more heteroatoms. Purely aliphatic molecules are mostly absent in the generation. We further investigate generation with functional group constraints and based on composite datasets, which can help partially remedy the model generation bias. Decision tree models can recognize the generation bias in the models and discriminate between training and generated data, revealing key chemical differences between the two sets. The chemical differences we find affect the distributions of electronic properties such as the HOMO-LUMO gap, which is a common target for functional molecule design.

评论：	18页，7图，14页的补充材料
主题：	化学物理 (physics.chem-ph)
引用方式：	arXiv:2503.21328 [physics.chem-ph]
	(或者 arXiv:2503.21328v1 [physics.chem-ph] 对于此版本)
	https://doi.org/10.48550/arXiv.2503.21328

提交历史

来自： Reinhard Maurer [查看电子邮件]
[v1] 星期四， 2025 年 3 月 27 日 10:08:06 UTC (9,236 KB)

物理学 > 化学物理

标题：有机分子三维自回归生成机器学习中的结构偏差

标题： Structural bias in three-dimensional autoregressive generative machine learning of organic molecules

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

物理学 > 化学物理

标题： 有机分子三维自回归生成机器学习中的结构偏差 显示英文标题

标题： Structural bias in three-dimensional autoregressive generative machine learning of organic molecules

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：有机分子三维自回归生成机器学习中的结构偏差