COLOR: A compositional linear operation-based representation of protein sequences for identification of monomer contributions to properties

Pandey, Akash; Chen, Wei; Keten, Sinan

定量生物学 > 生物大分子

arXiv:2501.06371v1 (q-bio)

[提交于 2025年1月10日 ]

标题：颜色：基于组合线性运算的蛋白质序列表示，用于识别单体对性质的贡献

标题： COLOR: A compositional linear operation-based representation of protein sequences for identification of monomer contributions to properties

Authors:Akash Pandey, Wei Chen, Sinan Keten

摘要：生物材料如蛋白质和核酸的性质主要由其一级序列决定。虽然序列中的某些片段强烈影响特定功能，但由于序列数据的复杂性，识别这些片段或所谓的基序（motifs）具有挑战性。虽然深度学习（DL）模型可以准确捕捉序列-属性关系，但这些模型中的非线性程度限制了对单体对属性贡献的评估——这是识别关键基序的关键步骤。可解释人工智能（XAI）的最新进展提供了注意力和基于梯度的方法来估计单体贡献。然而，这些方法主要应用于分类任务，例如结合位点识别，在这些任务中它们的准确性有限（40-45%），并且依赖于定性评估。为了解决这些限制，我们引入了一个具有可解释步骤的DL模型，能够直接追踪单体贡献。我们还提出了一种指标（$\mathcal{I}$），该指标受到图像分析和自然语言处理领域掩码技术的启发，用于对主要包含抗癌肽（ACP）、抗菌肽（AMP）和胶原蛋白特性的数据集进行定量分析。我们的模型表现出22%更高的可解释性，确定了显著破坏ACP的临界基序（RRR、RRI和RSS），并识别出在将非AMP转化为AMP方面有效的AMP基序。这些发现突显了我们的模型在指导设计基于蛋白质的生物材料的突变策略方面的潜力。

摘要： The properties of biological materials like proteins and nucleic acids are largely determined by their primary sequence. While certain segments in the sequence strongly influence specific functions, identifying these segments, or so-called motifs, is challenging due to the complexity of sequential data. While deep learning (DL) models can accurately capture sequence-property relationships, the degree of nonlinearity in these models limits the assessment of monomer contributions to a property - a critical step in identifying key motifs. Recent advances in explainable AI (XAI) offer attention and gradient-based methods for estimating monomeric contributions. However, these methods are primarily applied to classification tasks, such as binding site identification, where they achieve limited accuracy (40-45%) and rely on qualitative evaluations. To address these limitations, we introduce a DL model with interpretable steps, enabling direct tracing of monomeric contributions. We also propose a metric ($\mathcal{I}$), inspired by the masking technique in the field of image analysis and natural language processing, for quantitative analysis on datasets mainly containing distinct properties of anti-cancer peptides (ACP), antimicrobial peptides (AMP), and collagen. Our model exhibits 22% higher explainability, pinpoints critical motifs (RRR, RRI, and RSS) that significantly destabilize ACPs, and identifies motifs in AMPs that are 50% more effective in converting non-AMPs to AMPs. These findings highlight the potential of our model in guiding mutation strategies for designing protein-based biomaterials.

主题：	生物大分子 (q-bio.BM)
引用方式：	arXiv:2501.06371 [q-bio.BM]
	(或者 arXiv:2501.06371v1 [q-bio.BM] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.06371

提交历史

来自： Akash Pandey [查看电子邮件]
[v1] 星期五， 2025 年 1 月 10 日 22:42:42 UTC (7,031 KB)

定量生物学 > 生物大分子

标题：颜色：基于组合线性运算的蛋白质序列表示，用于识别单体对性质的贡献

标题： COLOR: A compositional linear operation-based representation of protein sequences for identification of monomer contributions to properties

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

定量生物学 > 生物大分子

标题： 颜色：基于组合线性运算的蛋白质序列表示，用于识别单体对性质的贡献 显示英文标题

标题： COLOR: A compositional linear operation-based representation of protein sequences for identification of monomer contributions to properties

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：颜色：基于组合线性运算的蛋白质序列表示，用于识别单体对性质的贡献