Beyond Simple Concatenation: Fairly Assessing PLM Architectures for Multi-Chain Protein-Protein Interactions Prediction

Alsamkary, Hazem; Elshaffei, Mohamed; Soudy, Mohamed; Ossman, Sara; Amr, Abdallah; Abdelsalam, Nehal Adel; Elkerdawy, Mohamed; Elnaggar, Ahmed

计算机科学 > 机器学习

arXiv:2505.20036v1 (cs)

[提交于 2025年5月26日 ]

标题：超越简单的拼接：公平评估PLM架构在多链蛋白质-蛋白质相互作用预测中的性能

标题： Beyond Simple Concatenation: Fairly Assessing PLM Architectures for Multi-Chain Protein-Protein Interactions Prediction

Authors:Hazem Alsamkary, Mohamed Elshaffei, Mohamed Soudy, Sara Ossman, Abdallah Amr, Nehal Adel Abdelsalam, Mohamed Elkerdawy, Ahmed Elnaggar

摘要：蛋白质-蛋白质相互作用（PPIs）对于众多细胞过程至关重要，对其特征的理解对于理解疾病机制和指导药物发现至关重要。尽管蛋白质语言模型（PLMs）在预测蛋白质结构和功能方面取得了显著成功，但它们在基于序列的PPI结合亲和力预测中的应用仍相对未被充分探索。这一差距通常归因于高质量、严格精炼的数据集的稀缺以及对简单策略用于连接蛋白质表示的依赖。在这项工作中，我们解决了这些限制。首先，我们引入了一个精心策划的PPB-Affinity数据集版本，共有8,207个独特的蛋白质-蛋白质相互作用条目，通过解决注释不一致和多链蛋白质相互作用的重复条目问题。该数据集采用了严格的、小于或等于30%的序列同一性阈值，以确保训练、验证和测试集的稳健分割，最大限度地减少数据泄漏。其次，我们提出了四种架构来适应PLMs进行PPI结合亲和力预测：嵌入连接（EC）、序列连接（SC）、层次池化（HP）和汇集注意力添加（PAD）。这些架构使用了两种训练方法进行评估：完全微调和一种轻量级方法，即在冻结的PLM特征上使用ConvBERT头部。我们在多个领先的PLMs（ProtT5、ESM2、Ankh、Ankh2和ESM3）上进行全面实验表明，HP和PAD架构始终优于传统的连接方法，在斯皮尔曼相关系数方面提高了多达12%。这些结果强调了复杂架构设计的必要性，以便充分利用PLMs在细微PPI结合亲和力预测中的能力。

摘要： Protein-protein interactions (PPIs) are fundamental to numerous cellular processes, and their characterization is vital for understanding disease mechanisms and guiding drug discovery. While protein language models (PLMs) have demonstrated remarkable success in predicting protein structure and function, their application to sequence-based PPI binding affinity prediction remains relatively underexplored. This gap is often attributed to the scarcity of high-quality, rigorously refined datasets and the reliance on simple strategies for concatenating protein representations. In this work, we address these limitations. First, we introduce a meticulously curated version of the PPB-Affinity dataset of a total of 8,207 unique protein-protein interaction entries, by resolving annotation inconsistencies and duplicate entries for multi-chain protein interactions. This dataset incorporates a stringent, less than or equal to 30%, sequence identity threshold to ensure robust splitting into training, validation, and test sets, minimizing data leakage. Second, we propose and systematically evaluate four architectures for adapting PLMs to PPI binding affinity prediction: embeddings concatenation (EC), sequences concatenation (SC), hierarchical pooling (HP), and pooled attention addition (PAD). These architectures were assessed using two training methods: full fine-tuning and a lightweight approach employing ConvBERT heads over frozen PLM features. Our comprehensive experiments across multiple leading PLMs (ProtT5, ESM2, Ankh, Ankh2, and ESM3) demonstrated that the HP and PAD architectures consistently outperform conventional concatenation methods, achieving up to 12% increase in terms of Spearman correlation. These results highlight the necessity of sophisticated architectural designs to fully exploit the capabilities of PLMs for nuanced PPI binding affinity prediction.

评论：	15页，4个图
主题：	机器学习 (cs.LG) ; 生物大分子 (q-bio.BM)
引用方式：	arXiv:2505.20036 [cs.LG]
	(或者 arXiv:2505.20036v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2505.20036

提交历史

来自： Mohamed Elshaffei [查看电子邮件]
[v1] 星期一， 2025 年 5 月 26 日 14:23:08 UTC (837 KB)

计算机科学 > 机器学习

标题：超越简单的拼接：公平评估PLM架构在多链蛋白质-蛋白质相互作用预测中的性能

标题： Beyond Simple Concatenation: Fairly Assessing PLM Architectures for Multi-Chain Protein-Protein Interactions Prediction

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： 超越简单的拼接：公平评估PLM架构在多链蛋白质-蛋白质相互作用预测中的性能 显示英文标题

标题： Beyond Simple Concatenation: Fairly Assessing PLM Architectures for Multi-Chain Protein-Protein Interactions Prediction

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：超越简单的拼接：公平评估PLM架构在多链蛋白质-蛋白质相互作用预测中的性能