Spatio-Temporal Representation Decoupling and Enhancement for Federated Instrument Segmentation in Surgical Videos

Fang, Zheng; Qi, Xiaoming; Feng, Chun-Mei; Pei, Jialun; Si, Weixin; Jin, Yueming

电气工程与系统科学 > 图像与视频处理

arXiv:2506.23759 (eess)

[提交于 2025年6月30日 ]

标题：手术视频中联邦仪器分割的时空表示解耦与增强

标题： Spatio-Temporal Representation Decoupling and Enhancement for Federated Instrument Segmentation in Surgical Videos

Authors:Zheng Fang, Xiaoming Qi, Chun-Mei Feng, Jialun Pei, Weixin Si, Yueming Jin

摘要：外科器械分割在联邦学习（FL）下是一个有前景的方向，它使多个外科场所能够在不集中数据集的情况下协作训练模型。然而，在外科数据科学中存在非常有限的FL工作，其他模态的FL方法并未考虑外科领域固有的特性：i）不同场景显示多样的解剖背景，而器械表示高度相似；ii）存在外科模拟器，可以以最小的努力促进大规模合成数据生成。本文中，我们提出了一种新颖的个性化FL方案，时空表示解耦与增强（FedST），该方案在本地站点和全局服务器训练期间巧妙地利用外科领域知识以提高分割效果。具体来说，我们的模型在本地站点训练中采用表示分离与合作（RSC）机制，将查询嵌入层解耦以进行私有训练，以编码各自的背景。同时，其他参数在全球范围内优化，以捕捉器械的一致表示，包括时间层以捕捉相似的运动模式。进一步设计了基于文本引导的通道选择，以突出站点特定特征，促进模型适应每个站点。此外，在全局服务器训练中，我们提出了基于合成数据的显式表示量化（SERQ），该方法基于合成数据定义一个显式表示目标，以在融合期间同步模型收敛，从而提高模型泛化能力。

摘要： Surgical instrument segmentation under Federated Learning (FL) is a promising direction, which enables multiple surgical sites to collaboratively train the model without centralizing datasets. However, there exist very limited FL works in surgical data science, and FL methods for other modalities do not consider inherent characteristics in surgical domain: i) different scenarios show diverse anatomical backgrounds while highly similar instrument representation; ii) there exist surgical simulators which promote large-scale synthetic data generation with minimal efforts. In this paper, we propose a novel Personalized FL scheme, Spatio-Temporal Representation Decoupling and Enhancement (FedST), which wisely leverages surgical domain knowledge during both local-site and global-server training to boost segmentation. Concretely, our model embraces a Representation Separation and Cooperation (RSC) mechanism in local-site training, which decouples the query embedding layer to be trained privately, to encode respective backgrounds. Meanwhile, other parameters are optimized globally to capture the consistent representations of instruments, including the temporal layer to capture similar motion patterns. A textual-guided channel selection is further designed to highlight site-specific features, facilitating model adapta tion to each site. Moreover, in global-server training, we propose Synthesis-based Explicit Representation Quantification (SERQ), which defines an explicit representation target based on synthetic data to synchronize the model convergence during fusion for improving model generalization.

主题：	图像与视频处理 (eess.IV) ; 计算机视觉与模式识别 (cs.CV)
引用方式：	arXiv:2506.23759 [eess.IV]
	(或者 arXiv:2506.23759v1 [eess.IV] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.23759

提交历史

来自： Zheng Fang [查看电子邮件]
[v1] 星期一， 2025 年 6 月 30 日 12:08:02 UTC (3,088 KB)

电气工程与系统科学 > 图像与视频处理

标题：手术视频中联邦仪器分割的时空表示解耦与增强

标题： Spatio-Temporal Representation Decoupling and Enhancement for Federated Instrument Segmentation in Surgical Videos

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

电气工程与系统科学 > 图像与视频处理

标题： 手术视频中联邦仪器分割的时空表示解耦与增强 显示英文标题

标题： Spatio-Temporal Representation Decoupling and Enhancement for Federated Instrument Segmentation in Surgical Videos

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：手术视频中联邦仪器分割的时空表示解耦与增强