Feature Integration Spaces: Joint Training Reveals Dual Encoding in Neural Network Representations

Claflin, Omar

定量生物学 > 神经与认知

arXiv:2507.00269 (q-bio)

[提交于 2025年6月30日 ]

标题：特征集成空间：联合训练揭示神经网络表示中的双重编码

标题： Feature Integration Spaces: Joint Training Reveals Dual Encoding in Neural Network Representations

Authors:Omar Claflin

摘要：当前的稀疏自编码器（SAE）方法在神经网络可解释性方面假设激活可以通过线性叠加分解为稀疏、可解释的特征。尽管重建保真度高，SAE始终无法消除多义性并表现出病理行为错误。我们提出神经网络在压缩到同一底物的两个互补空间中编码信息：特征身份和特征整合。为了验证这一双重编码假设，我们开发了顺序训练和联合训练架构以同时捕捉身份和整合模式。联合训练实现了41.3%的重建改进和51.6%的KL散度误差减少。该架构自发地发展出双峰特征组织：低平方范数特征贡献于整合路径，其余特征直接贡献于残差。小型非线性组件（占参数的3%）实现了16.5%的独立改进，证明了对行为至关重要的计算关系的参数高效捕获。此外，使用2x2因子刺激设计的干预实验表明，整合特征对实验操作表现出选择性敏感性，并在模型输出上产生系统的行为效应，包括跨语义维度的显著交互效应。这项工作为（1）神经表示中的双重编码，（2）有意义的非线性编码特征交互，以及（3）从事后特征分析到集成计算设计的架构范式转变提供了系统证据，为下一代SAE奠定了基础。

摘要： Current sparse autoencoder (SAE) approaches to neural network interpretability assume that activations can be decomposed through linear superposition into sparse, interpretable features. Despite high reconstruction fidelity, SAEs consistently fail to eliminate polysemanticity and exhibit pathological behavioral errors. We propose that neural networks encode information in two complementary spaces compressed into the same substrate: feature identity and feature integration. To test this dual encoding hypothesis, we develop sequential and joint-training architectures to capture identity and integration patterns simultaneously. Joint training achieves 41.3% reconstruction improvement and 51.6% reduction in KL divergence errors. This architecture spontaneously develops bimodal feature organization: low squared norm features contributing to integration pathways and the rest contributing directly to the residual. Small nonlinear components (3% of parameters) achieve 16.5% standalone improvements, demonstrating parameter-efficient capture of computational relationships crucial for behavior. Additionally, intervention experiments using 2x2 factorial stimulus designs demonstrated that integration features exhibit selective sensitivity to experimental manipulations and produce systematic behavioral effects on model outputs, including significant interaction effects across semantic dimensions. This work provides systematic evidence for (1) dual encoding in neural representations, (2) meaningful nonlinearly encoded feature interactions, and (3) introduces an architectural paradigm shift from post-hoc feature analysis to integrated computational design, establishing foundations for next-generation SAEs.

主题：	神经与认知 (q-bio.NC) ; 人工智能 (cs.AI)
引用方式：	arXiv:2507.00269 [q-bio.NC]
	(或者 arXiv:2507.00269v1 [q-bio.NC] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.00269

提交历史

来自： Omar Claflin [查看电子邮件]
[v1] 星期一， 2025 年 6 月 30 日 21:26:58 UTC (644 KB)

定量生物学 > 神经与认知

标题：特征集成空间：联合训练揭示神经网络表示中的双重编码

标题： Feature Integration Spaces: Joint Training Reveals Dual Encoding in Neural Network Representations

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

定量生物学 > 神经与认知

标题： 特征集成空间：联合训练揭示神经网络表示中的双重编码 显示英文标题

标题： Feature Integration Spaces: Joint Training Reveals Dual Encoding in Neural Network Representations

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：特征集成空间：联合训练揭示神经网络表示中的双重编码