On the Emergence of Linear Analogies in Word Embeddings

Korchinski, Daniel J.; Karkada, Dhruva; Bahri, Yasaman; Wyart, Matthieu

计算机科学 > 计算与语言

arXiv:2505.18651v1 (cs)

[提交于 2025年5月24日 ]

标题：关于词嵌入中线性类比的出现

标题： On the Emergence of Linear Analogies in Word Embeddings

Authors:Daniel J. Korchinski, Dhruva Karkada, Yasaman Bahri, Matthieu Wyart

摘要：像 Word2Vec 和 GloVe 这样的模型根据文本语料库中单词 $i$ 和 $j$ 的共现概率 $P(i,j)$ 构建词嵌入。所得向量 $W_i$ 不仅按语义相似性对单词进行分组，还表现出显著的线性类比结构——例如， $W_{\text{king}} - W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}$ ——其理论起源仍不清楚。先前的观察表明，这种类比结构：(i) 已经出现在矩阵$M(i,j) = P(i,j)/P(i)P(j)$的主特征向量中，(ii) 在包含更多控制嵌入维度的$M (i, j)$特征向量时会增强并饱和，(iii) 使用$\log M(i,j)$而非$M(i,j)$时会被加强，以及 (iv) 即使从语料库中移除所有参与特定类比关系的单词对（例如，king-queen、man-woman），仍然会持续存在。为了阐释这些现象，我们引入了一个理论生成模型，在该模型中词语由二元语义属性定义，并且共现概率源自基于属性的交互作用。此模型能够解析地再现线性类比结构的出现，并自然解释性质 (i)-(iv)。可以将其视为对每个附加嵌入维度的作用提供了更精细的分辨率。它对各种形式的噪声具有鲁棒性，并且与 Mikolov 等人引入的维基百科和类比基准测量的共现统计数据非常一致。

摘要： Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability $P(i,j)$ of words $i$ and $j$ in text corpora. The resulting vectors $W_i$ not only group semantically similar words but also exhibit a striking linear analogy structure -- for example, $W_{\text{king}} - W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}$ -- whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix $M(i,j) = P(i,j)/P(i)P(j)$, (ii) strengthens and then saturates as more eigenvectors of $M (i, j)$, which controls the dimension of the embeddings, are included, (iii) is enhanced when using $\log M(i,j)$ rather than $M(i,j)$, and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king-queen, man-woman) are removed from the corpus. To explain these phenomena, we introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)-(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.

评论：	主文：12页，3个图。附录：8页，7个图
主题：	计算与语言 (cs.CL) ; 无序系统与神经网络 (cond-mat.dis-nn); 机器学习 (cs.LG)
引用方式：	arXiv:2505.18651 [cs.CL]
	(或者 arXiv:2505.18651v1 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2505.18651

提交历史

来自： Daniel Korchinski [查看电子邮件]
[v1] 星期六， 2025 年 5 月 24 日 11:42:26 UTC (907 KB)

计算机科学 > 计算与语言

标题：关于词嵌入中线性类比的出现

标题： On the Emergence of Linear Analogies in Word Embeddings

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： 关于词嵌入中线性类比的出现 显示英文标题

标题： On the Emergence of Linear Analogies in Word Embeddings

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：关于词嵌入中线性类比的出现