The broken sample problem revisited: Proof of a conjecture by Bai-Hsing and high-dimensional extensions

Jiao, Simiao; Wu, Yihong; Xu, Jiaming

Mathematics > Statistics Theory

arXiv:2503.14619 (math)

[Submitted on 18 Mar 2025 ]

Title: The broken sample problem revisited: Proof of a conjecture by Bai-Hsing and high-dimensional extensions

Title: 破损样本问题的重新审视：Bai-Hsing猜想的证明及高维扩展

Authors:Simiao Jiao, Yihong Wu, Jiaming Xu

Abstract: We revisit the classical broken sample problem: Two samples of i.i.d. data points $\mathbf{X}=\{X_1,\cdots, X_n\}$ and $\mathbf{Y}=\{Y_1,\cdots,Y_m\}$ are observed without correspondence with $m\leq n$. Under the null hypothesis, $\mathbf{X}$ and $\mathbf{Y}$ are independent. Under the alternative hypothesis, $\mathbf{Y}$ is correlated with a random subsample of $\mathbf{X}$, in the sense that $(X_{\pi(i)},Y_i)$'s are drawn independently from some bivariate distribution for some latent injection $\pi:[m] \to [n]$. Originally introduced by DeGroot, Feder, and Goel (1971) to model matching records in census data, this problem has recently gained renewed interest due to its applications in data de-anonymization, data integration, and target tracking. Despite extensive research over the past decades, determining the precise detection threshold has remained an open problem even for equal sample sizes ($m=n$). Assuming $m$ and $n$ grow proportionally, we show that the sharp threshold is given by a spectral and an $L_2$ condition of the likelihood ratio operator, resolving a conjecture of Bai and Hsing (2005) in the positive. These results are extended to high dimensions and settle the sharp detection thresholds for Gaussian and Bernoulli models.

Abstract: 我们重新审视经典的断裂样本问题：观察到两个独立同分布的数据点样本$\mathbf{X}=\{X_1,\cdots, X_n\}$和$\mathbf{Y}=\{Y_1,\cdots,Y_m\}$，但与$m\leq n$无对应关系。在原假设下，$\mathbf{X}$和$\mathbf{Y}$是独立的。在备择假设下，$\mathbf{Y}$与$\mathbf{X}$的随机子样本相关，即$(X_{\pi(i)},Y_i)$是从某些潜在注入$\pi:[m] \to [n]$的二元分布中独立抽取的。最初由 DeGroot、Feder 和 Goel（1971）引入，用于模拟人口普查数据中的匹配记录，由于其在数据去匿名化、数据整合和目标跟踪中的应用，这个问题最近重新引起了关注。尽管过去几十年进行了大量研究，但即使在样本量相等的情况下（$m=n$），确定精确的检测阈值仍然是一个开放问题。假设$m$和$n$成比例增长，我们证明了尖锐阈值由似然比算子的谱条件和$L_2$条件给出，解决了 Bai 和 Hsing (2005) 的一个猜想。这些结果推广到高维情况，并确定了高斯和伯努利模型的尖锐检测阈值。

Comments:	35 pages, 3 figures
Subjects:	Statistics Theory (math.ST) ; Information Theory (cs.IT)
Cite as:	arXiv:2503.14619 [math.ST]
	(or arXiv:2503.14619v1 [math.ST] for this version)
	https://doi.org/10.48550/arXiv.2503.14619

Submission history

From: Simiao Jiao [view email]
[v1] Tue, 18 Mar 2025 18:15:13 UTC (105 KB)

Mathematics > Statistics Theory

Title: The broken sample problem revisited: Proof of a conjecture by Bai-Hsing and high-dimensional extensions

Title: 破损样本问题的重新审视：Bai-Hsing猜想的证明及高维扩展

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Mathematics > Statistics Theory

Title: The broken sample problem revisited: Proof of a conjecture by Bai-Hsing and high-dimensional extensions Show Chinese title

Title: 破损样本问题的重新审视：Bai-Hsing猜想的证明及高维扩展

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: The broken sample problem revisited: Proof of a conjecture by Bai-Hsing and high-dimensional extensions