Optimal disclosure risk assessment

Camerlenghi, Federico; Favaro, Stefano; Naulet, Zacharie; Panero, Francesca

数学 > 统计理论

arXiv:1902.05354 (math)

[提交于 2019年2月14日 ]

标题：最优披露风险评估

标题： Optimal disclosure risk assessment

Authors:Federico Camerlenghi, Stefano Favaro, Zacharie Naulet, Francesca Panero

摘要：对披露的保护是发布用于公共使用的微数据文件的机构的法律和道德义务。考虑一个大小为$n$的微数据样本，来自一个大小为$\bar{n}=n+\lambda n$的有限总体，其中$\lambda>0$，使得每条记录包含两种不相交类型的信息：识别型分类信息和敏感信息。任何关于发布数据的决定都由对披露风险度量的估计所支持，这些度量是样本记录中具有唯一识别变量值组合的数量的泛函。最常用的度量无疑是样本中为人口唯一记录的唯一记录数$\tau_{1}$。在本文中，我们首先研究在样本记录的泊松丰富模型下$\tau_{1}$的非参数估计。我们引入了一类$\tau_{1}$的线性估计量，这些估计量简单、计算效率高且可扩展到大规模数据集，并为其提供了统一的理论保证。特别是，我们证明它们可以准确估计 $\tau_{1}$ 直到采样比例 $(\lambda+1)^{-1}\propto (\log n)^{-1}$，并且对于大的 $n$，归一化均方误差（NMSE）趋于零。然后我们建立了 $\tau_{1}$ 估计的最小最大 NMSE 的下界，这使我们能够证明：i) $(\lambda+1)^{-1}\propto (\log n)^{-1}$ 是可能的最小采样比例；ii) 对于大的 $n$，估计器的 NMSE 接近最优，在匹配最小最大下界的意义上。这是本文的主要结果，它对在泊松丰度模型下非参数估计$\tau_{1}$的可行性以及采样比例$(\lambda+1)^{-1}<1/2$的开放性问题提供了精确的答案。

摘要： Protection against disclosure is a legal and ethical obligation for agencies releasing microdata files for public use. Consider a microdata sample of size $n$ from a finite population of size $\bar{n}=n+\lambda n$, with $\lambda>0$, such that each record contains two disjoint types of information: identifying categorical information and sensitive information. Any decision about releasing data is supported by the estimation of measures of disclosure risk, which are functionals of the number of sample records with a unique combination of values of identifying variables. The most common measure is arguably the number $\tau_{1}$ of sample unique records that are population uniques. In this paper, we first study nonparametric estimation of $\tau_{1}$ under the Poisson abundance model for sample records. We introduce a class of linear estimators of $\tau_{1}$ that are simple, computationally efficient and scalable to massive datasets, and we give uniform theoretical guarantees for them. In particular, we show that they provably estimate $\tau_{1}$ all of the way up to the sampling fraction $(\lambda+1)^{-1}\propto (\log n)^{-1}$, with vanishing normalized mean-square error (NMSE) for large $n$. We then establish a lower bound for the minimax NMSE for the estimation of $\tau_{1}$, which allows us to show that: i) $(\lambda+1)^{-1}\propto (\log n)^{-1}$ is the smallest possible sampling fraction; ii) estimators' NMSE is near optimal, in the sense of matching the minimax lower bound, for large $n$. This is the main result of our paper, and it provides a precise answer to an open question about the feasibility of nonparametric estimation of $\tau_{1}$ under the Poisson abundance model and for a sampling fraction $(\lambda+1)^{-1}<1/2$.

主题：	统计理论 (math.ST)
MSC 类：	62G05, 62C20
引用方式：	arXiv:1902.05354 [math.ST]
	(或者 arXiv:1902.05354v1 [math.ST] 对于此版本)
	https://doi.org/10.48550/arXiv.1902.05354

提交历史

来自： Zacharie Naulet [查看电子邮件]
[v1] 星期四， 2019 年 2 月 14 日 13:53:43 UTC (64 KB)

数学 > 统计理论

标题：最优披露风险评估

标题： Optimal disclosure risk assessment

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

数学 > 统计理论

标题： 最优披露风险评估 显示英文标题

标题： Optimal disclosure risk assessment

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：最优披露风险评估