Recursive Equations For Imputation Of Missing Not At Random Data With Sparse Pattern Support

Phung, Trung; Reese, Kyle; Shpitser, Ilya; Bhattacharya, Rohit

统计学 > 方法论

arXiv:2507.16107 (stat)

[提交于 2025年7月21日 ]

标题：缺失非随机数据的填补递归方程与稀疏模式支持

标题： Recursive Equations For Imputation Of Missing Not At Random Data With Sparse Pattern Support

Authors:Trung Phung, Kyle Reese, Ilya Shpitser, Rohit Bhattacharya

摘要：在数据分析流程中处理缺失值的一种常见方法是通过软件包如MICE（Van Buuren和Groothuis-Oudshoorn，2011）和Amelia（Honaker等，2011）进行多重插补。这些包通常假设数据是随机缺失（MAR），并以一种允许即使某些缺失模式在数据中没有支持的情况下也能进行插补的方式对插补分布施加参数或平滑假设。这种假设在实践中是不现实的，并且会在进行此类插补后的任何分析中导致模型误指偏差。在本文中，我们提供了一个有原则的替代方案。具体来说，我们为图形模型中的完整数据定律提供了一个新的表征。该表征是构造性的，易于适应计算MAR和MNAR（非随机缺失）机制的插补分布，并能够处理某些缺失模式缺乏支持的情况。我们利用这一表征开发了一种新的插补算法——基于支持模式递归的多变量插补（MISPR），它通过类似多变量插补与链式方程（MICE）算法的吉布斯采样来实现，但在MAR和MNAR设置下是一致的，并且能够在不施加超出缺失数据模型本身已有假设的情况下处理没有支持的缺失数据模式。在模拟中，我们展示了当数据是MAR时，MISPR的结果与MICE相当，而当数据是MNAR时，结果更优且偏差更小。我们的表征以及基于此的插补算法是使有原则的缺失数据方法在实际应用中更加实用的一个步骤，在实际应用中，数据很可能是MNAR，并且维度足够高，以至于在现有样本量下会产生没有支持的缺失数据模式。

摘要： A common approach for handling missing values in data analysis pipelines is multiple imputation via software packages such as MICE (Van Buuren and Groothuis-Oudshoorn, 2011) and Amelia (Honaker et al., 2011). These packages typically assume the data are missing at random (MAR), and impose parametric or smoothing assumptions upon the imputing distributions in a way that allows imputation to proceed even if not all missingness patterns have support in the data. Such assumptions are unrealistic in practice, and induce model misspecification bias on any analysis performed after such imputation. In this paper, we provide a principled alternative. Specifically, we develop a new characterization for the full data law in graphical models of missing data. This characterization is constructive, is easily adapted for the calculation of imputation distributions for both MAR and MNAR (missing not at random) mechanisms, and is able to handle lack of support for certain patterns of missingness. We use this characterization to develop a new imputation algorithm -- Multivariate Imputation via Supported Pattern Recursion (MISPR) -- which uses Gibbs sampling, by analogy with the Multivariate Imputation with Chained Equations (MICE) algorithm, but which is consistent under both MAR and MNAR settings, and is able to handle missing data patterns with no support without imposing additional assumptions beyond those already imposed by the missing data model itself. In simulations, we show MISPR obtains comparable results to MICE when data are MAR, and superior, less biased results when data are MNAR. Our characterization and imputation algorithm based on it are a step towards making principled missing data methods more practical in applied settings, where the data are likely both MNAR and sufficiently high dimensional to yield missing data patterns with no support at available sample sizes.

评论：	45页
主题：	方法论 (stat.ME) ; 机器学习 (cs.LG)
引用方式：	arXiv:2507.16107 [stat.ME]
	(或者 arXiv:2507.16107v1 [stat.ME] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.16107

提交历史

来自： Trung Phung [查看电子邮件]
[v1] 星期一， 2025 年 7 月 21 日 23:18:36 UTC (57 KB)

统计学 > 方法论

标题：缺失非随机数据的填补递归方程与稀疏模式支持

标题： Recursive Equations For Imputation Of Missing Not At Random Data With Sparse Pattern Support

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

统计学 > 方法论

标题： 缺失非随机数据的填补递归方程与稀疏模式支持 显示英文标题

标题： Recursive Equations For Imputation Of Missing Not At Random Data With Sparse Pattern Support

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：缺失非随机数据的填补递归方程与稀疏模式支持