What Makes A Good Fisherman? Linear Regression under Self-Selection Bias

Cherapanamjeri, Yeshwanth; Daskalakis, Constantinos; Ilyas, Andrew; Zampetakis, Manolis

数学 > 统计理论

arXiv:2205.03246v2 (math)

[提交于 2022年5月6日 (v1) ，最后修订 2022年12月10日 (此版本， v2)]

标题：什么造就了优秀的渔夫？自我选择偏差下的线性回归

标题： What Makes A Good Fisherman? Linear Regression under Self-Selection Bias

Authors:Yeshwanth Cherapanamjeri, Constantinos Daskalakis, Andrew Ilyas, Manolis Zampetakis

摘要：在经典自选择设置中，目标是从观察值$(x^{(i)}, y^{(i)})$中同时学习$k$模型，其中$y^{(i)}$是输入$x^{(i)}$上的 $k$个底层模型之一的输出。与混合模型不同，在混合模型中我们观察的是随机选择的模型的输出，而在这里，观察到的模型依赖于输出本身，并由某些已知的选择准则确定。例如，我们可能会观察到 $k$个模型中的最高输出、最小输出或中位数输出。在已知索引的自选择中，观察到的模型输出的身份是可观察的；在未知索引的自选择中，则不是。自选择在计量经济学中有着悠久的历史，并在各种理论和应用领域中有应用，包括处理效应估计、模仿学习、从战略报告数据中学习以及从非均衡市场中学习。在本工作中，我们提出了针对这个问题最标准设置的第一个计算和统计上高效的估计算法，其中模型是线性的。在已知索引的情况下，我们需要多项式$(1/\varepsilon, k, d)$的样本和时间复杂度，以精度$\varepsilon$在$d$维度下估计所有模型参数，并且可以适应非常一般的选取标准。在更具挑战性的未知索引情况下，甚至线性模型的可识别性（从无限多样本中）都是未知的。我们在这种情况下展示了关于广泛研究的$\max$自选择准则的三个结果：(1) 我们证明线性模型确实是可识别的，(2) 对于一般的$k$我们提供了一个具有多项式$(d) \exp(\text{poly}(k))$样本和时间复杂度的算法，以误差$1/\text{poly}(k)$估计回归参数，(3) 对于$k = 2$我们提供了一个对于任何误差$\varepsilon$和多项式$(d, 1/\varepsilon)$样本和时间复杂度的算法。

摘要： In the classical setting of self-selection, the goal is to learn $k$ models, simultaneously from observations $(x^{(i)}, y^{(i)})$ where $y^{(i)}$ is the output of one of $k$ underlying models on input $x^{(i)}$. In contrast to mixture models, where we observe the output of a randomly selected model, here the observed model depends on the outputs themselves, and is determined by some known selection criterion. For example, we might observe the highest output, the smallest output, or the median output of the $k$ models. In known-index self-selection, the identity of the observed model output is observable; in unknown-index self-selection, it is not. Self-selection has a long history in Econometrics and applications in various theoretical and applied fields, including treatment effect estimation, imitation learning, learning from strategically reported data, and learning from markets at disequilibrium. In this work, we present the first computationally and statistically efficient estimation algorithms for the most standard setting of this problem where the models are linear. In the known-index case, we require poly$(1/\varepsilon, k, d)$ sample and time complexity to estimate all model parameters to accuracy $\varepsilon$ in $d$ dimensions, and can accommodate quite general selection criteria. In the more challenging unknown-index case, even the identifiability of the linear models (from infinitely many samples) was not known. We show three results in this case for the commonly studied $\max$ self-selection criterion: (1) we show that the linear models are indeed identifiable, (2) for general $k$ we provide an algorithm with poly$(d) \exp(\text{poly}(k))$ sample and time complexity to estimate the regression parameters up to error $1/\text{poly}(k)$, and (3) for $k = 2$ we provide an algorithm for any error $\varepsilon$ and poly$(d, 1/\varepsilon)$ sample and time complexity.

主题：	统计理论 (math.ST) ; 数据结构与算法 (cs.DS); 机器学习 (cs.LG); 机器学习 (stat.ML)
引用方式：	arXiv:2205.03246 [math.ST]
	(或者 arXiv:2205.03246v2 [math.ST] 对于此版本)
	https://doi.org/10.48550/arXiv.2205.03246

提交历史

来自： Andrew Ilyas [查看电子邮件]
[v1] 星期五， 2022 年 5 月 6 日 14:03:05 UTC (191 KB)
[v2] 星期六， 2022 年 12 月 10 日 23:09:51 UTC (211 KB)

数学 > 统计理论

标题：什么造就了优秀的渔夫？自我选择偏差下的线性回归

标题： What Makes A Good Fisherman? Linear Regression under Self-Selection Bias

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

数学 > 统计理论

标题： 什么造就了优秀的渔夫？ 自我选择偏差下的线性回归 显示英文标题

标题： What Makes A Good Fisherman? Linear Regression under Self-Selection Bias

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：什么造就了优秀的渔夫？自我选择偏差下的线性回归