Bayesian analysis of the prevalence bias: learning and predicting from imbalanced data

Folgoc, Loic Le; Baltatzis, Vasileios; Alansary, Amir; Desai, Sujal; Devaraj, Anand; Ellis, Sam; Manzanera, Octavio E. Martinez; Kanavati, Fahdi; Nair, Arjun; Schnabel, Julia; Glocker, Ben

计算机科学 > 机器学习

arXiv:2108.00250 (cs)

[提交于 2021年7月31日 ]

标题：贝叶斯分析流行偏差：从不平衡数据中学习和预测

标题： Bayesian analysis of the prevalence bias: learning and predicting from imbalanced data

Authors:Loic Le Folgoc, Vasileios Baltatzis, Amir Alansary, Sujal Desai, Anand Devaraj, Sam Ellis, Octavio E. Martinez Manzanera, Fahdi Kanavati, Arjun Nair, Julia Schnabel, Ben Glocker

摘要：数据集很少是对目标人群的现实近似。例如，患病率被错误表示，图像质量高于临床标准等。这种不匹配被称为抽样偏差。抽样偏差是机器学习模型的主要障碍。它们导致了实验室中的模型性能与现实世界中的性能之间存在显著差距。我们的工作是解决患病率偏差的方案。患病率偏差是指病理学的患病率与其在训练数据集中的抽样率之间的差异，这在数据收集过程中或由于实践者重新平衡训练批次而引入。本文建立了在存在患病率偏差的情况下进行模型训练和预测的理论和计算框架。具体而言，在贝叶斯风险最小化原则下推导出一种偏差校正的损失函数以及偏差校正的预测规则。该损失函数与信息增益有直接联系。它为启发式训练损失提供了一个有原则的替代方案，并补充了基于从汇总曲线中选择操作点的测试时程序。它能够无缝集成到当前的（深度）学习范式中，使用随机反向传播，并自然地与贝叶斯模型结合。

摘要： Datasets are rarely a realistic approximation of the target population. Say, prevalence is misrepresented, image quality is above clinical standards, etc. This mismatch is known as sampling bias. Sampling biases are a major hindrance for machine learning models. They cause significant gaps between model performance in the lab and in the real world. Our work is a solution to prevalence bias. Prevalence bias is the discrepancy between the prevalence of a pathology and its sampling rate in the training dataset, introduced upon collecting data or due to the practioner rebalancing the training batches. This paper lays the theoretical and computational framework for training models, and for prediction, in the presence of prevalence bias. Concretely a bias-corrected loss function, as well as bias-corrected predictive rules, are derived under the principles of Bayesian risk minimization. The loss exhibits a direct connection to the information gain. It offers a principled alternative to heuristic training losses and complements test-time procedures based on selecting an operating point from summary curves. It integrates seamlessly in the current paradigm of (deep) learning using stochastic backpropagation and naturally with Bayesian models.

主题：	机器学习 (cs.LG) ; 定量方法 (q-bio.QM); 应用 (stat.AP); 方法论 (stat.ME); 机器学习 (stat.ML)
引用方式：	arXiv:2108.00250 [cs.LG]
	(或者 arXiv:2108.00250v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2108.00250

提交历史

来自： Loic Le Folgoc [查看电子邮件]
[v1] 星期六， 2021 年 7 月 31 日 14:36:33 UTC (4,773 KB)

计算机科学 > 机器学习

标题：贝叶斯分析流行偏差：从不平衡数据中学习和预测

标题： Bayesian analysis of the prevalence bias: learning and predicting from imbalanced data

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： 贝叶斯分析流行偏差：从不平衡数据中学习和预测 显示英文标题

标题： Bayesian analysis of the prevalence bias: learning and predicting from imbalanced data

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：贝叶斯分析流行偏差：从不平衡数据中学习和预测