Underrepresentation, Label Bias, and Proxies: Towards Data Bias Profiles for the EU AI Act and Beyond

Ceccon, Marina; Cornacchia, Giandomenico; Pezze, Davide Dalle; Fabris, Alessandro; Susto, Gian Antonio

doi:10.1016/j.eswa.2025.128266

计算机科学 > 机器学习

arXiv:2507.08866 (cs)

[提交于 2025年7月9日 ]

标题：代表性不足、标签偏差和代理指标：面向欧盟人工智能法案及更广泛的数据偏差概况

标题： Underrepresentation, Label Bias, and Proxies: Towards Data Bias Profiles for the EU AI Act and Beyond

Authors:Marina Ceccon, Giandomenico Cornacchia, Davide Dalle Pezze, Alessandro Fabris, Gian Antonio Susto

摘要：数据中编码的不良偏差是算法歧视的关键驱动因素。它们在算法公平性文献以及人工智能反歧视的立法和标准中被广泛认可。尽管有这种认识，数据偏差仍然研究不足，阻碍了其检测和缓解的计算最佳实践的发展。在本工作中，我们提出了三种常见的数据偏差，并研究它们在各种数据集、模型和公平性度量上对算法歧视的单独和联合影响。我们发现，训练集中弱势群体的代表性不足比传统认为的更不利于歧视，而代理变量和标签偏差的组合可能更加关键。因此，我们开发了专门的机制来检测特定类型的偏差，并将它们组合成一个初步构建，我们称之为数据偏差档案（DBP）。这一初始公式作为如何系统地记录不同偏差信号的证明概念。通过与流行的公平性数据集的案例研究，我们展示了DBP在预测歧视性结果的风险以及公平增强干预措施的效用方面的有效性。总体而言，本文通过以数据为中心的视角，将算法公平性研究与反歧视政策联系起来。

摘要： Undesirable biases encoded in the data are key drivers of algorithmic discrimination. Their importance is widely recognized in the algorithmic fairness literature, as well as legislation and standards on anti-discrimination in AI. Despite this recognition, data biases remain understudied, hindering the development of computational best practices for their detection and mitigation. In this work, we present three common data biases and study their individual and joint effect on algorithmic discrimination across a variety of datasets, models, and fairness measures. We find that underrepresentation of vulnerable populations in training sets is less conducive to discrimination than conventionally affirmed, while combinations of proxies and label bias can be far more critical. Consequently, we develop dedicated mechanisms to detect specific types of bias, and combine them into a preliminary construct we refer to as the Data Bias Profile (DBP). This initial formulation serves as a proof of concept for how different bias signals can be systematically documented. Through a case study with popular fairness datasets, we demonstrate the effectiveness of the DBP in predicting the risk of discriminatory outcomes and the utility of fairness-enhancing interventions. Overall, this article bridges algorithmic fairness research and anti-discrimination policy through a data-centric lens.

评论：	已被《Expert Systems with Applications》接收
主题：	机器学习 (cs.LG) ; 计算机与社会 (cs.CY); 机器学习 (stat.ML)
引用方式：	arXiv:2507.08866 [cs.LG]
	(或者 arXiv:2507.08866v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.08866
期刊参考：	Expert Systems with Applications. Volume 292, 1 November 2025, 128266
相关 DOI:	https://doi.org/10.1016/j.eswa.2025.128266

提交历史

来自： Marina Ceccon [查看电子邮件]
[v1] 星期三， 2025 年 7 月 9 日 15:52:11 UTC (227 KB)

计算机科学 > 机器学习

标题：代表性不足、标签偏差和代理指标：面向欧盟人工智能法案及更广泛的数据偏差概况

标题： Underrepresentation, Label Bias, and Proxies: Towards Data Bias Profiles for the EU AI Act and Beyond

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： 代表性不足、标签偏差和代理指标：面向欧盟人工智能法案及更广泛的数据偏差概况 显示英文标题

标题： Underrepresentation, Label Bias, and Proxies: Towards Data Bias Profiles for the EU AI Act and Beyond

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：代表性不足、标签偏差和代理指标：面向欧盟人工智能法案及更广泛的数据偏差概况