Comparative Analysis of Machine Learning-Based Imputation Techniques for Air Quality Datasets with High Missing Data Rates

Yan, Sen; O'Connor, David J.; Wang, Xiaojun; O'Connor, Noel E.; Smeaton, Alan F.; Liu, Mingming

计算机科学 > 机器学习

arXiv:2412.13966 (cs)

[提交于 2024年12月18日 (v1) ，最后修订 2024年12月25日 (此版本， v2)]

标题：基于机器学习的高缺失率空气质量数据集插补技术对比分析

标题： Comparative Analysis of Machine Learning-Based Imputation Techniques for Air Quality Datasets with High Missing Data Rates

Authors:Sen Yan, David J. O'Connor, Xiaojun Wang, Noel E. O'Connor, Alan F. Smeaton, Mingming Liu

摘要：城市污染对健康构成严重风险，特别是与交通相关的空气污染，这在许多城市仍然是一个主要问题。机动车排放导致呼吸系统和心血管问题，尤其是对行人和骑自行车等易受伤害且暴露的路用户。因此，具有高空间分辨率的准确空气质量监测对于良好的城市环境管理至关重要。本研究旨在为处理缺失率高的时空数据集提供见解。在这项研究中，高缺失数据率的挑战源于可用数据有限以及需要精确分类PM2.5水平所需的精细粒度。用于分析和填补的数据来自Dynamic Parcel Distribution、环境保护局和Google在爱尔兰都柏林收集的移动传感器和固定站点。缺失数据率为约82.42%，使得准确预测颗粒物2.5（PM2.5）水平变得尤为困难。评估并比较了多种填补和预测方法，包括集成方法、深度学习模型和扩散模型。外部特征如交通流量、天气条件以及最近站点的数据被纳入以增强模型性能。结果显示，带有外部特征的扩散方法获得了最高的F1分数，达到0.9486（准确率：94.26%，精确率：94.42%，召回率：94.82%），集成模型达到了最高的准确率94.82%，表明即使在高缺失数据率的情况下也可以获得良好的性能。

摘要： Urban pollution poses serious health risks, particularly in relation to traffic-related air pollution, which remains a major concern in many cities. Vehicle emissions contribute to respiratory and cardiovascular issues, especially for vulnerable and exposed road users like pedestrians and cyclists. Therefore, accurate air quality monitoring with high spatial resolution is vital for good urban environmental management. This study aims to provide insights for processing spatiotemporal datasets with high missing data rates. In this study, the challenge of high missing data rates is a result of the limited data available and the fine granularity required for precise classification of PM2.5 levels. The data used for analysis and imputation were collected from both mobile sensors and fixed stations by Dynamic Parcel Distribution, the Environmental Protection Agency, and Google in Dublin, Ireland, where the missing data rate was approximately 82.42%, making accurate Particulate Matter 2.5 level predictions particularly difficult. Various imputation and prediction approaches were evaluated and compared, including ensemble methods, deep learning models, and diffusion models. External features such as traffic flow, weather conditions, and data from the nearest stations were incorporated to enhance model performance. The results indicate that diffusion methods with external features achieved the highest F1 score, reaching 0.9486 (Accuracy: 94.26%, Precision: 94.42%, Recall: 94.82%), with ensemble models achieving the highest accuracy of 94.82%, illustrating that good performance can be obtained despite a high missing data rate.

评论：	已被IEEE CIETES 2025接受，共8页，包含3个图和2个表格。
主题：	机器学习 (cs.LG) ; 数据分析、统计与概率 (physics.data-an)
引用方式：	arXiv:2412.13966 [cs.LG]
	(或者 arXiv:2412.13966v2 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2412.13966

提交历史

来自： Sen Yan [查看电子邮件]
[v1] 星期三， 2024 年 12 月 18 日 15:45:08 UTC (11,280 KB)
[v2] 星期三， 2024 年 12 月 25 日 13:39:20 UTC (11,280 KB)

计算机科学 > 机器学习

标题：基于机器学习的高缺失率空气质量数据集插补技术对比分析

标题： Comparative Analysis of Machine Learning-Based Imputation Techniques for Air Quality Datasets with High Missing Data Rates

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： 基于机器学习的高缺失率空气质量数据集插补技术对比分析 显示英文标题

标题： Comparative Analysis of Machine Learning-Based Imputation Techniques for Air Quality Datasets with High Missing Data Rates

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：基于机器学习的高缺失率空气质量数据集插补技术对比分析