GalaxiesML: a dataset of galaxy images, photometry, redshifts, and structural parameters for machine learning

Do, Tuan; Boscoe, Bernie; Jones, Evan; Li, Yun Qi; Alfaro, Kevin

天体物理学 > 宇宙学与非星系天体物理学

arXiv:2410.00271 (astro-ph)

[提交于 2024年9月30日 ]

标题：星系ML：一个用于机器学习的星系图像、光度、红移和结构参数的数据集

标题： GalaxiesML: a dataset of galaxy images, photometry, redshifts, and structural parameters for machine learning

Authors:Tuan Do (1), Bernie Boscoe (2), Evan Jones (1), Yun Qi Li (1,3), Kevin Alfaro (1) ((1) UCLA, (2) Southern Oregon University, (3) University of Washington)

摘要：我们提出了一个用于机器学习应用的数据集，其中包括星系的测光数据、图像、光谱红移和结构属性。该数据集由来自 Hyper-Suprime-Cam 调查 PDR2 的五个成像滤波器（$g,r,i,z,y$）中的 286,401 幅星系图像和测光数据组成，具有光谱确认的红移作为真实值。由于其统一性、一致性以及最小的异常值但仍包含现实范围内的信噪比，此数据集对于机器学习应用非常重要。我们将此数据集公开，以促进下一代调查方法（如 Euclid 和 LSST）中机器学习方法的发展。 GalaxiesML 的目标是提供一个可靠的、不仅可用于天体物理学还可用于机器学习的数据集，在该数据集中图像属性无法通过人眼验证，而是由物理定律所支配。我们描述了从公开可用档案中构建数据集时遇到的挑战，包括异常值剔除、去重、建立真实值和样本选择。这是同类数据集中最大的公开机器学习就绪训练集之一，红移范围从 0.01 到 4。此样本的红移分布峰值位于红移 1.5，并在红移超过 2.5 后迅速下降。我们还展示了此数据集的一个红移估计示例应用，证明使用图像进行红移估计比单独使用测光数据获得的结果更准确。例如，在红移 0.1 到 1.25 范围内，使用图像的红移估计偏差比单独使用测光数据低一个数量级。此类数据集的结果将帮助我们了解如何最好地利用下一代星系巡天的数据。

摘要： We present a dataset built for machine learning applications consisting of galaxy photometry, images, spectroscopic redshifts, and structural properties. This dataset comprises 286,401 galaxy images and photometry from the Hyper-Suprime-Cam Survey PDR2 in five imaging filters ($g,r,i,z,y$) with spectroscopically confirmed redshifts as ground truth. Such a dataset is important for machine learning applications because it is uniform, consistent, and has minimal outliers but still contains a realistic range of signal-to-noise ratios. We make this dataset public to help spur development of machine learning methods for the next generation of surveys such as Euclid and LSST. The aim of GalaxiesML is to provide a robust dataset that can be used not only for astrophysics but also for machine learning, where image properties cannot be validated by the human eye and are instead governed by physical laws. We describe the challenges associated with putting together a dataset from publicly available archives, including outlier rejection, duplication, establishing ground truths, and sample selection. This is one of the largest public machine learning-ready training sets of its kind with redshifts ranging from 0.01 to 4. The redshift distribution of this sample peaks at redshift of 1.5 and falls off rapidly beyond redshift 2.5. We also include an example application of this dataset for redshift estimation, demonstrating that using images for redshift estimation produces more accurate results compared to using photometry alone. For example, the bias in redshift estimate is a factor of 10 lower when using images between redshift of 0.1 to 1.25 compared to photometry alone. Results from dataset such as this will help inform us on how to best make use of data from the next generation of galaxy surveys.

评论：	19页，6个图表，数据可在https://doi.org/10.5281/zenodo.11117528获取，使用示例代码可在https://github.com/astrodatalab/galaxiesml_examples获取
主题：	宇宙学与非星系天体物理学 (astro-ph.CO) ; 天体物理学的仪器与方法 (astro-ph.IM); 机器学习 (cs.LG)
引用方式：	arXiv:2410.00271 [astro-ph.CO]
	(或者 arXiv:2410.00271v1 [astro-ph.CO] 对于此版本)
	https://doi.org/10.48550/arXiv.2410.00271

提交历史

来自： Tuan Do [查看电子邮件]
[v1] 星期一， 2024 年 9 月 30 日 22:46:44 UTC (7,379 KB)

天体物理学 > 宇宙学与非星系天体物理学

标题：星系ML：一个用于机器学习的星系图像、光度、红移和结构参数的数据集

标题： GalaxiesML: a dataset of galaxy images, photometry, redshifts, and structural parameters for machine learning

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

天体物理学 > 宇宙学与非星系天体物理学

标题： 星系ML：一个用于机器学习的星系图像、光度、红移和结构参数的数据集 显示英文标题

标题： GalaxiesML: a dataset of galaxy images, photometry, redshifts, and structural parameters for machine learning

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：星系ML：一个用于机器学习的星系图像、光度、红移和结构参数的数据集