Maximum Information Extraction Via Clustering and Minimization of Shannon Entropy

Becchi, Matteo; Pavan, Giovanni Maria

物理学 > 数据分析、统计与概率

arXiv:2504.12990 (physics)

[提交于 2025年4月17日 (v1) ，最后修订 2025年7月16日 (此版本， v7)]

标题：通过聚类和香农熵最小化提取最大信息

标题： Maximum Information Extraction Via Clustering and Minimization of Shannon Entropy

Authors:Matteo Becchi (1), Giovanni Maria Pavan (1) ((1) Politecnico di Torino, Dipartimento di Scienze Applicate e Tecnologia)

摘要：在分析任何类型系统时，从其数据中提取最大信息量是非显而易见的。成功提取信息的信心通常建立在对所研究系统的先验知识或用户的经验证据之上。然而，定义一个稳健且客观的标准以确保从数据中提取最大信息量是困难的。在此，我们引入一种数据驱动的方法，该方法采用香农熵作为可转移的度量标准，通过将数据聚类到统计相关的微观领域来评估和量化最大信息提取（MInE）。该方法具有普遍性，几乎可以应用于任何类型的数据或系统。我们通过分析第一个示例来展示其效率，即分析从水和冰在固/液相变温度共存的分子动力学模拟中提取的时间序列数据。该方法允许量化数据分布中的信息（时间不变成分）以及通过将数据作为时间序列进行分析所获得的额外信息增益（即考虑数据时间相关性中的信息）。系统中可以有效解析和分类的不同微观领域具有自身的熵，这些熵与实验已知的热力学参数一致。第二个测试案例展示了MInE方法在高维数据集中的有效性，并清楚地表明，在高维分析中包含少量信息但噪声较多的额外组件/特征可能不仅无用，甚至会对最大信息提取产生不利影响。这提供了一种稳健的无参数方法和定量指标用于数据分析，以及从数据中研究任何类型系统。

摘要： In the analysis of any type of system, granting maximum information extraction from its data is non-trivial. Confidence in successful information extraction typically builds on prior knowledge of the studied system or on the user's experience. However, a robust and objective criterion for ensuring maximum information extraction from data is difficult to define. Here, we introduce a data-driven approach that employs Shannon entropy as a transferable metric to assess and quantify Maximum Information Extraction (MInE) from data via their clustering into statistically-relevant micro-domains. The method is general and can be applied virtually to any type of data or system. We demonstrate its efficiency by analyzing, as a first example, time-series data extracted from molecular dynamics simulations of water and ice coexisting at the solid/liquid transition temperature. The method allows quantifying the information contained in the data distributions (time-independent component) and the additional information gain attainable by analyzing data as time-series (i.e., accounting for the information contained in data time-correlations). The different micro-domains that can be effectively resolved and classified in the system are characterized by own entropy, which are found consistent with experimentally known thermodynamic parameters. A second test case demonstrates how the MInE approach is also effective for high-dimensional datasets and clearly shows how including little informative, but noisy, extra components/features in high-dimensional analyses may be not only useless, but even detrimental to maximum information extraction. This provides a robust parameter-free approach and quantitative metrics for data-analysis, and for the study of any type of system from its data.

评论：	正文11页，4图；补充材料3页，2图。v7：正文内容已扩展，增加一张图
主题：	数据分析、统计与概率 (physics.data-an)
引用方式：	arXiv:2504.12990 [physics.data-an]
	(或者 arXiv:2504.12990v7 [physics.data-an] 对于此版本)
	https://doi.org/10.48550/arXiv.2504.12990

提交历史

来自： Matteo Becchi [查看电子邮件]
[v1] 星期四， 2025 年 4 月 17 日 14:54:46 UTC (1,730 KB)
[v2] 星期五， 2025 年 4 月 18 日 12:44:56 UTC (1,730 KB)
[v3] 星期二， 2025 年 4 月 22 日 14:01:56 UTC (1,730 KB)
[v4] 星期三， 2025 年 4 月 23 日 10:08:15 UTC (1,730 KB)
[v5] 星期一， 2025 年 4 月 28 日 11:50:09 UTC (1,731 KB)
[v6] 星期二， 2025 年 5 月 27 日 07:38:49 UTC (1,726 KB)
[v7] 星期三， 2025 年 7 月 16 日 15:58:44 UTC (1,893 KB)

物理学 > 数据分析、统计与概率

标题：通过聚类和香农熵最小化提取最大信息

标题： Maximum Information Extraction Via Clustering and Minimization of Shannon Entropy

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

物理学 > 数据分析、统计与概率

标题： 通过聚类和香农熵最小化提取最大信息 显示英文标题

标题： Maximum Information Extraction Via Clustering and Minimization of Shannon Entropy

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：通过聚类和香农熵最小化提取最大信息