Goodness-of-Fit Tests for Large Datasets

Lazariv, Taras; Lehmann, Christoph

统计学 > 应用

arXiv:1810.09753 (stat)

[提交于 2018年10月23日 ]

标题：大数据集的拟合优度检验

标题： Goodness-of-Fit Tests for Large Datasets

Authors:Taras Lazariv, Christoph Lehmann

摘要：如今，在大数据领域中的数据分析通常与数据挖掘、描述性或探索性统计学相关，例如聚类分析、分类或回归分析。除了这些技术之外，还有大量来自推断统计学的方法，这些方法在与大数据相关的背景下很少被考虑。然而，推断方法对于大数据分析也是有用的，特别是用于量化不确定性。本文将提供一些关于大数据领域中推断方法的方法论和技术问题的见解，以将大数据和推断统计学结合起来，尽管这会带来一些困难。我们提出一种方法，可以在不假设模型的情况下进行拟合优度检验，并依赖于经验分布。特别是，该方法能够利用大型数据集的信息。因此，该方法基于明确的理论背景。我们专注于广泛使用的柯尔莫哥洛夫-斯米尔诺夫检验，该检验用于统计学中的拟合优度检验。我们的方法可以轻松并行化，这使其特别适用于分布式数据集，尤其是在计算集群上。通过这一贡献，我们面向对技术及方法论背景感兴趣的受众，特别是在使用如Spark等大数据工具实现推断统计方法时。

摘要： Nowadays, data analysis in the world of Big Data is connected typically to data mining, descriptive or exploratory statistics, e.~g.\ cluster analysis, classification or regression analysis. Aside these techniques there is a huge area of methods from inferential statistics that are rarely considered in connection with Big Data. Nevertheless, inferential methods are also of use for Big Data analysis, especially for quantifying uncertainty. The article at hand will provide some insights to methodological and technical issues referring inferential methods in the Big Data area in order to bring together Big Data and inferential statistics, as it comes along with its difficulties. We present an approach that allows testing goodness-of-fit without model assumptions and relying on the empirical distribution. Especially, the method is able to utilize information from large datasets. Thereby, the approach is based on a clear theoretical background. We concentrate on the widely-used Kolmogorov-Smirnov test that is applied for testing goodness-of-fit in statistics. Our approach can be parallelized easily, which makes it applicable to distributed datasets particularly on a compute cluster. By this contribution, we turn to an audience that is interested in the technical and methodological backgrounds while implementing especially inferential statistical methods with Big Data tools as e. g. Spark.

主题：	应用 (stat.AP) ; 计算 (stat.CO)
引用方式：	arXiv:1810.09753 [stat.AP]
	(或者 arXiv:1810.09753v1 [stat.AP] 对于此版本)
	https://doi.org/10.48550/arXiv.1810.09753

提交历史

来自： Taras Lazariv [查看电子邮件]
[v1] 星期二， 2018 年 10 月 23 日 10:03:38 UTC (56 KB)

统计学 > 应用

标题：大数据集的拟合优度检验

标题： Goodness-of-Fit Tests for Large Datasets

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

统计学 > 应用

标题： 大数据集的拟合优度检验 显示英文标题

标题： Goodness-of-Fit Tests for Large Datasets

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：大数据集的拟合优度检验