Sample Size Planning for Classification Models

Beleites, Claudia; Neugebauer, Ute; Bocklitz, Thomas; Krafft, Christoph; Popp, Jürgen

doi:10.1016/j.aca.2012.11.007

统计学 > 应用

arXiv:1211.1323 (stat)

[提交于 2012年11月6日 (v1) ，最后修订 2015年5月3日 (此版本， v3)]

标题：分类模型的样本量规划

标题： Sample Size Planning for Classification Models

Authors:Claudia Beleites, Ute Neugebauer, Thomas Bocklitz, Christoph Krafft, Jürgen Popp

摘要：在生物光谱学中，用于分类器训练和测试的适当标注且统计独立的样本（如患者、批次等）稀缺且昂贵。学习曲线展示了模型性能作为训练样本大小的函数，并可帮助确定训练良好分类器所需的样本量。然而，实际上仅仅构建一个良好的模型是不够的：其性能还必须得到验证。我们讨论了典型的少量样本情况下的学习曲线，每类样本数量为 5 至 25 个独立样本。尽管分类模型达到了可接受的性能，但由于测试样本量同样有限，随机测试不确定性可能会完全掩盖学习曲线。因此，我们确定了验证过程中需要的合理精度所对应的测试样本量，发现通常需要 75 至 100 个样本来测试一个良好但非完美的分类器。这样的数据集随后可以基于已实现的性能进行细化的样本量规划。我们还演示了如何计算必要的样本量以证明一种分类器优于另一种：这通常需要数百个统计独立的测试样本，甚至在理论上不可能实现。我们通过大约 2550 个单细胞拉曼光谱的数据集（五类：红细胞、白细胞以及三种肿瘤细胞系 BT-20、MCF-7 和 OCI-AML3）以及广泛的模拟实验进行了验证，这些模拟实验允许精确测定相关模型的实际性能。

摘要： In biospectroscopy, suitably annotated and statistically independent samples (e. g. patients, batches, etc.) for classifier training and testing are scarce and costly. Learning curves show the model performance as function of the training sample size and can help to determine the sample size needed to train good classifiers. However, building a good model is actually not enough: the performance must also be proven. We discuss learning curves for typical small sample size situations with 5 - 25 independent samples per class. Although the classification models achieve acceptable performance, the learning curve can be completely masked by the random testing uncertainty due to the equally limited test sample size. In consequence, we determine test sample sizes necessary to achieve reasonable precision in the validation and find that 75 - 100 samples will usually be needed to test a good but not perfect classifier. Such a data set will then allow refined sample size planning on the basis of the achieved performance. We also demonstrate how to calculate necessary sample sizes in order to show the superiority of one classifier over another: this often requires hundreds of statistically independent test samples or is even theoretically impossible. We demonstrate our findings with a data set of ca. 2550 Raman spectra of single cells (five classes: erythrocytes, leukocytes and three tumour cell lines BT-20, MCF-7 and OCI-AML3) as well as by an extensive simulation that allows precise determination of the actual performance of the models in question.

评论：	该论文发表在《Analytica Chimica Acta》（专刊“CAC2012”）。这是对已接受稿件的重新格式化版本，修正了一些排版错误，并提供了指向官方出版物的链接，包括补充材料（第11-16页及源文件中的补充*文件）。Clircon会议（2015年4月22日，英国埃克塞特）上的演讲幻灯片以附加工具书PDF文件的形式提供。
主题：	应用 (stat.AP) ; 方法论 (stat.ME); 机器学习 (stat.ML)
MSC 类：	92E99, 97K80, 62K99
ACM 类：	G.3
引用方式：	arXiv:1211.1323 [stat.AP]
	(或者 arXiv:1211.1323v3 [stat.AP] 对于此版本)
	https://doi.org/10.48550/arXiv.1211.1323
期刊参考：	Analytica Chimica Acta, 760 (2013) 25 - 33
相关 DOI:	https://doi.org/10.1016/j.aca.2012.11.007

提交历史

来自： Claudia Beleites [查看电子邮件]
[v1] 星期二， 2012 年 11 月 6 日 17:42:00 UTC (335 KB)
[v2] 星期五， 2013 年 1 月 4 日 13:28:52 UTC (151 KB)
[v3] 星期日， 2015 年 5 月 3 日 09:09:38 UTC (2,669 KB)

统计学 > 应用

标题：分类模型的样本量规划

标题： Sample Size Planning for Classification Models

提交历史

获取论文：

附属文件 (详细信息):

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

统计学 > 应用

标题： 分类模型的样本量规划 显示英文标题

标题： Sample Size Planning for Classification Models

提交历史

获取论文：

附属文件 (详细信息):

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：分类模型的样本量规划