Improving statistical learning methods via features selection without replacement sampling and random projection

khan, Sulaiman; Ahmad, Muhammad; Ullah, Fida; Ibañez, Carlos Aguilar; Rodriguez, José Eduardo Valdez

定量生物学 > 定量方法

arXiv:2506.00053 (q-bio)

[提交于 2025年5月28日 ]

标题：通过无放回采样和随机投影进行特征选择以改进统计学习方法

标题： Improving statistical learning methods via features selection without replacement sampling and random projection

Authors:Sulaiman khan, Muhammad Ahmad, Fida Ullah, Carlos Aguilar Ibañez, José Eduardo Valdez Rodriguez

摘要：癌症本质上是一种遗传性疾病，其特征是基因和表观遗传改变扰乱了正常基因的表达，导致细胞不受控制地生长和转移。由于“小样本数量，大特征维度”问题，高维微阵列数据集给分类模型带来了挑战，导致过拟合现象的发生。本研究做出了三个不同的关键贡献：1）我们提出了一种基于机器学习的方法，结合了无重置特征选择（FSWOR）技术和投影方法以提高分类准确性；2）我们应用Kendall统计检验从脑癌微阵列数据集（GSE50161）中识别出最显著的基因，将特征空间从54,675个基因减少到20,890个基因；3）我们使用k折交叉验证技术应用机器学习模型，其中我们的模型结合了集成分类器、线性判别分析（LDA）投影和朴素贝叶斯算法，取得了96%的测试分数，比现有方法提高了9.09%。结果证明了我们方法在高维基因表达分析中的有效性，提高了分类准确性同时缓解了过拟合问题。本研究有助于癌症生物标志物的发现，提供了一种用于分析微阵列数据的强大计算方法。

摘要： Cancer is fundamentally a genetic disease characterized by genetic and epigenetic alterations that disrupt normal gene expression, leading to uncontrolled cell growth and metastasis. High-dimensional microarray datasets pose challenges for classification models due to the "small n, large p" problem, resulting in overfitting. This study makes three different key contributions: 1) we propose a machine learning-based approach integrating the Feature Selection Without Re-placement (FSWOR) technique and a projection method to improve classification accuracy. 2) We apply the Kendall statistical test to identify the most significant genes from the brain cancer mi-croarray dataset (GSE50161), reducing the feature space from 54,675 to 20,890 genes.3) we apply machine learning models using k-fold cross validation techniques in which our model incorpo-rates ensemble classifiers with LDA projection and Na\"ive Bayes, achieving a test score of 96%, outperforming existing methods by 9.09%. The results demonstrate the effectiveness of our ap-proach in high-dimensional gene expression analysis, improving classification accuracy while mitigating overfitting. This study contributes to cancer biomarker discovery, offering a robust computational method for analyzing microarray data.

主题：	定量方法 (q-bio.QM) ; 人工智能 (cs.AI); 机器学习 (cs.LG); 应用 (stat.AP); 机器学习 (stat.ML)
引用方式：	arXiv:2506.00053 [q-bio.QM]
	(或者 arXiv:2506.00053v1 [q-bio.QM] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.00053

提交历史

来自： Fida Ullah [查看电子邮件]
[v1] 星期三， 2025 年 5 月 28 日 22:36:46 UTC (957 KB)

定量生物学 > 定量方法

标题：通过无放回采样和随机投影进行特征选择以改进统计学习方法

标题： Improving statistical learning methods via features selection without replacement sampling and random projection

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

定量生物学 > 定量方法

标题： 通过无放回采样和随机投影进行特征选择以改进统计学习方法 显示英文标题

标题： Improving statistical learning methods via features selection without replacement sampling and random projection

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：通过无放回采样和随机投影进行特征选择以改进统计学习方法