Normalization and selecting non-differentially expressed genes improve machine learning modelling of cross-platform transcriptomic data

Deng, Fei; Feng, Catherine H; Gao, Nan; Zhang, Lanjing

定量生物学 > 定量方法

arXiv:2501.14248 (q-bio)

[提交于 2025年1月24日 ]

标题：归一化和选择非差异表达基因可改善跨平台转录组数据的机器学习建模

标题： Normalization and selecting non-differentially expressed genes improve machine learning modelling of cross-platform transcriptomic data

Authors:Fei Deng (1), Catherine H Feng (1 and 2), Nan Gao (3 and 4), Lanjing Zhang (1, 4, 5 and 6) ((1) Department of Chemical Biology, Ernest Mario School of Pharmacy, Rutgers University, Piscataway, NJ, (2) Harvard University, Cambridge, MA, (3) Department of Biological Sciences, School of Arts & Sciences, Rutgers University, Newark, NJ, (4) Department of Pharmacology, Physiology, and Neuroscience, New Jersey Medical School, Rutgers University, Newark, NJ, (5) Department of Pathology, Princeton Medical Center, Plainsboro, NJ, (6) Rutgers Cancer Institute of New Jersey, New Brunswick, NJ.)

摘要：归一化是生物过程定量分析中的关键步骤。近期研究表明，跨平台整合与归一化能够使机器学习（ML）在RNA微阵列和RNA测序数据上的训练成为可能，但他们的研究中没有使用独立的数据集。因此，如何提高独立的RNA微阵列和RNA测序数据集上ML建模性能尚不清楚。受到实验生物学中常用的管家基因的启发，本研究检验了这样一个假设：非差异表达基因（NDEG）可能改善转录组数据的归一化，并随后提高ML模型的跨平台建模性能。使用TCGA乳腺癌的微阵列和RNA测序数据分别作为独立的训练和测试数据集，以分类乳腺癌的分子亚型。根据方差分析（ANOVA）的结果，选择NDEG（p>0.85）和差异表达基因（DEG，p<0.05），分别用于后续的数据归一化和分类。基于一个平台数据训练的模型被用于测试另一个平台。我们的数据显示，NDEG和DEG基因的选择可以有效地提高模型的分类性能。基于参数统计分析的归一化方法劣于基于非参数统计的方法。在这项研究中，LOG_QN和LOG_QNZ归一化方法与神经网络分类模型结合似乎表现出更好的性能。因此，基于NDEG的归一化似乎对完全独立数据集上的跨平台测试有用。然而，需要更多的研究来检验基于NDEG的归一化是否能提高其他数据集和其他组学数据类型的ML分类性能。

摘要： Normalization is a critical step in quantitative analyses of biological processes. Recent works show that cross-platform integration and normalization enable machine learning (ML) training on RNA microarray and RNA-seq data, but no independent datasets were used in their studies. Therefore, it is unclear how to improve ML modelling performance on independent RNA array and RNA-seq based datasets. Inspired by the house-keeping genes that are commonly used in experimental biology, this study tests the hypothesis that non-differentially expressed genes (NDEG) may improve normalization of transcriptomic data and subsequently cross-platform modelling performance of ML models. Microarray and RNA-seq datasets of the TCGA breast cancer were used as independent training and test datasets, respectively, to classify the molecular subtypes of breast cancer. NDEG (p>0.85) and differentially expressed genes (DEG, p<0.05) were selected based on the p values of ANOVA analysis and used for subsequent data normalization and classification, respectively. Models trained based on data from one platform were used for testing on the other platform. Our data show that NDEG and DEG gene selection could effectively improve the model classification performance. Normalization methods based on parametric statistical analysis were inferior to those based on nonparametric statistics. In this study, the LOG_QN and LOG_QNZ normalization methods combined with the neural network classification model seem to achieve better performance. Therefore, NDEG-based normalization appears useful for cross-platform testing on completely independent datasets. However, more studies are required to examine whether NDEG-based normalization can improve ML classification performance in other datasets and other omic data types.

评论：	35页，5幅图，2张表格
主题：	定量方法 (q-bio.QM) ; 基因组学 (q-bio.GN); 计算 (stat.CO); 方法论 (stat.ME)
引用方式：	arXiv:2501.14248 [q-bio.QM]
	(或者 arXiv:2501.14248v1 [q-bio.QM] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.14248
期刊参考：	https://media.sciltp.com/articles/2505000683/2505000683.pdf

提交历史

来自： Lanjing Zhang [查看电子邮件]
[v1] 星期五， 2025 年 1 月 24 日 05:23:51 UTC (956 KB)

定量生物学 > 定量方法

标题：归一化和选择非差异表达基因可改善跨平台转录组数据的机器学习建模

标题： Normalization and selecting non-differentially expressed genes improve machine learning modelling of cross-platform transcriptomic data

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

定量生物学 > 定量方法

标题： 归一化和选择非差异表达基因可改善跨平台转录组数据的机器学习建模 显示英文标题

标题： Normalization and selecting non-differentially expressed genes improve machine learning modelling of cross-platform transcriptomic data

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：归一化和选择非差异表达基因可改善跨平台转录组数据的机器学习建模