DIMM-SC: A Dirichlet mixture model for clustering droplet-based single cell transcriptomic data

Sun, Zhe; Wang, Ting; Deng, Ke; Wang, Xiao-Feng; Lafyatis, Robert; Ding, Ying; Hu, Ming; Chen, Wei

Statistics > Machine Learning

arXiv:1704.02007 (stat)

[Submitted on 6 Apr 2017 ]

Title: DIMM-SC: A Dirichlet mixture model for clustering droplet-based single cell transcriptomic data

Title: DIMM-SC：一种用于基于微滴的单细胞转录组数据聚类的狄利克雷混合模型

Authors:Zhe Sun, Ting Wang, Ke Deng, Xiao-Feng Wang, Robert Lafyatis, Ying Ding, Ming Hu, Wei Chen

Abstract: Motivation: Single cell transcriptome sequencing (scRNA-Seq) has become a revolutionary tool to study cellular and molecular processes at single cell resolution. Among existing technologies, the recently developed droplet-based platform enables efficient parallel processing of thousands of single cells with direct counting of transcript copies using Unique Molecular Identifier (UMI). Despite the technology advances, statistical methods and computational tools are still lacking for analyzing droplet-based scRNA-Seq data. Particularly, model-based approaches for clustering large-scale single cell transcriptomic data are still under-explored. Methods: We developed DIMM-SC, a Dirichlet Mixture Model for clustering droplet-based Single Cell transcriptomic data. This approach explicitly models UMI count data from scRNA-Seq experiments and characterizes variations across different cell clusters via a Dirichlet mixture prior. An expectation-maximization algorithm is used for parameter inference. Results: We performed comprehensive simulations to evaluate DIMM-SC and compared it with existing clustering methods such as K-means, CellTree and Seurat. In addition, we analyzed public scRNA-Seq datasets with known cluster labels and in-house scRNA-Seq datasets from a study of systemic sclerosis with prior biological knowledge to benchmark and validate DIMM-SC. Both simulation studies and real data applications demonstrated that overall, DIMM-SC achieves substantially improved clustering accuracy and much lower clustering variability compared to other existing clustering methods. More importantly, as a model-based approach, DIMM-SC is able to quantify the clustering uncertainty for each single cell, facilitating rigorous statistical inference and biological interpretations, which are typically unavailable from existing clustering methods.

Abstract: 动机：单细胞转录组测序（scRNA-Seq）已成为在单细胞分辨率下研究细胞和分子过程的革命性工具。在现有技术中，最近开发的基于液滴的平台能够通过唯一分子标识符（UMI）直接计数转录本拷贝数，实现对数千个单细胞的高效并行处理。尽管技术有所进步，但用于分析基于液滴的scRNA-Seq数据的统计方法和计算工具仍然不足。特别是，针对大规模单细胞转录组数据的基于模型的聚类方法仍缺乏深入研究。方法：我们开发了DIMM-SC，这是一种用于基于液滴的单细胞转录组数据聚类的狄利克雷混合模型。该方法显式地对scRNA-Seq实验中的UMI计数数据进行建模，并通过狄利克雷混合先验来表征不同细胞簇之间的变化。使用期望最大化算法进行参数推断。结果：我们进行了全面的模拟以评估DIMM-SC，并将其与其他现有的聚类方法如K-means、CellTree和Seurat进行比较。此外，我们分析了具有已知簇标签的公共scRNA-Seq数据集以及来自系统性硬化症研究的内部scRNA-Seq数据集，并结合先前的生物学知识来基准测试和验证DIMM-SC。模拟研究和实际数据应用均表明，总体而言，与现有其他聚类方法相比，DIMM-SC在聚类准确性方面有显著提高，聚类变异性也大大降低。更重要的是，作为一种基于模型的方法，DIMM-SC能够为每个单细胞量化聚类不确定性，从而促进严格的统计推断和生物学解释，而这些通常是现有聚类方法所不具备的。

Subjects:	Machine Learning (stat.ML) ; Quantitative Methods (q-bio.QM)
Cite as:	arXiv:1704.02007 [stat.ML]
	(or arXiv:1704.02007v1 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.1704.02007

Submission history

From: We Chen [view email]
[v1] Thu, 6 Apr 2017 20:01:29 UTC (4,942 KB)

Statistics > Machine Learning

Title: DIMM-SC: A Dirichlet mixture model for clustering droplet-based single cell transcriptomic data

Title: DIMM-SC：一种用于基于微滴的单细胞转录组数据聚类的狄利克雷混合模型

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title: DIMM-SC: A Dirichlet mixture model for clustering droplet-based single cell transcriptomic data Show Chinese title

Title: DIMM-SC：一种用于基于微滴的单细胞转录组数据聚类的狄利克雷混合模型

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: DIMM-SC: A Dirichlet mixture model for clustering droplet-based single cell transcriptomic data