Bridging Large Language Models and Single-Cell Transcriptomics in Dissecting Selective Motor Neuron Vulnerability

Jiang, Douglas; Dai, Zilin; Zhang, Luxuan; Yu, Qiyi; Sun, Haoqi; Tian, Feng

Quantitative Biology > Genomics

arXiv:2505.07896 (q-bio)

[Submitted on 12 May 2025 ]

Title: Bridging Large Language Models and Single-Cell Transcriptomics in Dissecting Selective Motor Neuron Vulnerability

Title: 大型语言模型与单细胞转录组学在解析选择性运动神经元脆弱性的结合

Authors:Douglas Jiang, Zilin Dai, Luxuan Zhang, Qiyi Yu, Haoqi Sun, Feng Tian

Abstract: Understanding cell identity and function through single-cell level sequencing data remains a key challenge in computational biology. We present a novel framework that leverages gene-specific textual annotations from the NCBI Gene database to generate biologically contextualized cell embeddings. For each cell in a single-cell RNA sequencing (scRNA-seq) dataset, we rank genes by expression level, retrieve their NCBI Gene descriptions, and transform these descriptions into vector embedding representations using large language models (LLMs). The models used include OpenAI text-embedding-ada-002, text-embedding-3-small, and text-embedding-3-large (Jan 2024), as well as domain-specific models BioBERT and SciBERT. Embeddings are computed via an expression-weighted average across the top N most highly expressed genes in each cell, providing a compact, semantically rich representation. This multimodal strategy bridges structured biological data with state-of-the-art language modeling, enabling more interpretable downstream applications such as cell-type clustering, cell vulnerability dissection, and trajectory inference.

Abstract: 理解通过单细胞水平测序数据的细胞身份和功能仍然是计算生物学中的一个关键挑战。我们提出了一种新的框架，利用来自NCBI基因数据库的基因特定文本注释生成生物背景化的细胞嵌入。对于单细胞RNA测序（scRNA-seq）数据集中的每个细胞，我们根据表达水平对基因进行排名，检索它们的NCBI基因描述，并使用大型语言模型（LLM）将这些描述转换为向量嵌入表示。使用的模型包括OpenAI的text-embedding-ada-002、text-embedding-3-small和text-embedding-3-large（2024年1月），以及领域特定模型BioBERT和SciBERT。嵌入是通过每个细胞中前N个高表达基因的加权平均值来计算的，提供了一个紧凑且语义丰富的表示。这种多模态策略将结构化的生物数据与最先进的语言模型相结合，使得下游应用更加可解释，如细胞类型聚类、细胞脆弱性剖析和轨迹推断。

Subjects:	Genomics (q-bio.GN) ; Artificial Intelligence (cs.AI)
Cite as:	arXiv:2505.07896 [q-bio.GN]
	(or arXiv:2505.07896v1 [q-bio.GN] for this version)
	https://doi.org/10.48550/arXiv.2505.07896

Submission history

From: Feng Tian [view email]
[v1] Mon, 12 May 2025 03:39:33 UTC (2,993 KB)

Quantitative Biology > Genomics

Title: Bridging Large Language Models and Single-Cell Transcriptomics in Dissecting Selective Motor Neuron Vulnerability

Title: 大型语言模型与单细胞转录组学在解析选择性运动神经元脆弱性的结合

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Quantitative Biology > Genomics

Title: Bridging Large Language Models and Single-Cell Transcriptomics in Dissecting Selective Motor Neuron Vulnerability Show Chinese title

Title: 大型语言模型与单细胞转录组学在解析选择性运动神经元脆弱性的结合

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: Bridging Large Language Models and Single-Cell Transcriptomics in Dissecting Selective Motor Neuron Vulnerability