Survey and Improvement Strategies for Gene Prioritization with Large Language Models

Neeley, Matthew; Qi, Guantong; Wang, Guanchu; Tang, Ruixiang; Mao, Dongxue; Liu, Chaozhong; Pasupuleti, Sasidhar; Yuan, Bo; Xia, Fan; Liu, Pengfei; Liu, Zhandong; Hu, Xia

定量生物学 > 基因组学

arXiv:2501.18794 (q-bio)

[提交于 2025年1月30日 ]

标题：基因优先排序的综述与大型语言模型的改进策略

标题： Survey and Improvement Strategies for Gene Prioritization with Large Language Models

Authors:Matthew Neeley, Guantong Qi, Guanchu Wang, Ruixiang Tang, Dongxue Mao, Chaozhong Liu, Sasidhar Pasupuleti, Bo Yuan, Fan Xia, Pengfei Liu, Zhandong Liu, Xia Hu

摘要：罕见疾病由于患者数据有限和遗传多样性而难以诊断。尽管在变异优先排序方面取得了进展，但许多病例仍未被诊断。虽然大型语言模型（LLMs）在医学考试中表现良好，但它们在诊断罕见遗传疾病方面的有效性尚未得到评估。为了识别致病基因，我们对各种LLMs进行了基因优先排序的基准测试。使用多代理和人类表型本体（HPO）分类，我们根据表型和可解性水平对患者进行分类。随着基因集大小的增加，LLM性能下降，因此我们采用分而治之的策略将任务分解为较小的子集。在基线情况下，GPT-4优于其他LLMs，在正确排名致病基因方面达到了近30%的准确率。多代理和HPO方法有助于区分确定性解决的病例和具有挑战性的病例，突显了已知基因-表型关联和表型特异性的重要性。我们发现，具有特定表型或明确关联的病例更容易被准确解决。然而，我们观察到对研究较多的基因存在偏见以及输入顺序敏感性，这阻碍了基因优先排序。我们的分而治之策略通过克服这些偏见提高了准确性。通过利用HPO分类、新颖的多代理技术和我们的LLM策略，与我们的基线评估相比，我们提高了致病基因识别的准确性。这种方法简化了罕见疾病的诊断，促进了未解决病例的重新分析，并加速了基因发现，支持了针对性诊断和治疗的发展。

摘要： Rare diseases are challenging to diagnose due to limited patient data and genetic diversity. Despite advances in variant prioritization, many cases remain undiagnosed. While large language models (LLMs) have performed well in medical exams, their effectiveness in diagnosing rare genetic diseases has not been assessed. To identify causal genes, we benchmarked various LLMs for gene prioritization. Using multi-agent and Human Phenotype Ontology (HPO) classification, we categorized patients based on phenotypes and solvability levels. As gene set size increased, LLM performance deteriorated, so we used a divide-and-conquer strategy to break the task into smaller subsets. At baseline, GPT-4 outperformed other LLMs, achieving near 30% accuracy in ranking causal genes correctly. The multi-agent and HPO approaches helped distinguish confidently solved cases from challenging ones, highlighting the importance of known gene-phenotype associations and phenotype specificity. We found that cases with specific phenotypes or clear associations were more accurately solved. However, we observed biases toward well-studied genes and input order sensitivity, which hindered gene prioritization. Our divide-and-conquer strategy improved accuracy by overcoming these biases. By utilizing HPO classification, novel multi-agent techniques, and our LLM strategy, we improved causal gene identification accuracy compared to our baseline evaluation. This approach streamlines rare disease diagnosis, facilitates reanalysis of unsolved cases, and accelerates gene discovery, supporting the development of targeted diagnostics and therapies.

评论：	11页，4张图，10页的补充图
主题：	基因组学 (q-bio.GN) ; 人工智能 (cs.AI)
引用方式：	arXiv:2501.18794 [q-bio.GN]
	(或者 arXiv:2501.18794v1 [q-bio.GN] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.18794

提交历史

来自： Matthew Neeley [查看电子邮件]
[v1] 星期四， 2025 年 1 月 30 日 23:03:03 UTC (5,940 KB)

定量生物学 > 基因组学

标题：基因优先排序的综述与大型语言模型的改进策略

标题： Survey and Improvement Strategies for Gene Prioritization with Large Language Models

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

定量生物学 > 基因组学

标题： 基因优先排序的综述与大型语言模型的改进策略 显示英文标题

标题： Survey and Improvement Strategies for Gene Prioritization with Large Language Models

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：基因优先排序的综述与大型语言模型的改进策略