Finding easy regions for short-read variant calling from pangenome data

Li, Heng

定量生物学 > 基因组学

arXiv:2507.03718 (q-bio)

[提交于 2025年7月4日 ]

标题：从泛基因组数据中寻找短读长变异检测的易处理区域

标题： Finding easy regions for short-read variant calling from pangenome data

Authors:Heng Li

摘要：背景：尽管在短读长变异检测的基准测试中表明错误率低于0.5%，但这些基准仅适用于预定义的可信区域。对于没有此类区域的人类样本，错误率可能高出10倍。尽管已经识别出多个易于处理的区域以缓解该问题，但它们未能考虑非参考样本，或者偏向于现有的短读长数据或比对工具。结果：在这里，使用数百个高质量的人类基因组组装，我们推导出了一组与样本无关的易于处理的区域，其中短读长变异检测达到高准确性。这些区域覆盖了GRCh38的87.9%、编码区域的92.7%以及ClinVar致病变异的96.4%。它们在覆盖率和易处理性之间取得了良好的平衡，并可以为其他人类基因组或具有多个良好组装基因组的物种生成。结论：此资源为临床或研究用人类样本过滤虚假的变异调用提供了一个方便且强大的方法。

摘要： Background: While benchmarks on short-read variant calling suggest low error rate below 0.5%, they are only applicable to predefined confident regions. For a human sample without such regions, the error rate could be 10 times higher. Although multiple sets of easy regions have been identified to alleviate the issue, they fail to consider non-reference samples or are biased towards existing short-read data or aligners. Results: Here, using hundreds of high-quality human assemblies, we derived a set of sample-agnostic easy regions where short-read variant calling reaches high accuracy. These regions cover 87.9% of GRCh38, 92.7% of coding regions and 96.4% of ClinVar pathogenic variants. They achieve a good balance between coverage and easiness and can be generated for other human assemblies or species with multiple well assembled genomes. Conclusion: This resource provides a convient and powerful way to filter spurious variant calls for clinical or research human samples.

主题：	基因组学 (q-bio.GN)
引用方式：	arXiv:2507.03718 [q-bio.GN]
	(或者 arXiv:2507.03718v1 [q-bio.GN] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.03718

提交历史

来自： Heng Li [查看电子邮件]
[v1] 星期五， 2025 年 7 月 4 日 17:11:15 UTC (68 KB)

定量生物学 > 基因组学

标题：从泛基因组数据中寻找短读长变异检测的易处理区域

标题： Finding easy regions for short-read variant calling from pangenome data

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

定量生物学 > 基因组学

标题： 从泛基因组数据中寻找短读长变异检测的易处理区域 显示英文标题

标题： Finding easy regions for short-read variant calling from pangenome data

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：从泛基因组数据中寻找短读长变异检测的易处理区域