GQVis: A Dataset of Genomics Data Questions and Visualizations for Generative AI

Walters, Skylar Sargent; Valderrama, Arthea; Smits, Thomas C.; Kouřil, David; Nguyen, Huyen N.; L'Yi, Sehi; Lange, Devin; Gehlenborg, Nils

定量生物学 > 基因组学

arXiv:2510.13816 (q-bio)

[提交于 2025年9月19日 ]

标题： GQVis：用于生成式人工智能的基因组数据问题和可视化数据集

标题： GQVis: A Dataset of Genomics Data Questions and Visualizations for Generative AI

Authors:Skylar Sargent Walters, Arthea Valderrama, Thomas C. Smits, David Kouřil, Huyen N. Nguyen, Sehi L'Yi, Devin Lange, Nils Gehlenborg

摘要：数据可视化是基因组学研究中的基本工具，能够促进对复杂基因组特征的探索、解释和交流。虽然机器学习模型在将数据转化为有洞察力的可视化方面展现出潜力，但当前模型缺乏针对领域特定任务的训练基础。为了为以基因组学为重点的模型训练提供基础资源，我们提出了一种框架，用于生成一个数据集，该数据集将关于基因组数据的抽象、低层次问题与相应的可视化结果配对。在先前统计图表工作的基础上，我们的方法适应了基因组数据的复杂性以及用于描述它们的专业化表示。我们进一步结合了多个链接查询和可视化结果，并包括设计选择的依据、图注和每个数据集项的图像替代文本。我们使用从三个不同的基因组数据存储库（4DN，ENCODE，Chromoscope）中检索到的基因组数据来生成GQVis：一个包含114万个单查询数据点、62.8万个查询对和58.9万个查询链的数据集。 GQVis数据集和生成代码可在 https://huggingface.co/datasets/HIDIVE/GQVis 和 https://github.com/hms-dbmi/GQVis-Generation 获取。

摘要： Data visualization is a fundamental tool in genomics research, enabling the exploration, interpretation, and communication of complex genomic features. While machine learning models show promise for transforming data into insightful visualizations, current models lack the training foundation for domain-specific tasks. In an effort to provide a foundational resource for genomics-focused model training, we present a framework for generating a dataset that pairs abstract, low-level questions about genomics data with corresponding visualizations. Building on prior work with statistical plots, our approach adapts to the complexity of genomics data and the specialized representations used to depict them. We further incorporate multiple linked queries and visualizations, along with justifications for design choices, figure captions, and image alt-texts for each item in the dataset. We use genomics data retrieved from three distinct genomics data repositories (4DN, ENCODE, Chromoscope) to produce GQVis: a dataset consisting of 1.14 million single-query data points, 628k query pairs, and 589k query chains. The GQVis dataset and generation code are available at https://huggingface.co/datasets/HIDIVE/GQVis and https://github.com/hms-dbmi/GQVis-Generation.

主题：	基因组学 (q-bio.GN) ; 人工智能 (cs.AI); 人机交互 (cs.HC); 机器学习 (cs.LG)
引用方式：	arXiv:2510.13816 [q-bio.GN]
	(或者 arXiv:2510.13816v1 [q-bio.GN] 对于此版本)
	https://doi.org/10.48550/arXiv.2510.13816

提交历史

来自： Devin Lange [查看电子邮件]
[v1] 星期五， 2025 年 9 月 19 日 21:29:13 UTC (1,435 KB)

定量生物学 > 基因组学

标题： GQVis：用于生成式人工智能的基因组数据问题和可视化数据集

标题： GQVis: A Dataset of Genomics Data Questions and Visualizations for Generative AI

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

定量生物学 > 基因组学

标题： GQVis：用于生成式人工智能的基因组数据问题和可视化数据集 显示英文标题

标题： GQVis: A Dataset of Genomics Data Questions and Visualizations for Generative AI

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： GQVis：用于生成式人工智能的基因组数据问题和可视化数据集