Skip to main content
CenXiv.org
This website is in trial operation, support us!
We gratefully acknowledge support from all contributors.
Contribute
Donate
cenxiv logo > cs > arXiv:2510.00039

Help | Advanced Search

Computer Science > Databases

arXiv:2510.00039 (cs)
[Submitted on 26 Sep 2025 ]

Title: AutoPK: Leveraging LLMs and a Hybrid Similarity Metric for Advanced Retrieval of Pharmacokinetic Data from Complex Tables and Documents

Title: AutoPK:利用大语言模型和混合相似性度量从复杂表格和文档中高级检索药代动力学数据

Authors:Hossein Sholehrasa, Amirhossein Ghanaatian, Doina Caragea, Lisa A. Tell, Jim E. Riviere, Majid Jaberi-Douraki
Abstract: Pharmacokinetics (PK) plays a critical role in drug development and regulatory decision-making for human and veterinary medicine, directly affecting public health through drug safety and efficacy assessments. However, PK data are often embedded in complex, heterogeneous tables with variable structures and inconsistent terminologies, posing significant challenges for automated PK data retrieval and standardization. AutoPK, a novel two-stage framework for accurate and scalable extraction of PK data from complex scientific tables. In the first stage, AutoPK identifies and extracts PK parameter variants using large language models (LLMs), a hybrid similarity metric, and LLM-based validation. The second stage filters relevant rows, converts the table into a key-value text format, and uses an LLM to reconstruct a standardized table. Evaluated on a real-world dataset of 605 PK tables, including captions and footnotes, AutoPK shows significant improvements in precision and recall over direct LLM baselines. For instance, AutoPK with LLaMA 3.1-70B achieved an F1-score of 0.92 on half-life and 0.91 on clearance parameters, outperforming direct use of LLaMA 3.1-70B by margins of 0.10 and 0.21, respectively. Smaller models such as Gemma 3-27B and Phi 3-12B with AutoPK achieved 2-7 fold F1 gains over their direct use, with Gemma's hallucination rates reduced from 60-95% down to 8-14%. Notably, AutoPK enabled open-source models like Gemma 3-27B to outperform commercial systems such as GPT-4o Mini on several PK parameters. AutoPK enables scalable and high-confidence PK data extraction, making it well-suited for critical applications in veterinary pharmacology, drug safety monitoring, and public health decision-making, while addressing heterogeneous table structures and terminology and demonstrating generalizability across key PK parameters. Code and data: https://github.com/hosseinsholehrasa/AutoPK
Abstract: 药代动力学(PK)在人类和兽医学的药物开发和监管决策中起着关键作用,通过药物安全性和有效性评估直接影响公共健康。 然而, PK数据通常嵌入在结构复杂、异构的表格中,术语不一致,这给自动PK数据检索和标准化带来了重大挑战。 AutoPK是一种新的两阶段框架,用于从复杂的科学表格中准确且可扩展地提取PK数据。 在第一阶段,AutoPK使用大语言模型(LLMs)、混合相似性度量和基于LLM的验证来识别和提取PK参数变体。 第二阶段过滤相关行,将表格转换为键值文本格式,并使用LLM重建标准化表格。 在包含标题和脚注的真实世界605个PK表格数据集上进行评估,AutoPK在精度和召回率方面相比直接使用LLM基线有显著提升。 例如,使用LLaMA 3.1-70B的AutoPK在半衰期上的F1分数达到0.92,在清除率参数上达到0.91,分别比直接使用LLaMA 3.1-70B高出0.10和0.21。 较小的模型如Gemma 3-27B和Phi 3-12B结合AutoPK相比直接使用其性能提升了2-7倍,Gemma的幻觉率从60-95%降至8-14%。 值得注意的是,AutoPK使开源模型如Gemma 3-27B在多个PK参数上超越了商业系统如GPT-4o Mini。 AutoPK实现了可扩展且高置信度的PK数据提取,使其非常适合兽医药理学、药物安全性监测和公共卫生决策中的关键应用,同时解决了异构表格结构和术语问题,并在关键PK参数上表现出泛化能力。 代码和数据:https://github.com/hosseinsholehrasa/AutoPK
Comments: Accepted at the 2025 IEEE 37th ICTAI
Subjects: Databases (cs.DB) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Cite as: arXiv:2510.00039 [cs.DB]
  (or arXiv:2510.00039v1 [cs.DB] for this version)
  https://doi.org/10.48550/arXiv.2510.00039
arXiv-issued DOI via DataCite

Submission history

From: Hossein Sholehrasa [view email]
[v1] Fri, 26 Sep 2025 22:05:32 UTC (709 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled
  • View Chinese PDF
  • View PDF
  • HTML (experimental)
  • TeX Source
view license
Current browse context:
cs.DB
< prev   |   next >
new | recent | 2025-10
Change to browse by:
cs
cs.AI
cs.IR

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar
a export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
IArxiv Recommender (What is IArxiv?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack

京ICP备2025123034号