Databases

Cross-lists
Replacements

See recent articles

Showing new listings for Friday, 26 September 2025

Total of 3 entries

Showing up to 2000 entries per page: fewer | more | all

[1] arXiv:2509.20781 (cross-list from cs.LG) [cn-pdf, pdf, other]: Title: Sig2Model: A Boosting-Driven Model for Updatable Learned Indexes

Title: Sig2Model：一种可更新学习索引的提升驱动模型

Alireza Heidari, Amirhossein Ahmad, Wei Zhang, Ying Xiong

Comments: 22 pages, 11 figures

Subjects: Machine Learning (cs.LG) ; Databases (cs.DB) ; Performance (cs.PF)

Learned Indexes (LIs) represent a paradigm shift from traditional index structures by employing machine learning models to approximate the cumulative distribution function (CDF) of sorted data. While LIs achieve remarkable efficiency for static datasets, their performance degrades under dynamic updates: maintaining the CDF invariant (sum of F(k) equals 1) requires global model retraining, which blocks queries and limits the queries-per-second (QPS) metric. Current approaches fail to address these retraining costs effectively, rendering them unsuitable for real-world workloads with frequent updates. In this paper, we present Sig2Model, an efficient and adaptive learned index that minimizes retraining cost through three key techniques: (1) a sigmoid boosting approximation technique that dynamically adjusts the index model by approximating update-induced shifts in data distribution with localized sigmoid functions while preserving bounded error guarantees and deferring full retraining; (2) proactive update training via Gaussian mixture models (GMMs) that identifies high-update-probability regions for strategic placeholder allocation to speed up updates; and (3) a neural joint optimization framework that continuously refines both the sigmoid ensemble and GMM parameters via gradient-based learning. We evaluate Sig2Model against state-of-the-art updatable learned indexes on real-world and synthetic workloads, and show that Sig2Model reduces retraining cost by up to 20x, achieves up to 3x higher QPS, and uses up to 1000x less memory.

学习索引（LIs）通过使用机器学习模型来近似排序数据的累积分布函数（CDF），实现了从传统索引结构到新范式的转变。虽然LIs在静态数据集上表现出色，但在动态更新下性能会下降：保持CDF不变（F(k)之和等于1）需要全局模型重新训练，这会阻塞查询并限制每秒查询数（QPS）指标。当前方法未能有效解决这些重新训练成本，因此不适合具有频繁更新的真实工作负载。在本文中，我们提出了Sig2Model，这是一种高效且自适应的学习索引，通过三种关键技术最小化重新训练成本：（1）一种逻辑回归提升近似技术，通过局部逻辑函数近似数据分布的变化来动态调整索引模型，同时保持有界误差保证并推迟完整重新训练；（2）通过高斯混合模型（GMMs）进行主动更新训练，识别高更新概率区域以进行战略占位符分配，加快更新速度；以及（3）一种神经联合优化框架，通过基于梯度的学习持续优化逻辑函数集合和GMM参数。我们在真实世界和合成工作负载上对最先进的可更新学习索引进行了评估，并表明 Sig2Model将重新训练成本降低了最多20倍，QPS提高了最多3倍，并且内存使用量减少了最多1000倍。

[2] arXiv:2403.19884 (replaced) [cn-pdf, pdf, other]: Title: Representing Knowledge and Querying Data using Double-Functorial Semantics

Title: 使用双函子语义表示知识和查询数据

Michael Lambert (University of Massachusetts-Boston), Evan Patterson (Topos Institute)

Comments: In Proceedings ACT 2024, arXiv:2509.18357

Journal-ref: EPTCS 429, 2025, pp. 174-189

Subjects: Category Theory (math.CT) ; Databases (cs.DB) ; Logic in Computer Science (cs.LO)

Category theory offers a mathematical foundation for knowledge representation and database systems. Popular existing approaches model a database instance as a functor into the category of sets and functions, or as a 2-functor into the 2-category of sets, relations, and implications. The functional and relational models are unified by double functors into the double category of sets, functions, relations, and implications. In an accessible, example-driven style, we show that the abstract structure of a 'double category of relations' is a flexible and expressive language in which to represent knowledge, and we show how queries on data in the spirit of Codd's relational algebra are captured by double-functorial semantics.

范畴论为知识表示和数据库系统提供了数学基础。现有的流行方法将数据库实例建模为到集合和函数范畴的函子，或者作为到集合、关系和蕴含的2-范畴的2-函子。函数模型和关系模型通过到集合、函数、关系和蕴含的双范畴的双函子得到统一。以一种易于理解、示例驱动的方式，我们表明“关系双范畴”的抽象结构是一种灵活且富有表现力的语言，可用于表示知识，并展示了如何通过双函子语义来捕捉类似Codd关系代数的数据查询。
[3] arXiv:2506.13989 (replaced) [cn-pdf, pdf, other]: Title: AMLgentex: Mobilizing Data-Driven Research to Combat Money Laundering

Title: AMLgentex：动员数据驱动的研究以打击洗钱

Johan Östman, Edvin Callisen, Anton Chen, Kristiina Ausmees, Emanuel Gårdh, Jovan Zamac, Jolanta Goldsteine, Hugo Wefer, Simon Whelan, Markus Reimegård

Comments: 29 pages, 22 figures

Subjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI) ; Databases (cs.DB) ; Machine Learning (cs.LG)

Money laundering enables organized crime by moving illicit funds into the legitimate economy. Although trillions of dollars are laundered each year, detection rates remain low because launderers evade oversight, confirmed cases are rare, and institutions see only fragments of the global transaction network. Since access to real transaction data is tightly restricted, synthetic datasets are essential for developing and evaluating detection methods. However, existing datasets fall short: they often neglect partial observability, temporal dynamics, strategic behavior, uncertain labels, class imbalance, and network-level dependencies. We introduce AMLGentex, an open-source suite for generating realistic, configurable transaction data and benchmarking detection methods. AMLGentex enables systematic evaluation of anti-money laundering systems under conditions that mirror real-world challenges. By releasing multiple country-specific datasets and practical parameter guidance, we aim to empower researchers and practitioners and provide a common foundation for collaboration and progress in combating money laundering.

洗钱通过将非法资金转入合法经济体系，使有组织犯罪成为可能。尽管每年有数万亿美元被洗钱，但检测率仍然很低，因为洗钱者逃避监管，已确认的案例很少，机构只能看到全球交易网络的一小部分。由于对真实交易数据的访问受到严格限制，合成数据集对于开发和评估检测方法至关重要。然而，现有的数据集存在不足：它们常常忽视部分可观测性、时间动态性、策略行为、不确定的标签、类别不平衡和网络级依赖关系。我们引入了AMLGentex，这是一个开源工具包，用于生成现实的、可配置的交易数据并基准化检测方法。 AMLGentex能够在反映现实世界挑战的条件下，系统地评估反洗钱系统。通过发布多个特定国家的数据集和实用的参数指导，我们的目标是赋予研究人员和从业人员能力，并为打击洗钱提供一个共同的基础，以促进合作和进步。

Total of 3 entries

Showing up to 2000 entries per page: fewer | more | all

Databases

Showing new listings for Friday, 26 September 2025

Cross submissions (showing 1 of 1 entries )

Replacement submissions (showing 2 of 2 entries )