Distributional encoding for Gaussian process regression with qualitative inputs

Da Veiga, Sébastien

Statistics > Machine Learning

arXiv:2506.04813 (stat)

[Submitted on 5 Jun 2025 ]

Title: Distributional encoding for Gaussian process regression with qualitative inputs

Title: 具有定性输入的高斯过程回归的分布编码

Authors:Sébastien Da Veiga (ENSAI, CREST, RT-UQ)

Abstract: Gaussian Process (GP) regression is a popular and sample-efficient approach for many engineering applications, where observations are expensive to acquire, and is also a central ingredient of Bayesian optimization (BO), a highly prevailing method for the optimization of black-box functions. However, when all or some input variables are categorical, building a predictive and computationally efficient GP remains challenging. Starting from the naive target encoding idea, where the original categorical values are replaced with the mean of the target variable for that category, we propose a generalization based on distributional encoding (DE) which makes use of all samples of the target variable for a category. To handle this type of encoding inside the GP, we build upon recent results on characteristic kernels for probability distributions, based on the maximum mean discrepancy and the Wasserstein distance. We also discuss several extensions for classification, multi-task learning and incorporation or auxiliary information. Our approach is validated empirically, and we demonstrate state-of-the-art predictive performance on a variety of synthetic and real-world datasets. DE is naturally complementary to recent advances in BO over discrete and mixed-spaces.

Abstract: 高斯过程（GP）回归是一种流行且样本高效的工程应用方法，尤其适用于获取观测值昂贵的情况，同时也是贝叶斯优化（BO）的核心组成部分，而贝叶斯优化是一种广泛使用的黑盒函数优化方法。然而，当所有或部分输入变量为分类变量时，构建一个预测能力强且计算高效的GP仍然具有挑战性。从原始的类别目标编码概念出发——即将原始分类值替换为目标变量在此类别的平均值——我们提出了一种基于分布编码（DE）的广义方法，该方法充分利用了某个类别的目标变量的所有样本。为了在GP中处理这种类型的编码，我们基于概率分布的最大均值差异和Wasserstein距离的特征核函数的最新成果进行了构建。此外，我们还讨论了几种扩展，包括分类、多任务学习以及整合辅助信息。我们的方法通过实证验证，并在各种合成数据集和真实数据集上展示了最先进的预测性能。 DE天然地补充了离散空间和混合空间中BO的最新进展。

Subjects:	Machine Learning (stat.ML) ; Machine Learning (cs.LG)
Cite as:	arXiv:2506.04813 [stat.ML]
	(or arXiv:2506.04813v1 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.2506.04813

Submission history

From: Sebastien Da Veiga [view email]
[v1] Thu, 5 Jun 2025 09:35:02 UTC (5,609 KB)

Statistics > Machine Learning

Title: Distributional encoding for Gaussian process regression with qualitative inputs

Title: 具有定性输入的高斯过程回归的分布编码

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title: Distributional encoding for Gaussian process regression with qualitative inputs Show Chinese title

Title: 具有定性输入的高斯过程回归的分布编码

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: Distributional encoding for Gaussian process regression with qualitative inputs