HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: abstract
  • failed: mlmodern
  • failed: mathalpha
  • failed: stackengine

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2506.03780v1 [stat.ML] 04 Jun 2025

High-Dimensional Learning in Financethanks: Replication code is available from the author.

Hasan Fallahgoul
Monash University
Hasan Fallahgoul, Monash University, School of Mathematics and Centre for Quantitative Finance and Investment Strategies, 9 Rainforest Walk, 3800 Victoria, Australia. E-mail: hasan.fallahgoul@monash.edu.
(This version: June 4, 2025
Link to Most Recent Version)
Abstract

Recent advances in machine learning have shown promising results for financial prediction using large, over-parameterized models. This paper provides theoretical foundations and empirical validation for understanding when and how these methods achieve predictive success. I examine three key aspects of high-dimensional learning in finance. First, I prove that within-sample standardization in Random Fourier Features implementations fundamentally alters the underlying Gaussian kernel approximation, replacing shift-invariant kernels with training-set dependent alternatives. Second, I derive sample complexity bounds showing when reliable learning becomes information-theoretically impossible under weak signal-to-noise ratios typical in finance. Third, VC-dimension analysis reveals that ridgeless regression’s effective complexity is bounded by sample size rather than nominal feature dimension. Comprehensive numerical validation confirms these theoretical predictions, revealing systematic breakdown of claimed theoretical properties across realistic parameter ranges. These results show that when sample size is small and features are high-dimensional, observed predictive success is necessarily driven by low-complexity artifacts, not genuine high-dimensional learning.

Key words: Portfolio choice, machine learning, random matrix theory, PAC-learning

JEL classification: C3, C58, C61, G11, G12, G14

1 Introduction

The integration of machine learning methods into financial prediction has emerged as one of the most active areas of research in empirical asset pricing (Kelly et al. 2024, Gu et al. 2020, Bianchi et al. 2021, Chen et al. 2024, Feng et al. 2020). The appeal is clear: while financial markets generate increasingly high-dimensional data, traditional econometric methods remain constrained by limited sample sizes and the curse of dimensionality. Machine learning promises to uncover predictive relationships that elude traditional linear models by leveraging nonlinear approximations and high-dimensional overparameterized representations, thereby expanding the frontier of return predictability and portfolio construction.

Yet despite rapid adoption and impressive empirical successes, our theoretical understanding of when and why machine learning methods succeed in financial applications remains incomplete. This gap is particularly pronounced for high-dimensional methods applied to the notoriously challenging problem of return prediction, where signals are weak, data are limited, and spurious relationships abound. A fundamental question emerges: under what conditions can sophisticated machine learning methods genuinely extract predictive information from financial data, and when might apparent success arise from simpler mechanisms?

The pioneering work of Kelly et al. (2024) has significantly advanced our theoretical understanding by establishing rigorous conditions under which complex machine learning models can outperform traditional approaches in financial prediction. Their theoretical framework, grounded in random matrix theory, demonstrates that the conventional wisdom about overfitting may not apply in high-dimensional settings, revealing a genuine ’virtue of complexity’ under appropriate conditions. This breakthrough provides crucial theoretical foundations for understanding when and why sophisticated methods succeed in finance.

Building on these theoretical advances, this paper examines how practical implementation details interact with established mechanisms. This becomes important as recent empirical analysis Nagel (2025) suggests that high-dimensional methods may achieve success through multiple pathways that differ from theoretical predictions. Several questions emerge: What are the information-theoretic requirements for learning with weak signals? How do implementation choices affect underlying mathematical properties? When do complexity benefits reflect different learning mechanisms? Understanding these interactions helps characterize the complete landscape of learning pathways in high-dimensional finance applications.

This paper provides theoretical foundations for answering these questions through three main contributions that help characterize the different mechanisms through which high-dimensional methods achieve predictive success in financial prediction.

First, I extend the theoretical analysis to practical implementations, showing how the standardization procedures commonly used for numerical stability modify the kernel approximation properties that underlie existing theory. While Random Fourier Features (RFF) theory rigorously proves convergence to shift-invariant Gaussian kernels under idealized conditions (Rahimi & Recht 2007, Sutherland & Schneider 2015), I prove that the within-sample standardization employed in every practical implementation modifies these theoretical properties. The standardized features converge instead to training-set dependent kernels that violate the mathematical foundations required for kernel methods. This breakdown explains why methods cannot achieve the kernel learning properties established by existing theory and must rely on fundamentally different mechanisms.

Rahimi & Recht (2007) prove that for features zi(x)=2cos(ωix+bi)subscript𝑧𝑖𝑥2superscriptsubscript𝜔𝑖top𝑥subscript𝑏𝑖z_{i}(x)=\sqrt{2}\cos(\omega_{i}^{\top}x+b_{i})italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = square-root start_ARG 2 end_ARG roman_cos ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) with ωi𝒩(0,γ2I)similar-tosubscript𝜔𝑖𝒩0superscript𝛾2𝐼\omega_{i}\sim\mathcal{N}(0,\gamma^{2}I)italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) and biUniform[0,2π]similar-tosubscript𝑏𝑖Uniform02𝜋b_{i}\sim\text{Uniform}[0,2\pi]italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ Uniform [ 0 , 2 italic_π ], the empirical kernel 1Pi=1Pzi(x)zi(x)1𝑃superscriptsubscript𝑖1𝑃subscript𝑧𝑖𝑥subscript𝑧𝑖superscript𝑥\frac{1}{P}\sum_{i=1}^{P}z_{i}(x)z_{i}(x^{\prime})divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) converges in probability to the Gaussian kernel k(x,x)=exp(γ2xx2/2)𝑘𝑥superscript𝑥superscript𝛾2superscriptnorm𝑥superscript𝑥22k(x,x^{\prime})=\exp(-\gamma^{2}\|x-x^{\prime}\|^{2}/2)italic_k ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_exp ( - italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ) as P𝑃P\to\inftyitalic_P → ∞. This convergence requires that features maintain their original distributional properties and scaling. However, I prove that the within-sample standardization z~i(x)=zi(x)/σ^isubscript~𝑧𝑖𝑥subscript𝑧𝑖𝑥subscript^𝜎𝑖\tilde{z}_{i}(x)=z_{i}(x)/\hat{\sigma}_{i}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) / over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT employed in every practical implementation—where σ^i2=1Tt=1Tzi(xt)2superscriptsubscript^𝜎𝑖21𝑇superscriptsubscript𝑡1𝑇subscript𝑧𝑖superscriptsubscript𝑥𝑡2\hat{\sigma}_{i}^{2}=\frac{1}{T}\sum_{t=1}^{T}z_{i}(x_{t})^{2}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT—fundamentally alters the convergence properties. The standardized features converge instead to training-set dependent kernels kstd(x,x|𝒯)kG(x,x)subscriptsuperscript𝑘std𝑥conditionalsuperscript𝑥𝒯subscript𝑘𝐺𝑥superscript𝑥k^{*}_{\text{std}}(x,x^{\prime}|\mathcal{T})\neq k_{G}(x,x^{\prime})italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT std end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | caligraphic_T ) ≠ italic_k start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) that violate the shift-invariance and stationarity properties required for kernel methods. A detailed analysis of how standardization breaks the specific theoretical conditions appears in Section 3, following the formal proof of this breakdown.

Second, I derive sharp sample complexity bounds that characterize the information-theoretic limits of high-dimensional learning in financial settings. Using PAC-learning theory,111In PAC-learning (Valiant 1984), a predictor is “probably approximately correct” if, with T(capacity)/ε2greater-than-or-equivalent-to𝑇capacitysuperscript𝜀2T\!\gtrsim\!(\text{capacity})/\varepsilon^{2}italic_T ≳ ( capacity ) / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT samples, its risk is within ε𝜀\varepsilonitalic_ε of optimal with probability 1δ1𝛿1-\delta1 - italic_δ; I apply these bounds (see Kearns & Vazirani 1994) to gauge when weak return signals are learnable. I establish both exponential and polynomial lower bounds showing when reliable extraction of weak predictive signals becomes impossible regardless of the sophistication of the employed method. These bounds reveal that learning over function spaces with thousands of parameters suggests that reliable learning may require stronger conditions than typically available in typical financial applications. For example, methods claiming to harness 12,000 parameters with 12 monthly observations require signal-to-noise ratios exceeding realistic bounds by orders of magnitude, suggesting that predictive success may arise through mechanisms that differ from the theoretical framework.

Third, I characterize the effective complexity of high-dimensional methods through VC dimension analysis and sharp learning thresholds.222Effective complexity—often called the effective degrees of freedom—is the trace of the “hat” matrix, df=tr[H]dftr𝐻\operatorname{df}=\operatorname{tr}\!\bigl{[}H\bigr{]}roman_df = roman_tr [ italic_H ] where H=Z(ZZ)1Z𝐻𝑍superscriptsuperscript𝑍top𝑍1superscript𝑍topH=Z\,(Z^{\top}Z)^{-1}\!Z^{\top}italic_H = italic_Z ( italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Z ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. For minimum-norm (ridgeless) regression H𝐻Hitalic_H is an idempotent projector of rank T𝑇Titalic_T, so dfTdf𝑇\operatorname{df}\leq Troman_df ≤ italic_T irrespective of the nominal dimension P𝑃Pitalic_P; see Hastie et al. (2009, Chapter 7), Bartlett et al. (2020), and Hastie et al. (2022). I prove that ridgeless regression operates over function spaces with complexity bounded by sample size rather than parameter count, regardless of nominal dimensionality. Combined with precise learning thresholds that depend on signal strength, feature dimension, and sample size, these results provide practitioners with concrete tools for evaluating when available data suffices for reliable prediction versus when apparent performance must arise through alternative mechanisms.

While these theoretical results provide clear mathematical boundaries on learning feasibility, their practical relevance depends on how they manifest across the parameter ranges typically employed in financial applications. The gap between asymptotic theory and finite-sample reality can be substantial, particularly when dealing with the moderate dimensions and sample sizes common in empirical asset pricing. Moreover, the breakdown of kernel approximation under standardization represents a fundamental departure from assumed theoretical properties that requires empirical quantification to assess its practical severity.

To bridge this theory-practice gap, I conduct comprehensive numerical validation of the kernel approximation breakdown across realistic parameter spaces that span the configurations used in recent high-dimensional financial prediction studies (Kelly et al. 2024, Nagel 2025). The numerical analysis examines how within-sample standardization destroys the theoretical Gaussian kernel convergence that underlies existing RFF frameworks, quantifying the magnitude of approximation errors under practical implementation choices. These experiments reveal that standardization-induced kernel deviations reach mean absolute errors exceeding 40% relative to the theoretical Gaussian kernel in typical configurations (P=12,000𝑃12000P=12{,}000italic_P = 12 , 000, T=12𝑇12T=12italic_T = 12), with maximum deviations approaching 80% in high-volatility training windows. The kernel approximation failure manifests consistently across different feature dimensions and sample sizes, with relative errors scaling approximately as logP/T𝑃𝑇\sqrt{\log P/T}square-root start_ARG roman_log italic_P / italic_T end_ARG in line with theoretical predictions. The numerical validation thus provides concrete evidence that practical implementation details create substantial violations of the theoretical assumptions underlying high-dimensional RFF approaches, with error magnitudes sufficient to fundamentally alter method behavior.

Together, these results explain why methods may achieve predictive success through multiple pathways, including both sophisticated learning and simpler pattern-matching mechanisms. These findings provide practitioners with frameworks for understanding and evaluating different sources of predictive performance in high-dimensional models. for understanding when such methods can genuinely contribute to predictive performance versus when they exploit statistical artifacts.

1.1 Literature Review

This paper builds on three distinct but interconnected theoretical traditions to provide foundations for understanding high-dimensional learning in financial prediction.

The Probably Approximately Correct (PAC) framework (Valiant 1984, Kearns & Vazirani 1994) provides fundamental tools for characterizing when reliable learning is information-theoretically feasible. Classical results establish that achieving generalization error ε𝜀\varepsilonitalic_ε with confidence 1δ1𝛿1-\delta1 - italic_δ requires sample sizes scaling with the complexity of the function class, typically T=O(complexitylog(1/ε)/ε2)𝑇𝑂complexity1𝜀superscript𝜀2T=O(\text{complexity}\cdot\log(1/\varepsilon)/\varepsilon^{2})italic_T = italic_O ( complexity ⋅ roman_log ( 1 / italic_ε ) / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (Shalev-Shwartz & Ben-David 2014). Recent advances in high-dimensional learning theory (Belkin et al. 2019, Bartlett et al. 2020, Hastie et al. 2022) have refined these bounds for overparameterized models, showing that the effective rather than nominal complexity determines learning difficulty. However, these results have not been systematically applied to the specific challenges of financial prediction, where weak signals and limited sample sizes create particularly demanding learning environments.

The RFF methodology (Rahimi & Recht 2007) provides computationally efficient approximation of kernel methods through random trigonometric features, with theoretical guarantees assuming convergence to shift-invariant kernels under appropriate conditions (Rudi & Rosasco 2017). Subsequent work has characterized the approximation quality and convergence rates for various kernel classes (Mei & Montanari 2022), establishing RFF as a foundation for scalable kernel learning. However, existing theory assumes idealized implementations that may not reflect practical usage. In particular, no prior work has analyzed how the standardization procedures commonly employed to improve numerical stability affect the fundamental convergence properties that justify the theoretical framework.

The phenomenon of ”benign overfitting” in overparameterized models has generated substantial theoretical interest (Belkin et al. 2019, Bartlett et al. 2020), with particular focus on understanding when adding parameters can improve rather than harm generalization performance. The VC dimension provides a classical measure of model complexity that connects directly to generalization bounds (Vapnik 1998), while recent work on effective degrees of freedom (Hastie et al. 2022) shows how structural constraints can limit the true complexity of nominally high-dimensional methods. These insights have been applied to understanding ridge regression in high-dimensional settings, but the connections to kernel methods and the specific constraints imposed by ridgeless regression in financial applications remain underexplored.

The application of machine learning to financial prediction has generated extensive empirical literature (Gu et al. 2020, Kelly et al. 2024, Chen et al. 2024), with particular attention to high-dimensional methods that can potentially harness large numbers of predictors (Feng et al. 2020, Bianchi et al. 2021). The theoretical framework of Kelly et al. (2024) provides crucial insights into when high-dimensional methods can succeed, particularly their demonstration that ridgeless regression can achieve positive performance despite seemingly problematic complexity ratios. This paper extends their analysis by examining how practical implementation considerations interact with these theoretical mechanisms.

This paper contributes to each of these literatures by providing the first unified theoretical analysis that connects sample complexity limitations, kernel approximation breakdown, and effective complexity bounds to explain the behavior of high-dimensional methods in financial prediction.

The remainder of the paper proceeds as follows. Section 2 establishes the theoretical framework and formalizes the theory-practice disconnect in RFF implementations. Section 3 proves that within-sample standardization fundamentally breaks kernel approximation, explaining why claimed theoretical properties cannot hold in practice. Section 4 establishes information-theoretic barriers to high-dimensional learning, showing that genuine complexity benefits are impossible under realistic financial conditions. Section 6 concludes. All technical details are relegated to a supplementary document containing Appendices A and B, which are available upon request from the author.

2 Background and Framework

This section establishes the theoretical framework for analyzing high-dimensional prediction methods in finance. I first formalize the return prediction problem, then examine the critical disconnect between RFF theory and practical implementation that underlies my main results.

2.1 The Financial Prediction Problem

Consider the fundamental challenge of predicting asset returns using high-dimensional predictor information. I observe predictor vectors xtKsubscript𝑥𝑡superscript𝐾x_{t}\in\mathbb{R}^{K}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT and subsequent returns rt+1subscript𝑟𝑡1r_{t+1}\in\mathbb{R}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∈ blackboard_R for t=1,,T𝑡1𝑇t=1,\ldots,Titalic_t = 1 , … , italic_T, with the goal of learning a predictor f^:K:^𝑓superscript𝐾\hat{f}:\mathbb{R}^{K}\to\mathbb{R}over^ start_ARG italic_f end_ARG : blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT → blackboard_R that minimizes expected squared loss 𝔼[(rt+1f^(xt))2]𝔼delimited-[]superscriptsubscript𝑟𝑡1^𝑓subscript𝑥𝑡2\mathbb{E}[(r_{t+1}-\hat{f}(x_{t}))^{2}]blackboard_E [ ( italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ].

The challenge lies in the fundamental characteristics of financial prediction: signals are weak relative to noise, predictors exhibit complex persistence patterns, and available sample sizes are limited by the nonstationarity of financial markets. These features create a particularly demanding environment for high-dimensional learning methods.

I formalize this environment through three core assumptions that capture the essential features while maintaining sufficient generality for my theoretical analysis.

Assumption 1 (Financial Prediction Environment).

The return generating process is rt+1=f(xt)+ϵt+1subscript𝑟𝑡1superscript𝑓subscript𝑥𝑡subscriptitalic-ϵ𝑡1r_{t+1}=f^{*}(x_{t})+\epsilon_{t+1}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT where:

  1. (a)

    f:K:superscript𝑓superscript𝐾f^{*}:\mathbb{R}^{K}\to\mathbb{R}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT → blackboard_R is the true regression function with 𝔼[f(x)2]B2𝔼delimited-[]superscript𝑓superscript𝑥2superscript𝐵2\mathbb{E}[f^{*}(x)^{2}]\leq B^{2}blackboard_E [ italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

  2. (b)

    ϵt+1subscriptitalic-ϵ𝑡1\epsilon_{t+1}italic_ϵ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is noise with 𝔼[ϵt+1|xt]=0𝔼delimited-[]conditionalsubscriptitalic-ϵ𝑡1subscript𝑥𝑡0\mathbb{E}[\epsilon_{t+1}|x_{t}]=0blackboard_E [ italic_ϵ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = 0 and 𝔼[ϵt+12|xt]=σ2𝔼delimited-[]conditionalsuperscriptsubscriptitalic-ϵ𝑡12subscript𝑥𝑡superscript𝜎2\mathbb{E}[\epsilon_{t+1}^{2}|x_{t}]=\sigma^{2}blackboard_E [ italic_ϵ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

  3. (c)

    The signal-to-noise ratio SNR:=B2/σ2=O(Kα)assignSNRsuperscript𝐵2superscript𝜎2𝑂superscript𝐾𝛼\text{SNR}:=B^{2}/\sigma^{2}=O(K^{-\alpha})SNR := italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_O ( italic_K start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT ) for some α>0𝛼0\alpha>0italic_α > 0

  4. (d)

    Predictors follow xt=Φxt1+utsubscript𝑥𝑡Φsubscript𝑥𝑡1subscript𝑢𝑡x_{t}=\Phi x_{t-1}+u_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Φ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with ut𝒩(0,Σu)similar-tosubscript𝑢𝑡𝒩0subscriptΣ𝑢u_{t}\sim\mathcal{N}(0,\Sigma_{u})italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) and eigenvalues of ΦΦ\Phiroman_Φ in (0,1)01(0,1)( 0 , 1 )

This assumption captures the essential features of financial prediction that distinguish it from typical machine learning applications. The bounded signal condition and weak SNR scaling reflect the empirical reality that financial predictors typically explain only 1-5% of return variation (Welch & Goyal 2008). The persistence in predictors (eigenvalues of ΦΦ\Phiroman_Φ in (0,1)01(0,1)( 0 , 1 )) captures the well-documented dynamics of financial variables like dividend yields and interest rate spreads, which proves crucial for understanding why short training windows lead to mechanical pattern matching rather than genuine learning.

Assumption 2 (Random Fourier Features Construction).

High-dimensional predictive features are constructed as zi(x)=2cos(ωix+bi)subscript𝑧𝑖𝑥2superscriptsubscript𝜔𝑖top𝑥subscript𝑏𝑖z_{i}(x)=\sqrt{2}\cos(\omega_{i}^{\top}x+b_{i})italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = square-root start_ARG 2 end_ARG roman_cos ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) where ωi𝒩(0,γ2IK)similar-tosubscript𝜔𝑖𝒩0superscript𝛾2subscript𝐼𝐾\omega_{i}\sim\mathcal{N}(0,\gamma^{2}I_{K})italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) and biUniform[0,2π]similar-tosubscript𝑏𝑖Uniform02𝜋b_{i}\sim\text{Uniform}[0,2\pi]italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ Uniform [ 0 , 2 italic_π ] for i=1,,P𝑖1𝑃i=1,\ldots,Pitalic_i = 1 , … , italic_P. In practical implementations, these features are standardized within each training sample: z~i(x)=zi(x)/σ^isubscript~𝑧𝑖𝑥subscript𝑧𝑖𝑥subscript^𝜎𝑖\tilde{z}_{i}(x)=z_{i}(x)/\hat{\sigma}_{i}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) / over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT where σ^i2=1Tt=1Tzi(xt)2superscriptsubscript^𝜎𝑖21𝑇superscriptsubscript𝑡1𝑇subscript𝑧𝑖superscriptsubscript𝑥𝑡2\hat{\sigma}_{i}^{2}=\frac{1}{T}\sum_{t=1}^{T}z_{i}(x_{t})^{2}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

This assumption formalizes the RFF methodology as actually implemented in practice, including the crucial standardization step that has not been analyzed in existing theoretical frameworks. The standardization appears in every practical implementation to improve numerical stability, yet as I prove, it fundamentally alters the mathematical properties of the method.

Assumption 3 (Regularity Conditions).

The input distribution has bounded support and finite moments, ensuring well-defined feature covariance Σz=𝔼[z(x)z(x)]subscriptΣ𝑧𝔼delimited-[]𝑧𝑥𝑧superscript𝑥top\Sigma_{z}=\mathbb{E}[z(x)z(x)^{\top}]roman_Σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = blackboard_E [ italic_z ( italic_x ) italic_z ( italic_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] satisfying czIPΣzCzIPprecedes-or-equalssubscript𝑐𝑧subscript𝐼𝑃subscriptΣ𝑧precedes-or-equalssubscript𝐶𝑧subscript𝐼𝑃c_{z}I_{P}\preceq\Sigma_{z}\preceq C_{z}I_{P}italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ⪯ roman_Σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ⪯ italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT for constants 0<czCz0subscript𝑐𝑧subscript𝐶𝑧0<c_{z}\leq C_{z}0 < italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ≤ italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. Training samples satisfy standard non-degeneracy conditions.333Specifically, the matrix A=[2xt 2]t=1T𝐴superscriptsubscriptdelimited-[]2superscriptsubscript𝑥𝑡top2𝑡1𝑇A=[2x_{t}^{\top}\;2]_{t=1}^{T}italic_A = [ 2 italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT 2 ] start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT has full column rank T𝑇Titalic_T, ensuring the geometric properties needed for my convergence analysis. See Appendix B for technical details.

These technical conditions ensure that concentration inequalities apply and that my convergence results hold with high probability. The conditions are mild and satisfied in typical financial applications.444For example, for the KMZ setup with K=15𝐾15K=15italic_K = 15 predictors and T=12𝑇12T=12italic_T = 12 training windows, these conditions hold almost surely since continuous economic variables generically satisfy the required independence properties.

Assumption 4 (Affine Independence of the Sample).

Let x1,,xTKsubscript𝑥1subscript𝑥𝑇superscript𝐾x_{1},\ldots,x_{T}\in\mathbb{R}^{K}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT with T5𝑇5T\geq 5italic_T ≥ 5. The (K+1)×T𝐾1𝑇(K+1)\times T( italic_K + 1 ) × italic_T matrix A=[2xt 2]t=1T𝐴superscriptsubscriptdelimited-[]2superscriptsubscript𝑥𝑡top2𝑡1𝑇A=[2x_{t}^{\top}\;2]_{t=1}^{T}italic_A = [ 2 italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT 2 ] start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT has full column rank T𝑇Titalic_T (equivalently, the augmented vectors (xt,1)subscript𝑥𝑡1(x_{t},1)( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , 1 ) are affinely independent).

This assumption enters my analysis through the small-ball probability estimates needed to establish convergence of standardized kernels. The full-rank requirement ensures that the linear change of variables (ω,b)(2ωxt+2b)t=1Tmaps-to𝜔𝑏superscriptsubscript2superscript𝜔topsubscript𝑥𝑡2𝑏𝑡1𝑇(\omega,b)\mapsto(2\omega^{\top}x_{t}+2b)_{t=1}^{T}( italic_ω , italic_b ) ↦ ( 2 italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 italic_b ) start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is bi-Lipschitz on bounded sets, enabling geometric control that yields exponential small-ball bounds and finiteness of key expectations. In Kelly et al.’s empirical design with K=15𝐾15K=15italic_K = 15 predictors and T=12𝑇12T=12italic_T = 12 months, the matrix A𝐴Aitalic_A is 16×12161216\times 1216 × 12, and since elements are continuous macroeconomic variables, affine dependence has Lebesgue measure zero, making this assumption mild.

Assumption 5 (Sub-Gaussian RFFs).

For every unit vector uP𝑢superscript𝑃u\in\mathbb{R}^{P}italic_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, the scalar uz(x)superscript𝑢top𝑧𝑥u^{\top}z(x)italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ) is κ𝜅\kappaitalic_κ-sub-Gaussian under xμsimilar-to𝑥𝜇x\sim\muitalic_x ∼ italic_μ: 𝔼[exp(tuz(x))]exp(12κ2t2)𝔼delimited-[]𝑡superscript𝑢top𝑧𝑥12superscript𝜅2superscript𝑡2\;\mathbb{E}[\exp(t\,u^{\top}z(x))]\leq\exp(\tfrac{1}{2}\kappa^{2}t^{2})blackboard_E [ roman_exp ( italic_t italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ) ) ] ≤ roman_exp ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for all t𝑡t\in\mathbb{R}italic_t ∈ blackboard_R.

Assumption 5 requires that linear combinations uz(x)superscript𝑢top𝑧𝑥u^{\top}z(x)italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ) of the random Fourier features are sub-Gaussian with parameter κ𝜅\kappaitalic_κ, ensuring 𝔼[exp(tuz(x))]exp(12κ2t2)𝔼delimited-[]𝑡superscript𝑢top𝑧𝑥12superscript𝜅2superscript𝑡2\mathbb{E}[\exp(t\,u^{\top}z(x))]\leq\exp(\frac{1}{2}\kappa^{2}t^{2})blackboard_E [ roman_exp ( italic_t italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ) ) ] ≤ roman_exp ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for all unit vectors uP𝑢superscript𝑃u\in\mathbb{R}^{P}italic_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT and scalars t𝑡titalic_t. This concentration condition is essential for applying uniform convergence results and obtaining non-asymptotic bounds on the empirical feature covariance matrix that appear in our sample complexity analysis. The assumption is standard in high-dimensional learning theory and is automatically satisfied for RFF with bounded support: since zi(x)=2cos(ωix+bi)[2,2]subscript𝑧𝑖𝑥2superscriptsubscript𝜔𝑖top𝑥subscript𝑏𝑖22z_{i}(x)=\sqrt{2}\cos(\omega_{i}^{\top}x+b_{i})\in[-\sqrt{2},\sqrt{2}]italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = square-root start_ARG 2 end_ARG roman_cos ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ [ - square-root start_ARG 2 end_ARG , square-root start_ARG 2 end_ARG ], each feature is bounded, and linear combinations of bounded random variables are sub-Gaussian with parameter κ=O(P)𝜅𝑂𝑃\kappa=O(\sqrt{P})italic_κ = italic_O ( square-root start_ARG italic_P end_ARG ). This ensures that concentration inequalities apply to the feature covariance estimation, enabling our PAC-learning bounds while remaining satisfied in all practical RFF implementations.

2.2 The Theory-Practice Disconnect in Random Fourier Features

The foundation of high-dimensional prediction methods in finance rests on RFF theory, yet a fundamental disconnect exists between theoretical guarantees and practical implementation. Understanding this disconnect is crucial for interpreting what these methods actually accomplish.

2.2.1 Theoretical Guarantees Under Idealized Conditions

The RFF methodology (Rahimi & Recht 2007) provides rigorous theoretical foundations for kernel approximation. For target shift-invariant kernels k(x,x)=k(xx)𝑘𝑥superscript𝑥𝑘𝑥superscript𝑥k(x,x^{\prime})=k(x-x^{\prime})italic_k ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_k ( italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), the theory establishes that:

kRFF(x,x)=1Pi=1Pzi(x)zi(x)PkG(x,x)=exp(γ22xx2)subscript𝑘RFF𝑥superscript𝑥1𝑃superscriptsubscript𝑖1𝑃subscript𝑧𝑖𝑥subscript𝑧𝑖superscript𝑥𝑃subscript𝑘𝐺𝑥superscript𝑥superscript𝛾22superscriptnorm𝑥superscript𝑥2k_{\text{RFF}}(x,x^{\prime})=\frac{1}{P}\sum_{i=1}^{P}z_{i}(x)z_{i}(x^{\prime}% )\xrightarrow{P\to\infty}k_{G}(x,x^{\prime})=\exp\left(-\frac{\gamma^{2}}{2}\|% x-x^{\prime}\|^{2}\right)italic_k start_POSTSUBSCRIPT RFF end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_ARROW start_OVERACCENT italic_P → ∞ end_OVERACCENT → end_ARROW italic_k start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_exp ( - divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (2.1)

in probability, under the condition that features maintain their original distributional properties. This convergence enables kernel methods to be approximated through linear regression in the RFF space, with all the theoretical guarantees that kernel learning provides.

2.2.2 What Actually Happens in Practice

Every practical RFF implementation deviates from the theoretical setup in a seemingly minor but mathematically crucial way. To improve numerical stability and ensure comparable scales across features, practitioners standardize features using training sample statistics:

z~i(x)=zi(x)σ^i,σ^i2=1Tt=1Tzi(xt)2formulae-sequencesubscript~𝑧𝑖𝑥subscript𝑧𝑖𝑥subscript^𝜎𝑖superscriptsubscript^𝜎𝑖21𝑇superscriptsubscript𝑡1𝑇subscript𝑧𝑖superscriptsubscript𝑥𝑡2\tilde{z}_{i}(x)=\frac{z_{i}(x)}{\hat{\sigma}_{i}},\quad\hat{\sigma}_{i}^{2}=% \frac{1}{T}\sum_{t=1}^{T}z_{i}(x_{t})^{2}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG , over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (2.2)

This standardization fundamentally alters the mathematical properties of the method. The standardized empirical kernel becomes:

kstd(x,x)=1Pi=1Pzi(x)zi(x)σ^i2subscript𝑘std𝑥superscript𝑥1𝑃superscriptsubscript𝑖1𝑃subscript𝑧𝑖𝑥subscript𝑧𝑖superscript𝑥superscriptsubscript^𝜎𝑖2k_{\text{std}}(x,x^{\prime})=\frac{1}{P}\sum_{i=1}^{P}\frac{z_{i}(x)z_{i}(x^{% \prime})}{\hat{\sigma}_{i}^{2}}italic_k start_POSTSUBSCRIPT std end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT divide start_ARG italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (2.3)

This standardized kernel no longer converges to the Gaussian kernel. Instead, as I prove in Theorem 1, it converges to a training-set dependent limit kstd(x,x|𝒯)kG(x,x)subscriptsuperscript𝑘std𝑥conditionalsuperscript𝑥𝒯subscript𝑘𝐺𝑥superscript𝑥k^{*}_{\text{std}}(x,x^{\prime}|\mathcal{T})\neq k_{G}(x,x^{\prime})italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT std end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | caligraphic_T ) ≠ italic_k start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) that violates the shift-invariance and stationarity properties required for kernel methods.

3 How Standardization Modifies Kernel Approximation

Having established the theory-practice disconnect in Section 2, I now prove rigorously that standardization fundamentally alters the kernel approximation properties that justify RFF methods. This breakdown explains why high-dimensional methods cannot achieve their claimed theoretical properties and must rely on simpler mechanisms.

3.1 Main Result

Theorem 1 (Modified Convergence of Gaussian-RFF Approximation under Standardization).

Let Assumptions 15 hold. For query points x,xK𝑥superscript𝑥superscript𝐾x,x^{\prime}\in\mathbb{R}^{K}italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, define the standardized kernel function:

h(ω,b)=2cos(ωx+b)cos(ωx+b)1+1Tt=1Tcos(2ωxt+2b)𝜔𝑏2superscript𝜔top𝑥𝑏superscript𝜔topsuperscript𝑥𝑏11𝑇superscriptsubscript𝑡1𝑇2superscript𝜔topsubscript𝑥𝑡2𝑏h(\omega,b)=\frac{2\cos(\omega^{\top}x+b)\cos(\omega^{\top}x^{\prime}+b)}{1+% \frac{1}{T}\sum_{t=1}^{T}\cos(2\omega^{\top}x_{t}+2b)}italic_h ( italic_ω , italic_b ) = divide start_ARG 2 roman_cos ( italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x + italic_b ) roman_cos ( italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_b ) end_ARG start_ARG 1 + divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_cos ( 2 italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 italic_b ) end_ARG

where (ω,b)𝒩(0,γ2IK)×Uniform[0,2π]similar-to𝜔𝑏𝒩0superscript𝛾2subscript𝐼𝐾Uniform02𝜋(\omega,b)\sim\mathcal{N}(0,\gamma^{2}I_{K})\times\text{Uniform}[0,2\pi]( italic_ω , italic_b ) ∼ caligraphic_N ( 0 , italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) × Uniform [ 0 , 2 italic_π ].

Then:

  1. (a)

    For every fixed x,xK𝑥superscript𝑥superscript𝐾x,x^{\prime}\in\mathbb{R}^{K}italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, the standardized kernel estimator converges almost surely:

    kstd(P)(x,x):=1Pi=1Ph(ωi,bi)Pa.s.kstd(x,x):=𝔼[h(ω,b)]assignsuperscriptsubscript𝑘std𝑃𝑥superscript𝑥1𝑃superscriptsubscript𝑖1𝑃subscript𝜔𝑖subscript𝑏𝑖𝑃a.s.superscriptsubscript𝑘std𝑥superscript𝑥assign𝔼delimited-[]𝜔𝑏k_{\text{std}}^{(P)}(x,x^{\prime}):=\frac{1}{P}\sum_{i=1}^{P}h(\omega_{i},b_{i% })\xrightarrow[P\to\infty]{\text{a.s.}}k_{\text{std}}^{*}(x,x^{\prime}):=% \mathbb{E}[h(\omega,b)]italic_k start_POSTSUBSCRIPT std end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_P ) end_POSTSUPERSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) := divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT italic_h ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_ARROW start_UNDERACCENT italic_P → ∞ end_UNDERACCENT start_ARROW overa.s. → end_ARROW end_ARROW italic_k start_POSTSUBSCRIPT std end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) := blackboard_E [ italic_h ( italic_ω , italic_b ) ]
  2. (b)

    The limit kernel kstdsuperscriptsubscript𝑘stdk_{\text{std}}^{*}italic_k start_POSTSUBSCRIPT std end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT depends on the particular training set 𝒯={x1,,xT}𝒯subscript𝑥1subscript𝑥𝑇\mathcal{T}=\{x_{1},\ldots,x_{T}\}caligraphic_T = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }, whereas the Gaussian kernel kG(x,x)=exp(γ22xx2)subscript𝑘𝐺𝑥superscript𝑥superscript𝛾22superscriptnorm𝑥superscript𝑥2k_{G}(x,x^{\prime})=\exp(-\frac{\gamma^{2}}{2}\|x-x^{\prime}\|^{2})italic_k start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_exp ( - divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is training-set independent. Consequently, kstdkGsuperscriptsubscript𝑘stdsubscript𝑘𝐺k_{\text{std}}^{*}\neq k_{G}italic_k start_POSTSUBSCRIPT std end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≠ italic_k start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT in general.

The proof proceeds in two steps. First, I establish that the standardized kernel function h(ω,b)𝜔𝑏h(\omega,b)italic_h ( italic_ω , italic_b ) has finite expectation despite the random denominator, enabling application of the strong law of large numbers for part (a). This requires controlling the probability that the empirical variance σ^2superscript^𝜎2\hat{\sigma}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT becomes arbitrarily small, which I achieve through geometric analysis exploiting the full-rank condition. Second, I prove training-set dependence by explicit construction: scaling any training point xjαxjmaps-tosubscript𝑥𝑗𝛼subscript𝑥𝑗x_{j}\mapsto\alpha x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ↦ italic_α italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with α>1𝛼1\alpha>1italic_α > 1 yields different limiting kernels, establishing that kstdkGsuperscriptsubscript𝑘stdsubscript𝑘𝐺k_{\text{std}}^{*}\neq k_{G}italic_k start_POSTSUBSCRIPT std end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≠ italic_k start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. The complete technical proof appears in Appendix A.

3.2 Analysis of the Breakdown

To understand the implications of Theorem 1, I examine precisely how standardization violates the conditions under which RFF theory operates. Rahimi & Recht (2007) prove convergence to the Gaussian kernel under two essential conditions: distributional alignment of frequencies ωisubscript𝜔𝑖\omega_{i}italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and phases bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the target kernel’s Fourier transform, and preservation of the prescribed scaling zi(x)=2cos(ωix+bi)subscript𝑧𝑖𝑥2superscriptsubscript𝜔𝑖top𝑥subscript𝑏𝑖z_{i}(x)=\sqrt{2}\cos(\omega_{i}^{\top}x+b_{i})italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = square-root start_ARG 2 end_ARG roman_cos ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

Standardization z~i(x)=zi(x)/σ^isubscript~𝑧𝑖𝑥subscript𝑧𝑖𝑥subscript^𝜎𝑖\tilde{z}_{i}(x)=z_{i}(x)/\hat{\sigma}_{i}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) / over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT systematically violates both conditions. The original features have theoretical properties derived from specified distributions of ωisubscript𝜔𝑖\omega_{i}italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, but the standardization factor 1/σ^i1subscript^𝜎𝑖1/\hat{\sigma}_{i}1 / over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT varies with the training set, altering the effective distribution in a data-dependent manner. The expectation 𝔼[z~i(x)z~i(x)]𝔼delimited-[]subscript~𝑧𝑖𝑥subscript~𝑧𝑖superscript𝑥\mathbb{E}[\tilde{z}_{i}(x)\tilde{z}_{i}(x^{\prime})]blackboard_E [ over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] now depends on σ^isubscript^𝜎𝑖\hat{\sigma}_{i}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, disrupting the direct mapping to kG(x,x)subscript𝑘𝐺𝑥superscript𝑥k_{G}(x,x^{\prime})italic_k start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Additionally, the fixed scaling 22\sqrt{2}square-root start_ARG 2 end_ARG that ensures correct kernel approximation is replaced by a random, sample-dependent factor, breaking the fundamental relationship between feature products and kernel values.

These modifications have important mathematical implications. The standardized features yield an empirical kernel that converges to kstd(x,x|𝒯)superscriptsubscript𝑘std𝑥conditionalsuperscript𝑥𝒯k_{\text{std}}^{*}(x,x^{\prime}|\mathcal{T})italic_k start_POSTSUBSCRIPT std end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | caligraphic_T ), which is training-set dependent rather than depending only on xxnorm𝑥superscript𝑥\|x-x^{\prime}\|∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ like the Gaussian kernel. The resulting kernel is not shift-invariant since σ^isubscript^𝜎𝑖\hat{\sigma}_{i}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT reflects absolute positions of training points, and shifting the data changes σ^isubscript^𝜎𝑖\hat{\sigma}_{i}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This creates temporal non-stationarity as kernel properties change when training windows roll forward.

3.3 Implications for Financial Applications

Theorem 1 resolves the fundamental puzzles in high-dimensional financial prediction by revealing that claimed theoretical properties simply do not hold in practice. KMZ develop their theoretical analysis assuming RFF converge to Gaussian kernels. Their random matrix theory characterization, effective complexity bounds, and optimal shrinkage formula all depend critically on this convergence. However, their empirical implementation employs standardization, which fundamentally alters the convergence properties, creating a notable difference between theory and practice.

With modified kernel structure, methods may perform learning that differs from the theoretical framework, potentially involving pattern-matching mechanisms based on training-sample dependent similarity measures. The standardized kernel creates similarity measures based on training-sample dependent weights rather than genuine predictor relationships. This explains Nagel (2025) empirical finding that high-complexity methods produce volatility-timed momentum strategies regardless of underlying data properties. The broken kernel structure makes the theoretically predicted learning more challenging, leading methods to weight returns based on alternative similarity measures within the training window.

The apparent virtue of complexity may arise through different mechanisms than originally theorized. Their method cannot achieve its theoretical properties due to standardization, so any success must arise through alternative mechanisms. This resolves the central puzzle of how methods claiming to harness thousands of parameters succeed with tiny training samples: they may operate through mechanisms that differ from the high-dimensional framework, potentially involving simpler pattern-matching approaches that happen to work in specific market conditions.

4 Fundamental Barriers to High-Dimensional Learning

The kernel approximation breakdown in Section 3 reveals that methods cannot achieve their claimed theoretical properties. This section establishes that even if this breakdown were corrected, fundamental information-theoretic barriers would still prevent genuine high-dimensional learning in financial applications. These results explain why methods must rely on the mechanical pattern matching that emerges from broken kernel structures.

4.1 Sample Complexity Lower Bounds

I establish fundamental limits on learning over the high-dimensional function spaces that methods claim to harness, requiring an additional regularity condition for my convergence analysis.

Theorem 2 (Exponential lower bound, random design).

Assume the data–generation scheme of Assumptions 1, 2, and 3, and suppose additionally that the RFF are κ𝜅\kappaitalic_κ-sub-Gaussian: for every unit uP𝑢superscript𝑃u\in\mathbb{R}^{P}italic_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, 𝔼[etuz(x)]exp(12κ2t2),tformulae-sequence𝔼delimited-[]superscript𝑒𝑡superscript𝑢top𝑧𝑥12superscript𝜅2superscript𝑡2𝑡\mathbb{E}\!\bigl{[}e^{t\,u^{\!\top}z(x)}\bigr{]}\leq\exp\!\bigl{(}\tfrac{1}{2% }\kappa^{2}t^{2}\bigr{)},\;t\in\mathbb{R}blackboard_E [ italic_e start_POSTSUPERSCRIPT italic_t italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ) end_POSTSUPERSCRIPT ] ≤ roman_exp ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_t ∈ blackboard_R (Assumption 5).

Let P={xwz(x):w2B},subscript𝑃conditional-setmaps-to𝑥superscript𝑤top𝑧𝑥subscriptnorm𝑤2𝐵\mathcal{F}_{P}=\{x\mapsto w^{\!\top}z(x):\|w\|_{2}\leq B\},caligraphic_F start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = { italic_x ↦ italic_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ) : ∥ italic_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_B } , and denote by σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT the noise variance. Then, for every T,P1𝑇𝑃1T,P\geq 1italic_T , italic_P ≥ 1,

inff^Tsupw2B𝔼x,𝒟T,ϵ[(f^T(x)wz(x))2]cB2exp(8TCzB2Pσ2),subscriptinfimumsubscript^𝑓𝑇subscriptsupremumsubscriptnorm𝑤2𝐵subscript𝔼𝑥subscript𝒟𝑇italic-ϵdelimited-[]superscriptsubscript^𝑓𝑇𝑥superscript𝑤top𝑧𝑥2𝑐superscript𝐵28𝑇subscript𝐶𝑧superscript𝐵2𝑃superscript𝜎2\inf_{\hat{f}_{T}}\;\sup_{\|w\|_{2}\leq B}\;\mathbb{E}_{x,\mathcal{D}_{T},% \epsilon}\bigl{[}(\hat{f}_{T}(x)-w^{\!\top}z(x))^{2}\bigr{]}\;\;\geq\;\;c\,B^{% 2}\exp\!\Bigl{(}-\frac{8\,T\,C_{z}\,B^{2}}{P\,\sigma^{2}}\Bigr{)},roman_inf start_POSTSUBSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_B end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x , caligraphic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_ϵ end_POSTSUBSCRIPT [ ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) - italic_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≥ italic_c italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_exp ( - divide start_ARG 8 italic_T italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_P italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (4.1)

for a universal constant c=c(cz,Cz)>0𝑐𝑐subscript𝑐𝑧subscript𝐶𝑧0c=c(c_{z},C_{z})>0italic_c = italic_c ( italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) > 0.

Moreover, there is a constant C0=C0(κ,cz,Cz)subscript𝐶0subscript𝐶0𝜅subscript𝑐𝑧subscript𝐶𝑧C_{0}=C_{0}(\kappa,c_{z},C_{z})italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_κ , italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) such that whenever TC0P𝑇subscript𝐶0𝑃T\geq C_{0}Pitalic_T ≥ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_P,

Z[f^T:supw2B𝔼x,ϵ(f^T(x)wz(x))2cB2exp(8TCzB2Pσ2)] 1eT,\mathbb{P}_{Z}\!\Bigl{[}\forall\,\hat{f}_{T}\;:\;\sup_{\|w\|_{2}\leq B}\mathbb% {E}_{x,\epsilon}(\hat{f}_{T}(x)-w^{\!\top}z(x))^{2}\;\geq\;c^{\star}B^{2}\exp% \!\Bigl{(}-\frac{8\,T\,C_{z}\,B^{2}}{P\,\sigma^{2}}\Bigr{)}\Bigr{]}\;\geq\;1-e% ^{-T},blackboard_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT [ ∀ over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT : roman_sup start_POSTSUBSCRIPT ∥ italic_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_B end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x , italic_ϵ end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) - italic_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_c start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_exp ( - divide start_ARG 8 italic_T italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_P italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ] ≥ 1 - italic_e start_POSTSUPERSCRIPT - italic_T end_POSTSUPERSCRIPT , (4.2)

with c=c(cz,Cz)>0superscript𝑐superscript𝑐subscript𝑐𝑧subscript𝐶𝑧0c^{\star}=c^{\star}(c_{z},C_{z})>0italic_c start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = italic_c start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) > 0.

The proof uses a minimax argument with Fano’s inequality. I construct a 2δ2𝛿2\delta2 italic_δ-packing {w1,,wM}B2P(B)subscript𝑤1subscript𝑤𝑀superscriptsubscript𝐵2𝑃𝐵\{w_{1},\ldots,w_{M}\}\subset B_{2}^{P}(B){ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } ⊂ italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( italic_B ) with M=(B/(2δ))P𝑀superscript𝐵2𝛿𝑃M=(B/(2\delta))^{P}italic_M = ( italic_B / ( 2 italic_δ ) ) start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT well-separated parameters. The Kullback-Leibler (KL) divergence between corresponding data distributions satisfies KL(PjP)2TCzB2σ2KLconditionalsubscript𝑃𝑗subscript𝑃2𝑇subscript𝐶𝑧superscript𝐵2superscript𝜎2\mathrm{KL}(P_{j}\|P_{\ell})\leq\frac{2TC_{z}B^{2}}{\sigma^{2}}roman_KL ( italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ italic_P start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) ≤ divide start_ARG 2 italic_T italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. Fano’s inequality then implies any decoder has error probability Pr[J^J]1/2Pr^𝐽𝐽12\Pr[\hat{J}\neq J]\geq 1/2roman_Pr [ over^ start_ARG italic_J end_ARG ≠ italic_J ] ≥ 1 / 2. Since low estimation risk would enable perfect identification (contradicting Fano), I obtain 𝔼[(f^T(x)fJ(x))2]czδ2𝔼delimited-[]superscriptsubscript^𝑓𝑇𝑥subscript𝑓𝐽𝑥2subscript𝑐𝑧superscript𝛿2\mathbb{E}[(\hat{f}_{T}(x)-f_{J}(x))^{2}]\geq c_{z}\delta^{2}blackboard_E [ ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) - italic_f start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≥ italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Optimizing δ𝛿\deltaitalic_δ yields the exponential bound. The high-probability version conditions on well-conditioned designs using matrix concentration.

Theorems 2 applies directly to machine learning methods employing RFF as actually implemented in practice. The theoretical framework covers the complete practical pipeline where random feature weights {ωi,bi}i=1Psuperscriptsubscriptsubscript𝜔𝑖subscript𝑏𝑖𝑖1𝑃\{\omega_{i},b_{i}\}_{i=1}^{P}{ italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT are drawn from specified distributions, standardization procedures are applied for numerical stability, and learning proceeds over the resulting linear-in-features function class P={xwTz(x):w2B}subscript𝑃conditional-setmaps-to𝑥superscript𝑤𝑇𝑧𝑥subscriptnorm𝑤2𝐵\mathcal{F}_{P}=\{x\mapsto w^{T}z(x):\|w\|_{2}\leq B\}caligraphic_F start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = { italic_x ↦ italic_w start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_z ( italic_x ) : ∥ italic_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_B } using any estimation method including OLS, ridge regression, LASSO, or ridgeless regression.

The bounds establish information-theoretic impossibility in two complementary forms: the expectation bound averaged over all possible feature realizations, and the high-probability bound showing that the same limitations hold for most individual feature draws. KMZ follows precisely this framework with P=12,000𝑃12000P=12{,}000italic_P = 12 , 000 features and T=12𝑇12T=12italic_T = 12 training observations, making both versions directly applicable to their empirical analysis. The universal constant c=c(cz,Cz,κ)𝑐𝑐subscript𝑐𝑧subscript𝐶𝑧𝜅c=c(c_{z},C_{z},\kappa)italic_c = italic_c ( italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_κ ) depends on the feature covariance bounds from the regularity conditions and the sub-Gaussian parameter controlling concentration properties, but remains bounded away from zero under standard assumptions for financial applications.

Theorem 3 (Polynomial minimax lower bound – high probability).

Assume Assumptions 1, 2, 3, and the sub-Gaussian feature condition Assumption 5. Put

c~:=cz64Cz>0,P:={xwz(x):w2B}.formulae-sequenceassign~𝑐subscript𝑐𝑧64subscript𝐶𝑧0assignsubscript𝑃conditional-setmaps-to𝑥superscript𝑤top𝑧𝑥subscriptnorm𝑤2𝐵\tilde{c}\;:=\;\frac{c_{z}}{64\,C_{z}}\;>0,\qquad\mathcal{F}_{P}:=\{\,x\mapsto w% ^{\!\top}z(x):\;\|w\|_{2}\leq B\}.over~ start_ARG italic_c end_ARG := divide start_ARG italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG start_ARG 64 italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG > 0 , caligraphic_F start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT := { italic_x ↦ italic_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ) : ∥ italic_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_B } .
  1. (a)

    In-expectation bound. For every T,P1𝑇𝑃1T,P\geq 1italic_T , italic_P ≥ 1

    inff^Tsupw2B𝔼x,𝒟T,ϵ[(f^T(x)wz(x))2]c~min{B2,σ2TlogP}.subscriptinfimumsubscript^𝑓𝑇subscriptsupremumsubscriptnorm𝑤2𝐵subscript𝔼𝑥subscript𝒟𝑇italic-ϵdelimited-[]superscriptsubscript^𝑓𝑇𝑥superscript𝑤top𝑧𝑥2~𝑐superscript𝐵2superscript𝜎2𝑇𝑃\inf_{\hat{f}_{T}}\sup_{\|w\|_{2}\leq B}\mathbb{E}_{x,\mathcal{D}_{T},\epsilon% }\bigl{[}(\hat{f}_{T}(x)-w^{\!\top}z(x))^{2}\bigr{]}\;\;\geq\;\;\tilde{c}\,% \min\!\Bigl{\{}B^{2},\;\tfrac{\sigma^{2}}{T}\log P\Bigr{\}}.roman_inf start_POSTSUBSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_B end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x , caligraphic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_ϵ end_POSTSUBSCRIPT [ ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) - italic_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≥ over~ start_ARG italic_c end_ARG roman_min { italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG roman_log italic_P } .
  2. (b)

    High-probability bound. There exists a constant C0=C0(κ,Cz)subscript𝐶0subscript𝐶0𝜅subscript𝐶𝑧C_{0}=C_{0}(\kappa,C_{z})italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_κ , italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) such that whenever TC0P𝑇subscript𝐶0𝑃T\geq C_{0}Pitalic_T ≥ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_P,

    Z[inff^Tsupw2B𝔼x,ϵ[(f^T(x)wz(x))2Z]<c~min{B2,σ2TlogP}]eT.subscript𝑍delimited-[]subscriptinfimumsubscript^𝑓𝑇subscriptsupremumsubscriptnorm𝑤2𝐵subscript𝔼𝑥italic-ϵdelimited-[]conditionalsuperscriptsubscript^𝑓𝑇𝑥superscript𝑤top𝑧𝑥2𝑍~𝑐superscript𝐵2superscript𝜎2𝑇𝑃superscript𝑒𝑇\mathbb{P}_{Z}\!\Bigl{[}\inf_{\hat{f}_{T}}\sup_{\|w\|_{2}\leq B}\mathbb{E}_{x,% \epsilon}\bigl{[}(\hat{f}_{T}(x)-w^{\!\top}z(x))^{2}\mid Z\bigr{]}\;<\;\tilde{% c}\,\min\!\Bigl{\{}B^{2},\;\tfrac{\sigma^{2}}{T}\log P\Bigr{\}}\Bigr{]}\;\leq% \;e^{-T}.blackboard_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT [ roman_inf start_POSTSUBSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_B end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x , italic_ϵ end_POSTSUBSCRIPT [ ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) - italic_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ italic_Z ] < over~ start_ARG italic_c end_ARG roman_min { italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG roman_log italic_P } ] ≤ italic_e start_POSTSUPERSCRIPT - italic_T end_POSTSUPERSCRIPT . (4.3)

Thus the same lower bound holds for each realised design matrix with probability at least 1eT1superscript𝑒𝑇1-e^{-T}1 - italic_e start_POSTSUPERSCRIPT - italic_T end_POSTSUPERSCRIPT.

The proof uses a standard basis packing with refined concentration analysis. I construct M=P+1𝑀𝑃1M=P+1italic_M = italic_P + 1 functions using the canonical basis: w0=0subscript𝑤00w_{0}=0italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 and wj=δejsubscript𝑤𝑗𝛿subscript𝑒𝑗w_{j}=\delta e_{j}italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_δ italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for j=1,,P𝑗1𝑃j=1,\ldots,Pitalic_j = 1 , … , italic_P, where δ=min{B/4,σ/(4TCzlogP)}𝛿𝐵4𝜎4𝑇subscript𝐶𝑧𝑃\delta=\min\{B/4,\sigma/(4\sqrt{TC_{z}\log P})\}italic_δ = roman_min { italic_B / 4 , italic_σ / ( 4 square-root start_ARG italic_T italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT roman_log italic_P end_ARG ) }. The population covariance ΣzczIPsucceeds-or-equalssubscriptΣ𝑧subscript𝑐𝑧subscript𝐼𝑃\Sigma_{z}\succeq c_{z}I_{P}roman_Σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ⪰ italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ensures separation fjfL2(μ)22czδ2superscriptsubscriptnormsubscript𝑓𝑗subscript𝑓superscript𝐿2𝜇22subscript𝑐𝑧superscript𝛿2\|f_{j}-f_{\ell}\|_{L^{2}(\mu)}^{2}\geq 2c_{z}\delta^{2}∥ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 2 italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. For the KL bound, 𝔼[KL(PjP)]2TCzδ2σ2logP8𝔼delimited-[]KLconditionalsubscript𝑃𝑗subscript𝑃2𝑇subscript𝐶𝑧superscript𝛿2superscript𝜎2𝑃8\mathbb{E}[\mathrm{KL}(P_{j}\|P_{\ell})]\leq\frac{2TC_{z}\delta^{2}}{\sigma^{2% }}\leq\frac{\log P}{8}blackboard_E [ roman_KL ( italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ italic_P start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) ] ≤ divide start_ARG 2 italic_T italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG roman_log italic_P end_ARG start_ARG 8 end_ARG, enabling Fano’s inequality with error probability 1/2absent12\geq 1/2≥ 1 / 2. The risk-identification argument yields 𝔼[(f^T(x)fJ(x))2]cz4δ2𝔼delimited-[]superscriptsubscript^𝑓𝑇𝑥subscript𝑓𝐽𝑥2subscript𝑐𝑧4superscript𝛿2\mathbb{E}[(\hat{f}_{T}(x)-f_{J}(x))^{2}]\geq\frac{c_{z}}{4}\delta^{2}blackboard_E [ ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) - italic_f start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≥ divide start_ARG italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, giving the polynomial bound. For part (b), I condition on the ”good design” event {λmax(T1ZZ)2Cz}subscript𝜆superscript𝑇1superscript𝑍top𝑍2subscript𝐶𝑧\{\lambda_{\max}(T^{-1}Z^{\top}Z)\leq 2C_{z}\}{ italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Z ) ≤ 2 italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT } with probability 1eTabsent1superscript𝑒𝑇\geq 1-e^{-T}≥ 1 - italic_e start_POSTSUPERSCRIPT - italic_T end_POSTSUPERSCRIPT when TC0P𝑇subscript𝐶0𝑃T\geq C_{0}Pitalic_T ≥ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_P, then apply the same argument with adjusted constants.

The universal constants in my polynomial lower bounds have transparent structure that illuminates the fundamental barriers to high-dimensional learning in finance. The expectation bound employs c~=cz/(64Cz)~𝑐subscript𝑐𝑧64subscript𝐶𝑧\tilde{c}=c_{z}/(64C_{z})over~ start_ARG italic_c end_ARG = italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT / ( 64 italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ), which depends on the feature covariance bounds and measures the quality of the feature construction. For RFF with the standard construction zi(x)=2cos(ωiTx+bi)subscript𝑧𝑖𝑥2superscriptsubscript𝜔𝑖𝑇𝑥subscript𝑏𝑖z_{i}(x)=\sqrt{2}\cos(\omega_{i}^{T}x+b_{i})italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = square-root start_ARG 2 end_ARG roman_cos ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) where ωi𝒩(0,γ2IK)similar-tosubscript𝜔𝑖𝒩0superscript𝛾2subscript𝐼𝐾\omega_{i}\sim\mathcal{N}(0,\gamma^{2}I_{K})italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ), the theoretical expectation 𝔼[zi(x)2]=1𝔼delimited-[]subscript𝑧𝑖superscript𝑥21\mathbb{E}[z_{i}(x)^{2}]=1blackboard_E [ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 1 suggests Cz1subscript𝐶𝑧1C_{z}\approx 1italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ≈ 1 under ideal conditions, yielding c~1/640.016~𝑐1640.016\tilde{c}\approx 1/64\approx 0.016over~ start_ARG italic_c end_ARG ≈ 1 / 64 ≈ 0.016 in the best case. Even when the conditioning number κ=Cz/cz𝜅subscript𝐶𝑧subscript𝑐𝑧\kappa=C_{z}/c_{z}italic_κ = italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT / italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT increases to moderate values around 10, I obtain c~0.0016~𝑐0.0016\tilde{c}\approx 0.0016over~ start_ARG italic_c end_ARG ≈ 0.0016, representing manageable degradation in the constant.

The high-probability bound introduces additional dependence on the sub-Gaussian parameter κ𝜅\kappaitalic_κ through both the concentration quality and the threshold requirement TC1(κ,cz,Cz)logP𝑇subscript𝐶1𝜅subscript𝑐𝑧subscript𝐶𝑧𝑃T\geq C_{1}(\kappa,c_{z},C_{z})\log Pitalic_T ≥ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_κ , italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) roman_log italic_P for the probabilistic guarantee to hold. Throughout my analysis, I employ the conservative choice c~=0.01~𝑐0.01\tilde{c}=0.01over~ start_ARG italic_c end_ARG = 0.01, deliberately favoring the possibility of learning by using constants that are optimistic relative to typical financial applications. This conservative approach strengthens my impossibility conclusions: even when I bias the analysis toward finding that learning should be possible, the fundamental barriers persist. The comparable magnitude of constants between fixed and random feature settings confirms that these information-theoretic limitations are robust to implementation details and reflect inherent properties of high-dimensional learning with limited financial data.

4.2 Effective Complexity: The VC Dimension Reality

The Vapnik-Chervonenkis (VC) dimension provides a fundamental measure of model complexity that directly connects to generalization performance and sample complexity requirements (Vapnik & Chervonenkis 1971, Vapnik 1998). For a hypothesis class \mathcal{H}caligraphic_H, the VC dimension is the largest number of points that can be shattered (i.e., correctly classified under all possible binary labelings) by functions in \mathcal{H}caligraphic_H. This combinatorial measure captures the essential complexity of a learning problem: classes with higher VC dimension require more samples to achieve reliable generalization.

The connection between VC dimension and sample complexity is formalized through uniform convergence bounds. Classical results show that for a hypothesis class with VC dimension d𝑑ditalic_d, achieving generalization error ε𝜀\varepsilonitalic_ε with confidence 1δ1𝛿1-\delta1 - italic_δ requires sample size T=O(dlog(1/ε)/ε+log(1/δ)/ε)𝑇𝑂𝑑1𝜀𝜀1𝛿𝜀T=O(d\log(1/\varepsilon)/\varepsilon+\log(1/\delta)/\varepsilon)italic_T = italic_O ( italic_d roman_log ( 1 / italic_ε ) / italic_ε + roman_log ( 1 / italic_δ ) / italic_ε ) (Blumer et al. 1989, Shalev-Shwartz & Ben-David 2014). This relationship reveals why effective model complexity, rather than nominal parameter count, determines learning difficulty.

In the context of high-dimensional financial prediction, VC dimension analysis becomes crucial for understanding what machine learning methods actually accomplish. While methods may claim to leverage thousands of parameters, their effective complexity—as measured by VC dimension—may be much lower due to structural constraints imposed by the optimization procedure. Ridgeless regression in the overparameterized regime (P>T𝑃𝑇P>Titalic_P > italic_T) provides a particularly important case study, as the interpolation constraint fundamentally limits the achievable function class regardless of the ambient parameter dimension.

Theorem 4 (Effective VC Dimension of Ridgeless RFF Regression).

Let z:𝒳P:𝑧𝒳superscript𝑃z:\mathcal{X}\to\mathbb{R}^{P}italic_z : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT be a fixed feature map (e.g. standardized RFF) and define the linear function class

P={fw(x)=wz(x):w2B},B>0.formulae-sequencesubscript𝑃conditional-setsubscript𝑓𝑤𝑥superscript𝑤top𝑧𝑥subscriptnorm𝑤2𝐵𝐵0\mathcal{F}_{P}\;=\;\Bigl{\{}\,f_{w}(x)=w^{\top}z(x)\;:\;\|w\|_{2}\leq B\Bigr{% \}},\qquad B>0.caligraphic_F start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = { italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_x ) = italic_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ) : ∥ italic_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_B } , italic_B > 0 .

Fix a training sample (x1,,xT)subscript𝑥1subscript𝑥𝑇(x_{1},\dots,x_{T})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) with T<P𝑇𝑃T<Pitalic_T < italic_P and denote Z=[z(x1)z(xT)]T×P𝑍superscriptdelimited-[]𝑧subscript𝑥1𝑧subscript𝑥𝑇topsuperscript𝑇𝑃Z=[\,z(x_{1})\;\cdots\;z(x_{T})]^{\top}\!\in\mathbb{R}^{T\times P}italic_Z = [ italic_z ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋯ italic_z ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_P end_POSTSUPERSCRIPT. Write ki(x)=z(xi)z(x)subscript𝑘𝑖𝑥𝑧superscriptsubscript𝑥𝑖top𝑧𝑥k_{i}(x)=z(x_{i})^{\top}z(x)italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_z ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ) and k(x)=(k1(x),,kT(x))𝑘𝑥superscriptsubscript𝑘1𝑥subscript𝑘𝑇𝑥topk(x)=(k_{1}(x),\dots,k_{T}(x))^{\top}italic_k ( italic_x ) = ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) , … , italic_k start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. The corresponding ridgeless (minimum-norm) regression functions are

ridge(Z)={fα(x)=αk(x):αT}.superscriptsubscriptridge𝑍conditional-setsubscript𝑓𝛼𝑥superscript𝛼top𝑘𝑥𝛼superscript𝑇\mathcal{F}_{\textnormal{ridge}}^{(Z)}\;=\;\bigl{\{}\,f_{\alpha}(x)=\alpha^{% \top}k(x)\;:\;\alpha\in\mathbb{R}^{T}\bigr{\}}.caligraphic_F start_POSTSUBSCRIPT ridge end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_Z ) end_POSTSUPERSCRIPT = { italic_f start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_x ) = italic_α start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_k ( italic_x ) : italic_α ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } .

Let r=rank(ZZ)T𝑟rank𝑍superscript𝑍top𝑇r=\operatorname{rank}(ZZ^{\top})\leq Titalic_r = roman_rank ( italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ≤ italic_T. Then

  1. (a)

    VC({sign(f):fP})=PVCconditional-setsign𝑓𝑓subscript𝑃𝑃\mathrm{VC}\!\bigl{(}\{\operatorname{sign}(f)\,:\,f\in\mathcal{F}_{P}\}\bigr{)% }=Proman_VC ( { roman_sign ( italic_f ) : italic_f ∈ caligraphic_F start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT } ) = italic_P.

  2. (b)

    VC({sign(f):fridge(Z)})=rTVCconditional-setsign𝑓𝑓superscriptsubscriptridge𝑍𝑟𝑇\mathrm{VC}\!\bigl{(}\{\operatorname{sign}(f)\,:\,f\in\mathcal{F}_{\textnormal% {ridge}}^{(Z)}\}\bigr{)}=r\leq Troman_VC ( { roman_sign ( italic_f ) : italic_f ∈ caligraphic_F start_POSTSUBSCRIPT ridge end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_Z ) end_POSTSUPERSCRIPT } ) = italic_r ≤ italic_T. In particular, if ZZ𝑍superscript𝑍topZZ^{\top}italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is invertible (full row rank), the VC dimension equals T𝑇Titalic_T.

KMZ correctly note that, after minimum–norm fitting, the effective degrees of freedom of their RFF model equal the sample size (T=12𝑇12T=12italic_T = 12), not the nominal dimension (P=12,000𝑃12000P=12{,}000italic_P = 12 , 000): “the effective number of parameters in the construction of the predicted return is only T=12𝑇12T=12italic_T = 12…”. Theorem 4 rigorously justifies this statement by showing that the VC dimension of ridgeless RFF regression is bounded above by T𝑇Titalic_T.

This observation, however, leaves open the central question that KMZ label the “virtue of complexity”: does the enormous RFF dictionary contribute predictive information beyond what a T𝑇Titalic_T-dimensional linear model could extract? In kernel learning the tension is familiar: one combines an extremely rich representation (in principle, infinite–dimensional) with an estimator whose statistical capacity is implicitly capped at T𝑇Titalic_T. Overfitting risk is therefore limited, but any real performance gain must come from the non-linear basis supplied by the features rather than from high effective complexity per se.

4.3 Sharp Learning Thresholds

The previous bounds establish that learning is difficult, but do not precisely characterize the boundary between feasible and infeasible regimes. I derive sharp thresholds that separate learnable from unlearnable scenarios.

Understanding such thresholds is crucial for financial applications where practitioners must decide whether available data is sufficient for reliable prediction. While my previous bounds show that learning is difficult, they do not precisely characterize the boundary between feasible and infeasible learning regimes. The following analysis addresses this gap by establishing sharp learning thresholds that depend on the signal-to-noise ratio, feature dimension, and sample size.

Definition 1 (Learning Threshold).

For target prediction error ε>0𝜀0\varepsilon>0italic_ε > 0, define the learning threshold as:

SNRthreshold(ε):=c~1logPTεB2assignsubscriptSNRthreshold𝜀superscript~𝑐1𝑃𝑇𝜀superscript𝐵2\text{SNR}_{\text{threshold}}(\varepsilon):=\frac{\tilde{c}^{-1}\log P}{T}% \cdot\frac{\varepsilon}{B^{2}}SNR start_POSTSUBSCRIPT threshold end_POSTSUBSCRIPT ( italic_ε ) := divide start_ARG over~ start_ARG italic_c end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log italic_P end_ARG start_ARG italic_T end_ARG ⋅ divide start_ARG italic_ε end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG

where SNR=B2/σ2SNRsuperscript𝐵2superscript𝜎2\text{SNR}=B^{2}/\sigma^{2}SNR = italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the signal-to-noise ratio and c~~𝑐\tilde{c}over~ start_ARG italic_c end_ARG is the universal constant from Theorem 3.

The polynomial learning threshold reveals why my characterization provides actionable guidance where cruder exponential bounds would not. Unlike exponential characterizations that scale catastrophically with feature dimension P𝑃Pitalic_P, my threshold scales as logPT𝑃𝑇\frac{\log P}{T}divide start_ARG roman_log italic_P end_ARG start_ARG italic_T end_ARG—a fundamental difference that enables meaningful evaluation with realistic parameters.

This distinction proves crucial for understanding high-dimensional financial prediction. The threshold exhibits intuitive monotonicity properties: easier targets (larger ε𝜀\varepsilonitalic_ε) require weaker signals, while higher complexity relative to sample size (larger P/T𝑃𝑇P/Titalic_P / italic_T) demands stronger signals. More importantly, the explicit dependence on sample size T𝑇Titalic_T shows precisely how additional observations reduce required signal strength, revealing that sample complexity alone does not determine learning difficulty.

The practical significance becomes clear when evaluating typical financial applications. For the parameters employed by KMZ—P=12,000𝑃12000P=12{,}000italic_P = 12 , 000 features, T=12𝑇12T=12italic_T = 12 observations, targeting ε=0.01𝜀0.01\varepsilon=0.01italic_ε = 0.01 accuracy—my threshold requires signal-to-noise ratios exceeding 0.780.780.780.78, compared to observed financial signal strengths of 0.010.010.010.010.050.050.050.05. This gap of nearly two orders of magnitude places such applications decisively outside the learnable regime, providing theoretical validation that apparent success must arise through mechanisms other than genuine high-dimensional learning.

The sharp nature of this transition explains why high-dimensional methods may appear to succeed or fail unpredictably: small changes in problem parameters can move applications across the fundamental boundary between learnable and unlearnable regimes.

Theorem 5 (Sharp Learning Threshold for RFF-based Predictors).

Consider the RFF prediction problem under Assumptions 13, with target prediction error ε>0𝜀0\varepsilon>0italic_ε > 0 and universal constant c~=cz64Cz~𝑐subscript𝑐𝑧64subscript𝐶𝑧\tilde{c}=\frac{c_{z}}{64C_{z}}over~ start_ARG italic_c end_ARG = divide start_ARG italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG start_ARG 64 italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG from Theorem 3. Then there exists a sharp phase transition characterized by the complexity-to-sample ratio logPT𝑃𝑇\frac{\log P}{T}divide start_ARG roman_log italic_P end_ARG start_ARG italic_T end_ARG:

  1. (a)

    Phase I (Impossible Learning): If

    σ2logPTεc~andB2εc~,formulae-sequencesuperscript𝜎2𝑃𝑇𝜀~𝑐andsuperscript𝐵2𝜀~𝑐\frac{\sigma^{2}\log P}{T}\geq\frac{\varepsilon}{\tilde{c}}\quad\text{and}% \quad B^{2}\geq\frac{\varepsilon}{\tilde{c}},divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_P end_ARG start_ARG italic_T end_ARG ≥ divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG and italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG ,

    then learning with error <εabsent𝜀<\varepsilon< italic_ε is impossible.

  2. (b)

    Phase II (Possible Learning): If

    σ2logPT<εc~andB2εc~,formulae-sequencesuperscript𝜎2𝑃𝑇𝜀~𝑐andsuperscript𝐵2𝜀~𝑐\frac{\sigma^{2}\log P}{T}<\frac{\varepsilon}{\tilde{c}}\quad\text{and}\quad B% ^{2}\geq\frac{\varepsilon}{\tilde{c}},divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_P end_ARG start_ARG italic_T end_ARG < divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG and italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG ,

    then learning with error <εabsent𝜀<\varepsilon< italic_ε becomes information-theoretically feasible with sufficiently sophisticated estimators.

  3. (c)

    Trivial Regime: If B2<εc~superscript𝐵2𝜀~𝑐B^{2}<\frac{\varepsilon}{\tilde{c}}italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG, then the function class is too simple relative to the target accuracy, and standard parametric rates apply.

Corollary 1 (Weak Signal Learning Impossibility).

Under the weak signal assumption SNR=O(Kα)SNR𝑂superscript𝐾𝛼\text{SNR}=O(K^{-\alpha})SNR = italic_O ( italic_K start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT ) with P=O(Kβ)𝑃𝑂superscript𝐾𝛽P=O(K^{\beta})italic_P = italic_O ( italic_K start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ), there exists a critical sample size threshold Tcritical(K)=Ω(σ2βlogKε)subscript𝑇critical𝐾Ωsuperscript𝜎2𝛽𝐾𝜀T_{\text{critical}}(K)=\Omega(\frac{\sigma^{2}\beta\log K}{\varepsilon})italic_T start_POSTSUBSCRIPT critical end_POSTSUBSCRIPT ( italic_K ) = roman_Ω ( divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β roman_log italic_K end_ARG start_ARG italic_ε end_ARG ) such that learning is impossible when T<Tcritical(K)𝑇subscript𝑇critical𝐾T<T_{\text{critical}}(K)italic_T < italic_T start_POSTSUBSCRIPT critical end_POSTSUBSCRIPT ( italic_K ).

Applying these thresholds to Kelly et al.’s reported performance reveals the impossibility of their claimed mechanism. Their high-complexity model achieves R2=0.6%superscript𝑅2percent0.6R^{2}=0.6\%italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.6 % (corresponding to ε=0.994𝜀0.994\varepsilon=0.994italic_ε = 0.994) with parameters P=12,000𝑃12000P=12,000italic_P = 12 , 000, T=12𝑇12T=12italic_T = 12.

The complexity-to-sample ratio log(12,000)120.7812000120.78\frac{\log(12,000)}{12}\approx 0.78divide start_ARG roman_log ( 12 , 000 ) end_ARG start_ARG 12 end_ARG ≈ 0.78 appears manageable, but the signal strength requirement B20.9940.01=99.4superscript𝐵20.9940.0199.4B^{2}\geq\frac{0.994}{0.01}=99.4italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ divide start_ARG 0.994 end_ARG start_ARG 0.01 end_ARG = 99.4 demands that predictive signals explain at least 9,940

This analysis confirms that their empirical success cannot arise from genuine learning over the claimed high-dimensional function space, providing theoretical validation for the mechanical pattern matching explanation.

These results resolve the central puzzle by showing that apparent ’virtue of complexity’ may reflect mechanisms that differ from both the predicted high-dimensional learning (information-theoretically impossible) and the theoretical properties (which are modified by standardization). Instead, methods achieve success through mechanical pattern matching that emerges when kernel approximation fails.

The standardization procedure ensures methods accidentally implement volatility-timed momentum strategies operating in low-dimensional spaces bounded by sample size. This transforms evaluation from ”how can complex methods work with limited data?” to ”how can we distinguish mechanical artifacts from genuine learning?”

The following section provides empirical validation of these theoretical predictions, demonstrating the kernel breakdown and learning impossibility in practice.

5 Empirical Validation of Kernel Approximation Breakdown

This section provides comprehensive empirical validation of Theorem 1 through systematic parameter exploration across the entire space of practical RFF implementations. My experimental design spans realistic financial prediction scenarios, testing whether standardization preserves the Gaussian kernel approximation properties that underlie existing theoretical frameworks. The results provide definitive evidence that standardization fundamentally breaks RFF convergence properties, confirming that methods cannot achieve their claimed theoretical guarantees in practice.

5.1 Data Generation and Model Parameters

I generate realistic financial predictor data following the autoregressive structure typical of macroeconomic variables used in return prediction. For each parameter combination (T,K)𝑇𝐾(T,K)( italic_T , italic_K ), I construct predictor matrices XT×K𝑋superscript𝑇𝐾X\in\mathbb{R}^{T\times K}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_K end_POSTSUPERSCRIPT where:

Xtsubscript𝑋𝑡\displaystyle X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =ΦXt1+ut,ut𝒩(0,Σu)formulae-sequenceabsentΦsubscript𝑋𝑡1subscript𝑢𝑡similar-tosubscript𝑢𝑡𝒩0subscriptΣ𝑢\displaystyle=\Phi X_{t-1}+u_{t},\quad u_{t}\sim\mathcal{N}(0,\Sigma_{u})= roman_Φ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) (5.1)

The persistence parameters Φ=diag(ϕ1,,ϕK)Φdiagsubscriptitalic-ϕ1subscriptitalic-ϕ𝐾\Phi=\text{diag}(\phi_{1},\ldots,\phi_{K})roman_Φ = diag ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) are drawn from the range [0.82,0.98]0.820.98[0.82,0.98][ 0.82 , 0.98 ] to match the high persistence of dividend yields, interest rates, and other financial predictors (Welch & Goyal 2008). The correlation structure Σu=ρ𝟏𝟏T+(1ρ)IKsubscriptΣ𝑢𝜌superscript11𝑇1𝜌subscript𝐼𝐾\Sigma_{u}=\rho\mathbf{1}\mathbf{1}^{T}+(1-\rho)I_{K}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_ρ bold_11 start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + ( 1 - italic_ρ ) italic_I start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT with ρ=0.1𝜌0.1\rho=0.1italic_ρ = 0.1 captures modest cross-correlation among predictors.

Random Fourier Features are constructed as zi(x)=2cos(ωiTx+bi)subscript𝑧𝑖𝑥2superscriptsubscript𝜔𝑖𝑇𝑥subscript𝑏𝑖z_{i}(x)=\sqrt{2}\cos(\omega_{i}^{T}x+b_{i})italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = square-root start_ARG 2 end_ARG roman_cos ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) where ωi𝒩(0,γ2IK)similar-tosubscript𝜔𝑖𝒩0superscript𝛾2subscript𝐼𝐾\omega_{i}\sim\mathcal{N}(0,\gamma^{2}I_{K})italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) and biUniform[0,2π]similar-tosubscript𝑏𝑖Uniform02𝜋b_{i}\sim\text{Uniform}[0,2\pi]italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ Uniform [ 0 , 2 italic_π ]. Standardization is applied as z~i(x)=zi(x)/σ^isubscript~𝑧𝑖𝑥subscript𝑧𝑖𝑥subscript^𝜎𝑖\tilde{z}_{i}(x)=z_{i}(x)/\hat{\sigma}_{i}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) / over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT where σ^i2=T1t=1Tzi(xt)2superscriptsubscript^𝜎𝑖2superscript𝑇1superscriptsubscript𝑡1𝑇subscript𝑧𝑖superscriptsubscript𝑥𝑡2\hat{\sigma}_{i}^{2}=T^{-1}\sum_{t=1}^{T}z_{i}(x_{t})^{2}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT following universal practice in RFF implementations.

My parameter exploration covers the comprehensive space:

  • Number of features: P{100,500,1000,2500,5000,10000,15000,20000}𝑃100500100025005000100001500020000P\in\{100,500,1000,2500,5000,10000,15000,20000\}italic_P ∈ { 100 , 500 , 1000 , 2500 , 5000 , 10000 , 15000 , 20000 }

  • Training window: T{6,12,24,60}𝑇6122460T\in\{6,12,24,60\}italic_T ∈ { 6 , 12 , 24 , 60 } months

  • Kernel bandwidth: γ{0.5,1.0,1.5,2.0,2.5,3.0}𝛾0.51.01.52.02.53.0\gamma\in\{0.5,1.0,1.5,2.0,2.5,3.0\}italic_γ ∈ { 0.5 , 1.0 , 1.5 , 2.0 , 2.5 , 3.0 }

  • Input dimension: K{5,10,15,20,30}𝐾510152030K\in\{5,10,15,20,30\}italic_K ∈ { 5 , 10 , 15 , 20 , 30 }.

5.2 Experimental Goal

The primary objective is to test whether standardization preserves the convergence kstd(P)(x,x)PkG(x,x)𝑃subscriptsuperscript𝑘𝑃std𝑥superscript𝑥subscript𝑘𝐺𝑥superscript𝑥k^{(P)}_{\text{std}}(x,x^{\prime})\xrightarrow{P\to\infty}k_{G}(x,x^{\prime})italic_k start_POSTSUPERSCRIPT ( italic_P ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT std end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_ARROW start_OVERACCENT italic_P → ∞ end_OVERACCENT → end_ARROW italic_k start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) established in Rahimi & Recht (2007). Under the null hypothesis that standardization has no effect, both standard and standardized RFF should exhibit identical convergence properties and error distributions. Theorem 1 predicts systematic breakdown with training-set dependent limits kstd(x,x|𝒯)kG(x,x)subscriptsuperscript𝑘std𝑥conditionalsuperscript𝑥𝒯subscript𝑘𝐺𝑥superscript𝑥k^{*}_{\text{std}}(x,x^{\prime}|\mathcal{T})\neq k_{G}(x,x^{\prime})italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT std end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | caligraphic_T ) ≠ italic_k start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

I conduct 1,000 independent trials per parameter combination, generating fresh training data, RFF weights, and query points for each trial. This provides robust statistical power to detect systematic effects across the parameter space while controlling for random variations in specific realizations.

5.3 Comparison Metrics

My empirical analysis employs four complementary approaches to characterize the extent and nature of kernel approximation breakdown. I begin by examining convergence properties through mean absolute error |k(P)(x,x)kG(x,x)|superscript𝑘𝑃𝑥superscript𝑥subscript𝑘𝐺𝑥superscript𝑥|k^{(P)}(x,x^{\prime})-k_{G}(x,x^{\prime})|| italic_k start_POSTSUPERSCRIPT ( italic_P ) end_POSTSUPERSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_k start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | between empirical and true Gaussian kernels, tracking how approximation quality evolves as P𝑃P\to\inftyitalic_P → ∞. This directly tests whether standardized features preserve the fundamental convergence properties established in Rahimi & Recht (2007).

To quantify the systematic nature of performance deterioration, I construct degradation factors as the ratio 𝔼[|errorstandardized|]/𝔼[|errorstandard|]𝔼delimited-[]subscripterrorstandardized𝔼delimited-[]subscripterrorstandard\mathbb{E}[|\text{error}_{\text{standardized}}|]/\mathbb{E}[|\text{error}_{% \text{standard}}|]blackboard_E [ | error start_POSTSUBSCRIPT standardized end_POSTSUBSCRIPT | ] / blackboard_E [ | error start_POSTSUBSCRIPT standard end_POSTSUBSCRIPT | ] across matched parameter combinations. Values exceeding unity indicate that standardization worsens kernel approximation, while larger ratios represent more severe breakdown. This metric provides a scale-invariant measure of standardization effects that facilitates comparison across different parameter regimes.

Statistical significance is assessed through Kolmogorov-Smirnov two-sample tests comparing error distributions between standard and standardized RFF implementations. Under the null hypothesis that standardization preserves distributional properties, these tests should yield non-significant results. Systematic rejection of this null across parameter combinations provides evidence that standardization fundamentally alters the mathematical behavior of RFF methods beyond what could arise from random variation.

Finally, I conduct comprehensive parameter sensitivity analysis to identify the conditions under which breakdown effects are most pronounced. Heatmap visualizations reveal how degradation severity depends on (P,T,γ,K)𝑃𝑇𝛾𝐾(P,T,\gamma,K)( italic_P , italic_T , italic_γ , italic_K ) combinations, enabling us to characterize the parameter regimes where theoretical guarantees are most severely compromised. This analysis is particularly relevant for understanding the implications for existing empirical studies that employ specific parameter configurations.

5.4 Results and Validation of Theorem 1

5.4.1 Universal Convergence Failure

Figure 1 provides decisive evidence of convergence breakdown. Standard RFF (blue circles) exhibit the theoretically predicted P1/2superscript𝑃12P^{-1/2}italic_P start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT convergence rate, with mean absolute error declining from 0.06absent0.06\approx 0.06≈ 0.06 at P=100𝑃100P=100italic_P = 100 to 0.003absent0.003\approx 0.003≈ 0.003 at P=20,000𝑃20000P=20,000italic_P = 20 , 000. This confirms that unstandardized features preserve Gaussian kernel approximation properties.

In stark contrast, standardized RFF (red squares) completely fail to converge, plateauing around 0.020.020.020.02-0.030.030.030.03 mean error regardless of P𝑃Pitalic_P. For large P𝑃Pitalic_P, standardized features are 𝟔×\mathbf{6\times}bold_6 × worse than standard RFF, demonstrating that additional features provide no approximation benefit when standardization is applied. This plateau behavior directly validates Theorem LABEL:thm:standardization_breakdown’s prediction that standardized features converge to training-set dependent limits rather than the target Gaussian kernel.

5.4.2 Systematic Degradation Across Parameter Space

Figure 2 reveals that breakdown occurs universally across all parameter combinations, with no regime where standardization preserves kernel properties. The degradation patterns exhibit clear economic intuition and align closely with the theoretical mechanisms underlying Theorem 1.

The most pronounced effects emerge along the feature dimension, where degradation increases dramatically with P𝑃Pitalic_P, ranging from 1.2 times at P=100𝑃100P=100italic_P = 100 to 6.0 times at P=20,000𝑃20000P=20,000italic_P = 20 , 000. This escalating pattern reflects the cumulative nature of standardization artifacts: as more features undergo within-sample standardization, the collective distortion of kernel approximation properties intensifies. Each additional standardized feature contributes random scaling factors that compound to produce increasingly severe departures from the target Gaussian kernel.

Sample size effects provide particularly compelling evidence for the breakdown mechanism. Smaller training windows exhibit severe degradation, reaching 41.6 times deterioration for T=6𝑇6T=6italic_T = 6 months. This extreme sensitivity to sample size occurs because standardization relies on empirical variance estimates σ^i2superscriptsubscript^𝜎𝑖2\hat{\sigma}_{i}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT that become increasingly unreliable with limited data. When training windows shrink to the 6-12 month range typical in financial applications, these variance estimates introduce substantial noise that fundamentally alters the scaling relationships required for kernel convergence. The magnitude of this effect—exceeding 40 times degradation in realistic scenarios—demonstrates that standardization can completely overwhelm any approximation benefits from additional features.

Kernel bandwidth parameters reveal additional structure in the breakdown pattern. Low bandwidth values (γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5) produce 12.8 times degradation, while higher bandwidths stabilize around 3.1 times deterioration. This occurs because tighter kernels, which decay more rapidly with distance, are inherently more sensitive to the scaling perturbations introduced by standardization. Small changes in feature magnitudes translate into disproportionately large changes in kernel values when the bandwidth is narrow, amplifying the distortions created by training-set dependent scaling factors.

In contrast, input dimension effects remain remarkably stable, with degradation ranging only between 3.1 and 4.6 times across K[5,30]𝐾530K\in[5,30]italic_K ∈ [ 5 , 30 ]. This stability confirms that breakdown stems primarily from the standardization procedure itself rather than the complexity of the underlying input space. Whether using 5 or 30 predictor variables, the fundamental mathematical properties of standardized RFF remain equally compromised, suggesting that the kernel approximation failure is intrinsic to the standardization mechanism rather than an artifact of high-dimensional inputs.

5.4.3 Parameter Sensitivity Analysis

Figure 3 provides detailed parameter sensitivity analysis through degradation factor heatmaps. The (P,T)𝑃𝑇(P,T)( italic_P , italic_T ) interaction reveals that combinations typical in financial applications—such as P5,000𝑃5000P\geq 5,000italic_P ≥ 5 , 000 features with T12𝑇12T\leq 12italic_T ≤ 12 months—produce degradation factors exceeding 3×3\times3 ×. This directly impacts methods like Kelly et al. (2024) using P=12,000𝑃12000P=12,000italic_P = 12 , 000 and T=12𝑇12T=12italic_T = 12.

The (P,γ)𝑃𝛾(P,\gamma)( italic_P , italic_γ ) interaction shows that standardization effects compound: high complexity (P10,000𝑃10000P\geq 10,000italic_P ≥ 10 , 000) combined with tight kernels (γ1.0𝛾1.0\gamma\leq 1.0italic_γ ≤ 1.0) yields degradation exceeding 10×10\times10 ×. These parameter ranges are commonly employed in high-dimensional return prediction, suggesting widespread applicability of my breakdown results.

5.4.4 Statistical Significance

The error distributions between standard and standardized RFF are fundamentally different across the entire parameter space, providing strong statistical evidence against the null hypothesis that standardization preserves kernel approximation properties. Figure 4 presents Kolmogorov-Smirnov test statistics that consistently exceed 0.5 across most parameter combinations, with many approaching the theoretical maximum of 1.0. Such large test statistics indicate that the cumulative distribution functions of standard and standardized RFF errors diverge substantially, ruling out the possibility that observed differences arise from sampling variation.

The statistical evidence is most compelling in parameter regimes commonly employed in financial applications. For high feature counts (P5,000𝑃5000P\geq 5,000italic_P ≥ 5 , 000), KS statistics approach 0.9, while short training windows (T12𝑇12T\leq 12italic_T ≤ 12) yield statistics near 1.0. These values correspond to p-values that are effectively zero, providing overwhelming evidence to reject the null hypothesis of distributional equivalence. The magnitude of these test statistics exceeds typical significance thresholds by orders of magnitude, establishing statistical significance that is both robust and economically meaningful.

The systematic pattern of large KS statistics across parameter combinations demonstrates that the breakdown identified in Theorem 1 is not confined to specific implementation choices or edge cases. Instead, the distributional differences persist universally across realistic parameter ranges, indicating that standardization fundamentally alters the stochastic properties of RFF approximation errors. This statistical evidence complements the degradation factor analysis by confirming that the observed differences represent genuine distributional shifts rather than changes in central tendency alone.

These results establish that standardization creates systematic, statistically significant alterations to RFF behavior that cannot be attributed to random variation, specific parameter selections, or implementation artifacts. The universality and magnitude of the statistical evidence provide definitive support for the conclusion that practical RFF implementations cannot achieve the theoretical kernel approximation properties that justify their use in high-dimensional prediction problems.

5.4.5 Alternative Kernel Convergence

Figure 5 provides empirical validation of Theorem 1’s central prediction that within-sample standardization fundamentally alters Random Fourier Features convergence properties. The analysis compares three distinct convergence behaviors across varying feature dimensions P[100,500,1000,2500,5000,12000]𝑃10050010002500500012000P\in[100,500,1000,2500,5000,12000]italic_P ∈ [ 100 , 500 , 1000 , 2500 , 5000 , 12000 ]:

The blue line demonstrates that standard (non-standardized) RFF achieve the theoretical convergence rate P1/2superscript𝑃12P^{-1/2}italic_P start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT to the Gaussian kernel kG(x,x)=exp(γ2xx2/2)subscript𝑘𝐺𝑥superscript𝑥superscript𝛾2superscriptnorm𝑥superscript𝑥22k_{G}(x,x^{\prime})=\exp(-\gamma^{2}\|x-x^{\prime}\|^{2}/2)italic_k start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_exp ( - italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ), validating the foundational result of Rahimi & Recht (2007). The convergence follows the expected Monte Carlo rate, with mean absolute error decreasing from approximately 0.060.060.060.06 at P=100𝑃100P=100italic_P = 100 to 0.0050.0050.0050.005 at P=12,000𝑃12000P=12{,}000italic_P = 12 , 000.

The red line reveals the fundamental breakdown predicted by Theorem 1: standardized RFF fail to converge to the Gaussian kernel, instead exhibiting slower convergence with substantially higher errors. At P=12,000𝑃12000P=12{,}000italic_P = 12 , 000, the error remains above 0.020.020.020.02—four times larger than the standard case—demonstrating that standardization prevents achievement of the theoretical guarantees.

Most importantly, the green line confirms Theorem 1’s constructive prediction by showing that standardized RFF do converge to the modified limit kstd(x,x|T)subscriptsuperscript𝑘std𝑥conditionalsuperscript𝑥𝑇k^{*}_{\text{std}}(x,x^{\prime}|T)italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT std end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_T ). This convergence exhibits the canonical P1/2superscript𝑃12P^{-1/2}italic_P start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT rate, reaching error levels below 0.0150.0150.0150.015 at P=12,000𝑃12000P=12{,}000italic_P = 12 , 000, thereby validating my theoretical characterization of the standardized limit.

My empirical validation employs the sample standard deviation standardization actually used in practice:

σ^i2subscriptsuperscript^𝜎2𝑖\displaystyle\hat{\sigma}^{2}_{i}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =1Tt=1Tzi2(xt)[1Tt=1Tzi(xt)]2absent1𝑇superscriptsubscript𝑡1𝑇superscriptsubscript𝑧𝑖2subscript𝑥𝑡superscriptdelimited-[]1𝑇superscriptsubscript𝑡1𝑇subscript𝑧𝑖subscript𝑥𝑡2\displaystyle=\frac{1}{T}\sum_{t=1}^{T}z_{i}^{2}(x_{t})-\left[\frac{1}{T}\sum_% {t=1}^{T}z_{i}(x_{t})\right]^{2}= divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - [ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (5.2)
z~i(x)subscript~𝑧𝑖𝑥\displaystyle\tilde{z}_{i}(x)over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) =zi(x)σ^iabsentsubscript𝑧𝑖𝑥subscript^𝜎𝑖\displaystyle=\frac{z_{i}(x)}{\hat{\sigma}_{i}}= divide start_ARG italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG (5.3)

rather than the simpler RMS normalization σ^i2=1Tt=1Tzi2(xt)subscriptsuperscript^𝜎2𝑖1𝑇superscriptsubscript𝑡1𝑇superscriptsubscript𝑧𝑖2subscript𝑥𝑡\hat{\sigma}^{2}_{i}=\frac{1}{T}\sum_{t=1}^{T}z_{i}^{2}(x_{t})over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) that might be assumed theoretically. This distinction strengthens rather than weakens my validation for two crucial reasons.

First, Theorem 1’s fundamental insight—that any reasonable standardization procedure breaks Gaussian kernel convergence and creates training-set dependence—remains intact regardless of the specific standardization formula. The theorem establishes that standardized features converge to some training-set dependent limit kstdkGsubscriptsuperscript𝑘stdsubscript𝑘𝐺k^{*}_{\text{std}}\neq k_{G}italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT std end_POSTSUBSCRIPT ≠ italic_k start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, with the exact form depending on implementation details.

Second, testing against the actual standardization procedure used in practical implementation ensures that my theoretical predictions match real-world behavior. The fact that standardized RFF converge to the correctly computed kstdsubscriptsuperscript𝑘stdk^{*}_{\text{std}}italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT std end_POSTSUBSCRIPT rather than to kGsubscript𝑘𝐺k_{G}italic_k start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT provides the strongest possible validation: my theory successfully predicts the behavior of methods as actually implemented, not merely as idealized.

The convergence patterns thus confirm all key predictions of Theorem 1: standardization breaks the foundational convergence guarantee of RFF theory, creates training-set dependent kernels that violate shift-invariance, and produces systematic errors that persist even with large feature counts. These findings validate my theoretical framework while highlighting the critical importance of analyzing methods as actually implemented rather than as theoretically idealized.

5.4.6 Implications for Existing Theory

My results provide definitive empirical validation of Theorem 1 across the entire parameter space relevant for financial applications. The universal nature of degradation—ranging from modest 1.2×1.2\times1.2 × effects to extreme 40×40\times40 × breakdown—demonstrates that standardization fundamentally alters RFF convergence properties regardless of implementation details.

Notably, parameter combinations employed by leading studies exhibit substantial degradation: Kelly et al. (2024)’s configuration (P=12,000𝑃12000P=12,000italic_P = 12 , 000, T=12𝑇12T=12italic_T = 12, γ=2.0𝛾2.0\gamma=2.0italic_γ = 2.0) falls in the 3333-6×6\times6 × degradation range, while more extreme combinations approach 10×10\times10 × or higher degradation. This suggests that empirical successes documented in the literature cannot arise from the theoretical kernel learning mechanisms that justify these methods.

The systematic nature of these effects, combined with their large magnitudes, supports the conclusion that alternative explanations—such as the mechanical pattern matching identified by Nagel (2025)—are required to understand why high-dimensional RFF methods achieve predictive success despite fundamental theoretical breakdown.

6 Conclusion

This paper resolves fundamental puzzles in high-dimensional financial prediction by providing rigorous theoretical foundations that explain when and why complex machine learning methods succeed or fail. My analysis contributes three key results that together clarify the apparent contradictions between theoretical claims and empirical mechanisms in recent literature.

First, I prove that within-sample standardization—employed in every practical Random Fourier Features implementation—fundamentally breaks the kernel approximation that underlies existing theoretical frameworks. This breakdown explains why methods operate under different conditions than theoretical assumptions and must rely on simpler mechanisms than advertised.

Second, I establish sharp sample complexity bounds showing that reliable extraction of weak financial signals requires sample sizes and signal strengths far exceeding those available in typical applications. These information-theoretic limits demonstrate that apparent high-dimensional learning often reflects mechanical pattern matching rather than genuine complexity benefits.

Third, I derive precise learning thresholds that characterize the boundary between learnable and unlearnable regimes, providing practitioners with concrete tools for evaluating when available data suffices for reliable prediction versus when apparent success arises through statistical artifacts.

These results explain why methods claiming sophisticated high-dimensional learning often succeed through simple volatility-timed momentum strategies operating in low-dimensional spaces bounded by sample size. Rather than discouraging complex methods, my findings provide a framework for distinguishing genuine learning from mechanical artifacts and understanding what such methods actually accomplish.

The theoretical insights extend beyond the specific methods analyzed, offering guidance for evaluating any high-dimensional approach in challenging prediction environments. As machine learning continues to transform finance, rigorous theoretical understanding remains essential for distinguishing genuine advances from statistical mirages and enabling more effective application of these powerful but often misunderstood techniques.

References

  • (1)
  • Bartlett et al. (2020) Bartlett, P. L., Long, P. M., Lugosi, G. & Tsigler, A. (2020), ‘Benign overfitting in linear regression’, Proceedings of the National Academy of Sciences 117(48), 30063–30070.
  • Belkin et al. (2019) Belkin, M., Hsu, D., Ma, S. & Mandal, S. (2019), ‘Reconciling modern machine‐learning practice and the bias–variance trade‐off’, Proceedings of the National Academy of Sciences 116(32), 15849–15854.
  • Bianchi et al. (2021) Bianchi, D., Büchner, M. & Tamoni, A. (2021), ‘Bond risk premiums with machine learning’, Review of Financial Studies 34(2), 1046–1089.
  • Blumer et al. (1989) Blumer, A., Ehrenfeucht, A., Haussler, D. & Warmuth, M. K. (1989), ‘Learnability and the vapnik-chervonenkis dimension’, Journal of the ACM 36(4), 929–965. Key paper connecting VC dimension to PAC learnability.
  • Chen et al. (2024) Chen, L., Pelger, M. & Zhu, J. (2024), ‘Deep learning in asset pricing’, Management Science 70(2), 714–750.
  • Feng et al. (2020) Feng, G., Giglio, S. & Xiu, D. (2020), ‘Taming the factor zoo: A test of new factors’, Journal of Finance 75(3), 1327–1370.
  • Gu et al. (2020) Gu, S., Kelly, B. & Xiu, D. (2020), ‘Empirical asset pricing via machine learning’, Review of Financial Studies 33(5), 2223–2273.
  • Hastie et al. (2022) Hastie, T., Montanari, A., Rosset, S. & Tibshirani, R. J. (2022), ‘Surprises in high-dimensional ridgeless least squares interpolation’, Annals of Statistics 50(2), 949–986.
  • Hastie et al. (2009) Hastie, T., Tibshirani, R. & Friedman, J. (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2 edn, Springer, New York.
  • Kearns & Vazirani (1994) Kearns, M. J. & Vazirani, U. V. (1994), An Introduction to Computational Learning Theory, MIT Press.
  • Kelly et al. (2024) Kelly, B., Malamud, S. & Zhou, K. (2024), ‘The virtue of complexity in return prediction’, Journal of Finance 79(1), 459–503.
  • Mei & Montanari (2022) Mei, S. & Montanari, A. (2022), ‘The generalization error of random features regression: Precise asymptotics and the double descent curve’, Communications on Pure and Applied Mathematics 75(4), 667–766.
  • Nagel (2025) Nagel, S. (2025), ‘Seemingly virtuous complexity in return prediction’, Working paper .
  • Rahimi & Recht (2007) Rahimi, A. & Recht, B. (2007), Random features for large-scale kernel machines, in ‘Advances in Neural Information Processing Systems’, Vol. 20, pp. 1177–1184.
  • Rudi & Rosasco (2017) Rudi, A. & Rosasco, L. (2017), Generalization properties of learning with random features, in ‘Advances in Neural Information Processing Systems’, Vol. 30, pp. 3215–3225.
  • Shalev-Shwartz & Ben-David (2014) Shalev-Shwartz, S. & Ben-David, S. (2014), Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, Cambridge, UK. Modern textbook with clear exposition of VC theory and PAC learning.
  • Sutherland & Schneider (2015) Sutherland, D. J. & Schneider, J. (2015), On the error of random fourier features, in ‘Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence’, pp. 862–871.
  • Tropp (2012) Tropp, J. A. (2012), ‘User-friendly tail bounds for sums of random matrices’, Foundations of Computational Mathematics 12(4), 389–434. See especially Theorem 6.2 for the matrix Bernstein inequality used in the proof.
    https://doi.org/10.1007/s10208-011-9099-z
  • Tropp (2015) Tropp, J. A. (2015), An Introduction to Matrix Concentration Inequalities, Vol. 8 of Foundations and Trends in Machine Learning, Now Publishers, Boston.
  • Valiant (1984) Valiant, L. G. (1984), ‘A theory of the learnable’, Communications of the ACM 27(11), 1134–1142.
  • Vapnik (1998) Vapnik, V. N. (1998), Statistical Learning Theory, Wiley, New York. Comprehensive treatment of VC theory and statistical learning.
  • Vapnik & Chervonenkis (1971) Vapnik, V. N. & Chervonenkis, A. Y. (1971), ‘On the uniform convergence of relative frequencies of events to their probabilities’, Theory of Probability & Its Applications 16(2), 264–280. Foundational paper introducing VC dimension.
  • Welch & Goyal (2008) Welch, I. & Goyal, A. (2008), ‘A comprehensive look at the empirical performance of equity premium prediction’, Review of Financial Studies 21(4), 1455–1508.
Refer to caption
Figure 1: Convergence Analysis: Kernel Approximation Error vs Number of Features
This figure shows mean absolute error between empirical and true Gaussian kernels as a function of the number of Random Fourier Features P𝑃Pitalic_P. Standard RFF (blue circles) exhibit the theoretically predicted P1/2superscript𝑃12P^{-1/2}italic_P start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT convergence rate (dashed gray line), while standardized RFF (red squares) fail to converge, plateauing around 0.02-0.03 regardless of P𝑃Pitalic_P. The systematic divergence demonstrates that standardization breaks the fundamental convergence properties established in Rahimi & Recht (2007). Results are averaged over 1,000 trials with T=12𝑇12T=12italic_T = 12, K=15𝐾15K=15italic_K = 15, and γ=2.0𝛾2.0\gamma=2.0italic_γ = 2.0.
Refer to caption
Figure 2: Degradation Factor Across Parameter Space
This figure displays degradation factors (ratio of standardized to standard RFF errors) across four key parameters. Panel (a) shows increasing degradation with feature count P𝑃Pitalic_P, reaching 6×\times× at P=20,000𝑃20000P=20,000italic_P = 20 , 000. Panel (b) reveals extreme degradation for small training windows, exceeding 40×\times× at T=6𝑇6T=6italic_T = 6. Panel (c) demonstrates sensitivity to kernel bandwidth γ𝛾\gammaitalic_γ, with tighter kernels showing worse degradation. Panel (d) shows stable degradation across input dimensions K𝐾Kitalic_K. All degradation factors exceed unity, confirming systematic breakdown across the entire parameter space. Each point represents the mean over 1,000 trials.
Refer to caption
Refer to caption
Figure 3: Parameter Sensitivity Analysis
Left panel shows degradation factor heatmap for (P,T)𝑃𝑇(P,T)( italic_P , italic_T ) combinations, where financial applications typically use P5,000𝑃5000P\geq 5,000italic_P ≥ 5 , 000 and T12𝑇12T\leq 12italic_T ≤ 12, exhibiting degradation factors exceeding 3×\times×. The extreme degradation at T=6𝑇6T=6italic_T = 6 (reaching 41.6×\times×) occurs because variance estimates become unreliable with limited training data. Right panel displays the (P,γ)𝑃𝛾(P,\gamma)( italic_P , italic_γ ) interaction, showing that high complexity combined with tight kernels yields degradation exceeding 10×\times×. These parameter ranges are commonly employed in high-dimensional return prediction, suggesting widespread applicability of the breakdown results.
Refer to caption
Figure 4: Statistical Significance: Kolmogorov-Smirnov Test Statistics
This figure presents Kolmogorov-Smirnov test statistics comparing error distributions between standard and standardized RFF across parameter space. All panels show KS statistics substantially exceeding typical significance thresholds, indicating fundamentally different error distributions. Panel (a) demonstrates increasing statistical significance with feature count P𝑃Pitalic_P, reaching KS 0.9absent0.9\approx 0.9≈ 0.9 for large P𝑃Pitalic_P. Panel (b) shows extreme significance for small training windows (T12𝑇12T\leq 12italic_T ≤ 12). Panels (c) and (d) reveal strong effects across kernel bandwidth γ𝛾\gammaitalic_γ and input dimension K𝐾Kitalic_K. These results provide overwhelming statistical evidence against the null hypothesis that standardization preserves RFF properties, with effect sizes far exceeding what could arise from random variation.
Refer to caption
Figure 5: Convergence Patterns
Empirical Validation of Theorem 1: Convergence Patterns Under Different Standardization Procedures. This figure demonstrates the fundamental breakdown of Random Fourier Features convergence properties under standardization. The blue line (circles) shows standard RFF achieving the theoretically predicted P1/2superscript𝑃12P^{-1/2}italic_P start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT convergence rate to the Gaussian kernel kG(x,x)=exp(γ2xx2/2)subscript𝑘𝐺𝑥superscript𝑥superscript𝛾2superscriptnorm𝑥superscript𝑥22k_{G}(x,x^{\prime})=\exp(-\gamma^{2}\|x-x^{\prime}\|^{2}/2)italic_k start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_exp ( - italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ), validating Rahimi & Recht (2007). The red line (squares) reveals that standardized RFF fail to converge to the Gaussian kernel, plateauing at error levels 4× higher than standard RFF at P=12,000𝑃12000P=12{,}000italic_P = 12 , 000. Most importantly, the green line (triangles) confirms Theorem 1’s constructive prediction: standardized RFF do converge to the modified limit kstd(x,x|T)subscriptsuperscript𝑘std𝑥conditionalsuperscript𝑥𝑇k^{*}_{\text{std}}(x,x^{\prime}|T)italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT std end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_T ) at the canonical P1/2superscript𝑃12P^{-1/2}italic_P start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT rate. This validates our theoretical characterization while demonstrating that standardization creates training-set dependent kernels that violate the shift-invariance properties required for kernel methods. Results averaged over 20 trials with T=12𝑇12T=12italic_T = 12, K=15𝐾15K=15italic_K = 15, and γ=2.0𝛾2.0\gamma=2.0italic_γ = 2.0.

Appendix A Technical Proofs for Kernel Approximation Breakdown

This appendix provides complete mathematical proofs for the results in Section 3. We establish that within-sample standardization of Random Fourier Features fundamentally breaks the Gaussian kernel approximation that underlies the theoretical framework of high-dimensional prediction methods.

A.1 Model Setup and Notation

We analyze the standardized Random Fourier Features used in practical implementations. Draw (ω,b)𝒩(0,γ2IK)×Uniform[0,2π]similar-to𝜔𝑏𝒩0superscript𝛾2subscript𝐼𝐾Uniform02𝜋(\omega,b)\sim\mathcal{N}(0,\gamma^{2}I_{K})\times\text{Uniform}[0,2\pi]( italic_ω , italic_b ) ∼ caligraphic_N ( 0 , italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) × Uniform [ 0 , 2 italic_π ], independently of the training set 𝒯={xt}t=1T𝒯superscriptsubscriptsubscript𝑥𝑡𝑡1𝑇\mathcal{T}=\{x_{t}\}_{t=1}^{T}caligraphic_T = { italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. For query points x,xK𝑥superscript𝑥superscript𝐾x,x^{\prime}\in\mathbb{R}^{K}italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, define the standardized kernel function:

h(ω,b)=2cos(ωx+b)cos(ωx+b)1+1Tt=1Tcos(2ωxt+2b)=N(ω,b)D(ω,b)𝜔𝑏2superscript𝜔top𝑥𝑏superscript𝜔topsuperscript𝑥𝑏11𝑇superscriptsubscript𝑡1𝑇2superscript𝜔topsubscript𝑥𝑡2𝑏𝑁𝜔𝑏𝐷𝜔𝑏h(\omega,b)=\frac{2\cos(\omega^{\top}x+b)\cos(\omega^{\top}x^{\prime}+b)}{1+% \frac{1}{T}\sum_{t=1}^{T}\cos(2\omega^{\top}x_{t}+2b)}=\frac{N(\omega,b)}{D(% \omega,b)}italic_h ( italic_ω , italic_b ) = divide start_ARG 2 roman_cos ( italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x + italic_b ) roman_cos ( italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_b ) end_ARG start_ARG 1 + divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_cos ( 2 italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 italic_b ) end_ARG = divide start_ARG italic_N ( italic_ω , italic_b ) end_ARG start_ARG italic_D ( italic_ω , italic_b ) end_ARG

Given P𝑃Pitalic_P i.i.d. copies (ωi,bi)subscript𝜔𝑖subscript𝑏𝑖(\omega_{i},b_{i})( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), we write kstd(P):=P1i=1Ph(ωi,bi)assignsubscriptsuperscript𝑘𝑃stdsuperscript𝑃1superscriptsubscript𝑖1𝑃subscript𝜔𝑖subscript𝑏𝑖k^{(P)}_{\text{std}}:=P^{-1}\sum_{i=1}^{P}h(\omega_{i},b_{i})italic_k start_POSTSUPERSCRIPT ( italic_P ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT std end_POSTSUBSCRIPT := italic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT italic_h ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

A.2 Proof of Theorem 1

The proof proceeds in two steps: establishing almost-sure convergence in part (a) and demonstrating training-set dependence in part (b).

A.2.1 Step 1: Integrability and Almost-Sure Convergence

We first establish that h(ω,b)𝜔𝑏h(\omega,b)italic_h ( italic_ω , italic_b ) has finite expectation, enabling application of the strong law of large numbers.

Write

σ^2:=2Tt=1Tcos2(ωxt+b)=1+ST,ST:=1Tt=1Tcos(2ωxt+2b)formulae-sequenceassignsuperscript^𝜎22𝑇superscriptsubscript𝑡1𝑇superscript2superscript𝜔topsubscript𝑥𝑡𝑏1subscript𝑆𝑇assignsubscript𝑆𝑇1𝑇superscriptsubscript𝑡1𝑇2superscript𝜔topsubscript𝑥𝑡2𝑏\hat{\sigma}^{2}:=\frac{2}{T}\sum_{t=1}^{T}\cos^{2}(\omega^{\top}x_{t}+b)=1+S_% {T},\quad S_{T}:=\frac{1}{T}\sum_{t=1}^{T}\cos(2\omega^{\top}x_{t}+2b)over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT := divide start_ARG 2 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b ) = 1 + italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT := divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_cos ( 2 italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 italic_b )

Since |h|2σ^22superscript^𝜎2|h|\leq 2\hat{\sigma}^{-2}| italic_h | ≤ 2 over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, integrability of hhitalic_h follows once we show 𝔼[σ^2]<𝔼delimited-[]superscript^𝜎2\mathbb{E}[\hat{\sigma}^{-2}]<\inftyblackboard_E [ over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ] < ∞. Lemma 1 proves this claim.

Using (σ^2>u)=(σ^2<u1)superscript^𝜎2𝑢superscript^𝜎2superscript𝑢1\mathbb{P}(\hat{\sigma}^{-2}>u)=\mathbb{P}(\hat{\sigma}^{2}<u^{-1})blackboard_P ( over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT > italic_u ) = blackboard_P ( over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < italic_u start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ), we obtain:

𝔼[σ^2]=02(σ^2>u)𝑑u1+1CTuT/2𝑑u<𝔼delimited-[]superscript^𝜎2superscriptsubscript02superscript^𝜎2𝑢differential-d𝑢1superscriptsubscript1subscript𝐶𝑇superscript𝑢𝑇2differential-d𝑢\mathbb{E}[\hat{\sigma}^{-2}]=\int_{0}^{2}\mathbb{P}(\hat{\sigma}^{-2}>u)\,du% \leq 1+\int_{1}^{\infty}C_{T}u^{-T/2}\,du<\inftyblackboard_E [ over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ] = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_P ( over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT > italic_u ) italic_d italic_u ≤ 1 + ∫ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_u start_POSTSUPERSCRIPT - italic_T / 2 end_POSTSUPERSCRIPT italic_d italic_u < ∞

for every T5𝑇5T\geq 5italic_T ≥ 5. Hence 𝔼|h|<𝔼\mathbb{E}|h|<\inftyblackboard_E | italic_h | < ∞.

Since the variables h(ωi,bi)subscript𝜔𝑖subscript𝑏𝑖h(\omega_{i},b_{i})italic_h ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) are i.i.d. with finite mean, Kolmogorov’s strong law yields:

kstd(P)(x,x)=1Pi=1Ph(ωi,bi)a.s.Pkstd(x,x):=𝔼[h(ω,b)]subscriptsuperscript𝑘𝑃std𝑥superscript𝑥1𝑃superscriptsubscript𝑖1𝑃subscript𝜔𝑖subscript𝑏𝑖a.s.𝑃subscriptsuperscript𝑘std𝑥superscript𝑥assign𝔼delimited-[]𝜔𝑏k^{(P)}_{\text{std}}(x,x^{\prime})=\frac{1}{P}\sum_{i=1}^{P}h(\omega_{i},b_{i}% )\xrightarrow[\text{a.s.}]{P\to\infty}k^{*}_{\text{std}}(x,x^{\prime}):=% \mathbb{E}[h(\omega,b)]italic_k start_POSTSUPERSCRIPT ( italic_P ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT std end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT italic_h ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_ARROW undera.s. start_ARROW start_OVERACCENT italic_P → ∞ end_OVERACCENT → end_ARROW end_ARROW italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT std end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) := blackboard_E [ italic_h ( italic_ω , italic_b ) ]

This establishes part (i) of Theorem 1.

A.2.2 Step 2: Training-Set Dependence

We now prove that the limiting kernel kstdsubscriptsuperscript𝑘stdk^{*}_{\text{std}}italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT std end_POSTSUBSCRIPT depends on the training set, unlike the Gaussian kernel.

Define the radial function:

G(u):=𝔼[(1+cos(2ωu+2b))1]=g(r:=u),r0formulae-sequenceassign𝐺𝑢𝔼delimited-[]superscript12superscript𝜔top𝑢2𝑏1𝑔assign𝑟norm𝑢𝑟0G(u):=\mathbb{E}[(1+\cos(2\omega^{\top}u+2b))^{-1}]=g(r:=\|u\|),\quad r\geq 0italic_G ( italic_u ) := blackboard_E [ ( 1 + roman_cos ( 2 italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_u + 2 italic_b ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] = italic_g ( italic_r := ∥ italic_u ∥ ) , italic_r ≥ 0

Now fix x=x=0𝑥superscript𝑥0x=x^{\prime}=0italic_x = italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0 and compare two training sets:

𝒯={x1,,xT},𝒯={αx1,x2,,xT},α>1formulae-sequence𝒯subscript𝑥1subscript𝑥𝑇formulae-sequencesuperscript𝒯𝛼subscript𝑥1subscript𝑥2subscript𝑥𝑇𝛼1\mathcal{T}=\{x_{1},\ldots,x_{T}\},\quad\mathcal{T}^{\prime}=\{\alpha x_{1},x_% {2},\ldots,x_{T}\},\quad\alpha>1caligraphic_T = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } , caligraphic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_α italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } , italic_α > 1

Write D:=1+STassign𝐷1subscript𝑆𝑇D:=1+S_{T}italic_D := 1 + italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and g𝒯(b):=𝔼[D1|b]assignsubscript𝑔𝒯𝑏𝔼delimited-[]conditionalsuperscript𝐷1𝑏g_{\mathcal{T}}(b):=\mathbb{E}[D^{-1}|b]italic_g start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( italic_b ) := blackboard_E [ italic_D start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT | italic_b ]. Conditioning on b𝑏bitalic_b:

kstd(0,0|𝒯)=02π(1+cos2b)g𝒯(b)db2πsubscriptsuperscript𝑘std0conditional0𝒯superscriptsubscript02𝜋12𝑏subscript𝑔𝒯𝑏𝑑𝑏2𝜋k^{*}_{\text{std}}(0,0|\mathcal{T})=\int_{0}^{2\pi}(1+\cos 2b)g_{\mathcal{T}}(% b)\frac{db}{2\pi}italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT std end_POSTSUBSCRIPT ( 0 , 0 | caligraphic_T ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_π end_POSTSUPERSCRIPT ( 1 + roman_cos 2 italic_b ) italic_g start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( italic_b ) divide start_ARG italic_d italic_b end_ARG start_ARG 2 italic_π end_ARG

Because Ut(b)=2ωxt+2bsubscript𝑈𝑡𝑏2superscript𝜔topsubscript𝑥𝑡2𝑏U_{t}(b)=2\omega^{\top}x_{t}+2bitalic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_b ) = 2 italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 italic_b is Gaussian with variance 4γ2xt24superscript𝛾2superscriptnormsubscript𝑥𝑡24\gamma^{2}\|x_{t}\|^{2}4 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, scaling x1αx1maps-tosubscript𝑥1𝛼subscript𝑥1x_{1}\mapsto\alpha x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↦ italic_α italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT strictly enlarges Var U1(b)Var subscript𝑈1𝑏\text{Var }U_{1}(b)Var italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_b ). By Lemma 2, this implies g𝒯(b)g𝒯(b)subscript𝑔𝒯𝑏subscript𝑔superscript𝒯𝑏g_{\mathcal{T}}(b)\neq g_{\mathcal{T}^{\prime}}(b)italic_g start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( italic_b ) ≠ italic_g start_POSTSUBSCRIPT caligraphic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_b ) on a set of b𝑏bitalic_b of positive measure. Since 1+cos2b>012𝑏01+\cos 2b>01 + roman_cos 2 italic_b > 0 on that set, the two integrals—and therefore the two kernels—differ:

kstd(0,0|𝒯)kstd(0,0|𝒯)subscriptsuperscript𝑘std0conditional0𝒯subscriptsuperscript𝑘std0conditional0superscript𝒯k^{*}_{\text{std}}(0,0|\mathcal{T})\neq k^{*}_{\text{std}}(0,0|\mathcal{T}^{% \prime})italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT std end_POSTSUBSCRIPT ( 0 , 0 | caligraphic_T ) ≠ italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT std end_POSTSUBSCRIPT ( 0 , 0 | caligraphic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )

The Gaussian kernel kG(x,x)=exp(γ22xx2)subscript𝑘𝐺𝑥superscript𝑥superscript𝛾22superscriptnorm𝑥superscript𝑥2k_{G}(x,x^{\prime})=\exp(-\frac{\gamma^{2}}{2}\|x-x^{\prime}\|^{2})italic_k start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_exp ( - divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) depends only on xxnorm𝑥superscript𝑥\|x-x^{\prime}\|∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ and is training-set independent. Hence kstdkGsubscriptsuperscript𝑘stdsubscript𝑘𝐺k^{*}_{\text{std}}\neq k_{G}italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT std end_POSTSUBSCRIPT ≠ italic_k start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, which proves part (b) of Theorem 1.

A.3 Supporting Lemmas

Lemma 1 (Small–ball estimate).

Let x1,,xTdsubscript𝑥1subscript𝑥𝑇superscript𝑑x_{1},\dots,x_{T}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT satisfy the affine–independence assumption

rank(x1xT11)=T.rankmatrixsubscript𝑥1subscript𝑥𝑇11𝑇\mathrm{rank}\!\begin{pmatrix}x_{1}&\!\cdots\!&x_{T}\\[2.0pt] 1&\!\cdots\!&1\end{pmatrix}=T.roman_rank ( start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL ⋯ end_CELL start_CELL 1 end_CELL end_ROW end_ARG ) = italic_T . (A)

Draw ω𝒩(0,γ2Id)similar-to𝜔𝒩0superscript𝛾2subscript𝐼𝑑\omega\sim\mathcal{N}(0,\gamma^{2}I_{d})italic_ω ∼ caligraphic_N ( 0 , italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) and bUnif[0,2π]similar-to𝑏Unif02𝜋b\sim\mathrm{Unif}[0,2\pi]italic_b ∼ roman_Unif [ 0 , 2 italic_π ] independently and set

σ^2=1+1Tt=1Tcos(2ωxt+2b).superscript^𝜎211𝑇superscriptsubscript𝑡1𝑇2superscript𝜔topsubscript𝑥𝑡2𝑏\hat{\sigma}^{2}=1+\frac{1}{T}\sum_{t=1}^{T}\cos\!\bigl{(}2\omega^{\top}x_{t}+% 2b\bigr{)}.over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 + divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_cos ( 2 italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 italic_b ) .

Then there exists a constant CT<subscript𝐶𝑇C_{T}<\inftyitalic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT < ∞ (depending on T,γ,{xt}𝑇𝛾subscript𝑥𝑡T,\gamma,\{x_{t}\}italic_T , italic_γ , { italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }) such that for every ε(0,1)𝜀01\varepsilon\in(0,1)italic_ε ∈ ( 0 , 1 ),

(σ^2ε)CTεT/2.superscript^𝜎2𝜀subscript𝐶𝑇superscript𝜀𝑇2\mathbb{P}\!\bigl{(}\hat{\sigma}^{2}\leq\varepsilon\bigr{)}\;\leq\;C_{T}\,% \varepsilon^{T/2}.blackboard_P ( over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ε ) ≤ italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT italic_T / 2 end_POSTSUPERSCRIPT .
Proof of Lemma 1.

We convert the small–ball event into a geometric one. Using the inequality cos(2θt)1+14Δt22subscript𝜃𝑡114superscriptsubscriptΔ𝑡2\cos(2\theta_{t})\geq-1+\frac{1}{4}\Delta_{t}^{2}roman_cos ( 2 italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ - 1 + divide start_ARG 1 end_ARG start_ARG 4 end_ARG roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where ΔtsubscriptΔ𝑡\Delta_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the distance of θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the points where cos(2θt)=12subscript𝜃𝑡1\cos(2\theta_{t})=-1roman_cos ( 2 italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = - 1, the condition σ^2εsuperscript^𝜎2𝜀\hat{\sigma}^{2}\leq\varepsilonover^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ε forces the vector (Δ1,,ΔT)subscriptΔ1subscriptΔ𝑇(\Delta_{1},\dots,\Delta_{T})( roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_Δ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) to lie inside an T𝑇Titalic_T-dimensional Euclidean ball of radius O(Tε)𝑂𝑇𝜀O(\sqrt{T\varepsilon})italic_O ( square-root start_ARG italic_T italic_ε end_ARG ), whose volume scales like εT/2superscript𝜀𝑇2\varepsilon^{T/2}italic_ε start_POSTSUPERSCRIPT italic_T / 2 end_POSTSUPERSCRIPT. Because the affine map (ω,b)(θ1,,θT)maps-to𝜔𝑏subscript𝜃1subscript𝜃𝑇(\omega,b)\mapsto(\theta_{1},\dots,\theta_{T})( italic_ω , italic_b ) ↦ ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) has full rank, the Gaussian density of its image is uniformly bounded, so the probability of the event is at most a constant times this volume, yielding the bound (σ^2ε)CTεT/2superscript^𝜎2𝜀subscript𝐶𝑇superscript𝜀𝑇2\mathbb{P}(\hat{\sigma}^{2}\leq\varepsilon)\leq C_{T}\varepsilon^{T/2}blackboard_P ( over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ε ) ≤ italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT italic_T / 2 end_POSTSUPERSCRIPT.

Put θt:=ωxt+b,t=1,,Tformulae-sequenceassignsubscript𝜃𝑡superscript𝜔topsubscript𝑥𝑡𝑏𝑡1𝑇\theta_{t}:=\omega^{\top}x_{t}+b,\;t=1,\dots,Titalic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b , italic_t = 1 , … , italic_T and write δ:=1ε(0,1)assign𝛿1𝜀01\delta:=1-\varepsilon\in(0,1)italic_δ := 1 - italic_ε ∈ ( 0 , 1 ). Then

σ^2=1+1Tt=1Tcos(2θt){σ^2ε}={1Tt=1Tcos(2θt)δ}.formulae-sequencesuperscript^𝜎211𝑇superscriptsubscript𝑡1𝑇2subscript𝜃𝑡superscript^𝜎2𝜀1𝑇superscriptsubscript𝑡1𝑇2subscript𝜃𝑡𝛿\hat{\sigma}^{2}=1+\frac{1}{T}\sum_{t=1}^{T}\cos\bigl{(}2\theta_{t}\bigr{)}% \quad\Longrightarrow\quad\{\hat{\sigma}^{2}\leq\varepsilon\}\;=\;\Bigl{\{}% \frac{1}{T}\sum_{t=1}^{T}\cos(2\theta_{t})\leq-\delta\Bigr{\}}.over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 + divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_cos ( 2 italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⟹ { over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ε } = { divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_cos ( 2 italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ - italic_δ } .

For every φ𝜑\varphi\in\mathbb{R}italic_φ ∈ blackboard_R with |φπ2πk|π/2(k)𝜑𝜋2𝜋𝑘𝜋2𝑘|\varphi-\pi-2\pi k|\leq\pi/2\;(k\in\mathbb{Z})| italic_φ - italic_π - 2 italic_π italic_k | ≤ italic_π / 2 ( italic_k ∈ blackboard_Z ) we have

cosφ1+14(φπ2πk)2𝜑114superscript𝜑𝜋2𝜋𝑘2\displaystyle\cos\varphi\;\geq\;-1+\frac{1}{4}\bigl{(}\varphi-\pi-2\pi k\bigr{% )}^{2}roman_cos italic_φ ≥ - 1 + divide start_ARG 1 end_ARG start_ARG 4 end_ARG ( italic_φ - italic_π - 2 italic_π italic_k ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (A.1)

(Use cosu1u2/4𝑢1superscript𝑢24\cos u\geq 1-u^{2}/4roman_cos italic_u ≥ 1 - italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 4 for |u|π/2𝑢𝜋2|u|\leq\pi/2| italic_u | ≤ italic_π / 2 and set u=φπ2πk𝑢𝜑𝜋2𝜋𝑘u=\varphi-\pi-2\pi kitalic_u = italic_φ - italic_π - 2 italic_π italic_k.)

Let Φt:=2θt=2ωxt+2bassignsubscriptΦ𝑡2subscript𝜃𝑡2superscript𝜔topsubscript𝑥𝑡2𝑏\Phi_{t}:=2\theta_{t}=2\omega^{\top}x_{t}+2broman_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := 2 italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 2 italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 italic_b and define the T𝑇Titalic_T distances

Δt:=mink|Φtπ2πk|[0,π].assignsubscriptΔ𝑡subscript𝑘subscriptΦ𝑡𝜋2𝜋𝑘0𝜋\Delta_{t}:=\min_{k\in\mathbb{Z}}\bigl{|}\Phi_{t}-\pi-2\pi k\bigr{|}\in[0,\pi].roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := roman_min start_POSTSUBSCRIPT italic_k ∈ blackboard_Z end_POSTSUBSCRIPT | roman_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_π - 2 italic_π italic_k | ∈ [ 0 , italic_π ] .

Applying A.1 with φ=Φt𝜑subscriptΦ𝑡\varphi=\Phi_{t}italic_φ = roman_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT gives cosΦt1+14Δt2.subscriptΦ𝑡114superscriptsubscriptΔ𝑡2\cos\Phi_{t}\geq-1+\tfrac{1}{4}\Delta_{t}^{2}.roman_cos roman_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ - 1 + divide start_ARG 1 end_ARG start_ARG 4 end_ARG roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Because 1TtcosΦtδ1𝑇subscript𝑡subscriptΦ𝑡𝛿\frac{1}{T}\sum_{t}\cos\Phi_{t}\leq-\deltadivide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_cos roman_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ - italic_δ and cosΦt1+14Δt2subscriptΦ𝑡114superscriptsubscriptΔ𝑡2\cos\Phi_{t}\geq-1+\tfrac{1}{4}\Delta_{t}^{2}roman_cos roman_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ - 1 + divide start_ARG 1 end_ARG start_ARG 4 end_ARG roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT,

1+141Tt=1TΔt2δ1Tt=1TΔt2 4(1δ)=4ε.1141𝑇superscriptsubscript𝑡1𝑇superscriptsubscriptΔ𝑡2𝛿1𝑇superscriptsubscript𝑡1𝑇superscriptsubscriptΔ𝑡241𝛿4𝜀-1+\frac{1}{4}\frac{1}{T}\sum_{t=1}^{T}\Delta_{t}^{2}\;\leq\;-\delta\;% \Longrightarrow\;\frac{1}{T}\sum_{t=1}^{T}\Delta_{t}^{2}\;\leq\;4(1-\delta)=4\varepsilon.- 1 + divide start_ARG 1 end_ARG start_ARG 4 end_ARG divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ - italic_δ ⟹ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 4 ( 1 - italic_δ ) = 4 italic_ε .

Hence the event {σ^2ε}superscript^𝜎2𝜀\{\hat{\sigma}^{2}\leq\varepsilon\}{ over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ε } implies

(Δ1,,ΔT)Bε:={zT:z224Tε}.subscriptΔ1subscriptΔ𝑇subscript𝐵𝜀assignconditional-set𝑧superscript𝑇superscriptsubscriptnorm𝑧224𝑇𝜀\bigl{(}\Delta_{1},\dots,\Delta_{T}\bigr{)}\in B_{\varepsilon}:=\Bigl{\{}z\in% \mathbb{R}^{T}:\|z\|_{2}^{2}\leq 4T\varepsilon\Bigr{\}}.( roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_Δ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∈ italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT := { italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT : ∥ italic_z ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 4 italic_T italic_ε } .

The Lebesgue volume of Bεsubscript𝐵𝜀B_{\varepsilon}italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT is vol(Bε)=κT(4Tε)T/2,volsubscript𝐵𝜀subscript𝜅𝑇superscript4𝑇𝜀𝑇2\mathrm{vol}(B_{\varepsilon})=\kappa_{T}(4T\varepsilon)^{T/2},roman_vol ( italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ) = italic_κ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( 4 italic_T italic_ε ) start_POSTSUPERSCRIPT italic_T / 2 end_POSTSUPERSCRIPT , with κTsubscript𝜅𝑇\kappa_{T}italic_κ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT the unit-ball volume in Tsuperscript𝑇\mathbb{R}^{T}blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT.

Write Y:=2(ωx1,,ωxT)T.assign𝑌2superscriptsuperscript𝜔topsubscript𝑥1superscript𝜔topsubscript𝑥𝑇topsuperscript𝑇Y:=2(\omega^{\top}x_{1},\dots,\omega^{\top}x_{T})^{\!\top}\in\mathbb{R}^{T}.italic_Y := 2 ( italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . By linearity, Y𝒩(0,Σ)similar-to𝑌𝒩0ΣY\sim\mathcal{N}(0,\Sigma)italic_Y ∼ caligraphic_N ( 0 , roman_Σ ) with Σ=4γ2(xixj)i,jT.Σ4superscript𝛾2subscriptsuperscriptsubscript𝑥𝑖topsubscript𝑥𝑗𝑖𝑗𝑇\Sigma=4\gamma^{2}(x_{i}^{\top}x_{j})_{i,j\leq T}.roman_Σ = 4 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i , italic_j ≤ italic_T end_POSTSUBSCRIPT . Assumption (A) implies ΣΣ\Sigmaroman_Σ is non–singular, so Y𝑌Yitalic_Y possesses a density fY(y)=1(2π)TdetΣe12yΣ1y,subscript𝑓𝑌𝑦1superscript2𝜋𝑇Σsuperscript𝑒12superscript𝑦topsuperscriptΣ1𝑦f_{Y}(y)=\tfrac{1}{\sqrt{(2\pi)^{T}\det\Sigma}}\,e^{-\frac{1}{2}y^{\top}\Sigma% ^{-1}y},italic_f start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_det roman_Σ end_ARG end_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_y start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT , satisfying 0<supyTfY(y)=:(2π)T/2(detΣ)1/2=:MT<.0<\sup_{y\in\mathbb{R}^{T}}f_{Y}(y)=:(2\pi)^{-T/2}(\det\Sigma)^{-1/2}=:M_{T}<\infty.0 < roman_sup start_POSTSUBSCRIPT italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) = : ( 2 italic_π ) start_POSTSUPERSCRIPT - italic_T / 2 end_POSTSUPERSCRIPT ( roman_det roman_Σ ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT = : italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT < ∞ .

For every fixed b[0,2π]𝑏02𝜋b\in[0,2\pi]italic_b ∈ [ 0 , 2 italic_π ] we have (Φ1,,ΦT)=Y+2b 1,subscriptΦ1subscriptΦ𝑇𝑌2𝑏1\bigl{(}\Phi_{1},\dots,\Phi_{T}\bigr{)}=Y+2b\,\mathbf{1},( roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_Φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = italic_Y + 2 italic_b bold_1 , so

(σ^2εb)yBε2b 1fY(y)𝑑yMTvol(Bε).superscript^𝜎2conditional𝜀𝑏subscript𝑦subscript𝐵𝜀2𝑏1subscript𝑓𝑌𝑦differential-d𝑦subscript𝑀𝑇volsubscript𝐵𝜀\mathbb{P}\bigl{(}\hat{\sigma}^{2}\leq\varepsilon\mid b\bigr{)}\;\leq\;\int_{y% \in B_{\varepsilon}-2b\,\mathbf{1}}\!f_{Y}(y)\,dy\;\leq\;M_{T}\,\mathrm{vol}(B% _{\varepsilon}).blackboard_P ( over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ε ∣ italic_b ) ≤ ∫ start_POSTSUBSCRIPT italic_y ∈ italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT - 2 italic_b bold_1 end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) italic_d italic_y ≤ italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT roman_vol ( italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ) .

Finally,

(σ^2ε)=12π02π(σ^2εb)𝑑bMTvol(Bε)=CTεT/2,superscript^𝜎2𝜀12𝜋superscriptsubscript02𝜋superscript^𝜎2conditional𝜀𝑏differential-d𝑏subscript𝑀𝑇volsubscript𝐵𝜀subscript𝐶𝑇superscript𝜀𝑇2\mathbb{P}\bigl{(}\hat{\sigma}^{2}\leq\varepsilon\bigr{)}=\frac{1}{2\pi}\int_{% 0}^{2\pi}\mathbb{P}\bigl{(}\hat{\sigma}^{2}\leq\varepsilon\mid b\bigr{)}\,db\;% \leq\;M_{T}\,\mathrm{vol}(B_{\varepsilon})\;=\;C_{T}\,\varepsilon^{T/2},blackboard_P ( over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ε ) = divide start_ARG 1 end_ARG start_ARG 2 italic_π end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_π end_POSTSUPERSCRIPT blackboard_P ( over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ε ∣ italic_b ) italic_d italic_b ≤ italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT roman_vol ( italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ) = italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT italic_T / 2 end_POSTSUPERSCRIPT ,

where CT:=MTκT(4T)T/2assignsubscript𝐶𝑇subscript𝑀𝑇subscript𝜅𝑇superscript4𝑇𝑇2C_{T}:=M_{T}\,\kappa_{T}(4T)^{T/2}italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT := italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_κ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( 4 italic_T ) start_POSTSUPERSCRIPT italic_T / 2 end_POSTSUPERSCRIPT. ∎

Lemma 2 (Strict Radial Monotonicity).

The derivative satisfies g(r)<0superscript𝑔𝑟0g^{\prime}(r)<0italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_r ) < 0 for every r>0𝑟0r>0italic_r > 0.

Proof of Lemma 2.

Using the identity (1+cosϕ)1=12n=0(1)nIn(ϕ)superscript1italic-ϕ112superscriptsubscript𝑛0superscript1𝑛subscript𝐼𝑛italic-ϕ(1+\cos\phi)^{-1}=\tfrac{1}{2}\sum_{n=0}^{\infty}(-1)^{n}I_{n}(\phi)( 1 + roman_cos italic_ϕ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( - 1 ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϕ ) and the isotropy of ω𝜔\omegaitalic_ω, g(r)=12n=0(1)nIn(2γr)𝑔𝑟12superscriptsubscript𝑛0superscript1𝑛subscript𝐼𝑛2𝛾𝑟g(r)=\tfrac{1}{2}\sum_{n=0}^{\infty}(-1)^{n}I_{n}(2\gamma r)italic_g ( italic_r ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( - 1 ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 2 italic_γ italic_r ). Each Insubscript𝐼𝑛I_{n}italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is positive and increasing on (0,)0(0,\infty)( 0 , ∞ ), and the series converges absolutely on compacts, so term-wise differentiation gives g(r)<0superscript𝑔𝑟0g^{\prime}(r)<0italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_r ) < 0. ∎

A.4 Connection to Existing Literature

Our analysis uses only the first four moments of the 𝒩(0,γ2IK)𝒩0superscript𝛾2subscript𝐼𝐾\mathcal{N}(0,\gamma^{2}I_{K})caligraphic_N ( 0 , italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) draw for ω𝜔\omegaitalic_ω, which aligns with the finite-moment conditions imposed in Kelly et al. (2024). Specifically, their Condition 0 requires 𝔼ω4<𝔼superscriptnorm𝜔4\mathbb{E}\|\omega\|^{4}<\inftyblackboard_E ∥ italic_ω ∥ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT < ∞, which our standard Gaussian assumption satisfies. No additional distributional assumptions beyond those already present in the KMZ framework are required for our impossibility results.

The breakdown we establish is therefore endemic to the standardized RFF approach as implemented, rather than an artifact of stronger technical conditions. This reinforces the fundamental nature of the theory-practice disconnect we identify.

Appendix B Technical Proofs for Section 4

Proof of Theorem 2.

The strategy is the classical minimax/Fano route: (i) build a large packing of well-separated parameters, (ii) show that their induced data distributions are statistically indistinguishable, (iii) invoke Fano’s inequality to bound any decoder’s error, and (iv) convert decoder error into a lower bound on prediction risk.

Packing construction. Fix a radius 0<δ<B/20𝛿𝐵20<\delta<B/20 < italic_δ < italic_B / 2. Because the Euclidean ball 𝔹2P(B)superscriptsubscript𝔹2𝑃𝐵\mathbb{B}_{2}^{P}(B)blackboard_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( italic_B ) in Psuperscript𝑃\mathbb{R}^{P}blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT has volume growth proportional to BPsuperscript𝐵𝑃B^{P}italic_B start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, it contains a 2δ2𝛿2\delta2 italic_δ-packing {w1,,wM}subscript𝑤1subscript𝑤𝑀\{w_{1},\dots,w_{M}\}{ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } of size M=(B/(2δ))P𝑀superscript𝐵2𝛿𝑃M=(B/(2\delta))^{P}italic_M = ( italic_B / ( 2 italic_δ ) ) start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT; hence logM=Plog(B/(2δ))𝑀𝑃𝐵2𝛿\log M=P\log\!\bigl{(}B/(2\delta)\bigr{)}roman_log italic_M = italic_P roman_log ( italic_B / ( 2 italic_δ ) ). Define fj(x):=wjz(x)assignsubscript𝑓𝑗𝑥superscriptsubscript𝑤𝑗top𝑧𝑥f_{j}(x):=w_{j}^{\!\top}z(x)italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) := italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ). For each index j𝑗jitalic_j let jsubscript𝑗\mathbb{P}_{j}blackboard_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT denote the joint distribution of the training sample 𝒟T={(xt,rt)}t=1Tsubscript𝒟𝑇superscriptsubscriptsubscript𝑥𝑡subscript𝑟𝑡𝑡1𝑇\mathcal{D}_{T}=\{(x_{t},r_{t})\}_{t=1}^{T}caligraphic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT generated according to rt=fj(xt)+ϵtsubscript𝑟𝑡subscript𝑓𝑗subscript𝑥𝑡subscriptitalic-ϵ𝑡r_{t}=f_{j}(x_{t})+\epsilon_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with independent Gaussian noise ϵt𝒩(0,σ2)similar-tosubscriptitalic-ϵ𝑡𝒩0superscript𝜎2\epsilon_{t}\sim\mathcal{N}(0,\sigma^{2})italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

Average KL divergence. Let ZT×P𝑍superscript𝑇𝑃Z\in\mathbb{R}^{T\times P}italic_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_P end_POSTSUPERSCRIPT be the random design matrix whose t𝑡titalic_t-th row is z(xt)𝑧superscriptsubscript𝑥𝑡topz(x_{t})^{\!\top}italic_z ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. Conditioned on Z𝑍Zitalic_Z the log-likelihood ratio between jsubscript𝑗\mathbb{P}_{j}blackboard_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and subscript\mathbb{P}_{\ell}blackboard_P start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT is Gaussian, and one checks

KL(j|Z)=Z(wjw)222σ2.\mathrm{KL}\!\bigl{(}\mathbb{P}_{j}\|\mathbb{P}_{\ell}\,\bigm{|}\,Z\bigr{)}\;=% \;\frac{\|Z(w_{j}-w_{\ell})\|_{2}^{2}}{2\sigma^{2}}.roman_KL ( blackboard_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ blackboard_P start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT | italic_Z ) = divide start_ARG ∥ italic_Z ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

Taking expectation over Z𝑍Zitalic_Z and using 𝔼[ZZ]=TΣz𝔼delimited-[]superscript𝑍top𝑍𝑇subscriptΣ𝑧\mathbb{E}[Z^{\!\top}Z]=T\Sigma_{z}blackboard_E [ italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Z ] = italic_T roman_Σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT gives

KL(j)=T2σ2(wjw)Σz(wjw)2TCzB2σ2=:KT.\mathrm{KL}(\mathbb{P}_{j}\|\mathbb{P}_{\ell})\;=\;\frac{T}{2\sigma^{2}}(w_{j}% -w_{\ell})^{\!\top}\Sigma_{z}(w_{j}-w_{\ell})\;\leq\;\frac{2T\,C_{z}\,B^{2}}{% \sigma^{2}}\;=:\;K_{T}.roman_KL ( blackboard_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ blackboard_P start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) = divide start_ARG italic_T end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) ≤ divide start_ARG 2 italic_T italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = : italic_K start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT .

(The inequality uses Σz0succeeds-or-equalssubscriptΣ𝑧0\Sigma_{z}\succeq 0roman_Σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ⪰ 0 and λmax(Σz)Czsubscript𝜆subscriptΣ𝑧subscript𝐶𝑧\lambda_{\max}(\Sigma_{z})\leq C_{z}italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( roman_Σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) ≤ italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT.)

Fano’s inequality. Draw an index J𝐽Jitalic_J uniformly from [M]delimited-[]𝑀[M][ italic_M ] and let J^^𝐽\hat{J}over^ start_ARG italic_J end_ARG be any measurable decoder based on the sample 𝒟Tsubscript𝒟𝑇\mathcal{D}_{T}caligraphic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Fano’s max–KL form (e.g. CoverThomas2006, Eq. 16.32) yields

(J^J) 1KT+log2logM.^𝐽𝐽1subscript𝐾𝑇2𝑀\mathbb{P}(\hat{J}\neq J)\;\geq\;1-\frac{K_{T}+\log 2}{\log M}.blackboard_P ( over^ start_ARG italic_J end_ARG ≠ italic_J ) ≥ 1 - divide start_ARG italic_K start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + roman_log 2 end_ARG start_ARG roman_log italic_M end_ARG .

Choosing the packing radius δ𝛿\deltaitalic_δ such that the right-hand side equals 1/2121/21 / 2 (so that any decoder errs at least half the time) gives

δB2exp(4TCzB2Pσ22log2P).𝛿𝐵24𝑇subscript𝐶𝑧superscript𝐵2𝑃superscript𝜎222𝑃\delta\;\leq\;\frac{B}{2}\exp\!\Bigl{(}-\frac{4TC_{z}B^{2}}{P\sigma^{2}}-\frac% {2\log 2}{P}\Bigr{)}.italic_δ ≤ divide start_ARG italic_B end_ARG start_ARG 2 end_ARG roman_exp ( - divide start_ARG 4 italic_T italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_P italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 2 roman_log 2 end_ARG start_ARG italic_P end_ARG ) . (B.1)

Link between prediction risk and decoder error. Let f^Tsubscript^𝑓𝑇\hat{f}_{T}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT be an arbitrary estimator and put ε:=𝔼x,𝒟T,ϵ[(f^T(x)fJ(x))2]assign𝜀subscript𝔼𝑥subscript𝒟𝑇italic-ϵdelimited-[]superscriptsubscript^𝑓𝑇𝑥subscript𝑓𝐽𝑥2\varepsilon:=\mathbb{E}_{x,\mathcal{D}_{T},\epsilon}\!\bigl{[}(\hat{f}_{T}(x)-% f_{J}(x))^{2}\bigr{]}italic_ε := blackboard_E start_POSTSUBSCRIPT italic_x , caligraphic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_ϵ end_POSTSUBSCRIPT [ ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) - italic_f start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. Because the nearest-neighbour decoder chooses J^=argminjf^TfjL2(μ)^𝐽subscript𝑗subscriptnormsubscript^𝑓𝑇subscript𝑓𝑗superscript𝐿2𝜇\hat{J}=\arg\min_{j}\|\hat{f}_{T}-f_{j}\|_{L^{2}(\mu)}over^ start_ARG italic_J end_ARG = roman_arg roman_min start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ ) end_POSTSUBSCRIPT, the triangle inequality gives

fJ^fJL2(μ) 2ε.subscriptnormsubscript𝑓^𝐽subscript𝑓𝐽superscript𝐿2𝜇2𝜀\|f_{\hat{J}}-f_{J}\|_{L^{2}(\mu)}\;\leq\;2\sqrt{\varepsilon}.∥ italic_f start_POSTSUBSCRIPT over^ start_ARG italic_J end_ARG end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ ) end_POSTSUBSCRIPT ≤ 2 square-root start_ARG italic_ε end_ARG .

Meanwhile each pair (j,)𝑗(j,\ell)( italic_j , roman_ℓ ) in the packing satisfies wjw22δsubscriptnormsubscript𝑤𝑗subscript𝑤22𝛿\|w_{j}-w_{\ell}\|_{2}\geq 2\delta∥ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 2 italic_δ; since ΣzczIPsucceeds-or-equalssubscriptΣ𝑧subscript𝑐𝑧subscript𝐼𝑃\Sigma_{z}\succeq c_{z}I_{P}roman_Σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ⪰ italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, fjfL2(μ)24czδ2superscriptsubscriptnormsubscript𝑓𝑗subscript𝑓superscript𝐿2𝜇24subscript𝑐𝑧superscript𝛿2\|f_{j}-f_{\ell}\|_{L^{2}(\mu)}^{2}\geq 4c_{z}\delta^{2}∥ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 4 italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Consequently, if ε<czδ2𝜀subscript𝑐𝑧superscript𝛿2\varepsilon<c_{z}\delta^{2}italic_ε < italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT the decoder must succeed (J^=J^𝐽𝐽\hat{J}=Jover^ start_ARG italic_J end_ARG = italic_J), contradicting (J^J)12^𝐽𝐽12\mathbb{P}(\hat{J}\neq J)\geq\tfrac{1}{2}blackboard_P ( over^ start_ARG italic_J end_ARG ≠ italic_J ) ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG. Hence

εczδ2.𝜀subscript𝑐𝑧superscript𝛿2\varepsilon\;\geq\;c_{z}\,\delta^{2}.italic_ε ≥ italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (B.2)

Expectation lower bound. Substituting (B.1) into (B.2) and absorbing the harmless factor e4log2/Psuperscript𝑒42𝑃e^{-4\log 2/P}italic_e start_POSTSUPERSCRIPT - 4 roman_log 2 / italic_P end_POSTSUPERSCRIPT into a constant c=14cze4log2/P𝑐14subscript𝑐𝑧superscript𝑒42𝑃c=\tfrac{1}{4}c_{z}e^{-4\log 2/P}italic_c = divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - 4 roman_log 2 / italic_P end_POSTSUPERSCRIPT yields

εcB2exp(8TCzB2Pσ2),𝜀𝑐superscript𝐵28𝑇subscript𝐶𝑧superscript𝐵2𝑃superscript𝜎2\varepsilon\;\geq\;c\,B^{2}\exp\!\Bigl{(}-\tfrac{8TC_{z}B^{2}}{P\sigma^{2}}% \Bigr{)},italic_ε ≥ italic_c italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_exp ( - divide start_ARG 8 italic_T italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_P italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ,

which is the desired in-expectation bound.

High-probability refinement over the design. Finally, define the “well-conditioned design” event

:={T1ZZΣzop12cz}.assignsubscriptdelimited-∥∥superscript𝑇1superscript𝑍top𝑍subscriptΣ𝑧op12subscript𝑐𝑧\mathcal{E}:=\Bigl{\{}\,\bigl{\|}T^{-1}Z^{\!\top}Z-\Sigma_{z}\bigr{\|}_{% \mathrm{op}}\leq\tfrac{1}{2}c_{z}\Bigr{\}}.caligraphic_E := { ∥ italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Z - roman_Σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_op end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT } .

For sub-Gaussian rows, the matrix Bernstein inequality (Tropp 2012, Theorem 6.2) guarantees Z(c)eTsubscript𝑍superscript𝑐superscript𝑒𝑇\mathbb{P}_{Z}(\mathcal{E}^{c})\leq e^{-T}blackboard_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( caligraphic_E start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ≤ italic_e start_POSTSUPERSCRIPT - italic_T end_POSTSUPERSCRIPT provided TC0(κ,cz,Cz)P𝑇subscript𝐶0𝜅subscript𝑐𝑧subscript𝐶𝑧𝑃T\geq C_{0}(\kappa,c_{z},C_{z})\,Pitalic_T ≥ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_κ , italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) italic_P. On \mathcal{E}caligraphic_E the empirical Gram matrix satisfies 12czIPT1ZZ2CzIPprecedes-or-equals12subscript𝑐𝑧subscript𝐼𝑃superscript𝑇1superscript𝑍top𝑍precedes-or-equals2subscript𝐶𝑧subscript𝐼𝑃\tfrac{1}{2}c_{z}I_{P}\preceq T^{-1}Z^{\!\top}Z\preceq 2C_{z}I_{P}divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ⪯ italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Z ⪯ 2 italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, so the previous KL-and-distance calculations hold with constants (2Cz,12cz)2subscript𝐶𝑧12subscript𝑐𝑧(2C_{z},\tfrac{1}{2}c_{z})( 2 italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ). Repeating the Fano–risk argument under \mathcal{E}caligraphic_E therefore gives

εcB2exp(8TCzB2Pσ2),c=18cze4log2/P,formulae-sequence𝜀superscript𝑐superscript𝐵28𝑇subscript𝐶𝑧superscript𝐵2𝑃superscript𝜎2superscript𝑐18subscript𝑐𝑧superscript𝑒42𝑃\varepsilon\;\geq\;c^{\ast}B^{2}\exp\!\Bigl{(}-\tfrac{8TC_{z}B^{2}}{P\sigma^{2% }}\Bigr{)},\qquad c^{\ast}=\tfrac{1}{8}c_{z}e^{-4\log 2/P},italic_ε ≥ italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_exp ( - divide start_ARG 8 italic_T italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_P italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 8 end_ARG italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - 4 roman_log 2 / italic_P end_POSTSUPERSCRIPT ,

for every fixed Z𝑍Z\in\mathcal{E}italic_Z ∈ caligraphic_E. Taking outer expectation over the design and using Z(c)eTsubscript𝑍superscript𝑐superscript𝑒𝑇\mathbb{P}_{Z}(\mathcal{E}^{c})\leq e^{-T}blackboard_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( caligraphic_E start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ≤ italic_e start_POSTSUPERSCRIPT - italic_T end_POSTSUPERSCRIPT produces the advertised high-probability bound. ∎

Proof of Theorem 3.

Part (a) Let e1,,ePsubscript𝑒1subscript𝑒𝑃e_{1},\dots,e_{P}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT be the standard basis of Psuperscript𝑃\mathbb{R}^{P}blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT. Define

δ=min{B4,σ4logPTCz},w0=𝟎,wj=δej(j=1,,P).formulae-sequence𝛿𝐵4𝜎4𝑃𝑇subscript𝐶𝑧formulae-sequencesubscript𝑤00subscript𝑤𝑗𝛿subscript𝑒𝑗𝑗1𝑃\delta\;=\;\min\bigl{\{}\,\tfrac{B}{4},\;\tfrac{\sigma}{4}\sqrt{\tfrac{\log P}% {T\,C_{z}}}\,\bigr{\}},\quad w_{0}=\mathbf{0},\quad w_{j}=\delta\,e_{j}\;(j=1,% \dots,P).italic_δ = roman_min { divide start_ARG italic_B end_ARG start_ARG 4 end_ARG , divide start_ARG italic_σ end_ARG start_ARG 4 end_ARG square-root start_ARG divide start_ARG roman_log italic_P end_ARG start_ARG italic_T italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG end_ARG } , italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_0 , italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_δ italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_j = 1 , … , italic_P ) .

All wjsubscript𝑤𝑗w_{j}italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT lie in the ball w2Bsubscriptnorm𝑤2𝐵\|w\|_{2}\leq B∥ italic_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_B, and wjw2=2δ(j)subscriptnormsubscript𝑤𝑗subscript𝑤22𝛿𝑗\|w_{j}-w_{\ell}\|_{2}=\sqrt{2}\,\delta\;(j\neq\ell)∥ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = square-root start_ARG 2 end_ARG italic_δ ( italic_j ≠ roman_ℓ ).

For j𝑗j\neq\ellitalic_j ≠ roman_ℓ let j,subscript𝑗subscript\mathbb{P}_{j},\mathbb{P}_{\ell}blackboard_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT be the distributions of the T𝑇Titalic_T samples when the true parameter is wjsubscript𝑤𝑗w_{j}italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT or wsubscript𝑤w_{\ell}italic_w start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT. Conditioned on the design matrix Z𝑍Zitalic_Z, both are Gaussians with means Zwj,Zw𝑍subscript𝑤𝑗𝑍subscript𝑤Zw_{j},Zw_{\ell}italic_Z italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Z italic_w start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT and covariance σ2ITsuperscript𝜎2subscript𝐼𝑇\sigma^{2}I_{T}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Taking expectation over Z𝑍Zitalic_Z,

𝔼[DKL(j)]=T2σ2(wjw)Σz(wjw)2TCzδ2σ2logP8<12logP.𝔼delimited-[]subscript𝐷KLconditionalsubscript𝑗subscript𝑇2superscript𝜎2superscriptsubscript𝑤𝑗subscript𝑤topsubscriptΣ𝑧subscript𝑤𝑗subscript𝑤2𝑇subscript𝐶𝑧superscript𝛿2superscript𝜎2𝑃812𝑃\mathbb{E}[D_{\mathrm{KL}}(\mathbb{P}_{j}\|\mathbb{P}_{\ell})]\;=\;\frac{T}{2% \sigma^{2}}(w_{j}-w_{\ell})^{\!\top}\Sigma_{z}(w_{j}-w_{\ell})\;\leq\;\frac{2T% \,C_{z}\delta^{2}}{\sigma^{2}}\;\leq\;\frac{\log P}{8}\;<\;\tfrac{1}{2}\log P.blackboard_E [ italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ blackboard_P start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) ] = divide start_ARG italic_T end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) ≤ divide start_ARG 2 italic_T italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG roman_log italic_P end_ARG start_ARG 8 end_ARG < divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log italic_P .

With M=P+1𝑀𝑃1M=P+1italic_M = italic_P + 1 equiprobable hypotheses, Fano gives

Pr[J^J] 112logP+log2log(P+1)12(for P16).formulae-sequencePr^𝐽𝐽112𝑃2𝑃112for 𝑃16\Pr[\hat{J}\neq J]\;\geq\;1-\frac{\tfrac{1}{2}\log P+\log 2}{\log(P+1)}\;\geq% \;\tfrac{1}{2}\quad(\text{for }P\geq 16).roman_Pr [ over^ start_ARG italic_J end_ARG ≠ italic_J ] ≥ 1 - divide start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log italic_P + roman_log 2 end_ARG start_ARG roman_log ( italic_P + 1 ) end_ARG ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( for italic_P ≥ 16 ) .

Because ΣzczIPsucceeds-or-equalssubscriptΣ𝑧subscript𝑐𝑧subscript𝐼𝑃\Sigma_{z}\succeq c_{z}I_{P}roman_Σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ⪰ italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT,

fjfL2(μ)2=(wjw)Σz(wjw)2czδ2.superscriptsubscriptnormsubscript𝑓𝑗subscript𝑓superscript𝐿2𝜇2superscriptsubscript𝑤𝑗subscript𝑤topsubscriptΣ𝑧subscript𝑤𝑗subscript𝑤2subscript𝑐𝑧superscript𝛿2\|f_{j}-f_{\ell}\|_{L^{2}(\mu)}^{2}=(w_{j}-w_{\ell})^{\!\top}\Sigma_{z}(w_{j}-% w_{\ell})\geq 2c_{z}\delta^{2}.∥ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) ≥ 2 italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

If f^Tsubscript^𝑓𝑇\hat{f}_{T}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT attains mean-squared error ε<cz4δ2𝜀subscript𝑐𝑧4superscript𝛿2\varepsilon<\frac{c_{z}}{4}\delta^{2}italic_ε < divide start_ARG italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, then on the event J^J^𝐽𝐽\hat{J}\neq Jover^ start_ARG italic_J end_ARG ≠ italic_J the triangle inequality forces fJ^fJL2(μ)<2czδsubscriptnormsubscript𝑓^𝐽subscript𝑓𝐽superscript𝐿2𝜇2subscript𝑐𝑧𝛿\|f_{\hat{J}}-f_{J}\|_{L^{2}(\mu)}<\sqrt{2c_{z}}\,\delta∥ italic_f start_POSTSUBSCRIPT over^ start_ARG italic_J end_ARG end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ ) end_POSTSUBSCRIPT < square-root start_ARG 2 italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG italic_δ, contradicting the separation just shown. Therefore

εPr[J^J]cz2δ2cz4δ2.𝜀Pr^𝐽𝐽subscript𝑐𝑧2superscript𝛿2subscript𝑐𝑧4superscript𝛿2\varepsilon\;\geq\;\Pr[\hat{J}\neq J]\,\frac{c_{z}}{2}\,\delta^{2}\;\geq\;% \frac{c_{z}}{4}\,\delta^{2}.italic_ε ≥ roman_Pr [ over^ start_ARG italic_J end_ARG ≠ italic_J ] divide start_ARG italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ divide start_ARG italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Substituting the definition of δ𝛿\deltaitalic_δ yields

εcz4min{B216,σ2logP16TCz}=cz64Czmin{B2,σ2TlogP}.𝜀subscript𝑐𝑧4superscript𝐵216superscript𝜎2𝑃16𝑇subscript𝐶𝑧subscript𝑐𝑧64subscript𝐶𝑧superscript𝐵2superscript𝜎2𝑇𝑃\varepsilon\;\geq\;\frac{c_{z}}{4}\;\min\!\Bigl{\{}\tfrac{B^{2}}{16},\;\tfrac{% \sigma^{2}\log P}{16\,T\,C_{z}}\Bigr{\}}\;=\;\frac{c_{z}}{64C_{z}}\;\min\!% \Bigl{\{}B^{2},\;\tfrac{\sigma^{2}}{T}\log P\Bigr{\}}.italic_ε ≥ divide start_ARG italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG roman_min { divide start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 end_ARG , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_P end_ARG start_ARG 16 italic_T italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG } = divide start_ARG italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG start_ARG 64 italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG roman_min { italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG roman_log italic_P } .

Setting c~=cz/(64Cz)~𝑐subscript𝑐𝑧64subscript𝐶𝑧\tilde{c}=c_{z}/(64C_{z})over~ start_ARG italic_c end_ARG = italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT / ( 64 italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) completes the proof.

Part (b)

1. Packing of the coefficient ball.

Let e1,,ePsubscript𝑒1subscript𝑒𝑃e_{1},\dots,e_{P}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT be the canonical basis in Psuperscript𝑃\mathbb{R}^{P}blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT and set

δ:=min{B4,σ4logPTCz},w0:=𝟎,wj:=δej(j=1,,P).formulae-sequenceassign𝛿𝐵4𝜎4𝑃𝑇subscript𝐶𝑧formulae-sequenceassignsubscript𝑤00assignsubscript𝑤𝑗𝛿subscript𝑒𝑗𝑗1𝑃\delta\;:=\;\min\!\Bigl{\{}\tfrac{B}{4},\;\tfrac{\sigma}{4}\sqrt{\tfrac{\log P% }{T\,C_{z}}}\Bigr{\}},\qquad w_{0}:=\mathbf{0},\quad w_{j}:=\delta\,e_{j}\;(j=% 1,\dots,P).italic_δ := roman_min { divide start_ARG italic_B end_ARG start_ARG 4 end_ARG , divide start_ARG italic_σ end_ARG start_ARG 4 end_ARG square-root start_ARG divide start_ARG roman_log italic_P end_ARG start_ARG italic_T italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG end_ARG } , italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := bold_0 , italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT := italic_δ italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_j = 1 , … , italic_P ) .

All wjsubscript𝑤𝑗w_{j}italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT lie in the Euclidean ball w2Bsubscriptnorm𝑤2𝐵\|w\|_{2}\leq B∥ italic_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_B and satisfy wjw2=2δ(j).subscriptnormsubscript𝑤𝑗subscript𝑤22𝛿𝑗\|w_{j}-w_{\ell}\|_{2}=\sqrt{2}\,\delta\;(j\neq\ell).∥ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = square-root start_ARG 2 end_ARG italic_δ ( italic_j ≠ roman_ℓ ) .

2. A “good-design” event.

Define

:={λmax(T1ZZ)2Cz}.assignsubscript𝜆superscript𝑇1superscript𝑍top𝑍2subscript𝐶𝑧\mathcal{E}:=\Bigl{\{}\,\lambda_{\max}\!\bigl{(}T^{-1}Z^{\!\top}Z\bigr{)}\leq 2% C_{z}\Bigr{\}}.caligraphic_E := { italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Z ) ≤ 2 italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT } .

By the sub-Gaussian matrix Bernstein inequality (e.g., Tropp (2015)) there is a constant C0=C0(κ,Cz)subscript𝐶0subscript𝐶0𝜅subscript𝐶𝑧C_{0}=C_{0}(\kappa,C_{z})italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_κ , italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) such that

Z(c)eTfor all TC0P.formulae-sequencesubscript𝑍superscriptcsuperscript𝑒𝑇for all 𝑇subscript𝐶0𝑃\mathbb{P}_{Z}(\mathcal{E}^{\mathrm{c}})\;\leq\;e^{-T}\quad\text{for all }T% \geq C_{0}P.blackboard_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( caligraphic_E start_POSTSUPERSCRIPT roman_c end_POSTSUPERSCRIPT ) ≤ italic_e start_POSTSUPERSCRIPT - italic_T end_POSTSUPERSCRIPT for all italic_T ≥ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_P . (B.3)
3. KL bound conditional on \mathcal{E}caligraphic_E.

For j𝑗j\neq\ellitalic_j ≠ roman_ℓ the conditional Kullback–Leibler divergence equals

KL(jZ)=Z(wjw)222σ2Tλmax(T1ZZ)wjw222σ2.KLsubscript𝑗delimited-‖∣subscript𝑍superscriptsubscriptnorm𝑍subscript𝑤𝑗subscript𝑤222superscript𝜎2𝑇subscript𝜆superscript𝑇1superscript𝑍top𝑍superscriptsubscriptnormsubscript𝑤𝑗subscript𝑤222superscript𝜎2\mathrm{KL}(\mathbb{P}_{j}\|\mathbb{P}_{\ell}\mid Z)\;=\;\frac{\|Z(w_{j}-w_{% \ell})\|_{2}^{2}}{2\sigma^{2}}\;\leq\;\frac{T\,\lambda_{\max}(T^{-1}Z^{\!\top}% Z)\,\|w_{j}-w_{\ell}\|_{2}^{2}}{2\sigma^{2}}.roman_KL ( blackboard_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ blackboard_P start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∣ italic_Z ) = divide start_ARG ∥ italic_Z ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG italic_T italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Z ) ∥ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

On \mathcal{E}caligraphic_E we have λmax(T1ZZ)2Czsubscript𝜆superscript𝑇1superscript𝑍top𝑍2subscript𝐶𝑧\lambda_{\max}(T^{-1}Z^{\!\top}Z)\leq 2C_{z}italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Z ) ≤ 2 italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, so

KL(jZ)2TCzδ2σ2logP8<12logP.KLsubscript𝑗delimited-‖∣subscript𝑍2𝑇subscript𝐶𝑧superscript𝛿2superscript𝜎2𝑃812𝑃\mathrm{KL}(\mathbb{P}_{j}\|\mathbb{P}_{\ell}\mid Z)\;\leq\;\frac{2T\,C_{z}\,% \delta^{2}}{\sigma^{2}}\;\leq\;\frac{\log P}{8}\;<\;\tfrac{1}{2}\log P.roman_KL ( blackboard_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ blackboard_P start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∣ italic_Z ) ≤ divide start_ARG 2 italic_T italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG roman_log italic_P end_ARG start_ARG 8 end_ARG < divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log italic_P .
4. Fano’s inequality conditional on Z𝑍Z\in\mathcal{E}italic_Z ∈ caligraphic_E.

With the M=P+1𝑀𝑃1M=P+1italic_M = italic_P + 1 hypotheses {w0,w1,,wP}subscript𝑤0subscript𝑤1subscript𝑤𝑃\{w_{0},w_{1},\dots,w_{P}\}{ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT } equiprobable, Fano (max-KL form) gives, conditional on Z𝑍Z\in\mathcal{E}italic_Z ∈ caligraphic_E,

Pr(J^JZ) 112logP+log2log(P+1)12(P16).formulae-sequencePr^𝐽conditional𝐽𝑍112𝑃2𝑃112𝑃16\Pr(\hat{J}\neq J\mid Z)\;\geq\;1-\frac{\tfrac{1}{2}\log P+\log 2}{\log(P+1)}% \;\geq\;\tfrac{1}{2}\quad(P\geq 16).roman_Pr ( over^ start_ARG italic_J end_ARG ≠ italic_J ∣ italic_Z ) ≥ 1 - divide start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log italic_P + roman_log 2 end_ARG start_ARG roman_log ( italic_P + 1 ) end_ARG ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_P ≥ 16 ) .
5. Relating risk to identification (conditional).

Since the population covariance satisfies czIPΣzprecedes-or-equalssubscript𝑐𝑧subscript𝐼𝑃subscriptΣ𝑧c_{z}I_{P}\preceq\Sigma_{z}italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ⪯ roman_Σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, for any j𝑗j\neq\ellitalic_j ≠ roman_ℓ

fjfL2(μ)2=(wjw)Σz(wjw) 2czδ2.superscriptsubscriptnormsubscript𝑓𝑗subscript𝑓superscript𝐿2𝜇2superscriptsubscript𝑤𝑗subscript𝑤topsubscriptΣ𝑧subscript𝑤𝑗subscript𝑤2subscript𝑐𝑧superscript𝛿2\|f_{j}-f_{\ell}\|_{L^{2}(\mu)}^{2}=(w_{j}-w_{\ell})^{\!\top}\Sigma_{z}(w_{j}-% w_{\ell})\;\geq\;2c_{z}\delta^{2}.∥ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) ≥ 2 italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Argument identical to Theorem 3 shows that for every estimator f^Tsubscript^𝑓𝑇\hat{f}_{T}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and every Z𝑍Z\in\mathcal{E}italic_Z ∈ caligraphic_E

𝔼x,ϵ[(f^T(x)wz(x))2Z]cz4δ2=c~min{B2,σ2TlogP}.subscript𝔼𝑥italic-ϵdelimited-[]conditionalsuperscriptsubscript^𝑓𝑇𝑥superscript𝑤top𝑧𝑥2𝑍subscript𝑐𝑧4superscript𝛿2~𝑐superscript𝐵2superscript𝜎2𝑇𝑃\mathbb{E}_{x,\epsilon}\!\bigl{[}(\hat{f}_{T}(x)-w^{\!\top}z(x))^{2}\mid Z% \bigr{]}\;\geq\;\frac{c_{z}}{4}\,\delta^{2}\;=\;\tilde{c}\,\min\!\Bigl{\{}B^{2% },\;\tfrac{\sigma^{2}}{T}\log P\Bigr{\}}.blackboard_E start_POSTSUBSCRIPT italic_x , italic_ϵ end_POSTSUBSCRIPT [ ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) - italic_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ italic_Z ] ≥ divide start_ARG italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = over~ start_ARG italic_c end_ARG roman_min { italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG roman_log italic_P } . (B.3)
6. Remove the conditioning.

Inequality (4.3) follows by combining (B.3) (probability of \mathcal{E}caligraphic_E) with (B.3). This completes the proof. ∎

Proof of Theorem 4.

All VC statements are made conditional on the fixed training sample (x1,,xT)subscript𝑥1subscript𝑥𝑇(x_{1},\dots,x_{T})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ). Throughout we use the standard fact that homogeneous linear threshold functions in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT have VC dimension d𝑑ditalic_d (e.g., Vapnik (1998)).

(a) Linear class Psubscript𝑃\mathcal{F}_{P}caligraphic_F start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT.

Because sign(λwz(x))=sign(wz(x))sign𝜆superscript𝑤top𝑧𝑥signsuperscript𝑤top𝑧𝑥\operatorname{sign}(\lambda w^{\top}z(x))=\operatorname{sign}(w^{\top}z(x))roman_sign ( italic_λ italic_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ) ) = roman_sign ( italic_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ) ) for every λ>0𝜆0\lambda>0italic_λ > 0, the norm bound w2Bsubscriptnorm𝑤2𝐵\|w\|_{2}\leq B∥ italic_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_B does not remove any labelings that an unconstrained homogeneous hyperplane in Psuperscript𝑃\mathbb{R}^{P}blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT could realise. Hence the set {sign(wz(x)):w2B}conditional-setsignsuperscript𝑤top𝑧𝑥subscriptnorm𝑤2𝐵\{\operatorname{sign}(w^{\top}z(x)):\|w\|_{2}\leq B\}{ roman_sign ( italic_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ) ) : ∥ italic_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_B } has the same VC dimension as all homogeneous linear separators in Psuperscript𝑃\mathbb{R}^{P}blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, namely P𝑃Pitalic_P.

(b) Ridgeless class ridge(Z)superscriptsubscriptridge𝑍\mathcal{F}_{\textnormal{ridge}}^{(Z)}caligraphic_F start_POSTSUBSCRIPT ridge end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_Z ) end_POSTSUPERSCRIPT.

For any training targets yT𝑦superscript𝑇y\in\mathbb{R}^{T}italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT the ridgeless solution is w^=Z(ZZ)y^𝑤superscript𝑍topsuperscript𝑍superscript𝑍top𝑦\hat{w}=Z^{\top}(ZZ^{\top})^{\dagger}yover^ start_ARG italic_w end_ARG = italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT italic_y, where denotes the Moore–Penrose pseudoinverse. Consequently every predictor can be written as

fα(x)=αk(x),with α=(ZZ)yT.formulae-sequencesubscript𝑓𝛼𝑥superscript𝛼top𝑘𝑥with 𝛼superscript𝑍superscript𝑍top𝑦superscript𝑇f_{\alpha}(x)=\alpha^{\top}k(x),\qquad\text{with }\alpha=(ZZ^{\top})^{\dagger}% y\in\mathbb{R}^{T}.italic_f start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_x ) = italic_α start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_k ( italic_x ) , with italic_α = ( italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .

Define the data–dependent feature map

ϕZ:𝒳T,ϕZ(x):=k(x).:subscriptitalic-ϕ𝑍formulae-sequence𝒳superscript𝑇assignsubscriptitalic-ϕ𝑍𝑥𝑘𝑥\phi_{Z}:\mathcal{X}\to\mathbb{R}^{T},\qquad\phi_{Z}(x)\;:=\;k(x).italic_ϕ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_x ) := italic_k ( italic_x ) .

Its image lies in the r𝑟ritalic_r-dimensional subspace im(ZZ)Tim𝑍superscript𝑍topsuperscript𝑇\operatorname{im}(ZZ^{\top})\subseteq\mathbb{R}^{T}roman_im ( italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ⊆ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, so ϕZ(𝒳)rsubscriptitalic-ϕ𝑍𝒳superscript𝑟\phi_{Z}(\mathcal{X})\subseteq\mathbb{R}^{r}italic_ϕ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( caligraphic_X ) ⊆ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT after an appropriate linear change of basis. Thus the hypothesis class

Z={xsign(αϕZ(x)):αT}subscript𝑍conditional-setmaps-to𝑥signsuperscript𝛼topsubscriptitalic-ϕ𝑍𝑥𝛼superscript𝑇\mathcal{H}_{Z}\;=\;\bigl{\{}\,x\mapsto\operatorname{sign}(\alpha^{\top}\phi_{% Z}(x)):\alpha\in\mathbb{R}^{T}\bigr{\}}caligraphic_H start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT = { italic_x ↦ roman_sign ( italic_α start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_x ) ) : italic_α ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT }

is (up to an invertible linear map) exactly the class of homogeneous linear separators in rsuperscript𝑟\mathbb{R}^{r}blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT. By the cited VC fact, VC(Z)=rVCsubscript𝑍𝑟\mathrm{VC}(\mathcal{H}_{Z})=rroman_VC ( caligraphic_H start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ) = italic_r. Because rT𝑟𝑇r\leq Titalic_r ≤ italic_T, we obtain the claimed bound. If (ZZ)𝑍superscript𝑍top(ZZ^{\top})( italic_Z italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) is invertible, then r=T𝑟𝑇r=Titalic_r = italic_T, giving equality. ∎

Proof of Theorem 5.

The proof follows from the polynomial lower bound in Theorem 3:

inff^Tsupw2B𝔼x,𝒟T,ϵ[(f^T(x)wz(x))2]c~min{B2,σ2logPT}.subscriptinfimumsubscript^𝑓𝑇subscriptsupremumsubscriptnorm𝑤2𝐵subscript𝔼𝑥subscript𝒟𝑇italic-ϵdelimited-[]superscriptsubscript^𝑓𝑇𝑥superscript𝑤top𝑧𝑥2~𝑐superscript𝐵2superscript𝜎2𝑃𝑇\inf_{\hat{f}_{T}}\sup_{\|w\|_{2}\leq B}\mathbb{E}_{x,\mathcal{D}_{T},\epsilon% }[(\hat{f}_{T}(x)-w^{\top}z(x))^{2}]\geq\tilde{c}\min\left\{B^{2},\frac{\sigma% ^{2}\log P}{T}\right\}.roman_inf start_POSTSUBSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_B end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x , caligraphic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_ϵ end_POSTSUBSCRIPT [ ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) - italic_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≥ over~ start_ARG italic_c end_ARG roman_min { italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_P end_ARG start_ARG italic_T end_ARG } .

For learning to be impossible with error <εabsent𝜀<\varepsilon< italic_ε, we need:

c~min{B2,σ2logPT}εmin{B2,σ2logPT}εc~.formulae-sequence~𝑐superscript𝐵2superscript𝜎2𝑃𝑇𝜀superscript𝐵2superscript𝜎2𝑃𝑇𝜀~𝑐\tilde{c}\min\left\{B^{2},\frac{\sigma^{2}\log P}{T}\right\}\geq\varepsilon% \quad\Leftrightarrow\quad\min\left\{B^{2},\frac{\sigma^{2}\log P}{T}\right\}% \geq\frac{\varepsilon}{\tilde{c}}.over~ start_ARG italic_c end_ARG roman_min { italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_P end_ARG start_ARG italic_T end_ARG } ≥ italic_ε ⇔ roman_min { italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_P end_ARG start_ARG italic_T end_ARG } ≥ divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG .

Phase I (Impossible Learning): Suppose σ2logPTεc~superscript𝜎2𝑃𝑇𝜀~𝑐\frac{\sigma^{2}\log P}{T}\geq\frac{\varepsilon}{\tilde{c}}divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_P end_ARG start_ARG italic_T end_ARG ≥ divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG and B2εc~superscript𝐵2𝜀~𝑐B^{2}\geq\frac{\varepsilon}{\tilde{c}}italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG.

Since both arguments of the minimum satisfy the bound εc~absent𝜀~𝑐\geq\frac{\varepsilon}{\tilde{c}}≥ divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG, we have:

min{B2,σ2logPT}εc~.superscript𝐵2superscript𝜎2𝑃𝑇𝜀~𝑐\min\left\{B^{2},\frac{\sigma^{2}\log P}{T}\right\}\geq\frac{\varepsilon}{% \tilde{c}}.roman_min { italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_P end_ARG start_ARG italic_T end_ARG } ≥ divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG .

Therefore:

inff^Tsupw2B𝔼[(f^T(x)wz(x))2]c~εc~=ε,subscriptinfimumsubscript^𝑓𝑇subscriptsupremumsubscriptnorm𝑤2𝐵𝔼delimited-[]superscriptsubscript^𝑓𝑇𝑥superscript𝑤top𝑧𝑥2~𝑐𝜀~𝑐𝜀\inf_{\hat{f}_{T}}\sup_{\|w\|_{2}\leq B}\mathbb{E}[(\hat{f}_{T}(x)-w^{\top}z(x% ))^{2}]\geq\tilde{c}\cdot\frac{\varepsilon}{\tilde{c}}=\varepsilon,roman_inf start_POSTSUBSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_B end_POSTSUBSCRIPT blackboard_E [ ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) - italic_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_z ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≥ over~ start_ARG italic_c end_ARG ⋅ divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG = italic_ε ,

establishing that learning with error <εabsent𝜀<\varepsilon< italic_ε is information-theoretically impossible.

Phase II (Possible Learning): Suppose σ2logPT<εc~superscript𝜎2𝑃𝑇𝜀~𝑐\frac{\sigma^{2}\log P}{T}<\frac{\varepsilon}{\tilde{c}}divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_P end_ARG start_ARG italic_T end_ARG < divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG and B2εc~superscript𝐵2𝜀~𝑐B^{2}\geq\frac{\varepsilon}{\tilde{c}}italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG.

Since σ2logPT<εc~B2superscript𝜎2𝑃𝑇𝜀~𝑐superscript𝐵2\frac{\sigma^{2}\log P}{T}<\frac{\varepsilon}{\tilde{c}}\leq B^{2}divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_P end_ARG start_ARG italic_T end_ARG < divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG ≤ italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we have:

min{B2,σ2logPT}=σ2logPT<εc~.superscript𝐵2superscript𝜎2𝑃𝑇superscript𝜎2𝑃𝑇𝜀~𝑐\min\left\{B^{2},\frac{\sigma^{2}\log P}{T}\right\}=\frac{\sigma^{2}\log P}{T}% <\frac{\varepsilon}{\tilde{c}}.roman_min { italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_P end_ARG start_ARG italic_T end_ARG } = divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_P end_ARG start_ARG italic_T end_ARG < divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG .

The lower bound becomes:

c~min{B2,σ2logPT}=c~σ2logPT<c~εc~=ε.~𝑐superscript𝐵2superscript𝜎2𝑃𝑇~𝑐superscript𝜎2𝑃𝑇~𝑐𝜀~𝑐𝜀\tilde{c}\min\left\{B^{2},\frac{\sigma^{2}\log P}{T}\right\}=\tilde{c}\frac{% \sigma^{2}\log P}{T}<\tilde{c}\cdot\frac{\varepsilon}{\tilde{c}}=\varepsilon.over~ start_ARG italic_c end_ARG roman_min { italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_P end_ARG start_ARG italic_T end_ARG } = over~ start_ARG italic_c end_ARG divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_P end_ARG start_ARG italic_T end_ARG < over~ start_ARG italic_c end_ARG ⋅ divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG = italic_ε .

Since the information-theoretic lower bound is <εabsent𝜀<\varepsilon< italic_ε, learning with error <εabsent𝜀<\varepsilon< italic_ε is not ruled out by fundamental limitations.

Trivial Regime: If B2<εc~superscript𝐵2𝜀~𝑐B^{2}<\frac{\varepsilon}{\tilde{c}}italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG, then regardless of the value of σ2logPTsuperscript𝜎2𝑃𝑇\frac{\sigma^{2}\log P}{T}divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_P end_ARG start_ARG italic_T end_ARG:

min{B2,σ2logPT}B2<εc~.superscript𝐵2superscript𝜎2𝑃𝑇superscript𝐵2𝜀~𝑐\min\left\{B^{2},\frac{\sigma^{2}\log P}{T}\right\}\leq B^{2}<\frac{% \varepsilon}{\tilde{c}}.roman_min { italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_P end_ARG start_ARG italic_T end_ARG } ≤ italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG .

Hence:

c~min{B2,σ2logPT}c~B2<c~εc~=ε.~𝑐superscript𝐵2superscript𝜎2𝑃𝑇~𝑐superscript𝐵2~𝑐𝜀~𝑐𝜀\tilde{c}\min\left\{B^{2},\frac{\sigma^{2}\log P}{T}\right\}\leq\tilde{c}B^{2}% <\tilde{c}\cdot\frac{\varepsilon}{\tilde{c}}=\varepsilon.over~ start_ARG italic_c end_ARG roman_min { italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_P end_ARG start_ARG italic_T end_ARG } ≤ over~ start_ARG italic_c end_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < over~ start_ARG italic_c end_ARG ⋅ divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG = italic_ε .

The function class is too simple relative to target accuracy ε𝜀\varepsilonitalic_ε, and standard parametric rates apply. ∎

Proof of Corollary 1.

We analyze when the conditions of Theorem 5 are satisfied.

For impossibility, we need σ2logPTεc~superscript𝜎2𝑃𝑇𝜀~𝑐\frac{\sigma^{2}\log P}{T}\geq\frac{\varepsilon}{\tilde{c}}divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_P end_ARG start_ARG italic_T end_ARG ≥ divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG. Since P=O(Kβ)𝑃𝑂superscript𝐾𝛽P=O(K^{\beta})italic_P = italic_O ( italic_K start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ), we have logP=O(βlogK)𝑃𝑂𝛽𝐾\log P=O(\beta\log K)roman_log italic_P = italic_O ( italic_β roman_log italic_K ), so:

σ2βlogKTεc~Tc~σ2βlogKε.formulae-sequencesuperscript𝜎2𝛽𝐾𝑇𝜀~𝑐𝑇~𝑐superscript𝜎2𝛽𝐾𝜀\frac{\sigma^{2}\beta\log K}{T}\geq\frac{\varepsilon}{\tilde{c}}\quad% \Rightarrow\quad T\leq\frac{\tilde{c}\sigma^{2}\beta\log K}{\varepsilon}.divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β roman_log italic_K end_ARG start_ARG italic_T end_ARG ≥ divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG ⇒ italic_T ≤ divide start_ARG over~ start_ARG italic_c end_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β roman_log italic_K end_ARG start_ARG italic_ε end_ARG .

We also need B2εc~superscript𝐵2𝜀~𝑐B^{2}\geq\frac{\varepsilon}{\tilde{c}}italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG. Under Assumption LABEL:ass:weak_signal, B2=O(Kα)σ2superscript𝐵2𝑂superscript𝐾𝛼superscript𝜎2B^{2}=O(K^{-\alpha})\sigma^{2}italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_O ( italic_K start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT ) italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, so:

O(Kα)σ2εc~KO((c~σ2ε)1/α).formulae-sequence𝑂superscript𝐾𝛼superscript𝜎2𝜀~𝑐𝐾𝑂superscript~𝑐superscript𝜎2𝜀1𝛼O(K^{-\alpha})\sigma^{2}\geq\frac{\varepsilon}{\tilde{c}}\quad\Rightarrow\quad K% \leq O\left(\left(\frac{\tilde{c}\sigma^{2}}{\varepsilon}\right)^{1/\alpha}% \right).italic_O ( italic_K start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT ) italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ divide start_ARG italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG end_ARG ⇒ italic_K ≤ italic_O ( ( divide start_ARG over~ start_ARG italic_c end_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ε end_ARG ) start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT ) .

Learning is impossible when both conditions hold simultaneously:

T𝑇\displaystyle Titalic_T c~σ2βlogKε(complexity condition)absent~𝑐superscript𝜎2𝛽𝐾𝜀(complexity condition)\displaystyle\leq\frac{\tilde{c}\sigma^{2}\beta\log K}{\varepsilon}\quad\text{% (complexity condition)}≤ divide start_ARG over~ start_ARG italic_c end_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β roman_log italic_K end_ARG start_ARG italic_ε end_ARG (complexity condition) (B.4)
K𝐾\displaystyle Kitalic_K O((c~σ2ε)1/α)(non-triviality condition)absent𝑂superscript~𝑐superscript𝜎2𝜀1𝛼(non-triviality condition)\displaystyle\leq O\left(\left(\frac{\tilde{c}\sigma^{2}}{\varepsilon}\right)^% {1/\alpha}\right)\quad\text{(non-triviality condition)}≤ italic_O ( ( divide start_ARG over~ start_ARG italic_c end_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ε end_ARG ) start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT ) (non-triviality condition) (B.5)

For fixed T𝑇Titalic_T, the complexity condition gives:

Kexp(Tεc~σ2β)=:K0(T).K\geq\exp\left(\frac{T\varepsilon}{\tilde{c}\sigma^{2}\beta}\right)=:K_{0}(T).italic_K ≥ roman_exp ( divide start_ARG italic_T italic_ε end_ARG start_ARG over~ start_ARG italic_c end_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β end_ARG ) = : italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_T ) .

Learning is impossible when K0(T)KO((c~σ2ε)1/α)subscript𝐾0𝑇𝐾𝑂superscript~𝑐superscript𝜎2𝜀1𝛼K_{0}(T)\leq K\leq O\left(\left(\frac{\tilde{c}\sigma^{2}}{\varepsilon}\right)% ^{1/\alpha}\right)italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_T ) ≤ italic_K ≤ italic_O ( ( divide start_ARG over~ start_ARG italic_c end_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ε end_ARG ) start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT ).

This interval is non-empty when T𝑇Titalic_T is sufficiently small, specifically when:

Tc~σ2βεlog(O((c~σ2ε)1/α))=O(σ2βlog(σ2/ε)αε).𝑇~𝑐superscript𝜎2𝛽𝜀𝑂superscript~𝑐superscript𝜎2𝜀1𝛼𝑂superscript𝜎2𝛽superscript𝜎2𝜀𝛼𝜀T\leq\frac{\tilde{c}\sigma^{2}\beta}{\varepsilon}\log\left(O\left(\left(\frac{% \tilde{c}\sigma^{2}}{\varepsilon}\right)^{1/\alpha}\right)\right)=O\left(\frac% {\sigma^{2}\beta\log(\sigma^{2}/\varepsilon)}{\alpha\varepsilon}\right).italic_T ≤ divide start_ARG over~ start_ARG italic_c end_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β end_ARG start_ARG italic_ε end_ARG roman_log ( italic_O ( ( divide start_ARG over~ start_ARG italic_c end_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ε end_ARG ) start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT ) ) = italic_O ( divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β roman_log ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε ) end_ARG start_ARG italic_α italic_ε end_ARG ) .