High-Dimensional Learning in Finance^†^†thanks: Replication code is available from the author.

Hasan Fallahgoul
Monash University Hasan Fallahgoul, Monash University, School of Mathematics and Centre for Quantitative Finance and Investment Strategies, 9 Rainforest Walk, 3800 Victoria, Australia. E-mail: hasan.fallahgoul@monash.edu.

(This version: June 4, 2025
Link to Most Recent Version)

Abstract

Recent advances in machine learning have shown promising results for financial prediction using large, over-parameterized models. This paper provides theoretical foundations and empirical validation for understanding when and how these methods achieve predictive success. I examine three key aspects of high-dimensional learning in finance. First, I prove that within-sample standardization in Random Fourier Features implementations fundamentally alters the underlying Gaussian kernel approximation, replacing shift-invariant kernels with training-set dependent alternatives. Second, I derive sample complexity bounds showing when reliable learning becomes information-theoretically impossible under weak signal-to-noise ratios typical in finance. Third, VC-dimension analysis reveals that ridgeless regression’s effective complexity is bounded by sample size rather than nominal feature dimension. Comprehensive numerical validation confirms these theoretical predictions, revealing systematic breakdown of claimed theoretical properties across realistic parameter ranges. These results show that when sample size is small and features are high-dimensional, observed predictive success is necessarily driven by low-complexity artifacts, not genuine high-dimensional learning.

Key words: Portfolio choice, machine learning, random matrix theory, PAC-learning

JEL classification: C3, C58, C61, G11, G12, G14

1 Introduction

The integration of machine learning methods into financial prediction has emerged as one of the most active areas of research in empirical asset pricing (Kelly et al. 2024, Gu et al. 2020, Bianchi et al. 2021, Chen et al. 2024, Feng et al. 2020). The appeal is clear: while financial markets generate increasingly high-dimensional data, traditional econometric methods remain constrained by limited sample sizes and the curse of dimensionality. Machine learning promises to uncover predictive relationships that elude traditional linear models by leveraging nonlinear approximations and high-dimensional overparameterized representations, thereby expanding the frontier of return predictability and portfolio construction.

Yet despite rapid adoption and impressive empirical successes, our theoretical understanding of when and why machine learning methods succeed in financial applications remains incomplete. This gap is particularly pronounced for high-dimensional methods applied to the notoriously challenging problem of return prediction, where signals are weak, data are limited, and spurious relationships abound. A fundamental question emerges: under what conditions can sophisticated machine learning methods genuinely extract predictive information from financial data, and when might apparent success arise from simpler mechanisms?

The pioneering work of Kelly et al. (2024) has significantly advanced our theoretical understanding by establishing rigorous conditions under which complex machine learning models can outperform traditional approaches in financial prediction. Their theoretical framework, grounded in random matrix theory, demonstrates that the conventional wisdom about overfitting may not apply in high-dimensional settings, revealing a genuine ’virtue of complexity’ under appropriate conditions. This breakthrough provides crucial theoretical foundations for understanding when and why sophisticated methods succeed in finance.

Building on these theoretical advances, this paper examines how practical implementation details interact with established mechanisms. This becomes important as recent empirical analysis Nagel (2025) suggests that high-dimensional methods may achieve success through multiple pathways that differ from theoretical predictions. Several questions emerge: What are the information-theoretic requirements for learning with weak signals? How do implementation choices affect underlying mathematical properties? When do complexity benefits reflect different learning mechanisms? Understanding these interactions helps characterize the complete landscape of learning pathways in high-dimensional finance applications.

This paper provides theoretical foundations for answering these questions through three main contributions that help characterize the different mechanisms through which high-dimensional methods achieve predictive success in financial prediction.

First, I extend the theoretical analysis to practical implementations, showing how the standardization procedures commonly used for numerical stability modify the kernel approximation properties that underlie existing theory. While Random Fourier Features (RFF) theory rigorously proves convergence to shift-invariant Gaussian kernels under idealized conditions (Rahimi & Recht 2007, Sutherland & Schneider 2015), I prove that the within-sample standardization employed in every practical implementation modifies these theoretical properties. The standardized features converge instead to training-set dependent kernels that violate the mathematical foundations required for kernel methods. This breakdown explains why methods cannot achieve the kernel learning properties established by existing theory and must rely on fundamentally different mechanisms.

Rahimi & Recht (2007) prove that for features $z_{i}(x)=\sqrt{2}\cos(\omega_{i}^{\top}x+b_{i})$ with $\omega_{i}\sim\mathcal{N}(0,\gamma^{2}I)$ and $b_{i}\sim\text{Uniform}[0,2\pi]$ , the empirical kernel $\frac{1}{P}\sum_{i=1}^{P}z_{i}(x)z_{i}(x^{\prime})$ converges in probability to the Gaussian kernel $k(x,x^{\prime})=\exp(-\gamma^{2}\|x-x^{\prime}\|^{2}/2)$ as $P\to\infty$ . This convergence requires that features maintain their original distributional properties and scaling. However, I prove that the within-sample standardization $\tilde{z}_{i}(x)=z_{i}(x)/\hat{\sigma}_{i}$ employed in every practical implementation—where $\hat{\sigma}_{i}^{2}=\frac{1}{T}\sum_{t=1}^{T}z_{i}(x_{t})^{2}$ —fundamentally alters the convergence properties. The standardized features converge instead to training-set dependent kernels $k^{*}_{\text{std}}(x,x^{\prime}|\mathcal{T})\neq k_{G}(x,x^{\prime})$ that violate the shift-invariance and stationarity properties required for kernel methods. A detailed analysis of how standardization breaks the specific theoretical conditions appears in Section 3, following the formal proof of this breakdown.

Second, I derive sharp sample complexity bounds that characterize the information-theoretic limits of high-dimensional learning in financial settings. Using PAC-learning theory,¹¹1In PAC-learning (Valiant 1984), a predictor is “probably approximately correct” if, with $T\!\gtrsim\!(\text{capacity})/\varepsilon^{2}$ samples, its risk is within $\varepsilon$ of optimal with probability $1-\delta$ ; I apply these bounds (see Kearns & Vazirani 1994) to gauge when weak return signals are learnable. I establish both exponential and polynomial lower bounds showing when reliable extraction of weak predictive signals becomes impossible regardless of the sophistication of the employed method. These bounds reveal that learning over function spaces with thousands of parameters suggests that reliable learning may require stronger conditions than typically available in typical financial applications. For example, methods claiming to harness 12,000 parameters with 12 monthly observations require signal-to-noise ratios exceeding realistic bounds by orders of magnitude, suggesting that predictive success may arise through mechanisms that differ from the theoretical framework.

Third, I characterize the effective complexity of high-dimensional methods through VC dimension analysis and sharp learning thresholds.²²2Effective complexity—often called the effective degrees of freedom—is the trace of the “hat” matrix, $\operatorname{df}=\operatorname{tr}\!\bigl{[}H\bigr{]}$ where $H=Z\,(Z^{\top}Z)^{-1}\!Z^{\top}$ . For minimum-norm (ridgeless) regression $H$ is an idempotent projector of rank $T$ , so $\operatorname{df}\leq T$ irrespective of the nominal dimension $P$ ; see Hastie et al. (2009, Chapter 7), Bartlett et al. (2020), and Hastie et al. (2022). I prove that ridgeless regression operates over function spaces with complexity bounded by sample size rather than parameter count, regardless of nominal dimensionality. Combined with precise learning thresholds that depend on signal strength, feature dimension, and sample size, these results provide practitioners with concrete tools for evaluating when available data suffices for reliable prediction versus when apparent performance must arise through alternative mechanisms.

While these theoretical results provide clear mathematical boundaries on learning feasibility, their practical relevance depends on how they manifest across the parameter ranges typically employed in financial applications. The gap between asymptotic theory and finite-sample reality can be substantial, particularly when dealing with the moderate dimensions and sample sizes common in empirical asset pricing. Moreover, the breakdown of kernel approximation under standardization represents a fundamental departure from assumed theoretical properties that requires empirical quantification to assess its practical severity.

To bridge this theory-practice gap, I conduct comprehensive numerical validation of the kernel approximation breakdown across realistic parameter spaces that span the configurations used in recent high-dimensional financial prediction studies (Kelly et al. 2024, Nagel 2025). The numerical analysis examines how within-sample standardization destroys the theoretical Gaussian kernel convergence that underlies existing RFF frameworks, quantifying the magnitude of approximation errors under practical implementation choices. These experiments reveal that standardization-induced kernel deviations reach mean absolute errors exceeding 40% relative to the theoretical Gaussian kernel in typical configurations ( $P=12{,}000$ , $T=12$ ), with maximum deviations approaching 80% in high-volatility training windows. The kernel approximation failure manifests consistently across different feature dimensions and sample sizes, with relative errors scaling approximately as $\sqrt{\log P/T}$ in line with theoretical predictions. The numerical validation thus provides concrete evidence that practical implementation details create substantial violations of the theoretical assumptions underlying high-dimensional RFF approaches, with error magnitudes sufficient to fundamentally alter method behavior.

Together, these results explain why methods may achieve predictive success through multiple pathways, including both sophisticated learning and simpler pattern-matching mechanisms. These findings provide practitioners with frameworks for understanding and evaluating different sources of predictive performance in high-dimensional models. for understanding when such methods can genuinely contribute to predictive performance versus when they exploit statistical artifacts.

1.1 Literature Review

This paper builds on three distinct but interconnected theoretical traditions to provide foundations for understanding high-dimensional learning in financial prediction.

The Probably Approximately Correct (PAC) framework (Valiant 1984, Kearns & Vazirani 1994) provides fundamental tools for characterizing when reliable learning is information-theoretically feasible. Classical results establish that achieving generalization error $\varepsilon$ with confidence $1-\delta$ requires sample sizes scaling with the complexity of the function class, typically $T=O(\text{complexity}\cdot\log(1/\varepsilon)/\varepsilon^{2})$ (Shalev-Shwartz & Ben-David 2014). Recent advances in high-dimensional learning theory (Belkin et al. 2019, Bartlett et al. 2020, Hastie et al. 2022) have refined these bounds for overparameterized models, showing that the effective rather than nominal complexity determines learning difficulty. However, these results have not been systematically applied to the specific challenges of financial prediction, where weak signals and limited sample sizes create particularly demanding learning environments.

The RFF methodology (Rahimi & Recht 2007) provides computationally efficient approximation of kernel methods through random trigonometric features, with theoretical guarantees assuming convergence to shift-invariant kernels under appropriate conditions (Rudi & Rosasco 2017). Subsequent work has characterized the approximation quality and convergence rates for various kernel classes (Mei & Montanari 2022), establishing RFF as a foundation for scalable kernel learning. However, existing theory assumes idealized implementations that may not reflect practical usage. In particular, no prior work has analyzed how the standardization procedures commonly employed to improve numerical stability affect the fundamental convergence properties that justify the theoretical framework.

The phenomenon of ”benign overfitting” in overparameterized models has generated substantial theoretical interest (Belkin et al. 2019, Bartlett et al. 2020), with particular focus on understanding when adding parameters can improve rather than harm generalization performance. The VC dimension provides a classical measure of model complexity that connects directly to generalization bounds (Vapnik 1998), while recent work on effective degrees of freedom (Hastie et al. 2022) shows how structural constraints can limit the true complexity of nominally high-dimensional methods. These insights have been applied to understanding ridge regression in high-dimensional settings, but the connections to kernel methods and the specific constraints imposed by ridgeless regression in financial applications remain underexplored.

The application of machine learning to financial prediction has generated extensive empirical literature (Gu et al. 2020, Kelly et al. 2024, Chen et al. 2024), with particular attention to high-dimensional methods that can potentially harness large numbers of predictors (Feng et al. 2020, Bianchi et al. 2021). The theoretical framework of Kelly et al. (2024) provides crucial insights into when high-dimensional methods can succeed, particularly their demonstration that ridgeless regression can achieve positive performance despite seemingly problematic complexity ratios. This paper extends their analysis by examining how practical implementation considerations interact with these theoretical mechanisms.

This paper contributes to each of these literatures by providing the first unified theoretical analysis that connects sample complexity limitations, kernel approximation breakdown, and effective complexity bounds to explain the behavior of high-dimensional methods in financial prediction.

The remainder of the paper proceeds as follows. Section 2 establishes the theoretical framework and formalizes the theory-practice disconnect in RFF implementations. Section 3 proves that within-sample standardization fundamentally breaks kernel approximation, explaining why claimed theoretical properties cannot hold in practice. Section 4 establishes information-theoretic barriers to high-dimensional learning, showing that genuine complexity benefits are impossible under realistic financial conditions. Section 6 concludes. All technical details are relegated to a supplementary document containing Appendices A and B, which are available upon request from the author.

2 Background and Framework

This section establishes the theoretical framework for analyzing high-dimensional prediction methods in finance. I first formalize the return prediction problem, then examine the critical disconnect between RFF theory and practical implementation that underlies my main results.

2.1 The Financial Prediction Problem

Consider the fundamental challenge of predicting asset returns using high-dimensional predictor information. I observe predictor vectors $x_{t}\in\mathbb{R}^{K}$ and subsequent returns $r_{t+1}\in\mathbb{R}$ for $t=1,\ldots,T$ , with the goal of learning a predictor $\hat{f}:\mathbb{R}^{K}\to\mathbb{R}$ that minimizes expected squared loss $\mathbb{E}[(r_{t+1}-\hat{f}(x_{t}))^{2}]$ .

The challenge lies in the fundamental characteristics of financial prediction: signals are weak relative to noise, predictors exhibit complex persistence patterns, and available sample sizes are limited by the nonstationarity of financial markets. These features create a particularly demanding environment for high-dimensional learning methods.

I formalize this environment through three core assumptions that capture the essential features while maintaining sufficient generality for my theoretical analysis.

Assumption 1 (Financial Prediction Environment).

The return generating process is $r_{t+1}=f^{*}(x_{t})+\epsilon_{t+1}$ where:

(a)

$f^{*}:\mathbb{R}^{K}\to\mathbb{R}$ is the true regression function with $\mathbb{E}[f^{*}(x)^{2}]\leq B^{2}$
(b)

$\epsilon_{t+1}$ is noise with $\mathbb{E}[\epsilon_{t+1}|x_{t}]=0$ and $\mathbb{E}[\epsilon_{t+1}^{2}|x_{t}]=\sigma^{2}$
(c)

The signal-to-noise ratio $\text{SNR}:=B^{2}/\sigma^{2}=O(K^{-\alpha})$ for some $\alpha>0$
(d)

Predictors follow $x_{t}=\Phi x_{t-1}+u_{t}$ with $u_{t}\sim\mathcal{N}(0,\Sigma_{u})$ and eigenvalues of $\Phi$ in $(0,1)$

This assumption captures the essential features of financial prediction that distinguish it from typical machine learning applications. The bounded signal condition and weak SNR scaling reflect the empirical reality that financial predictors typically explain only 1-5% of return variation (Welch & Goyal 2008). The persistence in predictors (eigenvalues of $\Phi$ in $(0,1)$ ) captures the well-documented dynamics of financial variables like dividend yields and interest rate spreads, which proves crucial for understanding why short training windows lead to mechanical pattern matching rather than genuine learning.

Assumption 2 (Random Fourier Features Construction).

High-dimensional predictive features are constructed as $z_{i}(x)=\sqrt{2}\cos(\omega_{i}^{\top}x+b_{i})$ where $\omega_{i}\sim\mathcal{N}(0,\gamma^{2}I_{K})$ and $b_{i}\sim\text{Uniform}[0,2\pi]$ for $i=1,\ldots,P$ . In practical implementations, these features are standardized within each training sample: $\tilde{z}_{i}(x)=z_{i}(x)/\hat{\sigma}_{i}$ where $\hat{\sigma}_{i}^{2}=\frac{1}{T}\sum_{t=1}^{T}z_{i}(x_{t})^{2}$ .

This assumption formalizes the RFF methodology as actually implemented in practice, including the crucial standardization step that has not been analyzed in existing theoretical frameworks. The standardization appears in every practical implementation to improve numerical stability, yet as I prove, it fundamentally alters the mathematical properties of the method.

Assumption 3 (Regularity Conditions).

The input distribution has bounded support and finite moments, ensuring well-defined feature covariance $\Sigma_{z}=\mathbb{E}[z(x)z(x)^{\top}]$ satisfying $c_{z}I_{P}\preceq\Sigma_{z}\preceq C_{z}I_{P}$ for constants $0<c_{z}\leq C_{z}$ . Training samples satisfy standard non-degeneracy conditions.³³3Specifically, the matrix $A=[2x_{t}^{\top}\;2]_{t=1}^{T}$ has full column rank $T$ , ensuring the geometric properties needed for my convergence analysis. See Appendix B for technical details.

These technical conditions ensure that concentration inequalities apply and that my convergence results hold with high probability. The conditions are mild and satisfied in typical financial applications.⁴⁴4For example, for the KMZ setup with $K=15$ predictors and $T=12$ training windows, these conditions hold almost surely since continuous economic variables generically satisfy the required independence properties.

Assumption 4 (Affine Independence of the Sample).

Let $x_{1},\ldots,x_{T}\in\mathbb{R}^{K}$ with $T\geq 5$ . The $(K+1)\times T$ matrix $A=[2x_{t}^{\top}\;2]_{t=1}^{T}$ has full column rank $T$ (equivalently, the augmented vectors $(x_{t},1)$ are affinely independent).

This assumption enters my analysis through the small-ball probability estimates needed to establish convergence of standardized kernels. The full-rank requirement ensures that the linear change of variables $(\omega,b)\mapsto(2\omega^{\top}x_{t}+2b)_{t=1}^{T}$ is bi-Lipschitz on bounded sets, enabling geometric control that yields exponential small-ball bounds and finiteness of key expectations. In Kelly et al.’s empirical design with $K=15$ predictors and $T=12$ months, the matrix $A$ is $16\times 12$ , and since elements are continuous macroeconomic variables, affine dependence has Lebesgue measure zero, making this assumption mild.

Assumption 5 (Sub-Gaussian RFFs).

For every unit vector $u\in\mathbb{R}^{P}$ , the scalar $u^{\top}z(x)$ is $\kappa$ -sub-Gaussian under $x\sim\mu$ : $\;\mathbb{E}[\exp(t\,u^{\top}z(x))]\leq\exp(\tfrac{1}{2}\kappa^{2}t^{2})$ for all $t\in\mathbb{R}$ .

Assumption 5 requires that linear combinations $u^{\top}z(x)$ of the random Fourier features are sub-Gaussian with parameter $\kappa$ , ensuring $\mathbb{E}[\exp(t\,u^{\top}z(x))]\leq\exp(\frac{1}{2}\kappa^{2}t^{2})$ for all unit vectors $u\in\mathbb{R}^{P}$ and scalars $t$ . This concentration condition is essential for applying uniform convergence results and obtaining non-asymptotic bounds on the empirical feature covariance matrix that appear in our sample complexity analysis. The assumption is standard in high-dimensional learning theory and is automatically satisfied for RFF with bounded support: since $z_{i}(x)=\sqrt{2}\cos(\omega_{i}^{\top}x+b_{i})\in[-\sqrt{2},\sqrt{2}]$ , each feature is bounded, and linear combinations of bounded random variables are sub-Gaussian with parameter $\kappa=O(\sqrt{P})$ . This ensures that concentration inequalities apply to the feature covariance estimation, enabling our PAC-learning bounds while remaining satisfied in all practical RFF implementations.

2.2 The Theory-Practice Disconnect in Random Fourier Features

The foundation of high-dimensional prediction methods in finance rests on RFF theory, yet a fundamental disconnect exists between theoretical guarantees and practical implementation. Understanding this disconnect is crucial for interpreting what these methods actually accomplish.

2.2.1 Theoretical Guarantees Under Idealized Conditions

The RFF methodology (Rahimi & Recht 2007) provides rigorous theoretical foundations for kernel approximation. For target shift-invariant kernels $k(x,x^{\prime})=k(x-x^{\prime})$ , the theory establishes that:

k_{\text{RFF}}(x,x^{\prime})=\frac{1}{P}\sum_{i=1}^{P}z_{i}(x)z_{i}(x^{\prime}% )\xrightarrow{P\to\infty}k_{G}(x,x^{\prime})=\exp\left(-\frac{\gamma^{2}}{2}\|% x-x^{\prime}\|^{2}\right)

(2.1)

in probability, under the condition that features maintain their original distributional properties. This convergence enables kernel methods to be approximated through linear regression in the RFF space, with all the theoretical guarantees that kernel learning provides.

2.2.2 What Actually Happens in Practice

Every practical RFF implementation deviates from the theoretical setup in a seemingly minor but mathematically crucial way. To improve numerical stability and ensure comparable scales across features, practitioners standardize features using training sample statistics:

\tilde{z}_{i}(x)=\frac{z_{i}(x)}{\hat{\sigma}_{i}},\quad\hat{\sigma}_{i}^{2}=% \frac{1}{T}\sum_{t=1}^{T}z_{i}(x_{t})^{2}

(2.2)

This standardization fundamentally alters the mathematical properties of the method. The standardized empirical kernel becomes:

k_{\text{std}}(x,x^{\prime})=\frac{1}{P}\sum_{i=1}^{P}\frac{z_{i}(x)z_{i}(x^{% \prime})}{\hat{\sigma}_{i}^{2}}

(2.3)

This standardized kernel no longer converges to the Gaussian kernel. Instead, as I prove in Theorem 1, it converges to a training-set dependent limit $k^{*}_{\text{std}}(x,x^{\prime}|\mathcal{T})\neq k_{G}(x,x^{\prime})$ that violates the shift-invariance and stationarity properties required for kernel methods.

3 How Standardization Modifies Kernel Approximation

Having established the theory-practice disconnect in Section 2, I now prove rigorously that standardization fundamentally alters the kernel approximation properties that justify RFF methods. This breakdown explains why high-dimensional methods cannot achieve their claimed theoretical properties and must rely on simpler mechanisms.

3.1 Main Result

Theorem 1 (Modified Convergence of Gaussian-RFF Approximation under Standardization).

Let Assumptions 1–5 hold. For query points $x,x^{\prime}\in\mathbb{R}^{K}$ , define the standardized kernel function:

h(\omega,b)=\frac{2\cos(\omega^{\top}x+b)\cos(\omega^{\top}x^{\prime}+b)}{1+% \frac{1}{T}\sum_{t=1}^{T}\cos(2\omega^{\top}x_{t}+2b)}

where $(\omega,b)\sim\mathcal{N}(0,\gamma^{2}I_{K})\times\text{Uniform}[0,2\pi]$ .

Then:

(a)

For every fixed $x,x^{\prime}\in\mathbb{R}^{K}$ , the standardized kernel estimator converges almost surely:

k_{\text{std}}^{(P)}(x,x^{\prime}):=\frac{1}{P}\sum_{i=1}^{P}h(\omega_{i},b_{i% })\xrightarrow[P\to\infty]{\text{a.s.}}k_{\text{std}}^{*}(x,x^{\prime}):=% \mathbb{E}[h(\omega,b)]

(b)

The limit kernel $k_{\text{std}}^{*}$ depends on the particular training set $\mathcal{T}=\{x_{1},\ldots,x_{T}\}$ , whereas the Gaussian kernel $k_{G}(x,x^{\prime})=\exp(-\frac{\gamma^{2}}{2}\|x-x^{\prime}\|^{2})$ is training-set independent. Consequently, $k_{\text{std}}^{*}\neq k_{G}$ in general.

The proof proceeds in two steps. First, I establish that the standardized kernel function $h(\omega,b)$ has finite expectation despite the random denominator, enabling application of the strong law of large numbers for part (a). This requires controlling the probability that the empirical variance $\hat{\sigma}^{2}$ becomes arbitrarily small, which I achieve through geometric analysis exploiting the full-rank condition. Second, I prove training-set dependence by explicit construction: scaling any training point $x_{j}\mapsto\alpha x_{j}$ with $\alpha>1$ yields different limiting kernels, establishing that $k_{\text{std}}^{*}\neq k_{G}$ . The complete technical proof appears in Appendix A.

3.2 Analysis of the Breakdown

To understand the implications of Theorem 1, I examine precisely how standardization violates the conditions under which RFF theory operates. Rahimi & Recht (2007) prove convergence to the Gaussian kernel under two essential conditions: distributional alignment of frequencies $\omega_{i}$ and phases $b_{i}$ with the target kernel’s Fourier transform, and preservation of the prescribed scaling $z_{i}(x)=\sqrt{2}\cos(\omega_{i}^{\top}x+b_{i})$ .

Standardization $\tilde{z}_{i}(x)=z_{i}(x)/\hat{\sigma}_{i}$ systematically violates both conditions. The original features have theoretical properties derived from specified distributions of $\omega_{i}$ and $b_{i}$ , but the standardization factor $1/\hat{\sigma}_{i}$ varies with the training set, altering the effective distribution in a data-dependent manner. The expectation $\mathbb{E}[\tilde{z}_{i}(x)\tilde{z}_{i}(x^{\prime})]$ now depends on $\hat{\sigma}_{i}$ , disrupting the direct mapping to $k_{G}(x,x^{\prime})$ . Additionally, the fixed scaling $\sqrt{2}$ that ensures correct kernel approximation is replaced by a random, sample-dependent factor, breaking the fundamental relationship between feature products and kernel values.

These modifications have important mathematical implications. The standardized features yield an empirical kernel that converges to $k_{\text{std}}^{*}(x,x^{\prime}|\mathcal{T})$ , which is training-set dependent rather than depending only on $\|x-x^{\prime}\|$ like the Gaussian kernel. The resulting kernel is not shift-invariant since $\hat{\sigma}_{i}$ reflects absolute positions of training points, and shifting the data changes $\hat{\sigma}_{i}$ . This creates temporal non-stationarity as kernel properties change when training windows roll forward.

3.3 Implications for Financial Applications

Theorem 1 resolves the fundamental puzzles in high-dimensional financial prediction by revealing that claimed theoretical properties simply do not hold in practice. KMZ develop their theoretical analysis assuming RFF converge to Gaussian kernels. Their random matrix theory characterization, effective complexity bounds, and optimal shrinkage formula all depend critically on this convergence. However, their empirical implementation employs standardization, which fundamentally alters the convergence properties, creating a notable difference between theory and practice.

With modified kernel structure, methods may perform learning that differs from the theoretical framework, potentially involving pattern-matching mechanisms based on training-sample dependent similarity measures. The standardized kernel creates similarity measures based on training-sample dependent weights rather than genuine predictor relationships. This explains Nagel (2025) empirical finding that high-complexity methods produce volatility-timed momentum strategies regardless of underlying data properties. The broken kernel structure makes the theoretically predicted learning more challenging, leading methods to weight returns based on alternative similarity measures within the training window.

The apparent virtue of complexity may arise through different mechanisms than originally theorized. Their method cannot achieve its theoretical properties due to standardization, so any success must arise through alternative mechanisms. This resolves the central puzzle of how methods claiming to harness thousands of parameters succeed with tiny training samples: they may operate through mechanisms that differ from the high-dimensional framework, potentially involving simpler pattern-matching approaches that happen to work in specific market conditions.

4 Fundamental Barriers to High-Dimensional Learning

The kernel approximation breakdown in Section 3 reveals that methods cannot achieve their claimed theoretical properties. This section establishes that even if this breakdown were corrected, fundamental information-theoretic barriers would still prevent genuine high-dimensional learning in financial applications. These results explain why methods must rely on the mechanical pattern matching that emerges from broken kernel structures.

4.1 Sample Complexity Lower Bounds

I establish fundamental limits on learning over the high-dimensional function spaces that methods claim to harness, requiring an additional regularity condition for my convergence analysis.

Theorem 2 (Exponential lower bound, random design).

Assume the data–generation scheme of Assumptions 1, 2, and 3, and suppose additionally that the RFF are $\kappa$ -sub-Gaussian: for every unit $u\in\mathbb{R}^{P}$ , $\mathbb{E}\!\bigl{[}e^{t\,u^{\!\top}z(x)}\bigr{]}\leq\exp\!\bigl{(}\tfrac{1}{2% }\kappa^{2}t^{2}\bigr{)},\;t\in\mathbb{R}$ (Assumption 5).

Let $\mathcal{F}_{P}=\{x\mapsto w^{\!\top}z(x):\|w\|_{2}\leq B\},$ and denote by $\sigma^{2}$ the noise variance. Then, for every $T,P\geq 1$ ,

\inf_{\hat{f}_{T}}\;\sup_{\|w\|_{2}\leq B}\;\mathbb{E}_{x,\mathcal{D}_{T},% \epsilon}\bigl{[}(\hat{f}_{T}(x)-w^{\!\top}z(x))^{2}\bigr{]}\;\;\geq\;\;c\,B^{% 2}\exp\!\Bigl{(}-\frac{8\,T\,C_{z}\,B^{2}}{P\,\sigma^{2}}\Bigr{)},

(4.1)

for a universal constant $c=c(c_{z},C_{z})>0$ .

Moreover, there is a constant $C_{0}=C_{0}(\kappa,c_{z},C_{z})$ such that whenever $T\geq C_{0}P$ ,

\mathbb{P}_{Z}\!\Bigl{[}\forall\,\hat{f}_{T}\;:\;\sup_{\|w\|_{2}\leq B}\mathbb% {E}_{x,\epsilon}(\hat{f}_{T}(x)-w^{\!\top}z(x))^{2}\;\geq\;c^{\star}B^{2}\exp% \!\Bigl{(}-\frac{8\,T\,C_{z}\,B^{2}}{P\,\sigma^{2}}\Bigr{)}\Bigr{]}\;\geq\;1-e% ^{-T},

(4.2)

with $c^{\star}=c^{\star}(c_{z},C_{z})>0$ .

The proof uses a minimax argument with Fano’s inequality. I construct a $2\delta$ -packing $\{w_{1},\ldots,w_{M}\}\subset B_{2}^{P}(B)$ with $M=(B/(2\delta))^{P}$ well-separated parameters. The Kullback-Leibler (KL) divergence between corresponding data distributions satisfies $\mathrm{KL}(P_{j}\|P_{\ell})\leq\frac{2TC_{z}B^{2}}{\sigma^{2}}$ . Fano’s inequality then implies any decoder has error probability $\Pr[\hat{J}\neq J]\geq 1/2$ . Since low estimation risk would enable perfect identification (contradicting Fano), I obtain $\mathbb{E}[(\hat{f}_{T}(x)-f_{J}(x))^{2}]\geq c_{z}\delta^{2}$ . Optimizing $\delta$ yields the exponential bound. The high-probability version conditions on well-conditioned designs using matrix concentration.

Theorems 2 applies directly to machine learning methods employing RFF as actually implemented in practice. The theoretical framework covers the complete practical pipeline where random feature weights $\{\omega_{i},b_{i}\}_{i=1}^{P}$ are drawn from specified distributions, standardization procedures are applied for numerical stability, and learning proceeds over the resulting linear-in-features function class $\mathcal{F}_{P}=\{x\mapsto w^{T}z(x):\|w\|_{2}\leq B\}$ using any estimation method including OLS, ridge regression, LASSO, or ridgeless regression.

The bounds establish information-theoretic impossibility in two complementary forms: the expectation bound averaged over all possible feature realizations, and the high-probability bound showing that the same limitations hold for most individual feature draws. KMZ follows precisely this framework with $P=12{,}000$ features and $T=12$ training observations, making both versions directly applicable to their empirical analysis. The universal constant $c=c(c_{z},C_{z},\kappa)$ depends on the feature covariance bounds from the regularity conditions and the sub-Gaussian parameter controlling concentration properties, but remains bounded away from zero under standard assumptions for financial applications.

Theorem 3 (Polynomial minimax lower bound – high probability).

Assume Assumptions 1, 2, 3, and the sub-Gaussian feature condition Assumption 5. Put

\tilde{c}\;:=\;\frac{c_{z}}{64\,C_{z}}\;>0,\qquad\mathcal{F}_{P}:=\{\,x\mapsto w% ^{\!\top}z(x):\;\|w\|_{2}\leq B\}.

(a)

In-expectation bound. For every $T,P\geq 1$

\inf_{\hat{f}_{T}}\sup_{\|w\|_{2}\leq B}\mathbb{E}_{x,\mathcal{D}_{T},\epsilon% }\bigl{[}(\hat{f}_{T}(x)-w^{\!\top}z(x))^{2}\bigr{]}\;\;\geq\;\;\tilde{c}\,% \min\!\Bigl{\{}B^{2},\;\tfrac{\sigma^{2}}{T}\log P\Bigr{\}}.

(b)

High-probability bound. There exists a constant $C_{0}=C_{0}(\kappa,C_{z})$ such that whenever $T\geq C_{0}P$ ,

\mathbb{P}_{Z}\!\Bigl{[}\inf_{\hat{f}_{T}}\sup_{\|w\|_{2}\leq B}\mathbb{E}_{x,% \epsilon}\bigl{[}(\hat{f}_{T}(x)-w^{\!\top}z(x))^{2}\mid Z\bigr{]}\;<\;\tilde{% c}\,\min\!\Bigl{\{}B^{2},\;\tfrac{\sigma^{2}}{T}\log P\Bigr{\}}\Bigr{]}\;\leq% \;e^{-T}.

(4.3)

Thus the same lower bound holds for each realised design matrix with probability at least $1-e^{-T}$ .

The proof uses a standard basis packing with refined concentration analysis. I construct $M=P+1$ functions using the canonical basis: $w_{0}=0$ and $w_{j}=\delta e_{j}$ for $j=1,\ldots,P$ , where $\delta=\min\{B/4,\sigma/(4\sqrt{TC_{z}\log P})\}$ . The population covariance $\Sigma_{z}\succeq c_{z}I_{P}$ ensures separation $\|f_{j}-f_{\ell}\|_{L^{2}(\mu)}^{2}\geq 2c_{z}\delta^{2}$ . For the KL bound, $\mathbb{E}[\mathrm{KL}(P_{j}\|P_{\ell})]\leq\frac{2TC_{z}\delta^{2}}{\sigma^{2% }}\leq\frac{\log P}{8}$ , enabling Fano’s inequality with error probability $\geq 1/2$ . The risk-identification argument yields $\mathbb{E}[(\hat{f}_{T}(x)-f_{J}(x))^{2}]\geq\frac{c_{z}}{4}\delta^{2}$ , giving the polynomial bound. For part (b), I condition on the ”good design” event $\{\lambda_{\max}(T^{-1}Z^{\top}Z)\leq 2C_{z}\}$ with probability $\geq 1-e^{-T}$ when $T\geq C_{0}P$ , then apply the same argument with adjusted constants.

The universal constants in my polynomial lower bounds have transparent structure that illuminates the fundamental barriers to high-dimensional learning in finance. The expectation bound employs $\tilde{c}=c_{z}/(64C_{z})$ , which depends on the feature covariance bounds and measures the quality of the feature construction. For RFF with the standard construction $z_{i}(x)=\sqrt{2}\cos(\omega_{i}^{T}x+b_{i})$ where $\omega_{i}\sim\mathcal{N}(0,\gamma^{2}I_{K})$ , the theoretical expectation $\mathbb{E}[z_{i}(x)^{2}]=1$ suggests $C_{z}\approx 1$ under ideal conditions, yielding $\tilde{c}\approx 1/64\approx 0.016$ in the best case. Even when the conditioning number $\kappa=C_{z}/c_{z}$ increases to moderate values around 10, I obtain $\tilde{c}\approx 0.0016$ , representing manageable degradation in the constant.

The high-probability bound introduces additional dependence on the sub-Gaussian parameter $\kappa$ through both the concentration quality and the threshold requirement $T\geq C_{1}(\kappa,c_{z},C_{z})\log P$ for the probabilistic guarantee to hold. Throughout my analysis, I employ the conservative choice $\tilde{c}=0.01$ , deliberately favoring the possibility of learning by using constants that are optimistic relative to typical financial applications. This conservative approach strengthens my impossibility conclusions: even when I bias the analysis toward finding that learning should be possible, the fundamental barriers persist. The comparable magnitude of constants between fixed and random feature settings confirms that these information-theoretic limitations are robust to implementation details and reflect inherent properties of high-dimensional learning with limited financial data.

4.2 Effective Complexity: The VC Dimension Reality

The Vapnik-Chervonenkis (VC) dimension provides a fundamental measure of model complexity that directly connects to generalization performance and sample complexity requirements (Vapnik & Chervonenkis 1971, Vapnik 1998). For a hypothesis class $\mathcal{H}$ , the VC dimension is the largest number of points that can be shattered (i.e., correctly classified under all possible binary labelings) by functions in $\mathcal{H}$ . This combinatorial measure captures the essential complexity of a learning problem: classes with higher VC dimension require more samples to achieve reliable generalization.

The connection between VC dimension and sample complexity is formalized through uniform convergence bounds. Classical results show that for a hypothesis class with VC dimension $d$ , achieving generalization error $\varepsilon$ with confidence $1-\delta$ requires sample size $T=O(d\log(1/\varepsilon)/\varepsilon+\log(1/\delta)/\varepsilon)$ (Blumer et al. 1989, Shalev-Shwartz & Ben-David 2014). This relationship reveals why effective model complexity, rather than nominal parameter count, determines learning difficulty.

In the context of high-dimensional financial prediction, VC dimension analysis becomes crucial for understanding what machine learning methods actually accomplish. While methods may claim to leverage thousands of parameters, their effective complexity—as measured by VC dimension—may be much lower due to structural constraints imposed by the optimization procedure. Ridgeless regression in the overparameterized regime ( $P>T$ ) provides a particularly important case study, as the interpolation constraint fundamentally limits the achievable function class regardless of the ambient parameter dimension.

Theorem 4 (Effective VC Dimension of Ridgeless RFF Regression).

Let $z:\mathcal{X}\to\mathbb{R}^{P}$ be a fixed feature map (e.g. standardized RFF) and define the linear function class

\mathcal{F}_{P}\;=\;\Bigl{\{}\,f_{w}(x)=w^{\top}z(x)\;:\;\|w\|_{2}\leq B\Bigr{% \}},\qquad B>0.

Fix a training sample $(x_{1},\dots,x_{T})$ with $T<P$ and denote $Z=[\,z(x_{1})\;\cdots\;z(x_{T})]^{\top}\!\in\mathbb{R}^{T\times P}$ . Write $k_{i}(x)=z(x_{i})^{\top}z(x)$ and $k(x)=(k_{1}(x),\dots,k_{T}(x))^{\top}$ . The corresponding ridgeless (minimum-norm) regression functions are

\mathcal{F}_{\textnormal{ridge}}^{(Z)}\;=\;\bigl{\{}\,f_{\alpha}(x)=\alpha^{% \top}k(x)\;:\;\alpha\in\mathbb{R}^{T}\bigr{\}}.

Let $r=\operatorname{rank}(ZZ^{\top})\leq T$ . Then

(a)

$\mathrm{VC}\!\bigl{(}\{\operatorname{sign}(f)\,:\,f\in\mathcal{F}_{P}\}\bigr{)% }=P$ .
(b)

$\mathrm{VC}\!\bigl{(}\{\operatorname{sign}(f)\,:\,f\in\mathcal{F}_{\textnormal% {ridge}}^{(Z)}\}\bigr{)}=r\leq T$ . In particular, if $ZZ^{\top}$ is invertible (full row rank), the VC dimension equals $T$ .

KMZ correctly note that, after minimum–norm fitting, the effective degrees of freedom of their RFF model equal the sample size ( $T=12$ ), not the nominal dimension ( $P=12{,}000$ ): “the effective number of parameters in the construction of the predicted return is only $T=12$ …”. Theorem 4 rigorously justifies this statement by showing that the VC dimension of ridgeless RFF regression is bounded above by $T$ .

This observation, however, leaves open the central question that KMZ label the “virtue of complexity”: does the enormous RFF dictionary contribute predictive information beyond what a $T$ -dimensional linear model could extract? In kernel learning the tension is familiar: one combines an extremely rich representation (in principle, infinite–dimensional) with an estimator whose statistical capacity is implicitly capped at $T$ . Overfitting risk is therefore limited, but any real performance gain must come from the non-linear basis supplied by the features rather than from high effective complexity per se.

4.3 Sharp Learning Thresholds

The previous bounds establish that learning is difficult, but do not precisely characterize the boundary between feasible and infeasible regimes. I derive sharp thresholds that separate learnable from unlearnable scenarios.

Understanding such thresholds is crucial for financial applications where practitioners must decide whether available data is sufficient for reliable prediction. While my previous bounds show that learning is difficult, they do not precisely characterize the boundary between feasible and infeasible learning regimes. The following analysis addresses this gap by establishing sharp learning thresholds that depend on the signal-to-noise ratio, feature dimension, and sample size.

Definition 1 (Learning Threshold).

For target prediction error $\varepsilon>0$ , define the learning threshold as:

\text{SNR}_{\text{threshold}}(\varepsilon):=\frac{\tilde{c}^{-1}\log P}{T}% \cdot\frac{\varepsilon}{B^{2}}

where $\text{SNR}=B^{2}/\sigma^{2}$ is the signal-to-noise ratio and $\tilde{c}$ is the universal constant from Theorem 3.

The polynomial learning threshold reveals why my characterization provides actionable guidance where cruder exponential bounds would not. Unlike exponential characterizations that scale catastrophically with feature dimension $P$ , my threshold scales as $\frac{\log P}{T}$ —a fundamental difference that enables meaningful evaluation with realistic parameters.

This distinction proves crucial for understanding high-dimensional financial prediction. The threshold exhibits intuitive monotonicity properties: easier targets (larger $\varepsilon$ ) require weaker signals, while higher complexity relative to sample size (larger $P/T$ ) demands stronger signals. More importantly, the explicit dependence on sample size $T$ shows precisely how additional observations reduce required signal strength, revealing that sample complexity alone does not determine learning difficulty.

The practical significance becomes clear when evaluating typical financial applications. For the parameters employed by KMZ— $P=12{,}000$ features, $T=12$ observations, targeting $\varepsilon=0.01$ accuracy—my threshold requires signal-to-noise ratios exceeding $0.78$ , compared to observed financial signal strengths of $0.01$ – $0.05$ . This gap of nearly two orders of magnitude places such applications decisively outside the learnable regime, providing theoretical validation that apparent success must arise through mechanisms other than genuine high-dimensional learning.

The sharp nature of this transition explains why high-dimensional methods may appear to succeed or fail unpredictably: small changes in problem parameters can move applications across the fundamental boundary between learnable and unlearnable regimes.

Theorem 5 (Sharp Learning Threshold for RFF-based Predictors).

Consider the RFF prediction problem under Assumptions 1–3, with target prediction error $\varepsilon>0$ and universal constant $\tilde{c}=\frac{c_{z}}{64C_{z}}$ from Theorem 3. Then there exists a sharp phase transition characterized by the complexity-to-sample ratio $\frac{\log P}{T}$ :

(a)

Phase I (Impossible Learning): If

\frac{\sigma^{2}\log P}{T}\geq\frac{\varepsilon}{\tilde{c}}\quad\text{and}% \quad B^{2}\geq\frac{\varepsilon}{\tilde{c}},

then learning with error $<\varepsilon$ is impossible.

(b)

Phase II (Possible Learning): If

\frac{\sigma^{2}\log P}{T}<\frac{\varepsilon}{\tilde{c}}\quad\text{and}\quad B% ^{2}\geq\frac{\varepsilon}{\tilde{c}},

then learning with error $<\varepsilon$ becomes information-theoretically feasible with sufficiently sophisticated estimators.

(c)

Trivial Regime: If $B^{2}<\frac{\varepsilon}{\tilde{c}}$ , then the function class is too simple relative to the target accuracy, and standard parametric rates apply.

Corollary 1 (Weak Signal Learning Impossibility).

Under the weak signal assumption $\text{SNR}=O(K^{-\alpha})$ with $P=O(K^{\beta})$ , there exists a critical sample size threshold $T_{\text{critical}}(K)=\Omega(\frac{\sigma^{2}\beta\log K}{\varepsilon})$ such that learning is impossible when $T<T_{\text{critical}}(K)$ .

Applying these thresholds to Kelly et al.’s reported performance reveals the impossibility of their claimed mechanism. Their high-complexity model achieves $R^{2}=0.6\%$ (corresponding to $\varepsilon=0.994$ ) with parameters $P=12,000$ , $T=12$ .

The complexity-to-sample ratio $\frac{\log(12,000)}{12}\approx 0.78$ appears manageable, but the signal strength requirement $B^{2}\geq\frac{0.994}{0.01}=99.4$ demands that predictive signals explain at least 9,940

This analysis confirms that their empirical success cannot arise from genuine learning over the claimed high-dimensional function space, providing theoretical validation for the mechanical pattern matching explanation.

These results resolve the central puzzle by showing that apparent ’virtue of complexity’ may reflect mechanisms that differ from both the predicted high-dimensional learning (information-theoretically impossible) and the theoretical properties (which are modified by standardization). Instead, methods achieve success through mechanical pattern matching that emerges when kernel approximation fails.

The standardization procedure ensures methods accidentally implement volatility-timed momentum strategies operating in low-dimensional spaces bounded by sample size. This transforms evaluation from ”how can complex methods work with limited data?” to ”how can we distinguish mechanical artifacts from genuine learning?”

The following section provides empirical validation of these theoretical predictions, demonstrating the kernel breakdown and learning impossibility in practice.

5 Empirical Validation of Kernel Approximation Breakdown

This section provides comprehensive empirical validation of Theorem 1 through systematic parameter exploration across the entire space of practical RFF implementations. My experimental design spans realistic financial prediction scenarios, testing whether standardization preserves the Gaussian kernel approximation properties that underlie existing theoretical frameworks. The results provide definitive evidence that standardization fundamentally breaks RFF convergence properties, confirming that methods cannot achieve their claimed theoretical guarantees in practice.

5.1 Data Generation and Model Parameters

I generate realistic financial predictor data following the autoregressive structure typical of macroeconomic variables used in return prediction. For each parameter combination $(T,K)$ , I construct predictor matrices $X\in\mathbb{R}^{T\times K}$ where:

\displaystyle X_{t}

\displaystyle=\Phi X_{t-1}+u_{t},\quad u_{t}\sim\mathcal{N}(0,\Sigma_{u})

(5.1)

The persistence parameters $\Phi=\text{diag}(\phi_{1},\ldots,\phi_{K})$ are drawn from the range $[0.82,0.98]$ to match the high persistence of dividend yields, interest rates, and other financial predictors (Welch & Goyal 2008). The correlation structure $\Sigma_{u}=\rho\mathbf{1}\mathbf{1}^{T}+(1-\rho)I_{K}$ with $\rho=0.1$ captures modest cross-correlation among predictors.

Random Fourier Features are constructed as $z_{i}(x)=\sqrt{2}\cos(\omega_{i}^{T}x+b_{i})$ where $\omega_{i}\sim\mathcal{N}(0,\gamma^{2}I_{K})$ and $b_{i}\sim\text{Uniform}[0,2\pi]$ . Standardization is applied as $\tilde{z}_{i}(x)=z_{i}(x)/\hat{\sigma}_{i}$ where $\hat{\sigma}_{i}^{2}=T^{-1}\sum_{t=1}^{T}z_{i}(x_{t})^{2}$ following universal practice in RFF implementations.

My parameter exploration covers the comprehensive space:

•

Number of features: $P\in\{100,500,1000,2500,5000,10000,15000,20000\}$
•

Training window: $T\in\{6,12,24,60\}$ months
•

Kernel bandwidth: $\gamma\in\{0.5,1.0,1.5,2.0,2.5,3.0\}$
•

Input dimension: $K\in\{5,10,15,20,30\}$ .

5.2 Experimental Goal

The primary objective is to test whether standardization preserves the convergence $k^{(P)}_{\text{std}}(x,x^{\prime})\xrightarrow{P\to\infty}k_{G}(x,x^{\prime})$ established in Rahimi & Recht (2007). Under the null hypothesis that standardization has no effect, both standard and standardized RFF should exhibit identical convergence properties and error distributions. Theorem 1 predicts systematic breakdown with training-set dependent limits $k^{*}_{\text{std}}(x,x^{\prime}|\mathcal{T})\neq k_{G}(x,x^{\prime})$ .

I conduct 1,000 independent trials per parameter combination, generating fresh training data, RFF weights, and query points for each trial. This provides robust statistical power to detect systematic effects across the parameter space while controlling for random variations in specific realizations.

5.3 Comparison Metrics

My empirical analysis employs four complementary approaches to characterize the extent and nature of kernel approximation breakdown. I begin by examining convergence properties through mean absolute error $|k^{(P)}(x,x^{\prime})-k_{G}(x,x^{\prime})|$ between empirical and true Gaussian kernels, tracking how approximation quality evolves as $P\to\infty$ . This directly tests whether standardized features preserve the fundamental convergence properties established in Rahimi & Recht (2007).

To quantify the systematic nature of performance deterioration, I construct degradation factors as the ratio $\mathbb{E}[|\text{error}_{\text{standardized}}|]/\mathbb{E}[|\text{error}_{% \text{standard}}|]$ across matched parameter combinations. Values exceeding unity indicate that standardization worsens kernel approximation, while larger ratios represent more severe breakdown. This metric provides a scale-invariant measure of standardization effects that facilitates comparison across different parameter regimes.

Statistical significance is assessed through Kolmogorov-Smirnov two-sample tests comparing error distributions between standard and standardized RFF implementations. Under the null hypothesis that standardization preserves distributional properties, these tests should yield non-significant results. Systematic rejection of this null across parameter combinations provides evidence that standardization fundamentally alters the mathematical behavior of RFF methods beyond what could arise from random variation.

Finally, I conduct comprehensive parameter sensitivity analysis to identify the conditions under which breakdown effects are most pronounced. Heatmap visualizations reveal how degradation severity depends on $(P,T,\gamma,K)$ combinations, enabling us to characterize the parameter regimes where theoretical guarantees are most severely compromised. This analysis is particularly relevant for understanding the implications for existing empirical studies that employ specific parameter configurations.

5.4 Results and Validation of Theorem 1

5.4.1 Universal Convergence Failure

Figure 1 provides decisive evidence of convergence breakdown. Standard RFF (blue circles) exhibit the theoretically predicted $P^{-1/2}$ convergence rate, with mean absolute error declining from $\approx 0.06$ at $P=100$ to $\approx 0.003$ at $P=20,000$ . This confirms that unstandardized features preserve Gaussian kernel approximation properties.

In stark contrast, standardized RFF (red squares) completely fail to converge, plateauing around $0.02$ - $0.03$ mean error regardless of $P$ . For large $P$ , standardized features are $\mathbf{6\times}$ worse than standard RFF, demonstrating that additional features provide no approximation benefit when standardization is applied. This plateau behavior directly validates Theorem LABEL:thm:standardization_breakdown’s prediction that standardized features converge to training-set dependent limits rather than the target Gaussian kernel.

5.4.2 Systematic Degradation Across Parameter Space

Figure 2 reveals that breakdown occurs universally across all parameter combinations, with no regime where standardization preserves kernel properties. The degradation patterns exhibit clear economic intuition and align closely with the theoretical mechanisms underlying Theorem 1.

The most pronounced effects emerge along the feature dimension, where degradation increases dramatically with $P$ , ranging from 1.2 times at $P=100$ to 6.0 times at $P=20,000$ . This escalating pattern reflects the cumulative nature of standardization artifacts: as more features undergo within-sample standardization, the collective distortion of kernel approximation properties intensifies. Each additional standardized feature contributes random scaling factors that compound to produce increasingly severe departures from the target Gaussian kernel.

Sample size effects provide particularly compelling evidence for the breakdown mechanism. Smaller training windows exhibit severe degradation, reaching 41.6 times deterioration for $T=6$ months. This extreme sensitivity to sample size occurs because standardization relies on empirical variance estimates $\hat{\sigma}_{i}^{2}$ that become increasingly unreliable with limited data. When training windows shrink to the 6-12 month range typical in financial applications, these variance estimates introduce substantial noise that fundamentally alters the scaling relationships required for kernel convergence. The magnitude of this effect—exceeding 40 times degradation in realistic scenarios—demonstrates that standardization can completely overwhelm any approximation benefits from additional features.

Kernel bandwidth parameters reveal additional structure in the breakdown pattern. Low bandwidth values ( $\gamma=0.5$ ) produce 12.8 times degradation, while higher bandwidths stabilize around 3.1 times deterioration. This occurs because tighter kernels, which decay more rapidly with distance, are inherently more sensitive to the scaling perturbations introduced by standardization. Small changes in feature magnitudes translate into disproportionately large changes in kernel values when the bandwidth is narrow, amplifying the distortions created by training-set dependent scaling factors.

In contrast, input dimension effects remain remarkably stable, with degradation ranging only between 3.1 and 4.6 times across $K\in[5,30]$ . This stability confirms that breakdown stems primarily from the standardization procedure itself rather than the complexity of the underlying input space. Whether using 5 or 30 predictor variables, the fundamental mathematical properties of standardized RFF remain equally compromised, suggesting that the kernel approximation failure is intrinsic to the standardization mechanism rather than an artifact of high-dimensional inputs.

5.4.3 Parameter Sensitivity Analysis

Figure 3 provides detailed parameter sensitivity analysis through degradation factor heatmaps. The $(P,T)$ interaction reveals that combinations typical in financial applications—such as $P\geq 5,000$ features with $T\leq 12$ months—produce degradation factors exceeding $3\times$ . This directly impacts methods like Kelly et al. (2024) using $P=12,000$ and $T=12$ .

The $(P,\gamma)$ interaction shows that standardization effects compound: high complexity ( $P\geq 10,000$ ) combined with tight kernels ( $\gamma\leq 1.0$ ) yields degradation exceeding $10\times$ . These parameter ranges are commonly employed in high-dimensional return prediction, suggesting widespread applicability of my breakdown results.

5.4.4 Statistical Significance

The error distributions between standard and standardized RFF are fundamentally different across the entire parameter space, providing strong statistical evidence against the null hypothesis that standardization preserves kernel approximation properties. Figure 4 presents Kolmogorov-Smirnov test statistics that consistently exceed 0.5 across most parameter combinations, with many approaching the theoretical maximum of 1.0. Such large test statistics indicate that the cumulative distribution functions of standard and standardized RFF errors diverge substantially, ruling out the possibility that observed differences arise from sampling variation.

The statistical evidence is most compelling in parameter regimes commonly employed in financial applications. For high feature counts ( $P\geq 5,000$ ), KS statistics approach 0.9, while short training windows ( $T\leq 12$ ) yield statistics near 1.0. These values correspond to p-values that are effectively zero, providing overwhelming evidence to reject the null hypothesis of distributional equivalence. The magnitude of these test statistics exceeds typical significance thresholds by orders of magnitude, establishing statistical significance that is both robust and economically meaningful.

The systematic pattern of large KS statistics across parameter combinations demonstrates that the breakdown identified in Theorem 1 is not confined to specific implementation choices or edge cases. Instead, the distributional differences persist universally across realistic parameter ranges, indicating that standardization fundamentally alters the stochastic properties of RFF approximation errors. This statistical evidence complements the degradation factor analysis by confirming that the observed differences represent genuine distributional shifts rather than changes in central tendency alone.

These results establish that standardization creates systematic, statistically significant alterations to RFF behavior that cannot be attributed to random variation, specific parameter selections, or implementation artifacts. The universality and magnitude of the statistical evidence provide definitive support for the conclusion that practical RFF implementations cannot achieve the theoretical kernel approximation properties that justify their use in high-dimensional prediction problems.

5.4.5 Alternative Kernel Convergence

Figure 5 provides empirical validation of Theorem 1’s central prediction that within-sample standardization fundamentally alters Random Fourier Features convergence properties. The analysis compares three distinct convergence behaviors across varying feature dimensions $P\in[100,500,1000,2500,5000,12000]$ :

The blue line demonstrates that standard (non-standardized) RFF achieve the theoretical convergence rate $P^{-1/2}$ to the Gaussian kernel $k_{G}(x,x^{\prime})=\exp(-\gamma^{2}\|x-x^{\prime}\|^{2}/2)$ , validating the foundational result of Rahimi & Recht (2007). The convergence follows the expected Monte Carlo rate, with mean absolute error decreasing from approximately $0.06$ at $P=100$ to $0.005$ at $P=12{,}000$ .

The red line reveals the fundamental breakdown predicted by Theorem 1: standardized RFF fail to converge to the Gaussian kernel, instead exhibiting slower convergence with substantially higher errors. At $P=12{,}000$ , the error remains above $0.02$ —four times larger than the standard case—demonstrating that standardization prevents achievement of the theoretical guarantees.

Most importantly, the green line confirms Theorem 1’s constructive prediction by showing that standardized RFF do converge to the modified limit $k^{*}_{\text{std}}(x,x^{\prime}|T)$ . This convergence exhibits the canonical $P^{-1/2}$ rate, reaching error levels below $0.015$ at $P=12{,}000$ , thereby validating my theoretical characterization of the standardized limit.

My empirical validation employs the sample standard deviation standardization actually used in practice:

	$\displaystyle\hat{\sigma}^{2}_{i}$	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}z_{i}^{2}(x_{t})-\left[\frac{1}{T}\sum_% {t=1}^{T}z_{i}(x_{t})\right]^{2}$		(5.2)
	$\displaystyle\tilde{z}_{i}(x)$	$\displaystyle=\frac{z_{i}(x)}{\hat{\sigma}_{i}}$		(5.3)

rather than the simpler RMS normalization $\hat{\sigma}^{2}_{i}=\frac{1}{T}\sum_{t=1}^{T}z_{i}^{2}(x_{t})$ that might be assumed theoretically. This distinction strengthens rather than weakens my validation for two crucial reasons.

First, Theorem 1’s fundamental insight—that any reasonable standardization procedure breaks Gaussian kernel convergence and creates training-set dependence—remains intact regardless of the specific standardization formula. The theorem establishes that standardized features converge to some training-set dependent limit $k^{*}_{\text{std}}\neq k_{G}$ , with the exact form depending on implementation details.

Second, testing against the actual standardization procedure used in practical implementation ensures that my theoretical predictions match real-world behavior. The fact that standardized RFF converge to the correctly computed $k^{*}_{\text{std}}$ rather than to $k_{G}$ provides the strongest possible validation: my theory successfully predicts the behavior of methods as actually implemented, not merely as idealized.

The convergence patterns thus confirm all key predictions of Theorem 1: standardization breaks the foundational convergence guarantee of RFF theory, creates training-set dependent kernels that violate shift-invariance, and produces systematic errors that persist even with large feature counts. These findings validate my theoretical framework while highlighting the critical importance of analyzing methods as actually implemented rather than as theoretically idealized.

5.4.6 Implications for Existing Theory

My results provide definitive empirical validation of Theorem 1 across the entire parameter space relevant for financial applications. The universal nature of degradation—ranging from modest $1.2\times$ effects to extreme $40\times$ breakdown—demonstrates that standardization fundamentally alters RFF convergence properties regardless of implementation details.

Notably, parameter combinations employed by leading studies exhibit substantial degradation: Kelly et al. (2024)’s configuration ( $P=12,000$ , $T=12$ , $\gamma=2.0$ ) falls in the $3$ - $6\times$ degradation range, while more extreme combinations approach $10\times$ or higher degradation. This suggests that empirical successes documented in the literature cannot arise from the theoretical kernel learning mechanisms that justify these methods.

The systematic nature of these effects, combined with their large magnitudes, supports the conclusion that alternative explanations—such as the mechanical pattern matching identified by Nagel (2025)—are required to understand why high-dimensional RFF methods achieve predictive success despite fundamental theoretical breakdown.

6 Conclusion

This paper resolves fundamental puzzles in high-dimensional financial prediction by providing rigorous theoretical foundations that explain when and why complex machine learning methods succeed or fail. My analysis contributes three key results that together clarify the apparent contradictions between theoretical claims and empirical mechanisms in recent literature.

First, I prove that within-sample standardization—employed in every practical Random Fourier Features implementation—fundamentally breaks the kernel approximation that underlies existing theoretical frameworks. This breakdown explains why methods operate under different conditions than theoretical assumptions and must rely on simpler mechanisms than advertised.

Second, I establish sharp sample complexity bounds showing that reliable extraction of weak financial signals requires sample sizes and signal strengths far exceeding those available in typical applications. These information-theoretic limits demonstrate that apparent high-dimensional learning often reflects mechanical pattern matching rather than genuine complexity benefits.

Third, I derive precise learning thresholds that characterize the boundary between learnable and unlearnable regimes, providing practitioners with concrete tools for evaluating when available data suffices for reliable prediction versus when apparent success arises through statistical artifacts.

These results explain why methods claiming sophisticated high-dimensional learning often succeed through simple volatility-timed momentum strategies operating in low-dimensional spaces bounded by sample size. Rather than discouraging complex methods, my findings provide a framework for distinguishing genuine learning from mechanical artifacts and understanding what such methods actually accomplish.

The theoretical insights extend beyond the specific methods analyzed, offering guidance for evaluating any high-dimensional approach in challenging prediction environments. As machine learning continues to transform finance, rigorous theoretical understanding remains essential for distinguishing genuine advances from statistical mirages and enabling more effective application of these powerful but often misunderstood techniques.

References

(1)
Bartlett et al. (2020) Bartlett, P. L., Long, P. M., Lugosi, G. & Tsigler, A. (2020), ‘Benign overfitting in linear regression’, Proceedings of the National Academy of Sciences 117(48), 30063–30070.
Belkin et al. (2019) Belkin, M., Hsu, D., Ma, S. & Mandal, S. (2019), ‘Reconciling modern machine‐learning practice and the bias–variance trade‐off’, Proceedings of the National Academy of Sciences 116(32), 15849–15854.
Bianchi et al. (2021) Bianchi, D., Büchner, M. & Tamoni, A. (2021), ‘Bond risk premiums with machine learning’, Review of Financial Studies 34(2), 1046–1089.
Blumer et al. (1989) Blumer, A., Ehrenfeucht, A., Haussler, D. & Warmuth, M. K. (1989), ‘Learnability and the vapnik-chervonenkis dimension’, Journal of the ACM 36(4), 929–965. Key paper connecting VC dimension to PAC learnability.
Chen et al. (2024) Chen, L., Pelger, M. & Zhu, J. (2024), ‘Deep learning in asset pricing’, Management Science 70(2), 714–750.
Feng et al. (2020) Feng, G., Giglio, S. & Xiu, D. (2020), ‘Taming the factor zoo: A test of new factors’, Journal of Finance 75(3), 1327–1370.
Gu et al. (2020) Gu, S., Kelly, B. & Xiu, D. (2020), ‘Empirical asset pricing via machine learning’, Review of Financial Studies 33(5), 2223–2273.
Hastie et al. (2022) Hastie, T., Montanari, A., Rosset, S. & Tibshirani, R. J. (2022), ‘Surprises in high-dimensional ridgeless least squares interpolation’, Annals of Statistics 50(2), 949–986.
Hastie et al. (2009) Hastie, T., Tibshirani, R. & Friedman, J. (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2 edn, Springer, New York.
Kearns & Vazirani (1994) Kearns, M. J. & Vazirani, U. V. (1994), An Introduction to Computational Learning Theory, MIT Press.
Kelly et al. (2024) Kelly, B., Malamud, S. & Zhou, K. (2024), ‘The virtue of complexity in return prediction’, Journal of Finance 79(1), 459–503.
Mei & Montanari (2022) Mei, S. & Montanari, A. (2022), ‘The generalization error of random features regression: Precise asymptotics and the double descent curve’, Communications on Pure and Applied Mathematics 75(4), 667–766.
Nagel (2025) Nagel, S. (2025), ‘Seemingly virtuous complexity in return prediction’, Working paper .
Rahimi & Recht (2007) Rahimi, A. & Recht, B. (2007), Random features for large-scale kernel machines, in ‘Advances in Neural Information Processing Systems’, Vol. 20, pp. 1177–1184.
Rudi & Rosasco (2017) Rudi, A. & Rosasco, L. (2017), Generalization properties of learning with random features, in ‘Advances in Neural Information Processing Systems’, Vol. 30, pp. 3215–3225.
Shalev-Shwartz & Ben-David (2014) Shalev-Shwartz, S. & Ben-David, S. (2014), Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, Cambridge, UK. Modern textbook with clear exposition of VC theory and PAC learning.
Sutherland & Schneider (2015) Sutherland, D. J. & Schneider, J. (2015), On the error of random fourier features, in ‘Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence’, pp. 862–871.
Tropp (2012) Tropp, J. A. (2012), ‘User-friendly tail bounds for sums of random matrices’, Foundations of Computational Mathematics 12(4), 389–434. See especially Theorem 6.2 for the matrix Bernstein inequality used in the proof.
https://doi.org/10.1007/s10208-011-9099-z
Tropp (2015) Tropp, J. A. (2015), An Introduction to Matrix Concentration Inequalities, Vol. 8 of Foundations and Trends in Machine Learning, Now Publishers, Boston.
Valiant (1984) Valiant, L. G. (1984), ‘A theory of the learnable’, Communications of the ACM 27(11), 1134–1142.
Vapnik (1998) Vapnik, V. N. (1998), Statistical Learning Theory, Wiley, New York. Comprehensive treatment of VC theory and statistical learning.
Vapnik & Chervonenkis (1971) Vapnik, V. N. & Chervonenkis, A. Y. (1971), ‘On the uniform convergence of relative frequencies of events to their probabilities’, Theory of Probability & Its Applications 16(2), 264–280. Foundational paper introducing VC dimension.
Welch & Goyal (2008) Welch, I. & Goyal, A. (2008), ‘A comprehensive look at the empirical performance of equity premium prediction’, Review of Financial Studies 21(4), 1455–1508.

Refer to caption — Figure 1: Convergence Analysis: Kernel Approximation Error vs Number of Features

Appendix A Technical Proofs for Kernel Approximation Breakdown

This appendix provides complete mathematical proofs for the results in Section 3. We establish that within-sample standardization of Random Fourier Features fundamentally breaks the Gaussian kernel approximation that underlies the theoretical framework of high-dimensional prediction methods.

A.1 Model Setup and Notation

We analyze the standardized Random Fourier Features used in practical implementations. Draw $(\omega,b)\sim\mathcal{N}(0,\gamma^{2}I_{K})\times\text{Uniform}[0,2\pi]$ , independently of the training set $\mathcal{T}=\{x_{t}\}_{t=1}^{T}$ . For query points $x,x^{\prime}\in\mathbb{R}^{K}$ , define the standardized kernel function:

h(\omega,b)=\frac{2\cos(\omega^{\top}x+b)\cos(\omega^{\top}x^{\prime}+b)}{1+% \frac{1}{T}\sum_{t=1}^{T}\cos(2\omega^{\top}x_{t}+2b)}=\frac{N(\omega,b)}{D(% \omega,b)}

Given $P$ i.i.d. copies $(\omega_{i},b_{i})$ , we write $k^{(P)}_{\text{std}}:=P^{-1}\sum_{i=1}^{P}h(\omega_{i},b_{i})$ .

A.2 Proof of Theorem 1

The proof proceeds in two steps: establishing almost-sure convergence in part (a) and demonstrating training-set dependence in part (b).

A.2.1 Step 1: Integrability and Almost-Sure Convergence

We first establish that $h(\omega,b)$ has finite expectation, enabling application of the strong law of large numbers.

Write

\hat{\sigma}^{2}:=\frac{2}{T}\sum_{t=1}^{T}\cos^{2}(\omega^{\top}x_{t}+b)=1+S_% {T},\quad S_{T}:=\frac{1}{T}\sum_{t=1}^{T}\cos(2\omega^{\top}x_{t}+2b)

Since $|h|\leq 2\hat{\sigma}^{-2}$ , integrability of $h$ follows once we show $\mathbb{E}[\hat{\sigma}^{-2}]<\infty$ . Lemma 1 proves this claim.

Using $\mathbb{P}(\hat{\sigma}^{-2}>u)=\mathbb{P}(\hat{\sigma}^{2}<u^{-1})$ , we obtain:

\mathbb{E}[\hat{\sigma}^{-2}]=\int_{0}^{2}\mathbb{P}(\hat{\sigma}^{-2}>u)\,du% \leq 1+\int_{1}^{\infty}C_{T}u^{-T/2}\,du<\infty

for every $T\geq 5$ . Hence $\mathbb{E}|h|<\infty$ .

Since the variables $h(\omega_{i},b_{i})$ are i.i.d. with finite mean, Kolmogorov’s strong law yields:

k^{(P)}_{\text{std}}(x,x^{\prime})=\frac{1}{P}\sum_{i=1}^{P}h(\omega_{i},b_{i}% )\xrightarrow[\text{a.s.}]{P\to\infty}k^{*}_{\text{std}}(x,x^{\prime}):=% \mathbb{E}[h(\omega,b)]

This establishes part (i) of Theorem 1.

A.2.2 Step 2: Training-Set Dependence

We now prove that the limiting kernel $k^{*}_{\text{std}}$ depends on the training set, unlike the Gaussian kernel.

Define the radial function:

G(u):=\mathbb{E}[(1+\cos(2\omega^{\top}u+2b))^{-1}]=g(r:=\|u\|),\quad r\geq 0

Now fix $x=x^{\prime}=0$ and compare two training sets:

\mathcal{T}=\{x_{1},\ldots,x_{T}\},\quad\mathcal{T}^{\prime}=\{\alpha x_{1},x_% {2},\ldots,x_{T}\},\quad\alpha>1

Write $D:=1+S_{T}$ and $g_{\mathcal{T}}(b):=\mathbb{E}[D^{-1}|b]$ . Conditioning on $b$ :

k^{*}_{\text{std}}(0,0|\mathcal{T})=\int_{0}^{2\pi}(1+\cos 2b)g_{\mathcal{T}}(% b)\frac{db}{2\pi}

Because $U_{t}(b)=2\omega^{\top}x_{t}+2b$ is Gaussian with variance $4\gamma^{2}\|x_{t}\|^{2}$ , scaling $x_{1}\mapsto\alpha x_{1}$ strictly enlarges $\text{Var }U_{1}(b)$ . By Lemma 2, this implies $g_{\mathcal{T}}(b)\neq g_{\mathcal{T}^{\prime}}(b)$ on a set of $b$ of positive measure. Since $1+\cos 2b>0$ on that set, the two integrals—and therefore the two kernels—differ:

k^{*}_{\text{std}}(0,0|\mathcal{T})\neq k^{*}_{\text{std}}(0,0|\mathcal{T}^{% \prime})

The Gaussian kernel $k_{G}(x,x^{\prime})=\exp(-\frac{\gamma^{2}}{2}\|x-x^{\prime}\|^{2})$ depends only on $\|x-x^{\prime}\|$ and is training-set independent. Hence $k^{*}_{\text{std}}\neq k_{G}$ , which proves part (b) of Theorem 1.

A.3 Supporting Lemmas

Lemma 1 (Small–ball estimate).

Let $x_{1},\dots,x_{T}\in\mathbb{R}^{d}$ satisfy the affine–independence assumption

\mathrm{rank}\!\begin{pmatrix}x_{1}&\!\cdots\!&x_{T}\\[2.0pt] 1&\!\cdots\!&1\end{pmatrix}=T.

(A)

Draw $\omega\sim\mathcal{N}(0,\gamma^{2}I_{d})$ and $b\sim\mathrm{Unif}[0,2\pi]$ independently and set

\hat{\sigma}^{2}=1+\frac{1}{T}\sum_{t=1}^{T}\cos\!\bigl{(}2\omega^{\top}x_{t}+% 2b\bigr{)}.

Then there exists a constant $C_{T}<\infty$ (depending on $T,\gamma,\{x_{t}\}$ ) such that for every $\varepsilon\in(0,1)$ ,

\mathbb{P}\!\bigl{(}\hat{\sigma}^{2}\leq\varepsilon\bigr{)}\;\leq\;C_{T}\,% \varepsilon^{T/2}.

Proof of Lemma 1.

We convert the small–ball event into a geometric one. Using the inequality $\cos(2\theta_{t})\geq-1+\frac{1}{4}\Delta_{t}^{2}$ , where $\Delta_{t}$ is the distance of $\theta_{t}$ to the points where $\cos(2\theta_{t})=-1$ , the condition $\hat{\sigma}^{2}\leq\varepsilon$ forces the vector $(\Delta_{1},\dots,\Delta_{T})$ to lie inside an $T$ -dimensional Euclidean ball of radius $O(\sqrt{T\varepsilon})$ , whose volume scales like $\varepsilon^{T/2}$ . Because the affine map $(\omega,b)\mapsto(\theta_{1},\dots,\theta_{T})$ has full rank, the Gaussian density of its image is uniformly bounded, so the probability of the event is at most a constant times this volume, yielding the bound $\mathbb{P}(\hat{\sigma}^{2}\leq\varepsilon)\leq C_{T}\varepsilon^{T/2}$ .

Put $\theta_{t}:=\omega^{\top}x_{t}+b,\;t=1,\dots,T$ and write $\delta:=1-\varepsilon\in(0,1)$ . Then

\hat{\sigma}^{2}=1+\frac{1}{T}\sum_{t=1}^{T}\cos\bigl{(}2\theta_{t}\bigr{)}% \quad\Longrightarrow\quad\{\hat{\sigma}^{2}\leq\varepsilon\}\;=\;\Bigl{\{}% \frac{1}{T}\sum_{t=1}^{T}\cos(2\theta_{t})\leq-\delta\Bigr{\}}.

For every $\varphi\in\mathbb{R}$ with $|\varphi-\pi-2\pi k|\leq\pi/2\;(k\in\mathbb{Z})$ we have

\displaystyle\cos\varphi\;\geq\;-1+\frac{1}{4}\bigl{(}\varphi-\pi-2\pi k\bigr{% )}^{2}

(A.1)

(Use $\cos u\geq 1-u^{2}/4$ for $|u|\leq\pi/2$ and set $u=\varphi-\pi-2\pi k$ .)

Let $\Phi_{t}:=2\theta_{t}=2\omega^{\top}x_{t}+2b$ and define the $T$ distances

\Delta_{t}:=\min_{k\in\mathbb{Z}}\bigl{|}\Phi_{t}-\pi-2\pi k\bigr{|}\in[0,\pi].

Applying A.1 with $\varphi=\Phi_{t}$ gives $\cos\Phi_{t}\geq-1+\tfrac{1}{4}\Delta_{t}^{2}.$

Because $\frac{1}{T}\sum_{t}\cos\Phi_{t}\leq-\delta$ and $\cos\Phi_{t}\geq-1+\tfrac{1}{4}\Delta_{t}^{2}$ ,

-1+\frac{1}{4}\frac{1}{T}\sum_{t=1}^{T}\Delta_{t}^{2}\;\leq\;-\delta\;% \Longrightarrow\;\frac{1}{T}\sum_{t=1}^{T}\Delta_{t}^{2}\;\leq\;4(1-\delta)=4\varepsilon.

Hence the event $\{\hat{\sigma}^{2}\leq\varepsilon\}$ implies

\bigl{(}\Delta_{1},\dots,\Delta_{T}\bigr{)}\in B_{\varepsilon}:=\Bigl{\{}z\in% \mathbb{R}^{T}:\|z\|_{2}^{2}\leq 4T\varepsilon\Bigr{\}}.

The Lebesgue volume of $B_{\varepsilon}$ is $\mathrm{vol}(B_{\varepsilon})=\kappa_{T}(4T\varepsilon)^{T/2},$ with $\kappa_{T}$ the unit-ball volume in $\mathbb{R}^{T}$ .

Write $Y:=2(\omega^{\top}x_{1},\dots,\omega^{\top}x_{T})^{\!\top}\in\mathbb{R}^{T}.$ By linearity, $Y\sim\mathcal{N}(0,\Sigma)$ with $\Sigma=4\gamma^{2}(x_{i}^{\top}x_{j})_{i,j\leq T}.$ Assumption (A) implies $\Sigma$ is non–singular, so $Y$ possesses a density $f_{Y}(y)=\tfrac{1}{\sqrt{(2\pi)^{T}\det\Sigma}}\,e^{-\frac{1}{2}y^{\top}\Sigma% ^{-1}y},$ satisfying $0<\sup_{y\in\mathbb{R}^{T}}f_{Y}(y)=:(2\pi)^{-T/2}(\det\Sigma)^{-1/2}=:M_{T}<\infty.$

For every fixed $b\in[0,2\pi]$ we have $\bigl{(}\Phi_{1},\dots,\Phi_{T}\bigr{)}=Y+2b\,\mathbf{1},$ so

\mathbb{P}\bigl{(}\hat{\sigma}^{2}\leq\varepsilon\mid b\bigr{)}\;\leq\;\int_{y% \in B_{\varepsilon}-2b\,\mathbf{1}}\!f_{Y}(y)\,dy\;\leq\;M_{T}\,\mathrm{vol}(B% _{\varepsilon}).

Finally,

\mathbb{P}\bigl{(}\hat{\sigma}^{2}\leq\varepsilon\bigr{)}=\frac{1}{2\pi}\int_{% 0}^{2\pi}\mathbb{P}\bigl{(}\hat{\sigma}^{2}\leq\varepsilon\mid b\bigr{)}\,db\;% \leq\;M_{T}\,\mathrm{vol}(B_{\varepsilon})\;=\;C_{T}\,\varepsilon^{T/2},

where $C_{T}:=M_{T}\,\kappa_{T}(4T)^{T/2}$ . ∎

Lemma 2 (Strict Radial Monotonicity).

The derivative satisfies $g^{\prime}(r)<0$ for every $r>0$ .

Proof of Lemma 2.

Using the identity $(1+\cos\phi)^{-1}=\tfrac{1}{2}\sum_{n=0}^{\infty}(-1)^{n}I_{n}(\phi)$ and the isotropy of $\omega$ , $g(r)=\tfrac{1}{2}\sum_{n=0}^{\infty}(-1)^{n}I_{n}(2\gamma r)$ . Each $I_{n}$ is positive and increasing on $(0,\infty)$ , and the series converges absolutely on compacts, so term-wise differentiation gives $g^{\prime}(r)<0$ . ∎

A.4 Connection to Existing Literature

Our analysis uses only the first four moments of the $\mathcal{N}(0,\gamma^{2}I_{K})$ draw for $\omega$ , which aligns with the finite-moment conditions imposed in Kelly et al. (2024). Specifically, their Condition 0 requires $\mathbb{E}\|\omega\|^{4}<\infty$ , which our standard Gaussian assumption satisfies. No additional distributional assumptions beyond those already present in the KMZ framework are required for our impossibility results.

The breakdown we establish is therefore endemic to the standardized RFF approach as implemented, rather than an artifact of stronger technical conditions. This reinforces the fundamental nature of the theory-practice disconnect we identify.

Appendix B Technical Proofs for Section 4

Proof of Theorem 2.

The strategy is the classical minimax/Fano route: (i) build a large packing of well-separated parameters, (ii) show that their induced data distributions are statistically indistinguishable, (iii) invoke Fano’s inequality to bound any decoder’s error, and (iv) convert decoder error into a lower bound on prediction risk.

Packing construction. Fix a radius $0<\delta<B/2$ . Because the Euclidean ball $\mathbb{B}_{2}^{P}(B)$ in $\mathbb{R}^{P}$ has volume growth proportional to $B^{P}$ , it contains a $2\delta$ -packing $\{w_{1},\dots,w_{M}\}$ of size $M=(B/(2\delta))^{P}$ ; hence $\log M=P\log\!\bigl{(}B/(2\delta)\bigr{)}$ . Define $f_{j}(x):=w_{j}^{\!\top}z(x)$ . For each index $j$ let $\mathbb{P}_{j}$ denote the joint distribution of the training sample $\mathcal{D}_{T}=\{(x_{t},r_{t})\}_{t=1}^{T}$ generated according to $r_{t}=f_{j}(x_{t})+\epsilon_{t}$ with independent Gaussian noise $\epsilon_{t}\sim\mathcal{N}(0,\sigma^{2})$ .

Average KL divergence. Let $Z\in\mathbb{R}^{T\times P}$ be the random design matrix whose $t$ -th row is $z(x_{t})^{\!\top}$ . Conditioned on $Z$ the log-likelihood ratio between $\mathbb{P}_{j}$ and $\mathbb{P}_{\ell}$ is Gaussian, and one checks

\mathrm{KL}\!\bigl{(}\mathbb{P}_{j}\|\mathbb{P}_{\ell}\,\bigm{|}\,Z\bigr{)}\;=% \;\frac{\|Z(w_{j}-w_{\ell})\|_{2}^{2}}{2\sigma^{2}}.

Taking expectation over $Z$ and using $\mathbb{E}[Z^{\!\top}Z]=T\Sigma_{z}$ gives

\mathrm{KL}(\mathbb{P}_{j}\|\mathbb{P}_{\ell})\;=\;\frac{T}{2\sigma^{2}}(w_{j}% -w_{\ell})^{\!\top}\Sigma_{z}(w_{j}-w_{\ell})\;\leq\;\frac{2T\,C_{z}\,B^{2}}{% \sigma^{2}}\;=:\;K_{T}.

(The inequality uses $\Sigma_{z}\succeq 0$ and $\lambda_{\max}(\Sigma_{z})\leq C_{z}$ .)

Fano’s inequality. Draw an index $J$ uniformly from $[M]$ and let $\hat{J}$ be any measurable decoder based on the sample $\mathcal{D}_{T}$ . Fano’s max–KL form (e.g. CoverThomas2006, Eq. 16.32) yields

\mathbb{P}(\hat{J}\neq J)\;\geq\;1-\frac{K_{T}+\log 2}{\log M}.

Choosing the packing radius $\delta$ such that the right-hand side equals $1/2$ (so that any decoder errs at least half the time) gives

\delta\;\leq\;\frac{B}{2}\exp\!\Bigl{(}-\frac{4TC_{z}B^{2}}{P\sigma^{2}}-\frac% {2\log 2}{P}\Bigr{)}.

(B.1)

Link between prediction risk and decoder error. Let $\hat{f}_{T}$ be an arbitrary estimator and put $\varepsilon:=\mathbb{E}_{x,\mathcal{D}_{T},\epsilon}\!\bigl{[}(\hat{f}_{T}(x)-% f_{J}(x))^{2}\bigr{]}$ . Because the nearest-neighbour decoder chooses $\hat{J}=\arg\min_{j}\|\hat{f}_{T}-f_{j}\|_{L^{2}(\mu)}$ , the triangle inequality gives

\|f_{\hat{J}}-f_{J}\|_{L^{2}(\mu)}\;\leq\;2\sqrt{\varepsilon}.

Meanwhile each pair $(j,\ell)$ in the packing satisfies $\|w_{j}-w_{\ell}\|_{2}\geq 2\delta$ ; since $\Sigma_{z}\succeq c_{z}I_{P}$ , $\|f_{j}-f_{\ell}\|_{L^{2}(\mu)}^{2}\geq 4c_{z}\delta^{2}$ . Consequently, if $\varepsilon<c_{z}\delta^{2}$ the decoder must succeed ( $\hat{J}=J$ ), contradicting $\mathbb{P}(\hat{J}\neq J)\geq\tfrac{1}{2}$ . Hence

\varepsilon\;\geq\;c_{z}\,\delta^{2}.

(B.2)

Expectation lower bound. Substituting (B.1) into (B.2) and absorbing the harmless factor $e^{-4\log 2/P}$ into a constant $c=\tfrac{1}{4}c_{z}e^{-4\log 2/P}$ yields

\varepsilon\;\geq\;c\,B^{2}\exp\!\Bigl{(}-\tfrac{8TC_{z}B^{2}}{P\sigma^{2}}% \Bigr{)},

which is the desired in-expectation bound.

High-probability refinement over the design. Finally, define the “well-conditioned design” event

\mathcal{E}:=\Bigl{\{}\,\bigl{\|}T^{-1}Z^{\!\top}Z-\Sigma_{z}\bigr{\|}_{% \mathrm{op}}\leq\tfrac{1}{2}c_{z}\Bigr{\}}.

For sub-Gaussian rows, the matrix Bernstein inequality (Tropp 2012, Theorem 6.2) guarantees $\mathbb{P}_{Z}(\mathcal{E}^{c})\leq e^{-T}$ provided $T\geq C_{0}(\kappa,c_{z},C_{z})\,P$ . On $\mathcal{E}$ the empirical Gram matrix satisfies $\tfrac{1}{2}c_{z}I_{P}\preceq T^{-1}Z^{\!\top}Z\preceq 2C_{z}I_{P}$ , so the previous KL-and-distance calculations hold with constants $(2C_{z},\tfrac{1}{2}c_{z})$ . Repeating the Fano–risk argument under $\mathcal{E}$ therefore gives

\varepsilon\;\geq\;c^{\ast}B^{2}\exp\!\Bigl{(}-\tfrac{8TC_{z}B^{2}}{P\sigma^{2% }}\Bigr{)},\qquad c^{\ast}=\tfrac{1}{8}c_{z}e^{-4\log 2/P},

for every fixed $Z\in\mathcal{E}$ . Taking outer expectation over the design and using $\mathbb{P}_{Z}(\mathcal{E}^{c})\leq e^{-T}$ produces the advertised high-probability bound. ∎

Proof of Theorem 3.

Part (a) Let $e_{1},\dots,e_{P}$ be the standard basis of $\mathbb{R}^{P}$ . Define

\delta\;=\;\min\bigl{\{}\,\tfrac{B}{4},\;\tfrac{\sigma}{4}\sqrt{\tfrac{\log P}% {T\,C_{z}}}\,\bigr{\}},\quad w_{0}=\mathbf{0},\quad w_{j}=\delta\,e_{j}\;(j=1,% \dots,P).

All $w_{j}$ lie in the ball $\|w\|_{2}\leq B$ , and $\|w_{j}-w_{\ell}\|_{2}=\sqrt{2}\,\delta\;(j\neq\ell)$ .

For $j\neq\ell$ let $\mathbb{P}_{j},\mathbb{P}_{\ell}$ be the distributions of the $T$ samples when the true parameter is $w_{j}$ or $w_{\ell}$ . Conditioned on the design matrix $Z$ , both are Gaussians with means $Zw_{j},Zw_{\ell}$ and covariance $\sigma^{2}I_{T}$ . Taking expectation over $Z$ ,

\mathbb{E}[D_{\mathrm{KL}}(\mathbb{P}_{j}\|\mathbb{P}_{\ell})]\;=\;\frac{T}{2% \sigma^{2}}(w_{j}-w_{\ell})^{\!\top}\Sigma_{z}(w_{j}-w_{\ell})\;\leq\;\frac{2T% \,C_{z}\delta^{2}}{\sigma^{2}}\;\leq\;\frac{\log P}{8}\;<\;\tfrac{1}{2}\log P.

With $M=P+1$ equiprobable hypotheses, Fano gives

\Pr[\hat{J}\neq J]\;\geq\;1-\frac{\tfrac{1}{2}\log P+\log 2}{\log(P+1)}\;\geq% \;\tfrac{1}{2}\quad(\text{for }P\geq 16).

Because $\Sigma_{z}\succeq c_{z}I_{P}$ ,

\|f_{j}-f_{\ell}\|_{L^{2}(\mu)}^{2}=(w_{j}-w_{\ell})^{\!\top}\Sigma_{z}(w_{j}-% w_{\ell})\geq 2c_{z}\delta^{2}.

If $\hat{f}_{T}$ attains mean-squared error $\varepsilon<\frac{c_{z}}{4}\delta^{2}$ , then on the event $\hat{J}\neq J$ the triangle inequality forces $\|f_{\hat{J}}-f_{J}\|_{L^{2}(\mu)}<\sqrt{2c_{z}}\,\delta$ , contradicting the separation just shown. Therefore

\varepsilon\;\geq\;\Pr[\hat{J}\neq J]\,\frac{c_{z}}{2}\,\delta^{2}\;\geq\;% \frac{c_{z}}{4}\,\delta^{2}.

Substituting the definition of $\delta$ yields

\varepsilon\;\geq\;\frac{c_{z}}{4}\;\min\!\Bigl{\{}\tfrac{B^{2}}{16},\;\tfrac{% \sigma^{2}\log P}{16\,T\,C_{z}}\Bigr{\}}\;=\;\frac{c_{z}}{64C_{z}}\;\min\!% \Bigl{\{}B^{2},\;\tfrac{\sigma^{2}}{T}\log P\Bigr{\}}.

Setting $\tilde{c}=c_{z}/(64C_{z})$ completes the proof.

Part (b)

1. Packing of the coefficient ball.

Let $e_{1},\dots,e_{P}$ be the canonical basis in $\mathbb{R}^{P}$ and set

\delta\;:=\;\min\!\Bigl{\{}\tfrac{B}{4},\;\tfrac{\sigma}{4}\sqrt{\tfrac{\log P% }{T\,C_{z}}}\Bigr{\}},\qquad w_{0}:=\mathbf{0},\quad w_{j}:=\delta\,e_{j}\;(j=% 1,\dots,P).

All $w_{j}$ lie in the Euclidean ball $\|w\|_{2}\leq B$ and satisfy $\|w_{j}-w_{\ell}\|_{2}=\sqrt{2}\,\delta\;(j\neq\ell).$

2. A “good-design” event.

Define

\mathcal{E}:=\Bigl{\{}\,\lambda_{\max}\!\bigl{(}T^{-1}Z^{\!\top}Z\bigr{)}\leq 2% C_{z}\Bigr{\}}.

By the sub-Gaussian matrix Bernstein inequality (e.g., Tropp (2015)) there is a constant $C_{0}=C_{0}(\kappa,C_{z})$ such that

\mathbb{P}_{Z}(\mathcal{E}^{\mathrm{c}})\;\leq\;e^{-T}\quad\text{for all }T% \geq C_{0}P.

(B.3)

3. KL bound conditional on $\mathcal{E}$ .

For $j\neq\ell$ the conditional Kullback–Leibler divergence equals

\mathrm{KL}(\mathbb{P}_{j}\|\mathbb{P}_{\ell}\mid Z)\;=\;\frac{\|Z(w_{j}-w_{% \ell})\|_{2}^{2}}{2\sigma^{2}}\;\leq\;\frac{T\,\lambda_{\max}(T^{-1}Z^{\!\top}% Z)\,\|w_{j}-w_{\ell}\|_{2}^{2}}{2\sigma^{2}}.

On $\mathcal{E}$ we have $\lambda_{\max}(T^{-1}Z^{\!\top}Z)\leq 2C_{z}$ , so

\mathrm{KL}(\mathbb{P}_{j}\|\mathbb{P}_{\ell}\mid Z)\;\leq\;\frac{2T\,C_{z}\,% \delta^{2}}{\sigma^{2}}\;\leq\;\frac{\log P}{8}\;<\;\tfrac{1}{2}\log P.

4. Fano’s inequality conditional on $Z\in\mathcal{E}$ .

With the $M=P+1$ hypotheses $\{w_{0},w_{1},\dots,w_{P}\}$ equiprobable, Fano (max-KL form) gives, conditional on $Z\in\mathcal{E}$ ,

\Pr(\hat{J}\neq J\mid Z)\;\geq\;1-\frac{\tfrac{1}{2}\log P+\log 2}{\log(P+1)}% \;\geq\;\tfrac{1}{2}\quad(P\geq 16).

5. Relating risk to identification (conditional).

Since the population covariance satisfies $c_{z}I_{P}\preceq\Sigma_{z}$ , for any $j\neq\ell$

\|f_{j}-f_{\ell}\|_{L^{2}(\mu)}^{2}=(w_{j}-w_{\ell})^{\!\top}\Sigma_{z}(w_{j}-% w_{\ell})\;\geq\;2c_{z}\delta^{2}.

Argument identical to Theorem 3 shows that for every estimator $\hat{f}_{T}$ and every $Z\in\mathcal{E}$

\mathbb{E}_{x,\epsilon}\!\bigl{[}(\hat{f}_{T}(x)-w^{\!\top}z(x))^{2}\mid Z% \bigr{]}\;\geq\;\frac{c_{z}}{4}\,\delta^{2}\;=\;\tilde{c}\,\min\!\Bigl{\{}B^{2% },\;\tfrac{\sigma^{2}}{T}\log P\Bigr{\}}.

(B.3)

6. Remove the conditioning.

Inequality (4.3) follows by combining (B.3) (probability of $\mathcal{E}$ ) with (B.3). This completes the proof. ∎

Proof of Theorem 4.

All VC statements are made conditional on the fixed training sample $(x_{1},\dots,x_{T})$ . Throughout we use the standard fact that homogeneous linear threshold functions in $\mathbb{R}^{d}$ have VC dimension $d$ (e.g., Vapnik (1998)).

(a) Linear class $\mathcal{F}_{P}$ .

Because $\operatorname{sign}(\lambda w^{\top}z(x))=\operatorname{sign}(w^{\top}z(x))$ for every $\lambda>0$ , the norm bound $\|w\|_{2}\leq B$ does not remove any labelings that an unconstrained homogeneous hyperplane in $\mathbb{R}^{P}$ could realise. Hence the set $\{\operatorname{sign}(w^{\top}z(x)):\|w\|_{2}\leq B\}$ has the same VC dimension as all homogeneous linear separators in $\mathbb{R}^{P}$ , namely $P$ .

(b) Ridgeless class $\mathcal{F}_{\textnormal{ridge}}^{(Z)}$ .

For any training targets $y\in\mathbb{R}^{T}$ the ridgeless solution is $\hat{w}=Z^{\top}(ZZ^{\top})^{\dagger}y$ , where ^† denotes the Moore–Penrose pseudoinverse. Consequently every predictor can be written as

f_{\alpha}(x)=\alpha^{\top}k(x),\qquad\text{with }\alpha=(ZZ^{\top})^{\dagger}% y\in\mathbb{R}^{T}.

Define the data–dependent feature map

\phi_{Z}:\mathcal{X}\to\mathbb{R}^{T},\qquad\phi_{Z}(x)\;:=\;k(x).

Its image lies in the $r$ -dimensional subspace $\operatorname{im}(ZZ^{\top})\subseteq\mathbb{R}^{T}$ , so $\phi_{Z}(\mathcal{X})\subseteq\mathbb{R}^{r}$ after an appropriate linear change of basis. Thus the hypothesis class

\mathcal{H}_{Z}\;=\;\bigl{\{}\,x\mapsto\operatorname{sign}(\alpha^{\top}\phi_{% Z}(x)):\alpha\in\mathbb{R}^{T}\bigr{\}}

is (up to an invertible linear map) exactly the class of homogeneous linear separators in $\mathbb{R}^{r}$ . By the cited VC fact, $\mathrm{VC}(\mathcal{H}_{Z})=r$ . Because $r\leq T$ , we obtain the claimed bound. If $(ZZ^{\top})$ is invertible, then $r=T$ , giving equality. ∎

Proof of Theorem 5.

The proof follows from the polynomial lower bound in Theorem 3:

\inf_{\hat{f}_{T}}\sup_{\|w\|_{2}\leq B}\mathbb{E}_{x,\mathcal{D}_{T},\epsilon% }[(\hat{f}_{T}(x)-w^{\top}z(x))^{2}]\geq\tilde{c}\min\left\{B^{2},\frac{\sigma% ^{2}\log P}{T}\right\}.

For learning to be impossible with error $<\varepsilon$ , we need:

\tilde{c}\min\left\{B^{2},\frac{\sigma^{2}\log P}{T}\right\}\geq\varepsilon% \quad\Leftrightarrow\quad\min\left\{B^{2},\frac{\sigma^{2}\log P}{T}\right\}% \geq\frac{\varepsilon}{\tilde{c}}.

Phase I (Impossible Learning): Suppose $\frac{\sigma^{2}\log P}{T}\geq\frac{\varepsilon}{\tilde{c}}$ and $B^{2}\geq\frac{\varepsilon}{\tilde{c}}$ .

Since both arguments of the minimum satisfy the bound $\geq\frac{\varepsilon}{\tilde{c}}$ , we have:

\min\left\{B^{2},\frac{\sigma^{2}\log P}{T}\right\}\geq\frac{\varepsilon}{% \tilde{c}}.

Therefore:

\inf_{\hat{f}_{T}}\sup_{\|w\|_{2}\leq B}\mathbb{E}[(\hat{f}_{T}(x)-w^{\top}z(x% ))^{2}]\geq\tilde{c}\cdot\frac{\varepsilon}{\tilde{c}}=\varepsilon,

establishing that learning with error $<\varepsilon$ is information-theoretically impossible.

Phase II (Possible Learning): Suppose $\frac{\sigma^{2}\log P}{T}<\frac{\varepsilon}{\tilde{c}}$ and $B^{2}\geq\frac{\varepsilon}{\tilde{c}}$ .

Since $\frac{\sigma^{2}\log P}{T}<\frac{\varepsilon}{\tilde{c}}\leq B^{2}$ , we have:

\min\left\{B^{2},\frac{\sigma^{2}\log P}{T}\right\}=\frac{\sigma^{2}\log P}{T}% <\frac{\varepsilon}{\tilde{c}}.

The lower bound becomes:

\tilde{c}\min\left\{B^{2},\frac{\sigma^{2}\log P}{T}\right\}=\tilde{c}\frac{% \sigma^{2}\log P}{T}<\tilde{c}\cdot\frac{\varepsilon}{\tilde{c}}=\varepsilon.

Since the information-theoretic lower bound is $<\varepsilon$ , learning with error $<\varepsilon$ is not ruled out by fundamental limitations.

Trivial Regime: If $B^{2}<\frac{\varepsilon}{\tilde{c}}$ , then regardless of the value of $\frac{\sigma^{2}\log P}{T}$ :

\min\left\{B^{2},\frac{\sigma^{2}\log P}{T}\right\}\leq B^{2}<\frac{% \varepsilon}{\tilde{c}}.

Hence:

\tilde{c}\min\left\{B^{2},\frac{\sigma^{2}\log P}{T}\right\}\leq\tilde{c}B^{2}% <\tilde{c}\cdot\frac{\varepsilon}{\tilde{c}}=\varepsilon.

The function class is too simple relative to target accuracy $\varepsilon$ , and standard parametric rates apply. ∎

Proof of Corollary 1.

We analyze when the conditions of Theorem 5 are satisfied.

For impossibility, we need $\frac{\sigma^{2}\log P}{T}\geq\frac{\varepsilon}{\tilde{c}}$ . Since $P=O(K^{\beta})$ , we have $\log P=O(\beta\log K)$ , so:

\frac{\sigma^{2}\beta\log K}{T}\geq\frac{\varepsilon}{\tilde{c}}\quad% \Rightarrow\quad T\leq\frac{\tilde{c}\sigma^{2}\beta\log K}{\varepsilon}.

We also need $B^{2}\geq\frac{\varepsilon}{\tilde{c}}$ . Under Assumption LABEL:ass:weak_signal, $B^{2}=O(K^{-\alpha})\sigma^{2}$ , so:

O(K^{-\alpha})\sigma^{2}\geq\frac{\varepsilon}{\tilde{c}}\quad\Rightarrow\quad K% \leq O\left(\left(\frac{\tilde{c}\sigma^{2}}{\varepsilon}\right)^{1/\alpha}% \right).

Learning is impossible when both conditions hold simultaneously:

	$\displaystyle T$	$\displaystyle\leq\frac{\tilde{c}\sigma^{2}\beta\log K}{\varepsilon}\quad\text{% (complexity condition)}$		(B.4)
	$\displaystyle K$	$\displaystyle\leq O\left(\left(\frac{\tilde{c}\sigma^{2}}{\varepsilon}\right)^% {1/\alpha}\right)\quad\text{(non-triviality condition)}$		(B.5)

For fixed $T$ , the complexity condition gives:

K\geq\exp\left(\frac{T\varepsilon}{\tilde{c}\sigma^{2}\beta}\right)=:K_{0}(T).

Learning is impossible when $K_{0}(T)\leq K\leq O\left(\left(\frac{\tilde{c}\sigma^{2}}{\varepsilon}\right)% ^{1/\alpha}\right)$ .

This interval is non-empty when $T$ is sufficiently small, specifically when:

T\leq\frac{\tilde{c}\sigma^{2}\beta}{\varepsilon}\log\left(O\left(\left(\frac{% \tilde{c}\sigma^{2}}{\varepsilon}\right)^{1/\alpha}\right)\right)=O\left(\frac% {\sigma^{2}\beta\log(\sigma^{2}/\varepsilon)}{\alpha\varepsilon}\right).

∎

High-Dimensional Learning in Finance††thanks: Replication code is available from the author.