The curse of dimensionality is easy to understand - volume grows exponentially with dimension. This implies we need to sample exponentially many data points relative to the number of features in order to have anything meaningful. What isn't so clear is why injecting randomness helps alleviate the issue. We use Monte Carlo integration as a backdrop to investigate this phenomenon.
We open with a seemingly unrelated lemma. In time, we will see that this lemma is the backbone to this phenomenon.
Lemma. On a bounded domain (Ω,μ), for q≥p we have the inclusion Lq(Ω,μ)⊂Lp(Ω,μ). In fact, up to a constant C=C(μ(Ω),p,q) which is independent of the dimension,
∥f∥p≤C∥f∥q.
Proof. This is an straightforward application of Hölder's inequality. We begin by setting s as the Hölder conjugate to pq. We compute
∫Ω∣f∣p≤(∫Ω(∣f∣p)pq)qp(∫Ω1s)s1=∥f∥qp⋅μ(Ω)s1.
Taking p-th roots gives the claim.
Actually, for our purposes, we'll need to control only the variance of a random variable Y. In this case, we can achieve a (usually) sharper constant of 1. For this, we only need the convexity of the q-norms.
Lemma. For a random variable Y∈Lq with q≥2, we have
V[Y]≤∥Y∥q2.
In other words, the standard deviation of Y is controlled by all its higher moments.
Proof. This is a simple application of Jensen's inequality. When applied to the s-norm for s≥1, it states
E[Y]≤(E[Ys])s1.
Applying this to s=2q, we see
V[Y]=E[Y2]−(E[Y])2≤E[Y2]≤(E[Y2⋅2q])q2≤∥Y∥q2.
Let's turn to integration. Let Ω⊂Rn be a box. Clasically, we'd draw evenly spaced grid lines and integrate by sampling along all corners. As discussed, this is a bad idea by the curse of dimensionality. Instead, let μ denote the uniform measure on Ω, and we draw m iid samples X1,...,Xm in Ω. For conveneince, set X as an auxillary random variable also sampled from (Ω,μ). Our first goal is to show the random Riemann sum m1∑i=1mf(Xi) forms an unbiased estimator for the true integral ∫Ωfdx. We compute
E[m1i=1∑mf(Xi)]=m1i=1∑mE[f(Xi)]=m1i=1∑mE[f(X)]=E[f(X)]=∫Rnf𝟙Ωdx=∫Ωfdx.
By SLLN, as m→∞, we get that m1∑i=1mf(Xi)→∫Ωf almost surely. As with all these asymptotic results, they're useless in practice. We need information on the rate of convergence. Henceforth, assume f∈Lq(Ω) for some q>2. For example, if Ω is closed and f is continuous, we automatically get that f∈L∞(Ω). Rates of convergence are measured in the L2 sense (why?), so setting Sm:=m1∑i=1mf(Xi) we see
E(m1i=1∑mf(Xi)−∫Ωfdx)2=E(Sm−E[Sm])2=V[Sm]=m21V[i=1∑mf(Xi)]=m1Vf(X)≤mC(μ(Ω),q,2)∥f∘X∥q,
where the last equality follows from our initial lemma. In fact, we could set this constant to be 1 with our second lemma. Either way, this means the expected error of Monte Carlo integration is dimension independent with rate O(m−21) when measured in an L2 sense.