The Russian Tail Bounds

We record a simple proof of Markov's inequality via indicator functions, then discuss how how to tighten it using various transformations.

16 March 2025

Markov's inequality is fundamentally a statement about indicator functions of superlevel sets. Given a non-negative function X:ΩR0X:\Omega\to \mathbb{R}^{\geq 0} and a number t>0t>0, consider the superlevel set StS_t defined by

St:={ω:f(ω)t}.\begin{equation} S_t:=\{\omega: f(\omega)\geq t\}. \end{equation}

It is clear, by construction, the minimum that ff can take on StS_t is tt. In other words, on StS_t,

Xt.\begin{equation} X \geq t. \end{equation}

If we partition Ω=St(St)c\Omega=S_t \sqcup (S_t)^c, then we can convert (2) into an inequality on Ω\Omega by

X{tin St0in (St)c=t𝟙St,\begin{equation} X \geq \begin{cases} t & \text{in } S_t\\ 0 & \text{in } (S_t)^c \end{cases}= t 𝟙_{S_t}, \end{equation}

where 𝟙S𝟙_S denotes the indicator function of the set SS. Let ν\nu denote the measure on Ω\Omega. Integrating (3) yields

ΩX  dνtΩ𝟙St  dν=tν(St).\begin{equation} \int_\Omega X \; d\nu \geq t \int_\Omega 𝟙_{S_t}\; d\nu = t \nu(S_t). \end{equation}

Dividing through by tt yields Markov's inequality. When ν=P\nu=\mathbb{P} is a probability measure, we can interpret this as a weak tail bound. For a non-negative random variable X0X\geq 0 with finite mean μ=E[X]\mu=E[X],

P(Xμt)EXμt=O(t1).\begin{equation} \mathbb{P}(|X-\mu|\geq t) \leq \frac{\mathbb{E}|X-\mu|}{t} = O(t^{-1}). \end{equation}

Let's begin improving this. Using the second moment yields an even tighter bound that gives a quadratic decay at infinity. Therefore, we additionally assume that XX has finite variance. The key insight is that

{ω:X(ω)μt}={ω:X(ω)μ2t2},\begin{equation} \{\omega: |X(\omega)-\mu|\geq t\} = \{\omega:|X(\omega)-\mu|^2 \geq t^2\}, \end{equation}

since squaring is monotonically increasing. Applying P\mathbb{P} to (6) and using Markov's inequality yields Chebyshev's inequality

P(Xμt)=P(Xμ2t2)E[Xμ2]t2=V[X]t2=O(t2).\begin{equation} \begin{split} \mathbb{P}(|X-\mu|\geq t) &= \mathbb{P}(|X-\mu|^2 \geq t^2)\\ &\leq \frac{\mathbb{E}[|X-\mu|^2]}{t^2}\\ &= \frac{\mathbb{V}[X]}{t^2}\\ &=O(t^{-2}). \end{split} \end{equation}

Our Pavlovian reaction to witnessing Chebyshev's inequality is to try higher and higher central moments in order to improve the polynomial decay. Chernoff does us one better by obtaining an exponential bound using all the central moments at once via the exponential function exp(z)=k=1zkk!\exp(z)=\sum_{k=1}^\infty \frac{z^k}{k!}. We additionally assume the mgf of XX is defined in a neighborhood of 00, and let s>0s> 0 be a small enough free parameter. Then,

P(Xμt)=P(exp(s(Xμ))exp(st))E[exp(s(Xμ))]exp(st)=E[exp(sX)exp(sμ)]exp(st)=exp(s(μ+t))E[exp(sX)]=O(exp(st)).\begin{equation} \begin{split} \mathbb{P}(X-\mu\geq t) &=\mathbb{P}(\exp(s(X-\mu))\geq \exp(st))\\ &\leq \frac{\mathbb{E}[\exp(s(X-\mu))]}{\exp(st)}\\ &= \frac{\mathbb{E}[\frac{\exp(sX)}{\exp(s\mu)}]}{\exp(st)}\\ &= \exp(-s(\mu+t))\mathbb{E}[\exp(sX)]\\ &=O(\exp(-st)). \end{split} \end{equation}

Note that in the fourth line, all the randomness is concnetrated in XX, allowing us to pull out the exp(sμ)\exp(s\mu). Minimizing over ss in the domain where the mgf of XX is defined yields Chernoff's inequality. It's important to note that we've dropped the absolute value in the discussion of Chernoff, so this is really a 1-sided bound, i.e. we implicitly assumed that XX lies above its mean.