Markov's inequality is fundamentally a statement about indicator functions of superlevel sets. Given a non-negative function X:Ω→R≥0 and a number t>0, consider the superlevel set St defined by
St:={ω:f(ω)≥t}.
It is clear, by construction, the minimum that f can take on St is t. In other words, on St,
X≥t.
If we partition Ω=St⊔(St)c, then we can convert (2) into an inequality on Ω by
X≥{t0in Stin (St)c=t𝟙St,
where 𝟙S denotes the indicator function of the set S. Let ν denote the measure on Ω. Integrating (3) yields
∫ΩXdν≥t∫Ω𝟙Stdν=tν(St).
Dividing through by t yields Markov's inequality. When ν=P is a probability measure, we can interpret this as a weak tail bound. For a non-negative random variable X≥0 with finite mean μ=E[X],
P(∣X−μ∣≥t)≤tE∣X−μ∣=O(t−1).
Let's begin improving this. Using the second moment yields an even tighter bound that gives a quadratic decay at infinity. Therefore, we additionally assume that X has finite variance. The key insight is that
{ω:∣X(ω)−μ∣≥t}={ω:∣X(ω)−μ∣2≥t2},
since squaring is monotonically increasing. Applying P to (6) and using Markov's inequality yields Chebyshev's inequality
P(∣X−μ∣≥t)=P(∣X−μ∣2≥t2)≤t2E[∣X−μ∣2]=t2V[X]=O(t−2).
Our Pavlovian reaction to witnessing Chebyshev's inequality is to try higher and higher central moments in order to improve the polynomial decay. Chernoff does us one better by obtaining an exponential bound using all the central moments at once via the exponential function exp(z)=∑k=1∞k!zk. We additionally assume the mgf of X is defined in a neighborhood of 0, and let s>0 be a small enough free parameter. Then,
P(X−μ≥t)=P(exp(s(X−μ))≥exp(st))≤exp(st)E[exp(s(X−μ))]=exp(st)E[exp(sμ)exp(sX)]=exp(−s(μ+t))E[exp(sX)]=O(exp(−st)).
Note that in the fourth line, all the randomness is concnetrated in X, allowing us to pull out the exp(sμ). Minimizing over s in the domain where the mgf of X is defined yields Chernoff's inequality. It's important to note that we've dropped the absolute value in the discussion of Chernoff, so this is really a 1-sided bound, i.e. we implicitly assumed that X lies above its mean.