Suppose we have m m m data points ( x 1 , y 1 ) , . . . , ( x m , y m ) (x_1,y_1),...,(x_m,y_m) ( x 1 , y 1 ) , ... , ( x m , y m ) , where y i ∈ { ± 1 } y_i\in \{\pm 1\} y i ∈ { ± 1 } . Recall the primal problem of SVM in its constrained form
min w , b , ϵ 1 2 ∥ w ∥ 2 + C ∑ i = 1 m ϵ i , s.t. y i ( w T x i + b ) ≥ 1 − ϵ i , and ϵ i ≥ 0. \begin{equation}
\begin{split}
&\min_{w,b,\epsilon} \frac{1}{2}\|w\|^2 + C\sum_{i=1}^m\epsilon_i,\\
&\text{s.t.} \quad \text{ }y_i(w^Tx_i+b) \geq 1 - \epsilon_i,\\
&\text{and}\quad \epsilon_i \geq 0.
\end{split}
\end{equation} w , b , ϵ min 2 1 ∥ w ∥ 2 + C i = 1 ∑ m ϵ i , s.t. y i ( w T x i + b ) ≥ 1 − ϵ i , and ϵ i ≥ 0.
We set up the Lagrangian from the constraints as follows. Define 2 m 2m 2 m positive variables α i , β i ≥ 0 \alpha_i,\beta_i\geq 0 α i , β i ≥ 0 where
L ( w , b , ϵ , α , β ) = 1 2 ∥ w ∥ 2 + C ∑ i = 1 m ϵ i + ∑ i = 1 m α i ( 1 − ϵ i − ( y i ( w T x i + b ) ) ) − ∑ i = 1 m β i ϵ i . \begin{equation}
\begin{split}
&\mathcal{L}(w,b,\epsilon,\alpha,\beta)=\\
&\frac{1}{2}\|w\|^2 + C\sum_{i=1}^m\epsilon_i + \sum_{i=1}^m \alpha_i(1-\epsilon_i-(y_i(w^Tx_i+b))) - \sum_{i=1}^m \beta_i\epsilon_i.
\end{split}
\end{equation} L ( w , b , ϵ , α , β ) = 2 1 ∥ w ∥ 2 + C i = 1 ∑ m ϵ i + i = 1 ∑ m α i ( 1 − ϵ i − ( y i ( w T x i + b ))) − i = 1 ∑ m β i ϵ i .
Our aim is to use first-order optimality conditions to get equations on our old variables, subsitute them into L \mathcal{L} L , then rewrite everything in terms of our new variables. Maximizing the Lagrangian after the subsitution converts the primal problem into the dual. Taking derivative in the w j w^j w j th direction gives
0 = ∂ L ∂ w j = 1 2 ∂ ∂ w j ∑ i = 1 n ( w i ) 2 − ∂ ∂ w j ∑ i = 1 m α i y i ( ∑ k = 1 n w k x i k ) = 1 2 ∑ i = 1 m w i δ j i − ∑ i , k = 1 m α i y i x i k δ j k = w j − ∑ i = 1 m α i y i x i j . \begin{equation}
\begin{split}
0=\frac{\partial \mathcal{L}}{\partial w^j}&=\frac{1}{2}\frac{\partial}{\partial w^j}\sum_{i=1}^n (w^i)^2-\frac{\partial}{\partial w^j}\sum_{i=1}^m \alpha_i y_i (\sum_{k=1}^n w_kx^k_i)\\
&=\frac{1}{2}\sum_{i=1}^m w^i\delta^i_j - \sum_{i,k=1}^m \alpha_iy_ix_i^k\delta_{jk}\\
&=w^j-\sum_{i=1}^m \alpha_iy_ix^j_i.
\end{split}
\end{equation} 0 = ∂ w j ∂ L = 2 1 ∂ w j ∂ i = 1 ∑ n ( w i ) 2 − ∂ w j ∂ i = 1 ∑ m α i y i ( k = 1 ∑ n w k x i k ) = 2 1 i = 1 ∑ m w i δ j i − i , k = 1 ∑ m α i y i x i k δ jk = w j − i = 1 ∑ m α i y i x i j .
Since this holds for each coordinate w j w^j w j , this implies
w = ∑ i = 1 m α i y i x i . \begin{equation}
w=\sum_{i=1}^m \alpha_i y_ix_i.
\end{equation} w = i = 1 ∑ m α i y i x i .
Taking a derivative in the b b b th direction gives
0 = ∂ L ∂ b = − ∑ i = 1 m α i y i . \begin{equation}
0=\frac{\partial \mathcal{L}}{\partial b}=-\sum_{i=1}
^m \alpha_iy_i.
\end{equation} 0 = ∂ b ∂ L = − i = 1 ∑ m α i y i .
Taking a derivative in the ϵ j \epsilon_j ϵ j th direction gives
0 = ∂ L ∂ ϵ j = C ∂ ∂ ϵ j ∑ i = 1 m ϵ i − ∂ ∂ ϵ j ∑ i = 1 m α i ϵ i − ∂ ∂ ϵ j ∑ i = 1 m β i ϵ i = C ∑ i = 1 m δ i j − ∑ i = 1 m α i δ i j − ∑ i = 1 m β i δ i j = C − α j − β j . \begin{equation}
\begin{split}
0=\frac{\partial \mathcal{L}}{\partial \epsilon_j}&=C\frac{\partial}{\partial \epsilon_j}\sum_{i=1}^m \epsilon_i- \frac{\partial}{\partial \epsilon_j}\sum_{i=1}^m \alpha_i\epsilon_i- \frac{\partial}{\partial \epsilon_j} \sum_{i=1}^m \beta_i\epsilon_i\\
&=C\sum_{i=1}^m\delta_{ij}-\sum_{i=1}^m\alpha_i\delta_{ij}-\sum_{i=1}^m\beta_i\delta_{ij}\\
&=C-\alpha_j-\beta_j.
\end{split}
\end{equation} 0 = ∂ ϵ j ∂ L = C ∂ ϵ j ∂ i = 1 ∑ m ϵ i − ∂ ϵ j ∂ i = 1 ∑ m α i ϵ i − ∂ ϵ j ∂ i = 1 ∑ m β i ϵ i = C i = 1 ∑ m δ ij − i = 1 ∑ m α i δ ij − i = 1 ∑ m β i δ ij = C − α j − β j .
This, of course, implies
β j = C − α j . \begin{equation}
\beta_j=C-\alpha_j.
\end{equation} β j = C − α j .
Subsituting these constraints in gives us the dual objective function L ∗ \mathcal{L}^* L ∗ , which we maximize below.
max α , β L ∗ = max α 1 2 ∥ ∑ i = 1 m α i y i x i ∥ 2 + C ∑ i = 1 m ϵ i ⏟ ( ∗ ) − ∑ i = 1 m ( C ⏟ ( ∗ ) − α i ⏟ ( ∗ ∗ ) ) ϵ i + ∑ i = 1 m α i − ∑ i = 1 m α i ϵ i ⏟ ( ∗ ∗ ) − ∑ i = 1 m α i y i ( w T x i + b ) = max α 1 2 ∑ i , j = 1 m α i α j y i y j x i T x j + ∑ i = 1 m α i − ∑ i = 1 m α i y i ( ∑ j = 1 m α j y j x j T ) x i = max α − 1 2 ∑ i , j = 1 m α i α j y i y j x i T x j + ∑ i = 1 m α i = min α 1 2 ∑ i , j = 1 m α i α j y i y j x i T x j − ∑ i = 1 m α i \begin{equation}
\begin{split}
\max_{\alpha,\beta} \mathcal{L}^* &= \max_{\alpha}\frac{1}{2} \left\|\sum_{i=1}^m \alpha_i y_i x_i \right\|^2+ \underbrace{C\sum_{i=1}^m\epsilon_i}_{(*)}- \sum_{i=1}^m (\underbrace{C}_{(*)}-\underbrace{\alpha_i}_{(**)}) \epsilon_i\\
&\quad \quad +\sum_{i=1}^m \alpha_i \underbrace{- \sum_{i=1}^m \alpha_i\epsilon_i}_{(**)} - \sum_{i=1}^m \alpha_iy_i(w^Tx_i+\cancel{b})\\
&=\max_\alpha \frac{1}{2} \sum_{i,j=1}^m \alpha_i \alpha_j y_i y_j x_i^T x_j + \sum_{i=1}^m \alpha_i\\
&\quad \quad - \sum_{i=1}^m \alpha_i y_i \left(\sum_{j=1}^m \alpha_j y_j x_j^T \right)x_i\\
&=\max_\alpha -\frac{1}{2} \sum_{i,j=1}^m \alpha_i \alpha_j y_i y_j x_i^T x_j + \sum_{i=1}^m \alpha_i\\
&=\min_\alpha \frac{1}{2} \sum_{i,j=1}^m \alpha_i \alpha_j y_i y_j x_i^T x_j - \sum_{i=1}^m \alpha_i\\
\end{split}
\end{equation} α , β max L ∗ = α max 2 1 i = 1 ∑ m α i y i x i 2 + ( ∗ ) C i = 1 ∑ m ϵ i − i = 1 ∑ m ( ( ∗ ) C − ( ∗∗ ) α i ) ϵ i + i = 1 ∑ m α i ( ∗∗ ) − i = 1 ∑ m α i ϵ i − i = 1 ∑ m α i y i ( w T x i + b ) = α max 2 1 i , j = 1 ∑ m α i α j y i y j x i T x j + i = 1 ∑ m α i − i = 1 ∑ m α i y i ( j = 1 ∑ m α j y j x j T ) x i = α max − 2 1 i , j = 1 ∑ m α i α j y i y j x i T x j + i = 1 ∑ m α i = α min 2 1 i , j = 1 ∑ m α i α j y i y j x i T x j − i = 1 ∑ m α i
For our constraints, one comes from ∑ i = 1 m α i y i = 0 \sum_{i=1}^m\alpha_iy_i=0 ∑ i = 1 m α i y i = 0 . A second comes from α i ≥ 0 \alpha_i\geq 0 α i ≥ 0 . The last comes from
C ≥ α i + β i ≥ α i , \begin{equation}
C\geq \alpha_i+\beta_i\geq \alpha_i,
\end{equation} C ≥ α i + β i ≥ α i ,
since β i ≥ 0 \beta_i\geq 0 β i ≥ 0 . This gives the full dual problem to SVM as
min α 1 2 ∑ i , j = 1 m α i α j y i y j x i T x j − ∑ i = 1 m α i s.t. ∑ i = 1 m α i y i = 0 , and 0 ≤ α i ≤ C . \begin{equation}
\begin{split}
&\min_\alpha \frac{1}{2} \sum_{i,j=1}^m \alpha_i \alpha_j y_i y_j x_i^T x_j - \sum_{i=1}^m \alpha_i\\
&\text{s.t.} \quad \sum_{i=1}^m\alpha_iy_i=0,\\
&\text{and}\quad 0\leq \alpha_i\leq C.
\end{split}
\end{equation} α min 2 1 i , j = 1 ∑ m α i α j y i y j x i T x j − i = 1 ∑ m α i s.t. i = 1 ∑ m α i y i = 0 , and 0 ≤ α i ≤ C .