SVM, Slack Variable and Kernel

Maximize margin

The basic idea of SVM is to maximize the margin between the support vectors and the hyperplane.

The margin can be expressed as $r=\frac{\left|\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}+b\right|}{|\boldsymbol{w}|}$

For the binary classification task $y\in(-1,+1)$, the equation satisfies:

$y_i(w^Tx_i+b)\geqslant r\|w\|$

Since the $w$ and $b$ can be scaled, we can get the condition

$y_i(w'^Tx_i+b')\geqslant 1$

where $w’=\frac{w}{r}, b’=\frac{b}{r}$.

In the maximizing margin process, we only consider the support vectors which locate on the margin of hyperplane. These vectors satisfy the equation, therefore, the margin can be written as $\gamma=\frac{2}{|\boldsymbol{w}|}$.

Maximizing the margin is equivalent to

$\min _{\boldsymbol{w}, b} \frac{1}{2}\|\boldsymbol{w}\|^{2}$ $\text { s.t. } \quad y_{i}\left(\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}_{i}+b\right) \geqslant 1, \quad i=1,2, \ldots, m$

Lagrangian

To solve the optimization problem, apply the Lagrangian multiplier, and we get the form of the question:

$L(w,b,\alpha)=\frac{1}{2}\|w\|^2-\sum_i^p \alpha_i(y_i(w^Tx_i+b-1))$ $\begin{array}{l}{\text { s.t. } \sum_{i=1}^{m} \alpha_{i} y_{i}=0} \\ {\alpha_{i} \geqslant 0, \quad i=1,2, \ldots, m}\end{array}$

Dual Form

Before applying the kernel functions, the dual representations are introduced.

Using the lagrandian multiplier, we convert our problem from $\min _{\boldsymbol{w}, b} \frac{1}{2}|\boldsymbol{w}|^{2}$ to $\min _{\boldsymbol{w}, b}\max_{\boldsymbol{\alpha}} L(w,b,\alpha)$. And it is equivalent to

$\max_{\boldsymbol{\alpha}}\{ \min _{\boldsymbol{w}, b} L(w,b,\alpha)\}$

To minimize the inner loss function, we calculate the partial derivative of $w$ and $b$, and make them equal to $0$, an then we get the the

$\begin{aligned} \boldsymbol{w}=& \sum_{i=1}^{m} \alpha_{i} y_{i} \boldsymbol{x}_{i} \\ 0 &=\sum_{i=1}^{m} \alpha_{i} y_{i} \end{aligned}$

Substitute the results back to the origin $L(w,b,\alpha)$ loss function, now the dual representation is

$\max _{\boldsymbol{\alpha}} \left ( \sum_{i=1}^{m} \alpha_{i}-\frac{1}{2} \sum_{i=1}^{m} \sum_{j=1}^{m} \alpha_{i} \alpha_{j} y_{i} y_{j} \boldsymbol{x}_{i}^{\mathrm{T}} \boldsymbol{x}_{j} \right)$ $\begin{array}{l}{\text { s.t. } \sum_{i=1}^{m} \alpha_{i} y_{i}=0} \\ {\alpha_{i} \geqslant 0, \quad i=1,2, \ldots, m}\end{array}$

Then we can apply the kernel.

Assum the kernel function which has the form like this

$\kappa\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right)=\left\langle\phi\left(\boldsymbol{x}_{i}\right), \phi\left(\boldsymbol{x}_{j}\right)\right\rangle=\phi\left(\boldsymbol{x}_{i}\right)^{\mathrm{T}} \phi\left(\boldsymbol{x}_{j}\right)$

Write the matrix form $K=\boldsymbol\phi(\boldsymbol x)\boldsymbol\phi(\boldsymbol y)$, and rewrite the dual representation as the matrix form

$\boldsymbol c^T \boldsymbol\alpha-\frac{1}{2}\boldsymbol\alpha^T K \boldsymbol\alpha$

with the constraints.

The advantage of using the kernel trick: blog Kernel Function

Some common form of kernels: blog Kernel Function

Slack Variable

对于不能完全分隔开的情况，引入松弛变量，使硬间隔(hard margin)转换成弱间隔(soft margin)。落在软间隔中的点才是我们要关注的东西，所以之后的slack variable只有在间隔内的才是不等于0的。

The form of classification constraints are replaced with

$\begin{array}{c}{\text { s.t. } \quad y_{i}\left(\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}_{i}+b\right) \geqslant 1-\xi_{i}} \\ {\xi_{i} \geqslant 0, i=1,2, \ldots, m}\end{array}$

Each slack variable is for one training data point.

Data points with $ \xi_n=0$ are correctly classified and are either on the margin or on the correct side of the margin. Points for which $0<\xi_{n} \leqslant 1$ lie inside the margin, but on the correct boundary, and those data points for which $\xi_n>1$ lie on the wrong side of the decision boundary and are misclassified.

Slack variable allow for overlapping class distributions, however this framework is still sensitive to outliers because the penalty for misclassification increase linearly with $\xi$. (错误分类的离群点，会有很大的 $\xi$ 绝对值)

Therefore, the function we are going to minimize is

$C \sum_{n=1}^{N} \xi_{n}+\frac{1}{2}\|\mathbf{w}\|^{2}$

where the parameter $C>0$ controls the trade-off between the salck variable penality and the margin. Because any point that is misclassified has $\xi_n>1$, it follows that the $\sum_{n} \xi_{n}$ is an upper bound on the number of misclassified points. Therefore, the parameter $C$ is analogous to a regularization coefficient because it controls the trade-off between minimizing training errors and controlling the model complexity. In the limit $C \rightarrow \infty$, we will recover the earlier SVM with the hard margin.

The corresponding Lagrangian is given by

$L(\mathbf{w}, b, \mathbf{a})=\frac{1}{2}\|\mathbf{w}\|^{2}+C \sum_{n=1}^{N} \xi_{n}-\sum_{n=1}^{N} a_{n}\left\{t_{n} y\left(\mathbf{x}_{n}\right)-1+\xi_{n}\right\}-\sum_{n=1}^{N} \mu_{n} \xi_{n}$

and there are the constraints(see the book)

We can also convert it to the dual form and during the calculating the derivative, we get the another constraint $0 \leqslant a_{n} \leqslant C$.

(since there are a lot of content which can be written down. I will omit these information. If you want to get to know more, see the Chapter 7.1.1 of PRML of Bishop)

Reference

周志华《机器学习》
Bishop PRML Chapter 7.1.1