EM Algorithm

Posted on 2019-05-28 | In Machine Learning

EM algorithm is used to solve the problems which have the latent variable. Like the example mentioned in the reference 1 about the heights of boys and girls.

Before explaining the EM algorith, we review the generative models. See the blog of Discriminative and Generative models and Naive Bayes.

Building Probabilistic Models

To describe a system with uncertainty we use random variables, $X$, $Y$, $Z$, etc.

The variables are described by probability mass function $\mathbb{P}(X, Y, Z)$ or if our variables are continuous, but probability densities $f_{X, Y, Z}(x, y, z)$. We build in dependencies in this joint distribution.

We often think of our observations as given and the predictions as random variables. The object is to find a probability $\mathbb{P}(C | \boldsymbol{x})$. Modeling this directly is what we know as discriminative model.

In generative models, we think of the problem as the joint process of generating the features and outputs together. This leads to a joint distribution $\mathbb{P}(\boldsymbol{X}, Y)$ where $X$ are your features and $Y$ is your output you are trying to predict. Write the approach like

$\mathbb{P}(Y | \boldsymbol{X})=\frac{\mathbb{P}(\boldsymbol{X}, Y)}{\mathbb{P}(\boldsymbol{X})}=\frac{\mathbb{P}(\boldsymbol{X}, Y)}{\sum_{Y} \mathbb{P}(\boldsymbol{X}, Y)}$

Latent Variable

Latent variables: The random variables involed in our models, but we don’t observe and we don’t care about.

If we have a latent variable $Z$ and observed variable $\boldsymbol{X}$ and we are predicting a variable $Y$ then we would marginalise over the latent variable

$\mathbb{P}(\boldsymbol{X}, Y)=\sum_{Z} \mathbb{P}(\boldsymbol{X}, Y, Z)$

Suppose we were observing the decays from two types of short-lived particle. $X$ is our observation. Assume $X$ is normally distributed with unknown means and variances: $\Theta=\left\{\mu_{1}, \sigma_{1}^{2}, \mu_{2}, \sigma_{2}^{2}\right\}$. Let $Z\in \{0,1\}$ be an indicator that it is particle 1.

The probability of $X$ is given by

$f(X | Z, \mathbf{\Theta})=Z \mathcal{N}\left(X | \mu_{1}, \sigma_{1}^{2}\right)+(1-Z) \mathcal{N}\left(X | \mu_{2}, \sigma_{2}^{2}\right)$

Note:
$\begin{aligned} f(X | \Theta) &=\sum_{Z \in\{0,1\}} f(X, Z | \Theta)=\sum_{Z \in\{0,1\}} f(X | Z, \Theta) \mathbb{P}(Z) \\ &=\mathbb{E}_{Z}[f(X | Z, \Theta)]=p \mathcal{N}\left(X | \mu_{1}, \sigma_{1}^{2}\right)+(1-p) \mathcal{N}\left(X | \mu_{2}, \sigma_{2}^{2}\right) \end{aligned}$

Then the likelihood of our observed data

$f(\mathcal{D} | \Theta)=\prod_{X \in \mathcal{D}} f(X | \Theta)$

The likelihood function is a non-linear function of the parameters so cannot be immediately maximised. We have a difficulty in that our latent variable $Z$ will depend on the parameter $\Theta$, and our likelihood will depend on the latent variable.

Solution:

$\Theta^{(t+1)}=\underset{\Theta}{\operatorname{argmax}} \sum_{Z} \mathbb{P}\left(Z | \mathcal{D}, \Theta^{(t)}\right) \log (f(\mathcal{D} | Z, \Theta))$

proceed iteratively by maximising the expected log-likelihood with respect to the current set of parameters.

EM for Mixture of Gaussians

(EM算法是求基于隐变量上的似然函数的期望。就像这样的形式$p\cdot log(f(D|Z,\Theta))+(1-p)\cdot log(f(D|Z,\Theta))$)

(to be continued)

The algorithm (Bishop PRML):

Initialize the $\boldsymbol{\mu}_{k}$, covariance $\boldsymbol{\Sigma}_{k}$ and mixing coefficients $\pi_{k}$, and evaluate the initial value of the log likelihood.
E step. Evaluate the responsibilities using the current parameter values
$\gamma\left(z_{n k}\right)=\frac{\pi_{k} \mathcal{N}\left(\mathbf{x}_{n} | \boldsymbol{\mu}_{k}, \mathbf{\Sigma}_{k}\right)}{\sum_{j=1}^{K} \pi_{j} \mathcal{N}\left(\mathbf{x}_{n} | \boldsymbol{\mu}_{j}, \boldsymbol{\Sigma}_{j}\right)}$
（计算每个点n对应到k隐变量的概率）
M step. Re-estimate the parameters using the current responsibilities.
$\begin{aligned} \boldsymbol{\mu}_{k}^{\mathrm{new}} &=\frac{1}{N_{k}} \sum_{n=1}^{N} \gamma\left(z_{n k}\right) \mathbf{x}_{n} \\ \boldsymbol{\Sigma}_{k}^{\mathrm{new}} &=\frac{1}{N_{k}} \sum_{n=1}^{N} \gamma\left(z_{n k}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{k}^{\mathrm{new}}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{k}^{\mathrm{new}}\right)^{\mathrm{T}} \\ \pi_{k}^{\mathrm{new}} &=\frac{N_{k}}{N} \end{aligned}$
where
$N_{k}=\sum_{n=1}^{N} \gamma\left(z_{n k}\right)$
（重新计算每类隐变量k的均值$\mu$和发那个方差矩阵$\Sigma$，以及重新分配隐变量分布的概率）
Evaluate the log likelihood
$\ln p(\mathbf{X} | \boldsymbol{\mu}, \boldsymbol{\Sigma}, \boldsymbol{\pi})=\sum_{n=1}^{N} \ln \left\{\sum_{k=1}^{K} \pi_{k} \mathcal{N}\left(\mathbf{x}_{n} | \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k}\right)\right\}$
and check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied return to step 2.

（计算log 似然函数，这个在最开始也计算了一次。如果没有达到要求，重新迭代再进行计算）

(not fully understand…)

Example of Coin Tossing

MLE: there is no latent variable. The event are all belong to one distribution.

EM: There are latent variables. And we don’t know the $p(Z)$ where $Z$ is the latent variable.

(See the example mentioned in reference 1)

Reference

zhihu 人人都懂EM算法
adam slide Generative Models
Bishop PRML Chapter 9.2

Bayes Rule, Prior and Posterior

Posted on 2019-05-27 | In Machine Learning

Bayes’ Rule

The form of bayes’ rule is

$\mathbb{P}\left(\mathcal{H}_{i} | \mathcal{D}\right)=\frac{\mathbb{P}\left(\mathcal{D} | \mathcal{H}_{i}\right) \mathbb{P}\left(\mathcal{H}_{i}\right)}{\mathbb{P}(\mathcal{D})}$

$\mathbb{P}\left(\mathcal{H}_{i} | \mathcal{D}\right)$ is the posterior probability of a hypothesis $\mathcal{H}_{i}$ (i.e. the probability of $\mathcal{H}_{i}$ after we know the data)
$\mathbb{P}\left(\mathcal{D} | \mathcal{H}_{i}\right)$ is the likelihood of the data given the hypothesis. Note, that we calculated this from the forward problem
$\mathbb{P}\left(\mathcal{H}_{i}\right)$ is the prior probability (i.e. the probability of $\mathcal{H}_{i}$ before we know the data)
$\mathbb{P}(\mathcal{D})$ is the evidence. It is the normalising constant given by
$\mathbb{P}(\mathcal{D})=\sum_{i=1}^{n} \mathbb{P}\left(\mathcal{H}_{i}, \mathcal{D}\right)=\sum_{i=1}^{n} \mathbb{P}\left(\mathcal{D} | \mathcal{H}_{i}\right) \mathbb{P}\left(\mathcal{H}_{i}\right)$

In most of our task, what we want is the posterior probability.

The bayes’ rule converts this into the forward problem.

Prior and Posterior

(to be continued)

When the posterior is the same as the prior then the likelihood and prior distributions are said to be conjugate. The prior then is the conjugate prior.

In the xxxxxx, we want to maximize the posterior probabilities(MAP), which is different with the fully bayesian inference.(2017-2018 exam paper AML)

MAP and MLE

MAP: maximize the posterior distribution

MLE: maximum likelihood estimation

In MAP, we should add a prior distribution and by adding observations and then to maximize the posterior distribution $p(w|X)$ to get the parameter.

最大似然估计是求参数$\theta$, 使似然函数$P(x_0|\theta)$最大。最大后验概率估计则是想求$\theta$使$P(x_0|\theta)P(\theta)$最大,求得的$ \theta $不单单让似然函数大，$ \theta $自己出现的先验概率(也是得到的后验概率）也得大.

Reference

LSTM

Posted on 2019-05-27 | Edited on 2019-07-30 | In Deep Learning

avatar

The Structure of the LSTM, which is shown in the image above.

Items in the Structure

Inputs $\boldsymbol{z}(t)=(\boldsymbol{x}(t), \boldsymbol{y}(t-1))$

The input is combined the input $x(t)$ and the output of last time $y(t-1)$
Network Updates ($W_*$ are the free parameters)
$\begin{array}{ll}{\boldsymbol{f}(t)=\boldsymbol{\sigma}\left(\mathbf{W}_{f} \boldsymbol{z}(t)\right)} & {\boldsymbol{g}(t)=\boldsymbol{\sigma}\left(\boldsymbol{W}_{g} \boldsymbol{z}(t)\right)} \\ {\boldsymbol{h}(t)=\tanh \left(\boldsymbol{W}_{h} \boldsymbol{z}(t)\right)} & {\boldsymbol{o}(t)=\boldsymbol{\sigma}\left(\boldsymbol{W}_{o} \boldsymbol{z}(t)\right)}\end{array}$
$\boldsymbol f(t)$ is the forget gate, $\boldsymbol g(t)$ decide the whether and what to remember from the input of this time, the $\boldsymbol h(t)$ is the input block, the $\boldsymbol o(t)$ is the output gate.
Long-term memory update
$\boldsymbol{c}(t)=\boldsymbol{f}(t) \otimes \boldsymbol{c}(t-1)+\boldsymbol{g}(t) \otimes \boldsymbol{h}(t)$
It can be seen as the influence of the forget gate on the hidden state plus the memory update.
(对隐含变量是否进行遗忘，加上这次的选择记忆阶段得到下一次的隐含状态的输入$c(t)$)
Output $\boldsymbol{y}(t)=\boldsymbol{o}(t) \otimes \tanh (\boldsymbol{c}(t))$

The product of the output gate and the $tanh(\boldsymbol c(t))$.

Note

We can train an LSTM by unwrapping it in time.

It involves four dense layers with sigmoidal(or tanh) output: those gates

The LSTM is typucally very slow to train.

There are a few variants of LSTMs, but all are very similar. The most popular is probably Gated Recurrent Unit (GRU).

(to be continued)

GRU

LSTM有很多变体，其中较大改动的是Gated Recurrent Unit (GRU)，这是由 Cho, et al. (2014)提出。它将忘记门和输入门合成了一个单一的更新门。同样还混合了细胞状态和隐藏状态，和其他一些改动。最终的模型比标准的 LSTM模型要简单。效果和LSTM差不多，但是参数少了1/3，不容易过拟合。

avatar

Reference

Adam lstm slide COMP6208
RNN, LSTM and GRU

Kernel Tricks

Posted on 2019-05-27 | In Machine Learning

Kernels

Many linear parametric models can be re-written into an equivalent dual representation in which the predictions are also based on linear combinations of a kernel function evaluated at the training data points. As we shall see, for models which are based on a fixed nonlinear feature space mapping $\phi(x)$, the kernel function is given by the relation

$k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\boldsymbol{\phi}(\mathbf{x})^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}^{\prime}\right)$

Dual Representation and Kernel: See an classical example of SVM from this blog.

Some common forms of kernels:

Linear Kernel 线性核 $\kappa \left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) =\boldsymbol{x}_{i}^{\mathrm{T}} \boldsymbol{x}_{j}$
Polynomial Kernel 多项式核 $\kappa\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) =\left(\boldsymbol{x}_{i}^{\mathrm{T}} \boldsymbol{x}_{j}\right)^{d}$
Gaussian Kernel 高斯核 $\kappa\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) =\exp \left(-\frac{\left|\boldsymbol{x}_{i}-\boldsymbol{x}_{j}\right|^{2}}{2 \sigma^{2}}\right)$
Laplace Kernel 拉普拉斯核 $\kappa\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) =\exp \left(-\frac{\left|\boldsymbol{x}_{i}-\boldsymbol{x}_{j}\right|}{\sigma}\right)$
Sigmoid Kernel $\kappa\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) =\tanh \left(\beta \boldsymbol{x}_{i}^{\mathrm{T}} \boldsymbol{x}_{j}+\theta\right)$

By replacing the $-\frac{1}{2\sigma^2}$ with $\lambda$ in the Gaussian Kernel, we can get the RBF kernel, which can be express as $\exp \left(-\gamma|\boldsymbol{x}-\boldsymbol{y}|^{2}\right)$.

Note that the feature vector that corresponds to the Gaussian kernel has infinite dimensionality.

Kernel Tricks

So, what is kernel?

In machine learning, a kernel is usually used to refer to the kernel trick, a method of using a linear classifier to solve a non-linear problem.

The kernel trick allows us to map data into high-dimensional feature space $\boldsymbol{x} \rightarrow \phi(\boldsymbol{x})$. This can be carried out for sufficiently simple machines where parameter optimisation involve the dot product $\phi^{\mathbf{T}}(\boldsymbol{x}) \phi(\boldsymbol{y})$. The kernel function is positive semi-definite and the decomposition is always possible (see the properties in the end of this blog). In fact, we never need to explicitly calculate the extended features $\phi(\boldsymbol{x})$. This often makes working in the extended feature space very efficient as $K(\boldsymbol{x},\boldsymbol{y})$ may be quick to calculate.

Constructing Kernels

One approach to build valid kernel is to choose a feature space mapping $\phi(x)$ and use it to find the corresponding kernel. Here the kernel function is defined for one-dimensional input space by

$k\left(x, x^{\prime}\right)=\phi(x)^{\mathrm{T}} \boldsymbol{\phi}\left(x^{\prime}\right)=\sum_{i=1}^{M} \phi_{i}(x) \phi_{i}\left(x^{\prime}\right)$

where $\phi(x)$ is the basis function.

Another approach is to construct the kernel directly. In this way, we should have a method to test whether the kernel we build is valid. A necessary and sufficient condition for a function $k(x,x)$ to be a valid kernel is that the Gram matrix $K$, whose elements are given by $k(x,x)$, should be positive semi-definite for all possible choices of the set $x_n$. Note that a positive semi-definite matrix is not the same thing as a matrix whose elements are nonnegative.

One powerful technique for constructing new kernels is to build out of simple kernels as building blocks. See the properties from the PRML book in $P_{296}$.

Some questions about the Kernel

Why it is important that the kernel function is positive semi-definite?
Kernel functions need to be positive semi-definite so that they have sensible (nonnegative) distances. That is the margins are positive.
Three properties that a positive semi-definite kernel should have:
- The eigenvalues of a positive semi-definite kernel function are nonnegative.
- A positive semi-definite kernel function can always be written as $K(x,y)=\sum_i \phi_i(x)\phi_i(y)$ for some sets of real functions $\phi_i(x)$
- The quadratic form satisfies $\int f(\boldsymbol{x}) K(\boldsymbol{x}, \boldsymbol{y}) f(\boldsymbol{y}) \mathrm{d} \boldsymbol{x} \mathrm{d} \boldsymbol{y} \geq 0$ for any real function $f(x)$.
Why kernel trick allows an SVM to seperate data points that are not linearly separable?
The kernel trick projects data into the extended feature space (the space of engenfunctions of the kernel). Although an SVM finds a linear separating plane in this extended space, as the extended features are typically a non-linear function of the original features. This corresponds to finding a non-linear separating surface in the original space.

Reference

周志华《机器学习》
Bishop PRML Chapter 6

Hexo Local Search not Respond

Posted on 2019-05-27 | In Hexo

When your article have some unlegal characters, the search function will not work.
See the https://segmentfault.com/q/1010000013084615 discussion for more details.

To debug, go to the url to see which blog rise this issue. http://localhost:4000/search.xml

(通过注释并且在本地调试，发现是kernels trick这个文件有问题。)

Solution

应该是有非法字符，local search无法加载出来非utf-8编码的字符。通过注释排除法，找到问题根源。

Reference

Bias and Variance

Posted on 2019-05-27 | In Machine Learning

Bias and Variance

A good learner classifier should have a good generalisation error.

Generalisation: how well do we do on unseen data as opposed to the training data

The problems in the Machine Learning can be over-constrained and under-constrained.

Over-constrained: We have conflicting data to deal with. There are more equations than variables. In this case, the learner has insufficient flexibility to correctly predict all the training data. To solve this problem, we can minimise an error function, which means that we find a machine that explained the training data as best it can.
Under-constrained: There are many possible solutions that are consistent with the data . Need to choose a plausible solution.

Bias: the generalisation performance of the mean machine.

$\hat{f}_{m}(\boldsymbol{x})=\mathbb{E}_{\mathcal{D}}\left[\hat{f}\left(\boldsymbol{x} | \boldsymbol{w}_{\mathcal{D}}\right)\right]$

which $\hat{f}_{m}(\boldsymbol{x})$ is the mean predictor(machine) value. And the bias is defined as

$B=\sum_{x \in \mathcal{X}} p(\boldsymbol{x})\left(\hat{f}_{m}(\boldsymbol{x})-f(\boldsymbol{x})\right)^{2}$

（可以看成最终训练出的分类器在训练集上进行预测，得到的所有值的平均值与每一个原target计算error）

Variance: measures the expected variation from the average machine due to the fluctuations caused by the using a finite training set.

$V=\mathbb{E}_{\mathcal{D}}\left[\sum_{x \in \mathcal{X}} p(\boldsymbol{x})\left(\hat{f}\left(\boldsymbol{x} | \boldsymbol{w}_{\mathcal{D}}\right)-\hat{f}_{m}(\boldsymbol{x})\right)^{2}\right]$

（训练出来的分类器，对每个训练集的输入进行预测，计算这些值的分布的方差）

Decomposition

The formulas of bias and variance are already defined above. Here we are going to show the decomposition.

The expected generalisation（平均泛化误差） is written as

$\overline{E}_{G}=\mathbb{E}_{\mathcal{D}}\left[E_{G}\left(\boldsymbol{w}_{\mathcal{D}}\right)\right]=\mathbb{E}_{\mathcal{D}}\left[\sum_{\boldsymbol{x} \in \mathcal{X}} p(\boldsymbol{x})\left(\hat{f}\left(\boldsymbol{x} | \boldsymbol{w}_{\mathcal{D}}\right)-f(\boldsymbol{x})\right)^{2}\right]$ $=\sum_{\boldsymbol{x} \in \mathcal{X}} p(\boldsymbol{x}) \mathbb{E}_{\mathcal{D}}\left[\left(\hat{f}\left(\boldsymbol{x} | \boldsymbol{w}_{\mathcal{D}}\right)-f(\boldsymbol{x})\right)^{2}\right]$ $=\sum_{\boldsymbol{x} \in \mathcal{X}} p(\boldsymbol{x}) \mathbb{E}_{\mathcal{D}}\left[\left(\left(\hat{f}\left(\boldsymbol{x} | \boldsymbol{w}_{\mathcal{D}}\right)-\hat{f}_{m}(\boldsymbol{x})\right)+\left(\hat{f}_{m}(\boldsymbol{x})-f(\boldsymbol{x})\right)\right)^{2}\right]$ $\begin{aligned}=\sum_{\boldsymbol{x} \in \mathcal{X}} p(\boldsymbol{x})( & \mathbb{E}_{\mathcal{D}}\left[\left(\hat{f}\left(\boldsymbol{x} | \boldsymbol{w}_{\mathcal{D}}\right)-\hat{f}_{m}(\boldsymbol{x})\right)^{2}+\left(\hat{f}_{m}(\boldsymbol{x})-f(\boldsymbol{x})\right)^{2}\right] \\ &+\mathbb{E}_{\mathcal{D}}\left[2\left(\hat{f}\left(\boldsymbol{x} | \boldsymbol{w}_{\mathcal{D}}\right)-\hat{f}_{m}(\boldsymbol{x})\right)\left(\hat{f}_{m}(\boldsymbol{x})-f(\boldsymbol{x})\right)\right] ) \end{aligned}$

The second term will vanish sicne the $\hat{f}_{m}(\boldsymbol{x})=\mathbb{E}_{\mathcal{D}}\left[\hat{f}\left(\boldsymbol{x} | \boldsymbol{w}_{\mathcal{D}}\right)\right]$. Finally we can rewrite the generalisation formula as

$\begin{aligned} \mathbb{E}_{\mathcal{D}}\left[E_{G}\left(\boldsymbol{w}_{\mathcal{D}}\right)\right]=\mathbb{E}_{\mathcal{D}} &\left[\sum_{\boldsymbol{x} \in \mathcal{X}} p(\boldsymbol{x})\left(\hat{f}\left(\boldsymbol{x} | \boldsymbol{w}_{\mathcal{D}}\right)-\hat{f}_{m}(\boldsymbol{x})\right)^{2}\right] \\+\sum & \sum_{\boldsymbol{x} \in \mathcal{X}} p(\boldsymbol{x})\left(\hat{f}_{m}(\boldsymbol{x})-f(\boldsymbol{x})\right)^{2}=V+B \end{aligned}$

Bias-Variance Dilemma

The composition mentioned above encodes how sensitive the machine is to the data.

The dilemma arises because a simple machine will typically have a large bias, but small variance, while a complicated machine will have a small bias but large variance.

Reference

Adam slide01 COMP6208
Additional reading Bishop PRML Chapter 3.2

Radial Basis Function, RBF network

Posted on 2019-05-27 | In Machine Learning

Radial basis function

radial basis function is that each basis function depends only on the radial distance (typically Euclidean) from a centre $\mu_j$, so that $\phi_j(x)=h(x-\mu_j)$.

$f(x)$ is expressed as a linear combination of radial basis functions, one
centred on every data point

$f(\mathbf{x})=\sum_{n=1}^{N} w_{n} h\left(\left\|\mathbf{x}-\mathbf{x}_{n}\right\|\right)$

The values of the coefficients ${w_n}$ are found by least squares.

RBF Network

(to be continued)

Question: What are the similarities and differences between MLP, RBF networks and SVMs?

All three techniques are based on the perceptron. In MLPs, the earlier layers are perceptrons, in RBFs they are radial basis functions and SVMs thet are the features corresponding to the eigenvalues of a kernel.
All three can be used for regression and classification.
MLPs are trained using back-propagation of errors. They have non-unique solution. Complexity depends on nunber of hidden nodes. Liable to over-fit the training data. Often use ad hoc methods such as early stopping to stop over-fitting. Can have many output nodes.
RBFs typically use unsupervised learning to choose the centres for the input layer. The labelled data used to train the final layer(a perceptron). Training is fast. Can have many output nodes. Often use regulariser on the output layer. The solution found is unique.
SVMs use a kernel function to perform a mapping into a very high dimensional feature space. An optimally stable perceptron is used in the feature space. This controls the capacity of the learning machine reducing the problem of over-fitting. The learning algorithm uses quadratic optimisation. The computation complexity grows as the number training patterns cubed. For very large datasets SVMs can become impractical. The solution found is unque.

Reference

ML学习笔记之——径向基网络
Bishop PRML chapter 6.3
question from (COMP3008 2009-2010 Q4)

SVM, Slack Variable and Kernel

Posted on 2019-05-26 | Edited on 2019-05-27 | In Machine Learning

Maximize margin

The basic idea of SVM is to maximize the margin between the support vectors and the hyperplane.

The margin can be expressed as $r=\frac{\left|\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}+b\right|}{|\boldsymbol{w}|}$

For the binary classification task $y\in(-1,+1)$, the equation satisfies:

$y_i(w^Tx_i+b)\geqslant r\|w\|$

Since the $w$ and $b$ can be scaled, we can get the condition

$y_i(w'^Tx_i+b')\geqslant 1$

where $w’=\frac{w}{r}, b’=\frac{b}{r}$.

In the maximizing margin process, we only consider the support vectors which locate on the margin of hyperplane. These vectors satisfy the equation, therefore, the margin can be written as $\gamma=\frac{2}{|\boldsymbol{w}|}$.

Maximizing the margin is equivalent to

$\min _{\boldsymbol{w}, b} \frac{1}{2}\|\boldsymbol{w}\|^{2}$ $\text { s.t. } \quad y_{i}\left(\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}_{i}+b\right) \geqslant 1, \quad i=1,2, \ldots, m$

Lagrangian

To solve the optimization problem, apply the Lagrangian multiplier, and we get the form of the question:

$L(w,b,\alpha)=\frac{1}{2}\|w\|^2-\sum_i^p \alpha_i(y_i(w^Tx_i+b-1))$ $\begin{array}{l}{\text { s.t. } \sum_{i=1}^{m} \alpha_{i} y_{i}=0} \\ {\alpha_{i} \geqslant 0, \quad i=1,2, \ldots, m}\end{array}$

Dual Form

Before applying the kernel functions, the dual representations are introduced.

Using the lagrandian multiplier, we convert our problem from $\min _{\boldsymbol{w}, b} \frac{1}{2}|\boldsymbol{w}|^{2}$ to $\min _{\boldsymbol{w}, b}\max_{\boldsymbol{\alpha}} L(w,b,\alpha)$. And it is equivalent to

$\max_{\boldsymbol{\alpha}}\{ \min _{\boldsymbol{w}, b} L(w,b,\alpha)\}$

To minimize the inner loss function, we calculate the partial derivative of $w$ and $b$, and make them equal to $0$, an then we get the the

$\begin{aligned} \boldsymbol{w}=& \sum_{i=1}^{m} \alpha_{i} y_{i} \boldsymbol{x}_{i} \\ 0 &=\sum_{i=1}^{m} \alpha_{i} y_{i} \end{aligned}$

Substitute the results back to the origin $L(w,b,\alpha)$ loss function, now the dual representation is

$\max _{\boldsymbol{\alpha}} \left ( \sum_{i=1}^{m} \alpha_{i}-\frac{1}{2} \sum_{i=1}^{m} \sum_{j=1}^{m} \alpha_{i} \alpha_{j} y_{i} y_{j} \boldsymbol{x}_{i}^{\mathrm{T}} \boldsymbol{x}_{j} \right)$ $\begin{array}{l}{\text { s.t. } \sum_{i=1}^{m} \alpha_{i} y_{i}=0} \\ {\alpha_{i} \geqslant 0, \quad i=1,2, \ldots, m}\end{array}$

Then we can apply the kernel.

Assum the kernel function which has the form like this

$\kappa\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right)=\left\langle\phi\left(\boldsymbol{x}_{i}\right), \phi\left(\boldsymbol{x}_{j}\right)\right\rangle=\phi\left(\boldsymbol{x}_{i}\right)^{\mathrm{T}} \phi\left(\boldsymbol{x}_{j}\right)$

Write the matrix form $K=\boldsymbol\phi(\boldsymbol x)\boldsymbol\phi(\boldsymbol y)$, and rewrite the dual representation as the matrix form

$\boldsymbol c^T \boldsymbol\alpha-\frac{1}{2}\boldsymbol\alpha^T K \boldsymbol\alpha$

with the constraints.

The advantage of using the kernel trick: blog Kernel Function

Some common form of kernels: blog Kernel Function

Slack Variable

对于不能完全分隔开的情况，引入松弛变量，使硬间隔(hard margin)转换成弱间隔(soft margin)。落在软间隔中的点才是我们要关注的东西，所以之后的slack variable只有在间隔内的才是不等于0的。

The form of classification constraints are replaced with

$\begin{array}{c}{\text { s.t. } \quad y_{i}\left(\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}_{i}+b\right) \geqslant 1-\xi_{i}} \\ {\xi_{i} \geqslant 0, i=1,2, \ldots, m}\end{array}$

Each slack variable is for one training data point.

Data points with $ \xi_n=0$ are correctly classified and are either on the margin or on the correct side of the margin. Points for which $0<\xi_{n} \leqslant 1$ lie inside the margin, but on the correct boundary, and those data points for which $\xi_n>1$ lie on the wrong side of the decision boundary and are misclassified.

Slack variable allow for overlapping class distributions, however this framework is still sensitive to outliers because the penalty for misclassification increase linearly with $\xi$. (错误分类的离群点，会有很大的 $\xi$ 绝对值)

Therefore, the function we are going to minimize is

$C \sum_{n=1}^{N} \xi_{n}+\frac{1}{2}\|\mathbf{w}\|^{2}$

where the parameter $C>0$ controls the trade-off between the salck variable penality and the margin. Because any point that is misclassified has $\xi_n>1$, it follows that the $\sum_{n} \xi_{n}$ is an upper bound on the number of misclassified points. Therefore, the parameter $C$ is analogous to a regularization coefficient because it controls the trade-off between minimizing training errors and controlling the model complexity. In the limit $C \rightarrow \infty$, we will recover the earlier SVM with the hard margin.

The corresponding Lagrangian is given by

$L(\mathbf{w}, b, \mathbf{a})=\frac{1}{2}\|\mathbf{w}\|^{2}+C \sum_{n=1}^{N} \xi_{n}-\sum_{n=1}^{N} a_{n}\left\{t_{n} y\left(\mathbf{x}_{n}\right)-1+\xi_{n}\right\}-\sum_{n=1}^{N} \mu_{n} \xi_{n}$

and there are the constraints(see the book)

We can also convert it to the dual form and during the calculating the derivative, we get the another constraint $0 \leqslant a_{n} \leqslant C$.

(since there are a lot of content which can be written down. I will omit these information. If you want to get to know more, see the Chapter 7.1.1 of PRML of Bishop)

Reference

周志华《机器学习》
Bishop PRML Chapter 7.1.1

Matrix Derivatives

Posted on 2019-05-26 | In Linear Algebra

Common Used in Machine Learning

There are two types of layout in the expression of matrix derivatives: denominator and numberator layout.

Here we always use the denominator layout.

vector to vector

$\frac{\partial \mathbf{A} \mathbf{x}}{\partial \mathbf{x}}=\mathbf{A}^{\top}$ $\frac{\partial \mathbf{x}^{\top} \mathbf{A}}{\partial \mathbf{x}}= \mathbf{A}$

scaler to vector

$\frac{\partial \mathbf{x}^{\top} \mathbf{A} \mathbf{x}}{\partial \mathbf{x}}=\left(\mathbf{A}+\mathbf{A}^{\top}\right) \mathbf{x}$ $\frac{\partial \mathbf{x}^{\mathbf{T}} \mathbf{x}}{\partial \mathbf{x}}=2 \mathbf{x}$

These two conclusions are very popular in the derivative calculation of Machine Learning.

Reference

矩阵求导、几种重要的矩阵及常用的矩阵求导公式

Naive Bayes

Posted on 2019-05-26 | In Machine Learning

What is Generative Models

See the last article: discriminative models and generative models.

The Naive Bayes belongs to the generative models, which model the distribution of the posterior and the process of generating the inputs.

Assumption of Naive Bayes

The naive bayes assumption is that all the data is conditionally independent, so if $D=(d_i|i=1,…,n)$ then

$p(\mathcal{D} | \boldsymbol{\theta})=\prod_{i=1}^{n} p\left(d_{i} | \boldsymbol{\theta}\right)$

(which also shown in the PRML P46)

Example of implementing spam filter

To implement a spam filter we can treat all the words in the email as independent of each other. Given an email $\left\langle w_{1}, w_{2}, \dots, w_{n}\right\rangle$ we can compute the probability of it being spam as

$p(\operatorname{spam} | \mathcal{D})=\frac{\prod_{i=1}^{n} p\left(w_{i} | \operatorname{spam}\right) p(\operatorname{spam})}{p(\mathcal{D})}$

where the $p(spam)$ is the empirically measured frequency of spam emails. To compute the likelihood we use a database of spam and non spam emails

$p\left(w_{i} | s p a m\right)=\frac{\# \text { of occurances of } w_{i} \text { in spam database }}{\# \text { of words in spam database }}$

Here I use the assumption we mentioned above, the likelihood $p(D|spam)$ is defined by the multiplication of each $p(w_i|spam)$. (We might include pseudo counts to make this more robust). The probability of the data $D$ is

$p(\mathcal{D})=p(\mathcal{D} | \operatorname{spam}) p(\operatorname{spam})+p(\mathcal{D} | \neg \operatorname{spam}) p(\neg \operatorname{spam})$

We use exactly the same procedure to compute $p(D|\neg spam)$ as we did to compute the $p(D|spam)$.

By calculating the posterior probabilities, we can get the approximate prediction on whether the email is spam or not.

Reference

Bishop PRML chapter 1.5.4
Shuogh blog: Discriminative and generative models
AML(comp3008) 2012-2013 exam paper