EM Algorithm

EM algorithm is used to solve the problems which have the latent variable. Like the example mentioned in the reference 1 about the heights of boys and girls.

Before explaining the EM algorith, we review the generative models. See the blog of Discriminative and Generative models and Naive Bayes.

Building Probabilistic Models

To describe a system with uncertainty we use random variables, $X$, $Y$, $Z$, etc.

The variables are described by probability mass function $\mathbb{P}(X, Y, Z)$ or if our variables are continuous, but probability densities $f_{X, Y, Z}(x, y, z)$. We build in dependencies in this joint distribution.

We often think of our observations as given and the predictions as random variables. The object is to find a probability $\mathbb{P}(C | \boldsymbol{x})$. Modeling this directly is what we know as discriminative model.

In generative models, we think of the problem as the joint process of generating the features and outputs together. This leads to a joint distribution $\mathbb{P}(\boldsymbol{X}, Y)$ where $X$ are your features and $Y$ is your output you are trying to predict. Write the approach like

$\mathbb{P}(Y | \boldsymbol{X})=\frac{\mathbb{P}(\boldsymbol{X}, Y)}{\mathbb{P}(\boldsymbol{X})}=\frac{\mathbb{P}(\boldsymbol{X}, Y)}{\sum_{Y} \mathbb{P}(\boldsymbol{X}, Y)}$

Latent Variable

Latent variables: The random variables involed in our models, but we don’t observe and we don’t care about.

If we have a latent variable $Z$ and observed variable $\boldsymbol{X}$ and we are predicting a variable $Y$ then we would marginalise over the latent variable

$\mathbb{P}(\boldsymbol{X}, Y)=\sum_{Z} \mathbb{P}(\boldsymbol{X}, Y, Z)$

Suppose we were observing the decays from two types of short-lived particle. $X$ is our observation. Assume $X$ is normally distributed with unknown means and variances: $\Theta=\left\{\mu_{1}, \sigma_{1}^{2}, \mu_{2}, \sigma_{2}^{2}\right\}$. Let $Z\in \{0,1\}$ be an indicator that it is particle 1.

The probability of $X$ is given by

$f(X | Z, \mathbf{\Theta})=Z \mathcal{N}\left(X | \mu_{1}, \sigma_{1}^{2}\right)+(1-Z) \mathcal{N}\left(X | \mu_{2}, \sigma_{2}^{2}\right)$

Note:
$\begin{aligned} f(X | \Theta) &=\sum_{Z \in\{0,1\}} f(X, Z | \Theta)=\sum_{Z \in\{0,1\}} f(X | Z, \Theta) \mathbb{P}(Z) \\ &=\mathbb{E}_{Z}[f(X | Z, \Theta)]=p \mathcal{N}\left(X | \mu_{1}, \sigma_{1}^{2}\right)+(1-p) \mathcal{N}\left(X | \mu_{2}, \sigma_{2}^{2}\right) \end{aligned}$

Then the likelihood of our observed data

$f(\mathcal{D} | \Theta)=\prod_{X \in \mathcal{D}} f(X | \Theta)$

The likelihood function is a non-linear function of the parameters so cannot be immediately maximised. We have a difficulty in that our latent variable $Z$ will depend on the parameter $\Theta$, and our likelihood will depend on the latent variable.

Solution:

$\Theta^{(t+1)}=\underset{\Theta}{\operatorname{argmax}} \sum_{Z} \mathbb{P}\left(Z | \mathcal{D}, \Theta^{(t)}\right) \log (f(\mathcal{D} | Z, \Theta))$

proceed iteratively by maximising the expected log-likelihood with respect to the current set of parameters.

EM for Mixture of Gaussians

(EM算法是求基于隐变量上的似然函数的期望。就像这样的形式$p\cdot log(f(D|Z,\Theta))+(1-p)\cdot log(f(D|Z,\Theta))$)

(to be continued)

The algorithm (Bishop PRML):

Initialize the $\boldsymbol{\mu}_{k}$, covariance $\boldsymbol{\Sigma}_{k}$ and mixing coefficients $\pi_{k}$, and evaluate the initial value of the log likelihood.
E step. Evaluate the responsibilities using the current parameter values
$\gamma\left(z_{n k}\right)=\frac{\pi_{k} \mathcal{N}\left(\mathbf{x}_{n} | \boldsymbol{\mu}_{k}, \mathbf{\Sigma}_{k}\right)}{\sum_{j=1}^{K} \pi_{j} \mathcal{N}\left(\mathbf{x}_{n} | \boldsymbol{\mu}_{j}, \boldsymbol{\Sigma}_{j}\right)}$
（计算每个点n对应到k隐变量的概率）
M step. Re-estimate the parameters using the current responsibilities.
$\begin{aligned} \boldsymbol{\mu}_{k}^{\mathrm{new}} &=\frac{1}{N_{k}} \sum_{n=1}^{N} \gamma\left(z_{n k}\right) \mathbf{x}_{n} \\ \boldsymbol{\Sigma}_{k}^{\mathrm{new}} &=\frac{1}{N_{k}} \sum_{n=1}^{N} \gamma\left(z_{n k}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{k}^{\mathrm{new}}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{k}^{\mathrm{new}}\right)^{\mathrm{T}} \\ \pi_{k}^{\mathrm{new}} &=\frac{N_{k}}{N} \end{aligned}$
where
$N_{k}=\sum_{n=1}^{N} \gamma\left(z_{n k}\right)$
（重新计算每类隐变量k的均值$\mu$和发那个方差矩阵$\Sigma$，以及重新分配隐变量分布的概率）
Evaluate the log likelihood
$\ln p(\mathbf{X} | \boldsymbol{\mu}, \boldsymbol{\Sigma}, \boldsymbol{\pi})=\sum_{n=1}^{N} \ln \left\{\sum_{k=1}^{K} \pi_{k} \mathcal{N}\left(\mathbf{x}_{n} | \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k}\right)\right\}$
and check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied return to step 2.

（计算log 似然函数，这个在最开始也计算了一次。如果没有达到要求，重新迭代再进行计算）

(not fully understand…)

Example of Coin Tossing

MLE: there is no latent variable. The event are all belong to one distribution.

EM: There are latent variables. And we don’t know the $p(Z)$ where $Z$ is the latent variable.

(See the example mentioned in reference 1)

Reference

zhihu 人人都懂EM算法
adam slide Generative Models
Bishop PRML Chapter 9.2