Discriminative and Generative Models

Posted on 2019-05-25 | In Machine Learning

The classification problem can be broken down into two seperate stages:

The inference stage: train data to learn a model for $p(C_k|x)$
The decision stage: use these posterior probabilities to make optimal class assignments

To solve the classification, there are actually three distinct approaches.

Generative Models

To solve the inference problem, we should determine the class-conditional densities $p(x|C_k)$ for each class $C_k$ individually. Also infer the prior class probabilities $p(C_k)$. Then use Bayes’ theorem in the form

$p\left(\mathcal{C}_{k} | \mathbf{x}\right)=\frac{p\left(\mathbf{x} | \mathcal{C}_{k}\right) p\left(\mathcal{C}_{k}\right)}{p(\mathbf{x})}$

to find the posterior class probabilities $p(C_k|x)$.

For the denominator, it can be calculated by $p(\mathbf{x})=\sum_{k} p\left(\mathbf{x} | \mathcal{C}_{k}\right) p\left(\mathcal{C}_{k}\right)$ .

Equivalently, the joint distribution $p(x,C_k)$ can also be modelled directly and then normalize to obtain the posterior probabilities.

Given the posterior probabilities, we use decision theory to determine class membership for each input $x$. This kind of method is called generative models, which model the distribution of inputs as well as the outputs. The name “generative“ is because by sampling from them it is possible to generate synthetic data points in the input space.

The examples of generative models:

Naive Bayes, Latent Dirichlet allocation, Gaussian Process…

Discriminative Models

Solve the inference problem of determining the posterior class probabilities $p(C_k|x)$, and then make prediction using decision theory.

The methods which model the posterior probabilities $p(C_k|x)$ directly are called discriminative models.

Find a function $f(x)$, called a discriminant function, which maps each input $x$ directly onto a class label.

Examples of discriminative models:

kNN, perceptron, decision tree, linear regression, logistics regression, SVM, neural network…

The Merits of Each Method

Generative models are most demanding, since it involve finding the joint distribution over both $x$ and $C_k$. For many application, $X$ have high dimensionality and consequently we may need a large training set in order to be able to determine the class-conditional densities （类条件概率密度，就是后验概率，我们的目标）to reasonable accuracy.
One distinctive use case of the generative models is outlier detection （离群点检测）. The margin density of data $p(x)$ can be determined using the formula menetioned above. It is usefule for detecting new data points that have low probability under the model and for which the predictions may be of low accuracy, which is know as outlier detection and novelty detection.

Discriminative approaches is simpler. The second approach can obtain the posterior probabilities $p(C_k|x)$ directly from the data points. The thrid approach is much simpler, in which we use the training data to find a discriminant function $f(x)$ that maps each $x$ directly onto a class label (It combine the inference and decision stages into a single learning problem). However, in the third method, we no loner have access to posterior probabilities.

Reference

Bishop PRML Chapter 1.5.4

Data Mining Information Theory

Posted on 2019-05-25 | Edited on 2019-05-29 | In Data Mining

Information Theory and Feature Selection

Outline:
- Information
- Entropy
- Mutual information
- for feature selection

Information

Information, also can be seen as uncertainty and surprise.

$I=-log_2{p(x)}$ Since the $p(x)$ is the probability of event $x$, the value $<1$.

Shannon entropy:

$H(p)=-\sum_{x} p(x) \log _{2} p(x)$

( entropy = the probability of an event * information of this event )

Shannon entropy is the measure of uncertainty.

(香农熵描述的是混乱程度，而且information这个概念其实也是从这个角度给出的，不确定性越大，这个事件携带的信息越多。)

K-L Divergence

Two probability distribution $f(x)$ and $g(x)$, the K-L divergence is :

$D(f \| g)=\sum_{x \in X} f(x) \log \frac{f(x)}{g(x)}$

Compare the entropy of two distribution over the same random variable
Heuristically: number of additional bits encoding a random variable with distribution $f(x)$ using $g(x)$.

It can be seen as the $D(f|g) =\sum_{x \in X} [-f(x)log_2 g(x)+f(x)log_2f(x)] $, the first term is to encode $f(x)$ using the the encoding method of $g(x)$.Therefore, it can be seen as the distance between two encoding function ( or distribution).

When minimizing K-L against a fixed reference distribution $p$, the task is euivalent to minimizing cross entropies. It can be written as: $D(f|g) =\sum_{x \in X}f(x)log_2f(x) - \sum_{x \in X}f(x)log_2 g(x) $

The second term is what we use in the cross entropy loss function.

The form of cross entropy:

$H(p, q)=-\sum_{x} p(x) \log _{2} q(x)$

Note:
The K-L divergence can not be used as a measure for the distance between $f$ and $g$, since it is not symmetric, $D(f | g)$ is not equal to $D(g | y)$.

Conditional Entropy

The $I$ is realized information, which is the difference between the entropy of $H(C)$ and the contional entropy $H(C|X=x)$. And the realized information is defined as:

$I[C ; X=x]=H(C)-H(C | X=x)$

Given the observation of $X$, the entropy of $C$ is decrease, which is written as $H(C | X=x)$.

The realized information is not necessarily positive. If it is negative, the entropy will increase.

Form of the contional entropy (from PRML): $H(Y | X)=-\sum_{x_{i}, y_{j}}^{m, n} p\left(x_{i}, y_{j}\right) \cdot \log _{2} p\left(y_{j} | x_{i}\right)$

Mutual Information

Mutual information is the expected information a feature gives us about a classs:

$I[C ; X]=H(C)-\sum \operatorname{Pr}(X=x) H(C | X=x)$

Note:

Mutual information is always positive.
Is only 0 when the X and C are statistically independent.
Is symmetric in X and C

Example of calculating the mutual information:

	Indicator X
Class $C$	“Paint”	“Not Paint”
Art	12	45
Music		45

The entropy of C: $H(C)=57/102 \cdot log_2(57/102)+ 45/102\cdot log_2(45/102)=0.99$

$H[C|X=”paint”]=0$ ,since the “paint” can be certain that the story is about art.

$H[C|X=”not paint”]=1.0$, which we can calculate from the distribution.

$I[C;X]=H[C]-Pr(x=1)H[C|X=1]-Pr(X=0)H[C|X=0]$ = 0.99-12/1020-90/102 1 =0.11

Therefore, the mutual information is 0.11, which is the expected reduction in uncertainly.

Note:

In the decision tree, mutual information is used as information gain. The information gain is the strategy used to choose the best feature for decision. See the zhihu for more detail.

And this is the way which most people use to find the informative features.

Joint and Conditional Entropy

$H[X, Y]=-\sum_{x, y} \operatorname{Pr}(X=x, Y=y) \log _{2} \operatorname{Pr}(X=x, Y=y)$

Kind of the joint distribution.

Using this, conditional mutual information can be derivated:

$I[C ; Y | X]=H[C | X]-H[C | Y, X]$

we ask how much information does Y contain about C if we “control” for X.

Interaction

Contional mutual information $I [C ; Y | X]$ is positive:

But might be smaller/larger/equal to $I[C;Y]$
If $I[C;Y|X]=0$: C and Y are conditionally independent given X; Otherwise there is an interaction between X and Y(regarding their information about C)
$I[C;Y|X]<I[C;Y]$: Some of the information in Y about C is redundant given X
Use this to define the interaction information: $I(C;Y;X)=I(C;Y|X)-I(C;Y)$

(Actually not very familiar with this interaction)

Reference

CAML机器学习系列2：深入浅出ML之Entropy-Based家族
The slide from Markus: information
Bishop PRML

Python Init Modules

Posted on 2019-05-14 | Edited on 2019-07-31 | In Python

__init__.py 文件的作用是将文件夹变为一个Python模块,Python 中的每个模块的包中，都有__init__.py 文件。

1	__all__=[]

The __all__ is a special variable, which defines what attribute, functions or modules can be imported into other modules.

Reference

Python __init__.py 作用详解
Python模块导入时全局变量”__all__“的作用 https://blog.csdn.net/chuan_day/article/details/79694319

Deep Learning Batch Normalization

Posted on 2019-05-13 | In Deep Learning

Why we need batch normalization in neural network?

It can help the neural network to converge more quickly.

Make the different features into the same scale, get rid of the influence of different scale.

防止梯度爆炸和梯度消失

Reference

zhihu: 神经网络中的归一化除了减少计算量，还有什么作用？
towards data science: Batch normalization in Neural Networks

Deep Learning RNN sequence model

Posted on 2019-05-09 | In Deep Learning

Take down the note when I came accross the bugs during doing the lab 7.3.

The key words:
pack padded sequence, pad packed sequence, the output of lstm model.

The code is listed below:

class ImprovedRNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        # YOUR CODE HERE
        self.lstm=nn.LSTM(embedding_dim,hidden_dim,batch_first=True)
#         raise NotImplementedError()
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text_len):
        text, lengths = text_len
        embedded = self.embedding(text)
#         print(embedded.data.size())  # (sentence length, batch_size,hidden_dim)
#         print(lengths)           # the tensor(size 64), contains the length of sizes
        embedded = nn.utils.rnn.pack_padded_sequence(embedded, lengths)
#         print(embedded.data.size()) # the packed_sequence record the series data and the tensor recording each length 

        # YOUR CODE HERE
#         print(embedded[0].size())
        _,(last_state,_)=self.lstm(embedded)
#         lstm_out_pad,length_sentence=nn.utils.rnn.pad_packed_sequence(lstm_out)
        print(last_state.size())
#         lstm_final_out=lstm_out_pad[length_sentence-1]  # just use the final timestep output
#       lstm_out_pad [0] is the data and [1] records the length of each sentence
#         print("...",lstm_out.data.size(),"\n ssss",type(lstm_out.data))
#         print("\n haha...",lstm_out_pad)
#         print(lstm_out_pad[0][-1][63])
#         length_63=lstm_out_pad[-1][63]  # Use [-1] can't get real last one 
#         print("I'm the length of first sentence:",length_63)
#         print("i'm the data:",lstm_out_pad[0][length_63-1],lstm_out_pad[0][length_63 - 1][63])
#         print("im the length list:",lstm_out_pad[-1])
#         print(lstm_final_out.size())
        out=self.fc(lstm_final_out)
#         print("emm heng?")
        return out
#         raise NotImplementedError()
        
INPUT_DIM = len(TEXT.vocab)  # 25002
EMBEDDING_DIM = 50
HIDDEN_DIM = 100
OUTPUT_DIM = 1

imodel = ImprovedRNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

# TODO: Train and evaluate the model
# YOUR CODE HERE
optimizer = optim.Adam(model.parameters(), lr=0.01)

torchbearer_trial = Trial(imodel, optimizer, criterion, metrics=['acc', 'loss']).to(device)
torchbearer_trial.with_generators(train_generator=MyIter(train_iterator), val_generator=MyIter(valid_iterator), test_generator=MyIter(test_iterator))
# torchbearer_trial.with_train_generator(MyIter(train_iterator))
torchbearer_trial.run(epochs=5)
torchbearer_trial.predict()

Linux Server set thread for Pytorch

Posted on 2019-05-07 | In Linux

During the time of doing my course work for Advanced Machine Learning, I ran my deep learning scripts on the server. In the first time, I saw the %CPU of my job was always very high. (And then, the admin killed my job since it stuck other jobs…sorry, i didn’t know this at that time)

To avoid the effect on other users, before runing our Pytorch scripts, we should run the following code in the beginning to set the thread.

1 2	OMP_NUM_THREADS=1 export OMP_NUM_THREADS

This allow us to use only one thread in our one job.

Git Rebase to keep commit log clean

Posted on 2019-05-05 | In Git

There always be the cases that we are developing a new feature on seperate branch when we are using Git. There are many commit log like “fix type”, “correct the error” etc. When we merge the branch to master branch, we don’t want these stupid commit log appear in the commit log of master branch.

To merge development branch to master branch:

1 2	git checkout master git merge development

If we want to make our commit log clean, then you should use rebase.

Rebase

example:

# 开始开发一个新 feature
$ git checkout -b new-feature master
# 改了一些代码
$ git commit -a -m "Start developing a feature"
# 刚刚的修改有点问题，再改一下
$ git commit -a -m "Fix something from the previous commit"
 
# 紧急修复，直接在 master 分支上改点东西
$ git checkout master
# 改了一些代码
$ git commit -a -m "Fix security hole"
 
# 开始交互式地 rebase 了
$ git checkout new-feature
$ git rebase -i master

Reference

Git tips: 合并 commit 保持分支干净整洁

Linux Run Script even after logging out

Posted on 2019-05-02 | Edited on 2019-05-04 | In Linux

Nohup

image_haha

Rerference

linux后台执行命令：&和nohup
Nohup Command in Linux: Linux Hint

VAE Variational Autoencoder

Posted on 2019-05-01 | In Deep Learning

差分自编码器，跟普通的自编码器不同，有着他自己特殊的地方。

通过编码器学习图像的编码，得到其潜在表征向量（这里学习其作为高斯分布的参数）。

为了训练encoder和decoder，loss function由两部分组成：

KL divergence来表示隐含向量与标准正态分布之间差异的loss
另外一个loss使用生成图片与原图片的均方误差来表示

Reference

部分公式推导 KL divergence
Github example code

Next() in Python

Posted on 2019-04-30 | Edited on 2019-08-06 | In Python

To fetch a item from generator, next() can be used: Return the next item from the iterator.

If a variable is not a generator, next() can be used along with iter().

Python code snippet:

a=[1,2,3]
next(a)
# output: TypeError: 'list' object is not an iterator

b=iter(a)
next(b)
# output: 1
next(b)
#  output: 2
next(b)
#  output: 3
next(b)
#  StopIteration

Example:

import torch
import torchvision
import torchvision.transforms as transforms

batch_size = 256

# dataset construction
transform = transforms.Compose([
    transforms.ToTensor(), # convert to tensor
    transforms.Lambda(lambda x: x.view(image_dim)) # flatten into vector
    ])

train_set = torchvision.datasets.FashionMNIST(
    root='./data/FashionMNIST'
    ,train=True
    ,download=True
    ,transform=transform
)

train_loader = torch.utils.data.DataLoader(
    train_set, batch_size=batch_size
)

# Fetch images by next() function
# Since the obj returned by DataLoader was not iterator, I also used iter()
images = next(iter(train_loader))