Data Mining Information Theory

Information Theory and Feature Selection

  • Outline:
    • Information
    • Entropy
    • Mutual information
    • for feature selection

Information

Information, also can be seen as uncertainty and surprise.

$I=-log_2{p(x)}$ Since the $p(x)$ is the probability of event $x$, the value $<1$.

Shannon entropy:

( entropy = the probability of an event * information of this event )

Shannon entropy is the measure of uncertainty.

(香农熵描述的是混乱程度,而且information这个概念其实也是从这个角度给出的,不确定性越大,这个事件携带的信息越多。)

K-L Divergence

Two probability distribution $f(x)$ and $g(x)$, the K-L divergence is :

  • Compare the entropy of two distribution over the same random variable
  • Heuristically: number of additional bits encoding a random variable with distribution $f(x)$ using $g(x)$.

It can be seen as the $D(f|g) =\sum_{x \in X} [-f(x)log_2 g(x)+f(x)log_2f(x)] $, the first term is to encode $f(x)$ using the the encoding method of $g(x)$.Therefore, it can be seen as the distance between two encoding function ( or distribution).

When minimizing K-L against a fixed reference distribution $p$, the task is euivalent to minimizing cross entropies. It can be written as: $D(f|g) =\sum_{x \in X}f(x)log_2f(x) - \sum_{x \in X}f(x)log_2 g(x) $

The second term is what we use in the cross entropy loss function.

The form of cross entropy:

Note:
The K-L divergence can not be used as a measure for the distance between $f$ and $g$, since it is not symmetric, $D(f | g)$ is not equal to $D(g | y)$.

Conditional Entropy

The $I$ is realized information, which is the difference between the entropy of $H(C)$ and the contional entropy $H(C|X=x)$. And the realized information is defined as:

Given the observation of $X$, the entropy of $C$ is decrease, which is written as $H(C | X=x)$.

The realized information is not necessarily positive. If it is negative, the entropy will increase.

Form of the contional entropy (from PRML): $H(Y | X)=-\sum_{x_{i}, y_{j}}^{m, n} p\left(x_{i}, y_{j}\right) \cdot \log _{2} p\left(y_{j} | x_{i}\right)$

Mutual Information

Mutual information is the expected information a feature gives us about a classs:

$I[C ; X]=H(C)-\sum \operatorname{Pr}(X=x) H(C | X=x)$

Note:

  • Mutual information is always positive.
  • Is only 0 when the X and C are statistically independent.
  • Is symmetric in X and C

Example of calculating the mutual information:

Indicator X
Class $C$ “Paint” “Not Paint”
Art 12 45
Music 45

The entropy of C: $H(C)=57/102 \cdot log_2(57/102)+ 45/102\cdot log_2(45/102)=0.99$

$H[C|X=”paint”]=0$ ,since the “paint” can be certain that the story is about art.

$H[C|X=”not paint”]=1.0$, which we can calculate from the distribution.

$I[C;X]=H[C]-Pr(x=1)H[C|X=1]-Pr(X=0)H[C|X=0]$ = 0.99-12/1020-90/102 1 =0.11

Therefore, the mutual information is 0.11, which is the expected reduction in uncertainly.

Note:

In the decision tree, mutual information is used as information gain. The information gain is the strategy used to choose the best feature for decision. See the zhihu for more detail.

And this is the way which most people use to find the informative features.

Joint and Conditional Entropy

$H[X, Y]=-\sum_{x, y} \operatorname{Pr}(X=x, Y=y) \log _{2} \operatorname{Pr}(X=x, Y=y)$

Kind of the joint distribution.

Using this, conditional mutual information can be derivated:

$I[C ; Y | X]=H[C | X]-H[C | Y, X]$

  • we ask how much information does Y contain about C if we “control” for X.

Interaction

Contional mutual information $I [C ; Y | X]$ is positive:

  • But might be smaller/larger/equal to $I[C;Y]$

  • If $I[C;Y|X]=0$: C and Y are conditionally independent given X; Otherwise there is an interaction between X and Y(regarding their information about C)

  • $I[C;Y|X]<I[C;Y]$: Some of the information in Y about C is redundant given X

  • Use this to define the interaction information: $I(C;Y;X)=I(C;Y|X)-I(C;Y)$

    (Actually not very familiar with this interaction)

Reference

  1. CAML机器学习系列2:深入浅出ML之Entropy-Based家族
  2. The slide from Markus: information
  3. Bishop PRML