Decision Tree

A decision tree is like a flow chart.

One merit of using decision tree is that they are interpretable. It is easy to see how they made a certain decision. Besides, the decision trees can be “hand crafted” by experts. They can also be built up using machine learning techniques.

Criteria for Choosing the Split

For classification:

ID3: maximise the information gain (based on the information entropy). The information gain is kind of mutual entropy. See the blog for more details.
CART: maximise the impurity decrease (based on the Gini impurity)
$\text {Gini}(r o o t)-\left(\operatorname{Gini}(L e f t) \frac{n_{L}}{n}+\operatorname{Gini}(\operatorname{Right}) \frac{n_{R}}{n}\right)$
root is the node to be split, and left and right is the impurity of the left and right branches.

For regression:

CART: use variance instead of gini or entropy. Choose the feature decrease the variance most.

Note:

When computing the information gain, impurity decrease or the variance gain, remember to multiply the weights for the left and right trees.

CART

CART, the classification and regression tree. In this part, regression using the variance is going to be introduced.

The procedure of building the CART:

Find the best split for each feature - minimises the impurity measure
Find feature that minimises impurity the most
Use the best split on that feature to split the node
Do the same for each of the leaf nodes

（先找每个feature最好的区分点，比较所有feature的最好效果找出决定feature，然后split。不断迭代）

Pruning

The decision tree is easy to be overfitting. One solution is pruning. This can be done in a variety of ways, including:

Reduced Error Pruning
Entropy Based Merging

Reduced Error Pruning

Procedure:

Start at leaf nodes
Look up branches at last decision split
Replace with a leaf node predicting the majority class
If validation set classification accuracy is not affected, the keep the change

This is a simple and fast algorithm that can simplify over complex decision trees.

Entropy Based Pruning

Procedure:

Chose a pair of leaf nodes with the same parent
What is the entropy gain from merging them
If lower than a threshold, merge nodes.

This doesn’t require additional data.

Reference

Jo Slide Decision Tree