Ensemble Learning
Bagging
When to use bagging?
用于很强的model。
最容易overfitting的model其实不是神经网络,而是decision tree。如果你想,只要把树扎的足够深,甚至可以在training data上得到100%的准确率,但是那没有任何意义,只是overfitting而已。
Bagging就是将容易overfitting的一堆model结合起来,乱拳打死老师傅。随机森林就是在decision tree上进行bagging,将多个决策树组合起来组成随机森林。
How to get different classifier?
- Re-sampling your data set to form a new set
- Re-weighting yoru data set to form a new set
Random Forest
The data set is generated by the bootstrapping, which resample the data set with replacement. In random forest, we average much less correlated trees. To implement this algorithm, not only different data subsets are used, but also we choose a subset $m \ll p$ of the features to train decision tree. Typically $m$ can range from $1$ to $\sqrt{p}$. The trees are not that good, but by averaging over huge number of trees, we can get pretty good results.
Boosting
用于比较弱的model。
Adaboost
Can convert the weak learner to strong learner(classifier).
我自己的一个简单Adaboost demo
Reference
- 台湾大学李宏毅的视频
- 课程资源: Hung-yi Lee