Toward ensemble methods: A primer with Random Forest
The
Kaggle-Higgs competition has attracted my attention very much lately. In this
particular challenge, the goal is to generate a predictive model out of a bunch
of data taken from distinct LHC’s detectors. The numbers are the following:
250000 properly labeled instances, 30 real-valued features (containing missing
values), and two classes, namely background {b} and signal {s}. Also, the site
provides a test set consisting of 550000 unlabeled instances. There are nearly
1300 participants while I am writing this post, and a lot of distinct methods
are put to the test to win the challenge. Very interestingly, ensemble methods
are all in the head of the leaderboard, and the surprise is XGBoost, a gradient
boosting method that makes use of binary trees.
After
checking the computational horsepower of the XGBoost algorithm by myself, I
decided to take a closer look at ensemble methods. To start with, I implemented
a Random Forest, an algorithm that consists of many independent binary trees
generated by means of bagging data samples. The idea is dead simple: generate
many small trees (with low predictive value, referred to as “weak classifiers”)
using distinct data samples taken from the original data set (using what it is
called “replacement”: when generating the data subset we take, randomly,
instances from the original data set, allowing repeated ones on it). After the
training of the n-trees, the final prediction is made by combining all the
votes of the weak classifiers (the forest). This way, theoretically, the
algorithm reduces bias and variance (wow!). There are a bunch of commercial
devices that use Random Forest as a base algorithm, and the most famous is,
probably, Microsoft Kinect.
Being
a very competitive technique, the trick lies in the way Random Forest generates
the trees: making use of Hunt et al. algorithm, at each stage, and recursively,
we select the best feature (in the literature these are called “predictors”)
and perform a split, growing the tree. For selecting the best feature and
partition we have to use some measure of impurity. I used the information gain
(i.e., how much information we gain by performing the actual split in
comparison to the base case--that is, no split). Random Forest tweaks this
search by not allowing the use of all the features, but a subset of these
(typically the square root of the number of features), and picking them at
random. This way, the Random tree avoids picking always the same (and strongest)
features, therefore decorrelating the trees. Also, Random Forest can be used
for feature selection, using the aforementioned split procedure.
As
concluding remarks, and after the implementation and test of the algorithm, I
can tell that Random Forest is FAST! It is surprising how fast it is when
learning and how it manages to classify large-scale data in just a fraction of
a second. Moreover, this technique is inherently parallel, so practitioners can obtain even more
through put. Wow!
References
[1] G. James, D. Witten, T. Hastie and R. Tiibshirani, "An Introduction to Statistical Learning With Applications in R," ISSN 1431-875X. Springer, 2013.
[2] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, "Real-Time Human Pose Recognition in Parts from a Single Depth Image," in CVPR, IEEE, June 2011.
Good day! I could have sworn I've been to this site before but after reading
ReplyDeletethrough some of the post I realized it's new to me. Anyhow,
I'm definitely delighted I found it and I'll be book-marking and
checking back frequently!
My web-site; coffee bean max