Toward ensemble methods: A primer with Random Forest
The Kaggle-Higgs competition has attracted my attention very much lately. In this particular challenge, the goal is to generate a predictive model out of a bunch of data taken from distinct LHC’s detectors. The numbers are the following: 250000 properly labeled instances, 30 real-valued features (containing missing values), and two classes, namely background { b } and signal { s }. Also, the site provides a test set consisting of 550000 unlabeled instances. There are nearly 1300 participants while I am writing this post, and a lot of distinct methods are put to the test to win the challenge. Very interestingly, ensemble methods are all in the head of the leaderboard, and the surprise is XGBoost , a gradient boosting method that makes use of binary trees. After checking the computational horsepower of the XGBoost algorithm by myself, I decided to take a closer look at ensemble methods. To start with, I implemented a Random Forest , an algorithm that consists of many indep...