Thursday, September 11, 2014

Gradient Boosting: Fostering accuracy even further

As in many real-world situations, union makes algorithms stronger. With this philosophy in mind, ensemble methods combine several weak classifiers into a massive one---in terms of accuracy. In the last post we learnt a primer with Random Forest. Therefore, the next cornerstone is gradient boosting. I mentioned Gradient Boosting many times in this blog, but I only commented the fundamental ideas, without discussing further the details. In this entry I will share my two cents. Let me introduce a little bit of history, first: recall the Kaggle-Higgs competition. The top scores in the leaderboard have been obtained by using distinct forms of gradient boosting, and XGBoost is the direct responsible of many of these. The question is, hence, how does this algorithm work?

Figure 1. A high-level description of the Gradient Boosting method I programmed. Click to enlarge.

Informally, Gradient Boosting generates a sequence of classifiers in the form of an additive expansion, that is, at each iteration a new weak classifier is trained to improve the accuracy of the previous ones, in the same fashion as AdaBoost. However, and differently from this one, the gradient of the loss function (i.e., the function used to check how well our algorithm performs) is used to guide the search toward a very accurate solution (i.e., continual improvement by means of a gradient). Another characteristic is that Gradient Boosting works best with a regression tree as a base model. Combining several of those, the common loss function used is the mean squared error, which is differentiable (and also straightforward to handle). Figure 1 shows the high-level description of the algorithm.

So far, so good, we have an algorithm that is straightforward to implement and that offers a high degree of accuracy. However, the trick lies in the trees. Gradient Boosting exploits the benefits of well-inducted trees, and this is, probably, the most difficult part of the technique. There are many possibilities, and in my particular case I implemented the simplest form of a binary regression tree as a basis. This tree performs the splits by computing the gain in variance (informally, how much variety is in the output variable), and without considering any form of backfitting (i.e., tree pruning). Also, I just “ignored” the unknown values, making the variance computation with the known available. Despite this rather simplistic approach, I obtained a score of 3.31 in the Approximate Median Significance (AMS) metric (almost the same as in the original Higgs challenge reference study, which is of 3.34!). The configuration was pretty out-of-the-shelf, consisting of 300 boosted trees, a learning rate of 0.1 and a limitation of 10 leaves per tree. Just for instance, XGBoost uses unpruned binary trees, and its amazing accuracy comes from the tricks used for growing the trees (e.g., mainly handling unknown values).

As concluding remarks, I have to say that I enjoyed the experience. OK, I did not win the challenge (the leader has a score of 3.85 in the aforementioned AMS metric) but I learned a lot, discovering---and, of course, implementing---these marvelous machine learning techniques. Following the rules of the challenge, I will upload and share the source code (it is programmed in a user-friendly C++) when it finishes (September the 15th).


[1] G. James, D. Witten, T. Hastie and R. Tiibshirani, "An Introduction to Statistical Learning With Applications in R," ISSN 1431-875X. Springer, 2013.
[2] J. Friedman, "Greedy function approximation: Gradient boosting machine," the Annals of Statistics 2001, Vol. 29, No. 5, 1189–1232. 

Friday, August 8, 2014

Toward ensemble methods: A primer with Random Forest

The Kaggle-Higgs competition has attracted my attention very much lately. In this particular challenge, the goal is to generate a predictive model out of a bunch of data taken from distinct LHC’s detectors. The numbers are the following: 250000 properly labeled instances, 30 real-valued features (containing missing values), and two classes, namely background {b} and signal {s}. Also, the site provides a test set consisting of 550000 unlabeled instances. There are nearly 1300 participants while I am writing this post, and a lot of distinct methods are put to the test to win the challenge. Very interestingly, ensemble methods are all in the head of the leaderboard, and the surprise is XGBoost, a gradient boosting method that makes use of binary trees. 

After checking the computational horsepower of the XGBoost algorithm by myself, I decided to take a closer look at ensemble methods. To start with, I implemented a Random Forest, an algorithm that consists of many independent binary trees generated by means of bagging data samples. The idea is dead simple: generate many small trees (with low predictive value, referred to as “weak classifiers”) using distinct data samples taken from the original data set (using what it is called “replacement”: when generating the data subset we take, randomly, instances from the original data set, allowing repeated ones on it). After the training of the n-trees, the final prediction is made by combining all the votes of the weak classifiers (the forest). This way, theoretically, the algorithm reduces bias and variance (wow!). There are a bunch of commercial devices that use Random Forest as a base algorithm, and the most famous is, probably, Microsoft Kinect.

Being a very competitive technique, the trick lies in the way Random Forest generates the trees: making use of Hunt et al. algorithm, at each stage, and recursively, we select the best feature (in the literature these are called “predictors”) and perform a split, growing the tree. For selecting the best feature and partition we have to use some measure of impurity. I used the information gain (i.e., how much information we gain by performing the actual split in comparison to the base case--that is, no split). Random Forest tweaks this search by not allowing the use of all the features, but a subset of these (typically the square root of the number of features), and picking them at random. This way, the Random tree avoids picking always the same (and strongest) features, therefore decorrelating the trees. Also, Random Forest can be used for feature selection, using the aforementioned split procedure.

As concluding remarks, and after the implementation and test of the algorithm, I can tell that Random Forest is FAST! It is surprising how fast it is when learning and how it manages to classify large-scale data in just a fraction of a second. Moreover, this technique is inherently parallel, so practitioners can obtain even more through put. Wow!

[1] G. James, D. Witten, T. Hastie and R. Tiibshirani, "An Introduction to Statistical Learning With Applications in R," ISSN 1431-875X. Springer, 2013.
[2] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, "Real-Time Human Pose Recognition in Parts from a Single Depth Image," in CVPR, IEEE, June 2011.

Wednesday, June 4, 2014

Boosting the accuracy rate by means of AdaBoost

Several machine learning techniques foster the cooperation between subsolutions for obtaining an accurate outcome by combining many of them. Michigan-style LCSs are one of these families. However, the most well known are those that implement Boosting, and AdaBoost is the most successful of them (or, at least, the most studied one and the first to implement the ideas of Boosting).

AdaBoost generates accurate predictions by combining several weak classifiers (also referred to as “the wisdom of the crowd”). The most outstanding fact about AdaBoost is that it is a deadly simple algorithm. The key of its success lies in the combination of many weak classifiers: these are very limited and their error rate is just slightly better than a random choice (hence their name). The typical implementation uses decision stumps (i.e., binary trees) that minimize the error between the prediction and the ground truth (that is, the desired outcome), so the classifiers have the form "if variable_1 < 3.145 then class = -1; otherwise class = +1". Another characteristic is that AdaBoost only handles two classes, namely {-1} and {+1}.

Its scheme is very simple: it trains classifiers that predict correctly a small part of the problem space and then it combines all these by using a weighting mechanism. Then, AdaBoost decides the class of the example by computing the sign (+1 or -1) of the aggregated predictions. Recall that the weights are computed based on the error achieved by the distinct weak predictors (see the image below).

Figure 1. The learning process of AdaBoost. The Di's are the distinct weights applied to each weak classifier.

Boosting techniques have attracted my attention lately since these are the ones that provide the best results in the distinct Kaggle competitions. As usual, I implemented a little R code for playing a bit with this wonderful technique. It is listed in the following. Notice that I implemented the univariate version of the AdaBoost algorithm, keeping the code very simple and easy to understand and extend.