As
in many realworld situations, union makes algorithms stronger. With this
philosophy in mind, ensemble methods combine several
weak classifiers into a massive onein terms of accuracy. In the
last post we learnt a
primer with Random Forest. Therefore, the next cornerstone is gradient boosting. I
mentioned Gradient Boosting many times in this blog, but I only commented the
fundamental ideas, without discussing further the details. In this entry I will
share my two cents. Let me introduce a little bit of history, first: recall the
KaggleHiggs competition. The top scores in the leaderboard have been obtained
by using distinct forms of gradient boosting, and
XGBoost is the direct responsible
of many of these. The question is, hence,
how does this algorithm work?

Figure 1. A highlevel description of the Gradient Boosting method I programmed. Click to enlarge. 
Informally,
Gradient Boosting generates a sequence of classifiers in the form of an
additive expansion, that is, at each iteration a new
weak classifier is trained to improve the accuracy of the previous
ones, in the same fashion as
AdaBoost. However, and differently from this one,
the gradient of the loss function (i.e., the function used to check how well
our algorithm performs) is used to guide the search toward a very accurate
solution (i.e., continual improvement by means of a gradient). Another
characteristic is that Gradient Boosting works best with a regression tree as a
base model. Combining several of those, the common loss function used
is the mean squared error, which is differentiable (and also straightforward to
handle). Figure 1 shows the highlevel description of the algorithm.
So
far, so good, we have an algorithm that is straightforward to implement and
that offers a high degree of accuracy. However, the trick lies in the trees.
Gradient Boosting exploits the benefits of
wellinducted
trees, and this is, probably, the most difficult part of the technique. There
are many possibilities, and in my particular case I implemented the simplest
form of a binary regression tree as a basis. This tree performs the splits by
computing the gain in variance (informally, how much variety is in the output
variable), and without considering any form of
backfitting (i.e., tree pruning). Also, I just “ignored” the
unknown values, making the variance computation with the known available. Despite
this rather simplistic approach, I obtained a score of
3.31 in the Approximate
Median Significance (AMS) metric (almost the same as in the
original Higgs challenge reference study, which is of
3.34!). The configuration was pretty
outoftheshelf, consisting of 300 boosted trees, a learning rate of 0.1 and a
limitation of 10 leaves per tree. Just for instance, XGBoost uses unpruned
binary trees, and its amazing accuracy comes from the tricks used for growing
the trees (e.g., mainly handling unknown values).
As
concluding remarks, I have to say that I enjoyed the experience. OK, I did not
win the challenge (the leader has a score of 3.85 in the aforementioned AMS
metric) but I learned a lot, discoveringand, of course, implementingthese
marvelous machine learning techniques. Following the rules of the challenge, I
will upload and share the source code (it is programmed in a userfriendly C++)
when it finishes (September the 15th).
References
[1] G. James, D. Witten, T. Hastie and
R. Tiibshirani, "An Introduction to Statistical Learning With
Applications in R," ISSN 1431875X. Springer, 2013.
[2] J. Friedman, "Greedy function approximation: Gradient boosting machine," the Annals of Statistics 2001, Vol. 29, No. 5, 1189–1232.