Skip to main content

Comparing Two (Or More) Classifiers

The problem of comparing two (or more) classifiers on a set of problems is not trivial. For that matter I want to share some fundamental definitions and procedures about it. This post is mainly based on the work of Demsar (2006). I will use a more informal way of describing this issue for the sake of clarity.

First, let's start with some (not-so-) complex and essential definitions.

1) The null hypothesis (or H_0). The hypothesis we want to prove false using, in our case, a statistical test; typically that all classifiers perform the same on average.

2) The alternative hypothesis (or H_1). The opposite hypothesis to the null hypothesis. In our particular case that not all the classifiers perform the same on average.

3) Type I Error or a false positive. Rejecting the null hypothesis (incorrectly) when is actually true.

4) Type II Error or a false negative. Conversely to type I error, not rejecting the null hypothesis when is actually false.

5) The level of significance. This is an important concept. It tells us whether we found a true pattern in data or not (i.e., that it was just chance). For example that classifier X is better, on average, than classifier Y for the given set of problems, with a certain probability of avoiding both type I error and type II error. This probability is coined as alpha.

6) Alpha. The probability value which identifies the level of significance. The larger this value, the more chance of committing type II error, and also we have less statistical power. Conversely, the lower this value, the more chance of committing type I error. Typical values are 0.05 and 0.1.

7) The computed p-value (or simply p). The p-value is the smallest level of significance that results in the rejection of the null hypothesis. This is a key concept, because if a test of significance gives a computed p-value lower than or equal to the significance level alpha, the null hypothesis is rejected.

Knowing this, we can proceed with the statistical test itself. For this purpose, I will use the Friedman's test. It is, probably, the most well known non-parametric test (that is: this test does not assume any particular probability distribution, as opposed to parametric tests like ANOVA). It ranks the algorithms based on their performance, for instance the accuracy or the F-Measure, for each data set separately. The reader is referred to (Demsar, 2006) for further details on this.
Notice that there are several software packages that perform the Friedman's ranking.

After the ranking, the Friendman statistic is computed. This results in a value that is used for rejecting (or not) the null hypothesis, using a chi-squared distribution with k - 1 degrees of freedom, in our case the number of algorithms minus one.

To further improve the precision of this test, a correction is performed. This is the Iman-Davenport correction (also referred to as F_F), which follows an F distribution instead, with k - 1 and (k - 1)(N - 1) degrees of freedom. In our case it refers to (1) the number of algorithms minus one and (2) the multiplication of the number of algorithms minus one and the number of data sets minus one. Again, the reader is referred to (Demsar, 2006) for more details.

In this point we can compute the p-value and check whether the null hypothesis is rejected or not. The problem is that computing the p-value (by hand) is not easy (I will not enter in further details), so we typically make use of p-value calculators for this issue (or other related software). In my case, I use R, an excellent and free software. It is very straightforward, assuming a confidence level of 95% (i.e., alpha = 0.05), seven algorithms and 30 distinct data sets, and a computed F_F value of 7.268, simply type 1 - pf( 7.268, 7 - 1, (7-1) * (30 - 1) ) in the console. The result I get is much less than 0.05 (in fact the actual value is 0.000000610959), hence we reject the null hypothesis that the classifiers perform the same on average. Therefore, we can use a post-hoc test (or more than just one) to check whether there are statistically significant differences, on average, among algorithms. Examples of these tests are the Nemenyi's test, the Bonferroni-Dunn's test or the Holm's step-down procedure---see (Demsar, 2006; García and Herrera, 2008) for further details on post-hoc tests.


Demsar J. (2006) Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7:1–30

García S, Herrera F (2008) An extension on ”Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. Journal of Machine Learning Research 9:2677–2694


Popular posts from this blog

Toward ensemble methods: A primer with Random Forest

The Kaggle-Higgs competition has attracted my attention very much lately. In this particular challenge, the goal is to generate a predictive model out of a bunch of data taken from distinct LHC’s detectors. The numbers are the following: 250000 properly labeled instances, 30 real-valued features (containing missing values), and two classes, namely background {b} and signal {s}. Also, the site provides a test set consisting of 550000 unlabeled instances. There are nearly 1300 participants while I am writing this post, and a lot of distinct methods are put to the test to win the challenge. Very interestingly, ensemble methods are all in the head of the leaderboard, and the surprise is XGBoost, a gradient boosting method that makes use of binary trees.
After checking the computational horsepower of the XGBoost algorithm by myself, I decided to take a closer look at ensemble methods. To start with, I implemented a Random Forest, an algorithm that consists of many independent binary trees…

High Performance Computing, yet another brief installation tutorial

Today’s mid- and high-end computers come with a tremendous hardware, mostly used in video games and other media software, that can be exploited for advanced computation, that is: High Performance Computing (HPC). This is a hot topic in Deep Learning as modern graphic cards come with huge streaming process power and large and quick memory. The most successful example is in Nvidia’s CUDA platform. In summary, CUDA significantly speeds up the fitting of large neural nets (for instance: from several hours to just a few minutes!).
However, the drawbacks come when setting up the scenario: it is non-trivial to install the requirements and set it running, and personally I had a little trouble the first time as many packages need to be manually compiled and installed in a specific order. The purpose of this entry is to reflect what I did for setting up Theano and Keras with HPC using an Nvidia’s graphic card (in my case a GT730) using GNU/Linux. To do so, I will start assuming a clean Debian …

A conceptual machine for cloning a human driver

If anything else, a Machine Learning practitioner has to get a global view and think on how far our conceptual machines can go. This is a little tale of my recent experience with Deep Learning. To start with, say that I am enrolled in the Udacity’s nanodegree inSelf Driving Car Engineer. Here we are learning the techniques needed for building a car controller capable of driving at the human level. The interesting part of the course (to be read as where one truly learns) is in the so called projects: these are nothing but assignments in which the student has to solve a challenging task, mainly by using Convolutional Neural Networks (convnets), the most well-known Deep Learning models. So far, the most mind-blowing task and the focus of this entry is project 3, were we have to record our driving behavior in a car simulator and let a convnet to clone it and successfully generalize the act of driving. It is remarkable that a convnet learns to control the car from raw color images jointly…