The problem of comparing two (or more) classifiers on a set of problems is not trivial. For that matter I want to share some fundamental definitions and procedures about it. This post is mainly based on the work of Demsar (2006). I will use a more informal way of describing this issue for the sake of clarity.

First, let's start with some (not-so-) complex and essential definitions.

**1)**The null hypothesis (or H_0). The hypothesis we want to prove false using, in our case, a statistical test; typically that all classifiers perform the same on average.

**2)**The alternative hypothesis (or H_1). The opposite hypothesis to the null hypothesis. In our particular case that not all the classifiers perform the same on average.

**3)**Type I Error or a false positive. Rejecting the null hypothesis (incorrectly) when is actually true.

**4)**Type II Error or a false negative. Conversely to type I error, not rejecting the null hypothesis when is actually false.

**5)**The level of significance. This is an important concept. It tells us whether we found a true pattern in data or not (i.e., that it was just chance). For example that classifier X is better, on average, than classifier Y for the given set of problems, with a certain probability of avoiding both type I error and type II error. This probability is coined as

*alpha*.

**6)**Alpha. The probability value which identifies the level of significance. The larger this value, the more chance of committing type II error, and also we have less statistical power. Conversely, the lower this value, the more chance of committing type I error. Typical values are 0.05 and 0.1.

**7)**The computed p-value (or simply

*p*). The p-value is the smallest level of significance that results in the rejection of the null hypothesis. This is a key concept, because if a test of significance gives a computed p-value lower than or equal to the significance level alpha, the null hypothesis is rejected.

Knowing this, we can proceed with the statistical test itself. For this purpose, I will use the Friedman's test. It is, probably, the most well known non-parametric test (that is: this test does not assume any particular probability distribution, as opposed to parametric tests like ANOVA). It ranks the algorithms based on their performance, for instance the accuracy or the F-Measure, for each data set separately. The reader is referred to (Demsar, 2006) for further details on this.

Notice that there are several software packages that perform the Friedman's ranking.

After the ranking, the Friendman statistic is computed. This results in a value that is used for rejecting (or not) the null hypothesis, using a chi-squared distribution with

*k - 1*degrees of freedom, in our case the number of algorithms minus one.
To further improve the precision of this test, a correction is performed. This is the Iman-Davenport correction (also referred to as

*F_F*), which follows an F distribution instead, with*k - 1*and*(k - 1)(N - 1)*degrees of freedom. In our case it refers to (1) the number of algorithms minus one and (2) the multiplication of the number of algorithms minus one and the number of data sets minus one. Again, the reader is referred to (Demsar, 2006) for more details.
In this point we can compute the p-value and check whether the null hypothesis is rejected or not. The problem is that computing the p-value (by hand) is not easy (I will not enter in further details), so we typically make use of p-value calculators for this issue (or other related software). In my case, I use R, an excellent and free software. It is very straightforward, assuming a confidence level of 95% (i.e., alpha = 0.05), seven algorithms and 30 distinct data sets, and a computed F_F value of 7.268, simply type

Demsar J. (2006) Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7:1–30

*1 - pf( 7.268, 7 - 1, (7-1) * (30 - 1) )*in the console. The result I get is much less than 0.05 (in fact the actual value is 0.000000610959), hence we reject the null hypothesis that the classifiers perform the same on average. Therefore, we can use a post-hoc test (or more than just one) to check whether there are statistically significant differences, on average, among algorithms. Examples of these tests are the Nemenyi's test, the Bonferroni-Dunn's test or the Holm's step-down procedure---see (Demsar, 2006; García and Herrera, 2008) for further details on post-hoc tests.**References**Demsar J. (2006) Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7:1–30

García S, Herrera F (2008) An extension on ”Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. Journal of Machine Learning Research 9:2677–2694