The problem of comparing two (or more) classifiers on a set of problems is not trivial. For that matter I want to share some fundamental definitions and procedures about it. This post is mainly based on the work of Demsar (2006). I will use a more informal way of describing this issue for the sake of clarity.

First, let's start with some (not-so-) complex and essential definitions.

1) The null hypothesis (or H_0). The hypothesis we want to prove false using, in our case, a statistical test; typically that all classifiers perform the same on average.

2) The alternative hypothesis (or H_1). The opposite hypothesis to the null hypothesis. In our particular case that not all the classifiers perform the same on average.

3) Type I Error or a false positive. Rejecting the null hypothesis (incorrectly) when is actually true.

4) Type II Error or a false negative. Conversely to type I error, not rejecting the null hypothesis when is actually false.

5) The level of significance. This is an im…

First, let's start with some (not-so-) complex and essential definitions.

1) The null hypothesis (or H_0). The hypothesis we want to prove false using, in our case, a statistical test; typically that all classifiers perform the same on average.

2) The alternative hypothesis (or H_1). The opposite hypothesis to the null hypothesis. In our particular case that not all the classifiers perform the same on average.

3) Type I Error or a false positive. Rejecting the null hypothesis (incorrectly) when is actually true.

4) Type II Error or a false negative. Conversely to type I error, not rejecting the null hypothesis when is actually false.

5) The level of significance. This is an im…