An important challenge is
learning from domains that do not have the same proportion of
classes, that is, learning from problems that contain class
imbalances (Orriols-Puig, 2008). Figure 1 shows

*a toy*example of this issue. Notwithstanding, it is challenging because (1) in many real-world problems we cannot assume a balanced distribution of classes and (2) traditional machine learning algorithms cannot induce accurate models in such domains. Oftentimes it happens that the key knowledge to solve a problem that previously eluded solution is hidden in patterns that are**rare**. To tackle this issue, practitioners rely on re-sampling techniques, that is, algorithms that pre-process the data sets and either (1) add synthetic instances of the minority pattern to the original data or (2) eliminate instances from the majority class. The first type is called*over-sampling*and the later,*under-sampling*.
In this post I will
present the most successful over-sampling technique, the so called

*Synthetic Minority Over-sampling Technique*(SMOTE), which was introduced by Chawla et al. (2002). It works in a very simple manner: it generates new samples out of the minority class by seeking the nearest neighbors of these. Figure 2 shows the results of applying this method to our*toy*problem.Figure 2. The SMOTEd version of the data set (see Figure 1). Now we have a much more balanced domain (almost 50% of the instances are in each class) |

In the following I provide the R code. In it one can select the requested number of samples to generate, the number of k-neighbors for the data generation and the distance metric used (one of the following two: the Euclidean distance or the Mahalanobis distance).

**References**

A. Orriols-Puig.

*New Challenges in Learning Classifier Systems: Mining Rarities and Evolving Fuzzy Models*. PhD thesis, Universitat Ramon Llull, Barcelona (Spain). 2008.

N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer.

*SMOTE: Synthetic minority over-sampling technique*. Journal of Artificial Intelligence Research, 16:321–357, 2002.