Skip to main content

The challenge of learning from rare cases


An important challenge is learning from domains that do not have the same proportion of classes, that is, learning from problems that contain class imbalances (Orriols-Puig, 2008). Figure 1 shows a toy example of this issue. Notwithstanding, it is challenging because (1) in many real-world problems we cannot assume a balanced distribution of classes and (2) traditional machine learning algorithms cannot induce accurate models in such domains. Oftentimes it happens that the key knowledge to solve a problem that previously eluded solution is hidden in patterns that are rare. To tackle this issue, practitioners rely on re-sampling techniques, that is, algorithms that pre-process the data sets and either (1) add synthetic instances of the minority pattern to the original data or (2) eliminate instances from the majority class. The first type is called over-sampling and the later, under-sampling
Figure 1. Our unbalanced data set. Black dots are the majority class (0) and red dots the minority (1). It has an imbalance ratio of 11.25, where the majority class correspods to the 91.8367% of the data, and the minority to the 8.1633%

In this post I will present the most successful over-sampling technique, the so called Synthetic Minority Over-sampling Technique (SMOTE), which was introduced by Chawla et al. (2002). It works in a very simple manner: it generates new samples out of the minority class by seeking the nearest neighbors of these. Figure 2 shows the results of applying this method to our toy problem.

Figure 2. The SMOTEd version of the data set (see Figure 1). Now we have a much more balanced domain (almost 50% of the instances are in each class)

In the following I provide the R code. In it one can select the requested number of samples to generate, the number of k-neighbors for the data generation and the distance metric used (one of the following two: the Euclidean distance or the Mahalanobis distance).
 



References

A. Orriols-Puig. New Challenges in Learning Classifier Systems: Mining Rarities and Evolving Fuzzy Models. PhD thesis, Universitat Ramon Llull, Barcelona (Spain). 2008. 

N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357, 2002.

Comments

Popular posts from this blog

Toward ensemble methods: A primer with Random Forest

The Kaggle-Higgs competition has attracted my attention very much lately. In this particular challenge, the goal is to generate a predictive model out of a bunch of data taken from distinct LHC’s detectors. The numbers are the following: 250000 properly labeled instances, 30 real-valued features (containing missing values), and two classes, namely background {b} and signal {s}. Also, the site provides a test set consisting of 550000 unlabeled instances. There are nearly 1300 participants while I am writing this post, and a lot of distinct methods are put to the test to win the challenge. Very interestingly, ensemble methods are all in the head of the leaderboard, and the surprise is XGBoost, a gradient boosting method that makes use of binary trees.
After checking the computational horsepower of the XGBoost algorithm by myself, I decided to take a closer look at ensemble methods. To start with, I implemented a Random Forest, an algorithm that consists of many independent binary trees…

High Performance Computing, yet another brief installation tutorial

Today’s mid- and high-end computers come with a tremendous hardware, mostly used in video games and other media software, that can be exploited for advanced computation, that is: High Performance Computing (HPC). This is a hot topic in Deep Learning as modern graphic cards come with huge streaming process power and large and quick memory. The most successful example is in Nvidia’s CUDA platform. In summary, CUDA significantly speeds up the fitting of large neural nets (for instance: from several hours to just a few minutes!).
However, the drawbacks come when setting up the scenario: it is non-trivial to install the requirements and set it running, and personally I had a little trouble the first time as many packages need to be manually compiled and installed in a specific order. The purpose of this entry is to reflect what I did for setting up Theano and Keras with HPC using an Nvidia’s graphic card (in my case a GT730) using GNU/Linux. To do so, I will start assuming a clean Debian …

A conceptual machine for cloning a human driver

If anything else, a Machine Learning practitioner has to get a global view and think on how far our conceptual machines can go. This is a little tale of my recent experience with Deep Learning. To start with, say that I am enrolled in the Udacity’s nanodegree inSelf Driving Car Engineer. Here we are learning the techniques needed for building a car controller capable of driving at the human level. The interesting part of the course (to be read as where one truly learns) is in the so called projects: these are nothing but assignments in which the student has to solve a challenging task, mainly by using Convolutional Neural Networks (convnets), the most well-known Deep Learning models. So far, the most mind-blowing task and the focus of this entry is project 3, were we have to record our driving behavior in a car simulator and let a convnet to clone it and successfully generalize the act of driving. It is remarkable that a convnet learns to control the car from raw color images jointly…