Skip to main content

Posts

Showing posts from May, 2014

The challenge of learning from rare cases

An important challenge is learning from domains that do not have the same proportion of classes, that is, learning from problems that contain class imbalances (Orriols-Puig, 2008). Figure 1 shows a toy example of this issue. Notwithstanding, it is challenging because (1) in many real-world problems we cannot assume a balanced distribution of classes and (2) traditional machine learning algorithms cannot induce accurate models in such domains. Oftentimes it happens that the key knowledge to solve a problem that previously eluded solution is hidden in patterns that are rare. To tackle this issue, practitioners rely on re-sampling techniques, that is, algorithms that pre-process the data sets and either (1) add synthetic instances of the minority pattern to the original data or (2) eliminate instances from the majority class. The first type is called over-sampling and the later, under-sampling. 
In this post I will present the most successful over-sampling technique, the so called Synthe…