The challenge of learning from rare cases
An important challenge is
learning from domains that do not have the same proportion of
classes, that is, learning from problems that contain class
imbalances (Orriols-Puig, 2008). Figure 1 shows a toy example of this
issue. Notwithstanding, it is challenging because (1) in many real-world problems we
cannot assume a balanced distribution of classes and (2) traditional
machine learning algorithms cannot induce accurate models in such
domains. Oftentimes it happens that the key knowledge to solve a
problem that previously eluded solution is hidden in patterns that
are rare. To tackle this issue, practitioners rely on re-sampling
techniques, that is, algorithms that pre-process the data sets and
either (1) add synthetic instances of the minority pattern to the
original data or (2) eliminate instances from the majority class. The
first type is called over-sampling and the later, under-sampling.
In this post I will
present the most successful over-sampling technique, the so called
Synthetic Minority Over-sampling Technique (SMOTE), which was
introduced by Chawla et al. (2002). It works in a very simple manner:
it generates new samples out of the minority class by seeking the
nearest neighbors of these. Figure 2 shows the results of applying this
method to our toy problem.
![]() |
Figure 2. The SMOTEd version of the data set (see Figure 1). Now we have a much more balanced domain (almost 50% of the instances are in each class) |
In the following I provide the R code. In it one can select the requested number of samples to generate, the number of k-neighbors for the data generation and the distance metric used (one of the following two: the Euclidean distance or the Mahalanobis distance).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#The classic C-style printf. | |
printf <- function(...)print(sprintf(...)) | |
#It computes the euclidean distance. | |
# @param instance_i the ith instance. | |
# @param instance_j the jth instance. | |
euclideanDistance <- function( instance_i, instance_j ) { | |
diff <- as.matrix( instance_i[,-length(instance_i)] - instance_j[,-length(instance_j)] ) | |
dist <- sqrt( sum( diff * diff ) ) | |
return( as.numeric( dist ) ) | |
} | |
#It computes the mahalanobis distance. | |
# @param dataSet the data set. | |
# @param instance_i the ith instance. | |
# @param instance_j the jth instance. | |
mahalanobisDistance <- function( dataSet, instance_i, instance_j ) { | |
#Compute the inverse covariance matrix of the data set features. | |
S <- solve( cov( as.matrix( dataSet[, -length( dataSet ) ] ), method="spearman" ) ) | |
diff <- as.matrix( instance_i[,-length(instance_i)] - instance_j[,-length(instance_j)] ) | |
dist <- sqrt( diff %*% S %*% t( diff ) ) #t() is the transposed vector/matrix | |
return( as.numeric( dist ) ) | |
} | |
#It seeks the minimum distant neighbors. | |
# @param data the data set. | |
# @param distances the array of distances. | |
# @param used auxiliary array. | |
# @param index index of the desired instance. | |
seekMinimum <- function( data, distances, used, index ) { | |
min <- 99999 | |
foundIn <- -1 | |
for( j in 1:nrow( data ) ) { | |
if( j != index ) { | |
if( distances[j] < min && used[j] == F ) { | |
min <- distances[j] | |
foundIn <- j | |
} | |
} | |
} | |
if( foundIn > -1 ) { | |
used[ foundIn ] <- T | |
} | |
return( list(foundIn, used ) ) | |
} | |
#Get the K nearest neighbours. It returns a data frame with the neighbours, | |
#ordered by distance. | |
# @param data the data set | |
# @param index the index of the actual instance. | |
# @param k the number of neighbours to seek for. | |
# @param distance the distance selected (euclidean or mahalanobis). | |
getNearest <- function( data, index, k, distance = "euclidean" ) { | |
neighbours <- list() | |
instance_i <- data[ index, ] | |
used <- array( F, nrow( data ) ) #Array of boolean flags. | |
used[ index ] <- T | |
distances <- vector() | |
for( i in 1:nrow( data ) ) { | |
if( distance == "mahalanobis" ) { #Mahalanobis distance. | |
distances[i] <- mahalanobisDistance( data, instance_i, data[i,] ) | |
} else { #Euclidean distance. | |
distances[i] <- euclideanDistance( instance_i, data[i,] ) | |
} | |
} | |
for( found in 1:k ) { | |
tmpList <- seekMinimum( data, distances, used, index ) | |
nearest <- tmpList[[1]] | |
neighbours <- c( neighbours, data[ nearest, ] ) | |
used <- tmpList[[2]] | |
} | |
return( data.frame( matrix( unlist( neighbours ), nrow=k, byrow=T ) ) ) | |
} | |
# @param N amount of SMOTE N%. | |
# @param k the k-nearest neighbours. | |
# @param minLabel the minority labe. | |
# @param majLabel the majority labe. | |
# @param distance the distance selected (euclidean or mahalanobis). | |
# @return A new data frame with the new examples. | |
mySMOTE <- function( N, k, minLabel, majLabel, distance = "euclidean" ) { | |
numMajority <- sum( data$Class == majLabel ) | |
numMinority <- sum( data$Class == minLabel ) | |
#Compute the imbalance ratio. | |
ir <- numMajority/numMinority | |
printf( "> The imbalance ratio is %f", ir ) | |
printf( "> The majority class correspods to the %f of the data, and the minority to the %f", (numMajority / nrow( data ) ), (numMinority / nrow( data ) ) ) | |
#Obtain a data frame with the minimum instances. | |
numMinExamples <- 0 | |
minList <- list() | |
exampleArrayIndex <- array() | |
for( i in 1:nrow( data ) ) { | |
if( data$Class[i] == minLabel ) { | |
minList <- c(minList, data[i,]) | |
exampleArrayIndex[ (numMinExamples + 1) ] <- i | |
numMinExamples <- numMinExamples + 1 | |
} | |
} | |
printf( "> I've found %d examples of minority class (%.0f).", numMinExamples, as.integer( minLabel ) ) | |
minInstances <- data.frame( matrix( unlist( minList ), nrow=numMinExamples, byrow=T ) ) | |
newExNum <- 0 | |
dOut <- data.frame() | |
for( i in 1:numMinExamples ) { | |
nn <- getNearest( data, exampleArrayIndex[i], k, distance ) | |
numNewNeigh <- N | |
while( numNewNeigh != 0 ) { | |
currEx <- minInstances[i, ] #Select a minority class instance | |
selected <- rbinom( 1, k - 1, 0.5 ) + 1 #Pick a random neighbour. | |
newEx <- array() | |
for( att in 1:(length(nn[ selected, ]) - 1 )) { | |
newEx[ att ] <- currEx[, att ] + (currEx[, att ] - nn[ selected, att ]) * runif(1) | |
} | |
newEx[ ncol(data) ] <- minLabel | |
dOut <- rbind( newEx, dOut ) | |
numNewNeigh <- numNewNeigh - 1 | |
newExNum <- newExNum + 1 | |
} | |
} | |
printf( "> I've added %d new instances of the minority class.", newExNum ) | |
names( dOut ) <- names( data ) | |
return( dOut ) | |
} | |
#Start here! | |
set.seed( 10097 ) | |
setwd( dir='/home/andreu/data/' ) | |
data <- read.csv( 'myData.csv' ) | |
#Now call SMOTE using N = 7, k = 3 and the Mahalanobis distance. | |
dOut <- mySMOTE( 7, 3, 1, 0, "mahalanobis" ) | |
data <- rbind(data, dOut) #Fuse both data frames for plotting... | |
plot( data$X, data$Y, col=(data$Class+1), pch=19, xlab="X", ylab="Y", main="SMOTEd data set" ) | |
References
A. Orriols-Puig. New Challenges in Learning Classifier Systems: Mining Rarities and Evolving Fuzzy Models. PhD thesis, Universitat Ramon Llull, Barcelona (Spain). 2008.
N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357, 2002.
Comments
Post a Comment