Boosting the accuracy rate by means of AdaBoost
Several machine learning techniques foster the cooperation between subsolutions for obtaining an accurate outcome by combining many of them. Michigan-style LCSs are one of these families. However, the most well known are those that implement Boosting, and AdaBoost is the most successful of them (or, at least, the most studied one and the first to implement the ideas of Boosting).
AdaBoost generates accurate predictions by combining several weak classifiers (also referred to as “the wisdom of the crowd”). The most outstanding fact about AdaBoost is that it is a deadly simple algorithm. The key of its success lies in the combination of many weak classifiers: these are very limited and their error rate is just slightly better than a random choice (hence their name). The typical implementation uses decision stumps (i.e., binary trees) that minimize the error between the prediction and the ground truth (that is, the desired outcome), so the classifiers have the form "if variable_1 < 3.145 then class = -1; otherwise class = +1". Another characteristic is that AdaBoost only handles two classes, namely {-1} and {+1}.
Its scheme
is very simple: it trains classifiers that predict correctly a small part of
the problem space and then it combines all these by using a weighting mechanism. Then, AdaBoost decides the class of the example by computing the sign (+1 or -1) of the aggregated predictions. Recall that the weights are computed based on the error achieved
by the distinct weak predictors (see the image below).
![]() |
Figure 1. The learning process of AdaBoost. The Di's are the distinct weights applied to each weak classifier. |
Boosting techniques have attracted my attention lately since these are the ones that provide the best results in the distinct Kaggle competitions. As usual, I implemented a little R code for playing a bit with this wonderful technique. It is listed in the following. Notice that I implemented the univariate version of the AdaBoost algorithm, keeping the code very simple and easy to understand and extend.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# @param data the data set. | |
# @param var the variable index to check. | |
# @param threshold the deciding threshold used. | |
# @param inequality the inequality to use: "lt" or "gt". | |
# @return a result array with the predictions made. | |
classify <- function( data, var, threshold, inequality ) { | |
result <- array( 1, dim = nrow( data ) ); | |
if( inequality == "lt" ) { | |
result[ data[, var] <= threshold ] <- -1 | |
} else { | |
result[ data[, var] > threshold ] <- -1 | |
} | |
return( result ) | |
} | |
# @param Y the true classes (i.e., ground truth). | |
# @param prediction the prediction of our cassifier. | |
# @param Di the distribution of Probabilities. | |
# @return the error rate. | |
computeError <- function( Y, prediction, Di ) { | |
e <- sum( Di[ Y != prediction ] ) | |
return( e ) | |
} | |
# @param data the data set. | |
# @param Y the true classes (i.e., ground truth). | |
# @param Di the distribution of Probabilities. | |
# @param var the variable index to check. | |
# @param numSteps the number of steps to take in the splitting (numSteps > 0!). | |
# @return the classifier that minimices the error, its threshold and the inequality used. | |
generateStump <- function( data, Y, Di, var, numSteps = 10 ) { | |
minError <- 0.5 #We demand an error lower than 0.5! | |
totalInequalities <- c( "lt", "gt" ) | |
prediction <- array() | |
bestThreshold <- -1 | |
finalInequality <- "lt" | |
maxValue <- max( data[, var] ) | |
minValue <- min( data[, var] ) | |
stepSize <- (maxValue - minValue) / numSteps | |
for( partition in -1:(numSteps + 1) ) { #Seek for the best partition greedily. | |
for( ineq in 1:length( totalInequalities ) ) { | |
threshold <- minValue + partition * stepSize | |
prediction <- classify( data, var, threshold, totalInequalities[ ineq ] ) | |
error <- computeError( Y, prediction, Di ) | |
if( error < minError ) { | |
minError <- error | |
bestThreshold <- threshold | |
finalInequality <- totalInequalities[ ineq ] | |
} | |
} | |
} | |
return( list( minError, bestThreshold, finalInequality ) ) | |
} | |
# @param error the error rate. | |
# @return the alpha_t | |
computeAlpha <- function( error ) { | |
alpha <- 0 | |
if( error != 0 ) { | |
alpha <- 0.5 * log( (1 - error) / error ) | |
} | |
return( alpha ) | |
} | |
# @param data the data set. | |
# @param Y the true classes (i.e., ground truth). | |
# @param var the variable index to check. | |
# @param alpha_t the alpha parameter obtained for error correction. | |
# @param Di the distribution of Probabilities. | |
# @param predictior the weak classifier used for prediction (list obtained by generateStump). | |
# @return the normalized weight distribution Di+1. | |
updateDistribution <- function( data, Y, var, alpha_t, Di, predictor ) { | |
prediction <- classify( data, var, predictor[[2]], predictor[[3]]) | |
Di <- Di * exp( -alpha_t * Y * prediction ) | |
#Compute the normalization constant Zt... | |
Zt <- sum( Di[ Y == prediction ] ) + sum( Di[ Y != prediction ] ) | |
# ... and renormalize. | |
Di <- Di / Zt | |
return( Di ) | |
} | |
# @param data the data set. | |
# @param Y the true classes (i.e., ground truth). | |
# @param var the variable index to check. | |
# @param T the number of rounds. | |
# @param numSteps the number of steps to take in the splitting (numSteps > 0!). | |
adaBoost <- function( data, Y, var, T = 3, numSteps = 10 ) { | |
#Initialize the distribution. | |
Di <- array( 1/length( Y ), length( Y ) ) | |
bestPredictors <- list() #We'll store the best weak classifiers at each round. | |
alphas <- vector() #We'll store the weight to combine weak classifiers. | |
print( sprintf( "> Starting the training... " ) ) | |
for( t in 1:T ) { | |
print( sprintf( "> Round %d ... ", t ) ) | |
predictor <- generateStump( data, Y, Di, 1, 10 ) | |
bestPredictors <- c( bestPredictors, predictor ) | |
alpha_t <- computeAlpha( predictor[[1]] ) | |
alphas <- c( alphas, alpha_t ) | |
Di <- updateDistribution( data, Y, 1, alpha_t, Di, predictor ) | |
} | |
#Finally perform the classification using the strong classifier H(x). | |
print( sprintf( "> Starting the test ... " ) ) | |
prediction <- array( 0, nrow( data ) ) | |
for( weakCl in 1:T ) { | |
weakPred <- classify( data, var, bestPredictors[[ 3 * (weakCl - 1) + 2 ]], bestPredictors[[ 3 * (weakCl - 1) + 3 ]] ) | |
prediction <- prediction + alphas[ weakCl ] * weakPred | |
} | |
prediction <- sign( prediction ) | |
print( sprintf( "> Test error acchieved: %f", computeError( Y, prediction, Di ) ) ) | |
} |
Comments
Post a Comment