Sunday, March 1, 2015

A Vectorized Version of the Mighty Logistic Regressor

Neural nets, with their flexible data representation capable of approximate any arbitrary function, are in a new renaissance with the development of Deep Leaners. These stack many layers (typically dozens of quasi-independent layers) of traditional neurons to achieve something very remarkable: self detection of important features from input data. The application of Deep Learning is not restricted to raw classification or regression; these family of techniques are applied to much broader fields such as machine vision and speech recognition. Recent advances went much further and combined a Deep Learning architecture with a Reinforcement Learning algorithm generating a computer program capable of beating classic Atari video games without being explicitly programmed for this task, scoring as a top human player. It is not surprising, therefore, that they appear in the last Nature’s cover (vol. 518, Num. 7540, pp. 456-568). But how do they work? First, one must take a close look to the humble beginnings: the mighty logistic regressor. I programmed a vectorized version in R of this classic algorithm which is the very basis for any neural net architecture. This is depicted in the following. 


1:  rm(list = ls())  
2:  #' It computes the sigmoid function of the given input.  
3:  #' @param z is the input scalar.  
4:  #' @return the sigmoid.  
5:  sigmoid <- function(z) {  
6:    return(1/(1 + exp(-z)))  
7:  }  
8:  # Start the process by generating 250 samples.
9:  set.seed(10097)  
10:  x1 <- runif(250)  
11:  x2 <- runif(250)  
12:  y <- ifelse(x1 + x2 > 0.8, 1, 0)  
13:  dataset <- data.frame("BIAS" = rep(1, length(y)), "X1" = x1, "X2" = x2)  
14:  a1 <- as.matrix(dataset)  
15:  features <- ncol(a1) - 1  
16:  numExamples <- nrow(a1)  
17:  epsilon <- 0.05 # The amount of standard deviation for random initialization.  
18:  alpha <- 0.8 # The learning rate.  
19:  lambda <- 0.001 # The regularization penalty.  
20:  epochs <- 5000  
21:  frac <- 1 / numExamples  
22:  # Let's plot the data set  
23:  plot(x = a1[,2], y = a1[,3], col = (y + 1), pch = 19, xlab = "X1", ylab = "X2")  
24:  W <- matrix(runif((features + 1), min = -epsilon, max = epsilon), nrow = 1, ncol = features + 1, byrow = T)  
25:  # Train the logistic regressor.  
26:  for(epoch in 1:epochs) {  
27:    # First compute the hypothesis... 
28:    z2 <- W %*% t(a1)  
29:    # Next, compute the gradient  
30:    gradW <- frac * ( (sigmoid(z2) - y) %*% a1 + lambda * cbind(0,t(W[,-1])) )  
31:    W <- W - alpha * gradW  
32:  }  
33:  h <- sigmoid(W %*% t(a1))  
34:  l_x <- log(h)  
35:  l_inv <- log(1 - h)  
36:  J <- -frac * sum(sum(y * l_x) + sum( (1 - y) * l_inv)) + (lambda / (2 * numExamples)) * W[,-1]^2  
37:  # Plot the separating hyperplane.  
38:  abline(a = -W[,1]/W[,3], b = -W[,2]/W[,3], col = "green", lwd = "3")  

Figure 1 depicts the problem space and the solution given by logistic regressor, shown as a green line. In this problem we have two input variables (x1 and x2) that take random values in the range [0, 1]. The class is labeled as 1 if x1 + x2 > 0.8 and 0 otherwise.

Figure 1 The data set  consisting of two random variables and the solution given by the algorithm. 

Sunday, November 16, 2014

Data Analytics Plus Big Data: A Win-win Strategy

We live in a dynamic world where new technologies arise, dominate for a brief blink of an eye and then fall to oblivion rapidly. A part from the academia domain, it is interesting to check what are the current trends in the industry. Or even better: see the whole picture without discriminating. In this regard, a nice analytics tool is found in Google Trends, which allow us to track our terms of interest. The following chart shows the search interest for Machine Learning and Data Analytics.


In a similar scale I depicted the search interest for Data Analytics and Big Data. As it is clearly noticeable, knowing both disciplines can open many doors in the industry.

Finally, as top and also rising queries, Hadoop is the most popular term.

Thursday, September 11, 2014

Gradient Boosting: Fostering accuracy even further

As in many real-world situations, union makes algorithms stronger. With this philosophy in mind, ensemble methods combine several weak classifiers into a massive one---in terms of accuracy. In the last post we learnt a primer with Random Forest. Therefore, the next cornerstone is gradient boosting. I mentioned Gradient Boosting many times in this blog, but I only commented the fundamental ideas, without discussing further the details. In this entry I will share my two cents. Let me introduce a little bit of history, first: recall the Kaggle-Higgs competition. The top scores in the leaderboard have been obtained by using distinct forms of gradient boosting, and XGBoost is the direct responsible of many of these. The question is, hence, how does this algorithm work?

Figure 1. A high-level description of the Gradient Boosting method I programmed. Click to enlarge.

Informally, Gradient Boosting generates a sequence of classifiers in the form of an additive expansion, that is, at each iteration a new weak classifier is trained to improve the accuracy of the previous ones, in the same fashion as AdaBoost. However, and differently from this one, the gradient of the loss function (i.e., the function used to check how well our algorithm performs) is used to guide the search toward a very accurate solution (i.e., continual improvement by means of a gradient). Another characteristic is that Gradient Boosting works best with a regression tree as a base model. Combining several of those, the common loss function used is the mean squared error, which is differentiable (and also straightforward to handle). Figure 1 shows the high-level description of the algorithm.

So far, so good, we have an algorithm that is straightforward to implement and that offers a high degree of accuracy. However, the trick lies in the trees. Gradient Boosting exploits the benefits of well-inducted trees, and this is, probably, the most difficult part of the technique. There are many possibilities, and in my particular case I implemented the simplest form of a binary regression tree as a basis. This tree performs the splits by computing the gain in variance (informally, how much variety is in the output variable), and without considering any form of backfitting (i.e., tree pruning). Also, I just “ignored” the unknown values, making the variance computation with the known available. Despite this rather simplistic approach, I obtained a score of 3.31 in the Approximate Median Significance (AMS) metric (almost the same as in the original Higgs challenge reference study, which is of 3.34!). The configuration was pretty out-of-the-shelf, consisting of 300 boosted trees, a learning rate of 0.1 and a limitation of 10 leaves per tree. Just for instance, XGBoost uses unpruned binary trees, and its amazing accuracy comes from the tricks used for growing the trees (e.g., mainly handling unknown values).


As concluding remarks, I have to say that I enjoyed the experience. OK, I did not win the challenge (the leader has a score of 3.85 in the aforementioned AMS metric) but I learned a lot, discovering---and, of course, implementing---these marvelous machine learning techniques. Following the rules of the challenge, I will upload and share the source code (it is programmed in a user-friendly C++) when it finishes (September the 15th).


References

[1] G. James, D. Witten, T. Hastie and R. Tiibshirani, "An Introduction to Statistical Learning With Applications in R," ISSN 1431-875X. Springer, 2013.
[2] J. Friedman, "Greedy function approximation: Gradient boosting machine," the Annals of Statistics 2001, Vol. 29, No. 5, 1189–1232.