A Vectorized Version of the Mighty Logistic Regressor

Neural nets, with their flexible data representation capable of approximate any arbitrary function, are in a new renaissance with the development of Deep Leaners. These stack many layers (typically dozens of quasi-independent layers) of traditional neurons to achieve something very remarkable: self detection of important features from input data. The application of Deep Learning is not restricted to raw classification or regression; these family of techniques are applied to much broader fields such as machine vision and speech recognition. Recent advances went much further and combined a Deep Learning architecture with a Reinforcement Learning algorithm generating a computer program capable of beating classic Atari video games without being explicitly programmed for this task, scoring as a top human player. It is not surprising, therefore, that they appear in the last Nature’s cover (vol. 518, Num. 7540, pp. 456-568). But how do they work? First, one must take a close look to the humble beginnings: the mighty logistic regressor. I programmed a vectorized version in R of this classic algorithm which is the very basis for any neural net architecture. This is depicted in the following. 


1:  rm(list = ls())  
2:  #' It computes the sigmoid function of the given input.  
3:  #' @param z is the input scalar.  
4:  #' @return the sigmoid.  
5:  sigmoid <- function(z) {  
6:    return(1/(1 + exp(-z)))  
7:  }  
8:  # Start the process by generating 250 samples.
9:  set.seed(10097)  
10:  x1 <- runif(250)  
11:  x2 <- runif(250)  
12:  y <- ifelse(x1 + x2 > 0.8, 1, 0)  
13:  dataset <- data.frame("BIAS" = rep(1, length(y)), "X1" = x1, "X2" = x2)  
14:  a1 <- as.matrix(dataset)  
15:  features <- ncol(a1) - 1  
16:  numExamples <- nrow(a1)  
17:  epsilon <- 0.05 # The amount of standard deviation for random initialization.  
18:  alpha <- 0.8 # The learning rate.  
19:  lambda <- 0.001 # The regularization penalty.  
20:  epochs <- 5000  
21:  frac <- 1 / numExamples  
22:  # Let's plot the data set  
23:  plot(x = a1[,2], y = a1[,3], col = (y + 1), pch = 19, xlab = "X1", ylab = "X2")  
24:  W <- matrix(runif((features + 1), min = -epsilon, max = epsilon), nrow = 1, ncol = features + 1, byrow = T)  
25:  # Train the logistic regressor.  
26:  for(epoch in 1:epochs) {  
27:    # First compute the hypothesis... 
28:    z2 <- W %*% t(a1)  
29:    # Next, compute the gradient  
30:    gradW <- frac * ( (sigmoid(z2) - y) %*% a1 + lambda * cbind(0,t(W[,-1])) )  
31:    W <- W - alpha * gradW  
32:  }  
33:  h <- sigmoid(W %*% t(a1))  
34:  l_x <- log(h)  
35:  l_inv <- log(1 - h)  
36:  J <- -frac * sum(sum(y * l_x) + sum( (1 - y) * l_inv)) + (lambda / (2 * numExamples)) * W[,-1]^2  
37:  # Plot the separating hyperplane.  
38:  abline(a = -W[,1]/W[,3], b = -W[,2]/W[,3], col = "green", lwd = "3")  

Figure 1 depicts the problem space and the solution given by logistic regressor, shown as a green line. In this problem we have two input variables (x1 and x2) that take random values in the range [0, 1]. The class is labeled as 1 if x1 + x2 > 0.8 and 0 otherwise.

Figure 1 The data set  consisting of two random variables and the solution given by the algorithm. 

Comments