### A Vectorized Version of the Mighty Logistic Regressor

Neural nets, with their flexible data representation capable of approximate any arbitrary function, are in a new renaissance with the development of Deep Leaners. These stack many layers (typically dozens of quasi-independent layers) of traditional neurons to achieve something very remarkable: self detection of important features from input data. The application of Deep Learning is not restricted to raw classification or regression; these family of techniques are applied to much broader fields such as machine vision and speech recognition. Recent advances went much further and combined a Deep Learning architecture with a Reinforcement Learning algorithm generating a computer program capable of beating classic Atari video games without being explicitly programmed for this task, scoring as a top human player. It is not surprising, therefore, that they appear in the last Nature’s cover (vol. 518, Num. 7540, pp. 456-568). But how do they work? First, one must take a close look to the humble beginnings: the mighty logistic regressor. I programmed a vectorized version in R of this classic algorithm which is the very basis for any neural net architecture. This is depicted in the following.

``````1:  rm(list = ls())
2:  #' It computes the sigmoid function of the given input.
3:  #' @param z is the input scalar.
4:  #' @return the sigmoid.
5:  sigmoid <- function(z) {
6:    return(1/(1 + exp(-z)))
7:  }
8:  # Start the process by generating 250 samples.
9:  set.seed(10097)
10:  x1 <- runif(250)
11:  x2 <- runif(250)
12:  y <- ifelse(x1 + x2 > 0.8, 1, 0)
13:  dataset <- data.frame("BIAS" = rep(1, length(y)), "X1" = x1, "X2" = x2)
14:  a1 <- as.matrix(dataset)
15:  features <- ncol(a1) - 1
16:  numExamples <- nrow(a1)
17:  epsilon <- 0.05 # The amount of standard deviation for random initialization.
18:  alpha <- 0.8 # The learning rate.
19:  lambda <- 0.001 # The regularization penalty.
20:  epochs <- 5000
21:  frac <- 1 / numExamples
22:  # Let's plot the data set
23:  plot(x = a1[,2], y = a1[,3], col = (y + 1), pch = 19, xlab = "X1", ylab = "X2")
24:  W <- matrix(runif((features + 1), min = -epsilon, max = epsilon), nrow = 1, ncol = features + 1, byrow = T)
25:  # Train the logistic regressor.
26:  for(epoch in 1:epochs) {
27:    # First compute the hypothesis...
28:    z2 <- W %*% t(a1)
29:    # Next, compute the gradient
30:    gradW <- frac * ( (sigmoid(z2) - y) %*% a1 + lambda * cbind(0,t(W[,-1])) )
31:    W <- W - alpha * gradW
32:  }
33:  h <- sigmoid(W %*% t(a1))
34:  l_x <- log(h)
35:  l_inv <- log(1 - h)
36:  J <- -frac * sum(sum(y * l_x) + sum( (1 - y) * l_inv)) + (lambda / (2 * numExamples)) * W[,-1]^2
37:  # Plot the separating hyperplane.
38:  abline(a = -W[,1]/W[,3], b = -W[,2]/W[,3], col = "green", lwd = "3")
``````

Figure 1 depicts the problem space and the solution given by logistic regressor, shown as a green line. In this problem we have two input variables (x1 and x2) that take random values in the range [0, 1]. The class is labeled as 1 if x1 + x2 > 0.8 and 0 otherwise. Figure 1 The data set  consisting of two random variables and the solution given by the algorithm.