Neural nets, with their flexible data representation capable of approximate any arbitrary function, are in a new renaissance with the development of Deep Leaners. These stack many layers (typically dozens of quasi-independent layers) of traditional neurons to achieve something very remarkable: self detection of important features from input data. The application of Deep Learning is not restricted to raw classification or regression; these family of techniques are applied to much broader fields such as machine vision and speech recognition. Recent advances went much further and combined a Deep Learning architecture with a Reinforcement Learning algorithm generating a computer program capable of beating classic Atari video games without being explicitly programmed for this task, scoring as a top human player. It is not surprising, therefore, that they appear in the last Nature’s cover (vol. 518, Num. 7540, pp. 456-568). But how do they work? First, one must take a close look to the humble beginnings: the mighty logistic regressor. I programmed a vectorized version in R of this classic algorithm which is the very basis for any neural net architecture. This is depicted in the following.

```
1: rm(list = ls())
2: #' It computes the sigmoid function of the given input.
3: #' @param z is the input scalar.
4: #' @return the sigmoid.
5: sigmoid <- function(z) {
6: return(1/(1 + exp(-z)))
7: }
8: # Start the process by generating 250 samples.
9: set.seed(10097)
10: x1 <- runif(250)
11: x2 <- runif(250)
12: y <- ifelse(x1 + x2 > 0.8, 1, 0)
13: dataset <- data.frame("BIAS" = rep(1, length(y)), "X1" = x1, "X2" = x2)
14: a1 <- as.matrix(dataset)
15: features <- ncol(a1) - 1
16: numExamples <- nrow(a1)
17: epsilon <- 0.05 # The amount of standard deviation for random initialization.
18: alpha <- 0.8 # The learning rate.
19: lambda <- 0.001 # The regularization penalty.
20: epochs <- 5000
21: frac <- 1 / numExamples
22: # Let's plot the data set
23: plot(x = a1[,2], y = a1[,3], col = (y + 1), pch = 19, xlab = "X1", ylab = "X2")
24: W <- matrix(runif((features + 1), min = -epsilon, max = epsilon), nrow = 1, ncol = features + 1, byrow = T)
25: # Train the logistic regressor.
26: for(epoch in 1:epochs) {
27: # First compute the hypothesis...
28: z2 <- W %*% t(a1)
29: # Next, compute the gradient
30: gradW <- frac * ( (sigmoid(z2) - y) %*% a1 + lambda * cbind(0,t(W[,-1])) )
31: W <- W - alpha * gradW
32: }
33: h <- sigmoid(W %*% t(a1))
34: l_x <- log(h)
35: l_inv <- log(1 - h)
36: J <- -frac * sum(sum(y * l_x) + sum( (1 - y) * l_inv)) + (lambda / (2 * numExamples)) * W[,-1]^2
37: # Plot the separating hyperplane.
38: abline(a = -W[,1]/W[,3], b = -W[,2]/W[,3], col = "green", lwd = "3")
```

Figure 1 depicts the problem space and the solution given by logistic regressor, shown as a green line. In this problem we have two input variables (x1 and x2) that take random values in the range [0, 1]. The class is labeled as 1 if x1 + x2 > 0.8 and 0 otherwise.

Figure 1 The data set consisting of two random variables and the solution given by the algorithm. |