Tuesday, July 28, 2015

Kagglers: another LHC Challenge!

I am glad of sharing some very motivating news: after a little bit more than year, Kaggle hosts a new CERN/LHC competition! This time it comes from the tau disintegration into three muons. Again, the challenge consists of identifying the pattern from both real and simulated data to identify this rare phenomenon. The training data set consists of 49 distinct real-valued features plus the signal (i.e., background noise and the proper event).

This time I am using Python with Scikit learn, Pandas and XGBoost (this latter one by far my favorite ensemble package; also available for R!) and I have to confess that is much easier when you do not have to program everything from scratch (as I usually do in C/C++ or Java), and that I have more time to think in new ways of solving the problem from a pure data scientist view. However, I really adore making use of use my own programs (I ended up 24th out of 1691 in the Forest Cover Type challenge using my own Extra Trees/Random Forest programmed from scratch in C++). As a clever person I once met told me these Python libraries gave us a lot of potential to explore and test new, crazy ideas. Cool!

In the following I share the script I used for passing the previous agreement and correlation tests (using Scikit learn and Pandas) and also score beyond the benchmark of 0.976709 weighted AUC (this rather simple Python script gave me 0.982596, and at the moment of writing this post the no. 1 has a score of 0.989614; yeah, that is a difference of 0.007018 between the first and the 98th). This is my first serious try with Python and I have to say that it was quite fun. In the next weeks I hope I will have the time for some nicer scripts that I will try to upload (here and on Kaggle).

 import pandas  
 import numpy as np  
 from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier  
 import evaluation  
 folder = '../'  # Configure this at your will!
 train = pandas.read_csv(folder + 'training.csv', index_col = 'id')  
 train_features = list(train.columns.values)  
 train_features.remove('SPDhits') # This attribute does let us pass the correlation test... 
 train_features.remove('min_ANNmuon') # It must be removed as it does not appear in the test set...  
 train_features.remove('signal')  # It must be removed as it does not appear in the test set...
 train_features.remove('mass') # It must be removed as it does not appear in the test set...  
 train_features.remove('production') # It must be removed as it does not appear in the test set...  
 gbm = GradientBoostingClassifier(n_estimators = 500, learning_rate = 0.1, subsample = 0.7,  
                    min_samples_leaf = 1, max_depth = 4, random_state = 10097)  
 gbm.fit(train[train_features], train['signal']) # Very slow training...  
 rf = RandomForestClassifier(n_estimators = 300, n_jobs = -1, criterion = "entropy", random_state = 10097)  
 rf.fit(train[train_features], train["signal"]) # Very slow training...  
 ert = ExtraTreesClassifier(n_estimators = 300, max_depth = None, n_jobs = -1, min_samples_split = 1, random_state = 10097)  
 ert.fit(train[train_features], train["signal"]) # Very slow training...  
 # Predict ala shotgun approach.  
 test = pandas.read_csv(folder + 'test.csv', index_col='id')  
 test_probs = (  
            (rf.predict_proba(test[train_features])[:,1] +  
           ert.predict_proba(test[train_features])[:,1]) /2 +  
 submission = pandas.DataFrame({'id': test.index, 'prediction': test_probs})  
 submission.to_csv("rf_gbm_ert_submission.csv", index = False)  

Sunday, March 1, 2015

A Vectorized Version of the Mighty Logistic Regressor

Neural nets, with their flexible data representation capable of approximate any arbitrary function, are in a new renaissance with the development of Deep Leaners. These stack many layers (typically dozens of quasi-independent layers) of traditional neurons to achieve something very remarkable: self detection of important features from input data. The application of Deep Learning is not restricted to raw classification or regression; these family of techniques are applied to much broader fields such as machine vision and speech recognition. Recent advances went much further and combined a Deep Learning architecture with a Reinforcement Learning algorithm generating a computer program capable of beating classic Atari video games without being explicitly programmed for this task, scoring as a top human player. It is not surprising, therefore, that they appear in the last Nature’s cover (vol. 518, Num. 7540, pp. 456-568). But how do they work? First, one must take a close look to the humble beginnings: the mighty logistic regressor. I programmed a vectorized version in R of this classic algorithm which is the very basis for any neural net architecture. This is depicted in the following. 

1:  rm(list = ls())  
2:  #' It computes the sigmoid function of the given input.  
3:  #' @param z is the input scalar.  
4:  #' @return the sigmoid.  
5:  sigmoid <- function(z) {  
6:    return(1/(1 + exp(-z)))  
7:  }  
8:  # Start the process by generating 250 samples.
9:  set.seed(10097)  
10:  x1 <- runif(250)  
11:  x2 <- runif(250)  
12:  y <- ifelse(x1 + x2 > 0.8, 1, 0)  
13:  dataset <- data.frame("BIAS" = rep(1, length(y)), "X1" = x1, "X2" = x2)  
14:  a1 <- as.matrix(dataset)  
15:  features <- ncol(a1) - 1  
16:  numExamples <- nrow(a1)  
17:  epsilon <- 0.05 # The amount of standard deviation for random initialization.  
18:  alpha <- 0.8 # The learning rate.  
19:  lambda <- 0.001 # The regularization penalty.  
20:  epochs <- 5000  
21:  frac <- 1 / numExamples  
22:  # Let's plot the data set  
23:  plot(x = a1[,2], y = a1[,3], col = (y + 1), pch = 19, xlab = "X1", ylab = "X2")  
24:  W <- matrix(runif((features + 1), min = -epsilon, max = epsilon), nrow = 1, ncol = features + 1, byrow = T)  
25:  # Train the logistic regressor.  
26:  for(epoch in 1:epochs) {  
27:    # First compute the hypothesis... 
28:    z2 <- W %*% t(a1)  
29:    # Next, compute the gradient  
30:    gradW <- frac * ( (sigmoid(z2) - y) %*% a1 + lambda * cbind(0,t(W[,-1])) )  
31:    W <- W - alpha * gradW  
32:  }  
33:  h <- sigmoid(W %*% t(a1))  
34:  l_x <- log(h)  
35:  l_inv <- log(1 - h)  
36:  J <- -frac * sum(sum(y * l_x) + sum( (1 - y) * l_inv)) + (lambda / (2 * numExamples)) * W[,-1]^2  
37:  # Plot the separating hyperplane.  
38:  abline(a = -W[,1]/W[,3], b = -W[,2]/W[,3], col = "green", lwd = "3")  

Figure 1 depicts the problem space and the solution given by logistic regressor, shown as a green line. In this problem we have two input variables (x1 and x2) that take random values in the range [0, 1]. The class is labeled as 1 if x1 + x2 > 0.8 and 0 otherwise.

Figure 1 The data set  consisting of two random variables and the solution given by the algorithm. 

Sunday, November 16, 2014

Data Analytics Plus Big Data: A Win-win Strategy

We live in a dynamic world where new technologies arise, dominate for a brief blink of an eye and then fall to oblivion rapidly. A part from the academia domain, it is interesting to check what are the current trends in the industry. Or even better: see the whole picture without discriminating. In this regard, a nice analytics tool is found in Google Trends, which allow us to track our terms of interest. The following chart shows the search interest for Machine Learning and Data Analytics.

In a similar scale I depicted the search interest for Data Analytics and Big Data. As it is clearly noticeable, knowing both disciplines can open many doors in the industry.

Finally, as top and also rising queries, Hadoop is the most popular term.