Wednesday, December 16, 2015

Presenting a Conceptual Machine Capable of Evolving Association Streams

The increasing bulk of data generation in industrial and scientific applications has fostered practitioners’ interest in mining large amounts of unlabelled data in the form of continuous, high speed, and time-changing streams of information. An appealing field is association stream mining, which models dynamically complex domains via rules without assuming any a priori structure. Different from the related frequent pattern mining field, its goal is to extract interesting associations among the forming features of such data, adapting these to the ever-changing dynamics of the environment in a pure online fashion--without the typical offline rule generation. These rules are adequate for extracting valuable insight which helps in decision making.

It is a pleasure to detail Fuzzy-CSar, an online genetic fuzzy system (GFS) designed to extract interesting, quantitative rules from streams of samples. It evolves its internal model online, being able to quickly adapt its knowledge in the presence of drifting concepts.

Tuesday, July 28, 2015

Kagglers: another LHC Challenge!

I am glad of sharing some very motivating news: after a little bit more than year, Kaggle hosts a new CERN/LHC competition! This time it comes from the tau disintegration into three muons. Again, the challenge consists of identifying the pattern from both real and simulated data to identify this rare phenomenon. The training data set consists of 49 distinct real-valued features plus the signal (i.e., background noise and the proper event).

This time I am using Python with Scikit learn, Pandas and XGBoost (this latter one by far my favorite ensemble package; also available for R!) and I have to confess that is much easier when you do not have to program everything from scratch (as I usually do in C/C++ or Java), and that I have more time to think in new ways of solving the problem from a pure data scientist view. However, I really adore making use of use my own programs (I ended up 24th out of 1691 in the Forest Cover Type challenge using my own Extra Trees/Random Forest programmed from scratch in C++). As a clever person I once met told me these Python libraries gave us a lot of potential to explore and test new, crazy ideas. Cool!

In the following I share the script I used for passing the previous agreement and correlation tests (using Scikit learn and Pandas) and also score beyond the benchmark of 0.976709 weighted AUC (this rather simple Python script gave me 0.982596, and at the moment of writing this post the no. 1 has a score of 0.989614; yeah, that is a difference of 0.007018 between the first and the 98th). This is my first serious try with Python and I have to say that it was quite fun. In the next weeks I hope I will have the time for some nicer scripts that I will try to upload (here and on Kaggle).

 import pandas  
 import numpy as np  
 from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier  
 import evaluation  
 folder = '../'  # Configure this at your will!
 train = pandas.read_csv(folder + 'training.csv', index_col = 'id')  
 train_features = list(train.columns.values)  
 train_features.remove('SPDhits') # This attribute does let us pass the correlation test... 
 train_features.remove('min_ANNmuon') # It must be removed as it does not appear in the test set...  
 train_features.remove('signal')  # It must be removed as it does not appear in the test set...
 train_features.remove('mass') # It must be removed as it does not appear in the test set...  
 train_features.remove('production') # It must be removed as it does not appear in the test set...  
 gbm = GradientBoostingClassifier(n_estimators = 500, learning_rate = 0.1, subsample = 0.7,  
                    min_samples_leaf = 1, max_depth = 4, random_state = 10097)[train_features], train['signal']) # Very slow training...  
 rf = RandomForestClassifier(n_estimators = 300, n_jobs = -1, criterion = "entropy", random_state = 10097)[train_features], train["signal"]) # Very slow training...  
 ert = ExtraTreesClassifier(n_estimators = 300, max_depth = None, n_jobs = -1, min_samples_split = 1, random_state = 10097)[train_features], train["signal"]) # Very slow training...  
 # Predict ala shotgun approach.  
 test = pandas.read_csv(folder + 'test.csv', index_col='id')  
 test_probs = (  
            (rf.predict_proba(test[train_features])[:,1] +  
           ert.predict_proba(test[train_features])[:,1]) /2 +  
 submission = pandas.DataFrame({'id': test.index, 'prediction': test_probs})  
 submission.to_csv("rf_gbm_ert_submission.csv", index = False)  

Sunday, March 1, 2015

A Vectorized Version of the Mighty Logistic Regressor

Neural nets, with their flexible data representation capable of approximate any arbitrary function, are in a new renaissance with the development of Deep Leaners. These stack many layers (typically dozens of quasi-independent layers) of traditional neurons to achieve something very remarkable: self detection of important features from input data. The application of Deep Learning is not restricted to raw classification or regression; these family of techniques are applied to much broader fields such as machine vision and speech recognition. Recent advances went much further and combined a Deep Learning architecture with a Reinforcement Learning algorithm generating a computer program capable of beating classic Atari video games without being explicitly programmed for this task, scoring as a top human player. It is not surprising, therefore, that they appear in the last Nature’s cover (vol. 518, Num. 7540, pp. 456-568). But how do they work? First, one must take a close look to the humble beginnings: the mighty logistic regressor. I programmed a vectorized version in R of this classic algorithm which is the very basis for any neural net architecture. This is depicted in the following. 

1:  rm(list = ls())  
2:  #' It computes the sigmoid function of the given input.  
3:  #' @param z is the input scalar.  
4:  #' @return the sigmoid.  
5:  sigmoid <- function(z) {  
6:    return(1/(1 + exp(-z)))  
7:  }  
8:  # Start the process by generating 250 samples.
9:  set.seed(10097)  
10:  x1 <- runif(250)  
11:  x2 <- runif(250)  
12:  y <- ifelse(x1 + x2 > 0.8, 1, 0)  
13:  dataset <- data.frame("BIAS" = rep(1, length(y)), "X1" = x1, "X2" = x2)  
14:  a1 <- as.matrix(dataset)  
15:  features <- ncol(a1) - 1  
16:  numExamples <- nrow(a1)  
17:  epsilon <- 0.05 # The amount of standard deviation for random initialization.  
18:  alpha <- 0.8 # The learning rate.  
19:  lambda <- 0.001 # The regularization penalty.  
20:  epochs <- 5000  
21:  frac <- 1 / numExamples  
22:  # Let's plot the data set  
23:  plot(x = a1[,2], y = a1[,3], col = (y + 1), pch = 19, xlab = "X1", ylab = "X2")  
24:  W <- matrix(runif((features + 1), min = -epsilon, max = epsilon), nrow = 1, ncol = features + 1, byrow = T)  
25:  # Train the logistic regressor.  
26:  for(epoch in 1:epochs) {  
27:    # First compute the hypothesis... 
28:    z2 <- W %*% t(a1)  
29:    # Next, compute the gradient  
30:    gradW <- frac * ( (sigmoid(z2) - y) %*% a1 + lambda * cbind(0,t(W[,-1])) )  
31:    W <- W - alpha * gradW  
32:  }  
33:  h <- sigmoid(W %*% t(a1))  
34:  l_x <- log(h)  
35:  l_inv <- log(1 - h)  
36:  J <- -frac * sum(sum(y * l_x) + sum( (1 - y) * l_inv)) + (lambda / (2 * numExamples)) * W[,-1]^2  
37:  # Plot the separating hyperplane.  
38:  abline(a = -W[,1]/W[,3], b = -W[,2]/W[,3], col = "green", lwd = "3")  

Figure 1 depicts the problem space and the solution given by logistic regressor, shown as a green line. In this problem we have two input variables (x1 and x2) that take random values in the range [0, 1]. The class is labeled as 1 if x1 + x2 > 0.8 and 0 otherwise.

Figure 1 The data set  consisting of two random variables and the solution given by the algorithm.