Friday, March 11, 2016

Beyond state-of-the-art accuracy by fostering ensemble generalization

Sometimes practitioners are forced to go beyond the standard methods in order to gain more accuracy with their models. If one analyzes the problem of rocketing accuracy, ensembling is a good starting point. However, the trick lies in getting enough generalization from feature space.  In this regard, ensemble generalization--do not confuse with classic or "standard" ensemble methods such as Random Forest or Gradient Boosting--is the right path to follow, however complex. The idea is to combine predictions from "base learners" to train a second stage regressor, using these predictions as metafeatures. The trick is to use a J-fold cross-validation scheme and use always the same data partitions and seed. This kind of ensemble is often called stacking--as we "stack" layers of classifiers.

Let’s do an example: suppose that we have three base learners: GBM, ET, and RF. Then assume we have a LM as level 2 learner. First we divide the training data into J-folds, for example in 4--recall that these 4 folds are stratified and disjoint. Then we train each model using the traditional cross-validation scheme, that is train with 3 folds and predict with the remaining (works best if the predictions are in form of probabilities). These predictions are stored and will be used for training the level 2 model. Figure 1 depicts this process.

Figure 1. Ensemble generalization (also known as Stacking) training scheme. The idea is to "stack" multiple layers for generalizing further (in this example we use two layers), and use a J-fold cross-validation scheme for avoiding bias (in this example J = 4).
After training the level 2 algorithm, we can proceed with the final predictions. To do so, we train again the base learners but using the whole training set. We do this to gain up to a 20% accuracy. It is important to highlight that we’ve to assure that the random seeds are the same that in the J-fold training! Afterwards, for each test example we predict with the base learners and collect the predictions. These are the input of the level 2 algorithm, which performs the final prediction.

I used these in Kaggle a few times and I’ve to say that it makes the difference. However, I found it to be difficult to get it working and it requires a lot of processing power. There is a nice post from Triskelion explaining ensembles that gave me the inspiration to write this

Wednesday, December 16, 2015

Presenting a Conceptual Machine Capable of Evolving Association Streams

The increasing bulk of data generation in industrial and scientific applications has fostered practitioners’ interest in mining large amounts of unlabelled data in the form of continuous, high speed, and time-changing streams of information. An appealing field is association stream mining, which models dynamically complex domains via rules without assuming any a priori structure. Different from the related frequent pattern mining field, its goal is to extract interesting associations among the forming features of such data, adapting these to the ever-changing dynamics of the environment in a pure online fashion--without the typical offline rule generation. These rules are adequate for extracting valuable insight which helps in decision making.

It is a pleasure to detail Fuzzy-CSar, an online genetic fuzzy system (GFS) designed to extract interesting, quantitative rules from streams of samples. It evolves its internal model online, being able to quickly adapt its knowledge in the presence of drifting concepts.

Tuesday, July 28, 2015

Kagglers: another LHC Challenge!

I am glad of sharing some very motivating news: after a little bit more than year, Kaggle hosts a new CERN/LHC competition! This time it comes from the tau disintegration into three muons. Again, the challenge consists of identifying the pattern from both real and simulated data to identify this rare phenomenon. The training data set consists of 49 distinct real-valued features plus the signal (i.e., background noise and the proper event).

This time I am using Python with Scikit learn, Pandas and XGBoost (this latter one by far my favorite ensemble package; also available for R!) and I have to confess that is much easier when you do not have to program everything from scratch (as I usually do in C/C++ or Java), and that I have more time to think in new ways of solving the problem from a pure data scientist view. However, I really adore making use of use my own programs (I ended up 24th out of 1691 in the Forest Cover Type challenge using my own Extra Trees/Random Forest programmed from scratch in C++). As a clever person I once met told me these Python libraries gave us a lot of potential to explore and test new, crazy ideas. Cool!

In the following I share the script I used for passing the previous agreement and correlation tests (using Scikit learn and Pandas) and also score beyond the benchmark of 0.976709 weighted AUC (this rather simple Python script gave me 0.982596, and at the moment of writing this post the no. 1 has a score of 0.989614; yeah, that is a difference of 0.007018 between the first and the 98th). This is my first serious try with Python and I have to say that it was quite fun. In the next weeks I hope I will have the time for some nicer scripts that I will try to upload (here and on Kaggle).

 import pandas  
 import numpy as np  
 from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier  
 import evaluation  
 folder = '../'  # Configure this at your will!
 train = pandas.read_csv(folder + 'training.csv', index_col = 'id')  
 train_features = list(train.columns.values)  
 train_features.remove('SPDhits') # This attribute does let us pass the correlation test... 
 train_features.remove('min_ANNmuon') # It must be removed as it does not appear in the test set...  
 train_features.remove('signal')  # It must be removed as it does not appear in the test set...
 train_features.remove('mass') # It must be removed as it does not appear in the test set...  
 train_features.remove('production') # It must be removed as it does not appear in the test set...  
 gbm = GradientBoostingClassifier(n_estimators = 500, learning_rate = 0.1, subsample = 0.7,  
                    min_samples_leaf = 1, max_depth = 4, random_state = 10097)  
 gbm.fit(train[train_features], train['signal']) # Very slow training...  
 rf = RandomForestClassifier(n_estimators = 300, n_jobs = -1, criterion = "entropy", random_state = 10097)  
 rf.fit(train[train_features], train["signal"]) # Very slow training...  
 ert = ExtraTreesClassifier(n_estimators = 300, max_depth = None, n_jobs = -1, min_samples_split = 1, random_state = 10097)  
 ert.fit(train[train_features], train["signal"]) # Very slow training...  
 # Predict ala shotgun approach.  
 test = pandas.read_csv(folder + 'test.csv', index_col='id')  
 test_probs = (  
            (rf.predict_proba(test[train_features])[:,1] +  
           ert.predict_proba(test[train_features])[:,1]) /2 +  
           gbm.predict_proba(test[train_features])[:,1]  
 )/3  
 submission = pandas.DataFrame({'id': test.index, 'prediction': test_probs})  
 submission.to_csv("rf_gbm_ert_submission.csv", index = False)