Kagglers: another LHC Challenge!

I am glad of sharing some very motivating news: after a little bit more than year, Kaggle hosts a new CERN/LHC competition! This time it comes from the tau disintegration into three muons. Again, the challenge consists of identifying the pattern from both real and simulated data to identify this rare phenomenon. The training data set consists of 49 distinct real-valued features plus the signal (i.e., background noise and the proper event).

This time I am using Python with Scikit learn, Pandas and XGBoost (this latter one by far my favorite ensemble package; also available for R!) and I have to confess that is much easier when you do not have to program everything from scratch (as I usually do in C/C++ or Java), and that I have more time to think in new ways of solving the problem from a pure data scientist view. However, I really adore making use of use my own programs (I ended up 24th out of 1691 in the Forest Cover Type challenge using my own Extra Trees/Random Forest programmed from scratch in C++). As a clever person I once met told me these Python libraries gave us a lot of potential to explore and test new, crazy ideas. Cool!

In the following I share the script I used for passing the previous agreement and correlation tests (using Scikit learn and Pandas) and also score beyond the benchmark of 0.976709 weighted AUC (this rather simple Python script gave me 0.982596, and at the moment of writing this post the no. 1 has a score of 0.989614; yeah, that is a difference of 0.007018 between the first and the 98th). This is my first serious try with Python and I have to say that it was quite fun. In the next weeks I hope I will have the time for some nicer scripts that I will try to upload (here and on Kaggle).

 import pandas  
 import numpy as np  
 from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier  
 import evaluation  
 folder = '../'  # Configure this at your will!
 train = pandas.read_csv(folder + 'training.csv', index_col = 'id')  
 train_features = list(train.columns.values)  
 train_features.remove('SPDhits') # This attribute does let us pass the correlation test... 
 train_features.remove('min_ANNmuon') # It must be removed as it does not appear in the test set...  
 train_features.remove('signal')  # It must be removed as it does not appear in the test set...
 train_features.remove('mass') # It must be removed as it does not appear in the test set...  
 train_features.remove('production') # It must be removed as it does not appear in the test set...  
 gbm = GradientBoostingClassifier(n_estimators = 500, learning_rate = 0.1, subsample = 0.7,  
                    min_samples_leaf = 1, max_depth = 4, random_state = 10097)  
 gbm.fit(train[train_features], train['signal']) # Very slow training...  
 rf = RandomForestClassifier(n_estimators = 300, n_jobs = -1, criterion = "entropy", random_state = 10097)  
 rf.fit(train[train_features], train["signal"]) # Very slow training...  
 ert = ExtraTreesClassifier(n_estimators = 300, max_depth = None, n_jobs = -1, min_samples_split = 1, random_state = 10097)  
 ert.fit(train[train_features], train["signal"]) # Very slow training...  
 # Predict ala shotgun approach.  
 test = pandas.read_csv(folder + 'test.csv', index_col='id')  
 test_probs = (  
            (rf.predict_proba(test[train_features])[:,1] +  
           ert.predict_proba(test[train_features])[:,1]) /2 +  
 submission = pandas.DataFrame({'id': test.index, 'prediction': test_probs})  
 submission.to_csv("rf_gbm_ert_submission.csv", index = False)