Monday, May 30, 2016

High Performance Computing, yet another brief installation tutorial

Today’s mid- and high-end computers come with a tremendous hardware, mostly used in video games and other media software, that can be exploited for advanced computation, that is: High Performance Computing (HPC). This is a hot topic in Deep Learning as modern graphic cards come with huge streaming process power and large and quick memory. The most successful example is in Nvidia’s CUDA platform. In summary, CUDA significantly speeds up the fitting of large neural nets (for instance: from several hours to just a few minutes!).

However, the drawbacks come when setting up the scenario: it is non-trivial to install the requirements and set it running, and personally I had a little trouble the first time as many packages need to be manually compiled and installed in a specific order. The purpose of this entry is to reflect what I did for setting up Theano and Keras with HPC using an Nvidia’s graphic card (in my case a GT730) using GNU/Linux. To do so, I will start assuming a clean Debian 8 Jessy install and the use of anaconda for python.

The first thing to do is to install the package requisites: gcc, g++, gfortran, build-essential, linux-headers, git, and automake after updating apt-get (assuming we are already logged as root):

# apt-get update
# aptitude install gcc g++ gfortran build-essential linux-headers-$(uname -r) git automake

These are the minimum requisites in order to proceed. Without all these software packages we could not compile Theano and Keras. Next, we configure git in two easy steps:

$ git config --global YOUR_USER_NAME
$ git config --global YOUR_USER_EMAIL

Now we start downloading the requisites for Theano. We start with OpenBLAS, the efficient library of algebra computing:

$ mkdir git
$ cd git
$ git clone

Transform into root, enter the git/OpenBLAS folder and run the following two lines:

# make FC=gfortran  
# make PREFIX=/usr/local install

After this step we can proceed with the installation of the graphic cards driver and the CUDA toolkit. This is one of the most critical parts, so be very careful. First we need to download the package from the Nvidia web page and select download CUDA 7.5, Linux, X86_64, Ubuntu, 14.04, runfile (local). Yes, we will use the Ubuntu 14.04 file –no trouble with that. After downloading the file, we have to blacklist the nouveau driver otherwise the correct Nvidia’s one won’t work. To do so, as root, we need to do the following:

# gedit /etc/modprobe.d/nvidia.conf

And enter the following:

blacklist nouveau
blacklist lbm-nouveau
blacklist nvidia-173
blacklist nvidia-96
blacklist nvidia-current
blacklist nvidia-173-updates
blacklist nvidia-96-updates
alias nvidia nvidia_current_updates
alias nouveau off
alias lbm-nouveau off

Save and exit. Then, as root run the following:

# update-initramfs -u

Now we are almost ready for the installation of the drivers. For proceeding with the installation we have to kill the X session. Enter the console mode pressing CTRL + ALT + F1 and log in. After log in with your user, log gin as root. Then do the following:

# telinit 3

This will enter in the classic console-only mode (that is: no X session), and we can proceed with the installation. Now enter the directory where the CUDA drivers are. I assume there are in ~/Downloads/ and the file is called “” (these may change, you have to check the actual name!):

# cd /home/YOUR_USER_HOME/Downloads
# chmod +x
# ./

And follow the instructions. Mostly accept the license and tell yes to all. Before proceeding we have to modify the .bashrc configuration for your user (not for root but for your user!):

$ gedit .bashrc

And add the following lines:

export PATH=/usr/local/cuda-7.5/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-7.5/lib64:$LD_LIBRARY_PATH

Save and exit. After this step, reboot the machine. We have the drivers and CUDA toolkit installed and ready. Now we have to install anaconda and the rest of packages. You can download the package from the following web page: selecting the Linux 64 bit for Python 2.7 (Python 3.5 is also available, but we stick to 2.7 for this tutorial). After this, install anaconda (assuming the file is named “” and that it is stored in your Downloads folder):

$ cd Downloads
$ bash  

You will see that anaconda writes into your .bashrc file. At this point I recommend either rebooting (easy solution) or executing the newer version of your .bashrc file in every opened terminal:

$ cd
$ . .bashrc

(yes, type cd and press ENTER then type . .bashrc and press ENTER).
After this, you will have set anaconda as the default python. You can check it by typing:

$ python --version

If everything is OK you will see a message telling that this is a special version of python compiled for anaconda. Now we update the package:

$ conda update conda
$ conda update anaconda
$ conda install pydot

At this moment this we can proceed with Theano and Keras. It is crucial to get the last versions from the git repository otherwise they won’t work (at least in my case). So we proceed with the required packages:

$ cd git
$ git clone
$ git clone

We enter the folders and install the packages in the following order: first install Theano, then configure Theano (the .theanorc file) and finally install Keras.

$ cd git/Theano
$ python install

As your user, in the home folder type:

$ cd
$ gedit .theanorc

And write the following:


Save and exit. Afterwards we install Keras:

$ cd git/keras
$ python install

And that is all for having HPC in your computer :)

Friday, March 11, 2016

Beyond state-of-the-art accuracy by fostering ensemble generalization

Sometimes practitioners are forced to go beyond the standard methods in order to gain more accuracy with their models. If one analyzes the problem of rocketing accuracy, ensembling is a good starting point. However, the trick lies in getting enough generalization from feature space.  In this regard, ensemble generalization--do not confuse with classic or "standard" ensemble methods such as Random Forest or Gradient Boosting--is the right path to follow, however complex. The idea is to combine predictions from "base learners" to train a second stage regressor, using these predictions as metafeatures. The trick is to use a J-fold cross-validation scheme and use always the same data partitions and seed. This kind of ensemble is often called stacking--as we "stack" layers of classifiers.

Let’s do an example: suppose that we have three base learners: GBM, ET, and RF. Then assume we have a LM as level 2 learner. First we divide the training data into J-folds, for example in 4--recall that these 4 folds are stratified and disjoint. Then we train each model using the traditional cross-validation scheme, that is train with 3 folds and predict with the remaining (works best if the predictions are in form of probabilities). These predictions are stored and will be used for training the level 2 model. Figure 1 depicts this process.

Figure 1. Ensemble generalization (also known as Stacking) training scheme. The idea is to "stack" multiple layers for generalizing further (in this example we use two layers), and use a J-fold cross-validation scheme for avoiding bias (in this example J = 4).
After training the level 2 algorithm, we can proceed with the final predictions. To do so, we train again the base learners but using the whole training set. We do this to gain up to a 20% accuracy. It is important to highlight that we’ve to assure that the random seeds are the same that in the J-fold training! Afterwards, for each test example we predict with the base learners and collect the predictions. These are the input of the level 2 algorithm, which performs the final prediction.

I used these in Kaggle a few times and I’ve to say that it makes the difference. However, I found it to be difficult to get it working and it requires a lot of processing power. There is a nice post from Triskelion explaining ensembles that gave me the inspiration to write this

Wednesday, December 16, 2015

Presenting a Conceptual Machine Capable of Evolving Association Streams

The increasing bulk of data generation in industrial and scientific applications has fostered practitioners’ interest in mining large amounts of unlabelled data in the form of continuous, high speed, and time-changing streams of information. An appealing field is association stream mining, which models dynamically complex domains via rules without assuming any a priori structure. Different from the related frequent pattern mining field, its goal is to extract interesting associations among the forming features of such data, adapting these to the ever-changing dynamics of the environment in a pure online fashion--without the typical offline rule generation. These rules are adequate for extracting valuable insight which helps in decision making.

It is a pleasure to detail Fuzzy-CSar, an online genetic fuzzy system (GFS) designed to extract interesting, quantitative rules from streams of samples. It evolves its internal model online, being able to quickly adapt its knowledge in the presence of drifting concepts.