Andreu Sancho Homepage

Posts

High Performance Computing, yet another brief installation tutorial

- May 30, 2016

Today’s mid- and high-end computers come with a tremendous hardware, mostly used in video games and other media software, that can be exploited for advanced computation, that is: High Performance Computing (HPC). This is a hot topic in Deep Learning as modern graphic cards come with huge streaming process power and large and quick memory. The most successful example is in Nvidia’s CUDA platform. In summary, CUDA significantly speeds up the fitting of large neural nets (for instance: from several hours to just a few minutes!). However, the drawbacks come when setting up the scenario: it is non-trivial to install the requirements and set it running, and personally I had a little trouble the first time as many packages need to be manually compiled and installed in a specific order. The purpose of this entry is to reflect what I did for setting up Theano and Keras with HPC using an Nvidia’s graphic card (in my case a GT730) using GNU/Linux. To do so, I will start assuming a clean Debi...

Beyond state-of-the-art accuracy by fostering ensemble generalization

- March 11, 2016

Sometimes practitioners are forced to go beyond the standard methods in order to gain more accuracy with their models. If one analyzes the problem of rocketing accuracy, ensembling is a good starting point. However, the trick lies in getting enough generalization from feature space. In this regard, ensemble generalization--do not confuse with classic or "standard" ensemble methods such as Random Forest or Gradient Boosting-- is the right path to follow, however complex. The idea is to combine predictions from "base learners" to train a second stage regressor, using these predictions as metafeatures. The trick is to use a J-fold cross-validation scheme and use always the same data partitions and seed. This kind of ensemble is often called stacking --as we "stack" layers of classifiers. Let’s do an example: suppose that we have three base learners: GBM, ET, and RF. Then assume we have a LM as level 2 learner. First we divide the training data into ...

Presenting a Conceptual Machine Capable of Evolving Association Streams

- December 16, 2015

The increasing bulk of data generation in industrial and scientific applications has fostered practitioners’ interest in mining large amounts of unlabelled data in the form of continuous, high speed, and time-changing streams of information. An appealing field is association stream mining, which models dynamically complex domains via rules without assuming any a priori structure. Different from the related frequent pattern mining field, its goal is to extract interesting associations among the forming features of such data, adapting these to the ever-changing dynamics of the environment in a pure online fashion--without the typical offline rule generation. These rules are adequate for extracting valuable insight which helps in decision making. It is a pleasure to detail Fuzzy-CSar, an online genetic fuzzy system (GFS) designed to extract interesting, quantitative rules from streams of samples. It evolves its internal model online, being able to quickly adapt its knowledge in the ...

Kagglers: another LHC Challenge!

- July 28, 2015

I am glad of sharing some very motivating news: after a little bit more than year, Kaggle hosts a new CERN/LHC competition ! This time it comes from the tau disintegration into three muons. Again, the challenge consists of identifying the pattern from both real and simulated data to identify this rare phenomenon. The training data set consists of 49 distinct real-valued features plus the signal (i.e., background noise and the proper event). This time I am using Python with Scikit learn, Pandas and XGBoost (this latter one by far my favorite ensemble package; also available for R!) and I have to confess that is much easier when you do not have to program everything from scratch (as I usually do in C/C++ or Java), and that I have more time to think in new ways of solving the problem from a pure data scientist view. However, I really adore making use of use my own programs (I ended up 24th out of 1691 in the Forest Cover Type challenge using my own Extra Trees/Random Forest progra...

A Vectorized Version of the Mighty Logistic Regressor

- March 01, 2015

Neural nets, with their flexible data representation capable of approximate any arbitrary function, are in a new renaissance with the development of Deep Leaners. These stack many layers (typically dozens of quasi-independent layers) of traditional neurons to achieve something very remarkable: self detection of important features from input data. The application of Deep Learning is not restricted to raw classification or regression; these family of techniques are applied to much broader fields such as machine vision and speech recognition. Recent advances went much further and combined a Deep Learning architecture with a Reinforcement Learning algorithm generating a computer program capable of beating classic Atari video games without being explicitly programmed for this task, scoring as a top human player. It is not surprising, therefore, that they appear in the last Nature’s cover (vol. 518, Num. 7540, pp. 456-568) . But how do they work? First, one must take a close look to the hum...

Data Analytics Plus Big Data: A Win-win Strategy

- November 16, 2014

We live in a dynamic world where new technologies arise, dominate for a brief blink of an eye and then fall to oblivion rapidly. A part from the academia domain, it is interesting to check what are the current trends in the industry. Or even better: see the whole picture without discriminating. In this regard, a nice analytics tool is found in Google Trends, which allow us to track our terms of interest. The following chart shows the search interest for Machine Learning and Data Analytics . In a similar scale I depicted the search interest for Data Analytics and Big Data . As it is clearly noticeable, knowing both disciplines can open many doors in the industry. Finally, as top and also rising queries, Hadoop is the most popular term.

Search This Blog