Thursday, November 11, 2010

Data Streams and VFML

We live in a technological world crowded of information. Every device we can think of can give us a bunch of such data, usually in the form of a flow or stream of information in, more or less, real time. In this particular situation classical knowledge discovery mechanisms (like our loved C4.5, a decision tree developed by Quinlan) are completely unable of extract a correct model of the situation. But, what is so special with flows of data?

Following the words of Gama and Rodriques: a data stream is an ordered sequence of instances that can be read only once or a small number of times using limited computing and storage capabilities. These sources of data are characterized by being open-ended, following at high speed, and generated by non-stationary distributions in dynamic environments.

So, to properly handle this kind of knowledge the learning algorithm has to learn on line and process massive amounts of data increasing the challenges to be faced. Let's hold one's breath with the following example: the nuclear device controller. The decay of heavier particles inside a nuclear reactor generates a flow of data which the controller must keep an eye on and adjust physical parameters, such as the neutron moderators, in order to sustain (or stop if necessary) the nuclear fission inside the reaction chamber. It is obvious that a classical, off line approach could be a very dangerous business.

Usually, data streams come from sensor networks containing an undesirable amount of noise, degrading the model of the system. And this is not the only affair. There is a major challenge: the variations on the distributions of categories of the problem at a given time. This effect is called concept drift, and, with the help of a noisy input, it can destroy completely the predictions of classical knowledge discovery algorithms.

Several on line learning algorithms have been proposed so far, but not all of those can handle concept drift. The current state of the art is the so called "Concept Very Fast Decision Tree" (CVFDT), a branch of C4.5 to handle serious data stream problems with concept drifts designed by Hulten and others. To test its capabilities, Hulten and Domingos developed a toolkit for mining high-speed data streams and very large data sets. This software is called "Very Fast Machine Learning" (VFML) and is available under BSD license here.

I tested this software and I have to advice that you will probably have some trouble compiling it. The makefiles provided have some minor mistakes and You have to compile it several times (I did make four times until I got the binaries).