Saturday, March 19, 2011

Schemata, Building Blocks, and Everything Else

Genetic Algorithms (GAs), are a search and optimization method inspired in the way nature works with living entities, using evolutionary-based operators. These operators exchange genetic information through different generations until an ending condition, typically the desired solution, is found. In this entry, the formalism of why GAs work is described as proposed by Holland in the middle seventies and later by Goldberg. To do so, we first need to introduce some key concepts, assuming the classical ternary representation {0, 1, *}, where * is the don't care symbol.


A fundamental concept in GA theory is the one of schema. A schema is a particular subset among the set of all possible binary strings described by a template composed of the ternary alphabet {0, 1, *}. For instance, the schema 01**1 corresponds to the set of strings of length five (that is, strings composed of five symbols from the ternary alphabet) with a 0 in the first position, an 1 in the second position and an 1 in the last position. Every string that fits a template is called an instance of a schema. In our particular example, the strings 01001, 01011 and 01111, among others, are instances of the schema 01**1. It is very common to use the term schema to denote the subset as well as the template defining the subset itself.


Two more concepts are needed in order to understand the intrinsics of GAs: (1) the order of a schema and (2) the length of a schema.


(1) The order of a schema is defined as the number of bits that are non-don't care symbols. For example, the order of the schema 01**1 is 3.

(2) The length of a schema is defined as the distance between the first and the last non-don't care symbols. For example, the length of the schema 01**1 is 4.


With these definitions in mind we can model how different schemas evolve along a GA run using the schema theorem. This theorem states how the frequency of schema instances changes considering the effects of the evolutionary operators of selection, crossover and mutation. Going further, schema theorem states that the frequency of schemata with fitness higher than the average, short length and low-order ones increases in the following generation. One can find the mathematical foundations elsewhere in the bibliography. Anyway, schemata theorem is not enough to comprehend why GAs work; we need another conceptual leap. That is the case of building block hypothesis, which states that the combination of low-order, highly fit schemata (the building blocks) generate higher-order schemata. These new schemata will form high performance solutions. As in Goldberg's own words, the primary idea of GAs in that they work through a mechanism of decomposition and reassembly.


References

O. Corón, F. Herrera, F. Hoffmann, and L. Magdalena. Genetic fuzzy systems: Evolutionary tuning and learning of fuzzy knowledge bases, volume 19 of Advances in Fuzzy Systems— Aplications and Theory. World Scientific, 2001.

D. E. Goldberg. The design of innovation: Lessons from and for competent genetic algorithms. Kluwer Academic Publishers, 1 edition, 2002.

D. E. Goldberg. Genetic algorithms in search, optimization & machine learning. Addison Wesley, 1 edition, 1989.

J. H. Holland. Adaptation in natural and artificial systems. The University of Michigan Press, 1975.

A. Orriols-Puing. New Challenges in Learning Classifier Systems: Mining Rarities and Evolving Fuzzy Models. PhD thesis, Arquitectura i Enginyeria La Salle, Universitat Ramon Llull, Passeig de la Bonanova 8, 08022 - Barcelona, Nov 2008.

Monday, March 7, 2011

From Market Baskets to Databases: Association Rule Mining

What do the customers buy? Which products are bought together? With these two short questions the field of association rule (AR) mining makes its appearance. In this field of ML, the original aim was to find associations and correlations between the different items that customers place in their shopping market. More generally, the goal of AR is to find frequent and interesting patterns, associations, correlations, or causal structures among sets of items or elements in large databases and put these relationships in terms of association rules. AR is an important part of the unsupervised learning paradigm, so the algorithm has not the presence of an expert to teach it during the training stage.


Why AR mining may be so important? Many commercial applications generate huge amounts of unlabeled data (just think of Facebook for a moment), so our favorite classifier system will not work in this environment. With AR we can exploit such databases and extract any kind of useful information, and rule-based systems are a well known to us...


Note that there are two main types of ARs: (1) binary or qualitative rules, and (2) quantitative rules. For this article I will use qualitative due to its simplicity.



Association Rule Mining with Apriori

Apriori is the most famous AR miner out there. With this algorithm we can find the most frequent itemsets, that is, those itemsets which appear most often in the database. Its simplicity can be summarized by the following code:


k := 1

generate frequent itemsets (support count >= min support count) of length 1

repeat until no frequent itemsets are found:

k := k + 1

generate itemsets of size k from the k - 1 frequent itemsets

compute the support of each candidate by scanning through the database


But this article is focused on rule generation. How do we obtain the association rules from the previously found frequent itemsets? That is definitely tricky. To obtain the desired rules we have to obtain all the possible combinations (without repetition) from every frequent itemset and check if the confidence of every rule formed has a confidence greater or equal than a minimum threshold, the minimum confidence. Let's see this with the following example: suppose we have the frequent itemset {B,C,D}, where B, C, and D are items obtained from the transaction database. How many rules will this itemset generate? The answer is a maximum of six rules: because we work with qualitative rules, there are (2^k)-2, where k is the size of the set, in our case k = 3. The rules are:

BC => E, BE => C, CE => B, B => CE, C => BE, and E => BC. The idea is the following:


for itemsInAntecedent := 1 to k

first := getFirstItemFromItemset

element := first

do

do

setConsequentItemActive( first )

for j := 1 to itemsInAntecedent

element := getNextElementFromItemset( element )

setConsequentItemActive( element )

endfor

inverseAntecedentFromConsequentInRule

setRuleConfidence

if minConf <= ruleConfidence then

store rule

endif

while first <> lastItemFromItemset and itemsInAntecedent > 1 then

first := getNextElementFromItemset( first )

element := first

while first <> lastItemFromItemset

if itemsInAntecedent = 1 and numElements > 2 then

setConsequentItemActive( lastElement )

setRuleConfidence

if minConf <= ruleConfidence then

store rule

endif

endif

endfor


As it can be appreciated, the algorithm is tricky enough to have a wannabe PhD for a couple of days wondering (from the very scratch) how to obtain the desired rules.


References

R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. Technical report, IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, Sep 1994.