From Market Baskets to Databases: Association Rule Mining

What do the customers buy? Which products are bought together? With these two short questions the field of association rule (AR) mining makes its appearance. In this field of ML, the original aim was to find associations and correlations between the different items that customers place in their shopping market. More generally, the goal of AR is to find frequent and interesting patterns, associations, correlations, or causal structures among sets of items or elements in large databases and put these relationships in terms of association rules. AR is an important part of the unsupervised learning paradigm, so the algorithm has not the presence of an expert to teach it during the training stage.


Why AR mining may be so important? Many commercial applications generate huge amounts of unlabeled data (just think of Facebook for a moment), so our favorite classifier system will not work in this environment. With AR we can exploit such databases and extract any kind of useful information, and rule-based systems are a well known to us...


Note that there are two main types of ARs: (1) binary or qualitative rules, and (2) quantitative rules. For this article I will use qualitative due to its simplicity.



Association Rule Mining with Apriori

Apriori is the most famous AR miner out there. With this algorithm we can find the most frequent itemsets, that is, those itemsets which appear most often in the database. Its simplicity can be summarized by the following code:


k := 1

generate frequent itemsets (support count >= min support count) of length 1

repeat until no frequent itemsets are found:

k := k + 1

generate itemsets of size k from the k - 1 frequent itemsets

compute the support of each candidate by scanning through the database


But this article is focused on rule generation. How do we obtain the association rules from the previously found frequent itemsets? That is definitely tricky. To obtain the desired rules we have to obtain all the possible combinations (without repetition) from every frequent itemset and check if the confidence of every rule formed has a confidence greater or equal than a minimum threshold, the minimum confidence. Let's see this with the following example: suppose we have the frequent itemset {B,C,D}, where B, C, and D are items obtained from the transaction database. How many rules will this itemset generate? The answer is a maximum of six rules: because we work with qualitative rules, there are (2^k)-2, where k is the size of the set, in our case k = 3. The rules are:

BC => E, BE => C, CE => B, B => CE, C => BE, and E => BC. The idea is the following:


for itemsInAntecedent := 1 to k

first := getFirstItemFromItemset

element := first

do

do

setConsequentItemActive( first )

for j := 1 to itemsInAntecedent

element := getNextElementFromItemset( element )

setConsequentItemActive( element )

endfor

inverseAntecedentFromConsequentInRule

setRuleConfidence

if minConf <= ruleConfidence then

store rule

endif

while first <> lastItemFromItemset and itemsInAntecedent > 1 then

first := getNextElementFromItemset( first )

element := first

while first <> lastItemFromItemset

if itemsInAntecedent = 1 and numElements > 2 then

setConsequentItemActive( lastElement )

setRuleConfidence

if minConf <= ruleConfidence then

store rule

endif

endif

endfor


As it can be appreciated, the algorithm is tricky enough to have a wannabe PhD for a couple of days wondering (from the very scratch) how to obtain the desired rules.


References

R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. Technical report, IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, Sep 1994.

Comments

Popular posts from this blog

A conceptual machine for cloning a human driver

Data Streams and VFML

Data Analytics Plus Big Data: A Win-win Strategy