📄 apriori.html
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"><!-- =================================================================== File : apriori.html Contents: Description of apriori program Author : Christian Borgelt==================================================================== --><html><head><title>Apriori Documentation</title></head><!-- =============================================================== --><body bgcolor=white><h1><a name="top">Apriori</a></h1><h3>Finding Association Rules/Hyperedges with the Apriori Algorithm</h3><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3>Contents</h3><ul type=disc><li><a href="#intro">Introduction</a></li><li><a href="#terms">Support and Confidence</a> <ul type=circle> <li><a href="#suppset">Support of an Item Set</a></li> <li><a href="#confrule">Confidence of an Association Rule</a></li> <li><a href="#supprule">Support of an Association Rule</a></li> </ul></li><li><a href="#target">Target Types</a> <ul type=circle> <li><a href="#assrules">Association Rules</a></li> <li><a href="#itemsets">Frequent Item Sets</a></li> <li><a href="#closed">Closed Item Sets</a></li> <li><a href="#maximal">Maximal Item Sets</a></li> <li><a href="#hyperedges">Association Hyperedges</a></li> </ul></li><li><a href="#select">Extended Rule Selection</a> <ul type=circle> <li><a href="#diff"> Absolute Confidence Difference to Prior</a></li> <li><a href="#quotient"> Difference of Confidence Quotient to 1</a></li> <li><a href="#improve"> Absolute Difference of Improvement Value to 1</a></li> <li><a href="#info"> Information Difference to Prior</a></li> <li><a href="#chi2"> Normalized chi<sup>2</sup> Measure</a></li> <li><a href="#behavior"> Selection Behavior of the Measures</a></li> <li><a href="#appear">Item Appearances</a></li> </ul></li><li><a href="#select">Extended Item Set Selection</a> <ul type=circle> <li><a href="#logquot"> Binary Logarithm of Support Quotient</a></li> <li><a href="#suppquot"> Difference of Support Quotient to 1</a></li> </ul></li><li><a href="#tatree">Transaction Prefix Tree</a></li><li><a href="#options">Program Invocation and Options</a></li><li><a href="#input">Input Format</a> <ul type=circle> <li><a href="#transin">Format of the Transactions File</a></li> <li><a href="#appearin">Format of the Item Appearances File</a></li> </ul></li><li><a href="#output">Output Format</a> <ul type=circle> <li><a href="#ruleout">Output Format for Association Rules</a></li> <li><a href="#setout">Output Format for Frequent Item Sets</a></li> <li><a href="#edgeout">Output Format for Association Hyperedges</a> </li> </ul></li><li><a href="#compopt">Compilation Options</a></li><li><a href="#copying">Copying</a></li><li><a href="#download">Download</a></li><li><a href="#contact">Contact</a></li></ul><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3><a name="intro">Introduction</a></h3><p>Association rule induction is a powerful method for so-called<i>market basket analysis</i>, which aims at finding regularities inthe shopping behavior of customers of supermarkets, mail-order companiesand the like. With the induction of association rules one tries to findsets of products that are frequently bought together, so that from thepresence of certain products in a shopping cart one can infer (with ahigh probability) that certain other products are present. Suchinformation, expressed in the form of rules, can often be used toincrease the number of items sold, for instance, by appropriatelyarranging the products in the shelves of a supermarket (they may,for example, be placed adjacent to each other in order to invite evenmore customers to buy them together) or by directly suggesting itemsto a customer, which may be of interest for him/her.</p><p>An <i>association rule</i> is a rule like "If a customer buys wineand bread, he often buys cheese, too." It expresses an associationbetween (sets of) <i>items</i>, which may be products of a supermarketor a mail-order company, special equipment options of a car, optionalservices offered by telecommunication companies etc. An associationrule states that if we pick a customer at random and find out thathe selected certain items (bought certain products, chose certainoptions etc.), we can be confident, quantified by a percentage, thathe also selected certain other items (bought certain other products,chose certain other options etc.).</p><p>Of course, we do not want just any association rules, we want"good" rules, rules that are "expressive" and "reliable". The standardmeasures to assess association rules are the <i>support</i> and the<i>confidence</i> of a rule, both of which are computed from the<i>support</i> of certain item sets. These notions are discussed<a href="#terms">here</a> in more detail. However, these standardcriteria are often not sufficient to restrict the set of rules tothe interesting ones. Therefore some additional rule evaluationmeasures are considered <a href="#select">here</a>.</p><p>The main problem of association rule induction is that there areso many possible rules. For example, for the product range of asupermarket, which may consist of several thousand different products,there are billions of possible association rules. It is obvious thatsuch a vast amount of rules cannot be processed by inspecting eachone in turn. Therefore efficient algorithms are needed that restrictthe search space and check only a subset of all rules, but, if possible,without missing important rules. One such algorithm is the apriorialgorithm, which was developed by [Agrawal et al. 1993] and whichis implemented in a specific way in my apriori program. A briefdescription of some implementation aspects can be found in thesepapers:</p><ul type=disc><li><b>Induction of Association Rules: Apriori Implementation</b><br> Christian Borgelt and Rudolf Kruse<br> <i>15th Conference on Computational Statistics</i> (Compstat 2002, Berlin, Germany)<br> Physica Verlag, Heidelberg, Germany 2002<br> (6 pages) <a href="http://fuzzy.cs.uni-magdeburg.de/~borgelt/papers/cstat_02.pdf"> cstat_02.pdf</a> (105 kb) <a href="http://fuzzy.cs.uni-magdeburg.de/~borgelt/papers/cstat_02.ps.gz"> cstat_02.ps.gz</a> (91 kb)</li><li><b>Efficient Implementations of Apriori and Eclat</b><br> Christian Borgelt.<br> <i>Workshop of Frequent Item Set Mining Implementations</i> (FIMI 2003, Melbourne, FL, USA).<br> (9 pages) <a href="http://fuzzy.cs.uni-magdeburg.de/~borgelt/papers/fimi_03.pdf"> fimi_03.pdf</a> (304 kb) <a href="http://fuzzy.cs.uni-magdeburg.de/~borgelt/papers/fimi_03.ps.gz"> fimi_03.ps.gz</a> (197 kb)</li></ul><p>By the way: Earlier versions of my apriori programare incorporated in the well-known data mining tool<a href="http://www.spss.com/Clementine/">Clementine</a>(apriori version 1.8 in Clementine version 5.0, apriori version 2.7 in Clementine version 7.0), available from<a href="http://www.spss.com">SPSS</a>. Newer versions of Clementinestill use my program, but I am not completely sure about the versionnumber of the underlying apriori program.</p><p>Enjoy,<br><a href="http://fuzzy.cs.uni-magdeburg.de/~borgelt/">Christian Borgelt</a></p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td> <td width=5></td> <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3><a name="terms">Support and Confidence</a></h3><h4><a name="suppset">Support of an Item Set</a></h4><p>Let T be the set of all transactions under consideration, e.g.,let T be the set of all "baskets" or "carts" of products bought by thecustomers of a supermarket - on a given day if you like. The supportof an item set S is the percentage of those transactions in T whichcontain S. In the supermarket example this is the number of "baskets"that contain a given set S of products, for example S = { bread, wine,cheese }. If U is the set of all transactions that contain all itemsin S, then</p><p>support(S) = (|U| / |T|) *100%,</p><p>where |U| and |T| are the number of elements in U and T,respectively. For example, if a customer buys the setX = { milk, bread, apples, wine, sausages, cheese, onions, potatoes }then S is obviously a subset of X, hence S is in U. If there are 318customers and 242 of them buy such a set U or a similar one thatcontains S, then support(S) = 76.1%.</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td> <td width=5></td> <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><h4><a name="confrule">Confidence of an Association Rule</a></h4><p>This is the measure used by [Agrawal et al. 1993], the inventors ofthe apriori algorithm, to evaluate association rules. The confidenceof a rule R = "A and B -> C" is the support of the set of all itemsthat appear in the rule divided by the support of the antecedent ofthe rule, i.e.</p><p>confidence(R) = (support({A, B, C}) / support({A, B})) *100%.</p><p>More intuitively, the confidence of a rule is the number of cases inwhich the rule is correct relative to the number of cases in which itis applicable. For example, let R = "wine and bread -> cheese". If acustomer buys wine and bread, then the rule is applicable and it saysthat he/she can be expected to buy cheese. If he/she does not buy wineor does not buy bread or buys neither, than the rule is not applicableand thus (obviously) does not say anything about this customer.</p><p>If the rule is applicable, it says that the customer can be expectedto buy cheese. But he/she may or may not buy cheese, that is, the rulemay or may not be correct. Of course, we are interested in how good therule is, i.e., how often its prediction that the customer buys cheeseis correct. The rule confidence measures this: It states the percentageof cases in which the rule is correct. It computes the percentagerelative to the number of cases in which the antecedent holds, sincethese are the cases in which the rule makes a prediction that can betrue or false. If the antecedent does not hold, then the rule does notmake a prediction, so these cases are excluded.</p><p>With this measure a rule is selected if its confidence exceeds oris equal to a given lower limit. That is, we look for rules that havea high probability of being true, i.e., we look for "good" rules, whichmake correct (or very often correct) predictions. My apriori programalways uses this measure to select association rules. The default valuefor the confidence limit is 80%. It can be changed with the option<tt>-c</tt>.</p><p>In addition to the rule confidence my apriori program lets youselect several other rule evaluation measures, which are explainedbelow, but it will also use rule confidence. If you want to relyentirely on some other measure, you can do so by setting the minimalrule confidence to zero. (Attention: If you have a large number ofitems, setting the minimal rule confidence to zero can result in<i>very</i> high memory consumption.)</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td> <td width=5></td> <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><h4><a name="supprule">Support of an Association Rule</a></h4><p>The support of rules may cause some confusion, because I use thisterm in a different way than [Agrawal et al. 1993] do. For them, thesupport of a rule "A and B -> C" is the support of the set {A, B, C}.This is fine if rule confidence is the only rule evaluation measure,
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -