📄 apriori.html

📁 apriori算法是数据挖掘的经典算法之1,其基于关联规则的思想.这是我的第2个收藏算法
💻 HTML
📖 第 1 页 / 共 4 页
字号:
12 3 4 下一页
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"><!-- ===================================================================  File    : apriori.html  Contents: Description of apriori program  Author  : Christian Borgelt==================================================================== --><html><head><title>Apriori Documentation</title></head><!-- =============================================================== --><body bgcolor=white><h1><a name="top">Apriori</a></h1><h3>Find Association Rules/Hyperedges with Apriori Algorithm</h3><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3>Contents</h3><ul type=disc><li><a href="#intro">Introduction</a></li><li><a href="#terms">Support and Confidence</a>    <ul type=circle>    <li><a href="#suppset">Support of an Item Set</a></li>    <li><a href="#confrule">Confidence of an Association Rule</a></li>    <li><a href="#supprule">Support of an Association Rule</a></li>    </ul></li><li><a href="#select">Extended Rule Selection</a>    <ul type=circle>    <li><a href="#diff">        Absolute Confidence Difference to Prior</a></li>    <li><a href="#quotient">        Difference of Confidence Quotient to 1</a></li>    <li><a href="#improve">        Absolute Difference of Improvement Value to 1</a></li>    <li><a href="#info">        Information Difference to Prior</a></li>    <li><a href="#chi2">        Normalized chi<sup>2</sup> Measure</a></li>    <li><a href="#behavior">        Selection Behavior of the Measures</a></li>    <li><a href="#appear">Item Appearances</a></li>    </ul></li><li><a href="#target">Other Target Types</a>    <ul type=circle>    <li><a href="#itemsets">Frequent Item Sets</a></li>    <li><a href="#hyperedges">Association Hyperedges</a></li>    </ul></li><li><a href="#options">Program Invocation and Options</a></li><li><a href="#input">Input Format</a>    <ul type=circle>    <li><a href="#transin">Format of the Transactions File</a></li>    <li><a href="#appearin">Format of the Item Appearances File</a></li>    </ul></li><li><a href="#output">Output Format</a>    <ul type=circle>    <li><a href="#ruleout">Output Format for Association Rules</a></li>    <li><a href="#setout">Output Format for Frequent Item Sets</a></li>    <li><a href="#edgeout">Output Format for Association Hyperedges</a>        </li>    </ul></li><li><a href="#copying">Copying</a></li><li><a href="#download">Download</a></li><li><a href="#contact">Contact</a></li></ul><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3><a name="intro">Introduction</a></h3><p>Association rule induction is a powerful method for so-called<i>market basket analysis</i>, which aims at finding regularities inthe shopping behavior of customers of supermarkets, mail-order companiesand the like. With the induction of association rules one tries to findsets of products that are frequently bought together, so that from thepresence of certain products in a shopping cart one can infer (with ahigh probability) that certain other products are present. Suchinformation, expressed in the form of rules, can often be used toincrease the number of items sold, for instance, by appropriatelyarranging the products in the shelves of a supermarket (they may,for example, be placed adjacent to each other in order to invite evenmore customers to buy them together) or by directly suggesting itemsto a customer, which may be of interest for him/her.</p><p>An <i>association rule</i> is a rule like "If a customer buys wineand bread, he often buys cheese, too." It expresses an associationbetween (sets of) <i>items</i>, which may be products of a supermarketor a mail-order company, special equipment options of a car, optionalservices offered by telecommunication companies etc. An associationrule states that if we pick a customer at random and find out thathe selected certain items (bought certain products, chose certainoptions etc.), we can be confident, quantified by a percentage, thathe also selected certain other items (bought certain other products,chose certain other options etc.).</p><p>Of course, we do not want just any association rules, we want"good" rules, rules that are "expressive" and "reliable". The standardmeasures to assess association rules are the <i>support</i> and the<i>confidence</i> of a rule, both of which are computed from the<i>support</i> of certain item sets. These notions are discussed<a href="#terms">here</a> in more detail. However, these standardcriteria are often not sufficient to restrict the set of rules tothe interesting ones. Therefore some additional rule evaluationmeasures are considered <a href="select">here</a>.</p><p>The main problem of association rule induction is that there areso many possible rules. For example, for the product range of asupermarket, which may consist of several thousand different products,there are billions of possible association rules. It is obvious thatsuch a vast amount of rules cannot be processed by inspecting eachone in turn. Therefore efficient algorithms are needed that restrictthe search space and check only a subset of all rules, but, if possible,without missing important rules. One such algorithm is the apriorialgorithm, which was developed by [Agrawal et al. 1993] and whichis implemented in a specific way in my apriori program.</p><p>By the way: Version 1.8 of my apriori program is incorporated inthe well-known data mining tool<a href="http://www.spss.com/Clementine/">Clementine</a> (Version 5.0),available from <a href="http://www.spss.com">SPSS</a>.</p><p>Enjoy,<br><a href="http://fuzzy.cs.uni-magdeburg.de/~borgelt/">Christian Borgelt</a></p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3><a name="terms">Support and Confidence</a></h3><h4><a name="suppset">Support of an Item Set</a></h4><p>Let T be the set of all transactions under consideration, e.g.,let T be the set of all "baskets" or "carts" of products bought by thecustomers of a supermarket - on a given day if you like. The supportof an item set S is the percentage of those transactions in T whichcontain S. In the supermarket example this is the number of "baskets"that contain a given set S of products, for example S = { bread, wine,cheese }. If U is the set of all transactions that contain all itemsin S, then</p><p>support(S) = (|U| / |T|) *100%,</p><p>where |U| and |T| are the number of elements in U and T,respectively. For example, if a customer buys the setX = { milk, bread, apples, wine, sausages, cheese, onions, potatoes }then S is obviously a subset of X, hence S is in U. If there are 318customers and 242 of them buy such a set U or a similar one thatcontains S, then support(S) = 76.1%.</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><h4><a name="confrule">Confidence of an Association Rule</a></h4><p>This is the measure used by [Agrawal et al. 1993], the inventors ofthe apriori algorithm, to evaluate association rules. The confidenceof a rule R = "A and B -&gt; C" is the support of the set of all itemsthat appear in the rule divided by the support of the antecedent ofthe rule, i.e.</p><p>confidence(R) = (support({A, B, C}) / support({A, B})) *100%.</p><p>More intuitively, the confidence of a rule is the number of cases inwhich the rule is correct relative to the number of cases in which itis applicable. For example, let R = "wine and bread -&gt; cheese". If acustomer buys wine and bread, then the rule is applicable and it saysthat he/she can be expected to buy cheese. If he/she does not buy wineor does not buy bread or buys neither, than the rule is not applicableand thus (obviously) does not say anything about this customer.</p><p>If the rule is applicable, it says that the customer can be expectedto buy cheese. But he/she may or may not buy cheese, that is, the rulemay or may not be correct. Of course, we are interested in how good therule is, i.e., how often its prediction that the customer buys cheeseis correct. The rule confidence measures this: It states the percentageof cases in which the rule is correct. It computes the percentagerelative to the number of cases in which the antecedent holds, sincethese are the cases in which the rule makes a prediction that can betrue or false. If the antecedent does not hold, then the rule does notmake a prediction, so these cases are excluded.</p><p>With this measure a rule is selected if its confidence exceeds oris equal to a given lower limit. That is, we look for rules that havea high probability of being true, i.e., we look for "good" rules, whichmake correct (or very often correct) predictions. My apriori programalways uses this measure to select association rules. The default valuefor the confidence limit is 80%. It can be changed with the option<tt>-c</tt>.</p><p>In addition to the rule confidence my apriori program lets youselect several other rule evaluation measures, which are explainedbelow, but it will also use rule confidence. If you want to relyentirely on some other measure, you can do so by setting the minimalrule confidence to zero. (Attention: If you have a large number ofitems, setting the minimal rule confidence to zero can result in<i>very</i> high memory consumption.)</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><h4><a name="supprule">Support of an Association Rule</a></h4><p>The support of rules may cause some confusion, because I use thisterm in a different way than [Agrawal et al. 1993] do. For them, thesupport of a rule "A and B -&gt; C" is the support of the set {A, B, C}.This is fine if rule confidence is the only rule evaluation measure,but it causes problems if some other measure is used. For these othermeasures it is often much more appropriate to call the support of theantecedent of the rule, i.e. the support of {A, B} in the example above,the support of the rule.</p><p>The difference can also be stated in the following way: For [Agrawalet al. 1993], the support of the rule is the (relative) number of casesin which the rule is correct (i.e., in which the presence of the item Cfollows from the presence of the items A and B), whereas for me (andthus my apriori program) the support of a rule is the (relative) numberof cases in which it is applicable (i.e., in which the antecedent of therule holds), although in some of these cases it may be false (becauseonly the items A and B are present, but the item C is missing).</p><p>One reason for this, as already mentioned, is that the definitionof [Agrawal et al. 1993] does not work well for evaluation measuresother than rule confidence. This is explained in more detail below.Another reason is that I prefer the support of a rule to say somethingabout the "statistical" support of a rule and its confidence, i.e.,from how many cases the confidence is computed in order to expresshow well founded the assertion about the confidence is.</p><p>Maybe an example will make this clearer. Suppose you have a die whichyou suspect to be biased. To test this hypothesis, you throw the die,say, a thousand times. 307 times the 6 turns up. Hence you assume thatthe die is actually biased, since the relative frequency is about 30%although for an unbiased die it should be around 17%. Now, what is the"statistical" support of this assertion, i.e., on how many experimentsdoes it rest? Obviously it rests on all 1000 experiments and not onlyon the 307 experiments in which the 6 turned up. This is so, simplybecause you had to do 1000 experiments to find out that the relativefrequency is around 30% and not only the 307 in which a 6 turned up.</p><p>Or suppose you are doing an opinion poll to find out about theacceptance of a certain political party, maybe with the usual question"If an election were held next Sunday ...?" You ask 2000 persons, ofwhich 857 say that they would vote for the party you are interested in.What is the support of the assertion that this party would get around43% of all votes? It is the size of your sample, i.e., all 2000 persons,and not only the 857 that answered in the positive. Again you had to askall 2000 people to find out about the percentage of 43%. Of course, youcould have asked fewer people, say, 100, of which, say, 43 said thatthey would vote for the party, but then your assertion would be lessreliable, because it is less "supported". The number of votes for theparty could also be 40% or 50%, because of some random influences. Suchdeviations are much less likely, if you asked 2000 persons, since thenthe random influences can be expected to cancel out.</p><p>The rule support can be used to select association rules by statinga lower bound for the support of a rule. This is equivalent to sayingthat you are interested only in such rules that have a large enoughstatistical basis (since my apriori program uses the term "support"in my interpretation and not in the one used by [Agrawal et al. 1993]).
12 3 4 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -