⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 apriori.html

📁 数据挖掘中的关联规则算法
💻 HTML
📖 第 1 页 / 共 5 页
字号:
but it causes problems if some other measure is used. For these othermeasures it is often much more appropriate to call the support of theantecedent of the rule, i.e. the support of {A, B} in the example above,the support of the rule.</p><p>The difference can also be stated in the following way: For [Agrawalet al. 1993], the support of the rule is the (relative) number of casesin which the rule is correct (i.e., in which the presence of the item Cfollows from the presence of the items A and B), whereas for me (andthus my apriori program) the support of a rule is the (relative) numberof cases in which it is applicable (i.e., in which the antecedent of therule holds), although in some of these cases it may be false (becauseonly the items A and B are present, but the item C is missing).</p><p>One reason for this, as already mentioned, is that the definitionof [Agrawal et al. 1993] does not work well for evaluation measuresother than rule confidence. This is explained in more detail below.Another reason is that I prefer the support of a rule to say somethingabout the "statistical" support of a rule and its confidence, i.e.,from how many cases the confidence is computed in order to expresshow well founded the assertion about the confidence is.</p><p>Maybe an example will make this clearer. Suppose you have a die whichyou suspect to be biased. To test this hypothesis, you throw the die,say, a thousand times. 307 times the 6 turns up. Hence you assume thatthe die is actually biased, since the relative frequency is about 30%although for an unbiased die it should be around 17%. Now, what is the"statistical" support of this assertion, i.e., on how many experimentsdoes it rest? Obviously it rests on all 1000 experiments and not onlyon the 307 experiments in which the 6 turned up. This is so, simplybecause you had to do 1000 experiments to find out that the relativefrequency is around 30% and not only the 307 in which a 6 turned up.</p><p>Or suppose you are doing an opinion poll to find out about theacceptance of a certain political party, maybe with the usual question"If an election were held next Sunday ...?" You ask 2000 persons, ofwhich 857 say that they would vote for the party you are interested in.What is the support of the assertion that this party would get around43% of all votes? It is the size of your sample, i.e., all 2000 persons,and not only the 857 that answered in the positive. Again you had to askall 2000 people to find out about the percentage of 43%. Of course, youcould have asked fewer people, say, 100, of which, say, 43 said thatthey would vote for the party, but then your assertion would be lessreliable, because it is less "supported". The number of votes for theparty could also be 40% or 50%, because of some random influences. Suchdeviations are much less likely, if you asked 2000 persons, since thenthe random influences can be expected to cancel out.</p><p>The rule support can be used to select association rules by statinga lower bound for the support of a rule. This is equivalent to sayingthat you are interested only in such rules that have a large enoughstatistical basis (since my apriori program uses the term "support"in my interpretation and not in the one used by [Agrawal et al. 1993]).The default value for the support limit is 10%. It can be changedwith the option <tt>-s</tt>. If the number given is negative, it isinterpreted as an absolute number (number of transactions) rather thana percentage. (Note that in this case the option <tt>-a</tt> reversesits meaning: if it is not given only the absolute support is printed,if it is added, the relative supoort is printed, too.) The lower boundfor the rule support is combined with the lower bound for the ruleconfidence, i.e., my apriori program generates only rules the confidenceof which is greater than or equal to the confidence limit <i>and</i> thesupport of which is greater than or equal to the support limit.</p><p>Despite the above arguments in favor of my definition of the supportof an association rule, a rule support compatibility mode is available.With the option <tt>-o</tt> the original rule support definition can beselected. In this case the support of an association rule is the supportof the set with the items in the antecedent and the consequent of therule, i.e. the support of a rule as defined in [Agrawal et al. 1993].</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3><a name="target">Target Types</a></h3><p>The target type, which can be selected via the option <tt>-t</tt>,is either association rules (option <tt>-tr</tt>, default), frequentitem sets (option <tt>-ts</tt>), closed item sets (option <tt>-tc</tt>),maximal item sets (option <tt>-tm</tt>), or association hyperedges(option <tt>-th</tt>).</p><!-- =============================================================== --><h4><a name="assrules">Association Rules (default, option -tr)</a></h4><p>By default my apriori program produces association rules witha single item in the consequent. The restriction to single itemconsequents is due to the following considerations: In the first place,association rule mining usually produces too many rules even if oneconfines oneself to rules with only one item in the consequent. So whyshould one make the situation worse by allowing more than one item inthe consequent? (It merely blows up the output size.)</p><p>Secondly, I do not know any application in which rules with morethan one item in the consequent are of any real use. The reason, inmy opinion, is that such more complex rules add almost nothing to theinsights about the data set. To understand this, consider the simplerrules that correspond to a rule with multiple items in the consequent,that is, rules having the same antecedent and consequents with onlysingle items from the consequent of the complex rule. All of theserules must necessarily be in the output, because neither their supportnor their confidence can be less than that of the more complex rule.That is, if you have a rule c d &lt;- a b, you will necessarily alsohave the rules c &lt;- a b and d &lt;- a b in the output. Of course,these latter two rules together do <i>not</i> say the same as the morecomplex rule. However, what do you gain from the additional informationthe more complex rule gives you? How can you use it? And is this littleextra information worth having to analyze a much bigger rule set?</p><!-- =============================================================== --><h4><a name="itemsets">Frequent Item Sets (option -ts)</a></h4><p>Sometimes one may not want to find association rules, but only thefrequent item sets underlying them. That is, one wants to find allitem sets with a support exceeding a certain threshold. My aprioriprogram supports this search, too: If the option <tt>-ts</tt> isgiven, only frequent item sets are determined.</p><!-- =============================================================== --><h4><a name="closed">Closed Item Sets (option -tc)</a></h4><p>A frequent item set is called <i>closed</i> if no superset has thesame support. If the option <tt>-tc</tt> is given, the found frequentitem sets are subsequently filtered and only the closed item setsare kept.</p><!-- =============================================================== --><h4><a name="maximal">Maximal Item Sets (option -tm)</a></h4><p>A frequent item set is called <i>maximal</i> if no superset isfrequent, i.e., has a support exceeding the minimal support. If theoption <tt>-tm</tt> is given, the found frequent item sets aresubsequently filtered and only the maximal item sets are kept.</p><!-- =============================================================== --><h4><a name="hyperedges">Association Hyperedges (option -th)</a></h4><p>My apriori program can also find association hyperedges, i.e., setsof items that are strongly predictive w.r.t. each other. In this modeno rules are generated, only item sets are selected. The selectioncriterion is as follows: Given an item set with enough support (option<tt>-s</tt>), all rules are checked which can be formed using this setwith all items appearing in the rule. For example, for the item set{a b c}, the rules c &lt;- a b, b &lt;- a c, a &lt;- b c would beconsidered. The confidences of these rules are computed and averaged.If the resulting average confidence is greater than the minimalconfidence (option <tt>-c</tt>), the item set is selected. (I amgrateful to Bastien Duclaux for requesting the possibility to generateassociation hyperedges.)</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3><a name="select">Extended Rule Selection</a></h3><p>If rules are selected using the rule confidence, the followingproblem arises: "Good" rules (rules that are often true) are notalways "interesting" rules (rules that reveal something about theinterdependence of the items). You certainly know the examples thatare usually given to illustrate this fact. For instance, it is easyto find out in a medical database that the rule "pregnant -&gt; female"is true with a confidence of 100%. Hence it is a perfect rule, itnever fails, but, of course, this is not very surprising. Althoughthe measures explained below cannot deal with this problem (which issemantical), they may be able to improve on the results in a relatedcase.</p><p>Let us look at the supermarket example again and let us assumethat 60% of all customers buy some kind of bread. Consider the rule"cheese -&gt; bread", which holds with a confidence of, say, 62%.Is this an important rule? Obviously not, since the fact that thecustomer buys cheese does not have a significant influence on him/herbuying bread: The percentages are almost the same. But if you had seta confidence limit of 60%, you would get both rules "-&gt; bread"(confidence 60%) and "cheese -&gt; bread" (confidence 62%), althoughthe first would suffice (the first, since it is the simpler of thetwo). The idea of all measures that can be used in addition or insteadof rule confidence is to handle such situations and to suppress thesecond rule.</p><p>In addition, consider the following case: Assume that the confidenceof the rule "cheese -&gt; bread" is not 62% but 35%. With a confidencelimit of 60% it would not be selected, but it may be very important toknow about this rule! Together with cheese bread is bought much lessfrequent than it is bought at all. Is cheese some kind of substitutefor bread, so that one does not need any bread, if one has cheese? Ok,maybe this is not a very good example. However, what can be seen isthat a rule with low confidence can be very interesting, since it maycapture an important influence. Furthermore, this is a way to expressnegation (though only in the consequent of a rule), since"cheese -&gt; bread" with confidence 35% is obviously equivalent to"cheese -&gt; no bread" with confidence 65%. This also makes clearwhy the support of the item set that contains all items in the body<i>and</i> the head of the rule is not appropriate for this measure.An important rule may have confidence 0 and thus a support (in theinterpretation of [Agrawal et al. 1993]) of 0. Hence it is notreasonable to set a lower bound for this kind of support.</p><p>I hope that the intention underlying all this is already clear:Potentially interesting rules differ significantly in their confidencefrom the confidence of rules with the same consequent, but a simplerantecedent. Adding an item to the antecedent is informative only if itsignificantly changes the confidence of the rule. Otherwise the simplerrule suffices.</p><p>Unfortunately the measures other than rule confidence do not solvethe rule selection problem in the very general form in which it wasstated above. It is not that easy to deal with all rules that have asimpler antecedent, to keep track of which of these rules were selected(this obviously influences the selection of more complicated rules),to deal with the special type of Poincare paradox that can occur, etc.Hence the measures always compare the confidence of a rule with theconfidence of the rule with empty antecedent, i.e. with the relativefrequency of the consequent.</p><p>I call the confidence of a rule with empty antecedent the priorconfidence, since it is the confidence that the item in the consequentof the rule will be present in an item set prior to any informationabout other items that are present. The confidence of a rule withnon-empty antecedent (and the same consequent) I call the posteriorconfidence, since it is the confidence that the item in the consequentof the rule will be present after it gets known that the items in theantecedent of the rule are present.</p><p>All measures that can be used in addition to rule confidence arecomputed from these two values: the prior confidence and the posteriorconfidence. Only those rules are selected for which the value of thechosen additional evaluation measure exceeds or is equal to a certainlimit. The measures are chosen with the option <tt>-e</tt>, the limitis passed to the program via the option <tt>-d</tt>. The default valuefor the limit is 10%.</p><p>All additional rule evaluation measures are combined with the limitsfor rule confidence and rule support. I.e., my apriori program selectsonly those rules the confidence of which is greater than or equal tothe confidence limit, the support of which is greater than or equal tothe support limit, <i>and</i> for which the additional evaluation valueis greater than or equal to the limit for this measure. The default isto use no additional evaluation measure, i.e., to rely only on ruleconfidence and rule support. Of course you can remove the restrictionthat the rule confidence must exceed a certain limit by simply settingthis limit to zero. In this case rules are selected using only thelimits for the rule support and the additional evaluation measure.

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -