⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 apriori.html

📁 apriori算法是数据挖掘的经典算法之1,其基于关联规则的思想.这是我的第2个收藏算法
💻 HTML
📖 第 1 页 / 共 4 页
字号:
The default value for the support limit is 10%. It can be changedwith the option <tt>-s</tt>. The lower bound for the rule support iscombined with the lower bound for the rule confidence, i.e., my aprioriprogram generates only rules the confidence of which is greater than orequal to the confidence limit <i>and</i> the support of which is greaterthan or equal to the support limit.</p><p>Despite the above arguments in favor of my definition of the supportof an association rule, a rule support compatibility mode is available.With the option <tt>-o</tt> the original rule support definition can beselected. In this case the support of an association rule is the supportof the set with the items in the antecedent and the consequent of therule, i.e. the support of a rule as defined in [Agrawal et al. 1993].</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3><a name="select">Extended Rule Selection</a></h3><p>If rules are selected using the rule confidence, the followingproblem arises: "Good" rules (rules that are often true) are notalways "interesting" rules (rules that reveal something about theinterdependence of the items). You certainly know the examples thatare usually given to illustrate this fact. For instance, it is easyto find out in a medical database that the rule "pregnant -&gt; female"is true with a confidence of 100%. Hence it is a perfect rule, itnever fails, but, of course, this is not very surprising. Althoughthe measures explained below cannot deal with this problem (which issemantical), they may be able to improve on the results in a relatedcase.</p><p>Let us look at the supermarket example again and let us assumethat 60% of all customers buy some kind of bread. Consider the rule"cheese -&gt; bread", which holds with a confidence of, say, 62%.Is this an important rule? Obviously not, since the fact that thecustomer buys cheese does not have a significant influence on him/herbuying bread: The percentages are almost the same. But if you had seta confidence limit of 60%, you would get both rules "-&gt; bread"(confidence 60%) and "cheese -&gt; bread" (confidence 62%), althoughthe first would suffice (the first, since it is the simpler of thetwo). The idea of all measures that can be used in addition or insteadof rule confidence is to handle such situations and to suppress thesecond rule.</p><p>In addition, consider the following case: Assume that the confidenceof the rule "cheese -&gt; bread" is not 62% but 35%. With a confidencelimit of 60% it would not be selected, but it may be very important toknow about this rule! Together with cheese bread is bought much lessfrequent than it is bought at all. Is cheese some kind of substitutefor bread, so that one does not need any bread, if one has cheese? Ok,maybe this is not a very good example. However, what can be seen isthat a rule with low confidence can be very interesting, since it maycapture an important influence. Furthermore, this is a way to expressnegation (though only in the consequent of a rule), since"cheese -&gt; bread" with confidence 35% is obviously equivalent to"cheese -&gt; no bread" with confidence 65%. This also makes clearwhy the support of the item set that contains all items in the body<i>and</i> the head of the rule is not appropriate for this measure.An important rule may have confidence 0 and thus a support (in theinterpretation of [Agrawal et al. 1993]) of 0. Hence it is notreasonable to set a lower bound for this kind of support.</p><p>I hope that the intention underlying all this is already clear:Potentially interesting rules differ significantly in their confidencefrom the confidence of rules with the same consequent, but a simplerantecedent. Adding an item to the antecedent is informative only if itsignificantly changes the confidence of the rule. Otherwise the simplerrule suffices.</p><p>Unfortunately the measures other than rule confidence do not solvethe rule selection problem in the very general form in which it wasstated above. It is not that easy to deal with all rules that have asimpler antecedent, to keep track of which of these rules were selected(this obviously influences the selection of more complicated rules),to deal with the special type of Poincare paradox that can occur, etc.Hence the measures always compare the confidence of a rule with theconfidence of the rule with empty antecedent, i.e. with the relativefrequency of the consequent.</p><p>I call the confidence of a rule with empty antecedent the priorconfidence, since it is the confidence that the item in the consequentof the rule will be present in an item set prior to any informationabout other items that are present. The confidence of a rule withnon-empty antecedent (and the same consequent) I call the posteriorconfidence, since it is the confidence that the item in the consequentof the rule will be present after it gets known that the items in theantecedent of the rule are present.</p><p>All measures that can be used in addition to rule confidence arecomputed from these two values: the prior confidence and the posteriorconfidence. Only those rules are selected for which the value of thechosen additional evaluation measure exceeds or is equal to a certainlimit. The measures are chosen with the option <tt>-e</tt>, the limitis passed to the program via the option <tt>-d</tt>. The default valuefor the limit is 10%.</p><p>All additional rule evaluation measures are combined with the limitsfor rule confidence and rule support. I.e., my apriori program selectsonly those rules the confidence of which is greater than or equal tothe confidence limit, the support of which is greater than or equal tothe support limit, <i>and</i> for which the additional evaluation valueis greater than or equal to the limit for this measure. The default isto use no additional evaluation measure, i.e., to rely only on ruleconfidence and rule support. Of course you can remove the restrictionthat the rule confidence must exceed a certain limit by simply settingthis limit to zero. In this case rules are selected using only thelimits for the rule support and the additional evaluation measure.(Attention: If you have a large number of items, setting the minimalrule confidence to zero can result in <i>very</i> high memoryconsumption.)</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><h4><a name="diff">Absolute Confidence Difference to Prior    (option <tt>-ed</tt> or <tt>-e1</tt>)</a></h4><p>The simplest way to compare the two confidences is to compute theabsolute value of their difference. I.e., if "-&gt; bread" has aconfidence of 60% and "cheese -&gt; bread" has a confidence of 62%,then the value of this measure is 2%. The parameter given via theoption <tt>-d</tt> to the program states a lower bound for thisdifference. It follows that this measure selects rules the confidenceof which differs more than a given threshold from the correspondingprior confidence. For example, with the option <tt>-d20</tt> (and, ofcourse, the option <tt>-ed</tt> to select the measure) for the item"bread" only rules with a confidence less than 40% or greater than 80%would be selected. Of course, for other items, with a different priorconfidence, the upper and lower bounds are different, too.</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><h4><a name="quotient">Difference of Confidence Quotient to 1    (option <tt>-eq</tt> or <tt>-e2</tt>)</a></h4><p>An equally simple way to compare the two confidences is to computetheir quotient. Since either the prior or the posterior confidencecan be greater (which was handled by computing the absolute valuefor the previous measure), this quotient or its reciprocal, whicheveris smaller, is then compared to one. A quotient of one says that therule is not interesting, since the prior and the posterior confidenceare identical. The more the quotient differs from one, the more"interesting" the rule is. Hence, just as above, a lower bound forthis difference is given via the option <tt>-d</tt>. For the breadexample, with the option <tt>-d20</tt> rules with a confidence lessthan or equal to (1 -20%) *60% = 0.8 *60% = 48% or a confidence greaterthan or equal to 60% / (1 -20%) = 60% / 0.8 = 75% are selected. Thedifference between this measure and the absolute confidence differenceto the prior is that the deviation that is considered to be significantdepends on the prior confidence. If it is high, then the deviation ofthe posterior confidence must also be high, and if it is low, thenthe deviation need only be low. For example, if "-&gt; bread" had aconfidence of only 30%, then the option <tt>-d20</tt> (just as above)would select rules the confidence of which is less than 0.8 *30% = 24%or greater than 30% /0.8 = 37.5%. As you can see, for a prior confidenceof 60% the deviation has to be at least 12%/15%, for a prior confidenceof 30% it has to be only 6%/7.5% in order to make a rule eligible.The idea is that an increment of the confidence from 30% to 40% is moreimportant than an increment from 60% to 70%, since the relative changeis greater.</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><h4><a name="improve">Absolute Difference of Improvement Value to 1    (option <tt>-ea</tt> or <tt>-e3</tt>)</a></h4><p>This measure is very similar to the preceding one. Actually, ifthe confidence of a rule is smaller than the prior confidence, thenit coincides with it. The improvement value is simply the posteriorconfidence divided by the prior confidence. It is greater thanone if the confidence increases due to the antecedent, and it issmaller than one if the confidence decreases due to the antecedent.By computing the absolute value of the difference to one, theimprovement value can easily be made a rule selection measure.The advantage of this measure over the preceding one is that it issymmetric w.r.t. changes of the confidence due to the antecedent ofa rule. For the bread example, with the option <tt>-d20</tt> rules witha confidence less than or equal to (1 -20%) *60% = 0.8 *60% = 48% or aconfidence greater than or equal to (1 +20%) *60% = 1.2 * 60% = 72%are selected. (Note the difference of 72% compared to 75% for thepreceding measure.) Similarly, for the second bread examplediscussed above, the numbers are 0.8 *30% = 24% and 1.2 *30% = 36%.Note that this is the only measure for which a value greater than 100may be specified with the <tt>-d</tt> option, since it can exceed100% if the posterior confidence of a rule exceeds twice the priorconfidence. (I am grateful to Roland Jonscher, who pointed out thismeasure to me.)</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><h4><a name="info">Information Difference to Prior    (option <tt>-ei</tt> or <tt>-e4</tt>)</a></h4><p>This measure is simply the information gain criterion that can beused in decision tree learners like C4.5 to select the split attributes.Its idea is as follows: Without any further information about otheritems in the set, we have a certain probability (or, to be exact, arelative frequency) distribution for, say "bread" and "no bread".Let us assume it is 60% : 40% (prior confidence of the item "bread",just as above). This distribution has a certain entropy</p><p>H = - P(bread)    log<sub>2</sub> P(bread)       - P(no bread) log<sub>2</sub> P(no bread),</p><p>where P(bread) is equivalent to the support of "bread", which inturn is equivalent to the prior confidence of "bread". The entropy of aprobability distribution is, intuitively, a lower bound on the numberof yes-no-questions you have to ask in order to determine the actualvalue. This cannot be understood very well with only two possiblevalues, but it can be made to work for this case too. I skip thedetails here, they are not that important.</p><p>After we get the information that the items in the antecedent ofthe rule are present (say, cheese), we have a different probabilitydistribution, say 35% : 65%. I.e., P(bread|cheese) = 0.35 andP(no bread|cheese) = 0.65. If we also know the support of the item"cheese" (let it be P(cheese) = 0.4 and P(no cheese) = 0.6), thenwe can also compute the probabilities P(bread|no cheese) = 0.8 andP(no bread|no cheese) = 0.2. Hence we have two posterior probabilitydistributions. The question now is: How much information do we receivefrom observing the antecedent of the rule? Information is measuredas a reduction of entropy. Hence the entropies of the two conditionalprobability distributions (for "cheese" and "no cheese") are computedand summed weighted with the probability of their occurrence (i.e.,the relative frequency of "cheese" and "no cheese", respectively).This gives the expected value of the posterior or conditional entropy.The difference of this value to the prior entropy (see above) is thegain in information from the antecedent of the rule or, as I calledit, the information difference to the prior.</p><p>The value that can be given via the <tt>-d</tt> option is a lowerbound for the information gain, measured in hundreds of a bit. Sinceall items can only be present or absent, the information gain can beat most one bit. Therefore a percent value is still reasonable.</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><h4><a name="chi2">Normalized</a> chi<sup>2</sup> Measure    (option <tt>-ec</tt> or <tt>-e5</tt>)</h4><p>The chi<sup>2</sup> measure is well known from statistics. It isoften used to measure the difference between a supposed independent

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -