📄 apriori.html

📁 数据挖掘中的关联规则算法
💻 HTML
📖 第 1 页 / 共 5 页
字号:
(Attention: If you have a large number of items, setting the minimalrule confidence to zero can result in <i>very</i> high memoryconsumption.)</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><h4><a name="diff">Absolute Confidence Difference to Prior    (option <tt>-ed</tt> or <tt>-e1</tt>)</a></h4><p>The simplest way to compare the two confidences is to compute theabsolute value of their difference. I.e., if "-&gt; bread" has aconfidence of 60% and "cheese -&gt; bread" has a confidence of 62%,then the value of this measure is 2%. The parameter given via theoption <tt>-d</tt> to the program states a lower bound for thisdifference. It follows that this measure selects rules the confidenceof which differs more than a given threshold from the correspondingprior confidence. For example, with the option <tt>-d20</tt> (and, ofcourse, the option <tt>-ed</tt> to select the measure) for the item"bread" only rules with a confidence less than 40% or greater than 80%would be selected. Of course, for other items, with a different priorconfidence, the upper and lower bounds are different, too.</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><h4><a name="quotient">Difference of Confidence Quotient to 1    (option <tt>-eq</tt> or <tt>-e2</tt>)</a></h4><p>An equally simple way to compare the two confidences is to computetheir quotient. Since either the prior or the posterior confidencecan be greater (which was handled by computing the absolute valuefor the previous measure), this quotient or its reciprocal, whicheveris smaller, is then compared to one. A quotient of one says that therule is not interesting, since the prior and the posterior confidenceare identical. The more the quotient differs from one, the more"interesting" the rule is. Hence, just as above, a lower bound forthis difference is given via the option <tt>-d</tt>. For the breadexample, with the option <tt>-d20</tt> rules with a confidence lessthan or equal to (1 -20%) *60% = 0.8 *60% = 48% or a confidence greaterthan or equal to 60% / (1 -20%) = 60% / 0.8 = 75% are selected. Thedifference between this measure and the absolute confidence differenceto the prior is that the deviation that is considered to be significantdepends on the prior confidence. If it is high, then the deviation ofthe posterior confidence must also be high, and if it is low, thenthe deviation need only be low. For example, if "-&gt; bread" had aconfidence of only 30%, then the option <tt>-d20</tt> (just as above)would select rules the confidence of which is less than 0.8 *30% = 24%or greater than 30% /0.8 = 37.5%. As you can see, for a prior confidenceof 60% the deviation has to be at least 12%/15%, for a prior confidenceof 30% it has to be only 6%/7.5% in order to make a rule eligible.The idea is that an increment of the confidence from 30% to 40% is moreimportant than an increment from 60% to 70%, since the relative changeis greater.</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><h4><a name="improve">Absolute Difference of Improvement Value to 1    (option <tt>-ea</tt> or <tt>-e3</tt>)</a></h4><p>This measure is very similar to the preceding one. Actually, ifthe confidence of a rule is smaller than the prior confidence, thenit coincides with it. The improvement value is simply the posteriorconfidence divided by the prior confidence. It is greater thanone if the confidence increases due to the antecedent, and it issmaller than one if the confidence decreases due to the antecedent.By computing the absolute value of the difference to one, theimprovement value can easily be made a rule selection measure.The advantage of this measure over the preceding one is that it issymmetric w.r.t. changes of the confidence due to the antecedent ofa rule. For the bread example, with the option <tt>-d20</tt> rules witha confidence less than or equal to (1 -20%) *60% = 0.8 *60% = 48% or aconfidence greater than or equal to (1 +20%) *60% = 1.2 * 60% = 72%are selected. (Note the difference of 72% compared to 75% for thepreceding measure.) Similarly, for the second bread examplediscussed above, the numbers are 0.8 *30% = 24% and 1.2 *30% = 36%.Note that this is the only measure for which a value greater than 100may be specified with the <tt>-d</tt> option, since it can exceed100% if the posterior confidence of a rule exceeds twice the priorconfidence. (I am grateful to Roland Jonscher, who pointed out thismeasure to me.)</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><h4><a name="info">Information Difference to Prior    (option <tt>-ei</tt> or <tt>-e4</tt>)</a></h4><p>This measure is simply the information gain criterion that can beused in decision tree learners like C4.5 to select the split attributes.Its idea is as follows: Without any further information about otheritems in the set, we have a certain probability (or, to be exact, arelative frequency) distribution for, say "bread" and "no bread".Let us assume it is 60% : 40% (prior confidence of the item "bread",just as above). This distribution has a certain entropy</p><p>H = - P(bread)    log<sub>2</sub> P(bread)       - P(no bread) log<sub>2</sub> P(no bread),</p><p>where P(bread) is equivalent to the support of "bread", which inturn is equivalent to the prior confidence of "bread". The entropy of aprobability distribution is, intuitively, a lower bound on the numberof yes-no-questions you have to ask in order to determine the actualvalue. This cannot be understood very well with only two possiblevalues, but it can be made to work for this case too. I skip thedetails here, they are not that important.</p><p>After we get the information that the items in the antecedent ofthe rule are present (say, cheese), we have a different probabilitydistribution, say 35% : 65%. I.e., P(bread|cheese) = 0.35 andP(no bread|cheese) = 0.65. If we also know the support of the item"cheese" (let it be P(cheese) = 0.4 and P(no cheese) = 0.6), thenwe can also compute the probabilities P(bread|no cheese) = 0.77 andP(no bread|no cheese) = 0.23. Hence we have two posterior probabilitydistributions. The question now is: How much information do we receivefrom observing the antecedent of the rule? Information is measuredas a reduction of entropy. Hence the entropies of the two conditionalprobability distributions (for "cheese" and "no cheese") are computedand summed weighted with the probability of their occurrence (i.e.,the relative frequency of "cheese" and "no cheese", respectively).This gives the expected value of the posterior or conditional entropy.The difference of this value to the prior entropy (see above) is thegain in information from the antecedent of the rule or, as I calledit, the information difference to the prior.</p><p>The value that can be given via the <tt>-d</tt> option is a lowerbound for the information gain, measured in hundreds of a bit. Sinceall items can only be present or absent, the information gain can beat most one bit. Therefore a percent value is still reasonable.</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><h4><a name="chi2">Normalized</a> chi<sup>2</sup> Measure    (option <tt>-ec</tt> or <tt>-e5</tt>)</h4><p>The chi<sup>2</sup> measure is well known from statistics. It isoften used to measure the difference between a supposed independentdistribution of two discrete variables and the actual joint distributionin order to determine how strongly two variables depend on each other.This measure (as it is defined in statistics) contains the number ofcases it is computed from as a factor. This is not very appropriateif one wants to evaluate rules that can have varying support. Hencethis factor is removed by simply dividing the measure by the numberof items sets (the total number, i.e. with the names used above, thenumber of sets in X). With this normalization, the chi<sup>2</sup>measure can assume values between 0 (no dependence) and 1 (very strongdependence). The value that can be given via the <tt>-d</tt> option isa lower bound for the strength of the dependence of the head on thebody in percent (0 - no dependence, 100 - perfect dependence). Onlythose rules are selected, in which the head depends on the body witha higher degree of dependence.</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><h4><a name="behavior">Selection Behavior of the Measures</a></h4><p>In the directory <tt>apriori/doc</tt> you can find a Gnuplot scriptnamed <tt>arem.gp</tt> (<tt>arem</tt> stands for additional ruleevaluation measures) which visualizes the behavior of the additionalrule evaluation measures. This script draws eight 3d graphs, one forthe absolute confidence difference, one for the difference of theconfidence quotient to one, three for the information difference tothe prior confidence and three for the normalized chi<sup>2</sup>measure. All graphs show the value of an additional rule evaluationmeasure over a plane defined by the prior and the posterior confidenceof a rule. The latter two measures need three graphs, since they dependon the antecedent support of a rule as a third parameter. Setting aminimal value for an additional rule evaluation measure is likeflooding the corresponding graph landscape up to a certain level(given as a percentage, since all considered measures assume valuesbetween 0 and 1). Only those rules are selected that sit on dry land.</p><p>The first graph shows the behavior of the absolute confidencedifference. For the diagonal, i.e. the line where the prior and theposterior confidence are identical, its value is zero (as expected).The more the two confidences differ, the higher the value of thismeasure gets, but in a linear way.</p><p>The second graph shows the behavior of the confidence quotientto one. Again its value is zero for the diagonal (as the value ofall measures is) and becomes greater the more the prior and theposterior confidence differ. But it is much steeper for a smallprior confidence than for a large one and it is non-linear.</p><p>The third to fifth graph show the information difference to theprior confidence for an antecedent support (which is identical to therule support in my interpretation, see above) of 0.2 (20%), 0.3 (30%)and 0.4 (40%). The regions at the margins, where the measure is zero,correspond to impossible combinations of prior and posterior confidenceand antecedent support. As you can see, the valley gets narrower withincreasing antecedent support. I.e., with the same minimal value forthis measure, rules with low antecedent support need a higher confidencedifference to be selected than rules with a high antecedent support.This nicely models the statistical significance of confidence changes.If you only have a few cases to support your rule, even a largedeviation from the prior confidence can be explained by randomfluctuations, since only a few transactions need to be different toproduce a considerable change. However, if the antecedent supportis large, even a small deviation (in percent) has to be consideredsignificant (non random), since it takes a lot of changes totransactions to produce even a small change in the percentage.This dependence on the antecedent support of the rule and that thevalley is not pointed at the diagonal (which means that even a lowminimal value can exclude a lot of rules) is the main differencebetween the information gain and the normalized chi<sup>2</sup>measure on the one hand and the absolute confidence difference anddifference of the confidence quotient to one on the other.</p><p>The sixth to eighth graph show the normalized chi<sup>2</sup> measurefor an antecedent support of 0.2, 0.3, and 0.4. The valleys are verysimilar to those for the information difference to the prior confidence,they only have slightly steeper flanks, especially for low antecedentsupport. So in practice there is no big difference between theinformation difference and the normalized chi<sup>2</sup> measure.</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><h4><a name="appear">Item Appearances</a></h4><p>My apriori program provides a simple way to restrict the rules togenerate w.r.t. the items that shall appear in them. It accepts a thirdoptional input file, in which item appearances can be given. For eachitem it can be stated whether it may appear in the body (antecedent)
💿 文件大小 122 K
👤 上传用户 yuanata
📂 所属分类人工智能/神经网络
🏷️ 相关标签

#数据挖掘 #关联规则 #算法
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -