📄 apriori.html

📁 apriori算法是数据挖掘的经典算法之1,其基于关联规则的思想.这是我的第2个收藏算法
💻 HTML
📖 第 1 页 / 共 4 页
字号:
distribution of two discrete variables and the actual joint distributionin order to determine how strongly two variables depend on each other.This measure (as it is defined in statistics) contains the number ofcases it is computed from as a factor. This is not very appropriateif one wants to evaluate rules that can have varying support. Hencethis factor is removed by simply dividing the measure by the numberof items sets (the total number, i.e. with the names used above, thenumber of sets in X). With this normalization, the chi<sup>2</sup>measure can assume values between 0 (no dependence) and 1 (very strongdependence). The value that can be given via the <tt>-d</tt> option isa lower bound for the strength of the dependence of the head on thebody in percent (0 - no dependence, 100 - perfect dependence). Onlythose rules are selected, in which the head depends on the body witha higher degree of dependence.</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><h4><a name="behavior">Selection Behavior of the Measures</a></h4><p>In the directory <tt>apriori/doc</tt> you can find a Gnuplot scriptnamed <tt>arem.gp</tt> (<tt>arem</tt> stands for additional ruleevaluation measures) which visualizes the behavior of the additionalrule evaluation measures. This script draws eight 3d graphs, one forthe absolute confidence difference, one for the difference of theconfidence quotient to one, three for the information difference tothe prior confidence and three for the normalized chi<sup>2</sup>measure. All graphs show the value of an additional rule evaluationmeasure over a plane defined by the prior and the posterior confidenceof a rule. The latter two measures need three graphs, since they dependon the antecedent support of a rule as a third parameter. Setting aminimal value for an additional rule evaluation measure is likeflooding the corresponding graph landscape up to a certain level(given as a percentage, since all considered measures assume valuesbetween 0 and 1). Only those rules are selected that sit on dry land.</p><p>The first graph shows the behavior of the absolute confidencedifference. For the diagonal, i.e. the line where the prior and theposterior confidence are identical, its value is zero (as expected).The more the two confidences differ, the higher the value of thismeasure gets, but in a linear way.</p><p>The second graph shows the behavior of the confidence quotientto one. Again its value is zero for the diagonal (as the value ofall measures is) and becomes greater the more the prior and theposterior confidence differ. But it is much steeper for a smallprior confidence than for a large one and it is non-linear.</p><p>The third to fifth graph show the information difference to theprior confidence for an antecedent support (which is identical to therule support in my interpretation, see above) of 0.2 (20%), 0.3 (30%)and 0.4 (40%). The regions at the margins, where the measure is zero,correspond to impossible combinations of prior and posterior confidenceand antecedent support. As you can see, the valley gets narrower withincreasing antecedent support. I.e., with the same minimal value forthis measure, rules with low antecedent support need a higher confidencedifference to be selected than rules with a high antecedent support.This nicely models the statistical significance of confidence changes.If you only have a few cases to support your rule, even a largedeviation from the prior confidence can be explained by randomfluctuations, since only a few transactions need to be different toproduce a considerable change. However, if the antecedent supportis large, even a small deviation (in percent) has to be consideredsignificant (non random), since it takes a lot of changes totransactions to produce even a small change in the percentage.This dependence on the antecedent support of the rule and that thevalley is not pointed at the diagonal (which means that even a lowminimal value can exclude a lot of rules) is the main differencebetween the information gain and the normalized chi<sup>2</sup>measure on the one hand and the absolute confidence difference anddifference of the confidence quotient to one on the other.</p><p>The sixth to eighth graph show the normalized chi<sup>2</sup> measurefor an antecedent support of 0.2, 0.3, and 0.4. The valleys are verysimilar to those for the information difference to the prior confidence,they only have slightly steeper flanks, especially for low antecedentsupport. So in practice there is no big difference between theinformation difference and the normalized chi<sup>2</sup> measure.</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><h4><a name="appear">Item Appearances</a></h4><p>My apriori program provides a simple way to restrict the rules togenerate w.r.t. the items that shall appear in them. It accepts a thirdoptional input file, in which item appearances can be given. For eachitem it can be stated whether it may appear in the body (antecedent)of a rule, in the head (consequent), or in both. A description of theformat of this additional input file, including examples, can be found<a href="#trans">here</a>. If no item appearances file is given, therule selection is not restricted. (I am grateful to the people atIntegral Solutions Ltd., who developed the well-known data mining tool<a href="http://www.spss.com/Clementine/">Clementine</a>, but are nowpart of <a href="http://www.spss.com">SPSS</a>, for requesting thepossibility to restrict item appearances.)</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3><a name="target">Other Target Types</a></h3><p>The target type, which can be selected via the option <tt>-t</tt>,is either association rules (option -tr, default), frequent item sets(option -ts), or association hyperedges (option -th).</p><!-- =============================================================== --><h4><a name="itemsets">Frequent Item Sets (option -ts)</a></h4><p>Sometimes one may not want to find association rules, but only thefrequent item sets underlying them. That is, one wants to find allitem sets with a support exceeding a certain threshold. My aprioriprogram supports this search, too: If the option -ts is given, onlythe frequent item sets are determined.</p><!-- =============================================================== --><h4><a name="hyperedges">Association Hyperedges (option -th)</a></h4><p>My apriori program can also find association hyperedges, i.e., setsof items that are strongly predictive w.r.t. each other. In this modeno rules are generated, only item sets are selected. The selectioncriterion is as follows: Given an item set with enough support (option<tt>-s</tt>), all rules are checked which can be formed using this setwith all items appearing in the rule. For example, for the item set{a b c}, the rules c &lt;- a b, b &lt;- a c, a &lt;- b c would beconsidered. The confidences of these rules are computed and averaged.If the resulting average confidence is greater than the minimalconfidence (option <tt>-c</tt>), the item set is selected. (I amgrateful to Bastien Duclaux for requesting the possibility to generateassociation hyperedges.)</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3><a name="options">Program Invocation and Options</a></h3><p>My apriori program is invoked as follows:</p><p><tt>apriori [options] infile outfile [appfile]</tt></p><p>The normal arguments are:</p><table border=0 cellpadding=0 cellspacing=0><tr><td>infile</td><td width=10></td>    <td>file to read transactions from</td></tr><tr><td>outfile</td><td></td>    <td>file to write association rules / hyperedges to</td></tr><tr><td>appfile</td><td></td>    <td>file stating item appearances (optional)</td></tr></table><p>The possible options are:</p><table border=0 cellpadding=0 cellspacing=0><tr><td><tt>-t#</tt></td><td width=10></td>    <td>target type        (s: item sets, r: rules (default), h: hyperedges)</td></tr><tr><td><tt>-m#</tt></td><td></td>    <td>minimal number of items per set/rule/hyperedge        (default: 1)</td></tr><tr><td><tt>-n#</tt></td><td></td>    <td>maximal number of items per set/rule/hyperedge        (default: 5)</td></tr><tr><td><tt>-s#</tt></td><td></td>    <td>minimal support    of a     set/rule/hyperedge        (default: 10%)</td></tr><tr><td><tt>-c#</tt></td><td></td>    <td>minimal confidence of a         rule/hyperedge        (default: 80%)</td></tr><tr><td><tt>-o</tt></td><td></td>    <td>use original definition of the support of a rule        (body & head)</td></tr><tr><td><tt>-x</tt></td><td></td>    <td>extended support output (print both rule support types)        </td></tr><tr><td><tt>-a</tt></td><td></td>    <td>print absolute support (number of transactions)</td></tr><tr><td><tt>-p</tt></td><td></td>    <td>print support/confidence with high precision</td></tr><tr><td><tt>-e#</tt></td><td></td>    <td>additional rule evaluation measure (default: none)</td></tr><tr><td><tt>-!</tt></td><td></td>    <td>print a list of additional rule evaluation measures</td></tr><tr><td><tt>-d#</tt></td><td></td>    <td>minimal value of additional evaluation measure        (default: 10%)</td></tr><tr><td><tt>-v</tt></td><td></td>    <td>print value of additional rule evaluation measure</td></tr><tr><td><tt>-g</tt></td><td></td>    <td>write output in scanable form        (quote certain characters)</td></tr><tr><td><tt>-l</tt></td><td></td>    <td>do not load transactions into memory        (work on input file)</td></tr><tr><td><tt>-q#</tt></td><td></td>    <td>sort items w.r.t. their frequency (1: ascending (default),        -1: descending, 0: do not sort)</td></tr><tr><td><tt>-z</tt></td><td></td>    <td>minimize memory usage (default: maximize speed)</td></tr><tr><td><tt>-i#</tt></td><td></td>    <td>ignore records starting with characters in the given        string</td></tr><tr><td><tt>-b/f/r#</tt></td><td></td>    <td>blank characters, field and record separators        (default: "<tt> \t\r</tt>", "<tt> \t</tt>", "<tt>\n</tt>")        </td></tr></table><p>(<tt>#</tt> always means a number, a letter, or a string that   specifies the parameter of the option.)</p><p>Note that the effect of the option <tt>-z</tt> can depend heavily   on how the items are sorted (option <tt>-q</tt>). Highest savings   in memory usually result if items are sorted with descending   frequency (<tt>-q-1</tt>). However, this often worsens the   processing time considerably.</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><p><img src="line.gif" alt="" height=7 width=704></p><h3><a name="input">Input Format</a></h3><h4><a name="transin">Format of the Transactions File</a></h4><p>A text file structured by field and record separators and blanks.Record separators, not surprisingly, separate records, usually lines,field separators fields (or columns), usually words. Blanks are usedto fill fields (columns), e.g. to align them. In the transactionsfile each record must contain one transaction, i.e. a list of itemidentifiers, which are separated by field separators. An empty recordis interpreted as an empty transaction.</p><p>Examples can be found in the directory <tt>apriori/ex</tt> in thesource package. Refer to the file <tt>apriori/ex/readme</tt>, whichexplains how to process the different example files in the directory<tt>apriori/ex</tt> in the source package.</p><table width="100%" border=0 cellpadding=0 cellspacing=0><tr><td width="95%" align=right><a href="#top">back to the top</a></td>    <td width=5></td>    <td><a href="#top"><img src="uparrow.gif" border=0></a></td></tr></table><!-- =============================================================== --><h4><a name="appearin">Format of the Item Appearances File</a></h4><p>A text file structured by field and record separators and blanks.(Note: For this file the same field and record separators and blanksare used as for the transactions file.)</p><p>The first record, which must have one field, contains the default
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -