📄 apriorit.html

📁 apriori algorithm using datasets implementation
💻 HTML
📖 第 1 页 / 共 2 页
字号:
12 下一页
<HTML><HEAD><TITLE>Apriori-T Association Rule Mining (ARM) Software</TITLE></HEAD><BODY BGCOLOR="white"><CENTER><TABLE BORDER = 0 cellpadding = 10 WIDTH="100%"><TR><TD BGCOLOR = BB0000><CENTER><TABLE BORDER = 0 cellpadding = 10 WIDTH="90%"><TR><TD BGCOLOR = 006699><FONT COLOR = "white"><CENTER><H1>THE LUCS-KDD APRIORI-T ASSOCIATION RULE MINING ALGORITHM</H1><HR WIDTH=<50%"><BR></CENTER><TABLE ALIGN=right BGCOLOR="white" BORDER=1 CELLPADDING=5><TR><TD><img src="../../../GifFolder/logo115.gif"	alt="Liverpool University"></TD></TABLE><P><B><I>Frans Coenen<I></B></P><P><B>Department of Computer Science</B></P><P><B>The University of Liverpool</B></P><P>13 February 2004</P></FONT></TD></TABLE></CENTER></TD></TABLE></CENTER><BR><h2>CONTENTS</h2><table BORDER=0 CELLPADDING=0 WIDTH=100%><tr><td WIDTH="48%"><dl><dt>1. <a HREF = "#introduction">Introduction</a>.</dt><dt>2. <a HREF = "#downloading">Downloading the software</a>.</dt><dl><dt>2.1. <a HREF = "#compiling">Compiling</A>.</dt><dt>2.2. <a HREF = "#documentation">Documentation</A>.</dt></DL></dl></td><td><pre> </pre></td><td><dl><dt>3. <a HREF = "#running">Running the software</a>.</dt><dt>4. <a HREF = "#applications">More detail on application classes</A>.</dt><dt>5. <a HREF = "#t-tree">The T-tree data structure</A>.</dt><dt>6. <a HREF =" #aprioriT">The Apriori-T algorithm</A>.</dt><dt>7. <a HREF =" #conclusions">Conclusions</a>.</dt></dl></td></table><br><hr><br><a NAME =" introduction"><table BORDER-=0 WIDTH="100%" CELLPADDING=5 BGCOLOR="006699"><tr><td><h2><font color =" white">1. INTRODUCTION</font></h2></td></table><BR><P>Apriori-T (Apriori Total) is an Association Rule Mining (ARM) algorithm,developed by the <A HREF = "http://www.csc.liv.ac.uk/~frans/KDD/">LUCS-KDD research team</A> which makes use of a"reverse" set enumeration tree where each level of the tree is defined in termsof an array (i.e. the T-tree data structure is a form of <I>Trie</I>).The Apriori-T algorithm was actually developed as part of a moresophisticated ARM algorithm --- Apriori-TFP (Apriori Total From Partial). Both algorithms are described in Coenen and Leng (2004).</P><br><hr><br><a NAME ="downloading"><table BORDER-=0 WIDTH="100%" CELLPADDING=5 BGCOLOR="006699"><tr><td><h2><font color =" white">2. DOWNLOADING THE SOFTWARE</font></h2></td></table><BR><P>The Apriori-T ARM software comprises four source files. These are providedfrom this WWW page together with three application classes. The source files areas follows:</P><OL><LI><A HREF = "AssocRuleMining.java"><TT>AssocRuleMining.java</TT></A>:Set of general ARM utility methods to allow: (i) data input and inputerror checking, (ii) data preprocessing, (iii) manipulation of records (e.g.operations such as subset, member, union etc.) and (iv) data and parameter output.<LI><A HREF = "RuleList.java"><TT>RuleList.java</TT></A>: Set of methods thatallow the creationand manipulation (e.g. ordering, etc.) of a list of ARs.<LI><A HREF = "TotalSupportTree.java"><TT>TotalSupportTree.java</TT></A>:Methods to implement the "Apriori-T"algorithm using the "Total support" tree data structure (T-tree).<LI><A HREF = "TtreeNode.java"><TT>TtreeNode.java</TT></A>: Methods concerned with the structure of Ttree nodes. Arrays ofthese structures are used to store nodes at the same level in any sub-branch ofthe T-tree. Note this is a separate class to the other T-tree classes which arearranged in a class hierarchy as illustrated below.</OL><CENTER><TABLE BORDER=1><TR><TD><TABLE BORDER=1 CELLPADDING=10><TR><TD><CENTER><PRE>             AssocRuleMining                    |     +--------------+---------------+     |                              |TotalSupportTree                    RuleList</PRE></CENTER></TD></TABLE></CENTER></TD></TABLE></CENTER><BR><P>The application classes are as follows:</P><OL><LI><A HREF = "AprioriTapp.java"><TT>AprioriTapp.java</TT></A>: FundamentalApriori-T application.<LI><A HREF = "AprioriTsortedApp.java"><TT>AprioriTsortedApp.java</TT></A>:Apriori-T application with input data preprocessed so that it is ordered according to frequency of single items --- this serves to reduce the computationtime.<LI><A HREF = "AprioriTsortedPrunedApp.java"><TT>AprioriTsortedPrunedApp.java</TT></A>:Apriori-T application with data ordered according to frequency of single itemsand columns representing unsupported 1-itemsets removed --- again this servesto enhance computational efficiency.</OL><P>There is also a "tar ball" <A HREF = "aprioriT.tgz">aprioriT.tgz</A> that can be downloaded that includes all of the above source and applicationclass files. It can be unpacked using <TT>tar -zxf aprioriT.tgz</TT>.</P><BR><a NAME = "compiling"><H3>2.1. Compiling</H3><p>The software has been implemented in Java using the Java2 SDK (SoftwareDevelopment Kit) Version 1.4.0, which should therefore make it highly portable. The code does not require any special packages and thus can be compiled using the standard Java compiler:</p><pre>javac *.java</pre><BR><a NAME = "documentation"><H3>2.2. Documentation</H3><P>The code can be documented using <I>Java Doc</I>. First create a directory<TT>Documentation</TT> in which to place the resulting HTML pages and then type:</P><PRE>javadoc -d Documentation/ *.java</PRE><P>This will produce a hierarchy of WWW pages contained in the <TT>Document</TT>directory.</P><br><hr><br><a NAME ="running"><table BORDER-=0 WIDTH="100%" CELLPADDING=5 BGCOLOR="006699"><tr><td><h2><font color =" white">3. RUNNING THE SOFTWARE</font></h2></td></table><BR><p>When compiled the software can be invoked in the normal manner using the Java interpreter:</p><pre>java APPLICATION_CLASS_FILE_NAME</pre><p>If you are planning to process a very large data set it is a good idea to grab some extra memory. For example:</p><pre>java -Xms600m -Xmx600m APPLICATION_CLASS_FILE_NAME</pre><P>The input to the software, in all cases is a (space separated)<I>binary valued</I> data set <TT>R</TT>. The set <TT>R</TT>comprises a set of <TT>N</TT> records such that each record (<TT>r</TT>),in turn, comprises a set of <I>attributes</I>. Thus:</P><PRE>R  = {r | r = subset A}</PRE><P>Where <TT>A</TT> is the set of available attributes. The value <TT>D</TT> isthen defined as:<PRE>D = |A|</PRE><P>We then say that a particular data set has <TT>D</TT> <I>columns</I> and<TT>N</TT> <I>rows</I>. A small example data sets might be as follows:</P><PRE>1 2 3 61 4 5 71 3 4 61 2 61 2 3 4 5 7</PRE><P>where, in this case, <TT>A = {1, 2, 3, 4, 5, 6, 7}</TT>. Note that attribute numbers are ordered sequentially commencing with thenumber 1 (the value 0 has a special meaning). The inputfile is included using a <TT>-F</TT> flag.</P><P>The program assumes support and confidence default threshold values of 20% and80% respectively. However the user may enter their own thresholds usingthe <TT>-S</TT> and <TT>-C</TT> flags. Thresholds are expressed as percentages.</P><P>Some example invocations, using a discretized/ normalised Pima Indians data set (also <A HREF = "pimaIndians.D42.N768.C2.num">available</A> from thissite) and each of the three application classes provided by this WWW site, aregiven below:</P><PRE>java AprioriTapp -FpimaIndians.D42.N768.C2.numjava AprioriTsortedApp -FpimaIndians.D42.N768.C2.num -S2.5java AprioriTsortedPrunedApp -FpimaIndians.D42.N768.C2.num -S1 -C50</PRE><P>(note that the ordering of flags is not significant). The output from eachapplication is a set of ARs (ordered according to confidence) plus somediagnosticinformation (run time, number of rules generated etc.).</P><br><hr><br><a NAME ="applications"><table BORDER-=0 WIDTH="100%" CELLPADDING=5 BGCOLOR="006699"><tr><td><h2><font color =" white">4. MORE DETAIL ON APPLICATION CLASSES</font></h2></td></table><BR><P>Apriori-T Applications classes all have the following basic form:</P><CENTER><TABLE BORDER=1><TR><TD><TABLE BORDER=1 CELLPADDING=10><TR><TD><PRE>public class &lt CLASS_NAME &gt  {    /** Main method */    public static void main(String[] args) throws IOException {	// Create instance of class TotalSupportTree	TotalSupportTree newAprioriT = new TotalSupportTree(args);		// Read data to be mined from file	newAprioriT.inputDataSet();			// If desired either: (1) keep data set as it is (do no	// preprocessing), (2) reorder the data sets according to	// frequency of single attributes:	newAprioriT.idInputDataOrdering();	newAprioriT.recastInputData();	// or (3) reorder and prune the input data	newAprioriT.idInputDataOrdering();	newAprioriT.recastInputDataAndPruneUnsupportedAtts();	newAprioriT.setNumOneItemSets();		// Mine data and produce T-tree		double time1 = (double) System.currentTimeMillis();	newAprioriT.createTotalSupportTree();	newAprioriT.outputDuration(time1,(double) System.currentTimeMillis());		// Generate ARS	newAprioriT.generateARs();		// Output as desired using the many output methods that are available	// (see below for further details)		// End	System.exit(0);	}</PRE></TD></TABLE></TD></TABLE></CENTER><BR><P>Some output is always generated such as: (1) the input parameters and startsettings and (2) "mile stones" during processing.Additional output statements can be included in application classes.The available additional output options are as follows:</P><OL><LI><TT>outputDataArray()</TT>: Outputs the input data.<LI><TT>outputDataArraySize()</TT>: Outputs the size of the input data (numberof records and columns, number of elements and overall density).<LI><TT>outputDuration(double time1, double time2)</TT>: The elapsed time,in seconds between time1 and time2. Typically used to measure processing time:<PRE>double time1 = (double) System.currentTimeMillis();// Program statements&LT INSTANCE_NAME &gt .outputDuration(time1,				(double) System.currentTimeMillis());</PRE><LI><TT>outputTtree()</TT>: Lists the T-tree nodes.<LI><TT>outputNumFreqSets()</TT>: The number of identified frequent sets.<LI><TT>outputNumUpdates()</TT>: The number of updates used to build theT-tree (a measure of the amount of "work done").<LI><TT>outputStorage()</TT>: The storage requirements, in Bytes, for theT-tree.      .<LI><TT>outputFrequentSets()</TT>: Lists the identified frequent sets and theirassociated support values.<LI><TT>outputNumRules()</TT>: The number of discovered ARs.<LI><TT>outputRules()</TT>: Lists the ARs and theirassociated confidence values.<LI><TT>outputRulesWithReconversion()</TT>: Lists the ARs but using reconversion(only appropriate where attributes have been reordered).</OL><P>Note that the first nine of the above output methods are <I>instancemethods</I> contained in either the <TT>TotalSupportTree</TT> class or the<TT>AssocRuleMining</TT> class the latter are therefore inherited by the<TT>TotalSupportTree</TT> class). The last three of the above are <I>instancemethods</I> of the <TT>RuleList</TT> class and thus must be called using aninstance of that class. An instance of the  <TT>RuleList</TT> class is createdduring processing and may be accessed using the <TT>getCurrentRuleListObject()</TT>public method found in the <TT>TotalSupportTree</TT> class. Thus, for example,the <TT>outputRules()</TT> method would be invoked as follows:</P><PRE>&LT INSTANCE_NAME &gt .getCurrentRuleListObject().outputRules();</PRE><br><hr><br><a NAME ="t-tree"><table BORDER-=0 WIDTH="100%" CELLPADDING=5 BGCOLOR="006699"><tr><td><h2><font color =" white">5. THE T-TREE DATA STRUCTURE</font></h2></td></table><BR><P>A T-tree is a set enumeration tree structure in which to storefrequent item set information. What distinguishes the T-tree from other setenumeration tree structures is:</P><OL><LI>Levels in each sub-branch of the tree are defined using arrays. This thuspermits "indexing in" at all levels and consequently offers computationaladvantages.<LI>To aid this indexing the tree is built in "reverse". Each branch isfounded on the last element of the frequent sets to be stored. This allowsdirect indexing with attribute number rather than first applying some offset.</OL><P>Thus given a data set of the form:</P><PRE>{1 3 4}{2 4 5}{2 4 6}</PRE><P>and assuming a support count of less than <TT>1</TT>, we can identify thefollowing frequent sets (support counts in parenthesise):</P><CENTER><TABLE BORDER=0 CELLPADDING=5><TR><TD><PRE>1 (1)2 (2)3 (1)4 (3)5 (1)6 (1)</PRE></TD><TD><PRE> </PRE></TD><TD><PRE>1 3 (1)1 4 (1)
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -