📄 arules.tex

📁 本程序是基于linux系统下c++代码
💻 TEX
📖 第 1 页 / 共 5 页
字号:
in these data structures, they are used as item labels or transactionIDs, accordingly.  To import data from a file, the\func{read.transactions} function is provided.  This function readsfiles structured as shown in Figure~\ref{fig:transactionMatrix} and alsothe very common format with one line per transaction and the itemsseparated by a predefined character.  Finally, \func{inspect} can beused to inspect transactions (e.g., ``interesting'' transactionsobtained with subset selection).Another important application of mining association rules has beenproposed by \cite{arules:Piatetsky-Shapiro:1991} and\cite{arules:Srikant+Agrawal:1996} for discovering interestingrelationships between the values of categorical and quantitative(metric) attributes.  For mining associations rules, non-binaryattributes have to be mapped to binary attributes. The straightforwardmapping method is to transform the metric attributes into $k$ ordinalattributes by building categories (e.g., an attribute income might betransformed into a ordinal attribute with the three categories: ``low'',``medium'' and ``high'').  Then, in a second step, each categoricalattribute with $k$ categories is represented by $k$ binary dummyattributes which correspond to the items used for mining.  An exampleapplication using questionnaire data can be found in\cite{arules:Hastie+Tibshirani+Friedman:2001} in the chapterabout association rule mining.The typical representation for data with categorical and quantitativeattributes in \proglang{R} is a \class{data.frame}.  First, a domainexpert has to create useful categories for all metric attributes.  Thistask is supported in \proglang{R} by functions such as \func{cut}.  Thesecond step, the generation of binary dummy items, is automated inpackage \pkg{arules} by coercing from \class{data.frame} to\class{transactions}.  In this process, the original attribute names andcategories are preserved as additional item information and can be usedto select itemsets or rules which contain items referring to a certainoriginal attributes.By default it is assumed that missing values do not carry information and thusall of the corresponding dummy items are set to zero.  If the factthat the value of a specific attribute is missing provides information (e.g., a respondent in an interview refuses to answer a specific question), the domain expert can create for the attribute a category for missing values which then will be included in the transactions as its own dummy item.The resulting \class{transactions} object can bemined and analyzed the same way as market basket data, see the examplein Section~\ref{sec:example-screen}.%% ------------------------------------------------------------------%% ------------------------------------------------------------------\subsection{Associations: itemsets and sets of rules\label{sec:associations}}The result of mining transaction data in \pkg{arules} are\class{associations.}  Conceptually, associations are sets of objectsdescribing the relationship between some items (e.g., as an itemset or arule) which have assigned values for different measures of quality.Such measures can be measures of significance (e.g., support), ormeasures of interestingness (e.g., confidence, lift), or other measures (e.g.,revenue covered by the association).All types of association have a common functionality in \pkg{arules} comprisingthe following methods:\begin{itemize} \item \func{summary} to give a short overview of the set and  \func{inspect} to display individual associations, \item \func{length} for getting the number of elements in the set, \item \func{items} for getting for each association a set of items  involved in the association (e.g., the union of the items in the LHS  and the RHS for each rule), \item sorting the set using the values of different quality measures  (\func{SORT}), \item subset extraction (\code{[} and \func{subset}), \item set operations (\func{union}, \func{intersect} and  \func{setequal}), and \item matching elements from two sets (\func{match}),% \item handling of duplicated elements for a collection%     of associations (\func{duplicated} and \func{unique}).\item \func{WRITE} for writing associations to disk in human readable form.  To save and load associations in compact form,   use \func{save} and \func{load} from the  \pkg{base} package. Package \pkg{pmml}~\citep{arules:Williams:2008}   includes functionality to  save associations in PMML (Predictive Model Markup Language).\end{itemize}The associations currently implemented in package~\pkg{arules} are setsof itemsets (e.g., used for frequent itemsets of their closed or maximalsubset) and sets of rules (e.g., association rules).  Both classes,\class{itemsets} and \class{rules}, directly extend the virtual class\class{associations} and provide the functionality described above.Class \class{itemsets} contains one \class{itemMatrix} object to storethe items as a binary matrix where each row in the matrix represents anitemset. In addition, it may contain transaction ID lists as an objectof class \class{tidLists}.  Note that when representing transactions,\class{tidLists} store for each item a transaction list, but here storefor each itemset a list of transaction IDs in which the itemset appears.Such lists are currently only returned by \func{eclat}.Class \class{rules} consists of two \class{itemMatrix} objectsrepresenting the left-hand-side (LHS) and the right-hand-side (RHS) ofthe rules, respectively.The items in the associations and the quality measures can be accessedand manipulated in a safe way using accessor and replace methods for\code{items}, \code{lhs}, \code{rhs}, and \code{quality}.  In additionthe association classes have built-in validity checking which ensuresthat all elements have compatible dimensions.It is simple to add new quality measures to existing associations.Since the \code{quality} slot holds a \class{data.frame}, additionalcolumns with new quality measures can be added. These new measures canthen be used to sort or select associations using \func{SORT} or\func{subset}.  Adding a new type of associations to \pkg{arules} isstraightforward as well. To do so, a developer has to create a new classextending the virtual \class{associations} class and implement thecommon functionality described above.%% ------------------------------------------------------------------%% ------------------------------------------------------------------\section{Mining algorithm interfaces\label{sec:interfaces}}In package \pkg{arules} we interface free reference implementations ofApriori and Eclat by Christian Borgelt\citep{arules:Borgelt+Kruse:2002,arules:Borgelt:2003}\footnote{ChristianBorgelt provides the source code for Apriori and Eclat at\url{http://www.borgelt.net/fpm.html}.}.  The code iscalled directly from \proglang{R} by the functions \func{apriori} and\func{eclat} and the data objects are directly passed from \proglang{R} tothe \proglang{C} code and back without writing to external files.The implementations can mine frequent itemsets,and closed and maximal frequent itemsets. In addition, \func{apriori} can also mine association rules.  The data given to the \func{apriori} and \func{eclat} functions have tobe \class{transactions} or something which can be coerced to\class{transactions} (e.g., \class{matrix} or \class{list}).  Thealgorithm parameters are divided into two groups represented by thearguments \code{parameter} and \code{control}.  The mining parameters(\code{parameter}) change the characteristics of the mined itemsets orrules (e.g., the minimum support) and the control parameters(\code{control}) influence the performance of the algorithm (e.g.,enable or disable initial sorting of the items with respect to theirfrequency).  These arguments have to be instances of the classes\class{APparameter} and \class{APcontrol} for the function\func{apriori} or \class{ECparameter} and \class{ECcontrol} for thefunction \func{eclat}, respectively.  Alternatively, data which can becoerced to these classes (e.g., \code{NULL} which will give the defaultvalues or a named list with names equal to slot names to change thedefault values) can be passed.  In these classes, each slot specifies adifferent parameter and the values.  The default values are equal to thedefaults of the stand-alone \proglang{C} programs\citep{arules:Borgelt:1996} except that the standard definition of thesupport of a rule \citep{arules:Agrawal+Imielinski+Swami:1993} isemployed for the specified minimum support required (Borgelt defines thesupport of a rule as the support of its antecedent).For \func{apriori} the appearance feature implemented by ChristianBorgelt can also be used.  With argument \code{appearance} of function\func{apriori} one can specify which items have to or must not appear initemsets or rules.  For more information on this feature we refer to theApriori manual~\citep{arules:Borgelt:1996}.The output of the functions \func{apriori} and \func{eclat} is an objectof a class extending \class{associations} which contains the sets of minedassociations and can be further analyzed using the functionality provided forthese classes.There exist many different algorithms which which use anincidence matrix or transaction ID list representation as input andsolve the frequent andclosed frequent itemset problems.  Each algorithm has specific strengthswhich can be important for very large databases.  Such algorithms, e.g.kDCI, LCM, FP-Growth or Patricia, are discussedin~\cite{arules:Goethals+Zaki:2003}.  The source code of most algorithmsis available on the internet and, if a special algorithm is needed, interfacing thealgorithms for \pkg{arules} is straightforward.The necessary steps are:\begin{enumerate} \item Adding interface code to the algorithm, preferably by directly  calling into the native implementation language (rather than using  files for communication), and an \proglang{R} function calling this  interface. \item Implementing extensions for \class{ASparameter} and  \class{AScontrol}.\end{enumerate}\section{Auxiliary functions \label{sec:auxiliary}}In \pkg{arules} several helpful functions are implemented forsupport counting, rule induction, sampling, etc.In the following we will discuss some of these functions.\subsection{Counting support for itemsets\label{sec:counting}}Normally, itemset support is counted during mining the database with agiven minimum support constraint.  During this process all frequentitemsets plus some infrequent candidate itemsets are counted (or supportis determined by other means).  Especially for databases with many itemsand for low minimum support values, this procedure can be extremely timeconsuming since in the worst case, the number of frequent itemsets growsexponentially in the number of items.If only the support information for a single or a few itemsets is needed,we might not want to mine the database for all frequent itemsets. We also do not know in advance how high (or low) to set the minimum support tostill get the support information for the itemsets in question.For this problem, \pkg{arules} contains \func{support}which determines the support for a set of given sets of items(as an \class{itemMatrix}).For counting, we use a prefix tree~\citep{arules:Knuth:1997} to organize thecounters. The used prefix tree is similar to the itemset tree describedby~\cite{arules:Borgelt+Kruse:2002}. However, we do not generate the treelevel-wise, but we first generate a prefix tree which only contains the nodesnecessary to hold the counters for all itemsets which need to be counted.Using the nodes in this tree only, we count the itemsets for eachtransaction recursively. After counting, the support for each itemset iscontained in the node with the prefix equal to the itemset. The exactprocedure is described in~\cite{arules:Hahsler+Buchta+Hornik:2007}.In addition to determining the support of a few itemsets without mining allfrequent itemsets, \func{support} is also useful for finding the support of infrequent itemsets with a support so low that miningis infeasible due to combinatorial explosion.\subsection{Rule induction\label{sec:induction}}For convenience we introduce $\set{X} = \{X_1, X_2, \ldots, X_l\}$ forsets of itemsets with length $l$.  Analogously, we write $\set{R}$ forsets of rules.  A part of the association rule mining problem is thegeneration (or induction) of a set of rules~$\set{R}$ from a set offrequent itemsets~$\set{X}$.  The implementation of the Apriorialgorithm used in \pkg{arules} already contains a rule induction engineand by default returns the set of association rules of the form $X\Rightarrow Y$ which satisfy given minimum support and minimumconfidence.  Following the definition of\cite{arules:Agrawal+Imielinski+Swami:1993} $Y$ is restricted to singleitems.In some cases it is necessary to separate mining itemsets and generatingrules from itemsets. For example, only rules stemming from a subset ofall frequent itemsets might be of interest to the user.  The Aprioriimplementation efficiently generates rules by reusing the datastructures built during mining the frequent itemsets.  However, ifApriori is used to return only itemsets or Eclat or some other algorithmis used to mine itemsets, the data structure needed for rule inductionis no longer available for computing rule confidence.If rules need to be induced from an arbitrary set of itemsets, supportvalues required to calculate confidence are typically missing.  Forexample, if all available information is an itemset containing fiveitems and we want to induce rules, we need the support of the itemset(which we might know), but also the support of all subsets of lengthfour. The missing support information has to be counted from thedatabase. Finally, to induce rules efficiently for a given set ofitemsets, we also have to store support values in a suitable datastructure which allows fast look-ups for calculating rule confidence.Function~\func{ruleInduction} provided in \pkg{arules} uses a prefix tree to induce rules for a givenconfidence from an arbitrary set of itemsets $\set{X}$ in the followingway:\begin{enumerate}    \item Count  the support values for each itemset $X \in \set{X}$ and the        subsets $\{X \setminus \{x\}: x \in X\}$ needed for rule generation        in a single pass over the database        and store them in a suitable data structure.    \item Populate set $\set{R}$ by selectively generating only rules for the        itemsets in $\set{X}$ using the support information        from the data structure created in step 1.\end{enumerate}Efficient support counting is done as described in Section~\ref{sec:counting}above. After counting, all necessary support counts are contained in the prefixtree. We can retrieve the needed support values and generating the rules isstraight forward. The exact procedure is describedin~\cite{arules:Hahsler+Buchta+Hornik:2007}.\subsection{Sampling from transactions\label{sec:sample}}Taking samples from large databases for mining is a powerful technique
💿 文件大小 2073 K
👤 上传用户 epower
📂 所属分类 Linux/Unix编程
🏷️ 相关标签

#linux #程序 #代码
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -