📄 arules.tex
字号:
\documentclass[10pt,a4paper]{article}\usepackage{a4wide}\setlength{\parskip}{0.5ex plus0.1ex minus0.1ex}\setlength{\parindent}{0em}\usepackage[round,longnamesfirst]{natbib}\usepackage{hyperref}%%% for tabulars\usepackage{rotating}\usepackage{multirow}\newcommand{\strong}[1]{{\normalfont\fontseries{b}\selectfont #1}}\newcommand{\class}[1]{\mbox{\textsf{#1}}}\newcommand{\func}[1]{\mbox{\texttt{#1()}}}\newcommand{\code}[1]{\mbox{\texttt{#1}}}\newcommand{\pkg}[1]{\strong{#1}}\newcommand{\samp}[1]{`\mbox{\texttt{#1}}'}\newcommand{\proglang}[1]{\textsf{#1}}\newcommand{\set}[1]{\mathcal{#1}}\usepackage{Sweave}%% \VignetteIndexEntry{Introduction to arules}\begin{document}%% ------------------------------------------------------------------%% ------------------------------------------------------------------\title{Introduction to \pkg{arules} -- A computational environment for mining association rules and frequent item sets}\author{Michael Hahsler and Bettina Gr{\"u}n and Kurt Hornik and Christian Buchta}\maketitle\sloppy%% ------------------------------------------------------------------%% ------------------------------------------------------------------\begin{abstract} Mining frequent itemsets and association rules is a popular and well researched approach for discovering interesting relationships between variables in large databases. The \proglang{R} package \pkg{arules} presented in this paper provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules. The package also includes interfaces to two fast mining algorithms, the popular \proglang{C} implementations of Apriori and Eclat by Christian Borgelt. These algorithms can be used to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules.\end{abstract}%% ------------------------------------------------------------------%% ------------------------------------------------------------------\clearpage\tableofcontents\clearpage%% ------------------------------------------------------------------%% ------------------------------------------------------------------\section{Introduction}Mining frequent itemsets and association rules is a popular and wellresearched method for discovering interesting relations betweenvariables in large databases. \cite{arules:Piatetsky-Shapiro:1991}describes analyzing and presenting strong rules discovered in databasesusing different measures of interestingness. Based on the concept of strongrules, \cite{arules:Agrawal+Imielinski+Swami:1993} introduced theproblem of mining association rules from transaction data as follows:Let $I=\{i_1, i_2,\ldots,i_n\}$ be a set of $n$ binary attributes called\emph{items}. Let $\set{D} = \{t_1, t_2, \ldots, t_m\}$ be a set oftransactions called the \emph{database}. Each transactionin~$\set{D}$ has an unique transaction ID andcontains a subset of the items in~$I$.A \emph{rule} is defined as an implication of the form $X \Rightarrow Y$where $X, Y \subseteq I$ and $X \cap Y = \emptyset$. The sets of items(for short \emph{itemsets}) $X$ and $Y$ are called \emph{antecedent}(left-hand-side or LHS) and \emph{consequent} (right-hand-side or RHS)of the rule.\begin{figure}[b]\centering \begin{tabular}{|c|l|} \hline {\bf transaction} ID & {\bf items}\\ \hline 1 & milk, bread \\ 2 & bread, butter \\ 3 & beer \\ 4 & milk, bread, butter \\ 5 & bread, butter \\ \hline\end{tabular}\caption{An example supermarket database with five transactions.\label{table:supermarket}}\end{figure}To illustrate the concepts, we use a small example from the supermarketdomain. The set of items is $I= \{\mathrm{milk, bread, butter, beer}\}$and a small database containing the items is shown inFigure~\ref{table:supermarket}. An example rule for the supermarketcould be $\{\mathrm{milk, bread}\} \Rightarrow \{\mathrm{butter}\}$meaning that if milk and bread is bought, customers also buy butter.To select interesting rules from the set of all possible rules,constraints on various measures of significance and interest can beused. The best-known constraints are minimum thresholds on support andconfidence. The \emph{support}~$\mathrm{supp}(X)$ of an itemset~$X$ isdefined as the proportion of transactions in the data set which containthe itemset. In the example database in Figure~\ref{table:supermarket},the itemset $\{\mathrm{milk, bread}\}$ has a support of $2/5=0.4$ sinceit occurs in 40\% of all transactions (2 out of 5 transactions).Finding frequentitemsets can be seen as a simplification of the unsupervised learningproblem called ``mode finding'' or ``bumphunting''~\citep{arules:Hastie+Tibshirani+Friedman:2001}. For theseproblems each item is seen as a variable. The goal is to find prototypevalues so that the probability density evaluated at these values issufficiently large. However, for practical applications with a largenumber of variables, probability estimation will be unreliable andcomputationally too expensive. This is why in practice frequentitemsets are used instead of probability estimation.The \emph{confidence} of a rule is defined $\mathrm{conf}(X\RightarrowY) = \mathrm{supp}(X \cup Y) / \mathrm{supp}(X)$. For example, the rule$\{\mathrm{milk, bread}\} \Rightarrow \{\mathrm{butter}\}$ has aconfidence of $0.2/0.4=0.5$ in the database inFigure~\ref{table:supermarket}, which means that for 50\% of thetransactions containing milk and bread the rule is correct.Confidence can be interpreted as an estimate of the probability$P(Y|X)$, the probability of finding the RHS of the rule in transactionsunder the condition that these transactions also contain the LHS\citep[see e.g.,][]{arules:Hipp+Guentzer+Nakhaeizadeh:2000}.Association rules are required to satisfy both a minimum support and aminimum confidence constraint at the same time. At medium to lowsupport values, often a great number of frequent itemsets are found in adatabase. However, since the definition of support enforces that allsubsets of a frequent itemset have to be also frequent, it is sufficientto only mine all \emph{maximal frequent itemsets}, defined as frequentitemsets which are not proper subsets of any other frequentitemset~\citep{arules:Zaki+Parthasarathy+Ogihara+Li:1997}. Anotherapproach to reduce the number of mined itemsets is to only mine\emph{frequent closed itemsets.} An itemset is closed if no propersuperset of the itemset is contained in each transaction in which theitemset is contained~\citep{arules:Pasquier+Bastide+Taouil+Lakhal:1999, arules:Zaki:2004}. Frequent closed itemsets are a superset of themaximal frequent itemsets. Their advantage over maximal frequentitemsets is that in addition to yielding all frequent itemsets, theyalso preserve the support information for all frequent itemsets whichcan be important for computing additional interest measures after themining process is finished (e.g., confidence for rules generated fromthe found itemsets, or\emph{all-confidence}~\citep{arules:Omiecinski:2003}).A practical solution to the problem of finding too many associationrules satisfying the support and confidence constraints is to furtherfilter or rank found rules using additional interest measures. Apopular measure for this purpose is\emph{lift}~\citep{arules:Brin+Motwani+Ullman+Tsur:1997}. The lift of arule is defined as $\mathrm{lift}(X \Rightarrow Y) = \mathrm{supp}(X\cup Y) / (\mathrm{supp}(X) \mathrm{supp}(Y))$, and can be interpretedas the deviation of the support of the whole rule from the supportexpected under independence given the supports of the LHS and the RHS.Greater lift values indicate stronger associations.In the last decade, research on algorithms to solve the frequent itemsetproblem has been abundant. \cite{arules:Goethals+Zaki:2004} compare thecurrently fastest algorithms. Among these algorithms are theimplementations of the Apriori and Eclat algorithms by\cite{arules:Borgelt:2003}interfaced inthe \pkg{arules} environment.The two algorithms use very different mining strategies. Apriori,developed by \cite{arules:Agrawal+Srikant:1994}, is a level-wise,breadth-first algorithm which counts transactions. In contrast, Eclat\citep{arules:Zaki+Parthasarathy+Ogihara+Li:1997} employs equivalenceclasses, depth-first search and set intersection instead of counting.The algorithms can be used to mine frequent itemsets, maximal frequentitemsets and closed frequent itemsets. The implementation of Apriorican additionally be used to generate association rules.This paper presents \pkg{arules}\footnote{The \pkg{arules} package can be obtained from \url{http://CRAN.R-Project.org} and the maintainer can be contacted at \url{michael@hahsler.net}.}, an extension package for\proglang{R}~\citep{arules:R:2005} which provides the infrastructureneeded to create and manipulate input data sets for the miningalgorithms and for analyzing the resulting itemsets and rules. Since itis common to work with large sets of rules and itemsets, the packageuses sparse matrix representations to minimize memory usage. Theinfrastructure provided by the package was also created to explicitlyfacilitate extensibility, both for interfacing new algorithms and foradding new types of interest measures and associations.The rest of the paper is organized as follows: In the next section, wegive an overview of the data structures implemented in thepackage~\pkg{arules}.%In Sections~\ref{sec:transactions} and \ref{sec:associations} In Section~\ref{sec:overview} we introduce the functionality of theclasses to handle transaction data and associations. InSection~\ref{sec:interfaces} we describe the way mining algorithms areinterfaced in \pkg{arules} using the available interfaces to Apriori andEclat as examples. In Section~\ref{sec:auxiliary} we present someauxiliary methods for support counting, rule induction and samplingavailable in \pkg{arules}. We provide several examples in%Sections~\ref{sec:example-screen} to \ref{sec:example-allconf}.Section~\ref{sec:examples}. The first two examples show typical\proglang{R} sessions for preparing, analyzing and manipulating atransaction data set, and for mining association rules. The thirdexample demonstrates how \pkg{arules} can be extended to integrate a newinterest measure. Finally, the fourth example shows how to use samplingin order to speed up the mining process. We conclude with a summary ofthe features and strengths of the package~\pkg{arules} as acomputational environment for mining association rules and frequentitemsets.A previous version of this manuscript was published in the \emph{Journal of Statistical Software} \citep{arules:Hahsler+Gruen+Hornik:2005b}.%% ------------------------------------------------------------------%% ------------------------------------------------------------------\section{Data structure overview\label{sec:overview}}To enable the user to represent and work with input and output data ofassociation rule mining algorithms in \proglang{R}, a well-designedstructure is necessary which can deal in an efficient way with largeamounts of sparse binary data. The \proglang{S4} class structureimplemented in the package \pkg{arules} is presented inFigure~\ref{fig:arules-classes}.\begin{figure}[tp]\centering\includegraphics[width=12cm]{arules-classes}\caption{UML class diagram \citep[see][]{misc:Fowler:2004} of the \pkg{arules}package.\label{fig:arules-classes}}\end{figure}For input data the classes \class{transactions} and \class{tidLists}(transaction ID lists, an alternative way to represent transaction data)are provided. The output of the mining algorithms comprises the classes\class{itemsets} and \class{rules} representing sets of itemsets orrules, respectively. Both classes directly extend a common virtualclass called \class{associations} which provides a common interface. Inthis structure it is easy to add a new type of associations by adding anew class that extends \class{associations}.Items in \class{associations} and \class{transactions} are implementedby the \class{itemMatrix} class which provides a facade for the sparsematrix implementation \class{ngCMatrix} from the \proglang{R} package\pkg{Matrix}~\citep{arules:Bates+Maechler:2005}.To control the behavior of the mining algorithms, the two classes\class{ASparameter} and \class{AScontrol} are used. Since eachalgorithm can use additional algorithm-specific parameters, weimplemented for each interfaced algorithm its own set of controlclasses. We used the prefix \samp{AP} for Apriori and \samp{EC} forEclat. In this way, it is easy to extend the control classes wheninterfacing a new algorithm.\subsection{Representing collections of itemsets\label{sec:setrepresentation}}From the definition of the association rule mining problem we see thattransaction databases and sets of associations have in common that they contain sets of items (itemsets) together with additional information.For example, a transaction in the database % $\set{D}$ contains a transaction ID and an itemset.A rule in a set of mined association rules % $\set{R}$contains two itemsets, one for the LHS and one for the RHS, and additional quality information, e.g., values for various interest measures.\begin{figure}[tp]\centering%\includegraphics[width=5cm]{itemsetMatrix}\renewcommand{\arraystretch}{1.1}\setlength{\tabcolsep}{2mm}\begin{tabular}{cl|cccc}%\begin{tabular}{cl|ccccc}%& & \multicolumn{5}{c}{items}\\& & \multicolumn{4}{c}{items}\\%%& & $i_1$ & $i_2$ & $i_3$ & . . . & $i_n$ \\& & $i_1$ & $i_2$ & $i_3$ & $i_4$ \\& & milk & bread & butter & beer \\%\hline\cline{2-6}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -