📄 arules.tex
字号:
which is especially useful if the original database does not fit intomain memory, but the sample does. However, even if the database fitsinto main memory, sampling can provide an enormous speed-up for miningat the cost of only little degradation of accuracy.\cite{arules:Mannila+Toivonen+Verkamo:1994}proposed sampling with replacement for association rule miningand quantify the estimation error due to sampling.Using Chernov bounds on the binomial distribution (the number oftransactions which contains a given itemset in a sample),the authors argue that in theory even relatively small samples should provide good estimates for support.\cite{arules:Zaki+Parthasarathy+Li+Ogihara:1997} built upon thetheoretic work by \cite{arules:Mannila+Toivonen+Verkamo:1994} and showthat for an itemset $X$ with support $\tau = \mathrm{supp}(X)$ and foran acceptable relative error of support $\epsilon$ (an accuracy of $1 -\epsilon$) at a given confidence level $1-c$, the needed sample size~$n$can be computed by\begin{equation}n = \frac{-2\mathrm{ln}(c)}{\tau\epsilon^2}.\label{equ:samplesize}\end{equation}Depending on its support, for each itemset a different sample size isappropriate. As a heuristic, the authors suggest to use the userspecified minimum support threshold for $\tau$. This means that foritemsets close to minimum support, the given error and confidence levelhold while for more frequent itemsets the error rate will be less.However, with this heuristic the error rate for itemsets below minimumsupport can exceed $\epsilon$ at the given confidence level and thussome infrequent itemsets might appear as frequent ones in the sample.\cite{arules:Zaki+Parthasarathy+Li+Ogihara:1997} also evaluated samplingin practice on several data sets and conclude that sampling not onlyspeeds mining up considerably, but also the errors are considerablysmaller than those given by the Chernov bounds and thus samples withsize smaller than obtained by Equation~\ref{equ:samplesize} are oftensufficient.Another way to obtain the required sample size for association rule mining is progressive sampling~\citep{arules:Parthasarathy:2002}.This approach starts with a small sample and uses progressively larger samples until model accuracy does not improve significantly anymore. \cite{arules:Parthasarathy:2002} defines a proxy for model accuracy improvement by using a similarity measure between twosets of associations. The idea is that since larger samples will produce more accurate results, thesimilarity between two sets of associations of two consecutive samplesis low if accuracy improvements are high and increases withdecreasing accuracy improvements.Thus increasing sample size can be stopped if the similaritybetween consecutive samples reaches a ``plateau.'' \cite{arules:Toivonen:1996} presents an application of sampling to reducethe needed I/O overhead for very large databases which do not fit into main memory.The idea is to use a random sample from the data base to minefrequent itemsets at a support threshold below the set minimum support.The support of these itemsets is then counted in the whole database and the infrequent itemsets are discarded. If the support threshold to mine the sample is picked low enough, almost all frequent itemsets and their support will be foundin one pass over the large database.In \pkg{arules} sampling is implemented by \func{sample}which provides all capabilities of the standard sampling function in \proglang{R}(e.g., sampling with or without replacement and probability weights).\subsection{Generating synthetic transaction data}Synthetic data can be used to evaluate and compare different mining algorithmsand to study the behavior of measures of interestingness. In \pkg{arules} the function \func{random.transactions} can be used to create synthetic transaction data. Currently there are two methods available. Thefirst method reimplements the well known generator for transaction data formining association rules developed by~\cite{arules:Agrawal+Srikant:1994}. Thesecond method implements a simple probabilistic model where each transaction isthe result of one independent Bernoulli trial for each item(see~\citep{arules:Hahsler+Hornik+Reutterer:2005}).%% ------------------------------------------------------------------%% ------------------------------------------------------------------\subsection{Sub-, super-, maximal and closed itemsets}For some calculations it is necessary to find all sub- or supersets for a specific itemsetin a set of itemsets. This functionality is implemented as \func{is.subset} and \func{is.superset}. For example, \code{is.subset(x, y, proper = TRUE)}, finds allproper subsets of the itemsets in \code{x} in the set~\code{y}. The result is a logical matrix with \code{length(x)} rows and \code{length(y)} columns. Each logical row vector represents which elements in \code{y} are subsets of the corresponding element in \code{x}. If \code{y} is omitted, the sub- or superset structure within theset \code{x} is returned.Similar methods, \func{is.maximal} and \func{is.closed}, can be used to find all\emph{maximal itemsets} or \emph{closed itemsets} in a set. An itemset ismaximal in a set if no proper superset of the itemset is contained in theset~\citep{arules:Zaki+Parthasarathy+Ogihara+Li:1997}. An itemset is closed,if it is its own closure (i.e., for an items no superset with the same supportexits)~\citep{arules:Pasquier+Bastide+Taouil+Lakhal:1999}.Note that these methods can be extremely slow and have high memory usage if the set contains many itemsets. \subsection{Additional measures of interestingness}\pkg{arules} provides \func{interestMeasure}which can be used to calculatea broad variety of interest measures for itemsets and rules. To speed up the calculation, we try to reuse the quality information available from the sets of itemsets or rules (i.e., support, confidence, lift) and, onlyif necessary, missing information is obtained from thetransactions used to mine the associations.For example, available measures for itemsets are:\begin{itemize}\item All-confidence~\citep{arules:Omiecinski:2003}\item Cross-support ratio~\citep{arules:Xiong+Tan+Kumar:2003}\item Support\end{itemize}For rules the following measures are implemented:\begin{itemize}\item Chi square measure~\citep{arules:Liu+Hsu+Ma:1999}\item Conviction~\citep{arules:Brin+Motwani+Ullman+Tsur:1997} \item Confidence\item Difference of Confidence (DOC)~\citep{arules:Hofmann+Wilhelm:2001}\item Hyper-lift and hyper-confidence~\citep{arules:Hahsler+Hornik:2007} \item Leverage~\citep{arules:Piatetsky-Shapiro:1991} \item Lift\item Improvement~\citep{arules:Bayardo+Agrawal+Gunopulos:2000} \item Support\item Several measures from \cite{arules:Tan+Kumar+Srivastava2004} (e.g., cosine, Gini index, $\phi$-coefficient, odds ratio)\end{itemize}\subsection{Distance based clustering transactions and associations}To allow for distance based clustering~\citep{arules:Gupta+Strehl+Ghosh:1999},\pkg{arules} provides \func{dissimilarity} which can be used to calculatedissimilarities and cross-dissimilarities between transactions or associations(i.e., itemsets and rules). Currently, the following standard measures forbinary data are available: Jaccard coefficient, simple matching coefficient anddice coefficient. Additionally, dissimilarity between transactions can becalculated based on affinities betweenitems~\citep{arules:Aggarwal+Procopiuc+Yu:2002}.The result of \func{dissimilarity} is either a \code{dist} object, which can bedirectly used by most clustering methods in R (e.g., \code{hclust} forhierarchical clustering), or an object of class \code{ar\_cross\_dissimilarity}.Since the number of transactions or associations in often too large toefficiently calculate a dissimilarity matrix and apply a clustering algorithm,\func{sample} can be used to cluster only a subset of transactions (associations). To assign the remaining transactions (associations)to clusters, \func{predict} implements the nearest neighbor approach forpredicting memberships for new data.A small example can be found in~\cite{arules:Hahsler+Hornik:2007b}.\section{Examples\label{sec:examples}}\subsection{Example 1: Analyzing and preparing a transaction data set \label{sec:example-screen}}In this example, we show how a data set can be analyzed and manipulatedbefore associations are mined.This is important for finding problems in the data set which could make the mined associations useless or at least inferiorto associations mined on a properly prepared data set.For the example,we look at the \code{Epub} transaction data contained inpackage \pkg{arules}. This data set contains downloads of documentsfrom the Electronic Publication platform of the Vienna University ofEconomics and Business Administration available via\url{http://epub.wu-wien.ac.at} from January 2003 to August 2005.First, we load \pkg{arules} and the data set.\begin{Schunk}\begin{Sinput}> library("arules")\end{Sinput}\end{Schunk}\begin{Schunk}\begin{Sinput}> data("Epub")> Epub\end{Sinput}\begin{Soutput}transactions in sparse format with 3975 transactions (rows) and 465 items (columns)\end{Soutput}\end{Schunk}We see that the data set consists of 3975 transactionsand is represented as a sparse matrix with 3975 rowsand 465 columns which represent the items. Next, we use the \func{summary} to get more information about the data set.\begin{Schunk}\begin{Sinput}> summary(Epub)\end{Sinput}\begin{Soutput}transactions as itemMatrix in sparse format with 3975 rows (elements/itemsets/transactions) and 465 columns (items) and a density of 0.003791438 most frequent items:doc_11d doc_4c6 doc_71 doc_2cd doc_364 (Other) 212 130 120 113 105 6328 element (itemset/transaction) length distribution:sizes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2852 573 245 116 53 29 23 17 12 9 10 4 4 3 2 16 17 18 19 20 22 24 25 28 34 38 74 79 3 2 3 5 2 1 1 1 1 1 1 1 1 Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 1.000 1.763 2.000 79.000 includes extended item information - examples: labels1 doc_1542 doc_3d63 doc_16fincludes extended transaction information - examples: transactionID TimeStamp1 session_4795 2003-01-01 19:59:002 session_4797 2003-01-02 06:46:013 session_479a 2003-01-02 09:50:38\end{Soutput}\end{Schunk}\func{summary} displays the most frequent items in the data set,information about the transaction length distribution and that the dataset contains some extended transaction information. We see that thedata set contains transaction IDs and in addition time stamps (usingclass \class{POSIXct}) for the transactions. This additionalinformation can be used for analyzing the data set.\begin{Schunk}\begin{Sinput}> year <- strftime(as.POSIXlt(transactionInfo(Epub)[["TimeStamp"]]), + "%Y")> table(year)\end{Sinput}\begin{Soutput}year2003 2004 2005 988 1375 1612 \end{Soutput}\end{Schunk}For 2003, the first year in the data set, we have 988transactions. We can select the corresponding transactions and inspectthe structure using a level-plot (see Figure~\ref{fig:imageEpub}).%%% lattice image returns an object of class "trellis"\begin{Schunk}\begin{Sinput}> Epub2003 <- Epub[year == "2003"]> length(Epub2003)\end{Sinput}\begin{Soutput}[1] 988\end{Soutput}\begin{Sinput}> image(Epub2003)\end{Sinput}\end{Schunk}\begin{figure}\centering\includegraphics[width=10cm]{arules-epub}\caption{The Epub data set (year 2003).}\label{fig:imageEpub}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -