📄 arules.rnw

📁 本程序是基于linux系统下c++代码
💻 RNW
📖 第 1 页 / 共 5 页
字号:
(which we might know), but also the support of all subsets of lengthfour. The missing support information has to be counted from thedatabase. Finally, to induce rules efficiently for a given set ofitemsets, we also have to store support values in a suitable datastructure which allows fast look-ups for calculating rule confidence.Function~\func{ruleInduction} provided in \pkg{arules} uses a prefix tree to induce rules for a givenconfidence from an arbitrary set of itemsets $\set{X}$ in the followingway:\begin{enumerate}    \item Count  the support values for each itemset $X \in \set{X}$ and the        subsets $\{X \setminus \{x\}: x \in X\}$ needed for rule generation        in a single pass over the database        and store them in a suitable data structure.    \item Populate set $\set{R}$ by selectively generating only rules for the        itemsets in $\set{X}$ using the support information        from the data structure created in step 1.\end{enumerate}Efficient support counting is done as described in Section~\ref{sec:counting}above. After counting, all necessary support counts are contained in the prefixtree. We can retrieve the needed support values and generating the rules isstraight forward. The exact procedure is describedin~\cite{arules:Hahsler+Buchta+Hornik:2007}.\subsection{Sampling from transactions\label{sec:sample}}Taking samples from large databases for mining is a powerful techniquewhich is especially useful if the original database does not fit intomain memory, but the sample does.  However, even if the database fitsinto main memory, sampling can provide an enormous speed-up for miningat the cost of only little degradation of accuracy.\cite{arules:Mannila+Toivonen+Verkamo:1994}proposed sampling with replacement for association rule miningand quantify the estimation error due to sampling.Using Chernov bounds on the binomial distribution (the number oftransactions which contains a given itemset in a sample),the authors argue that in theory even relatively small samples should provide good estimates for support.\cite{arules:Zaki+Parthasarathy+Li+Ogihara:1997} built upon thetheoretic work by \cite{arules:Mannila+Toivonen+Verkamo:1994} and showthat for an itemset $X$ with support $\tau = \mathrm{supp}(X)$ and foran acceptable relative error of support $\epsilon$ (an accuracy of $1 -\epsilon$) at a given confidence level $1-c$, the needed sample size~$n$can be computed by\begin{equation}n = \frac{-2\mathrm{ln}(c)}{\tau\epsilon^2}.\label{equ:samplesize}\end{equation}Depending on its support, for each itemset a different sample size isappropriate.  As a heuristic, the authors suggest to use the userspecified minimum support threshold for $\tau$.  This means that foritemsets close to minimum support, the given error and confidence levelhold while for more frequent itemsets the error rate will be less.However, with this heuristic the error rate for itemsets below minimumsupport can exceed $\epsilon$ at the given confidence level and thussome infrequent itemsets might appear as frequent ones in the sample.\cite{arules:Zaki+Parthasarathy+Li+Ogihara:1997} also evaluated samplingin practice on several data sets and conclude that sampling not onlyspeeds mining up considerably, but also the errors are considerablysmaller than those given by the Chernov bounds and thus samples withsize smaller than obtained by Equation~\ref{equ:samplesize} are oftensufficient.Another way to obtain the required sample size for association rule mining is progressive sampling~\citep{arules:Parthasarathy:2002}.This approach starts with a small sample and uses progressively larger samples until model accuracy does not improve significantly anymore. \cite{arules:Parthasarathy:2002} defines a proxy for model accuracy improvement by using a similarity measure between twosets of associations. The idea is that since larger samples will produce more accurate results, thesimilarity between two sets of associations of two consecutive samplesis low if accuracy improvements are high and increases withdecreasing accuracy improvements.Thus increasing sample size can be stopped if the similaritybetween consecutive samples reaches a ``plateau.'' \cite{arules:Toivonen:1996} presents an application of sampling to reducethe needed I/O overhead for very large databases which do not fit into main memory.The idea is to use a random sample from the data base to minefrequent itemsets at a support threshold below the set minimum support.The support of these itemsets is then counted in the whole database and the infrequent itemsets are discarded. If the support threshold to mine the sample is picked low enough, almost all frequent itemsets and their support will be foundin one pass over the large database.In \pkg{arules} sampling is implemented by \func{sample}which provides all capabilities of the standard sampling function in \proglang{R}(e.g., sampling with or without replacement and probability weights).\subsection{Generating synthetic transaction data}Synthetic data can be used to evaluate and compare different mining algorithmsand to study the behavior of measures of interestingness. In \pkg{arules} the function \func{random.transactions} can be used to create synthetic transaction data. Currently there are two methods available.  Thefirst method reimplements the well known generator for transaction data formining association rules developed by~\cite{arules:Agrawal+Srikant:1994}. Thesecond method implements a simple probabilistic model where each transaction isthe result of one independent Bernoulli trial for each item(see~\citep{arules:Hahsler+Hornik+Reutterer:2005}).%% ------------------------------------------------------------------%% ------------------------------------------------------------------\subsection{Sub-, super-, maximal and closed itemsets}For some calculations it is necessary to find all sub- or supersets for a specific itemsetin a set of itemsets. This functionality is implemented as \func{is.subset} and \func{is.superset}. For example, \code{is.subset(x, y, proper = TRUE)}, finds allproper subsets of the itemsets in \code{x} in the set~\code{y}. The result is a logical matrix with \code{length(x)} rows and \code{length(y)} columns.  Each logical row vector represents which elements in \code{y} are subsets of the corresponding element in \code{x}. If \code{y} is omitted, the sub- or superset structure within theset \code{x} is returned.Similar methods, \func{is.maximal} and \func{is.closed}, can be used to find all\emph{maximal itemsets} or \emph{closed itemsets} in a set.  An itemset ismaximal in a set if no proper superset of the itemset is contained in theset~\citep{arules:Zaki+Parthasarathy+Ogihara+Li:1997}.  An itemset is closed,if it is its own closure (i.e., for an items no superset with the same supportexits)~\citep{arules:Pasquier+Bastide+Taouil+Lakhal:1999}.Note that these methods can be extremely slow and have high memory usage if the set contains many itemsets. \subsection{Additional measures of interestingness}\pkg{arules} provides \func{interestMeasure}which can be used to calculatea broad variety of interest measures for itemsets and rules. To speed up the calculation, we try to reuse the quality information available from the sets of itemsets or rules (i.e., support, confidence, lift) and, onlyif necessary, missing information is obtained from thetransactions used to mine the associations.For example, available measures for itemsets are:\begin{itemize}\item All-confidence~\citep{arules:Omiecinski:2003}\item Cross-support ratio~\citep{arules:Xiong+Tan+Kumar:2003}\item Support\end{itemize}For rules the following measures are implemented:\begin{itemize}\item Chi square measure~\citep{arules:Liu+Hsu+Ma:1999}\item Conviction~\citep{arules:Brin+Motwani+Ullman+Tsur:1997} \item Confidence\item Difference of Confidence (DOC)~\citep{arules:Hofmann+Wilhelm:2001}\item Hyper-lift and hyper-confidence~\citep{arules:Hahsler+Hornik:2007} \item Leverage~\citep{arules:Piatetsky-Shapiro:1991} \item Lift\item Improvement~\citep{arules:Bayardo+Agrawal+Gunopulos:2000} \item Support\item Several measures from \cite{arules:Tan+Kumar+Srivastava2004}   (e.g., cosine, Gini index, $\phi$-coefficient, odds ratio)\end{itemize}\subsection{Distance based clustering transactions and associations}To allow for distance based clustering~\citep{arules:Gupta+Strehl+Ghosh:1999},\pkg{arules} provides \func{dissimilarity} which can be used to calculatedissimilarities and cross-dissimilarities between transactions or associations(i.e., itemsets and rules). Currently, the following standard measures forbinary data are available: Jaccard coefficient, simple matching coefficient anddice coefficient. Additionally, dissimilarity between transactions can becalculated based on affinities betweenitems~\citep{arules:Aggarwal+Procopiuc+Yu:2002}.The result of \func{dissimilarity} is either a \code{dist} object, which can bedirectly used by most clustering methods in R (e.g., \code{hclust} forhierarchical clustering), or an object of class \code{ar\_cross\_dissimilarity}.Since the number of transactions or associations in often too large toefficiently calculate a dissimilarity matrix and apply a clustering algorithm,\func{sample} can be used to cluster only a subset of transactions (associations). To assign the remaining transactions (associations)to clusters, \func{predict} implements the nearest neighbor approach forpredicting memberships for new data.A small example can be found in~\cite{arules:Hahsler+Hornik:2007b}.\section{Examples\label{sec:examples}}\subsection{Example 1: Analyzing and preparing a transaction data set \label{sec:example-screen}}In this example, we show how a data set can be analyzed and manipulatedbefore associations are mined.This is important for finding problems in the data set which could make the mined associations useless or at least inferiorto associations mined on a properly prepared data set.For the example,we look at the \code{Epub} transaction data contained inpackage \pkg{arules}.  This data set contains downloads of documentsfrom the Electronic Publication platform of the Vienna University ofEconomics and Business Administration available via\url{http://epub.wu-wien.ac.at} from January 2003 to August 2005.First, we load \pkg{arules} and the data set.<<>>=library("arules")@<<epub1>>=data("Epub")Epub@We see that the data set consists of \Sexpr{length(Epub)} transactionsand is represented as a sparse matrix with \Sexpr{length(Epub)} rowsand \Sexpr{dim(Epub)[2]} columns which represent the items. Next, we use the \func{summary} to get more information about the data set.<<epub2>>=summary(Epub)@\func{summary} displays the most frequent items in the data set,information about the transaction length distribution and that the dataset contains some extended transaction information.  We see that thedata set contains transaction IDs and in addition time stamps (usingclass \class{POSIXct}) for the transactions.  This additionalinformation can be used for analyzing the data set.<<>>=year <- strftime(as.POSIXlt(transactionInfo(Epub)[["TimeStamp"]]), "%Y")table(year)@For 2003, the first year in the data set, we have \Sexpr{table(year)[1]}transactions.  We can select the corresponding transactions and inspectthe structure using a level-plot (see Figure~\ref{fig:imageEpub}).%%% lattice image returns an object of class "trellis"<<fig=FALSE>>=Epub2003 <- Epub[year == "2003"]length(Epub2003)image(Epub2003)@ %\begin{figure}\centering<<echo=FALSE, fig=TRUE, include=FALSE, label=epub>>=print(image(Epub2003))@ %\includegraphics[width=10cm]{arules-epub}\caption{The Epub data set (year 2003).}\label{fig:imageEpub}\end{figure}The plot is a direct visualization of the binary incidence matrix wherethe the dark dots represent the ones in the matrix.  From the plot wesee that the items in the data set are not evenly distributed.  In fact,the white area to the top right side suggests, that in the beginning of2003 only very few items were available (less than 50) and then duringthe year more items were added until it reached a number of around 300items. Also, we can see that there are some transactions in the data setwhich contain a very high number of items (denser horizontal lines).These transactions need further investigation since they could originatefrom data collection problems (e.g., a web robot downloading manydocuments from the publication site).  To find the very long
💿 文件大小 2073 K
👤 上传用户 epower
📂 所属分类 Linux/Unix编程
🏷️ 相关标签

#linux #程序 #代码
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -