📄 arules.tex

📁 本程序是基于linux系统下c++代码
💻 TEX
📖 第 1 页 / 共 5 页
字号:
%\multirow{9}{1.5ex}{ \begin{sideways} itemsets \end{sideways} }%\multirow{8}{1.5ex}{ \begin{sideways} itemsets \end{sideways} }\multirow{4}{*}{\begin{sideways} itemsets \end{sideways}}%& $X_1$ 		& 0 & 1 & 0 & . . . & 1 \\%& $X_2$ 	  	& 0 & 1 & 0 & . . . & 1 \\%& $X_3$ 		& 0 & 1 & 0 & . . . & 0 \\%& $X_4$ 		& 0 & 0 & 1 & . . . & 0 \\%& \phantom{-}.&.&.&.&\hspace{-7mm}.&.\\%& \phantom{-}.&.&.&.&.&.\\%& \phantom{-}.&.&.&.&\hspace{7mm}.&.\\%& $X_{m-1}$ 		& 1 & 0 & 0 & . . . & 1 \\%& $X_m$ 		& 0 & 0 & 1 & . . . & 1 \\& $X_1$ 		& 1 & 1 & 0 &  0 \\& $X_2$ 	  	& 0 & 1 & 0 &  1 \\& $X_3$ 		& 1 & 1 & 1 &  0 \\& $X_4$ 		& 0 & 0 & 1 &  0 \\%& \phantom{-}.&.&.&.&.\\%& \phantom{-}.&.&.&.&.\\%& \phantom{-}.&.&.&.&.\\%& $X_{m-1}$ 		& 1 & 0 & 0 & 1 \\%& $X_m$ 		& 0 & 1 & 1 & 0 \\\end{tabular}\caption{Example of a collection of itemsets represented as a     binary incidence matrix.\label{fig:itemsetMatrix}}\end{figure}Collections of itemsets used for transaction databases and sets ofassociations can be represented as binary incidence matrices withcolumns corresponding to the items and rows corresponding to theitemsets.  The matrix entries represent the presence (1) or absence (0)of an item in a particular itemset.  An example of a binary incidencematrix containing itemsets for the example database inFigure~\ref{table:supermarket} on Page~\pageref{table:supermarket} isshown in Figure~\ref{fig:itemsetMatrix}.  Note that we need to storecollections of itemsets with possibly duplicated elements (identicalrows), i.e, itemsets containing exactly the same items. This isnecessary, since a transaction database can contain differenttransactions with the same items. Such a database is still a set oftransactions since each transaction also contains a uniquetransaction~ID.Since a typical frequent itemset or a typical transaction (e.g., asupermarket transaction) only contains a small number of items comparedto the total number of available items, the binary incidence matrix willin general be very sparse with many items and a very large number ofrows.  A natural representation for such data is a sparse matrix format.For our implementation we chose the \class{ngCMatrix} class defined inpackage~\pkg{Matrix}.  The \class{ngCMatrix} is a compressed, sparse, logical,column-oriented matrix which contains the indices of the \code{TRUE} rows andthe pointers to the initial indices of elements in each column of the matrix.Despite the column orientation of the \class{ngCMatrix}, it is more convenientto work with incidence matrices which are row-oriented.  This makes the mostimportant manipulation, selecting a subset of transactions from a data set formining, more comfortable and efficient.  Therefore, we implemented the class\class{itemMatrix} providing a row-oriented facade to the \class{ngCMatrix}which stores a transposed incidence matrix\footnote{Note that the developers ofpackage~\pkg{Matrix} contains some support for a sparse row-oriented format,but the available functionality currently is rather minimal.  Once all requiredfunctionality is implemented, \pkg{arules} will switch the internalrepresentation to sparse row-oriented logical matrices.}.  In sparse representation the followinginformation needs to be stored for the collection of itemsets inFigure~\ref{fig:itemsetMatrix}: A vector of indices of the non-zeroelements (row-wise starting with the first row) $1, 2, 2, 4, 1, 2, 3, 3$and the pointers $1, 3, 5, 8$ where each row starts in the index vector.The first two pointers indicate that the first row starts with elementone in the index vector and ends with element 2 (since with element 3already the next row starts).  The first two elements of the indexvector represent the items in the first row which are $i_1$ and $i_2$ ormilk and bread, respectively.  The two vectors are stored in the\class{ngCMatrix}. Note that indices for the \class{ngCMatrix} startwith zero rather than with one and thus actually the vectors $0, 1, 1,3, 0, 3, 3$ and $0, 3, 4, 7$ are stored.  However, the data structure ofthe \class{ngCMatrix} class is not intended to be directly accessed bythe end user of \pkg{arules}. The interfaces of \class{itemMatrix} canbe used without knowledge of how the internal representation of the dataworks.  However, if necessary, the \class{ngCMatrix} can be directlyaccessed by developers to add functionality to \pkg{arules} (e.g., todevelop new types of associations or interest measures or to efficientlycompute a distance matrix between itemsets for clustering).  In thiscase, the \class{ngCMatrix} should be accessed using the coercionmechanism from \class{itemMatrix} to \class{ngCMatrix} via \func{as}.In addition to the sparse matrix, \class{itemMatrix} stores item labels(e.g., names of the items) and handles the necessary mapping between theitem label and the corresponding column number in the incidence matrix.Optionally, \class{itemMatrix} can also store additional information onitems.  For example, the category hierarchy in a supermarket setting canbe stored which enables the analyst to select only transactions (or aswe later see also rules and itemsets) which contain items from a certaincategory (e.g., all dairy products).For \class{itemMatrix}, basic matrix operations including \func{dim} andsubset selection (\code{[}) are available.  The first element of\func{dim} and \code{[} corresponds to itemsets or transactions (rows),the second element to items (columns).  For example, on a transactiondata set in variable \code{x} the subset selection \samp{x[1:10, 16:20]}selects a matrix containing the first 10 transactions and items 16 to20.Since \class{itemMatrix} is used to represent sets or collections of itemsets additional functionality is provided.\func{length} can be used to get the number of itemsets in an\class{itemMatrix}.Technically, \func{length} returns the number of rows in the matrixwhich is equal to the first element returned by \func{dim}.%\pkg{arules} also provides set operations including \func{union},%\func{intersect} and \func{setequal}.Identical itemsets can be found with \func{duplicated}, and duplicationscan be removed with \func{unique}.  \func{match} can be used to findmatching elements in two collections of itemsets.With \func{c}, several \class{itemMatrix} objects can be combined bysuccessively appending the rows of the objects, i.e., creating acollection of itemsets which contains the itemsets from all\class{itemMatrix} objects.  This operation is only possible if the\class{itemMatrix} objects employed are ``compatible,'' i.e., if thematrices have the same number of columns and the items are in the sameorder.  If two objects contain the same items (item labels), but theorder in the matrix is different or one object is missing some items,\func{recode} can be used to make them compatible by reordering andinserting columns.To get the actual number of items in the itemsets stored in the\class{itemMatrix}, \func{size} is used.  It returns a vector with thenumber of items (ones) for each element in the set (row sum in thematrix).  Obtaining the sizes from the sparse representations is a veryefficient operation, since it can be calculated directly from the vectorof column pointers in the \class{ngCMatrix}.  For a purchase incidencematrix, \func{size} will produce a vector as long as the number oftransactions in the matrix with each element of the vector containingthe number of items in the corresponding transaction.  This informationcan be used, e.g., to select or filter unusually long or shorttransactions.\func{itemFrequency} calculates the frequency for each item in an\class{itemMatrix}. Conceptually, the item frequencies are the column sums ofthe binary matrix. Technically, column sums can be implemented for sparserepresentation efficiently by just tabulating the vector of row numbers of thenon-zero elements in the \class{ngCMatrix}.  Item frequencies can be used formany purposes.  For example, they are needed to compute interest measures.\func{itemFrequency} is also used by \func{itemFrequencyPlot} to produce a barplot of item count frequencies or support.  Such a plot gives a quick overviewof a set of itemsets and shows which are the most important items in terms ofoccurrence frequency.Coercion from and to \class{matrix} and \class{list} primitives isprovided where names and dimnames are used as item labels.  For thecoercion from \class{itemMatrix} to \class{list} there are twopossibilities.  The usual coercion via \func{as} results in a list ofvectors of character strings, each containing the item labels of theitems in the corresponding row of the \class{itemMatrix}.  The actualconversion is done by \func{LIST} with its default behavior (argument\code{decode} set to \code{TRUE}).  If in turn \func{LIST} is calledwith the argument \code{decode} set to \code{FALSE}, the result is alist of integer vectors with column numbers for items instead of theitem labels. For many computations it is useful to work with such a listand later use the item column numbers to go back to the original\class{itemMatrix} for, e.g., subsetting columns.  For subsequentlydecoding column numbers to item labels, \func{decode} is also available.Finally, \func{image} can be used to produce a level plot of an\class{itemMatrix} which is useful for quick visual inspection.  Fortransaction data sets (e.g., point-of-sale data) such a plot can be veryhelpful for checking whether the data set contains structural changes (e.g.,items were not offered or out-of-stock during part of the observationperiod) or to find abnormal transactions (e.g., transactions whichcontain almost all items may point to recording problems).  Spottingsuch problems in the data can be very helpful for data preparation.%% ------------------------------------------------------------------%% ------------------------------------------------------------------\subsection{Transaction data\label{sec:transactions}}The main application of association rules is for market basket analysiswhere large transaction data sets are mined.  In this setting eachtransaction contains the items which were purchased at one visit to aretail store \citep[see e.g.,][]{arules:Berry+Linoff:1997}.Transaction data are normally recorded by point-of-salescanners and often consists of tuples of the form:\begin{displaymath}<\emph{transaction ID}, \emph{item ID}, \ldots >\end{displaymath}All tuples with the same transaction ID form a single transaction whichcontains all the items given by the item IDs in the tuples.  Additionalinformation denoted by the ellipsis dots might be available.  Forexample, a customer ID might be provided via a loyalty program in asupermarket.  Further information on transactions (e.g., time,location), on the items (e.g., category, price), or on the customers(socio-demographic variables such as age, gender, etc.) might also beavailable.\begin{figure}[tp]\centering%\includegraphics[width=12cm]{transactionMatrix}\begin{minipage}[b]{.4\linewidth}\renewcommand{\arraystretch}{1.1}\setlength{\tabcolsep}{2mm}%\begin{center}\begin{tabular}{cl|cccc}& & \multicolumn{4}{c}{items}\\& & milk & bread & butter & beer \\%\hline\cline{2-6}\multirow{5}{*}{\begin{sideways} transactions \end{sideways}}& 1 		& 1 & 1 & 0 & 0 \\& 2 	  	& 0 & 1 & 1 & 0 \\& 3 		& 0 & 0 & 0 & 1 \\& 4 		& 1 & 1 & 1 & 0 \\& 5 		& 0 & 1 & 1 & 0 \\\end{tabular}\vspace{1cm}(a)\end{center}\end{minipage}%\hspace{.1\linewidth}%\begin{minipage}[b]{.4\linewidth}\renewcommand{\arraystretch}{1.1}\setlength{\tabcolsep}{2mm}%\begin{center}\begin{tabular}{cl|l}& & \multicolumn{1}{c}{transaction ID lists}\\%\hline\cline{2-3}\multirow{4}{*}{\begin{sideways} items \end{sideways}}& milk  		& 1, 4 \\& bread  		& 1, 2, 4, 5 \\& butter 		& 2, 4 \\& beer  		& 3 \\\end{tabular}\vspace{1.5cm}(b)\end{center}\end{minipage}\caption{Example of a set of transactions represented in     (a) horizontal layout and in     (b) vertical layout.\label{fig:transactionMatrix}}\end{figure}For mining, the transaction data is first transformed into a binarypurchase incidence matrix with columns corresponding to the differentitems and rows corresponding to transactions.  The matrix entriesrepresent the presence (1) or absence (0) of an item in a particulartransaction.  This format is often called the \emph{horizontal} databaselayout~\citep{arules:Zaki:2000}.  Alternatively, transaction data can berepresented in a \emph{vertical} database layout in the form of\emph{transaction ID lists}~\citep{arules:Zaki:2000}.  In this formatfor each item a list of IDs of the transactions the item is contained inis stored.  In Figure~\ref{fig:transactionMatrix} the example databasein Figure~\ref{table:supermarket} on Page~\pageref{table:supermarket} isdepicted in horizontal and vertical layouts.  Depending on thealgorithm, one of the layouts is used for mining.  In \pkg{arules} bothlayouts are implemented as the classes~\class{transactions} and\class{tidLists}.  Similar to \class{transactions},class~\class{tidLists} also uses a sparse representation to store itslists efficiently.  Objects of classes~\class{transactions} and\class{tidLists} can be directly converted into each other by coercion.The class \class{transactions} directly extends \class{itemMatrix} andinherits its basic functionality (e.g., subset selection, gettingitemset sizes, plotting item frequencies).  In addition,\class{transactions} has a slot to store further information for eachtransaction in form of a \class{data.frame}.  The slot can holdarbitrary named vectors with length equal to the number of storedtransactions.  In \pkg{arules} the slot is currently used to storetransaction IDs, however, it can also be used to store user IDs, revenueor profit, or other information on each transaction.  With thisinformation subsets of transactions (e.g., only transactions of acertain user or exceeding a specified profit level) can be selected.Objects of class \class{transactions} can be easily created by coercionfrom \class{matrix} or \class{list}.  If names or dimnames are available
💿 文件大小 2073 K
👤 上传用户 epower
📂 所属分类 Linux/Unix编程
🏷️ 相关标签

#linux #程序 #代码
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -