📄 arules.tex
字号:
\end{figure}The plot is a direct visualization of the binary incidence matrix wherethe the dark dots represent the ones in the matrix. From the plot wesee that the items in the data set are not evenly distributed. In fact,the white area to the top right side suggests, that in the beginning of2003 only very few items were available (less than 50) and then duringthe year more items were added until it reached a number of around 300items. Also, we can see that there are some transactions in the data setwhich contain a very high number of items (denser horizontal lines).These transactions need further investigation since they could originatefrom data collection problems (e.g., a web robot downloading manydocuments from the publication site). To find the very longtransactions we can use the \func{size} and select very longtransactions (containing more than 20 items).\begin{Schunk}\begin{Sinput}> transactionInfo(Epub2003[size(Epub2003) > 20])\end{Sinput}\begin{Soutput} transactionID TimeStamp301 session_56e2 2003-04-29 12:30:38580 session_6308 2003-08-17 17:16:12896 session_72dc 2003-12-29 19:35:35\end{Soutput}\end{Schunk}We found three long transactions and printed the correspondingtransaction information. Of course, size can be used in a similarfashion to remove long or short transactions.Transactions can be inspected using \func{inspect}. Since the long transactions identified above would result ina very long printout, we will inspect the first 5 transactions in the subset for 2003.\begin{Schunk}\begin{Sinput}> inspect(Epub2003[1:5])\end{Sinput}\begin{Soutput} items transactionID TimeStamp1 {doc_154} session_4795 2003-01-01 19:59:002 {doc_3d6} session_4797 2003-01-02 06:46:013 {doc_16f} session_479a 2003-01-02 09:50:384 {doc_f4, doc_11d, doc_1a7} session_47b7 2003-01-02 17:55:505 {doc_83} session_47bb 2003-01-02 20:27:44\end{Soutput}\end{Schunk}Most transactions contain one item. Only transaction 4 contains three items. For further inspection transactions can be converted into a list with:\begin{Schunk}\begin{Sinput}> as(Epub2003[1:5], "list")\end{Sinput}\begin{Soutput}$session_4795[1] "doc_154"$session_4797[1] "doc_3d6"$session_479a[1] "doc_16f"$session_47b7[1] "doc_f4" "doc_11d" "doc_1a7"$session_47bb[1] "doc_83"\end{Soutput}\end{Schunk}Finally, transaction data in horizontal layout can be converted totransaction ID lists in vertical layout using coercion.\begin{Schunk}\begin{Sinput}> EpubTidLists <- as(Epub, "tidLists")> EpubTidLists\end{Sinput}\begin{Soutput}tidLists in sparse format with 465 items/itemsets (rows) and 3975 transactions (columns)\end{Soutput}\end{Schunk}For performance reasons the transaction ID listis also stored in a sparse matrix. To get a list, coercion to \class{list}can be used.\begin{Schunk}\begin{Sinput}> as(EpubTidLists[1:3], "list")\end{Sinput}\begin{Soutput}$doc_154 [1] "session_4795" "session_6082" "session_60dd" "session_67db" [5] "session_769c" "session_7ee3" "session_bd9d" "session_c591" [9] "session_ce9f" "session_cf4b" "session_e019"$doc_3d6 [1] "session_4797" "session_4893" "session_48f4" [4] "session_4ca3" "session_wu4450a" "session_52c6" [7] "session_5712" "session_58e3" "session_5984" [10] "session_5b20" "session_5c20" "session_5dc0" [13] "session_5eac" "session_wu4a129" "session_6599" [16] "session_673d" "session_683e" "session_wu4d25a"[19] "session_6f2f" "session_708a" "session_7a0c" [22] "session_7de5" "session_89db" "session_9227" [25] "session_9941" "session_a4d7" "session_a8c0" [28] "session_c3c4" "session_c546" "session_ca44" [31] "session_d328" "session_d5b4" $doc_16f[1] "session_479a" "session_56e2" "session_630c" "session_72dc"[5] "session_8b3e" "session_91ab" "session_a202" "session_a7b9"\end{Soutput}\end{Schunk}In this representation each item has an entrywhich is a vector of all transactions it occurs in.\class{tidLists} can be directly used as input for mining algorithms which use such a vertical database layout to mine associations.In the next example, we will see how a data set is created andrules are mined.\subsection{Example 2: Preparing and mining a questionnaire data set\label{sec:example-adult}}As a second example, we prepare and mine questionnaire data. We use theAdult data set from the UCI machine learning repository\citep{arules:Blake+Merz:1998} provided by package~\pkg{arules}. Thisdata set is similar to the marketing data set used by\cite{arules:Hastie+Tibshirani+Friedman:2001} in their chapter aboutassociation rule mining. The data originates from the U.S. censusbureau database and contains 48842 instances with 14 attributes likeage, work class, education, etc. In the original applications of thedata, the attributes were used to predict the income level ofindividuals. We added the attribute \code{income} with levels\code{small} and \code{large}, representing an income of$\le$~USD~50,000 and $>$~USD~50,000, respectively. This data isincluded in \pkg{arules} as the data set \code{AdultUCI}.\begin{Schunk}\begin{Sinput}> data("AdultUCI")> dim(AdultUCI)\end{Sinput}\begin{Soutput}[1] 48842 15\end{Soutput}\begin{Sinput}> AdultUCI[1:2, ]\end{Sinput}\begin{Soutput} age workclass fnlwgt education education-num marital-status1 39 State-gov 77516 Bachelors 13 Never-married2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse occupation relationship race sex capital-gain capital-loss1 Adm-clerical Not-in-family White Male 2174 02 Exec-managerial Husband White Male 0 0 hours-per-week native-country income1 40 United-States small2 13 United-States small\end{Soutput}\end{Schunk}\code{AdultUCI} contains a mixture of categorical and metric attributes andneeds some preparations before it can be transformed intotransaction data suitable for association mining.First, we remove the two attributes \code{fnlwgt} and\code{education-num}. The first attribute is a weight calculatedby the creators of the data set from control data provided bythe Population Division of the U.S. census bureau. The second removed attribute is just a numeric representation of theattribute \code{education} which is also part of the data set.\begin{Schunk}\begin{Sinput}> AdultUCI[["fnlwgt"]] <- NULL> AdultUCI[["education-num"]] <- NULL\end{Sinput}\end{Schunk}Next, we need to map the four remaining metric attributes (\code{age},\code{hours-per-week}, \code{capital-gain} and \code{capital-loss}) to ordinalattributes by building suitable categories.We divide the attributes \code{age} and \code{hours-per-week}into suitable categories using knowledge about typical age groups and working hours. For the two capital related attributes,we create a category called \code{None} for cases which have no gains/losses. Then we further divide the group with gains/lossesat their median into the two categories \code{Low} and \code{High}.\begin{Schunk}\begin{Sinput}> AdultUCI[["age"]] <- ordered(cut(AdultUCI[["age"]], c(15, + 25, 45, 65, 100)), labels = c("Young", "Middle-aged", + "Senior", "Old"))> AdultUCI[["hours-per-week"]] <- ordered(cut(AdultUCI[["hours-per-week"]], + c(0, 25, 40, 60, 168)), labels = c("Part-time", "Full-time", + "Over-time", "Workaholic"))> AdultUCI[["capital-gain"]] <- ordered(cut(AdultUCI[["capital-gain"]], + c(-Inf, 0, median(AdultUCI[["capital-gain"]][AdultUCI[["capital-gain"]] > + 0]), Inf)), labels = c("None", "Low", "High"))> AdultUCI[["capital-loss"]] <- ordered(cut(AdultUCI[["capital-loss"]], + c(-Inf, 0, median(AdultUCI[["capital-loss"]][AdultUCI[["capital-loss"]] > + 0]), Inf)), labels = c("none", "low", "high"))\end{Sinput}\end{Schunk}Now, the data can be automatically recoded asa binary incidence matrix by coercing the data set to\class{transactions}.\begin{Schunk}\begin{Sinput}> Adult <- as(AdultUCI, "transactions")> Adult\end{Sinput}\begin{Soutput}transactions in sparse format with 48842 transactions (rows) and 115 items (columns)\end{Soutput}\end{Schunk}The remaining 115 categorical attributes wereautomatically recoded into 115binary items. During encoding the item labels were generated in theform of \texttt{<\emph{variable name}>=<\emph{category label}>}. Note that for cases with missing values all items corresponding to the attributes with the missing values were set to zero.\begin{Schunk}\begin{Sinput}> summary(Adult)\end{Sinput}\begin{Soutput}transactions as itemMatrix in sparse format with 48842 rows (elements/itemsets/transactions) and 115 columns (items) and a density of 0.1089939 most frequent items: capital-loss=none capital-gain=None 46560 44807 native-country=United-States race=White 43832 41762 workclass=Private (Other) 33906 401333 element (itemset/transaction) length distribution:sizes 9 10 11 12 13 19 971 2067 15623 30162 Min. 1st Qu. Median Mean 3rd Qu. Max. 9.00 12.00 13.00 12.53 13.00 13.00 includes extended item information - examples: labels variables levels1 age=Young age Young2 age=Middle-aged age Middle-aged3 age=Senior age Seniorincludes extended transaction information - examples: transactionID1 12 23 3\end{Soutput}\end{Schunk}The summary of the transaction data set gives a rough overview showingthe most frequent items, the length distribution of the transactions and
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -