📄 arules.tex

📁 本程序是基于linux系统下c++代码
💻 TEX
📖 第 1 页 / 共 5 页
字号:
上一页 1 2 3 45
\end{figure}The plot is a direct visualization of the binary incidence matrix wherethe the dark dots represent the ones in the matrix.  From the plot wesee that the items in the data set are not evenly distributed.  In fact,the white area to the top right side suggests, that in the beginning of2003 only very few items were available (less than 50) and then duringthe year more items were added until it reached a number of around 300items. Also, we can see that there are some transactions in the data setwhich contain a very high number of items (denser horizontal lines).These transactions need further investigation since they could originatefrom data collection problems (e.g., a web robot downloading manydocuments from the publication site).  To find the very longtransactions we can use the \func{size} and select very longtransactions (containing more than 20 items).\begin{Schunk}\begin{Sinput}> transactionInfo(Epub2003[size(Epub2003) > 20])\end{Sinput}\begin{Soutput}    transactionID           TimeStamp301  session_56e2 2003-04-29 12:30:38580  session_6308 2003-08-17 17:16:12896  session_72dc 2003-12-29 19:35:35\end{Soutput}\end{Schunk}We found three long transactions and printed the correspondingtransaction information. Of course, size can be used in a similarfashion to remove long or short transactions.Transactions can be inspected using \func{inspect}. Since the long transactions identified above would result ina very long printout, we will inspect the first 5 transactions in the subset for 2003.\begin{Schunk}\begin{Sinput}> inspect(Epub2003[1:5])\end{Sinput}\begin{Soutput}  items     transactionID           TimeStamp1 {doc_154}  session_4795 2003-01-01 19:59:002 {doc_3d6}  session_4797 2003-01-02 06:46:013 {doc_16f}  session_479a 2003-01-02 09:50:384 {doc_f4,                                      doc_11d,                                     doc_1a7}  session_47b7 2003-01-02 17:55:505 {doc_83}   session_47bb 2003-01-02 20:27:44\end{Soutput}\end{Schunk}Most transactions contain one item. Only transaction 4 contains three items. For further inspection transactions can be converted into a list with:\begin{Schunk}\begin{Sinput}> as(Epub2003[1:5], "list")\end{Sinput}\begin{Soutput}$session_4795[1] "doc_154"$session_4797[1] "doc_3d6"$session_479a[1] "doc_16f"$session_47b7[1] "doc_f4"  "doc_11d" "doc_1a7"$session_47bb[1] "doc_83"\end{Soutput}\end{Schunk}Finally, transaction data in horizontal layout can be converted totransaction ID lists in vertical layout using coercion.\begin{Schunk}\begin{Sinput}> EpubTidLists <- as(Epub, "tidLists")> EpubTidLists\end{Sinput}\begin{Soutput}tidLists in sparse format with 465 items/itemsets (rows) and 3975 transactions (columns)\end{Soutput}\end{Schunk}For performance reasons the transaction ID listis also stored in a sparse matrix. To get a list, coercion to \class{list}can be used.\begin{Schunk}\begin{Sinput}> as(EpubTidLists[1:3], "list")\end{Sinput}\begin{Soutput}$doc_154 [1] "session_4795" "session_6082" "session_60dd" "session_67db" [5] "session_769c" "session_7ee3" "session_bd9d" "session_c591" [9] "session_ce9f" "session_cf4b" "session_e019"$doc_3d6 [1] "session_4797"    "session_4893"    "session_48f4"    [4] "session_4ca3"    "session_wu4450a" "session_52c6"    [7] "session_5712"    "session_58e3"    "session_5984"   [10] "session_5b20"    "session_5c20"    "session_5dc0"   [13] "session_5eac"    "session_wu4a129" "session_6599"   [16] "session_673d"    "session_683e"    "session_wu4d25a"[19] "session_6f2f"    "session_708a"    "session_7a0c"   [22] "session_7de5"    "session_89db"    "session_9227"   [25] "session_9941"    "session_a4d7"    "session_a8c0"   [28] "session_c3c4"    "session_c546"    "session_ca44"   [31] "session_d328"    "session_d5b4"   $doc_16f[1] "session_479a" "session_56e2" "session_630c" "session_72dc"[5] "session_8b3e" "session_91ab" "session_a202" "session_a7b9"\end{Soutput}\end{Schunk}In this representation each item has an entrywhich is a vector of all transactions it occurs in.\class{tidLists} can be directly used as input for mining algorithms which use such a vertical database layout to mine associations.In the next example, we will see how a data set is created andrules are mined.\subsection{Example 2: Preparing and mining a questionnaire data set\label{sec:example-adult}}As a second example, we prepare and mine questionnaire data.  We use theAdult data set from the UCI machine learning repository\citep{arules:Blake+Merz:1998} provided by package~\pkg{arules}.  Thisdata set is similar to the marketing data set used by\cite{arules:Hastie+Tibshirani+Friedman:2001} in their chapter aboutassociation rule mining.  The data originates from the U.S. censusbureau database and contains 48842 instances with 14 attributes likeage, work class, education, etc.  In the original applications of thedata, the attributes were used to predict the income level ofindividuals.  We added the attribute \code{income} with levels\code{small} and \code{large}, representing an income of$\le$~USD~50,000 and $>$~USD~50,000, respectively.  This data isincluded in \pkg{arules} as the data set \code{AdultUCI}.\begin{Schunk}\begin{Sinput}> data("AdultUCI")> dim(AdultUCI)\end{Sinput}\begin{Soutput}[1] 48842    15\end{Soutput}\begin{Sinput}> AdultUCI[1:2, ]\end{Sinput}\begin{Soutput}  age        workclass fnlwgt education education-num     marital-status1  39        State-gov  77516 Bachelors            13      Never-married2  50 Self-emp-not-inc  83311 Bachelors            13 Married-civ-spouse       occupation  relationship  race  sex capital-gain capital-loss1    Adm-clerical Not-in-family White Male         2174            02 Exec-managerial       Husband White Male            0            0  hours-per-week native-country income1             40  United-States  small2             13  United-States  small\end{Soutput}\end{Schunk}\code{AdultUCI} contains a mixture of categorical and metric attributes andneeds some preparations before it can be transformed intotransaction data suitable for association mining.First, we remove the two attributes \code{fnlwgt} and\code{education-num}. The first attribute is a weight calculatedby the creators of the data set from control data provided bythe Population Division of the U.S. census bureau. The second removed attribute is just a numeric representation of theattribute \code{education} which is also part of the data set.\begin{Schunk}\begin{Sinput}> AdultUCI[["fnlwgt"]] <- NULL> AdultUCI[["education-num"]] <- NULL\end{Sinput}\end{Schunk}Next, we need to map the four remaining metric attributes (\code{age},\code{hours-per-week}, \code{capital-gain} and \code{capital-loss}) to ordinalattributes by building suitable categories.We divide the attributes \code{age} and \code{hours-per-week}into suitable categories using knowledge about typical age groups and working hours. For the two capital related attributes,we create a category called \code{None} for cases which have no gains/losses. Then we further divide the group with gains/lossesat their median into the two categories \code{Low} and \code{High}.\begin{Schunk}\begin{Sinput}> AdultUCI[["age"]] <- ordered(cut(AdultUCI[["age"]], c(15, +     25, 45, 65, 100)), labels = c("Young", "Middle-aged", +     "Senior", "Old"))> AdultUCI[["hours-per-week"]] <- ordered(cut(AdultUCI[["hours-per-week"]], +     c(0, 25, 40, 60, 168)), labels = c("Part-time", "Full-time", +     "Over-time", "Workaholic"))> AdultUCI[["capital-gain"]] <- ordered(cut(AdultUCI[["capital-gain"]], +     c(-Inf, 0, median(AdultUCI[["capital-gain"]][AdultUCI[["capital-gain"]] > +         0]), Inf)), labels = c("None", "Low", "High"))> AdultUCI[["capital-loss"]] <- ordered(cut(AdultUCI[["capital-loss"]], +     c(-Inf, 0, median(AdultUCI[["capital-loss"]][AdultUCI[["capital-loss"]] > +         0]), Inf)), labels = c("none", "low", "high"))\end{Sinput}\end{Schunk}Now, the data can be automatically recoded asa binary incidence matrix by coercing the data set to\class{transactions}.\begin{Schunk}\begin{Sinput}> Adult <- as(AdultUCI, "transactions")> Adult\end{Sinput}\begin{Soutput}transactions in sparse format with 48842 transactions (rows) and 115 items (columns)\end{Soutput}\end{Schunk}The remaining 115 categorical attributes wereautomatically recoded into 115binary items. During encoding the item labels were generated in theform of \texttt{<\emph{variable name}>=<\emph{category label}>}. Note that for cases with missing values all items corresponding to the attributes with the missing values were set to zero.\begin{Schunk}\begin{Sinput}> summary(Adult)\end{Sinput}\begin{Soutput}transactions as itemMatrix in sparse format with 48842 rows (elements/itemsets/transactions) and 115 columns (items) and a density of 0.1089939 most frequent items:           capital-loss=none            capital-gain=None                        46560                        44807 native-country=United-States                   race=White                        43832                        41762            workclass=Private                      (Other)                        33906                       401333 element (itemset/transaction) length distribution:sizes    9    10    11    12    13    19   971  2067 15623 30162    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    9.00   12.00   13.00   12.53   13.00   13.00 includes extended item information - examples:           labels variables      levels1       age=Young       age       Young2 age=Middle-aged       age Middle-aged3      age=Senior       age      Seniorincludes extended transaction information - examples:  transactionID1             12             23             3\end{Soutput}\end{Schunk}The summary of the transaction data set gives a rough overview showingthe most frequent items, the length distribution of the transactions and
上一页 1 2 3 45
💿 文件大小 2073 K
👤 上传用户 epower
📂 所属分类 Linux/Unix编程
🏷️ 相关标签

#linux #程序 #代码
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -