📄 arules.rnw

📁 本程序是基于linux系统下c++代码
💻 RNW
📖 第 1 页 / 共 5 页
字号:
上一页 1 2 3 45
transactions we can use the \func{size} and select very longtransactions (containing more than 20 items).<<>>=transactionInfo(Epub2003[size(Epub2003) > 20])@We found three long transactions and printed the correspondingtransaction information. Of course, size can be used in a similarfashion to remove long or short transactions.Transactions can be inspected using \func{inspect}. Since the long transactions identified above would result ina very long printout, we will inspect the first 5 transactions in the subset for 2003.<<>>=inspect(Epub2003[1:5])@Most transactions contain one item. Only transaction 4 contains three items. For further inspection transactions can be converted into a list with:<<>>=as(Epub2003[1:5], "list")@Finally, transaction data in horizontal layout can be converted totransaction ID lists in vertical layout using coercion.<<>>=EpubTidLists <- as(Epub, "tidLists")EpubTidLists@For performance reasons the transaction ID listis also stored in a sparse matrix. To get a list, coercion to \class{list}can be used.<<>>=as(EpubTidLists[1:3], "list") @In this representation each item has an entrywhich is a vector of all transactions it occurs in.\class{tidLists} can be directly used as input for mining algorithms which use such a vertical database layout to mine associations.In the next example, we will see how a data set is created andrules are mined.\subsection{Example 2: Preparing and mining a questionnaire data set\label{sec:example-adult}}As a second example, we prepare and mine questionnaire data.  We use theAdult data set from the UCI machine learning repository\citep{arules:Blake+Merz:1998} provided by package~\pkg{arules}.  Thisdata set is similar to the marketing data set used by\cite{arules:Hastie+Tibshirani+Friedman:2001} in their chapter aboutassociation rule mining.  The data originates from the U.S. censusbureau database and contains 48842 instances with 14 attributes likeage, work class, education, etc.  In the original applications of thedata, the attributes were used to predict the income level ofindividuals.  We added the attribute \code{income} with levels\code{small} and \code{large}, representing an income of$\le$~USD~50,000 and $>$~USD~50,000, respectively.  This data isincluded in \pkg{arules} as the data set \code{AdultUCI}.<<data>>=data("AdultUCI")dim(AdultUCI)AdultUCI[1:2,]@\code{AdultUCI} contains a mixture of categorical and metric attributes andneeds some preparations before it can be transformed intotransaction data suitable for association mining.First, we remove the two attributes \code{fnlwgt} and\code{education-num}. The first attribute is a weight calculatedby the creators of the data set from control data provided bythe Population Division of the U.S. census bureau. The second removed attribute is just a numeric representation of theattribute \code{education} which is also part of the data set.<<>>=AdultUCI[["fnlwgt"]] <- NULLAdultUCI[["education-num"]] <- NULL@Next, we need to map the four remaining metric attributes (\code{age},\code{hours-per-week}, \code{capital-gain} and \code{capital-loss}) to ordinalattributes by building suitable categories.We divide the attributes \code{age} and \code{hours-per-week}into suitable categories using knowledge about typical age groups and working hours. For the two capital related attributes,we create a category called \code{None} for cases which have no gains/losses. Then we further divide the group with gains/lossesat their median into the two categories \code{Low} and \code{High}.<<>>=AdultUCI[[ "age"]] <- ordered(cut(AdultUCI[[ "age"]], c(15,25,45,65,100)),    labels = c("Young", "Middle-aged", "Senior", "Old"))AdultUCI[[ "hours-per-week"]] <- ordered(cut(AdultUCI[[ "hours-per-week"]],      c(0,25,40,60,168)),    labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))			    AdultUCI[[ "capital-gain"]] <- ordered(cut(AdultUCI[[ "capital-gain"]],      c(-Inf,0,median(AdultUCI[[ "capital-gain"]][AdultUCI[[ "capital-gain"]]>0]),Inf)),    labels = c("None", "Low", "High"))AdultUCI[[ "capital-loss"]] <- ordered(cut(AdultUCI[[ "capital-loss"]],      c(-Inf,0,	median(AdultUCI[[ "capital-loss"]][AdultUCI[[ "capital-loss"]]>0]),Inf)),    labels = c("none", "low", "high"))@Now, the data can be automatically recoded asa binary incidence matrix by coercing the data set to\class{transactions}.<<coerce>>=Adult <- as(AdultUCI, "transactions")Adult@The remaining \Sexpr{dim(Adult)[2]} categorical attributes wereautomatically recoded into \Sexpr{dim(Adult)[2]}binary items. During encoding the item labels were generated in theform of \texttt{<\emph{variable name}>=<\emph{category label}>}. Note that for cases with missing values all items corresponding to the attributes with the missing values were set to zero.<<summary>>=summary(Adult)@The summary of the transaction data set gives a rough overview showingthe most frequent items, the length distribution of the transactions andthe extended item information which shows which variable and which valuewere used to create each binary item. In the first example we see thatthe item with label \code{age=Middle-aged} was generated by variable\code{age} and level \code{middle-aged}.  To see which items are important in the data set we can use the\func{itemFrequencyPlot}. To reduce the number of items, we only plotthe item frequency for items with a support greater than 10\% (using the parameter \code{support}).  Forbetter readability of the labels, we reduce the label size with theparameter \code{cex.names}. The plot is shown in Figure~\ref{fig:itemFrequencyPlot}.<<itemFrequencyPlot, eval=FALSE>>=itemFrequencyPlot(Adult, support = 0.1, cex.names=0.8)@\begin{figure}\centering<<echo=FALSE, fig=TRUE, width=8>>=<<itemFrequencyPlot>>@ %\caption{Item frequencies of items in the Adult data set with support greater than 10\%.}\label{fig:itemFrequencyPlot}\end{figure}Next, we call the function\func{apriori} to find all rules (the default association type for\func{apriori}) with a minimum support of 1\% and a confidence of 0.6.<<apriori>>=rules <- apriori(Adult,                  parameter = list(support = 0.01, confidence = 0.6))rules@%The specified parameter values are validated and, for example,%a support $> 1$ gives:%%<<error>>=%error <- try(apriori(Adult, parameter = list(support = 1.3)))%error%@First, the function prints the used parameters.  Apart from thespecified minimum support and minimum confidence, all parameters havethe default values. It is important to note that with parameter\code{maxlen}, the maximum size of mined frequent itemsets, is bydefault restricted to 5.  Longer association rules are only mined if\code{maxlen} is set to a higher value.  After the parameter settings,the output of the \proglang{C} implementation of the algorithm with timinginformation is displayed.The result of the mining algorithm is a set of \Sexpr{length(rules)}rules.  For an overview of the mined rules \func{summary}can be used.   It shows the number of rules, the most frequent itemscontained in the left-hand-side and the right-hand-side and theirrespective length distributions and summary statistics for the qualitymeasures returned by the mining algorithm.<<summary>>=summary(rules)@As typical for association rule mining, the number of rules found ishuge.  To analyze these rules, for example, \func{subset} can be used toproduce separate subsets of rules for each item which resulted form thevariable \code{income} in the right-hand-side of the rule. At the sametime we require that the \code{lift} measure exceeds $1.2$.<<rules>>=rulesIncomeSmall <- subset(rules, subset = rhs %in% "income=small" & lift > 1.2)rulesIncomeLarge <- subset(rules, subset = rhs %in% "income=large" & lift > 1.2)@We now have a set with rules for persons with a small income and a setfor persons with a large income.  For comparison, we inspect for bothsets the three rules with the highest confidence (using \func{SORT}).%%{\samepage\small<<subset>>=inspect(head(SORT(rulesIncomeSmall, by = "confidence"), n = 3))inspect(head(SORT(rulesIncomeLarge, by = "confidence"), n = 3))@%%}From the rules we see that workers in the private sector working part-time orin the service industry tend to have a small incomewhile persons with high capital gain who are born in the US tend to have alarge income.This example shows that using subset selection and sorting aset of mined associations can beanalyzed even if it is huge.Finally, the found rules can be written to disk to be shared with other applications. To save rules in plain text format the function \func{WRITE} is used. The following command saves a set of rules as the file named `data.csv' in comma separated value (CSV) format.<<label = write_rules, eval = FALSE>>=WRITE(rulesIncomeSmall, file = "data.csv", sep = ",", col.names = NA)@Alternatively, with package~\pkg{pmml}~\cite{arules:Williams:2008} the rulescan be saved in PMML (Predictive Modelling Markup Language), a standardizedXML-based representation used my many data mining tools. Note that \pkg{pmml}requires the package~\pkg{XML} which might not be available for all operatingsystems. <<label=pmml, eval=FALSE>>=library("pmml")rules_pmml <- pmml(rulesIncomeSmall)saveXML(rules_pmml, file = "data.xml")@The saved data can now be easily shared and used by other applications. Itemsets (with \func{WRITE} also transactions) can be written to a file in the same way.\subsection{Example 3: Extending arules with a new interest measure\label{sec:example-allconf}}In this example, we show how easy it is to add a new interest measure,using \emph{all-confidence} as introduced by\cite{arules:Omiecinski:2003}.  The all-confidence of an itemset~$X$ isdefined as\begin{equation}\mbox{all-confidence}(X) = \frac{\mathrm{supp}(X)}{\mathrm{max}_{I \subset X} \mathrm{supp}(I)}\label{equ:all_conf}\end{equation}This measure has the property $\mathrm{conf}(I \Rightarrow X \setminusI) \ge \mbox{all-confidence}(X)$ for all $I \subset X$.  This means thatall possible rules generated from itemset~$X$ must at least have aconfidence given by the itemset's all-confidence value.\cite{arules:Omiecinski:2003} shows that the support in the denominatorof equation~\ref{equ:all_conf} must stem from a single item and thus canbe simplified to $\max_{i \in X} \mathrm{supp}(\{i\})$.To obtain an itemset to calculate all-confidence for,
上一页 1 2 3 45
💿 文件大小 2073 K
👤 上传用户 epower
📂 所属分类 Linux/Unix编程
🏷️ 相关标签

#linux #程序 #代码
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -