📄 tm.rnw
字号:
@\section*{Transformations}Once we have a text document collection one typically wants to modifythe documents in it, e.g., stemming, stopword removal, et cetera. In\pkg{tm}, all this functionality is subsumed into the concept of\emph{transformation}s. Transformations are done via the \code{tmMap}function which applies a function to all elements of thecollection. Basically, all transformations work on single text documentsand \code{tmMap} just applies them to all documents in a documentcollection.\subsection*{Loading Documents into Memory}If the source objects supports load on demand, but the user has notenforced the package to load the input content directly into memory,this can be done manually via \code{loadDoc}. Normally it is notnecessary to call this explicitly, as other functions working on textcorpora trigger this function for not-loaded documents (the corpus isautomatically loaded if accessed via \code{[[}).<<>>=reut21578TDC <- tmMap(reut21578TDC, loadDoc)@\subsection*{Converting to Plaintext Documents}The text document collection \code{reut21578TDC} contains documentsin XML format. We have no further use for the XML interna and justwant to work with the text content. This can be done by converting thedocuments to plaintext documents. It is done by the generic\code{asPlain}.<<>>=reut21578TDC <- tmMap(reut21578TDC, asPlain)@\subsection*{Eliminating Extra Whitespace}Extra whitespace is eliminated by:<<>>=reut21578TDC <- tmMap(reut21578TDC, stripWhitespace)@\subsection*{Convert to Lower Case}Conversion to lower case by:<<>>=reut21578TDC <- tmMap(reut21578TDC, tmTolower)@\subsection*{Remove Stopwords}Removal of stopwords by:<<>>=reut21578TDC <- tmMap(reut21578TDC, removeWords, stopwords("english"))@\subsection*{Stemming}Stemming is done by:<<>>=tmMap(reut21578TDC, stemDoc)@\section*{Filters}Often it is of special interest to filter out documents satisfying givenproperties. For this purpose the function \code{tmFilter} isdesignated. It is possible to write custom filter functions, but formost cases the default filter does its job: it integrates a minimalquery language to filter metadata. Statements in this query languageare statements as used for subsetting data frames.E.g., the following statement filters out those documents having\code{COMPUTER TERMINAL SYSTEMS <CPML> COMPLETES SALE} as theirheading and an \code{ID} equal to 10 (both are metadata slotvariables of the text document).<<keep.source=TRUE>>=query <- "identifier == '10' & heading == 'COMPUTER TERMINAL SYSTEMS <CPML> COMPLETES SALE'"tmFilter(reut21578TDC, query)@There is also a full text search filter available which accepts regularexpressions:<<>>=tmFilter(reut21578TDC, FUN = searchFullText, "partnership", doclevel = TRUE)@\section*{Adding Data or Metadata}Text documents or metadata can be added to text document collectionswith \code{appendElem} and \code{appendMeta}, respectively. The textdocument collection has two types of metadata: one is the metadata onthe document collection level (\code{cmeta}), the other is the metadatarelated to the individual documents (e.g., clusterings) (\code{dmeta})in form of a dataframe. For the method \code{appendElem} it is possibleto give a row of values in the dataframe for the added data element.<<>>=data(crude)reut21578TDC <- appendElem(reut21578TDC, crude[[1]], 0)reut21578TDC <- appendMeta(reut21578TDC, cmeta = list(test = c(1,2,3)), dmeta = list(cl1 = 1:11))summary(reut21578TDC)CMetaData(reut21578TDC)DMetaData(reut21578TDC)@\section*{Removing Metadata}The metadata of text document collections can be easily modified orremoved:<<>>=data(crude)reut21578TDC <- removeMeta(reut21578TDC, cname = "test", dname = "cl1")CMetaData(reut21578TDC)DMetaData(reut21578TDC)@\section*{Operators}Many standard operators and functions (\code{[}, \code{[<-}, \code{[[}, \code{[[<-}, \code{c}, \code{length}, \code{lapply}, \code{sapply}) are available for text documentcollections with semantics similar to standard \proglang{R}routines.E.g., \code{c} concatenates two (or more) text documentcollections. Applied to several text documents it returns a textdocument collection. The metadata is automatically updated, if textdocument collections are concatenated (i.e., merged).Note also the custom element-of operator---it checks whether a textdocument is already in a text document collection (metadata is notchecked, only the corpus):<<>>=crude[[1]] %IN% reut21578TDCcrude[[2]] %IN% reut21578TDC@\section*{Keeping Track of Text Document Collections}There is a mechanism available for managing text documentcollections. It is called \class{TextRepository}. A typical use wouldbe to save different states of a text document collection. Arepository has metadata in list format which can be either set with\code{appendElem} as additional argument (e.g., a date when a newelement is added), or directly with \code{appendMeta}.<<>>=data(acq)repo <- TextRepository(reut21578TDC)repo <- appendElem(repo, acq, list(modified = date()))repo <- appendMeta(repo, list(moremeta = 5:10))summary(repo)RepoMetaData(repo)summary(repo[[1]])summary(repo[[2]])@\section*{Creating Term-Document Matrices}A common approach in text mining is to create a term-document matrixfor given texts. In this package the class \class{TermDocMatrix}handles sparse matrices for text document collections.<<>>=tdm <- TermDocMatrix(reut21578TDC)Data(tdm)[1:8,150:155]@\section*{Operations on Term-Document Matrices}Besides the fact that on the \code{Data} part of this matrix a huge amount of \proglang{R}functions (like clustering, classifications, etc.) is possible, thispackage brings some shortcuts. Consider wewant to find those terms that occur at least 5 times:<<>>=findFreqTerms(tdm, 5, Inf)@Or we want to find associations (i.e., terms which correlate) with atleast $0.97$ correlation for the term \code{crop}:<<>>=findAssocs(tdm, "crop", 0.97)@The function also accepts a matrix as first argument (which does notinherit from a term-document matrix). This matrix is then interpretedas a correlation matrix and directly used. With this approachdifferent correlation measures can be employed.Term-document matrices tend to get very big already for normal sizeddatasets. Therefore we provide a method to remove \emph{sparse} terms,i.e., terms occurring only in very few documents. Normally, thisreduces the matrix dramatically without losing significant relationsinherent to the matrix:<<>>=removeSparseTerms(tdm, 0.4)@This function call removes those terms which have at least a 40percentage of sparse (i.e., terms occurring 0 times in a document)elements.\section*{Dictionary}A dictionary is a (multi-)set of strings. It is often used to representrelevant terms in text mining. We provide a class \class{Dictionary}implementing such a dictionary concept. It can be created via the\code{Dictionary} constructor, e.g.,<<>>=(d <- Dictionary(c("dlrs", "crude", "oil")))@and may be passed over to the \code{TermDocMatrix} constructor. Thenthe created matrix is tabulated against the dictionary, i.e., onlyterms from the dictionary appear in the matrix. This allows torestrict the dimension of the matrix a priori and to focus on specificterms for distinct text mining contexts, e.g.,<<>>=tdmD <- TermDocMatrix(reut21578TDC, list(dictionary = d))Data(tdmD)@You can also create a dictionary from a term-document matrix via\code{createDictionary} holding all terms from the matrix e.g.,<<>>=createDictionary(tdm)[100:110]@\end{document}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -