📄 tm.rnw

📁 R-Project是一个开源的统计软件,专门有一个R语言,类似S语言,这个包里面就是一个R实现的文本挖掘(text mining简称tm)的包.里面有代码和样本数据.
💻 RNW
📖 第 1 页 / 共 2 页
字号:
12 下一页
\documentclass[a4paper]{article}\usepackage[utf8]{inputenc}\DeclareUnicodeCharacter{201C}{"}\DeclareUnicodeCharacter{201D}{"}\newcommand{\strong}[1]{{\normalfont\fontseries{b}\selectfont #1}}\newcommand{\class}[1]{\mbox{\textsf{#1}}}\newcommand{\func}[1]{\mbox{\texttt{#1()}}}\newcommand{\code}[1]{\mbox{\texttt{#1}}}\newcommand{\pkg}[1]{\strong{#1}}\newcommand{\samp}[1]{`\mbox{\texttt{#1}}'}\newcommand{\proglang}[1]{\textsf{#1}}\newcommand{\set}[1]{\mathcal{#1}}\newcommand{\acronym}[1]{\textsc{#1}}%% \VignetteIndexEntry{Introduction to the tm Package}\begin{document}<<echo=FALSE>>=options(width = 75)### for samplingset.seed <- 1234@\title{Introduction to the \pkg{tm} Package\\Text Mining in \proglang{R}}\author{Ingo Feinerer}\maketitle\sloppy\begin{abstract}This vignette gives a short overview over available features in the\pkg{tm} package for text mining purposes in \proglang{R}.\end{abstract}\section*{Loading the Package}Before actually working we need to load the package:<<>>=library("tm")@\section*{Data Import}The main structure for managing documents is a so-called text documentcollection, denoted as corpus in linguistics (\class{Corpus}).Its constructor takes following arguments:\begin{itemize}\item \code{object}: a \class{Source} object which abstracts the input location.\item \code{readerControl}: a list with the named components  \code{reader}, \code{language}, and \code{load}.  A reader constructs a text document from a single element  delivered by a source. A reader must have the argument signature \code{(elem,    load, language, id)}. The first argument is the element provided  from the source, the second gives the text's language, the third  indicates whether the user wants to load the documents immediately  into memory, and the fourth is a unique identification string.  If the passed over \code{reader} object is of  class~\class{FunctionGenerator}, it is assumed to be a function  generating a reader. This way custom readers taking various  parameters (specified in \code{...}) can be built, which in fact  must produce a valid reader signature but can access additional  parameters via lexical scoping (i.e., by the including  environment).\item \code{dbControl}: a list with the named components \code{useDb}  indicating that database support should be activated, \code{dbName} giving the  filename holding the sourced out objects (i.e., the database), and  \code{dbType} holding a valid database type as supported by  \code{filehash}. Under activated database  support the \code{tm} packages tries to keep as few as possible  resources in memory under usage of the database.\item \code{...}: Further arguments to the reader.\end{itemize}Available sources are \class{DirSource}, \class{CSVSource},\class{GmaneSource} and\class{ReutersSource} which handle a directory, a mixed CSV, a Gmanemailing list archive \acronym{Rss} feed or amixed Reuters file (mixed means several documents are in a singlefile). Except \class{DirSource}, which is designatedsolely for directories on a file system, all other implemented sourcescan take connections as input (a character string is interpreted asfile path).This package ships with several readers (\code{readPlain()}(default), \code{readRCV1()}, \code{readReut21578XML()},\code{readGmane()}, \code{readNewsgroup()}, \code{readPDF()},\code{readDOC()} and \code{readHTML()}).Each source has a default reader which can be overridden. E.g., for \code{DirSource} thedefault just reads in the whole input files and interprets theircontent as text.Plain text files in a directory:<<keep.source=TRUE>>=txt <- system.file("texts", "txt", package = "tm")(ovid <- Corpus(DirSource(txt),                    readerControl = list(reader = readPlain,                                         language = "la",                                         load = TRUE)))@A single comma separated values file:<<>>=# Comma separated valuescars <- system.file("texts", "cars.csv", package = "tm")Corpus(CSVSource(cars))@Reuters21578 files either in directory (one document per file) or a singlefile (several documents per file). Note that connections can be usedas input:<<keep.source=TRUE>>=# Reuters21578 XMLreut21578 <- system.file("texts", "reut21578", package = "tm")reut21578XML <- system.file("texts", "reut21578.xml", package = "tm")reut21578XMLgz <- system.file("texts", "reut21578.xml.gz", package = "tm")(reut21578TDC <- Corpus(DirSource(reut21578),                            readerControl = list(reader = readReut21578XML,                                                 language = "en_US",                                                 load = FALSE)))Corpus(ReutersSource(reut21578XML),           readerControl = list(reader = readReut21578XML,                                language = "en_US", load = FALSE))Corpus(ReutersSource(gzfile(reut21578XMLgz)),           readerControl = list(reader = readReut21578XML,                                language = "en_US", load = FALSE))@Depending on your exact input format you might find\code{preprocessReut21578XML()} useful. For the original downloadablearchive this function can correct invalid \acronym{Utf8} encodings andcan copy each text document into a separate file to enable load ondemand.Analogously we can construct collections for files in the ReutersCorpus Volume 1 format:<<>>=# Reuters Corpus Volume 1rcv1 <- system.file("texts", "rcv1", package = "tm")rcv1XML <- system.file("texts", "rcv1.xml", package = "tm")Corpus(DirSource(rcv1),           readerControl = list(reader = readRCV1, language = "en_US", load = TRUE))Corpus(ReutersSource(rcv1XML),           readerControl = list(reader = readRCV1, language = "en_US", load = FALSE))@Or mails from newsgroups (as found in the \acronym{Uci} \acronym{Kdd} newsgroup data set):<<>>=# UCI KDD Newsgroup Mailsnewsgroup <- system.file("texts", "newsgroup", package = "tm")Corpus(DirSource(newsgroup),           readerControl = list(reader = readNewsgroup, language = "en_US", load = TRUE))@\acronym{Rss} feed as delivered by Gmane for the \proglang{R} mailing list archive:<<>>=rss <- system.file("texts", "gmane.comp.lang.r.gr.rdf", package = "tm")Corpus(GmaneSource(rss),           readerControl = list(reader = readGmane, language = "en_US", load = FALSE))@For very simple \acronym{Html} documents:<<>>=html <- system.file("texts", "html", package = "tm")Corpus(DirSource(html),           readerControl = list(reader = readHTML, load = TRUE))@And for \acronym{Pdf} documents:<<>>=pdf <- system.file("texts", "pdf", package = "tm")Corpus(DirSource(pdf),           readerControl = list(reader = readPDF, language = "en_US", load = TRUE))@Note that \code{readPDF()} needs \code{pdftotext} and \code{pdfinfo}installed on your system to be able to extract the text and metainformation from your \acronym{Pdf}s.Finally, for \acronym{Ms} Word documents there is the reader function\code{readDOC()}. You need \code{antiword} installed on your system tobe able to extract the text from your Word documents.\section*{Data Export}For the case you have created a text collection via manipulating otherobjects in \proglang{R}, thus do not have the texts already stored,and want to save the text documents to disk, you can simply usestandard \proglang{R} routines for writing out plain textdocuments. E.g.,<<eval=false>>=lapply(ovid, function(x) writeLines(x, paste(ID(x), ".txt", sep = "")))@Alternatively there is the function \code{writeCorpus()} whichencapsulates this functionality.\section*{Inspecting the Text Document Collection}Custom \code{show} and \code{summary} methods are available, whichhide the raw amount of information (consider a collection couldconsists of several thousand documents, like adatabase). \code{summary} gives more details on metadata than\code{show}, whereas in order to actually see the content of textdocuments use the command \code{inspect} on a collection.<<>>=show(ovid)summary(ovid)inspect(ovid[1:2])
12 下一页
💿 文件大小 1076 K
👤 上传用户 heyuyutu
📂 所属分类多国语言处理
🏷️ 相关标签

#R-Project #mining #text #语言
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -