📄 readme
字号:
Bag Of Words Library README***************************`libbow', version 1.0. Documentation and updates for `libbow' are available athttp://www.cs.cmu.edu/~mccallum/bow Rainbow is a C program that performs document classification usingone of several different methods, including naive Bayes, TFIDF/Rocchio,K-nearest neighbor, Maximum Entropy, Support Vector Machines, Fuhr'sProbabilitistic Indexing, and a simple-minded form a shrinkage withnaive Bayes. Rainbow's accompanying library, `libbow', is a library of C codeintended for support of statistical text-processing programs. Thecurrent source distribution includes the library, a text classificationfront-end (rainbow), a simple TFIDF-based document retrieval front-end(arrow), an AltaVista-style document retrieval front-end (archer), and aunsupported document clustering front-end with hierarchical clusteringand deterministic annealing (crossbow).The library provides facilities for: * Recursively descending directories, finding text files. * Finding `document' boundaries when there are multiple docs per file. * Tokenizing a text file, according to several different methods. * Including N-grams among the tokens. * Mapping strings to integers and back again, very efficiently. * Building a sparse matrix of document/token counts. * Pruning vocabulary by occurrence counts or by information gain. * Building and manipulating word vectors. * Setting word vector weights according to NaiveBayes, TFIDF, and a simple form of Probabilistic Indexing. * Scoring queries for retrieval or classification. * Writing all data structures to disk in a compact format. * Reading the document/token matrix from disk in an efficient, sparse fashion. * Performing test/train splits, and automatic classification tests. * Operating in server mode, receiving and answering queries over a socket. It is known to compile on most UNIX systems, including Linux,Solaris, SUNOS, Irix and HPUX. Six months ago, it compiled onWindowsNT (with a GNU build environment); it would probably work againwith little effort. Patches to the code are most welcome. It is relatively efficient. Reading, tokenizing and indexing the rawtext of 20,000 UseNet articles takes about 3 minutes. Building a naiveBayes classifier from 10,000 articles, and classifying the other 10,000takes about 1 minute. The code conforms to the GNU coding standards. It is released underthe Library GNU Public License (LGPL).The library does not: Have parsing facilities. Do smoothing across N-gram models. Claim to be finished. Have good documentation. Claim to be bug-free. ...many other things.Rainbow======= `Rainbow' is a standalone program that does document classification.Here are some examples: * rainbow -i ./training/positive ./training/negative Using the text files found under the directories `./positive' and `./negative', tokenize, build word vectors, and write the resulting data structures to disk. * rainbow --query=./testing/254 Tokenize the text document `./testing/254', and classify it, producing output like: /home/mccallum/training/positive 0.72 /home/mccallum/training/negative 0.28 * rainbow --test-set=0.5 -t 5 Perform 5 trials, each consisting of a new random test/train split and outputs of the classification of the test documents. Typing `rainbow --help' will give list of all rainbow options. After you have compiled `libbow' and `rainbow', you can run theshell script `./demo/script' to see an annotated demonstration of theclassifier in action. More information and documentation is available athttp://www.cs.cmu.edu/~mccallum/bowRainbow improvements coming eventually: Better documentation. Incremental model training.Arrow===== `Arrow' is a standalone program that does document retrieval byTFIDF. Index all the documents in directory `foo' by typing arrow --index foo Make a single query by typing arrow --query then typing your query, and pressing Control-D. If you want to make many queries, it will be more efficient to runarrow as a server, and query it multiple times without restarts bycommunicating through a socket. Type, for example, arrow --query-server=9876 And access it through port number 9876. For example: telnet localhost 9876 In this mode there is no need to press Control-D to end a query.Simply type your query on one line, and press return.Crossbow======== `Crossbow' is a standalone program that does document clustering.Sorry, there is no documentation yet.Archer====== `Archer' is a standalone program that does document retrieval withAltaVista-type queries, using +, -, "", etc. The commands in the"arrow" examples above also work for archer. See "archer -help" formore information.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -