📄 libbow-desc.texi

📁 机器学习作者tom mitchell的书上代码
💻 TEXI
字号:
Documentation and updates for `libbow' are available athttp://www.cs.cmu.edu/~mccallum/bowRainbow is a C program that performs document classification using oneof several different methods, including naive Bayes, TFIDF/Rocchio,K-nearest neighbor, Maximum Entropy, Support Vector Machines, Fuhr'sProbabilitistic Indexing, and a simple-minded form a shrinkage withnaive Bayes.Rainbow's accompanying library, `libbow', is a library of C codeintended for support of statistical text-processing programs.  Thecurrent source distribution includes the library, a text classificationfront-end (rainbow), a simple TFIDF-based document retrieval front-end(arrow), an AltaVista-style document retrieval front-end (archer), and aunsupported document clustering front-end with hierarchical clusteringand deterministic annealing (crossbow).@formatThe library provides facilities for: *  Recursively descending directories, finding text files. *  Finding `document' boundaries when there are multiple docs per file. *  Tokenizing a text file, according to several different methods. *  Including N-grams among the tokens. *  Mapping strings to integers and back again, very efficiently. *  Building a sparse matrix of document/token counts. *  Pruning vocabulary by occurrence counts or by information gain. *  Building and manipulating word vectors. *  Setting word vector weights according to NaiveBayes, TFIDF, and a     simple form of Probabilistic Indexing. *  Scoring queries for retrieval or classification. *  Writing all data structures to disk in a compact format. *  Reading the document/token matrix from disk in an efficient,     sparse fashion. *  Performing test/train splits, and automatic classification tests. *  Operating in server mode, receiving and answering queries over a     socket. @end format        It is known to compile on most UNIX systems, including Linux, Solaris,SUNOS, Irix and HPUX.  Six months ago, it compiled on WindowsNT (witha GNU build environment); it would probably work again with littleeffort.  Patches to the code are most welcome.It is relatively efficient.  Reading, tokenizing and indexing the rawtext of 20,000 UseNet articles takes about 3 minutes.  Building anaive Bayes classifier from 10,000 articles, and classifying the other10,000 takes about 1 minute.The code conforms to the GNU coding standards.  It is released under theLibrary GNU Public License (LGPL).@formatThe library does not:        Have parsing facilities.        Do smoothing across N-gram models.        Claim to be finished.        Have good documentation.        Claim to be bug-free.        ...many other things.@end format
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -