readme

来自「卡内基梅隆大学MaCallum开发的文本分类系统」· 代码 · 共 145 行

TXT

145 行

Bag Of Words Library README***************************`libbow', version 1.0.   Documentation and updates for `libbow' are available athttp://www.cs.cmu.edu/~mccallum/bow   Rainbow is a C program that performs document classification usingone of several different methods, including naive Bayes, TFIDF/Rocchio,K-nearest neighbor, Maximum Entropy, Support Vector Machines, Fuhr'sProbabilitistic Indexing, and a simple-minded form a shrinkage withnaive Bayes.   Rainbow's accompanying library, `libbow', is a library of C codeintended for support of statistical text-processing programs.  Thecurrent source distribution includes the library, a text classificationfront-end (rainbow), a simple TFIDF-based document retrieval front-end(arrow), an AltaVista-style document retrieval front-end (archer), and aunsupported document clustering front-end with hierarchical clusteringand deterministic annealing (crossbow).The library provides facilities for: *  Recursively descending directories, finding text files. *  Finding `document' boundaries when there are multiple docs per file. *  Tokenizing a text file, according to several different methods. *  Including N-grams among the tokens. *  Mapping strings to integers and back again, very efficiently. *  Building a sparse matrix of document/token counts. *  Pruning vocabulary by occurrence counts or by information gain. *  Building and manipulating word vectors. *  Setting word vector weights according to NaiveBayes, TFIDF, and a     simple form of Probabilistic Indexing. *  Scoring queries for retrieval or classification. *  Writing all data structures to disk in a compact format. *  Reading the document/token matrix from disk in an efficient,     sparse fashion. *  Performing test/train splits, and automatic classification tests. *  Operating in server mode, receiving and answering queries over a     socket.   It is known to compile on most UNIX systems, including Linux,Solaris, SUNOS, Irix and HPUX.  Six months ago, it compiled onWindowsNT (with a GNU build environment); it would probably work againwith little effort.  Patches to the code are most welcome.   It is relatively efficient.  Reading, tokenizing and indexing the rawtext of 20,000 UseNet articles takes about 3 minutes.  Building a naiveBayes classifier from 10,000 articles, and classifying the other 10,000takes about 1 minute.   The code conforms to the GNU coding standards.  It is released underthe Library GNU Public License (LGPL).The library does not:        Have parsing facilities.        Do smoothing across N-gram models.        Claim to be finished.        Have good documentation.        Claim to be bug-free.        ...many other things.Rainbow=======   `Rainbow' is a standalone program that does document classification.Here are some examples:   *      rainbow -i ./training/positive ./training/negative     Using the text files found under the directories `./positive' and     `./negative', tokenize, build word vectors, and write the     resulting data structures to disk.   *      rainbow --query=./testing/254     Tokenize the text document `./testing/254', and classify it,     producing output like:          /home/mccallum/training/positive 0.72          /home/mccallum/training/negative 0.28   *      rainbow --test-set=0.5 -t 5     Perform 5 trials, each consisting of a new random test/train split     and outputs of the classification of the test documents.   Typing `rainbow --help' will give list of all rainbow options.   After you have compiled `libbow' and `rainbow', you can run theshell script `./demo/script' to see an annotated demonstration of theclassifier in action.   More information and documentation is available athttp://www.cs.cmu.edu/~mccallum/bowRainbow improvements coming eventually:   Better documentation.   Incremental model training.Arrow=====   `Arrow' is a standalone program that does document retrieval byTFIDF.   Index all the documents in directory `foo' by typing     arrow --index foo   Make a single query by typing     arrow --query   then typing your query, and pressing Control-D.   If you want to make many queries, it will be more efficient to runarrow as a server, and query it multiple times without restarts bycommunicating through a socket.  Type, for example,     arrow --query-server=9876   And access it through port number 9876.  For example:     telnet localhost 9876   In this mode there is no need to press Control-D to end a query.Simply type your query on one line, and press return.Crossbow========   `Crossbow' is a standalone program that does document clustering.Sorry, there is no documentation yet.Archer======   `Archer' is a standalone program that does document retrieval withAltaVista-type queries, using +, -, "", etc.  The commands in the"arrow" examples above also work for archer.  See "archer -help" formore information.

readme - 源码说明

本页面展示了「卡内基梅隆大学MaCallum开发的文本分类系统」中的 readme 源码文件，采用编程语言编写，共 145 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。

虫虫下载站收录了大量与MaCallum相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。

⌨️ 快捷键说明

复制代码Ctrl + C

搜索代码Ctrl + F

全屏模式F11

增大字号Ctrl + =

减小字号Ctrl + -

显示快捷键?