⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 dbacl.1

📁 dbacl是一个通用目的的digramic贝叶斯文本分类器。它可以学习你提供的文本
💻 1
字号:
\" t.TH DBACL 1 "Bayesian Text Classification Tools" "Version 1.3" "".SH NAMEdbacl \- a digramic Bayesian classifier for text recognition..SH SYNOPSIS.HP.B dbacl[-dvnirMND][-T .IR type] -l .I category[-h .IR size ] [-H.IR gsize ][-x .IR decim ] [-w.IR max_order ] [-g .IR regex ]... [FILE]....HP.B dbacl[-vniNR] [-h .IR size ] [-T .IR type]-c .I category[-c .IR category ]... [-f.IR keep ]... [FILE]....HP.B dbacl-V.SH DESCRIPTION.PP.B dbaclis a Bayesian text document classifier, which uses a maximum entropy (minimumdivergence) language model constructed with respect to a digramic reference measure (unknown words are predicted from digrams, i.e. pairs of letters)..PPIf using the .B -lcommand form,.B dbacllearns a category when given one or more FILE names, which should contain readable .SM ASCIItext. If no FILE is given, .B dbacllearns from STDIN. The result is saved in the binary file named .IR "category" ..PPIf using the.B -ccommand form,.B dbaclattempts to classify the text found in FILE, or STDIN if no FILE is given. Each possible .I category must be given separately, and should be the file name of a previously learned text corpus. When several (more than one) categories are specified, the text classification is performed by computing the Bayesianposterior probability for each model, given the input text, and with a uniform prior distribution on categories. .PPWhen only a single categoryis specified, .B dbaclcalculates a score which is the product of the cross entropy and the complexityof the input text (needs the .B -nand .B -voptions). A low score indicates a good fit of the .I categorymodel. The cross entropy measures the average compression rate which is achievable, under the given.IR category ,for the features of the input text..PP By default,.B dbaclwill classify the input text as a whole. However, when using the .B -f option, .B dbaclcan be used to filter each input line separately, printing only thoselines which match one or more models identified by .IR keep ..PPLearning and classifying cannot be mixed on the same command invocation. By default, non-alphabetic characters in the input are always ignored, as is word case,unless one or more regular expressions are supplied..PPIf the .B -woption is given, the default features consist of all n-grams up to .IR max_order ,where each purely alphabetic string, in lower case, is taken as a unit. If .B -wis not given, .B dbaclassumes .I max_orderequals 1. The default feature selection is completely overriden however if the .B -gswitch is encountered..PPWhen supplying one or more regular expressions with the .B -goption, only those features of the input text or texts which match .I regexare analysed (with possible overlap between matches), and only those subexpressions of .I regex which are explicitly tagged are used as features in the model (if severalsubexpressions are tagged simultaneously, the resulting feature is the concatenatedstring of these tagged expressions in order of tag appearance). .PPFor regular exression syntax, see.BR regex (7). When learning with regular expressions, the .B -Doption will print all matched text features..SH EXIT STATUSWhen using the .B -lcommand form,.B dbaclreturns zero. When using the .B -cform, .B dbaclreturns a positive integer corresponding to the .I categorywith the highest posterior probability. In case of a tie, the first most probablecategory is chosen. If an error occurs, .B dbaclreturns zero. .SH OPTIONS.IP -aAppend scores. Every input line is written to STDOUT and the dbacl scores areappended. This is useful for postprocessing with .BR bayesol (1). For ease of processing, every original input line is indented by a single space (to distinguish them from the appended scores), and the line with the scores (if .B -nis used) is prefixed with the string "scores ". If a second copy of .B dbacl needs to read this output later, it should be invoked with the .B -A switch..IP -dDump the model parameters to STDOUT. Only useful in conjunction with the.B -loption, this produces a human-readable summary of the maximum entropy model. Suppresses all other output..IP -fFilter each line of input separately, keeping the .I category identified as.IR keep . This option should be used repeatedly for each .I categorywhich must be kept. .I keepcan be either the .I categoryfile name, or a positive integer representing the required .I category in the sameorder it appears on the command line..IP -gLearn only features described by the extended regular expression.IR regex . This overrides the default feature selection method (see.B -woption) and learns, foreach line of input, onlytokens constructed from the concatenation of strings which match the tagged subexpressions within the supplied .IR regex . All substrings which match.IR regexwithin a suffix of each input line are treated as features, even if they overlap on the input line. .IPAs an optional convenience, .I regexcan include the suffix .I ||xyz which indicates which parenthesized subexpressions should be tagged. In this case, .I xyzshould consist exclusively of digits 1 to 9, numbering exactly those subexpressions which should be tagged..IP -hSet the size of the hash table to 2^\fIsize\fPelements. When using the .B -l option, this refers to the total number of features allowed in the maximumentropy model being learned. When using the .B -c option toghether with the .B -Mswitch and multinomial type categories, this refers to the maximum number of features taken into account during classification.Without the .B -Mswitch, this option has no effect..IP -iFully internationalized mode. Forces the use of wide characters internally,which is necessary in some locales. This incurs a noticeable performance penalty..IP -jMake features case sensitive. Only meaningful in conjunction with .B -g option..IP -nPrint scores for each .IR category . Each score is the product of two numbers, the cross entropy and the complexity of the input text under each model. Multiplied together, they represent the log probability that the input resembles the model. To see these numbers separately, use also the .B -voption. In conjunction with the .B -f option, stops filtering but prints each input line prepended with a list of scores forthat line. .IP -rLearn the digramic reference model only. Skips the learning of extra features in the text corpus..IP -vVerbose mode. When learning, print out details of the computation, when classifying, print out the name of the most probable .IR category . In conjunction with the .B -noption, prints the scores as an explicit product of the cross entropy and the complexity..IP -wSelect default features to be n-grams up to .IR max_order .This is incompatible with the .B -goption, which always takes precedence. If no .B -w or .B -goptions are given, .B dbaclassumes .B -w 1..IP -xSet feature decimation probability to 1 - 2^(\fI-decim\fP).To reduce memory requirements when learning, features are added to the model only with probability 2^(\fI-decim\fP). Use with caution (best with unigram models)..IP -AExpect indented input and scores. With this switch, .B dbacl expects input lines to be indented by a single space character (which is then skipped). Linesstarting with any other character are ignored. This is the counterpart to the .B -a switch above.When used together with the .B -a switch, .B dbacl outputs the skipped lines as they are, and reinserts the space at the front of each processedinput line. .IP -DPrint debug output. Do not use, but can be useful in conjunction with .B -g option (includes list of model features as they are discovered)..IP -HAllow hash table to grow up to a maximum of 2^\fIgsize\fP elements during learning. Initial size is given by .B -hoption. .IP -MForce multinomial calculations. When learning, forces the model features to be treated multinomially. When classifying, corrects entropy scores to reflect multinomial probabilities (only applicable to multinomial type models, if present).Scores will always be lower, because the ordering of features is lost..IP -NPrint posterior probabilities for each.IR category .This assumes the supplied categories form an exhaustive list of possibilities.In conjunction with the .B -f option, stops filtering but prints each input line prepended with a summary of theposterior distribution for that line. .IP -RInclude an extra category for purely random text. The category is called "random"..IP -TSpecify nonstandard text format. By default, .B dbaclassumes that the input text is a purely .SM ASCII text file. This corresponds to the case when .I typeis "text". If .I type is "email", then .B dbaclprocesses the input text as a.SM BSD mbox format file, skipping mail headers and non textual .SM MIME attachments, as well as .SM HTML markup. When .I typeis "xml", .B dbaclskips any .SM XML markup in the input..IP -VPrint the program version number and exit. .SH USAGE.PPTo create two category files in the current directory from two .SM ASCIItext files named Mark_Twain.txt and William_Shakespeare.txt respectively, type:.PP.na% dbacl -l twain Mark_Twain.txt .br% dbacl -l shake William_Shakespeare.txt.ad.PPNow you can classify input text, for example:.PP.na% echo "howdy" | dbacl -v -c twain -c shake.brtwain.br% echo "to be or not to be" | dbacl -v -c twain -c shake.brshake.ad.PPNote that the .B -v option is necessary, otherwise .B dbacl does not print anything. The return value is 1 in the first case, 2 in the second..PPSuppose a file document.txt contains English text lines interspersedwith noise lines. To filter out the noise lines from the English lines,assuming you have an existing category shake say, type:.PP.na% dbacl -c shake -f shake -R document.txt > document.txt_eng.br% dbacl -c shake -f random -R document.txt > document.txt_rnd.ad.PPNote that the quality of the results will vary depending onhow well the categories shake and random represent each input line.It is sometimes useful to see the posterior probabilities for each linewithout filtering:.PP.na% dbacl -c shake -f shake -RN document.txt > document.txt_probs.ad.PPYou can now postprocess the posterior probabilities for each line of text with another script, to replicate an arbitrary Bayesian decision rule of your choice. .PPIn the special case of exactly two categories, the optimal Bayesiandecision procedure can be implemented for documents as follows: let .I p1 be the prior probabilitythat the input text is classified as .IR category1 . Consequently, the prior probabilityof classifying as .I category2 is 1 - .IR p1 . Let .I u12 be the cost of misclassifyinga .I category1input text as belonging to .I category2and vice versa for .IR u21 .We assume there is no cost for classifying correctly.Then the following command implements the optimal Bayesian decision:.HP.na% dbacl -n -c .I category1 -c .I category2 | awk '{ if($2 * .I p1*.I u12> $4 * (1 - .IR p1 )* .IR u21 ){ print $1; } else { print $3; } }'.ad.PP.B dbaclcan also be used in conjunction with .BR procmail (1) to implement a simple Bayesian email classification system. Assume thatincoming mail should be automatically delivered to one of three mail folders located in $MAILDIR and named .IR work ,.IR personal ,and .IR spam .Initially, these must be created and filled with appropriate sample emails.A .BR crontab (1) file can be used to learn the three categories once a day, e.g..PP.naCATS=$HOME/.dbacl.br5  0 * * * dbacl -T email -l $CATS/work $MAILDIR/work.br10 0 * * * dbacl -T email -l $CATS/personal $MAILDIR/personal.br15 0 * * * dbacl -T email -l $CATS/spam $MAILDIR/spam.ad.PPTo automatically deliver each incoming email into the appropriate folder,the following .BR procmailrc (5)recipe fragment could be used:.PP.naCATS=$HOME/.dbacl.ad.PP.na# run the spam classifier.br:0 c.brYAY=| dbacl -vT email -c $CATS/work -c $CATS/personal -c $CATS/spam.ad.PP.na# send to the appropriate mailbox.br:0:.br* ? test -n "$YAY".br$MAILDIR/$YAY.ad.PP.na:0:.br$DEFAULT.ad.PPSometimes, .B dbaclwill send the email to the wrong mailbox. In that case, the misclassified messageshould be removed from its wrong destination and placed in the correct mailbox.If it is left in the wrong category,.B dbaclwill learn the wrong corpus statistics..PPIt is possible to override the default feature selection method used to learnthe category model by means of regular expressions. For example, the followingexample duplicates the default feature selection method in the C locale, while being much slower:.HP.na% dbacl -l twain -g '^([[:alpha:]]+)' -g '[^[:alpha:]]([[:alpha:]]+)' Mark_Twain.txt.ad.PPThe category twain which is obtained depends only on single alphabetic wordsin the text file Mark_Twain.txt (and computed digram statistics for prediction).For a second example, the following command builds a smoothed Markovian (word bigram) model which depends on pairs of consecutive words within each line (but pairs cannotstraddle a line break):.HP.na% dbacl -l twain2 -g '(^|[^[:alpha:]])([[:alpha:]]+)||2' -g '(^|[^[:alpha:]])([[:alpha:]]+)[^[:alpha:]]+([[:alpha:]]+)||23' Mark_Twain.txt.ad.PPMore general, line based, n-gram models of all orders can be built in a similar way.To construct paragraph based models, you should reformatthe input corpora with .BR awk (1)or .BR sed (1) to obtain one paragraph per line. Line size is limited by available memory,but note that regex performance will degrade quickly for long lines..SH ENVIRONMENT.PP.IP DBACL_PATHWhen this variable is set, its value is prepended to every .I categoryfilename which doesn't start with a '/'..SH NOTES.PP.B dbacldoes not recognize functionally equivalent regular expressions, and in this case duplicate features will be counted several times.Specific documentation about the design of .B dbacland the statistical models it uses can be found in /usr/share/dbacl/doc/dbacl.psor see /usr/share/dbacl/doc/tutorial.html for a basic overview..SH BUGS.PP"Ya know, some day scientists are gonna invent something that will outsmart a rabbit." (Robot Rabbit, 1953).SH SOURCE.PPThe source code for the latest version of this program is available at http://www.lbreyer.com/gpl.html.SH AUTHOR.PPLaird A. Breyer <laird@lbreyer.com>.SH SEE ALSO.PP.BR awk (1),.BR bayesol (1),.BR crontab (1),.BR mailcross (1),.BR mailinspect (1),.BR procmailex (5),.BR regex (7),.BR sed (1)

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -