📄 mailcross.1
字号:
\" t.TH MAILCROSS 1 "Bayesian Text Classification Tools" "Version 1.3" "".SH NAMEmailcross \- a cross-validation tester for use with dbacl..SH SYNOPSIS.HP.B mailcross .I command [.I command_arguments ].SH DESCRIPTION.PP.B mailcrossautomates the task of cross-validating email filtering and classificationprograms such as .BR dbacl (1).Given a set of categorized documents, mailcross initiates test runs to estimate the classification errors and thereby permit fine tuning of the parameters of the classifier. .PPCross-validation is a method which is widely used to compare the quality of classification and learning algorithms, and therefore permits rudimentarycomparisons between those classifiers which make use of .BR dbacl (1)and .BR bayesol (1),and other competing classifiers. .PPThe mechanics of cross-validation are simple: A set of pre-classified email messages is first split into a number of roughly equal-sized subsets.For each subset, the filter (by default, .BR dbacl (1)) is used to classify each message within this subset, based upon having learned the categories from the remaining subsets. The resulting classification errors are then averaged over all subsets..PP.B mailcrossuses the environment variables MAILCROSS_LEARNER and MAILCROSS_FILTER whenexecuting, which permits the cross-validation of arbitrary filters, providedthese satisfy the compatibility conditions stated in the ENVIRONMENT section below..PPDuring preparation, .B mailcrossbuilds a subdirectory named mailcross.d in the current working directory. All needed calculations are performed inside this subdirectory..SH EXIT STATUS.B mailcrossreturns 1 on success, 0 if a problem occurred..SH COMMANDS.PP.IP "\fBprepare\fR \fIsize\fR"Prepares a subdirectory named mailcross.d in the current working directory, andpopulates it with empty subdirectories for exactly .I sizesubsets..IP "\fBadd\fR \fIcategory\fR [FILE]..."Takes a set of emails from either FILE if specified, or STDIN, and associates them with .IR category .All emails are distributed randomly into the subdirectories of mailcross.d for later use. For each.IR category , this command can be repeated several times, but should be executed at least once..IP "\fBclean\fR"Deletes the directory mailcross.d and all its contents..IP "\fBlearn\fR"For every previously built subset of email messages, pre-learns all the categories based on the contents of all the subsets except this one.The .I command_argumentsare passed to MAILCROSS_LEARNER..IP "\fBrun\fR"For every previously built subset of email messages, performs the classificationbased upon the pre-learned categories associated with this subset.The .I command_argumentsare passed to MAILCROSS_FILTER..IP "\fBsummarize\fR"Prints statistics for the latest cross-validation run..SH USAGE.PPThe normal usage pattern is the following: first, you should separate your emailcollection into several categories (manually or otherwise). Each category shouldbe associated with one or more folders, but each folder should not contain more than one category. Next, you should decide how many subsets to use, say 10. Note that too many subsets will slow down the calculations rapidly. Now you can type.HP.na% mailcross prepare 10.ad.PPNext, for every category, you must add every folder associated with thiscategory. Suppose you have three categories named .IR spam , .IR work , and .IR play ,which are associated with the .SM BSD mbox files .IR spam.mbox , .IR work.mbox , and .IR play.mbox respectively. You would type.PP.na% mailcross add spam spam.mbox.br% mailcross add work work.mbox.br% mailcross add play play.mbox.ad.PPYou can now perform the cross-validation. Note that the learning stage can takesome time. Typically, you would learn once, and run several times with varying risk parameters in case you use .BR bayesol (1)..PP.na% mailcross learn.br% mailcross run.br% mailcross summarize.ad.PPOnce you are all done cross validating, you can delete the working files, logfiles etc. by typing.PP.na% mailcross clean.ad.SH ENVIRONMENT.PPRight after loading, .B mailcross reads the hidden file .mailcrossrc in the $HOME directory, if it exists, sothis would be a good place to define custom values for environment variables..IP MAILCROSS_FILTERThis variable contains a shell command to be executed repeatedlyduring the running stage.The command should accept an email message on STDIN and output a resulting category name. It should also accept a list of category file nameson the command line, each preceded by the string "-c". If undefined, .B mailcrossuses the default valueMAILCROSS_FILTER="dbacl -T email -v"..IP MAILCROSS_LEARNERThis variable contains a shell command to be executed repeatedly during thelearning stage. The command should accept a .SM BSD mbox type stream of emailson STDIN for learning, and the file name of the category on the command line. If undefined, .B mailcrossuses the default valueMAILCROSS_LEARNER="dbacl -T email -l"..SH NOTES.PPThe subdirectory mailcross.d can grow quite large. It contains a full copy of the training corpora, as well as learning files for .I size times all the added categories, and various log files. .SH WARNING.PPCross-validation is a widely used, but ad-hoc statistical procedure, completelyunrelated to Bayesian theory, and subject to controversy. Use this at your own risk..SH SOURCE.PPThe source code for the latest version of this program is available at http://www.lbreyer.com/gpl.html.SH AUTHOR.PPLaird A. Breyer <laird@lbreyer.com>.SH SEE ALSO.PP.BR bayesol (1).BR dbacl (1), .BR mailinspect (1),.BR regex (7)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -