📄 readme.randomtrain
字号:
It seems that training bogofilter on its errors _only_ is a very goodway to train, at least with the Robinson-Fisher or Bayes chain rulecalculation methods. The way this works is: messages from the trainingcorpus are picked at random (without replacement, ie no message is usedmore than once) and fed to bogofilter for evaluation. If bogofiltergets the classification right, nothing further is done. If it's wrong,or uncertain if ternary mode is in use, then the message is fed tobogofilter again with the -s or -n option, as appropriate.That's all very well, except that it's not an easy process to executewith just a couple of shell commands. I've now written a bash scriptthat does the job [Matthias Andree: this has been changed so it may nowwork on a regular POSIX compliant sh, too, feedback is welcome]; yougive it the directory in which to build the bogofilter database, and alist of files flagged with either -s or -n to indicate spam or nonspam,and it performs training-on-error using all the messages in all thefiles in random order.My production version of bogofilter returns the following exit codes:0 for spam1 for nonspam2 for uncertain3 for errorNormal bogofilter returns (I think)0 for spam1 for nonspam2 for errorThis script will work with either.You can use it to build from scratch; the first message evaluated willreturn the error exit code, and randomtrain (as this script is called)will train with that message, thus creating the databases.The script needs rather a lot of auxiliary commands (they're listed inthe comments at the top of the file); in particular, perl is called forthe randomization function. (The embedded perl script is "useful" inits own right: it takes text on standard input and returns the lines inrandom sequence.) Known portability issue: on HP-UX (10.20 at least),grep -b returns a block offset instead of a byte offset, so randomtrainwon't work unless gnu grep is substituted for the HP-UX one.I rebuilt my training lists with randomtrain. The training corpusconsists of 9878 spams and 7896 nonspams. The message-counts frombogoutil -w bogodir are 1475 and 408. The database sizes from fulltraining were 10 and 4 Mb; randomtrain produced .db files of 7 and 1.2Mb. I don't yet have figures comparing discrimination by bogofilterwith these two training sets, but yesterday's smaller-scale test (whichmotivated me to write this script) clearly indicated an improvementcould be expected.Greg Louis <glouis@dynamicro.on.ca>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -