⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 bogofilter.1in

📁 一个C语言写的快速贝叶斯垃圾邮件过滤工具
💻 1IN
📖 第 1 页 / 共 2 页
字号:
.\" ** You probably do not want to edit this file directly **.\" It was generated using the DocBook XSL Stylesheets (version 1.69.1)..\" Instead of manually editing it, you probably should edit the DocBook XML.\" source for it and then use the DocBook XSL Stylesheets to regenerate it..TH "BOGOFILTER" "1" "08/09/2006" "" "".\" disable hyphenation.nh.\" disable justification (adjust text to left margin only).ad l.SH "NAME"bogofilter \- fast Bayesian spam filter.SH "SYNOPSIS".HP 11\fBbogofilter\fR [help\ options classification\ options registration\ options parameter\ options info\ options] [general\ options] [config\ file\ options].PPwhere.PP\fBhelp options\fRare:.HP 1[\-h] [\-\-help] [\-V] [\-Q].PP\fBclassification options\fRare:.HP 1[\-p] [\-e] [\-t] [\-T] [\-u] [\-H] [\-M] [\-b] [\-B\ \fIobject\ ...\fR] [\-R] [general\ options] [parameter\ options] [config\ file\ options].PP\fBregistration options\fRare:.HP 1[\-s \-n] [\-S \-N] [general\ options].PP\fBgeneral options\fRare:.HP 1[\-c\ \fIfilename\fR] [\-C] [\-d\ \fIdir\fR] [\-k\ \fIcachesize\fR] [\-l] [\-L\ \fItag\fR] [\-I\ \fIfilename\fR] [\-O\ \fIfilename\fR].PP\fBparameter options\fRare:.HP 1[\-E\ \fIvalue\fR\fI[,value]\fR] [\-m\ \fIvalue\fR\fI[,value]\fR\fI[,value]\fR] [\-o\ \fIvalue\fR\fI[,value]\fR].PP\fBinfo options\fRare:.HP 1[\-v] [\-y\ \fIdate\fR] [\-D] [\-x\ \fIflags\fR].PP\fBconfig file options\fRare:.HP 1[\-\-\fIoption=value\fR].PPNote: Use\fBbogofilter \-\-help\fRto display the complete list of options..SH "DESCRIPTION".PPBogofilteris a Bayesian spam filter. In its normal mode of operation, it takes an email message or other text on standard input, does a statistical check against lists of "good" and "bad" words, and returns a status code indicating whether or not the message is spam.Bogofilteris designed with a fast algorithm, uses the Berkeley DB for fast startup and lookups, coded directly in C, and tuned for speed, so it can be used for production by sites that process a lot of mail..SH "THEORY OF OPERATION".PPBogofiltertreats its input as a bag of tokens. Each token is checked against a wordlist, which maintains counts of the numbers of times it has occurred in non\-spam and spam mails. These numbers are used to compute an estimate of the probability that a message in which the token occurs is spam. Those are combined to indicate whether the message is spam or ham..PPWhile this method sounds crude compared to the more usual pattern\-matching approach, it turns out to be extremely effective. Paul Graham's paper[1]\&\fI A Plan For Spam\fRis recommended reading..PPThis program substantially improves on Paul's proposal by doing smarter lexical analysis.Bogofilterdoes proper MIME decoding and a reasonable HTML parsing. Special kinds of tokens like hostnames and IP addresses are retained as recognition features rather than broken up. Various kinds of MTA cruft such as dates and message\-IDs are ignored so as not to bloat the wordlist. Tokens found in various header fields are marked appropriately..PPAnother improvement is that this program offers Gary Robinson's suggested modifications to the calculations (see the parameters robx and robs below). These modifications are described in Robinson's paper[2]\&\fISpam Detection\fR..PPSince then, Robinson (see his Linux Journal article[3]\&\fIA Statistical Approach to the Spam Problem\fR) and others have realized that the calculation can be further optimized using Fisher's method.[4]\&\fIAnother improvement\fRcompensates for token redundancy by applying separate effective size factors (ESF) to spam and nonspam probability calculations..PPIn short, this is how it works: The estimates for the spam probabilities of the individual tokens are combined using the "inverse chi\-square function". Its value indicates how badly the null hypothesis that the message is just a random collection of independent words with probabilities given by our previous estimates fails. This function is very sensitive to small probabilities (hammish words), but not to high probabilities (spammish words); so the value only indicates strong hammish signs in a message. Now using inverse probabilities for the tokens, the same computation is done again, giving an indicator that a message looks strongly spammish. Finally, those two indicators are subtracted (and scaled into a 0\-1\-interval). This combined indicator (bogosity) is close to 0 if the signs for a hammish message are stronger than for a spammish message and close to 1 if the situation is the other way round. If signs for both are equally strong, the value will be near 0.5. Since those message don't give a clear indication there is a tristate mode inbogofilterto mark those messages as unsure, while the clear messages are marked as spam or ham, respectively. In two\-state mode, every message is marked as either spam or ham..PPVarious parameters influence these calculations, the most important are:.PProbx: the score given to a token which has not seen before. robx is the probability that the token is spammish..PProbs: a weight on robx which moves the probability of a little seen token towards robx..PPmin\-dev: a minimum distance from .5 for tokens to use in the calculation. Only tokens farther away from 0.5 than this value are used..PPspam\-cutoff: messages with scores greater than or equal to will be marked as spam..PPham\-cutoff: If zero or spam\-cutoff, all messages with values strictly below spam\-cutoff are marked as ham, all others as spam (two\-state). Else values less than or equal to ham\-cutoff are marked as ham, messages with values strictly between ham\-cutoff and spam\-cutoff are marked as unsure; the rest as spam (tristate).PPsp\-esf: the effective size factor (ESF) for spam..PPns\-esf: the ESF for nonspam. These ESF values default to 1.0, which is the same as not using ESF in the calculation. Values suitable to a user's email population can be determined with the aid of thebogotuneprogram..SH "OPTIONS".PPHELP OPTIONS.PPThe\fB\-h\fRoption prints the help message and exits..PPThe\fB\-V\fRoption prints the version number and exits..PPThe\fB\-Q\fR(query) option printsbogofilter's configuration, i.e. registration parameters, parsing options,bogofilterdirectory, etc..PPCLASSIFICATION OPTIONS.PPThe\fB\-p\fR(passthrough) option outputs the message with an X\-Bogosity line at the end of the message header. This requires keeping the entire message in memory when it's read from stdin (or from a pipe or socket). If the message is read from a file that can be rewound,bogofilterwill read it a second time..PPThe\fB\-e\fR(embed) option tellsbogofilterto exit with code 0 if the message can be classified, i.e. if there is not an error. Normallybogofilteruses different codes for spam, ham, and unsure classifications, but this simplifies usingbogofilterwithprocmailormaildrop..PPThe\fB\-t\fR(terse) option tellsbogofilterto print an abbreviated spamicity message containing 1 letter and the score. Spam is indicated with "Y", ham by "N", and unsure by "U". Note: the formatting can be customized using the config file..PPThe\fB\-T\fRprovides an invariant terse mode for scripts to use.bogofilterwill print an abbreviated spamicity message containing 1 letter and the score. Spam is indicated with "S", ham by "H", and unsure by "U"..PPThe\fB\-TT\fRprovides an invariant terse mode for scripts to use.Bogofilterprints only the score and displays it to 16 significant digits..PPThe\fB\-u\fRoption tellsbogofilterto register the message's text after classifying it as spam or non\-spam. A spam message will be registered on the spamlist and a non\-spam message on the goodlist. If the classification is "unsure", the message will not be registered. Effectively this option runsbogofilterwith the\fB\-s\fRor\fB\-n\fRflag, as appropriate. Caution is urged in the use of this capability, as any classification errorsbogofiltermay make will be preserved and will accumulate until manually corrected with the\fB\-Sn\fRand\fB\-Ns\fRoption combinations. Note this option causes the database to be opened for write access, which can entail massive slowdowns through lock contention and synchronous I/O operations..PPThe\fB\-H\fRoption tellsbogofilterto not tag tokens from the header. This option is for testing, you should not use it in normal operation..PPThe\fB\-M\fRoption tellsbogofilterto process its input as a mbox formatted file. If the\fB\-v\fRor\fB\-t\fRoption is also given, a spamicity line will be printed for each message..PPThe\fB\-b\fR(streaming bulk mode) option tellsbogofilterto classify multiple objects whose names are read from stdin. If the\fB\-v\fRor\fB\-t\fRoption is also given,bogofilterwill print a line giving file name and classification information for each file. This is an alternative to\fB\-B\fRwhich lists objects on the command line..PPAn object in this context shall be a maildir (autodetected), or if it's not a maildir, a single mail unless\fB\-M\fRis given \- in that case it's processed as mbox. (The Content\-Length: header is not taken into account currently.).PPWhen reading mbox format,bogofilterrelies on the empty line after a mail. If needed,\fBformail \-es\fRwill ensure this is the case..PPThe\fB\-B \fR\fB\fIobject ...\fR\fR(bulk mode) option tellsbogofilterto classify multiple objects named on the command line. The objects may be filenames (for single messages), mailboxes (files with multiple messages), or directories (of maildir and MH format). If the\fB\-v\fRor\fB\-t\fRoption is also given,bogofilterwill print a line giving file name and classification information for each file. This is an alternative to\fB\-b\fRwhich lists objects on stdin..PPThe\fB\-R\fRoption tellsbogofilterto output an R data frame in text form on the standard output. See the section on integration with R, below, for further detail..PPREGISTRATION OPTIONS.PPThe\fB\-s\fRoption tellsbogofilterto register the text presented as spam. The database is created if absent..PPThe\fB\-n\fRoption tellsbogofilterto register the text presented as non\-spam..PPBogofilterdoesn't detect if a message registered twice. If you do this by accident, the token counts will off by 1 from what you really want and the corresponding spam scores will be slightly off. Given a large number of tokens and messages in the wordlist, this doesn't matter. The problem\fIcan\fRbe corrected by using the\fB\-S\fRoption or the\fB\-N\fRoption..PPThe\fB\-S\fRoption tellsbogofilterto undo a prior registration of the same message as spam. If a message was incorrectly entered as spam by\fB\-s\fRor\fB\-u\fRand you want to remove it and enter it as non\-spam, use\fB\-Sn\fR. If\fB\-S\fRis used for a message that wasn't registered as spam, the counts will still be decremented..PPThe\fB\-N\fRoption tellsbogofilterto undo a prior registration of the same message as non\-spam. If a message was incorrectly entered as non\-spam by\fB\-n\fRor\fB\-u\fRand you want to remove it and enter it as spam, then use\fB\-Ns\fR. If\fB\-N\fRis used for a message that wasn't registered as non\-spam, the counts will still be decremented..PPGENERAL OPTIONS.PPThe\fB\-c \fR\fB\fIfilename\fR\fRoption tellsbogofilterto read the config file named..PPThe\fB\-C\fRoption preventsbogofilterfrom reading configuration files..PPThe\fB\-d \fR\fB\fIdir\fR\fRoption allows you to set the directory for the database. See the ENVIRONMENT section for other directory setting options..PPThe\fB\-k \fR\fB\fIcachesize\fR\fRoption sets the cache size for the BerkeleyDB subsystem, in units of 1 MiB (1,048,576 bytes). Properly sizing the cache improvesbogofilter's performance. The recommended size is one third of the size of the database file. You can run thebogotunescript (in the tuning directory) to determine the recommended size..PPThe\fB\-l\fRoption writes an informational line to the system log each timebogofilteris run. The information logged depends on howbogofilteris run..PPThe\fB\-L \fR\fB\fItag\fR\fRoption configures a tag which can be included in the information being logged by the\fB\-l\fRoption, but it requires a custom format that includes the %l string for now. This option implies\fB\-l\fR..PPThe\fB\-I \fR\fB\fIfilename\fR\fRoption tellsbogofilterto read its input from the specified file, rather than from\fBstdin\fR..PPThe\fB\-O \fR\fB\fIfilename\fR\fRoption tellsbogofilter

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -