📄 bogofilter.htmlin
字号:
<html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>bogofilter</title><meta name="generator" content="DocBook XSL Stylesheets V1.69.1"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="refentry" lang="en"><a name="bogofilter.1"></a><div class="titlepage"></div><div class="refnamediv"><a name="name"></a><h2>Name</h2><p>bogofilter — fast Bayesian spam filter</p></div><div class="refsynopsisdiv"><a name="synopsis"></a><h2>Synopsis</h2><div class="cmdsynopsis"><p><code class="command">bogofilter</code> [ help options | classification options | registration options | parameter options | info options ] [general options] [config file options]</p></div><p>where</p><p><code class="option">help options</code> are:</p><div class="cmdsynopsis"><p>[-h] [--help] [-V] [-Q]</p></div><p><code class="option">classification options</code> are:</p><div class="cmdsynopsis"><p>[-p] [-e] [-t] [-T] [-u] [-H] [-M] [-b] [-B <em class="replaceable"><code>object ...</code></em>] [-R] [general options] [parameter options] [config file options]</p></div><p><code class="option">registration options</code> are:</p><div class="cmdsynopsis"><p>[ -s | -n ] [ -S | -N ] [general options]</p></div><p><code class="option">general options</code> are:</p><div class="cmdsynopsis"><p>[-c <em class="replaceable"><code>filename</code></em>] [-C] [-d <em class="replaceable"><code>dir</code></em>] [-k <em class="replaceable"><code>cachesize</code></em>] [-l] [-L <em class="replaceable"><code>tag</code></em>] [-I <em class="replaceable"><code>filename</code></em>] [-O <em class="replaceable"><code>filename</code></em>]</p></div><p><code class="option">parameter options</code> are:</p><div class="cmdsynopsis"><p>[-E <em class="replaceable"><code>value[<span class="optional">,value</span>]</code></em>] [-m <em class="replaceable"><code>value[<span class="optional">,value</span>][<span class="optional">,value</span>]</code></em>] [-o <em class="replaceable"><code>value[<span class="optional">,value</span>]</code></em>]</p></div><p><code class="option">info options</code> are:</p><div class="cmdsynopsis"><p>[-v] [-y <em class="replaceable"><code>date</code></em>] [-D] [-x <em class="replaceable"><code>flags</code></em>]</p></div><p><code class="option">config file options</code> are:</p><div class="cmdsynopsis"><p>[--<em class="replaceable"><code>option=value</code></em>]</p></div><p>Note: Use <span><strong class="command">bogofilter --help</strong></span> to display the complete list of options.</p></div><div class="refsect1" lang="en"><a name="description"></a><h2>DESCRIPTION</h2><p><span class="application">Bogofilter</span> is a Bayesian spam filter.In its normal mode of operation, it takes an email message or othertext on standard input, does a statistical check against lists of"good" and "bad" words, and returns a status code indicating whetheror not the message is spam. <span class="application">Bogofilter</span> isdesigned with a fast algorithm, uses the Berkeley DB for fast startupand lookups, coded directly in C, and tuned for speed, so it can beused for production by sites that process a lot of mail.</p></div><div class="refsect1" lang="en"><a name="theory"></a><h2>THEORY OF OPERATION</h2><p><span class="application">Bogofilter</span> treats its input as a bagof tokens. Each token is checked against a wordlist, which maintainscounts of the numbers of times it has occurred in non-spam and spammails. These numbers are used to compute an estimate of theprobability that a message in which the token occurs is spam. Those arecombined to indicate whether the message is spam or ham.</p><p>While this method sounds crude compared to the more usualpattern-matching approach, it turns out to be extremely effective.Paul Graham's paper <a href="http://www.paulgraham.com/spam.html" target="_top">A Plan For Spam</a> is recommended reading.</p><p>This program substantially improves on Paul's proposal by doingsmarter lexical analysis. <span class="application">Bogofilter</span> doesproper MIME decoding and a reasonable HTML parsing. Special kinds oftokens like hostnames and IP addresses are retained as recognitionfeatures rather than broken up. Various kinds of MTA cruft such asdates and message-IDs are ignored so as not to bloat the wordlist.Tokens found in various header fields are marked appropriately.</p><p>Another improvement is that this program offers Gary Robinson'ssuggested modifications to the calculations (see the parameters robxand robs below). These modifications are described in Robinson'spaper <a href="http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html" target="_top">SpamDetection</a>.</p><p>Since then, Robinson (see his Linux Journal article <a href="http://www.linuxjournal.com/article/6467" target="_top">A StatisticalApproach to the Spam Problem</a>) and others have realized thatthe calculation can be further optimized using Fisher's method.<a href="http://www.garyrobinson.net/2004/04/improved_chi.html" target="_top">Anotherimprovement</a> compensates for token redundancy by applying separateeffective size factors (ESF) to spam and nonspam probability calculations.</p><p>In short, this is how it works: The estimates for the spamprobabilities of the individual tokens are combined using the "inversechi-square function". Its value indicates how badly the nullhypothesis that the message is just a random collection of independentwords with probabilities given by our previous estimates fails. Thisfunction is very sensitive to small probabilities (hammish words), butnot to high probabilities (spammish words); so the value onlyindicates strong hammish signs in a message. Now using inverseprobabilities for the tokens, the same computation is done again,giving an indicator that a message looks strongly spammish. Finally,those two indicators are subtracted (and scaled into a 0-1-interval).This combined indicator (bogosity) is close to 0 if the signs for ahammish message are stronger than for a spammish message and close to1 if the situation is the other way round. If signs for both areequally strong, the value will be near 0.5. Since those message don'tgive a clear indication there is a tristate mode in<span class="application">bogofilter</span> to mark those messages asunsure, while the clear messages are marked as spam or ham,respectively. In two-state mode, every message is marked as eitherspam or ham.</p><p>Various parameters influence these calculations, the mostimportant are:</p><p>robx: the score given to a token which has not seen before.robx is the probability that the token is spammish.</p><p>robs: a weight on robx which moves the probability of a little seentoken towards robx.</p><p>min-dev: a minimum distance from .5 for tokens to use in thecalculation. Only tokens farther away from 0.5 than this value areused.</p><p>spam-cutoff: messages with scores greater than or equal to willbe marked as spam.</p><p>ham-cutoff: If zero or spam-cutoff, all messages with valuesstrictly below spam-cutoff are marked as ham, all others as spam(two-state). Else values less than or equal to ham-cutoff are markedas ham, messages with values strictly between ham-cutoff andspam-cutoff are marked as unsure; the rest as spam (tristate)</p><p>sp-esf: the effective size factor (ESF) for spam.</p><p>ns-esf: the ESF for nonspam. These ESF values default to 1.0,which is the same as not using ESF in the calculation. Values suitableto a user's email population can be determined with the aid of the<span class="application">bogotune</span> program.</p></div><div class="refsect1" lang="en"><a name="options"></a><h2>OPTIONS</h2><p>HELP OPTIONS</p><p>The <code class="option">-h</code> option prints the help message and exits.</p><p>The <code class="option">-V</code> option prints the version number andexits.</p><p>The <code class="option">-Q</code> (query) option prints<span class="application">bogofilter</span>'s configuration, i.e. registrationparameters, parsing options, <span class="application">bogofilter</span>directory, etc.</p><p>CLASSIFICATION OPTIONS</p><p>The <code class="option">-p</code> (passthrough) option outputs the messagewith an X-Bogosity line at the end of the message header. Thisrequires keeping the entire message in memory when it's read fromstdin (or from a pipe or socket). If the message is read from a filethat can be rewound, <span class="application">bogofilter</span> will read ita second time.</p><p>The <code class="option">-e</code> (embed) option tells<span class="application">bogofilter</span> to exit with code 0 if themessage can be classified, i.e. if there is not an error. Normally<span class="application">bogofilter</span> uses different codes for spam, ham,and unsure classifications, but this simplifies using<span class="application">bogofilter</span> with<span class="application">procmail</span> or<span class="application">maildrop</span>.</p><p>The <code class="option">-t</code> (terse) option tells<span class="application">bogofilter</span> to print an abbreviatedspamicity message containing 1 letter and the score. Spam isindicated with "Y", ham by "N", and unsure by "U". Note: theformatting can be customized using the config file.</p><p>The <code class="option">-T</code> provides an invariant terse mode forscripts to use. <span class="application">bogofilter</span> will print anabbreviated spamicity message containing 1 letter and the score. Spamis indicated with "S", ham by "H", and unsure by "U".</p><p>The <code class="option">-TT</code> provides an invariant terse mode forscripts to use. <span class="application">Bogofilter</span> prints only thescore and displays it to 16 significant digits.</p><p>The <code class="option">-u</code> option tells<span class="application">bogofilter</span> to register the message's textafter classifying it as spam or non-spam. A spam message will beregistered on the spamlist and a non-spam message on the goodlist. Ifthe classification is "unsure", the message will not be registered.Effectively this option runs <span class="application">bogofilter</span>with the <code class="option">-s</code> or <code class="option">-n</code> flag, asappropriate. Caution is urged in the use of this capability, as anyclassification errors <span class="application">bogofilter</span> may makewill be preserved and will accumulate until manually corrected withthe <code class="option">-Sn</code> and <code class="option">-Ns</code> optioncombinations. Note this option causes the database to be opened forwrite access, which can entail massive slowdowns throughlock contention and synchronous I/O operations.</p><p>The <code class="option">-H</code> option tells<span class="application">bogofilter</span> to not tag tokens from theheader. This option is for testing, you should not use it in normaloperation.</p><p>The <code class="option">-M</code> option tells<span class="application">bogofilter</span> to process its input as a mboxformatted file. If the <code class="option">-v</code> or <code class="option">-t</code>option is also given, a spamicity line will be printed for eachmessage.</p><p>The <code class="option">-b</code> (streaming bulk mode) option tells<span class="application">bogofilter</span> to classify multiple objectswhose names are read from stdin. If the <code class="option">-v</code> or<code class="option">-t</code> option is also given,<span class="application">bogofilter</span> will print a line giving filename and classification information for each file. This is an alternative to <code class="option">-B</code> which lists objects on the command line.</p><p>An object in this context shall be a maildir (autodetected), orif it's not a maildir, a single mail unless <code class="option">-M</code> isgiven - in that case it's processed as mbox. (The Content-Length:header is not taken into account currently.)</p><p>When reading mbox format, <span class="application">bogofilter</span>relies on the empty line after a mail. If needed,<span><strong class="command">formail -es</strong></span> will ensure this is the case.</p><p>The <code class="option">-B <em class="replaceable"><code>object ...</code></em></code>(bulk mode) option tells <span class="application">bogofilter</span> toclassify multiple objects named on the command line. The objects maybe filenames (for single messages), mailboxes (files with multiplemessages), or directories (of maildir and MH format). If the<code class="option">-v</code> or <code class="option">-t</code> option is also given,<span class="application">bogofilter</span> will print a line giving filename and classification information for each file. This is an alternative to <code class="option">-b</code> which lists objects on stdin.</p><p>The <code class="option">-R</code> option tells<span class="application">bogofilter</span> to output an R data frame intext form on the standard output. See the section on integration withR, below, for further detail.</p><p>REGISTRATION OPTIONS</p><p>The <code class="option">-s</code> option tells<span class="application">bogofilter</span> to register the text presentedas spam. The database is created if absent.</p><p>The <code class="option">-n</code> option tells<span class="application">bogofilter</span> to register the text presentedas non-spam.</p><p><span class="application">Bogofilter</span> doesn't detect if a messageregistered twice. If you do this by accident, the token counts will off by 1from what you really want and the corresponding spam scores will be slightlyoff. Given a large number of tokens and messages in the wordlist, thisdoesn't matter. The problem <span class="emphasis"><em>can</em></span> be corrected by usingthe <code class="option">-S</code> option or the <code class="option">-N</code> option.</p><p>The <code class="option">-S</code> option tells <span class="application">bogofilter</span>to undo a prior registration of the same message as spam. If a message wasincorrectly entered as spam by <code class="option">-s</code> or <code class="option">-u</code>and you want to remove it and enter it as non-spam, use <code class="option">-Sn</code>.If <code class="option">-S</code> is used for a message that wasn't registered as spam,the counts will still be decremented.</p><p>The <code class="option">-N</code> option tells <span class="application">bogofilter</span>to undo a prior registration of the same message as non-spam. If a message wasincorrectly entered as non-spam by <code class="option">-n</code> or <code class="option">-u</code>and you want to remove it and enter it as spam, then use <code class="option">-Ns</code>.If <code class="option">-N</code> is used for a message that wasn't registered as non-spam,the counts will still be decremented.</p><p>GENERAL OPTIONS</p><p>The <code class="option">-c <em class="replaceable"><code>filename</code></em></code>option tells <span class="application">bogofilter</span> to read the configfile named.</p><p>The <code class="option">-C</code> option prevents<span class="application">bogofilter</span> from reading configurationfiles.</p><p>The <code class="option">-d <em class="replaceable"><code>dir</code></em></code> optionallows you to set the directory for the database. See the ENVIRONMENTsection for other directory setting options.</p><p>The <code class="option">-k <em class="replaceable"><code>cachesize</code></em></code> optionsets the cache size for the BerkeleyDB subsystem, in units of 1 MiB (1,048,576bytes). Properly sizing the cache improves<span class="application">bogofilter</span>'s performance. The recommendedsize is one third of the size of the database file. You can run the<span class="application">bogotune</span> script (in the tuning directory) todetermine the recommended size.</p><p>The <code class="option">-l</code> option writes an informational line tothe system log each time <span class="application">bogofilter</span> is run.The information logged depends on how<span class="application">bogofilter</span> is run.</p><p>The <code class="option">-L <em class="replaceable"><code>tag</code></em></code> optionconfigures a tag which can be included in the information being loggedby the <code class="option">-l</code> option, but it requires a custom formatthat includes the %l string for now. This option implies<code class="option">-l</code>.</p><p>The <code class="option">-I <em class="replaceable"><code>filename</code></em></code>option tells <span class="application">bogofilter</span> to read its inputfrom the specified file, rather than from<code class="option">stdin</code>.</p><p>The <code class="option">-O <em class="replaceable"><code>filename</code></em></code> optiontells <span class="application">bogofilter</span> where to write its output inpassthrough mode. Note that this only works when -p is explicitly given.</p><p>PARAMETER OPTIONS</p><p>The <code class="option">-E <em class="replaceable"><code>value[<span class="optional">,value</span>]</code></em></code> option allows setting the sp-esf value and the ns-esf value. With two values, both sp-esf and ns-esf are set. If only one value is given, parameters are set as described in the note below.</p><p>The <code class="option">-m <em class="replaceable"><code>value[<span class="optional">,value</span>][<span class="optional">,value</span>]</code></em></code> option allows setting the min-dev value and, optionally, the
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -