⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 bogofilter.htmlin

📁 一个C语言写的快速贝叶斯垃圾邮件过滤工具
💻 HTMLIN
📖 第 1 页 / 共 2 页
字号:
<html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>bogofilter</title><meta name="generator" content="DocBook XSL Stylesheets V1.69.1"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="refentry" lang="en"><a name="bogofilter.1"></a><div class="titlepage"></div><div class="refnamediv"><a name="name"></a><h2>Name</h2><p>bogofilter &#8212; fast Bayesian spam filter</p></div><div class="refsynopsisdiv"><a name="synopsis"></a><h2>Synopsis</h2><div class="cmdsynopsis"><p><code class="command">bogofilter</code>  [ help options  |   classification options  |   registration options  |   parameter options  |   info options ] [general options] [config file options]</p></div><p>where</p><p><code class="option">help options</code> are:</p><div class="cmdsynopsis"><p>[-h] [--help] [-V] [-Q]</p></div><p><code class="option">classification options</code> are:</p><div class="cmdsynopsis"><p>[-p] [-e] [-t] [-T] [-u] [-H] [-M] [-b] [-B <em class="replaceable"><code>object ...</code></em>] [-R] [general options] [parameter options] [config file options]</p></div><p><code class="option">registration options</code> are:</p><div class="cmdsynopsis"><p>[ -s  |   -n ] [ -S  |   -N ] [general options]</p></div><p><code class="option">general options</code> are:</p><div class="cmdsynopsis"><p>[-c <em class="replaceable"><code>filename</code></em>] [-C] [-d <em class="replaceable"><code>dir</code></em>] [-k <em class="replaceable"><code>cachesize</code></em>] [-l] [-L <em class="replaceable"><code>tag</code></em>] [-I <em class="replaceable"><code>filename</code></em>] [-O <em class="replaceable"><code>filename</code></em>]</p></div><p><code class="option">parameter options</code> are:</p><div class="cmdsynopsis"><p>[-E <em class="replaceable"><code>value[<span class="optional">,value</span>]</code></em>] [-m <em class="replaceable"><code>value[<span class="optional">,value</span>][<span class="optional">,value</span>]</code></em>] [-o <em class="replaceable"><code>value[<span class="optional">,value</span>]</code></em>]</p></div><p><code class="option">info options</code> are:</p><div class="cmdsynopsis"><p>[-v] [-y <em class="replaceable"><code>date</code></em>] [-D] [-x <em class="replaceable"><code>flags</code></em>]</p></div><p><code class="option">config file options</code> are:</p><div class="cmdsynopsis"><p>[--<em class="replaceable"><code>option=value</code></em>]</p></div><p>Note:  Use <span><strong class="command">bogofilter --help</strong></span> to display    the complete list of options.</p></div><div class="refsect1" lang="en"><a name="description"></a><h2>DESCRIPTION</h2><p><span class="application">Bogofilter</span> is a Bayesian spam filter.In its normal mode of operation, it takes an email message or othertext on standard input, does a statistical check against lists of"good" and "bad" words, and returns a status code indicating whetheror not the message is spam.  <span class="application">Bogofilter</span> isdesigned with a fast algorithm, uses the Berkeley DB for fast startupand lookups, coded directly in C, and tuned for speed, so it can beused for production by sites that process a lot of mail.</p></div><div class="refsect1" lang="en"><a name="theory"></a><h2>THEORY OF OPERATION</h2><p><span class="application">Bogofilter</span> treats its input as a bagof tokens.  Each token is checked against a wordlist, which maintainscounts of the numbers of times it has occurred in non-spam and spammails.  These numbers are used to compute an estimate of theprobability that a message in which the token occurs is spam.  Those arecombined to indicate whether the message is spam or ham.</p><p>While this method sounds crude compared to the more usualpattern-matching approach, it turns out to be extremely effective.Paul Graham's paper <a href="http://www.paulgraham.com/spam.html" target="_top">A Plan For Spam</a> is recommended reading.</p><p>This program substantially improves on Paul's proposal by doingsmarter lexical analysis.  <span class="application">Bogofilter</span> doesproper MIME decoding and a reasonable HTML parsing.  Special kinds oftokens like hostnames and IP addresses are retained as recognitionfeatures rather than broken up.  Various kinds of MTA cruft such asdates and message-IDs are ignored so as not to bloat the wordlist.Tokens found in various header fields are marked appropriately.</p><p>Another improvement is that this program offers Gary Robinson'ssuggested modifications to the calculations (see the parameters robxand robs below).  These modifications are described in Robinson'spaper <a href="http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html" target="_top">SpamDetection</a>.</p><p>Since then, Robinson (see his Linux Journal article <a href="http://www.linuxjournal.com/article/6467" target="_top">A StatisticalApproach to the Spam Problem</a>) and others have realized thatthe calculation can be further optimized using Fisher's method.<a href="http://www.garyrobinson.net/2004/04/improved_chi.html" target="_top">Anotherimprovement</a> compensates for token redundancy by applying separateeffective size factors (ESF) to spam and nonspam probability calculations.</p><p>In short, this is how it works: The estimates for the spamprobabilities of the individual tokens are combined using the "inversechi-square function".  Its value indicates how badly the nullhypothesis that the message is just a random collection of independentwords with probabilities given by our previous estimates fails.  Thisfunction is very sensitive to small probabilities (hammish words), butnot to high probabilities (spammish words); so the value onlyindicates strong hammish signs in a message. Now using inverseprobabilities for the tokens, the same computation is done again,giving an indicator that a message looks strongly spammish.  Finally,those two indicators are subtracted (and scaled into a 0-1-interval).This combined indicator (bogosity) is close to 0 if the signs for ahammish message are stronger than for a spammish message and close to1 if the situation is the other way round.  If signs for both areequally strong, the value will be near 0.5.  Since those message don'tgive a clear indication there is a tristate mode in<span class="application">bogofilter</span> to mark those messages asunsure, while the clear messages are marked as spam or ham,respectively.  In two-state mode, every message is marked as eitherspam or ham.</p><p>Various parameters influence these calculations, the mostimportant are:</p><p>robx: the score given to a token which has not seen before.robx is the probability that the token is spammish.</p><p>robs: a weight on robx which moves the probability of a little seentoken towards robx.</p><p>min-dev: a minimum distance from .5 for tokens to use in thecalculation.  Only tokens farther away from 0.5 than this value areused.</p><p>spam-cutoff: messages with scores greater than or equal to willbe marked as spam.</p><p>ham-cutoff: If zero or spam-cutoff, all messages with valuesstrictly below spam-cutoff are marked as ham, all others as spam(two-state).  Else values less than or equal to ham-cutoff are markedas ham, messages with values strictly between ham-cutoff andspam-cutoff are marked as unsure; the rest as spam (tristate)</p><p>sp-esf: the effective size factor (ESF) for spam.</p><p>ns-esf: the ESF for nonspam.  These ESF values default to 1.0,which is the same as not using ESF in the calculation.  Values suitableto a user's email population can be determined with the aid of the<span class="application">bogotune</span> program.</p></div><div class="refsect1" lang="en"><a name="options"></a><h2>OPTIONS</h2><p>HELP OPTIONS</p><p>The <code class="option">-h</code> option prints the help message and exits.</p><p>The <code class="option">-V</code> option prints the version number andexits.</p><p>The <code class="option">-Q</code> (query) option prints<span class="application">bogofilter</span>'s configuration, i.e. registrationparameters, parsing options, <span class="application">bogofilter</span>directory, etc.</p><p>CLASSIFICATION OPTIONS</p><p>The <code class="option">-p</code> (passthrough) option outputs the messagewith an X-Bogosity line at the end of the message header.  Thisrequires keeping the entire message in memory when it's read fromstdin (or from a pipe or socket).  If the message is read from a filethat can be rewound, <span class="application">bogofilter</span> will read ita second time.</p><p>The <code class="option">-e</code> (embed) option tells<span class="application">bogofilter</span> to exit with code 0 if themessage can be classified, i.e. if there is not an error.  Normally<span class="application">bogofilter</span> uses different codes for spam, ham,and unsure classifications, but this simplifies using<span class="application">bogofilter</span> with<span class="application">procmail</span> or<span class="application">maildrop</span>.</p><p>The <code class="option">-t</code> (terse) option tells<span class="application">bogofilter</span> to print an abbreviatedspamicity message containing 1 letter and the score.  Spam isindicated with "Y", ham by "N", and unsure by "U".  Note: theformatting can be customized using the config file.</p><p>The <code class="option">-T</code> provides an invariant terse mode forscripts to use.  <span class="application">bogofilter</span> will print anabbreviated spamicity message containing 1 letter and the score.  Spamis indicated with "S", ham by "H", and unsure by "U".</p><p>The <code class="option">-TT</code> provides an invariant terse mode forscripts to use.  <span class="application">Bogofilter</span> prints only thescore and displays it to 16 significant digits.</p><p>The <code class="option">-u</code> option tells<span class="application">bogofilter</span> to register the message's textafter classifying it as spam or non-spam.  A spam message will beregistered on the spamlist and a non-spam message on the goodlist.  Ifthe classification is "unsure", the message will not be registered.Effectively this option runs <span class="application">bogofilter</span>with the <code class="option">-s</code> or <code class="option">-n</code> flag, asappropriate.  Caution is urged in the use of this capability, as anyclassification errors <span class="application">bogofilter</span> may makewill be preserved and will accumulate until manually corrected withthe <code class="option">-Sn</code> and <code class="option">-Ns</code> optioncombinations.  Note this option causes the database to be opened forwrite access, which can entail massive slowdowns throughlock contention and synchronous I/O operations.</p><p>The <code class="option">-H</code> option tells<span class="application">bogofilter</span> to not tag tokens from theheader. This option is for testing, you should not use it in normaloperation.</p><p>The <code class="option">-M</code> option tells<span class="application">bogofilter</span> to process its input as a mboxformatted file.  If the <code class="option">-v</code> or <code class="option">-t</code>option is also given, a spamicity line will be printed for eachmessage.</p><p>The <code class="option">-b</code> (streaming bulk mode) option tells<span class="application">bogofilter</span> to classify multiple objectswhose names are read from stdin.  If the <code class="option">-v</code> or<code class="option">-t</code> option is also given,<span class="application">bogofilter</span> will print a line giving filename and classification information for each file.  This is an alternative to <code class="option">-B</code> which lists objects on the command line.</p><p>An object in this context shall be a maildir (autodetected), orif it's not a maildir, a single mail unless <code class="option">-M</code> isgiven - in that case it's processed as mbox.  (The Content-Length:header is not taken into account currently.)</p><p>When reading mbox format, <span class="application">bogofilter</span>relies on the empty line after a mail.  If needed,<span><strong class="command">formail -es</strong></span> will ensure this is the case.</p><p>The <code class="option">-B <em class="replaceable"><code>object ...</code></em></code>(bulk mode) option tells <span class="application">bogofilter</span> toclassify multiple objects named on the command line.  The objects maybe filenames (for single messages), mailboxes (files with multiplemessages), or directories (of maildir and MH format).  If the<code class="option">-v</code> or <code class="option">-t</code> option is also given,<span class="application">bogofilter</span> will print a line giving filename and classification information for each file.  This is an alternative to <code class="option">-b</code> which lists objects on stdin.</p><p>The <code class="option">-R</code> option tells<span class="application">bogofilter</span> to output an R data frame intext form on the standard output.  See the section on integration withR, below, for further detail.</p><p>REGISTRATION OPTIONS</p><p>The <code class="option">-s</code> option tells<span class="application">bogofilter</span> to register the text presentedas spam.  The database is created if absent.</p><p>The <code class="option">-n</code> option tells<span class="application">bogofilter</span> to register the text presentedas non-spam.</p><p><span class="application">Bogofilter</span> doesn't detect if a messageregistered twice.  If you do this by accident, the token counts will off by 1from what you really want and the corresponding spam scores will be slightlyoff.  Given a large number of tokens and messages in the wordlist, thisdoesn't matter.  The problem <span class="emphasis"><em>can</em></span> be corrected by usingthe <code class="option">-S</code> option or the <code class="option">-N</code> option.</p><p>The <code class="option">-S</code> option tells <span class="application">bogofilter</span>to undo a prior registration of the same message as spam.  If a message wasincorrectly entered as spam by <code class="option">-s</code> or <code class="option">-u</code>and you want to remove it and enter it as non-spam, use <code class="option">-Sn</code>.If <code class="option">-S</code> is used for a message that wasn't registered as spam,the counts will still be decremented.</p><p>The <code class="option">-N</code> option tells <span class="application">bogofilter</span>to undo a prior registration of the same message as non-spam.  If a message wasincorrectly entered as non-spam by <code class="option">-n</code> or <code class="option">-u</code>and you want to remove it and enter it as spam, then use <code class="option">-Ns</code>.If <code class="option">-N</code> is used for a message that wasn't registered as non-spam,the counts will still be decremented.</p><p>GENERAL OPTIONS</p><p>The <code class="option">-c <em class="replaceable"><code>filename</code></em></code>option tells <span class="application">bogofilter</span> to read the configfile named.</p><p>The <code class="option">-C</code> option prevents<span class="application">bogofilter</span> from reading configurationfiles.</p><p>The <code class="option">-d <em class="replaceable"><code>dir</code></em></code> optionallows you to set the directory for the database.  See the ENVIRONMENTsection for other directory setting options.</p><p>The <code class="option">-k <em class="replaceable"><code>cachesize</code></em></code> optionsets the cache size for the BerkeleyDB subsystem, in units of 1 MiB (1,048,576bytes).  Properly sizing the cache improves<span class="application">bogofilter</span>'s performance.  The recommendedsize is one third of the size of the database file.  You can run the<span class="application">bogotune</span> script (in the tuning directory) todetermine the recommended size.</p><p>The <code class="option">-l</code> option writes an informational line tothe system log each time <span class="application">bogofilter</span> is run.The information logged depends on how<span class="application">bogofilter</span> is run.</p><p>The <code class="option">-L <em class="replaceable"><code>tag</code></em></code> optionconfigures a tag which can be included in the information being loggedby the <code class="option">-l</code> option, but it requires a custom formatthat includes the %l string for now.  This option implies<code class="option">-l</code>.</p><p>The <code class="option">-I <em class="replaceable"><code>filename</code></em></code>option tells <span class="application">bogofilter</span> to read its inputfrom the specified file, rather than from<code class="option">stdin</code>.</p><p>The <code class="option">-O <em class="replaceable"><code>filename</code></em></code> optiontells <span class="application">bogofilter</span> where to write its output inpassthrough mode.  Note that this only works when -p is explicitly given.</p><p>PARAMETER OPTIONS</p><p>The <code class="option">-E      <em class="replaceable"><code>value[<span class="optional">,value</span>]</code></em></code>      option allows setting the sp-esf value and the ns-esf value.      With two values, both sp-esf and ns-esf are set.  If only one      value is given, parameters are set as described in the note      below.</p><p>The <code class="option">-m      <em class="replaceable"><code>value[<span class="optional">,value</span>][<span class="optional">,value</span>]</code></em></code>      option allows setting the min-dev value and, optionally, the

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -