📄 options.py
字号:
("x-cache_expiry_days", _("Number of days to store URLs in cache"), 7, _("""(EXPERIMENTAL) This is the number of days that local cached copies of the text at the URLs will be stored for."""), INTEGER, RESTORE), ("x-cache_directory", _("URL Cache Directory"), "url-cache", _("""(EXPERIMENTAL) So that SpamBayes doesn't need to retrieve the same URL over and over again, it stores local copies of the text at the end of the URL. This is the directory that will be used for those copies."""), PATH, RESTORE), ("x-only_slurp_base", _("Retrieve base url"), False, _("""(EXPERIMENTAL) To try and speed things up, and to avoid following unique URLS, if this option is enabled, SpamBayes will convert the URL to as basic a form it we can. All directory information is removed and the domain is reduced to the two (or three for those with a country TLD) top-most elements. For example, http://www.massey.ac.nz/~tameyer/index.html?you=me would become http://massey.ac.nz and http://id.example.com would become http://example.com This should have two beneficial effects: o It's unlikely that any information could be contained in this 'base' url that could identify the user (unless they have a *lot* of domains). o Many urls (both spam and ham) will strip down into the same 'base' url. Since we have a limited form of caching, this means that a lot fewer urls will have to be retrieved. However, this does mean that if the 'base' url is hammy and the full is spammy, or vice-versa, that the slurp will give back the wrong information. Whether or not this is the case would have to be determined by testing. """), BOOLEAN, RESTORE), ("x-web_prefix", _("Prefix for tokens from web pages"), "", _("""(EXPERIMENTAL) It may be that what is hammy/spammy for you in email isn't from webpages. You can then set this option (to "web:", for example), and effectively create an independent (sub)database for tokens derived from parsing web pages."""), r"[\S]+", RESTORE), ), # These options control how a message is categorized "Categorization" : ( # spam_cutoff and ham_cutoff are used in Python slice sense: # A msg is considered ham if its score is in 0:ham_cutoff # A msg is considered unsure if its score is in ham_cutoff:spam_cutoff # A msg is considered spam if its score is in spam_cutoff: # # So it's unsure iff ham_cutoff <= score < spam_cutoff. # For a binary classifier, make ham_cutoff == spam_cutoff. # ham_cutoff > spam_cutoff doesn't make sense. # # The defaults here (.2 and .9) may be appropriate for the default chi- # combining scheme. Cutoffs for chi-combining typically aren't touchy, # provided you're willing to settle for "really good" instead of "optimal". # Tim found that .3 and .8 worked very well for well-trained systems on # his personal email, and his large comp.lang.python test. If just # beginning training, or extremely fearful of mistakes, 0.05 and 0.95 may # be more appropriate for you. # # Picking good values for gary-combining is much harder, and appears to be # corpus-dependent, and within a single corpus dependent on how much # training has been done. Values from 0.50 thru the low 0.60's have been # reported to work best by various testers on their data. ("ham_cutoff", _("Ham cutoff"), 0.20, _("""Spambayes gives each email message a spam probability between 0 and 1. Emails below the Ham Cutoff probability are classified as Ham. Larger values will result in more messages being classified as ham, but with less certainty that all of them actually are ham. This value should be between 0 and 1, and should be smaller than the Spam Cutoff."""), REAL, RESTORE), ("spam_cutoff", _("Spam cutoff"), 0.90, _("""Emails with a spam probability above the Spam Cutoff are classified as Spam - just like the Ham Cutoff but at the other end of the scale. Messages that fall between the two values are classified as Unsure."""), REAL, RESTORE), ), # These control various displays in class TestDriver.Driver, and # Tester.Test. "TestDriver" : ( ("nbuckets", _("Number of buckets"), 200, _("""Number of buckets in histograms."""), INTEGER, RESTORE), ("show_histograms", _("Show histograms"), True, _(""""""), BOOLEAN, RESTORE), ("compute_best_cutoffs_from_histograms", _("Compute best cutoffs from histograms"), True, _("""After the display of a ham+spam histogram pair, you can get a listing of all the cutoff values (coinciding with histogram bucket boundaries) that minimize: best_cutoff_fp_weight * (# false positives) + best_cutoff_fn_weight * (# false negatives) + best_cutoff_unsure_weight * (# unsure msgs) This displays two cutoffs: hamc and spamc, where 0.0 <= hamc <= spamc <= 1.0 The idea is that if something scores < hamc, it's called ham; if something scores >= spamc, it's called spam; and everything else is called 'I am not sure' -- the middle ground. Note: You may wish to increase nbuckets, to give this scheme more cutoff values to analyze."""), BOOLEAN, RESTORE), ("best_cutoff_fp_weight", _("Best cutoff false positive weight"), 10.00, _(""""""), REAL, RESTORE), ("best_cutoff_fn_weight", _("Best cutoff false negative weight"), 1.00, _(""""""), REAL, RESTORE), ("best_cutoff_unsure_weight", _("Best cutoff unsure weight"), 0.20, _(""""""), REAL, RESTORE), ("percentiles", _("Percentiles"), (5, 25, 75, 95), _("""Histogram analysis also displays percentiles. For each percentile p in the list, the score S such that p% of all scores are <= S is given. Note that percentile 50 is the median, and is displayed (along with the min score and max score) independent of this option."""), INTEGER, RESTORE), ("show_spam_lo", _(""), 1.0, _("""Display spam when show_spam_lo <= spamprob <= show_spam_hi and likewise for ham. The defaults here do not show anything."""), REAL, RESTORE), ("show_spam_hi", _(""), 0.0, _("""Display spam when show_spam_lo <= spamprob <= show_spam_hi and likewise for ham. The defaults here do not show anything."""), REAL, RESTORE), ("show_ham_lo", _(""), 1.0, _("""Display spam when show_spam_lo <= spamprob <= show_spam_hi and likewise for ham. The defaults here do not show anything."""), REAL, RESTORE), ("show_ham_hi", _(""), 0.0, _("""Display spam when show_spam_lo <= spamprob <= show_spam_hi and likewise for ham. The defaults here do not show anything."""), REAL, RESTORE), ("show_false_positives", _("Show false positives"), True, _(""""""), BOOLEAN, RESTORE), ("show_false_negatives", _("Show false negatives"), False, _(""""""), BOOLEAN, RESTORE), ("show_unsure", _("Show unsure"), False, _(""""""), BOOLEAN, RESTORE), ("show_charlimit", _("Show character limit"), 3000, _("""The maximum # of characters to display for a msg displayed due to the show_xyz options above."""), INTEGER, RESTORE), ("save_trained_pickles", _("Save trained pickles"), False, _("""If save_trained_pickles is true, Driver.train() saves a binary pickle of the classifier after training. The file basename is given by pickle_basename, the extension is .pik, and increasing integers are appended to pickle_basename. By default (if save_trained_pickles is true), the filenames are class1.pik, class2.pik, ... If a file of that name already exists, it is overwritten. pickle_basename is ignored when save_trained_pickles is false."""), BOOLEAN, RESTORE), ("pickle_basename", _("Pickle basename"), "class", _(""""""), r"[\w]+", RESTORE), ("save_histogram_pickles", _("Save histogram pickles"), False, _("""If save_histogram_pickles is true, Driver.train() saves a binary pickle of the spam and ham histogram for "all test runs". The file basename is given by pickle_basename, the suffix _spamhist.pik or _hamhist.pik is appended to the basename."""), BOOLEAN, RESTORE), ("spam_directories", _("Spam directories"), "Data/Spam/Set%d", _("""default locations for timcv and timtest - these get the set number interpolated."""), VARIABLE_PATH, RESTORE), ("ham_directories", _("Ham directories"), "Data/Ham/Set%d", _("""default locations for timcv and timtest - these get the set number interpolated."""), VARIABLE_PATH, RESTORE), ), "CV Driver": ( ("build_each_classifier_from_scratch", _("Build each classifier from scratch"), False, _("""A cross-validation driver takes N ham+spam sets, and builds N classifiers, training each on N-1 sets, and the predicting against the set not trained on. By default, it does this in a clever way, learning *and* unlearning sets as it goes along, so that it never needs to train on N-1 sets in one gulp after the first time. Setting this option true forces ''one gulp from-scratch'' training every time. There used to be a set of combining schemes that needed this, but now it is just in case you are paranoid <wink>."""), BOOLEAN, RESTORE), ), "Classifier": ( ("max_discriminators", _("Maximum number of extreme words"), 150, _("""The maximum number of extreme words to look at in a message, where "extreme" means with spam probability farthest away from 0.5. 150 appears to work well across all corpora tested."""), INTEGER, RESTORE), ("unknown_word_prob", _("Unknown word probability"), 0.5, _("""These two control the prior assumption about word probabilities. unknown_word_prob is essentially the probability given to a word that has never been seen before. Nobody has reported an improvement via moving it away from 1/2, although Tim has measured a mean spamprob of a bit over 0.5 (0.51-0.55) in 3 well-trained classifiers."""), REAL, RESTORE), ("unknown_word_strength", _("Unknown word strength"), 0.45, _("""This adjusts how much weight to give the prior assumption relative to the probabilities estimated by counting. At 0, the counting estimates are believed 100%, even to the extent of assigning certainty (0 or 1) to a word that has appeared in only ham or only spam. This is a disaster. As unknown_word_strength tends toward infinity, all probabilities tend toward unknown_word_prob. All reports were that a value near 0.4 worked best, so this does not seem to be corpus-dependent."""), REAL, RESTORE), ("minimum_prob_strength", _("Minimum probability strength"), 0.1, _("""When scoring a message, ignore all words with abs(word.spamprob - 0.5) < minimum_prob_strength. This may be a hack, but it has proved to reduce error rates in many tests. 0.1 appeared to work well across all corpora."""), REAL, RESTORE), ("use_chi_squared_combining", _("Use chi-squared combining"), True, _("""For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i)) follows the chi-squared distribution with 2*n degrees of freedom. This is the "provably most-sensitive" test the original scheme was monotonic with. Getting closer to the theoretical basis appears to give an excellent combining method, usually very extreme in its judgment, yet finding a tiny (in # of msgs, spread across a huge range of scores) middle ground where lots of the mistakes live. This is the best method so far. One systematic benefit is is immunity to "cancellation disease". One systematic drawback is sensitivity to *any* deviation from a uniform distribution, regardless of whether actually evidence of ham or spam. Rob Hooft alleviated that by combining the final S and H measures via (S-H+1)/2 instead of via S/(S+H)). In practice, it appears that setting ham_cutoff=0.05, and spam_cutoff=0.95, does well across test sets; while these cutoffs are rarely optimal, they get close to optimal. With more training data, Tim has had good luck with ham_cutoff=0.30 and spam_cutoff=0.80 across three test data sets (original c.l.p data, his own email, and newer general python.org traffic)."""), BOOLEAN, RESTORE), ("use_bigrams", _("Use mixed uni/bi-grams scheme"), False,
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -