options.py

来自「用python实现的邮件过滤器」· Python 代码 · 共 1,363 行 · 第 1/5 页
1,363 行
    ("x-cache_expiry_days", _("Number of days to store URLs in cache"), 7,     _("""(EXPERIMENTAL) This is the number of days that local cached copies     of the text at the URLs will be stored for."""),     INTEGER, RESTORE),    ("x-cache_directory", _("URL Cache Directory"), "url-cache",     _("""(EXPERIMENTAL) So that SpamBayes doesn't need to retrieve the same     URL over and over again, it stores local copies of the text at the     end of the URL.  This is the directory that will be used for those     copies."""),     PATH, RESTORE),    ("x-only_slurp_base", _("Retrieve base url"), False,     _("""(EXPERIMENTAL) To try and speed things up, and to avoid following     unique URLS, if this option is enabled, SpamBayes will convert the URL     to as basic a form it we can.  All directory information is removed     and the domain is reduced to the two (or three for those with a     country TLD) top-most elements.  For example,         http://www.massey.ac.nz/~tameyer/index.html?you=me     would become         http://massey.ac.nz     and         http://id.example.com     would become http://example.com     This should have two beneficial effects:      o It's unlikely that any information could be contained in this 'base'        url that could identify the user (unless they have a *lot* of domains).      o Many urls (both spam and ham) will strip down into the same 'base' url.        Since we have a limited form of caching, this means that a lot fewer        urls will have to be retrieved.     However, this does mean that if the 'base' url is hammy and the full is     spammy, or vice-versa, that the slurp will give back the wrong information.     Whether or not this is the case would have to be determined by testing.     """),     BOOLEAN, RESTORE),    ("x-web_prefix", _("Prefix for tokens from web pages"), "",     _("""(EXPERIMENTAL) It may be that what is hammy/spammy for you in email     isn't from webpages.  You can then set this option (to "web:", for     example), and effectively create an independent (sub)database for     tokens derived from parsing web pages."""),     r"[\S]+", RESTORE),  ),  # These options control how a message is categorized  "Categorization" : (    # spam_cutoff and ham_cutoff are used in Python slice sense:    #    A msg is considered    ham if its score is in 0:ham_cutoff    #    A msg is considered unsure if its score is in ham_cutoff:spam_cutoff    #    A msg is considered   spam if its score is in spam_cutoff:    #    # So it's unsure iff  ham_cutoff <= score < spam_cutoff.    # For a binary classifier, make ham_cutoff == spam_cutoff.    # ham_cutoff > spam_cutoff doesn't make sense.    #    # The defaults here (.2 and .9) may be appropriate for the default chi-    # combining scheme.  Cutoffs for chi-combining typically aren't touchy,    # provided you're willing to settle for "really good" instead of "optimal".    # Tim found that .3 and .8 worked very well for well-trained systems on    # his personal email, and his large comp.lang.python test.  If just    # beginning training, or extremely fearful of mistakes, 0.05 and 0.95 may    # be more appropriate for you.    #    # Picking good values for gary-combining is much harder, and appears to be    # corpus-dependent, and within a single corpus dependent on how much    # training has been done.  Values from 0.50 thru the low 0.60's have been    # reported to work best by various testers on their data.    ("ham_cutoff", _("Ham cutoff"), 0.20,     _("""Spambayes gives each email message a spam probability between     0 and 1. Emails below the Ham Cutoff probability are classified     as Ham. Larger values will result in more messages being     classified as ham, but with less certainty that all of them     actually are ham. This value should be between 0 and 1,     and should be smaller than the Spam Cutoff."""),     REAL, RESTORE),    ("spam_cutoff", _("Spam cutoff"), 0.90,     _("""Emails with a spam probability above the Spam Cutoff are     classified as Spam - just like the Ham Cutoff but at the other     end of the scale.  Messages that fall between the two values     are classified as Unsure."""),     REAL, RESTORE),  ),  # These control various displays in class TestDriver.Driver, and  # Tester.Test.  "TestDriver" : (    ("nbuckets", _("Number of buckets"), 200,     _("""Number of buckets in histograms."""),     INTEGER, RESTORE),    ("show_histograms", _("Show histograms"), True,     _(""""""),     BOOLEAN, RESTORE),    ("compute_best_cutoffs_from_histograms", _("Compute best cutoffs from histograms"), True,     _("""After the display of a ham+spam histogram pair, you can get a     listing of all the cutoff values (coinciding with histogram bucket     boundaries) that minimize:         best_cutoff_fp_weight * (# false positives) +         best_cutoff_fn_weight * (# false negatives) +         best_cutoff_unsure_weight * (# unsure msgs)     This displays two cutoffs:  hamc and spamc, where        0.0 <= hamc <= spamc <= 1.0     The idea is that if something scores < hamc, it's called ham; if     something scores >= spamc, it's called spam; and everything else is     called 'I am not sure' -- the middle ground.     Note:  You may wish to increase nbuckets, to give this scheme more cutoff     values to analyze."""),     BOOLEAN, RESTORE),    ("best_cutoff_fp_weight", _("Best cutoff false positive weight"), 10.00,     _(""""""),     REAL, RESTORE),    ("best_cutoff_fn_weight", _("Best cutoff false negative weight"), 1.00,     _(""""""),     REAL, RESTORE),    ("best_cutoff_unsure_weight", _("Best cutoff unsure weight"), 0.20,     _(""""""),     REAL, RESTORE),    ("percentiles", _("Percentiles"), (5, 25, 75, 95),     _("""Histogram analysis also displays percentiles.  For each percentile     p in the list, the score S such that p% of all scores are <= S is     given. Note that percentile 50 is the median, and is displayed (along     with the min score and max score) independent of this option."""),     INTEGER, RESTORE),    ("show_spam_lo", _(""), 1.0,     _("""Display spam when show_spam_lo <= spamprob <= show_spam_hi and     likewise for ham.  The defaults here do not show anything."""),     REAL, RESTORE),    ("show_spam_hi", _(""), 0.0,     _("""Display spam when show_spam_lo <= spamprob <= show_spam_hi and     likewise for ham.  The defaults here do not show anything."""),     REAL, RESTORE),    ("show_ham_lo", _(""), 1.0,     _("""Display spam when show_spam_lo <= spamprob <= show_spam_hi and     likewise for ham.  The defaults here do not show anything."""),     REAL, RESTORE),    ("show_ham_hi", _(""), 0.0,     _("""Display spam when show_spam_lo <= spamprob <= show_spam_hi and     likewise for ham.  The defaults here do not show anything."""),     REAL, RESTORE),    ("show_false_positives", _("Show false positives"), True,     _(""""""),     BOOLEAN, RESTORE),    ("show_false_negatives", _("Show false negatives"), False,     _(""""""),     BOOLEAN, RESTORE),    ("show_unsure", _("Show unsure"), False,     _(""""""),     BOOLEAN, RESTORE),    ("show_charlimit", _("Show character limit"), 3000,     _("""The maximum # of characters to display for a msg displayed due to     the show_xyz options above."""),     INTEGER, RESTORE),    ("save_trained_pickles", _("Save trained pickles"), False,     _("""If save_trained_pickles is true, Driver.train() saves a binary     pickle of the classifier after training.  The file basename is given     by pickle_basename, the extension is .pik, and increasing integers are     appended to pickle_basename.  By default (if save_trained_pickles is     true), the filenames are class1.pik, class2.pik, ...  If a file of     that name already exists, it is overwritten.  pickle_basename is     ignored when save_trained_pickles is false."""),     BOOLEAN, RESTORE),    ("pickle_basename", _("Pickle basename"), "class",     _(""""""),     r"[\w]+", RESTORE),    ("save_histogram_pickles", _("Save histogram pickles"), False,     _("""If save_histogram_pickles is true, Driver.train() saves a binary     pickle of the spam and ham histogram for "all test runs". The file     basename is given by pickle_basename, the suffix _spamhist.pik     or _hamhist.pik is appended  to the basename."""),     BOOLEAN, RESTORE),    ("spam_directories", _("Spam directories"), "Data/Spam/Set%d",     _("""default locations for timcv and timtest - these get the set number     interpolated."""),     VARIABLE_PATH, RESTORE),    ("ham_directories", _("Ham directories"), "Data/Ham/Set%d",     _("""default locations for timcv and timtest - these get the set number     interpolated."""),     VARIABLE_PATH, RESTORE),  ),  "CV Driver": (    ("build_each_classifier_from_scratch", _("Build each classifier from scratch"), False,     _("""A cross-validation driver takes N ham+spam sets, and builds N     classifiers, training each on N-1 sets, and the predicting against the     set not trained on.  By default, it does this in a clever way,     learning *and* unlearning sets as it goes along, so that it never     needs to train on N-1 sets in one gulp after the first time.  Setting     this option true forces ''one gulp from-scratch'' training every time.     There used to be a set of combining schemes that needed this, but now     it is just in case you are paranoid <wink>."""),     BOOLEAN, RESTORE),  ),  "Classifier": (    ("max_discriminators", _("Maximum number of extreme words"), 150,     _("""The maximum number of extreme words to look at in a message, where     "extreme" means with spam probability farthest away from 0.5.  150     appears to work well across all corpora tested."""),     INTEGER, RESTORE),    ("unknown_word_prob", _("Unknown word probability"), 0.5,     _("""These two control the prior assumption about word probabilities.     unknown_word_prob is essentially the probability given to a word that     has never been seen before.  Nobody has reported an improvement via     moving it away from 1/2, although Tim has measured a mean spamprob of     a bit over 0.5 (0.51-0.55) in 3 well-trained classifiers."""),     REAL, RESTORE),    ("unknown_word_strength", _("Unknown word strength"), 0.45,     _("""This adjusts how much weight to give the prior     assumption relative to the probabilities estimated by counting.  At 0,     the counting estimates are believed 100%, even to the extent of     assigning certainty (0 or 1) to a word that has appeared in only ham     or only spam.  This is a disaster.     As unknown_word_strength tends toward infinity, all probabilities     tend toward unknown_word_prob.  All reports were that a value near 0.4     worked best, so this does not seem to be corpus-dependent."""),     REAL, RESTORE),    ("minimum_prob_strength", _("Minimum probability strength"), 0.1,     _("""When scoring a message, ignore all words with     abs(word.spamprob - 0.5) < minimum_prob_strength.     This may be a hack, but it has proved to reduce error rates in many     tests.  0.1 appeared to work well across all corpora."""),     REAL, RESTORE),    ("use_chi_squared_combining", _("Use chi-squared combining"), True,     _("""For vectors of random, uniformly distributed probabilities,     -2*sum(ln(p_i)) follows the chi-squared distribution with 2*n degrees     of freedom.  This is the "provably most-sensitive" test the original     scheme was monotonic with.  Getting closer to the theoretical basis     appears to give an excellent combining method, usually very extreme in     its judgment, yet finding a tiny (in # of msgs, spread across a huge     range of scores) middle ground where lots of the mistakes live.  This     is the best method so far. One systematic benefit is is immunity to     "cancellation disease". One systematic drawback is sensitivity to     *any* deviation from a uniform distribution, regardless of whether     actually evidence of ham or spam. Rob Hooft alleviated that by     combining the final S and H measures via (S-H+1)/2 instead of via     S/(S+H)). In practice, it appears that setting ham_cutoff=0.05, and     spam_cutoff=0.95, does well across test sets; while these cutoffs are     rarely optimal, they get close to optimal.  With more training data,     Tim has had good luck with ham_cutoff=0.30 and spam_cutoff=0.80 across     three test data sets (original c.l.p data, his own email, and newer     general python.org traffic)."""),     BOOLEAN, RESTORE),    ("use_bigrams", _("Use mixed uni/bi-grams scheme"), False,
options.py - 源码说明

本页面展示了「用python实现的邮件过滤器」中的 options.py 源码文件，采用 Python 编程语言编写，共 1,363 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与python相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?