tokenizer.py

来自「用python实现的邮件过滤器」· Python 代码 · 共 1,693 行 · 第 1/5 页
1,693 行
#! /usr/bin/env python"""Module to tokenize email messages for spam filtering."""from __future__ import generatorsimport emailimport email.Messageimport email.Headerimport email.Utilsimport email.Errorsimport reimport mathimport timeimport osimport binasciiimport urlparseimport urllibtry:    # We have three possibilities for Set:    #  (a) With Python 2.2 and earlier, we use our compatsets class    #  (b) With Python 2.3, we use the sets.Set class    #  (c) With Python 2.4 and later, we use the builtin set class    Set = setexcept NameError:    try:        from sets import Set    except ImportError:        from spambayes.compatsets import Setfrom spambayes import classifierfrom spambayes.Options import optionsfrom spambayes.mboxutils import get_messagetry:    True, Falseexcept NameError:    # Maintain compatibility with Python 2.2    True, False = 1, 0try:    import dnscache    cache = dnscache.cache(cachefile=options["Tokenizer", "lookup_ip_cache"])    cache.printStatsAtEnd = Trueexcept (IOError, ImportError):    cache = Noneelse:    import atexit    atexit.register(cache.close) # Patch encodings.aliases to recognize 'ansi_x3_4_1968'from encodings.aliases import aliases # The aliases dictionaryif not aliases.has_key('ansi_x3_4_1968'):    aliases['ansi_x3_4_1968'] = 'ascii'del aliases # Not needed any more############################################################################### To fold case or not to fold case?  I didn't want to fold case, because# it hides information in English, and I have no idea what .lower() does# to other languages; and, indeed, 'FREE' (all caps) turned out to be one# of the strongest spam indicators in my content-only tests (== one with# prob 0.99 *and* made it into spamprob's nbest list very often).## Against preservering case, it makes the database size larger, and requires# more training data to get enough "representative" mixed-case examples.## Running my c.l.py tests didn't support my intuition that case was# valuable, so it's getting folded away now.  Folding or not made no# significant difference to the false positive rate, and folding made a# small (but statistically significant all the same) reduction in the# false negative rate.  There is one obvious difference:  after folding# case, conference announcements no longer got high spam scores.  Their# content was usually fine, but they were highly penalized for VISIT OUR# WEBSITE FOR MORE INFORMATION! kinds of repeated SCREAMING.  That is# indeed the language of advertising, and I halfway regret that folding# away case no longer picks on them.## Since the f-p rate didn't change, but conference announcements escaped# that category, something else took their place.  It seems to be highly# off-topic messages, like debates about Microsoft's place in the world.# Talk about "money" and "lucrative" is indistinguishable now from talk# about "MONEY" and "LUCRATIVE", and spam mentions MONEY a lot.############################################################################### Character n-grams or words?## With careful multiple-corpora c.l.py tests sticking to case-folded decoded# text-only portions, and ignoring headers, and with identical special# parsing & tagging of embedded URLs:## Character 3-grams gave 5x as many false positives as split-on-whitespace# (s-o-w).  The f-n rate was also significantly worse, but within a factor# of 2.  So character 3-grams lost across the board.## Character 5-grams gave 32% more f-ps than split-on-whitespace, but the# s-o-w fp rate across 20,000 presumed-hams was 0.1%, and this is the# difference between 23 and 34 f-ps.  There aren't enough there to say that's# significnatly more with killer-high confidence.  There were plenty of f-ns,# though, and the f-n rate with character 5-grams was substantially *worse*# than with character 3-grams (which in turn was substantially worse than# with s-o-w).## Training on character 5-grams creates many more unique tokens than s-o-w:# a typical run bloated to 150MB process size.  It also ran a lot slower than# s-o-w, partly related to heavy indexing of a huge out-of-cache wordinfo# dict.  I rarely noticed disk activity when running s-o-w, so rarely bothered# to look at process size; it was under 30MB last time I looked.## Figuring out *why* a msg scored as it did proved much more mysterious when# working with character n-grams:  they often had no obvious "meaning".  In# contrast, it was always easy to figure out what s-o-w was picking up on.# 5-grams flagged a msg from Christian Tismer as spam, where he was discussing# the speed of tasklets under his new implementation of stackless:##     prob = 0.99999998959#     prob('ed sw') = 0.01#     prob('http0:pgp') = 0.01#     prob('http0:python') = 0.01#     prob('hlon ') = 0.99#     prob('http0:wwwkeys') = 0.01#     prob('http0:starship') = 0.01#     prob('http0:stackless') = 0.01#     prob('n xp ') = 0.99#     prob('on xp') = 0.99#     prob('p 150') = 0.99#     prob('lon x') = 0.99#     prob(' amd ') = 0.99#     prob(' xp 1') = 0.99#     prob(' athl') = 0.99#     prob('1500+') = 0.99#     prob('xp 15') = 0.99## The spam decision was baffling until I realized that *all* the high-# probablity spam 5-grams there came out of a single phrase:##     AMD Athlon XP 1500+## So Christian was punished for using a machine lots of spam tries to sell# <wink>.  In a classic Bayesian classifier, this probably wouldn't have# mattered, but Graham's throws away almost all the 5-grams from a msg,# saving only the about-a-dozen farthest from a neutral 0.5.  So one bad# phrase can kill you!  This appears to happen very rarely, but happened# more than once.## The conclusion is that character n-grams have almost nothing to recommend# them under Graham's scheme:  harder to work with, slower, much larger# database, worse results, and prone to rare mysterious disasters.## There's one area they won hands-down:  detecting spam in what I assume are# Asian languages.  The s-o-w scheme sometimes finds only line-ends to split# on then, and then a "hey, this 'word' is way too big!  let's ignore it"# gimmick kicks in, and produces no tokens at all.## [Later:  we produce character 5-grams then under the s-o-w scheme, instead# ignoring the blob, but only if there are high-bit characters in the blob;# e.g., there's no point 5-gramming uuencoded lines, and doing so would# bloat the database size.]## Interesting:  despite that odd example above, the *kinds* of f-p mistakes# 5-grams made were very much like s-o-w made -- I recognized almost all of# the 5-gram f-p messages from previous s-o-w runs.  For example, both# schemes have a particular hatred for conference announcements, although# s-o-w stopped hating them after folding case.  But 5-grams still hate them.# Both schemes also hate msgs discussing HTML with examples, with about equal# passion.   Both schemes hate brief "please subscribe [unsubscribe] me"# msgs, although 5-grams seems to hate them more.############################################################################### How to tokenize?## I started with string.split() merely for speed.  Over time I realized it# was making interesting context distinctions qualitatively akin to n-gram# schemes; e.g., "free!!" is a much stronger spam indicator than "free".  But# unlike n-grams (whether word- or character- based) under Graham's scoring# scheme, this mild context dependence never seems to go over the edge in# giving "too much" credence to an unlucky phrase.## OTOH, compared to "searching for words", it increases the size of the# database substantially, less than but close to a factor of 2.  This is very# much less than a word bigram scheme bloats it, but as always an increase# isn't justified unless the results are better.## Following are stats comparing##    for token in text.split():  # left column## to##    for token in re.findall(r"[\w$\-\x80-\xff]+", text):  # right column## text is case-normalized (text.lower()) in both cases, and the runs were# identical in all other respects.  The results clearly favor the split()# gimmick, although they vaguely suggest that some sort of compromise# may do as well with less database burden; e.g., *perhaps* folding runs of# "punctuation" characters into a canonical representative could do that.# But the database size is reasonable without that, and plain split() avoids# having to worry about how to "fold punctuation" in languages other than# English.##    false positive percentages#        0.000  0.000  tied#        0.000  0.050  lost#        0.050  0.150  lost#        0.000  0.025  lost#        0.025  0.050  lost#        0.025  0.075  lost#        0.050  0.150  lost#        0.025  0.000  won#        0.025  0.075  lost#        0.000  0.025  lost#        0.075  0.150  lost#        0.050  0.050  tied#        0.025  0.050  lost#        0.000  0.025  lost#        0.050  0.025  won#        0.025  0.000  won#        0.025  0.025  tied#        0.000  0.025  lost#        0.025  0.075  lost#        0.050  0.175  lost##    won   3 times#    tied  3 times#    lost 14 times##    total unique fp went from 8 to 20##    false negative percentages#        0.945  1.200  lost#        0.836  1.018  lost#        1.200  1.200  tied#        1.418  1.636  lost#        1.455  1.418  won#        1.091  1.309  lost#        1.091  1.272  lost#        1.236  1.563  lost#        1.564  1.855  lost#        1.236  1.491  lost#        1.563  1.599  lost#        1.563  1.781  lost#        1.236  1.709  lost#        0.836  0.982  lost#        0.873  1.382  lost#        1.236  1.527  lost#        1.273  1.418  lost#        1.018  1.273  lost#        1.091  1.091  tied#        1.490  1.454  won##    won   2 times#    tied  2 times#    lost 16 times##    total unique fn went from 292 to 302## Later:  Here's another tokenization scheme with more promise.##     fold case, ignore punctuation, strip a trailing 's' from words (to#     stop Guido griping about "hotel" and "hotels" getting scored as#     distinct clues <wink>) and save both word bigrams and word unigrams## This was the code:##     # Tokenize everything in the body.#     lastw = ''#     for w in word_re.findall(text):#         n = len(w)#         # Make sure this range matches in tokenize_word().#         if 3 <= n <= 12:#             if w[-1] == 's':#                 w = w[:-1]#             yield w#             if lastw:#                 yield lastw + w#             lastw = w + ' '##         elif n >= 3:#             lastw = ''#             for t in tokenize_word(w):#                 yield t## where##     word_re = re.compile(r"[\w$\-\x80-\xff]+")## This at least doubled the process size.  It helped the f-n rate# significantly, but probably hurt the f-p rate (the f-p rate is too low# with only 4000 hams per run to be confident about changes of such small# *absolute* magnitude -- 0.025% is a single message in the f-p table):## false positive percentages#     0.000  0.000  tied#     0.000  0.075  lost  +(was 0)#     0.050  0.125  lost  +150.00%#     0.025  0.000  won   -100.00%#     0.075  0.025  won    -66.67%#     0.000  0.050  lost  +(was 0)#     0.100  0.175  lost   +75.00%#     0.050  0.050  tied#     0.025  0.050  lost  +100.00%#     0.025  0.000  won   -100.00%#     0.050  0.125  lost  +150.00%#     0.050  0.025  won    -50.00%#     0.050  0.050  tied#     0.000  0.025  lost  +(was 0)#     0.000  0.025  lost  +(was 0)#     0.075  0.050  won    -33.33%#     0.025  0.050  lost  +100.00%#     0.000  0.000  tied#     0.025  0.100  lost  +300.00%#     0.050  0.150  lost  +200.00%## won   5 times# tied  4 times# lost 11 times## total unique fp went from 13 to 21## false negative percentages#     0.327  0.218  won    -33.33%#     0.400  0.218  won    -45.50%#     0.327  0.218  won    -33.33%#     0.691  0.691  tied#     0.545  0.327  won    -40.00%#     0.291  0.218  won    -25.09%#     0.218  0.291  lost   +33.49%#     0.654  0.473  won    -27.68%#     0.364  0.327  won    -10.16%#     0.291  0.182  won    -37.46%#     0.327  0.254  won    -22.32%#     0.691  0.509  won    -26.34%#     0.582  0.473  won    -18.73%#     0.291  0.255  won    -12.37%#     0.364  0.218  won    -40.11%#     0.436  0.327  won    -25.00%
tokenizer.py - 源码说明

本页面展示了「用python实现的邮件过滤器」中的 tokenizer.py 源码文件，采用 Python 编程语言编写，共 1,693 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与python相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?