📄 tokenizer.py
字号:
#! /usr/bin/env python"""Module to tokenize email messages for spam filtering."""from __future__ import generatorsimport emailimport email.Messageimport email.Headerimport email.Utilsimport email.Errorsimport reimport mathimport timeimport osimport binasciiimport urlparseimport urllibtry: # We have three possibilities for Set: # (a) With Python 2.2 and earlier, we use our compatsets class # (b) With Python 2.3, we use the sets.Set class # (c) With Python 2.4 and later, we use the builtin set class Set = setexcept NameError: try: from sets import Set except ImportError: from spambayes.compatsets import Setfrom spambayes import classifierfrom spambayes.Options import optionsfrom spambayes.mboxutils import get_messagetry: True, Falseexcept NameError: # Maintain compatibility with Python 2.2 True, False = 1, 0try: import dnscache cache = dnscache.cache(cachefile=options["Tokenizer", "lookup_ip_cache"]) cache.printStatsAtEnd = Trueexcept (IOError, ImportError): cache = Noneelse: import atexit atexit.register(cache.close) # Patch encodings.aliases to recognize 'ansi_x3_4_1968'from encodings.aliases import aliases # The aliases dictionaryif not aliases.has_key('ansi_x3_4_1968'): aliases['ansi_x3_4_1968'] = 'ascii'del aliases # Not needed any more############################################################################### To fold case or not to fold case? I didn't want to fold case, because# it hides information in English, and I have no idea what .lower() does# to other languages; and, indeed, 'FREE' (all caps) turned out to be one# of the strongest spam indicators in my content-only tests (== one with# prob 0.99 *and* made it into spamprob's nbest list very often).## Against preservering case, it makes the database size larger, and requires# more training data to get enough "representative" mixed-case examples.## Running my c.l.py tests didn't support my intuition that case was# valuable, so it's getting folded away now. Folding or not made no# significant difference to the false positive rate, and folding made a# small (but statistically significant all the same) reduction in the# false negative rate. There is one obvious difference: after folding# case, conference announcements no longer got high spam scores. Their# content was usually fine, but they were highly penalized for VISIT OUR# WEBSITE FOR MORE INFORMATION! kinds of repeated SCREAMING. That is# indeed the language of advertising, and I halfway regret that folding# away case no longer picks on them.## Since the f-p rate didn't change, but conference announcements escaped# that category, something else took their place. It seems to be highly# off-topic messages, like debates about Microsoft's place in the world.# Talk about "money" and "lucrative" is indistinguishable now from talk# about "MONEY" and "LUCRATIVE", and spam mentions MONEY a lot.############################################################################### Character n-grams or words?## With careful multiple-corpora c.l.py tests sticking to case-folded decoded# text-only portions, and ignoring headers, and with identical special# parsing & tagging of embedded URLs:## Character 3-grams gave 5x as many false positives as split-on-whitespace# (s-o-w). The f-n rate was also significantly worse, but within a factor# of 2. So character 3-grams lost across the board.## Character 5-grams gave 32% more f-ps than split-on-whitespace, but the# s-o-w fp rate across 20,000 presumed-hams was 0.1%, and this is the# difference between 23 and 34 f-ps. There aren't enough there to say that's# significnatly more with killer-high confidence. There were plenty of f-ns,# though, and the f-n rate with character 5-grams was substantially *worse*# than with character 3-grams (which in turn was substantially worse than# with s-o-w).## Training on character 5-grams creates many more unique tokens than s-o-w:# a typical run bloated to 150MB process size. It also ran a lot slower than# s-o-w, partly related to heavy indexing of a huge out-of-cache wordinfo# dict. I rarely noticed disk activity when running s-o-w, so rarely bothered# to look at process size; it was under 30MB last time I looked.## Figuring out *why* a msg scored as it did proved much more mysterious when# working with character n-grams: they often had no obvious "meaning". In# contrast, it was always easy to figure out what s-o-w was picking up on.# 5-grams flagged a msg from Christian Tismer as spam, where he was discussing# the speed of tasklets under his new implementation of stackless:## prob = 0.99999998959# prob('ed sw') = 0.01# prob('http0:pgp') = 0.01# prob('http0:python') = 0.01# prob('hlon ') = 0.99# prob('http0:wwwkeys') = 0.01# prob('http0:starship') = 0.01# prob('http0:stackless') = 0.01# prob('n xp ') = 0.99# prob('on xp') = 0.99# prob('p 150') = 0.99# prob('lon x') = 0.99# prob(' amd ') = 0.99# prob(' xp 1') = 0.99# prob(' athl') = 0.99# prob('1500+') = 0.99# prob('xp 15') = 0.99## The spam decision was baffling until I realized that *all* the high-# probablity spam 5-grams there came out of a single phrase:## AMD Athlon XP 1500+## So Christian was punished for using a machine lots of spam tries to sell# <wink>. In a classic Bayesian classifier, this probably wouldn't have# mattered, but Graham's throws away almost all the 5-grams from a msg,# saving only the about-a-dozen farthest from a neutral 0.5. So one bad# phrase can kill you! This appears to happen very rarely, but happened# more than once.## The conclusion is that character n-grams have almost nothing to recommend# them under Graham's scheme: harder to work with, slower, much larger# database, worse results, and prone to rare mysterious disasters.## There's one area they won hands-down: detecting spam in what I assume are# Asian languages. The s-o-w scheme sometimes finds only line-ends to split# on then, and then a "hey, this 'word' is way too big! let's ignore it"# gimmick kicks in, and produces no tokens at all.## [Later: we produce character 5-grams then under the s-o-w scheme, instead# ignoring the blob, but only if there are high-bit characters in the blob;# e.g., there's no point 5-gramming uuencoded lines, and doing so would# bloat the database size.]## Interesting: despite that odd example above, the *kinds* of f-p mistakes# 5-grams made were very much like s-o-w made -- I recognized almost all of# the 5-gram f-p messages from previous s-o-w runs. For example, both# schemes have a particular hatred for conference announcements, although# s-o-w stopped hating them after folding case. But 5-grams still hate them.# Both schemes also hate msgs discussing HTML with examples, with about equal# passion. Both schemes hate brief "please subscribe [unsubscribe] me"# msgs, although 5-grams seems to hate them more.############################################################################### How to tokenize?## I started with string.split() merely for speed. Over time I realized it# was making interesting context distinctions qualitatively akin to n-gram# schemes; e.g., "free!!" is a much stronger spam indicator than "free". But# unlike n-grams (whether word- or character- based) under Graham's scoring# scheme, this mild context dependence never seems to go over the edge in# giving "too much" credence to an unlucky phrase.## OTOH, compared to "searching for words", it increases the size of the# database substantially, less than but close to a factor of 2. This is very# much less than a word bigram scheme bloats it, but as always an increase# isn't justified unless the results are better.## Following are stats comparing## for token in text.split(): # left column## to## for token in re.findall(r"[\w$\-\x80-\xff]+", text): # right column## text is case-normalized (text.lower()) in both cases, and the runs were# identical in all other respects. The results clearly favor the split()# gimmick, although they vaguely suggest that some sort of compromise# may do as well with less database burden; e.g., *perhaps* folding runs of# "punctuation" characters into a canonical representative could do that.# But the database size is reasonable without that, and plain split() avoids# having to worry about how to "fold punctuation" in languages other than# English.## false positive percentages# 0.000 0.000 tied# 0.000 0.050 lost# 0.050 0.150 lost# 0.000 0.025 lost# 0.025 0.050 lost# 0.025 0.075 lost# 0.050 0.150 lost# 0.025 0.000 won# 0.025 0.075 lost# 0.000 0.025 lost# 0.075 0.150 lost# 0.050 0.050 tied# 0.025 0.050 lost# 0.000 0.025 lost# 0.050 0.025 won# 0.025 0.000 won# 0.025 0.025 tied# 0.000 0.025 lost# 0.025 0.075 lost# 0.050 0.175 lost## won 3 times# tied 3 times# lost 14 times## total unique fp went from 8 to 20## false negative percentages# 0.945 1.200 lost# 0.836 1.018 lost# 1.200 1.200 tied# 1.418 1.636 lost# 1.455 1.418 won# 1.091 1.309 lost# 1.091 1.272 lost# 1.236 1.563 lost# 1.564 1.855 lost# 1.236 1.491 lost# 1.563 1.599 lost# 1.563 1.781 lost# 1.236 1.709 lost# 0.836 0.982 lost# 0.873 1.382 lost# 1.236 1.527 lost# 1.273 1.418 lost# 1.018 1.273 lost# 1.091 1.091 tied# 1.490 1.454 won## won 2 times# tied 2 times# lost 16 times## total unique fn went from 292 to 302## Later: Here's another tokenization scheme with more promise.## fold case, ignore punctuation, strip a trailing 's' from words (to# stop Guido griping about "hotel" and "hotels" getting scored as# distinct clues <wink>) and save both word bigrams and word unigrams## This was the code:## # Tokenize everything in the body.# lastw = ''# for w in word_re.findall(text):# n = len(w)# # Make sure this range matches in tokenize_word().# if 3 <= n <= 12:# if w[-1] == 's':# w = w[:-1]# yield w# if lastw:# yield lastw + w# lastw = w + ' '## elif n >= 3:# lastw = ''# for t in tokenize_word(w):# yield t## where## word_re = re.compile(r"[\w$\-\x80-\xff]+")## This at least doubled the process size. It helped the f-n rate# significantly, but probably hurt the f-p rate (the f-p rate is too low# with only 4000 hams per run to be confident about changes of such small# *absolute* magnitude -- 0.025% is a single message in the f-p table):## false positive percentages# 0.000 0.000 tied# 0.000 0.075 lost +(was 0)# 0.050 0.125 lost +150.00%# 0.025 0.000 won -100.00%# 0.075 0.025 won -66.67%# 0.000 0.050 lost +(was 0)# 0.100 0.175 lost +75.00%# 0.050 0.050 tied# 0.025 0.050 lost +100.00%# 0.025 0.000 won -100.00%# 0.050 0.125 lost +150.00%# 0.050 0.025 won -50.00%# 0.050 0.050 tied# 0.000 0.025 lost +(was 0)# 0.000 0.025 lost +(was 0)# 0.075 0.050 won -33.33%# 0.025 0.050 lost +100.00%# 0.000 0.000 tied# 0.025 0.100 lost +300.00%# 0.050 0.150 lost +200.00%## won 5 times# tied 4 times# lost 11 times## total unique fp went from 13 to 21## false negative percentages# 0.327 0.218 won -33.33%# 0.400 0.218 won -45.50%# 0.327 0.218 won -33.33%# 0.691 0.691 tied# 0.545 0.327 won -40.00%# 0.291 0.218 won -25.09%# 0.218 0.291 lost +33.49%# 0.654 0.473 won -27.68%# 0.364 0.327 won -10.16%# 0.291 0.182 won -37.46%# 0.327 0.254 won -22.32%# 0.691 0.509 won -26.34%# 0.582 0.473 won -18.73%# 0.291 0.255 won -12.37%# 0.364 0.218 won -40.11%# 0.436 0.327 won -25.00%
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -