📄 tokenizer.py
字号:
# 0.436 0.473 lost +8.49%# 0.218 0.218 tied# 0.291 0.255 won -12.37%# 0.254 0.364 lost +43.31%## won 15 times# tied 2 times# lost 3 times## total unique fn went from 106 to 94############################################################################### What about HTML?## Computer geeks seem to view use of HTML in mailing lists and newsgroups as# a mortal sin. Normal people don't, but so it goes: in a technical list/# group, every HTML decoration has spamprob 0.99, there are lots of unique# HTML decorations, and lots of them appear at the very start of the message# so that Graham's scoring scheme latches on to them tight. As a result,# any plain text message just containing an HTML example is likely to be# judged spam (every HTML decoration is an extreme).## So if a message is multipart/alternative with both text/plain and text/html# branches, we ignore the latter, else newbies would never get a message# through. If a message is just HTML, it has virtually no chance of getting# through.## In an effort to let normal people use mailing lists too <wink>, and to# alleviate the woes of messages merely *discussing* HTML practice, I# added a gimmick to strip HTML tags after case-normalization and after# special tagging of embedded URLs. This consisted of a regexp sub pattern,# where instances got replaced by single blanks:## html_re = re.compile(r"""# <# [^\s<>] # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b'# [^>]{0,128} # search for the end '>', but don't chew up the world# ># """, re.VERBOSE)## and then## text = html_re.sub(' ', text)## Alas, little good came of this:## false positive percentages# 0.000 0.000 tied# 0.000 0.000 tied# 0.050 0.075 lost# 0.000 0.000 tied# 0.025 0.025 tied# 0.025 0.025 tied# 0.050 0.050 tied# 0.025 0.025 tied# 0.025 0.025 tied# 0.000 0.050 lost# 0.075 0.100 lost# 0.050 0.050 tied# 0.025 0.025 tied# 0.000 0.025 lost# 0.050 0.050 tied# 0.025 0.025 tied# 0.025 0.025 tied# 0.000 0.000 tied# 0.025 0.050 lost# 0.050 0.050 tied## won 0 times# tied 15 times# lost 5 times## total unique fp went from 8 to 12## false negative percentages# 0.945 1.164 lost# 0.836 1.418 lost# 1.200 1.272 lost# 1.418 1.272 won# 1.455 1.273 won# 1.091 1.382 lost# 1.091 1.309 lost# 1.236 1.381 lost# 1.564 1.745 lost# 1.236 1.564 lost# 1.563 1.781 lost# 1.563 1.745 lost# 1.236 1.455 lost# 0.836 0.982 lost# 0.873 1.309 lost# 1.236 1.381 lost# 1.273 1.273 tied# 1.018 1.273 lost# 1.091 1.200 lost# 1.490 1.599 lost## won 2 times# tied 1 times# lost 17 times## total unique fn went from 292 to 327## The messages merely discussing HTML were no longer fps, so it did what it# intended there. But the f-n rate nearly doubled on at least one run -- so# strong a set of spam indicators is the mere presence of HTML. The increase# in the number of fps despite that the HTML-discussing msgs left that# category remains mysterious to me, but it wasn't a significant increase# so I let it drop.## Later: If I simply give up on making mailing lists friendly to my sisters# (they're not nerds, and create wonderfully attractive HTML msgs), a# compromise is to strip HTML tags from only text/plain msgs. That's# principled enough so far as it goes, and eliminates the HTML-discussing# false positives. It remains disturbing that the f-n rate on pure HTML# msgs increases significantly when stripping tags, so the code here doesn't# do that part. However, even after stripping tags, the rates above show that# at least 98% of spams are still correctly identified as spam.## So, if another way is found to slash the f-n rate, the decision here not# to strip HTML from HTML-only msgs should be revisited.## Later, after the f-n rate got slashed via other means:## false positive percentages# 0.000 0.000 tied# 0.000 0.000 tied# 0.050 0.075 lost +50.00%# 0.025 0.025 tied# 0.075 0.025 won -66.67%# 0.000 0.000 tied# 0.100 0.100 tied# 0.050 0.075 lost +50.00%# 0.025 0.025 tied# 0.025 0.000 won -100.00%# 0.050 0.075 lost +50.00%# 0.050 0.050 tied# 0.050 0.025 won -50.00%# 0.000 0.000 tied# 0.000 0.000 tied# 0.075 0.075 tied# 0.025 0.025 tied# 0.000 0.000 tied# 0.025 0.025 tied# 0.050 0.050 tied## won 3 times# tied 14 times# lost 3 times## total unique fp went from 13 to 11## false negative percentages# 0.327 0.400 lost +22.32%# 0.400 0.400 tied# 0.327 0.473 lost +44.65%# 0.691 0.654 won -5.35%# 0.545 0.473 won -13.21%# 0.291 0.364 lost +25.09%# 0.218 0.291 lost +33.49%# 0.654 0.654 tied# 0.364 0.473 lost +29.95%# 0.291 0.327 lost +12.37%# 0.327 0.291 won -11.01%# 0.691 0.654 won -5.35%# 0.582 0.655 lost +12.54%# 0.291 0.400 lost +37.46%# 0.364 0.436 lost +19.78%# 0.436 0.582 lost +33.49%# 0.436 0.364 won -16.51%# 0.218 0.291 lost +33.49%# 0.291 0.400 lost +37.46%# 0.254 0.327 lost +28.74%## won 5 times# tied 2 times# lost 13 times## total unique fn went from 106 to 122## So HTML decorations are still a significant clue when the ham is composed# of c.l.py traffic. Again, this should be revisited if the f-n rate is# slashed again.## Later: As the amount of training data increased, the effect of retaining# HTML tags decreased to insignificance. options.retain_pure_html_tags# was introduced to control this, and it defaulted to False. Later, as the# algorithm improved, retain_pure_html_tags was removed.## Later: The decision to ignore "redundant" HTML is also dubious, since# the text/plain and text/html alternatives may have entirely different# content. options.ignore_redundant_html was introduced to control this,# and it defaults to False. Later: ignore_redundant_html was also removed.############################################################################### How big should "a word" be?## As I write this, words less than 3 chars are ignored completely, and words# with more than 12 are special-cased, replaced with a summary "I skipped# about so-and-so many chars starting with such-and-such a letter" token.# This makes sense for English if most of the info is in "regular size"# words.## A test run boosting to 13 had no effect on f-p rate, and did a little# better or worse than 12 across runs -- overall, no significant difference.# The database size is smaller at 12, so there's nothing in favor of 13.# A test at 11 showed a slight but consistent bad effect on the f-n rate# (lost 12 times, won once, tied 7 times).## A test with no lower bound showed a significant increase in the f-n rate.# Curious, but not worth digging into. Boosting the lower bound to 4 is a# worse idea: f-p and f-n rates both suffered significantly then. I didn't# try testing with lower bound 2.## Anthony Baxter found that boosting the option skip_max_word_size to 20# from its default of 12 produced a quite dramatic decrease in the number# of 'unsure' messages. However, this was coupled with a large increase# in the FN rate, and it remains unclear whether simply shifting cutoffs# would have given the same tradeoff (not enough data was posted to tell).## On Tim's c.l.py test, 10-fold CV, ham_cutoff=0.20 and spam_cutoff=0.80:## -> <stat> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams# [ditto]## filename: max12 max20# ham:spam: 20000:14000# 20000:14000# fp total: 2 2 the same# fp %: 0.01 0.01# fn total: 0 0 the same# fn %: 0.00 0.00# unsure t: 103 100 slight decrease# unsure %: 0.30 0.29# real cost: $40.60 $40.00 slight improvement with these cutoffs# best cost: $27.00 $27.40 best possible got slightly worse# h mean: 0.28 0.27# h sdev: 2.99 2.92# s mean: 99.94 99.93# s sdev: 1.41 1.47# mean diff: 99.66 99.66# k: 22.65 22.70## "Best possible" in max20 would have been to boost ham_cutoff to 0.50(!),# and drop spam_cutoff a little to 0.78. This would have traded away most# of the unsures in return for letting 3 spam through:## -> smallest ham & spam cutoffs 0.5 & 0.78# -> fp 2; fn 3; unsure ham 11; unsure spam 11# -> fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0647%## Best possible in max12 was much the same:## -> largest ham & spam cutoffs 0.5 & 0.78# -> fp 2; fn 3; unsure ham 12; unsure spam 8# -> fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0588%## The classifier pickle size increased by about 1.5 MB (~8.4% bigger).## Rob Hooft's results were worse:## -> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams# [...]# -> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams# filename: skip12 skip20# ham:spam: 16000:5800# 16000:5800# fp total: 12 13# fp %: 0.07 0.08# fn total: 7 7# fn %: 0.12 0.12# unsure t: 178 184# unsure %: 0.82 0.84# real cost: $162.60 $173.80# best cost: $106.20 $109.60# h mean: 0.51 0.52# h sdev: 4.87 4.92# s mean: 99.42 99.39# s sdev: 5.22 5.34# mean diff: 98.91 98.87# k: 9.80 9.64# textparts(msg) returns a set containing all the text components of msg.# There's no point decoding binary blobs (like images). If a text/plain# and text/html part happen to have redundant content, it doesn't matter# to results, since training and scoring are done on the set of all# words in the msg, without regard to how many times a given word appears.def textparts(msg): """Return a set of all msg parts with content maintype 'text'.""" return Set(filter(lambda part: part.get_content_maintype() == 'text', msg.walk()))def octetparts(msg): """Return a set of all msg parts with type 'application/octet-stream'.""" return Set(filter(lambda part: part.get_content_type() == 'application/octet-stream', msg.walk()))def imageparts(msg): """Return a list of all msg parts with type 'image/*'.""" # Don't want a set here because we want to be able to process them in # order. return filter(lambda part: part.get_content_type().startswith('image/'), msg.walk())has_highbit_char = re.compile(r"[\x80-\xff]").search# Cheap-ass gimmick to probabilistically find HTML/XML tags.# Note that <style and HTML comments are handled by crack_html_style()# and crack_html_comment() instead -- they can be very long, and long# minimal matches have a nasty habit of blowing the C stack.html_re = re.compile(r""" < (?![\s<>]) # e.g., don't match 'a < b' or '<<<' or 'i<<5' or 'a<>b' # guessing that other tags are usually "short" [^>]{0,256} # search for the end '>', but don't run wild >""", re.VERBOSE | re.DOTALL)# Trailing letter serves to reject "hostnames" which are really ip# addresses. Some spammers forge their apparent ip addresses, so you get# Received: headers which look like:# Received: from 199.249.165.175 ([218.5.93.116])# by manatee.mojam.com (8.12.1-20030917/8.12.1) with SMTP id# hBIERsqI018090# for <itinerary@musi-cal.com>; Thu, 18 Dec 2003 08:28:11 -0600# "199.249.165.175" is who the spamhaus said it was. That's really the# ip address of the receiving host (manatee.mojam.com), which correctly# identified the sender's ip address as 218.5.93.116.## Similarly, the more complex character set instead of just \S serves to# reject Received: headers where the message bounces from one user to# another on the local machine:# Received: (from itin@localhost)# by manatee.mojam.com (8.12.1-20030917/8.12.1/Submit) id hBIEQFxF018044# for skip@manatee.mojam.com; Thu, 18 Dec 2003 08:26:15 -0600received_host_re = re.compile(r'from ([a-z0-9._-]+[a-z])[)\s]')# 99% of the time, the receiving host places the sender's ip address in
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -