tokenizer.py

来自「用python实现的邮件过滤器」· Python 代码 · 共 1,693 行 · 第 1/5 页
1,693 行
#     0.436  0.473  lost    +8.49%#     0.218  0.218  tied#     0.291  0.255  won    -12.37%#     0.254  0.364  lost   +43.31%## won  15 times# tied  2 times# lost  3 times## total unique fn went from 106 to 94############################################################################### What about HTML?## Computer geeks seem to view use of HTML in mailing lists and newsgroups as# a mortal sin.  Normal people don't, but so it goes:  in a technical list/# group, every HTML decoration has spamprob 0.99, there are lots of unique# HTML decorations, and lots of them appear at the very start of the message# so that Graham's scoring scheme latches on to them tight.  As a result,# any plain text message just containing an HTML example is likely to be# judged spam (every HTML decoration is an extreme).## So if a message is multipart/alternative with both text/plain and text/html# branches, we ignore the latter, else newbies would never get a message# through.  If a message is just HTML, it has virtually no chance of getting# through.## In an effort to let normal people use mailing lists too <wink>, and to# alleviate the woes of messages merely *discussing* HTML practice, I# added a gimmick to strip HTML tags after case-normalization and after# special tagging of embedded URLs.  This consisted of a regexp sub pattern,# where instances got replaced by single blanks:##    html_re = re.compile(r"""#        <#        [^\s<>]     # e.g., don't match 'a < b' or '<<<' or 'i << 5' or 'a<>b'#        [^>]{0,128} # search for the end '>', but don't chew up the world#        >#    """, re.VERBOSE)## and then##    text = html_re.sub(' ', text)## Alas, little good came of this:##    false positive percentages#        0.000  0.000  tied#        0.000  0.000  tied#        0.050  0.075  lost#        0.000  0.000  tied#        0.025  0.025  tied#        0.025  0.025  tied#        0.050  0.050  tied#        0.025  0.025  tied#        0.025  0.025  tied#        0.000  0.050  lost#        0.075  0.100  lost#        0.050  0.050  tied#        0.025  0.025  tied#        0.000  0.025  lost#        0.050  0.050  tied#        0.025  0.025  tied#        0.025  0.025  tied#        0.000  0.000  tied#        0.025  0.050  lost#        0.050  0.050  tied##    won   0 times#    tied 15 times#    lost  5 times##    total unique fp went from 8 to 12##    false negative percentages#        0.945  1.164  lost#        0.836  1.418  lost#        1.200  1.272  lost#        1.418  1.272  won#        1.455  1.273  won#        1.091  1.382  lost#        1.091  1.309  lost#        1.236  1.381  lost#        1.564  1.745  lost#        1.236  1.564  lost#        1.563  1.781  lost#        1.563  1.745  lost#        1.236  1.455  lost#        0.836  0.982  lost#        0.873  1.309  lost#        1.236  1.381  lost#        1.273  1.273  tied#        1.018  1.273  lost#        1.091  1.200  lost#        1.490  1.599  lost##    won   2 times#    tied  1 times#    lost 17 times##    total unique fn went from 292 to 327## The messages merely discussing HTML were no longer fps, so it did what it# intended there.  But the f-n rate nearly doubled on at least one run -- so# strong a set of spam indicators is the mere presence of HTML.  The increase# in the number of fps despite that the HTML-discussing msgs left that# category remains mysterious to me, but it wasn't a significant increase# so I let it drop.## Later:  If I simply give up on making mailing lists friendly to my sisters# (they're not nerds, and create wonderfully attractive HTML msgs), a# compromise is to strip HTML tags from only text/plain msgs.  That's# principled enough so far as it goes, and eliminates the HTML-discussing# false positives.  It remains disturbing that the f-n rate on pure HTML# msgs increases significantly when stripping tags, so the code here doesn't# do that part.  However, even after stripping tags, the rates above show that# at least 98% of spams are still correctly identified as spam.## So, if another way is found to slash the f-n rate, the decision here not# to strip HTML from HTML-only msgs should be revisited.## Later, after the f-n rate got slashed via other means:## false positive percentages#     0.000  0.000  tied#     0.000  0.000  tied#     0.050  0.075  lost   +50.00%#     0.025  0.025  tied#     0.075  0.025  won    -66.67%#     0.000  0.000  tied#     0.100  0.100  tied#     0.050  0.075  lost   +50.00%#     0.025  0.025  tied#     0.025  0.000  won   -100.00%#     0.050  0.075  lost   +50.00%#     0.050  0.050  tied#     0.050  0.025  won    -50.00%#     0.000  0.000  tied#     0.000  0.000  tied#     0.075  0.075  tied#     0.025  0.025  tied#     0.000  0.000  tied#     0.025  0.025  tied#     0.050  0.050  tied## won   3 times# tied 14 times# lost  3 times## total unique fp went from 13 to 11## false negative percentages#     0.327  0.400  lost   +22.32%#     0.400  0.400  tied#     0.327  0.473  lost   +44.65%#     0.691  0.654  won     -5.35%#     0.545  0.473  won    -13.21%#     0.291  0.364  lost   +25.09%#     0.218  0.291  lost   +33.49%#     0.654  0.654  tied#     0.364  0.473  lost   +29.95%#     0.291  0.327  lost   +12.37%#     0.327  0.291  won    -11.01%#     0.691  0.654  won     -5.35%#     0.582  0.655  lost   +12.54%#     0.291  0.400  lost   +37.46%#     0.364  0.436  lost   +19.78%#     0.436  0.582  lost   +33.49%#     0.436  0.364  won    -16.51%#     0.218  0.291  lost   +33.49%#     0.291  0.400  lost   +37.46%#     0.254  0.327  lost   +28.74%## won   5 times# tied  2 times# lost 13 times## total unique fn went from 106 to 122## So HTML decorations are still a significant clue when the ham is composed# of c.l.py traffic.  Again, this should be revisited if the f-n rate is# slashed again.## Later:  As the amount of training data increased, the effect of retaining# HTML tags decreased to insignificance.  options.retain_pure_html_tags# was introduced to control this, and it defaulted to False.  Later, as the# algorithm improved, retain_pure_html_tags was removed.## Later:  The decision to ignore "redundant" HTML is also dubious, since# the text/plain and text/html alternatives may have entirely different# content.  options.ignore_redundant_html was introduced to control this,# and it defaults to False.  Later:  ignore_redundant_html was also removed.############################################################################### How big should "a word" be?## As I write this, words less than 3 chars are ignored completely, and words# with more than 12 are special-cased, replaced with a summary "I skipped# about so-and-so many chars starting with such-and-such a letter" token.# This makes sense for English if most of the info is in "regular size"# words.## A test run boosting to 13 had no effect on f-p rate, and did a little# better or worse than 12 across runs -- overall, no significant difference.# The database size is smaller at 12, so there's nothing in favor of 13.# A test at 11 showed a slight but consistent bad effect on the f-n rate# (lost 12 times, won once, tied 7 times).## A test with no lower bound showed a significant increase in the f-n rate.# Curious, but not worth digging into.  Boosting the lower bound to 4 is a# worse idea:  f-p and f-n rates both suffered significantly then.  I didn't# try testing with lower bound 2.## Anthony Baxter found that boosting the option skip_max_word_size to 20# from its default of 12 produced a quite dramatic decrease in the number# of 'unsure' messages.  However, this was coupled with a large increase# in the FN rate, and it remains unclear whether simply shifting cutoffs# would have given the same tradeoff (not enough data was posted to tell).## On Tim's c.l.py test, 10-fold CV, ham_cutoff=0.20 and spam_cutoff=0.80:## -> <stat> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams# [ditto]## filename:    max12   max20# ham:spam:  20000:14000#                    20000:14000# fp total:        2       2       the same# fp %:         0.01    0.01# fn total:        0       0       the same# fn %:         0.00    0.00# unsure t:      103     100       slight decrease# unsure %:     0.30    0.29# real cost:  $40.60  $40.00       slight improvement with these cutoffs# best cost:  $27.00  $27.40       best possible got slightly worse# h mean:       0.28    0.27# h sdev:       2.99    2.92# s mean:      99.94   99.93# s sdev:       1.41    1.47# mean diff:   99.66   99.66# k:           22.65   22.70## "Best possible" in max20 would have been to boost ham_cutoff to 0.50(!),# and drop spam_cutoff a little to 0.78.  This would have traded away most# of the unsures in return for letting 3 spam through:## -> smallest ham & spam cutoffs 0.5 & 0.78# ->     fp 2; fn 3; unsure ham 11; unsure spam 11# ->     fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0647%## Best possible in max12 was much the same:## -> largest ham & spam cutoffs 0.5 & 0.78# ->     fp 2; fn 3; unsure ham 12; unsure spam 8# ->     fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0588%## The classifier pickle size increased by about 1.5 MB (~8.4% bigger).## Rob Hooft's results were worse:## -> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams# [...]# -> <stat> tested 1600 hams & 580 spams against 14400 hams & 5220 spams# filename:   skip12  skip20# ham:spam:  16000:5800#                     16000:5800# fp total:       12      13# fp %:         0.07    0.08# fn total:        7       7# fn %:         0.12    0.12# unsure t:      178     184# unsure %:     0.82    0.84# real cost: $162.60 $173.80# best cost: $106.20 $109.60# h mean:       0.51    0.52# h sdev:       4.87    4.92# s mean:      99.42   99.39# s sdev:       5.22    5.34# mean diff:   98.91   98.87# k:            9.80    9.64# textparts(msg) returns a set containing all the text components of msg.# There's no point decoding binary blobs (like images).  If a text/plain# and text/html part happen to have redundant content, it doesn't matter# to results, since training and scoring are done on the set of all# words in the msg, without regard to how many times a given word appears.def textparts(msg):    """Return a set of all msg parts with content maintype 'text'."""    return Set(filter(lambda part: part.get_content_maintype() == 'text',                      msg.walk()))def octetparts(msg):    """Return a set of all msg parts with type 'application/octet-stream'."""    return Set(filter(lambda part:                      part.get_content_type() == 'application/octet-stream',                      msg.walk()))def imageparts(msg):    """Return a list of all msg parts with type 'image/*'."""    # Don't want a set here because we want to be able to process them in    # order.    return filter(lambda part:                  part.get_content_type().startswith('image/'),                  msg.walk())has_highbit_char = re.compile(r"[\x80-\xff]").search# Cheap-ass gimmick to probabilistically find HTML/XML tags.# Note that <style and HTML comments are handled by crack_html_style()# and crack_html_comment() instead -- they can be very long, and long# minimal matches have a nasty habit of blowing the C stack.html_re = re.compile(r"""    <    (?![\s<>])  # e.g., don't match 'a < b' or '<<<' or 'i<<5' or 'a<>b'    # guessing that other tags are usually "short"    [^>]{0,256} # search for the end '>', but don't run wild    >""", re.VERBOSE | re.DOTALL)# Trailing letter serves to reject "hostnames" which are really ip# addresses.  Some spammers forge their apparent ip addresses, so you get# Received: headers which look like:#   Received: from 199.249.165.175 ([218.5.93.116])#       by manatee.mojam.com (8.12.1-20030917/8.12.1) with SMTP id#       hBIERsqI018090#       for <itinerary@musi-cal.com>; Thu, 18 Dec 2003 08:28:11 -0600# "199.249.165.175" is who the spamhaus said it was.  That's really the# ip address of the receiving host (manatee.mojam.com), which correctly# identified the sender's ip address as 218.5.93.116.## Similarly, the more complex character set instead of just \S serves to# reject Received: headers where the message bounces from one user to# another on the local machine:#   Received: (from itin@localhost)#       by manatee.mojam.com (8.12.1-20030917/8.12.1/Submit) id hBIEQFxF018044#       for skip@manatee.mojam.com; Thu, 18 Dec 2003 08:26:15 -0600received_host_re = re.compile(r'from ([a-z0-9._-]+[a-z])[)\s]')# 99% of the time, the receiving host places the sender's ip address in
tokenizer.py - 源码说明

本页面展示了「用python实现的邮件过滤器」中的 tokenizer.py 源码文件，采用 Python 编程语言编写，共 1,693 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与python相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?