incremental.howto.txt

来自「用python实现的邮件过滤器」· 文本代码 · 共 66 行

TXT

66 行

There are a few steps to doing incremental training tests:1. Get your corpora.  It's best if they're contemporaneous   and single source, because that makes it much easier to   sequence and group them.  The corpora need to be in the   good old familiar Data/{Ham,Spam}/{reservoir,Set*} tree.   For my (Alex) purposes, I wrote the es2hs.py tool to grab   stuff out of my real MH mail archive folders; other people   may want some other method of getting the corpora into the   tree.  If you're using Outlook, then the   Outlook2000/export.py script is what you are after.2. Sort and group the corpora.  When testing, messages will   be processed in sorted order.  The messages should all   have unique names with a group number and an id number   separated by a dash (eg. 0123-004556).  I (Alex) wrote   sort+group.py for this.  sort+group.py sorts the messages   into chronological order (by topmost Received header) and   then groups them by 24-hour period.  The group number (0123)   is the number of full 24-hour periods that elapsed between   the time this msg was received and the time the oldest msg   found was received.  The id number (004556) is a unique   0-based ordinal across all msgs seen, with 000000 given to   the oldest msg found.   With 1.0.x, note that this script will run through *all* the   files in the Data directory, not just those in Data/Ham and   Data/Spam.  With 1.1, only those specified in the   ham_directories and spam_directories will be used, unless   the -a option is used.3. Distribute the corpora into multiple sets so you can do   multiple similar runs to gauge validity of the results   (similar to a cross-validation, but not really).  When   testing, all but one set will be used for a particular   run.  I personally use 5 sets.   Distribution is done with mksets.py.  It will evenly   distribute the corpora across the sets, keeping the   groups evenly distributed, too.  You can specify the   number of sets, limit the number of groups used (to   make short runs), and limit the number of messages per   group*set distributed (to simulate less mail per group,   and thus get more fine-grained results).4. Run incremental.py to actually process the messages in   a training and testing run.  How training is done is   determined by what regime you specify (regimes are   defined in the regimes.py file; see the perfect and   corrected classes for examples).  For large corpora,   you may want to do the various set runs separately   (by specifying the -s option), instead of building   nsets classifiers all in parallel (memory usage can   get high).   Make sure to save the output of incremental.py into   a file... by itself it's ugly, but postprocessing   can make it useful.5. Postprocess the incremental.py output.  I made mkgraph.py   to do this, outputting datasets for plotmtv.  plotmtv is   a really neat data visualization tool.  Use it.  Love it.   XXX tools for Excel.See dotest.sh for a sample of automating steps 4 & 5.

incremental.howto.txt - 源码说明

本页面展示了「用python实现的邮件过滤器」中的 incremental.howto.txt 源码文件，采用文本编程语言编写，共 66 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。

虫虫下载站收录了大量与python相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。

⌨️ 快捷键说明

复制代码Ctrl + C

搜索代码Ctrl + F

全屏模式F11

增大字号Ctrl + =

减小字号Ctrl + -

显示快捷键?