⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 incremental.howto.txt

📁 用python实现的邮件过滤器
💻 TXT
字号:
There are a few steps to doing incremental training tests:1. Get your corpora.  It's best if they're contemporaneous   and single source, because that makes it much easier to   sequence and group them.  The corpora need to be in the   good old familiar Data/{Ham,Spam}/{reservoir,Set*} tree.   For my (Alex) purposes, I wrote the es2hs.py tool to grab   stuff out of my real MH mail archive folders; other people   may want some other method of getting the corpora into the   tree.  If you're using Outlook, then the   Outlook2000/export.py script is what you are after.2. Sort and group the corpora.  When testing, messages will   be processed in sorted order.  The messages should all   have unique names with a group number and an id number   separated by a dash (eg. 0123-004556).  I (Alex) wrote   sort+group.py for this.  sort+group.py sorts the messages   into chronological order (by topmost Received header) and   then groups them by 24-hour period.  The group number (0123)   is the number of full 24-hour periods that elapsed between   the time this msg was received and the time the oldest msg   found was received.  The id number (004556) is a unique   0-based ordinal across all msgs seen, with 000000 given to   the oldest msg found.   With 1.0.x, note that this script will run through *all* the   files in the Data directory, not just those in Data/Ham and   Data/Spam.  With 1.1, only those specified in the   ham_directories and spam_directories will be used, unless   the -a option is used.3. Distribute the corpora into multiple sets so you can do   multiple similar runs to gauge validity of the results   (similar to a cross-validation, but not really).  When   testing, all but one set will be used for a particular   run.  I personally use 5 sets.   Distribution is done with mksets.py.  It will evenly   distribute the corpora across the sets, keeping the   groups evenly distributed, too.  You can specify the   number of sets, limit the number of groups used (to   make short runs), and limit the number of messages per   group*set distributed (to simulate less mail per group,   and thus get more fine-grained results).4. Run incremental.py to actually process the messages in   a training and testing run.  How training is done is   determined by what regime you specify (regimes are   defined in the regimes.py file; see the perfect and   corrected classes for examples).  For large corpora,   you may want to do the various set runs separately   (by specifying the -s option), instead of building   nsets classifiers all in parallel (memory usage can   get high).   Make sure to save the output of incremental.py into   a file... by itself it's ugly, but postprocessing   can make it useful.5. Postprocess the incremental.py output.  I made mkgraph.py   to do this, outputting datasets for plotmtv.  plotmtv is   a really neat data visualization tool.  Use it.  Love it.   XXX tools for Excel.See dotest.sh for a sample of automating steps 4 & 5.

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -