⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 corpus.py

📁 用python实现的邮件过滤器
💻 PY
字号:
#! /usr/bin/env python'''Corpus.py - Spambayes corpus management framework.Classes:    Corpus - a collection of Messages    ExpiryCorpus - a "young" Corpus    MessageFactory - creates a MessageAbstract:    A corpus is defined as a set of messages that share some common    characteristic relative to spamness.  Examples might be spam, ham,    unsure, or untrained, or "bayes rating between .4 and .6".  A    corpus is a collection of messages.  Corpus is a dictionary that    is keyed by the keys of the messages within it.  It is iterable,    and observable.  Observers are notified when a message is added    to or removed from the corpus.    Corpus is designed to cache message objects.  By default, it will    only engage in lazy creation of message objects, keeping those    objects in memory until the corpus instance itself is destroyed.    In large corpora, this could consume a large amount of memory.  A    cacheSize operand is implemented on the constructor, which is used    to limit the *number* of messages currently loaded into memory.    The instance variable that implements this cache is    Corpus.Corpus.msgs, a dictionary.  Access to this variable should    be through keys(), [key], or using an iterator.  Direct access    should not be used, as subclasses that manage their cache may use    this variable very differently.    Iterating Corpus objects is potentially very expensive, as each    message in the corpus will be brought into memory.  For large    corpora, this could consume a lot of system resources.    ExpiryCorpus is designed to keep a corpus of file messages that    are guaranteed to be younger than a given age.  The age is    specified on the constructor, as a number of seconds in the past.    If a message file was created before that point in time, the a    message is deemed to be "old" and thus ignored.  Access to a    message that is deemed to be old will raise KeyError, which should    be handled by the corpus user as appropriate.  While iterating,    KeyError is handled by the iterator, and messages that raise    KeyError are ignored.    As messages pass their "expiration date," they are eligible for    removal from the corpus. To remove them properly,    removeExpiredMessages() should be called.  As messages are removed,    observers are notified.    ExpiryCorpus function is included into a concrete Corpus through    multiple inheritance. It must be inherited before any inheritance    that derives from Corpus.  For example:        class RealCorpus(Corpus)           ...        class ExpiryRealCorpus(Corpus.ExpiryCorpus, RealCorpus)           ...    Messages have substance, which is is the textual content of the    message. They also have a key, which uniquely defines them within    the corpus.  This framework makes no assumptions about how or if    messages persist.    MessageFactory is a required factory class, because Corpus is    designed to do lazy initialization of messages and, as an abstract    class, must know how to create concrete instances of the correct    class.To Do:    o Suggestions?'''# This module is part of the spambayes project, which is Copyright 2002-3# The Python Software Foundation and is covered by the Python Software# Foundation license.from __future__ import generators__author__ = "Tim Stone <tim@fourstonesExpressions.com>"__credits__ = "Richie Hindle, Tim Peters, all the spambayes contributors."try:    True, Falseexcept NameError:    # Maintain compatibility with Python 2.2    True, False = 1, 0    def bool(val):        return not not valimport sys           # for output of docstringimport timeimport typesfrom spambayes.Options import optionsSPAM = TrueHAM = Falseclass Corpus:    '''An observable dictionary of Messages'''    def __init__(self, factory, cacheSize=-1):        '''Constructor(MessageFactory)'''        self.msgs = {}            # dict of all messages in corpus                                  # value is None if msg not currently loaded        self.keysInMemory = []    # keys of messages currently loaded                                  # this *could* be derived by iterating msgs        self.cacheSize = cacheSize  # max number of messages in memory        self.observers = []       # observers of this corpus        self.factory = factory    # factory for the correct Message subclass    def addObserver(self, observer):        '''Register an observer, which should implement        onAddMessage, onRemoveMessage'''        self.observers.append(observer)    def addMessage(self, message, observer_flags=0):        '''Add a Message to this corpus'''        if options["globals", "verbose"]:            print 'adding message %s to corpus' % (message.key())        self.cacheMessage(message)        for obs in self.observers:            # there is no reason that a Corpus observer MUST be a Trainer            # and so it may very well not be interested in AddMessage events            # even though right now the only observable events are            # training related            if hasattr(obs, "onAddMessage"):                obs.onAddMessage(message, observer_flags)    def removeMessage(self, message, observer_flags=0):        '''Remove a Message from this corpus'''        key = message.key()        if options["globals", "verbose"]:            print 'removing message %s from corpus' % (key,)        self.unCacheMessage(key)        del self.msgs[key]        for obs in self.observers:            # see comments in event loop in addMessage            if hasattr(obs, "onRemoveMessage"):                obs.onRemoveMessage(message, observer_flags)    def cacheMessage(self, message):        '''Add a message to the in-memory cache'''        # This method should probably not be overridden        key = message.key()        if options["globals", "verbose"]:            print 'placing %s in corpus cache' % (key,)        self.msgs[key] = message        # Here is where we manage the in-memory cache size...        self.keysInMemory.append(key)        if self.cacheSize > 0:       # performance optimization            if len(self.keysInMemory) > self.cacheSize:                keyToFlush = self.keysInMemory[0]                self.unCacheMessage(keyToFlush)    def unCacheMessage(self, key):        '''Remove a message from the in-memory cache'''        # This method should probably not be overridden        if options["globals", "verbose"]:            print 'Flushing %s from corpus cache' % (key,)        try:            ki = self.keysInMemory.index(key)        except ValueError:            pass        else:            del self.keysInMemory[ki]        self.msgs[key] = None    def takeMessage(self, key, fromcorpus, fromCache=False):        '''Move a Message from another corpus to this corpus'''        msg = fromcorpus[key]        msg.load() # ensure that the substance has been loaded        # Remove needs to be first, because add changes the directory        # of the message, and so remove won't work then.        fromcorpus.removeMessage(msg)        self.addMessage(msg)    def get(self, key, default=None):        if self.msgs.get(key, "") == "":            return default        else:            return self[key]    def __getitem__(self, key):        '''Corpus is a dictionary'''        amsg = self.msgs.get(key, "")        if amsg == "":            raise KeyError(key)        if amsg is None:            amsg = self.makeMessage(key)     # lazy init, saves memory            self.cacheMessage(amsg)        return amsg    def keys(self):        '''Message keys in the Corpus'''        return self.msgs.keys()    def __contains__(self, other):        return other in self.msgs.values()    def __iter__(self):        '''Corpus is iterable'''        for key in self.keys():            yield self[key]    def __str__(self):        '''Instance as a printable string'''        return self.__repr__()    def __repr__(self):        '''Instance as a representative string'''        raise NotImplementedError    def makeMessage(self, key, content=None):        '''Call the factory to make a message'''        # This method will likely be overridden        msg = self.factory.create(key, content)        return msgclass ExpiryCorpus:    '''Mixin Class - Corpus of "young" file system artifacts'''    def __init__(self, expireBefore):        self.expireBefore = expireBefore        # Only check for expiry after this time.        self.expiry_due = time.time()    def removeExpiredMessages(self):        '''Kill expired messages'''                # Only check for expired messages after this time.  We set this to the        # closest-to-expiry message's expiry time, so that this method can be        # called very regularly, and most of the time it will just immediately        # return.        if time.time() < self.expiry_due:            return        self.expiry_due = time.time() + self.expireBefore        for key in self.keys()[:]:            msg = self[key]            timestamp = msg.createTimestamp()            if timestamp < time.time() - self.expireBefore:                if options["globals", "verbose"]:                    print 'message %s has expired' % (msg.key(),)                from spambayes.storage import NO_TRAINING_FLAG                self.removeMessage(msg, observer_flags=NO_TRAINING_FLAG)            elif timestamp + self.expireBefore < self.expiry_due:                self.expiry_due = timestamp + self.expireBeforeclass MessageFactory(object):    '''Abstract Message Factory'''    def create(self, key, content=None):        '''Create a message instance'''        raise NotImplementedErrorif __name__ == '__main__':    print >>sys.stderr, __doc__

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -