naivebayes.pm

来自「AI::Categorizer is a framework for autom」· PM 代码 · 共 185 行

185 行

package AI::Categorizer::Learner::NaiveBayes;use strict;use AI::Categorizer::Learner;use base qw(AI::Categorizer::Learner);use Params::Validate qw(:types);use Algorithm::NaiveBayes;__PACKAGE__->valid_params  (   threshold => {type => SCALAR, default => 0.3},  );sub create_model {  my $self = shift;  my $m = $self->{model} = Algorithm::NaiveBayes->new;  foreach my $d ($self->knowledge_set->documents) {    $m->add_instance(attributes => $d->features->as_hash,		     label      => [ map $_->name, $d->categories ]);  }  $m->train;}sub get_scores {  my ($self, $newdoc) = @_;  return ($self->{model}->predict( attributes => $newdoc->features->as_hash ),	  $self->{threshold});}sub threshold {  my $self = shift;  $self->{threshold} = shift if @_;  return $self->{threshold};}sub save_state {  my $self = shift;  local $self->{knowledge_set};  # Don't need the knowledge_set to categorize  $self->SUPER::save_state(@_);}sub categories {  my $self = shift;  return map AI::Categorizer::Category->by_name( name => $_ ), $self->{model}->labels;}1;__END__=head1 NAMEAI::Categorizer::Learner::NaiveBayes - Naive Bayes Algorithm For AI::Categorizer=head1 SYNOPSIS  use AI::Categorizer::Learner::NaiveBayes;    # Here $k is an AI::Categorizer::KnowledgeSet object    my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);  $nb->train(knowledge_set => $k);  $nb->save_state('filename');    ... time passes ...    $nb = AI::Categorizer::Learner::NaiveBayes->restore_state('filename');  my $c = new AI::Categorizer::Collection::Files( path => ... );  while (my $document = $c->next) {    my $hypothesis = $nb->categorize($document);    print "Best assigned category: ", $hypothesis->best_category, "\n";    print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";  }=head1 DESCRIPTIONThis is an implementation of the Naive Bayes decision-makingalgorithm, applied to the task of document categorization (as definedby the AI::Categorizer module).  See L<AI::Categorizer> for a completedescription of the interface.This module is now a wrapper around the stand-aloneC<Algorithm::NaiveBayes> module.  I moved the discussion of Bayes'Theorem into that module's documentation.=head1 METHODSThis class inherits from the C<AI::Categorizer::Learner> class, so allof its methods are available unless explicitly mentioned here.=head2 new()Creates a new Naive Bayes Learner and returns it.  In addition to theparameters accepted by the C<AI::Categorizer::Learner> class, theNaive Bayes subclass accepts the following parameters:=over 4=item * thresholdSets the score threshold for category membership.  The default iscurrently 0.3.  Set the threshold lower to assign more categories perdocument, set it higher to assign fewer.  This can be an effective wayto trade of between precision and recall.=back=head2 threshold()Returns the current threshold value.  With an optional numericargument, you may set the threshold.=head2 train(knowledge_set => $k)Trains the categorizer.  This prepares it for later use incategorizing documents.  The C<knowledge_set> parameter must providean object of the class C<AI::Categorizer::KnowledgeSet> (or a subclassthereof), populated with lots of documents and categories.  SeeL<AI::Categorizer::KnowledgeSet> for the details of how to create suchan object.=head2 categorize($document)Returns an C<AI::Categorizer::Hypothesis> object representing thecategorizer's "best guess" about which categories the given documentshould be assigned to.  See L<AI::Categorizer::Hypothesis> for moredetails on how to use this object.=head2 save_state($path)Saves the categorizer for later use.  This method is inherited fromC<AI::Categorizer::Storable>.=head1 CALCULATIONSThe various probabilities used in the above calculations are founddirectly from the training documents.  For instance, if there are 5000total tokens (words) in the "sports" training documents and 200 ofthem are the word "curling", then C<P(curling|sports) = 200/5000 =0.04> .  If there are 10,000 total tokens in the training corpus and5,000 of them are in documents belonging to the category "sports",then C<P(sports)> = 5,000/10,000 = 0.5> .Because the probabilities involved are often very small and wemultiply many of them together, the result is often a tiny tinynumber.  This could pose problems of floating-point underflow, soinstead of working with the actual probabilities we work with thelogarithms of the probabilities.  This also speeds up variouscalculations in the C<categorize()> method.=head1 TO DOMore work on the confidence scores - right now the winning categorytends to dominate the scores overwhelmingly, when the scores shouldprobably be more evenly distributed.=head1 AUTHORKen Williams, ken@forum.swarthmore.edu=head1 COPYRIGHTCopyright 2000-2003 Ken Williams.  All rights reserved.This library is free software; you can redistribute it and/ormodify it under the same terms as Perl itself.=head1 SEE ALSOAI::Categorizer(3), Algorithm::NaiveBayes(3)"A re-examination of text categorization methods" by Yiming YangL<http://www.cs.cmu.edu/~yiming/publications.html>"On the Optimality of the Simple Bayesian Classifier under Zero-OneLoss" by Pedro DomingosL<"http://www.cs.washington.edu/homes/pedrod/mlj97.ps.gz">A simple but complete example of Bayes' Theorem from Dr. MathL<"http://www.mathforum.com/dr.math/problems/battisfore.03.22.99.html">=cut

naivebayes.pm - 源码说明

本页面展示了「AI::Categorizer is a framework for automatic text categorization. It consists of a collection of Per」中的 naivebayes.pm 源码文件，采用 PM 编程语言编写，共 185 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。

虫虫下载站收录了大量与categorization相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。

⌨️ 快捷键说明

复制代码Ctrl + C

搜索代码Ctrl + F

全屏模式F11

增大字号Ctrl + =

减小字号Ctrl + -

显示快捷键?