featureselector.pm

来自「AI::Categorizer is a framework for autom」· PM 代码 · 共 357 行
357 行
package AI::Categorizer::FeatureSelector;use strict;use Class::Container;use base qw(Class::Container);use Params::Validate qw(:types);use AI::Categorizer::FeatureVector;use AI::Categorizer::Util;use Carp qw(croak);__PACKAGE__->valid_params  (   features_kept => {		     type => SCALAR,		     default => 0.2,		    },   verbose => {	       type => SCALAR,	       default => 0,	      },  );sub verbose {  my $self = shift;  $self->{verbose} = shift if @_;  return $self->{verbose};}sub reduce_features {  # Takes a feature vector whose weights are "feature scores", and  # chops to the highest n features.  n is specified by the  # 'features_kept' parameter.  If it's zero, all features are kept.  # If it's between 0 and 1, we multiply by the present number of  # features.  If it's greater than 1, we treat it as the number of  # features to use.  my ($self, $f, %args) = @_;  my $kept = defined $args{features_kept} ? $args{features_kept} : $self->{features_kept};  return $f unless $kept;  my $num_kept = ($kept < 1 ? 		  $f->length * $kept :		  $kept);  print "Trimming features - # features = " . $f->length . "\n" if $self->verbose;    # This is algorithmic overkill, but the sort seems fast enough.  Will revisit later.  my $features = $f->as_hash;  my @new_features = (sort {$features->{$b} <=> $features->{$a}} keys %$features)                      [0 .. $num_kept-1];  my $result = $f->intersection( \@new_features );  print "Finished trimming features - # features = " . $result->length . "\n" if $self->verbose;  return $result;}# Abstract methodssub rank_features;sub scan_features;sub select_features {  my ($self, %args) = @_;    die "No knowledge_set parameter provided to select_features()"    unless $args{knowledge_set};  my $f = $self->rank_features( knowledge_set => $args{knowledge_set} );  return $self->reduce_features( $f, features_kept => $args{features_kept} );}1;__END__=head1 NAMEAI::Categorizer::FeatureSelector - Abstract Feature Selection class=head1 SYNOPSIS ...=head1 DESCRIPTIONThe KnowledgeSet class that provides an interface to a set ofdocuments, a set of categories, and a mapping between the two.  Manyparameters for controlling the processing of documents are managed bythe KnowledgeSet class.=head1 METHODS=over 4=item new()Creates a new KnowledgeSet and returns it.  Accepts the followingparameters:=over 4=item loadIf a C<load> parameter is present, the C<load()> method will beinvoked immediately.  If the C<load> parameter is a string, it will bepassed as the C<path> parameter to C<load()>.  If the C<load>parameter is a hash reference, it will represent all the parameters topass to C<load()>.=item categoriesAn optional reference to an array of Category objects representing thecomplete set of categories in a KnowledgeSet.  If used, theC<documents> parameter should also be specified.=item documentsAn optional reference to an array of Document objects representing thecomplete set of documents in a KnowledgeSet.  If used, theC<categories> parameter should also be specified.=item features_keptA number indicating how many features (words) should be consideredwhen training the Learner or categorizing new documents.  May bespecified as a positive integer (e.g. 2000) indicating the absolutenumber of features to be kept, or as a decimal between 0 and 1(e.g. 0.2) indicating the fraction of the total number of features tobe kept, or as 0 to indicate that no feature selection should be doneand that the entire set of features should be used.  The default is0.2.=item feature_selectionA string indicating the type of feature selection that should beperformed.  Currently the only option is also the default option:C<document_frequency>.=item tfidf_weightingSpecifies how document word counts should be converted to vectorvalues.  Uses the three-character specification strings from Salton &Buckley's paper "Term-weighting approaches in automatic textretrieval".  The three characters indicate the three factors that willbe multiplied for each feature to find the final vector value for thatfeature.  The default weighting is C<xxx>.The first character specifies the "term frequency" component, whichcan take the following values:=over 4=item bBinary weighting - 1 for terms present in a document, 0 for terms absent.=item tRaw term frequency - equal to the number of times a feature occurs inthe document.=item xA synonym for 't'.=item nNormalized term frequency - 0.5 + 0.5 * t/max(t).  This is the same asthe 't' specification, but with term frequency normalized to liebetween 0.5 and 1.=backThe second character specifies the "collection frequency" component, whichcan take the following values:=over 4=item fInverse document frequency - multiply term C<t>'s value by C<log(N/n)>,where C<N> is the total number of documents in the collection, andC<n> is the number of documents in which term C<t> is found.=item pProbabilistic inverse document frequency - multiply term C<t>'s valueby C<log((N-n)/n)> (same variable meanings as above).=item xNo change - multiply by 1.=backThe third character specifies the "normalization" component, whichcan take the following values:=over 4=item cApply cosine normalization - multiply by 1/length(document_vector).=item xNo change - multiply by 1.=backThe three components may alternatively be specified by theC<term_weighting>, C<collection_weighting>, and C<normalize_weighting>parameters respectively.=item verboseIf set to a true value, some status/debugging information will beoutput on C<STDOUT>.=back=item categories()In a list context returns a list of all Category objects in thisKnowledgeSet.  In a scalar context returns the number of such objects.=item documents()In a list context returns a list of all Document objects in thisKnowledgeSet.  In a scalar context returns the number of such objects.=item document()Given a document name, returns the Document object with that name, orC<undef> if no such Document object exists in this KnowledgeSet.=item features()Returns a FeatureSet object which represents the features of all thedocuments in this KnowledgeSet.=item verbose()Returns the C<verbose> parameter of this KnowledgeSet, or sets it withan optional argument.=item scan_stats()Scans all the documents of a Collection and returns a hash referencecontaining several statistics about the Collection.  (XXX need to describe stats)=item scan_features()This method scans through a Collection object and determines the"best" features (words) to use when loading the documents and trainingthe Learner.  This process is known as "feature selection", and it's avery important part of categorization.The Collection object should be specified as a C<collection> parameter,or by giving the arguments to pass to the Collection's C<new()> method.The process of feature selection is governed by theC<feature_selection> and C<features_kept> parameters given to theKnowledgeSet's C<new()> method.This method returns the features as a FeatureVector whose values arethe "quality" of each feature, by whatever measure theC<feature_selection> parameter specifies.  Normally you won't need touse the return value, because this FeatureVector will become theC<use_features> parameter of any Document objects created by thisKnowledgeSet.=item save_features()Given the name of a file, this method writes the features (asdetermined by the C<scan_features> method) to the file.=item restore_features()Given the name of a file written by C<save_features>, loads thefeatures from that file and passes them as the C<use_features>parameter for any Document objects created in the future by thisKnowledgeSet.=item read()Iterates through a Collection of documents and adds them to theKnowledgeSet.  The Collection can be specified using a C<collection>parameter - otherwise, specify the arguments to pass to the C<new()>method of the Collection class.=item load()This method can do feature selection and load a Collection in one step(though it currently uses two steps internally).=item add_document()Given a Document object as an argument, this method will add it andany categories it belongs to to the KnowledgeSet.=item make_document()This method will create a Document object with the given data and thencall C<add_document()> to add it to the KnowledgeSet.  A C<categories>parameter should specify an array reference containing a list ofcategories I<by name>.  These are the categories that the documentbelongs to.  Any other parameters will be passed to the Documentclass's C<new()> method.=item finish()This method will be called prior to training the Learner.  Its purposeis to perform any operations (such as feature vector weighting) thatmay require examination of the entire KnowledgeSet.=item weigh_features()This method will be called during C<finish()> to adjust the weights ofthe features according to the C<tfidf_weighting> parameter.=item document_frequency()Given a single feature (word) as an argument, this method will returnthe number of documents in the KnowledgeSet that contain that feature.=item partition()Divides the KnowledgeSet into several subsets.  This may be useful forperforming cross-validation.  The relative sizes of the subsets shouldbe passed as arguments.  For example, to split the KnowledgeSet intofour KnowledgeSets of equal size, pass the arguments .25, .25, .25(the final size is 1 minus the sum of the other sizes).  Thepartitions will be returned as a list.=back=head1 AUTHORKen Williams, ken@mathforum.org=head1 COPYRIGHTCopyright 2000-2003 Ken Williams.  All rights reserved.This library is free software; you can redistribute it and/ormodify it under the same terms as Perl itself.=head1 SEE ALSOAI::Categorizer(3)=cut
featureselector.pm - 源码说明

本页面展示了「AI::Categorizer is a framework for automatic text categorization. It consists of a collection of Per」中的 featureselector.pm 源码文件，采用 PM 编程语言编写，共 357 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与categorization相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?