standardtokenizer.pm

来自「Plucene-1.25.tar.gz PERL版本的lucene」· PM 代码 · 共 72 行

PM
72
字号
package Plucene::Analysis::Standard::StandardTokenizer;=head1 NAME Plucene::Analysis::Standard::StandardTokenizer - standard tokenizer=head1 SYNOPSIS	# isa Plucene::Analysis::CharTokenizer=head1 DESCRIPTIONThis is the standard tokenizer.This should be a good tokenizer for most European-language documents.=head1 METHODS=cutuse strict;use warnings;use base 'Plucene::Analysis::CharTokenizer';# Don't blame me, blame the Plucene people!my $alpha      = qr/\p{IsAlpha}+/;my $apostrophe = qr/$alpha('$alpha)+/;my $acronym    = qr/$alpha\.($alpha\.)+/;my $company    = qr/$alpha(&|\@)$alpha/;my $hostname   = qr/\w+(\.\w+)+/;my $email      = qr/\w+\@$hostname/;my $p          = qr/[_\/.,-]/;my $hasdigit   = qr/\w*\d\w*/;my $num        = qr/\w+$p$hasdigit|$hasdigit$p\w+                   |\w+($p$hasdigit$p\w+)+                   |$hasdigit($p\w+$p$hasdigit)+                   |\w+$p$hasdigit($p\w+$p$hasdigit)+                   |$hasdigit$p\w+($p$hasdigit$p\w+)+/x;=head2 token_reThe regular expression for tokenising.=cutsub token_re {	qr/        $apostrophe | $acronym | $company | $hostname | $email | $num        | \w+    /x;}=head2 normalizeRemove 's and .=cutsub normalize {	my $class = shift;	# These are in the StandardFilter in Java, but Perl is not Java.	# Thankfully.	local $_ = shift;	if (/$apostrophe/) { s/'s//; }	if (/$company/)    { s/\.//g; }	return $_;}1;

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?