📄 perlfaq6.pod
字号:
| ## OR . ## Anything other char [^/"'\\]* ## Chars which doesn't start a comment, string or escape ) }{$2}gxs;A slight modification also removes C++ comments: s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//[^\n]*|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#$2#gs;=head2 Can I use Perl regular expressions to match balanced text?Although Perl regular expressions are more powerful than "mathematical"regular expressions because they feature conveniences like backreferences(C<\1> and its ilk), they still aren't powerful enough--withthe possible exception of bizarre and experimental features in thedevelopment-track releases of Perl. You still need to use non-regextechniques to parse balanced text, such as the text enclosed betweenmatching parentheses or braces, for example.An elaborate subroutine (for 7-bit ASCII only) to pull out balancedand possibly nested single chars, like C<`> and C<'>, C<{> and C<}>,or C<(> and C<)> can be found inhttp://www.perl.com/CPAN/authors/id/TOMC/scripts/pull_quotes.gz .The C::Scan module from CPAN contains such subs for internal use,but they are undocumented.=head2 What does it mean that regexes are greedy? How can I get around it?Most people mean that greedy regexes match as much as they can.Technically speaking, it's actually the quantifiers (C<?>, C<*>, C<+>,C<{}>) that are greedy rather than the whole pattern; Perl prefers localgreed and immediate gratification to overall greed. To get non-greedyversions of the same quantifiers, use (C<??>, C<*?>, C<+?>, C<{}?>).An example: $s1 = $s2 = "I am very very cold"; $s1 =~ s/ve.*y //; # I am cold $s2 =~ s/ve.*?y //; # I am very coldNotice how the second substitution stopped matching as soon as itencountered "y ". The C<*?> quantifier effectively tells the regularexpression engine to find a match as quickly as possible and passcontrol on to whatever is next in line, like you would if you wereplaying hot potato.=head2 How do I process each word on each line?Use the split function: while (<>) { foreach $word ( split ) { # do something with $word here } }Note that this isn't really a word in the English sense; it's justchunks of consecutive non-whitespace characters.To work with only alphanumeric sequences (including underscores), youmight consider while (<>) { foreach $word (m/(\w+)/g) { # do something with $word here } }=head2 How can I print out a word-frequency or line-frequency summary?To do this, you have to parse out each word in the input stream. We'llpretend that by word you mean chunk of alphabetics, hyphens, orapostrophes, rather than the non-whitespace chunk idea of a word givenin the previous question: while (<>) { while ( /(\b[^\W_\d][\w'-]+\b)/g ) { # misses "`sheep'" $seen{$1}++; } } while ( ($word, $count) = each %seen ) { print "$count $word\n"; }If you wanted to do the same thing for lines, you wouldn't need aregular expression: while (<>) { $seen{$_}++; } while ( ($line, $count) = each %seen ) { print "$count $line"; }If you want these output in a sorted order, see L<perlfaq4>: ``How do Isort a hash (optionally by value instead of key)?''.=head2 How can I do approximate matching?See the module String::Approx available from CPAN.=head2 How do I efficiently match many regular expressions at once?The following is extremely inefficient: # slow but obvious way @popstates = qw(CO ON MI WI MN); while (defined($line = <>)) { for $state (@popstates) { if ($line =~ /\b$state\b/i) { print $line; last; } } } That's because Perl has to recompile all those patterns for each ofthe lines of the file. As of the 5.005 release, there's a much betterapproach, one which makes use of the new C<qr//> operator: # use spiffy new qr// operator, with /i flag even use 5.005; @popstates = qw(CO ON MI WI MN); @poppats = map { qr/\b$_\b/i } @popstates; while (defined($line = <>)) { for $patobj (@poppats) { print $line if $line =~ /$patobj/; } }=head2 Why don't word-boundary searches with C<\b> work for me?Two common misconceptions are that C<\b> is a synonym for C<\s+> andthat it's the edge between whitespace characters and non-whitespacecharacters. Neither is correct. C<\b> is the place between a C<\w>character and a C<\W> character (that is, C<\b> is the edge of a"word"). It's a zero-width assertion, just like C<^>, C<$>, and allthe other anchors, so it doesn't consume any characters. L<perlre>describes the behavior of all the regex metacharacters.Here are examples of the incorrect application of C<\b>, with fixes: "two words" =~ /(\w+)\b(\w+)/; # WRONG "two words" =~ /(\w+)\s+(\w+)/; # right " =matchless= text" =~ /\b=(\w+)=\b/; # WRONG " =matchless= text" =~ /=(\w+)=/; # rightAlthough they may not do what you thought they did, C<\b> and C<\B>can still be quite useful. For an example of the correct use ofC<\b>, see the example of matching duplicate words over multiplelines.An example of using C<\B> is the pattern C<\Bis\B>. This will findoccurrences of "is" on the insides of words only, as in "thistle", butnot "this" or "island".=head2 Why does using $&, $`, or $' slow my program down?Once Perl sees that you need one of these variables anywhere inthe program, it provides them on each and every pattern match.The same mechanism that handles these provides for the use of $1, $2,etc., so you pay the same price for each regex that contains capturingparentheses. If you never use $&, etc., in your script, then regexesI<without> capturing parentheses won't be penalized. So avoid $&, $',and $` if you can, but if you can't, once you've used them at all, usethem at will because you've already paid the price. Remember that somealgorithms really appreciate them. As of the 5.005 release. the $&variable is no longer "expensive" the way the other two are.=head2 What good is C<\G> in a regular expression?The notation C<\G> is used in a match or substitution in conjunction withthe C</g> modifier to anchor the regular expression to the point just pastwhere the last match occurred, i.e. the pos() point. A failed match resetsthe position of C<\G> unless the C</c> modifier is in effect. C<\G> can beused in a match without the C</g> modifier; it acts the same (i.e. stillanchors at the pos() point) but of course only matches once and does notupdate pos(), as non-C</g> expressions never do. C<\G> in an expressionapplied to a target string that has never been matched against a C</g>expression before or has had its pos() reset is functionally equivalent toC<\A>, which matches at the beginning of the string.For example, suppose you had a line of text quoted in standard mailand Usenet notation, (that is, with leading C<< > >> characters), andyou want change each leading C<< > >> into a corresponding C<:>. Youcould do so in this way: s/^(>+)/':' x length($1)/gem;Or, using C<\G>, the much simpler (and faster): s/\G>/:/g;A more sophisticated use might involve a tokenizer. The followinglex-like example is courtesy of Jeffrey Friedl. It did not work in5.003 due to bugs in that release, but does work in 5.004 or better.(Note the use of C</c>, which prevents a failed match with C</g> fromresetting the search position back to the beginning of the string.) while (<>) { chomp; PARSER: { m/ \G( \d+\b )/gcx && do { print "number: $1\n"; redo; }; m/ \G( \w+ )/gcx && do { print "word: $1\n"; redo; }; m/ \G( \s+ )/gcx && do { print "space: $1\n"; redo; }; m/ \G( [^\w\d]+ )/gcx && do { print "other: $1\n"; redo; }; } }Of course, that could have been written as while (<>) { chomp; PARSER: { if ( /\G( \d+\b )/gcx { print "number: $1\n"; redo PARSER; } if ( /\G( \w+ )/gcx { print "word: $1\n"; redo PARSER; } if ( /\G( \s+ )/gcx { print "space: $1\n"; redo PARSER; } if ( /\G( [^\w\d]+ )/gcx { print "other: $1\n"; redo PARSER; } } }but then you lose the vertical alignment of the regular expressions.=head2 Are Perl regexes DFAs or NFAs? Are they POSIX compliant?While it's true that Perl's regular expressions resemble the DFAs(deterministic finite automata) of the egrep(1) program, they are infact implemented as NFAs (non-deterministic finite automata) to allowbacktracking and backreferencing. And they aren't POSIX-style either,because those guarantee worst-case behavior for all cases. (It seemsthat some people prefer guarantees of consistency, even when what'sguaranteed is slowness.) See the book "Mastering Regular Expressions"(from O'Reilly) by Jeffrey Friedl for all the details you could everhope to know on these matters (a full citation appears inL<perlfaq2>).=head2 What's wrong with using grep or map in a void context?Both grep and map build a return list, regardless of their context.This means you're making Perl go to the trouble of building up areturn list that you then just ignore. That's no way to treat aprogramming language, you insensitive scoundrel!=head2 How can I match strings with multibyte characters?This is hard, and there's no good way. Perl does not directly supportwide characters. It pretends that a byte and a character aresynonymous. The following set of approaches was offered by JeffreyFriedl, whose article in issue #5 of The Perl Journal talks about thisvery matter.Let's suppose you have some weird Martian encoding where pairs ofASCII uppercase letters encode single Martian letters (i.e. the twobytes "CV" make a single Martian letter, as do the two bytes "SG","VS", "XX", etc.). Other bytes represent single characters, just likeASCII.So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode thenine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.Now, say you want to search for the single character C</GX/>. Perldoesn't know about Martian, so it'll find the two bytes "GX" in the "Iam CVSGXX!" string, even though that character isn't there: it justlooks like it is because "SG" is next to "XX", but there's no real"GX". This is a big problem.Here are a few ways, all painful, to deal with it: $martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``martian'' bytes # are no longer adjacent. print "found GX!\n" if $martian =~ /GX/;Or like this: @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g; # above is conceptually similar to: @chars = $text =~ m/(.)/g; # foreach $char (@chars) { print "found GX!\n", last if $char eq 'GX'; }Or like this: while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded print "found GX!\n", last if $1 eq 'GX'; }Or like this: die "sorry, Perl doesn't (yet) have Martian support )-:\n";There are many double- (and multi-) byte encodings commonly used thesedays. Some versions of these have 1-, 2-, 3-, and 4-byte characters,all mixed.=head2 How do I match a pattern that is supplied by the user?Well, if it's really a pattern, then just use chomp($pattern = <STDIN>); if ($line =~ /$pattern/) { }Alternatively, since you have no guarantee that your user entereda valid regular expression, trap the exception this way: if (eval { $line =~ /$pattern/ }) { }If all you really want to search for a string, not a pattern,then you should either use the index() function, which is made forstring searching, or if you can't be disabused of using a patternmatch on a non-pattern, then be sure to use C<\Q>...C<\E>, documentedin L<perlre>. $pattern = <STDIN>; open (FILE, $input) or die "Couldn't open input $input: $!; aborting"; while (<FILE>) { print if /\Q$pattern\E/; } close FILE;=head1 AUTHOR AND COPYRIGHTCopyright (c) 1997-1999 Tom Christiansen and Nathan Torkington.All rights reserved.When included as part of the Standard Version of Perl, or as part ofits complete documentation whether printed or otherwise, this workmay be distributed only under the terms of Perl's Artistic License.Any distribution of this file or derivatives thereof I<outside>of that package require that special arrangements be made withcopyright holder.Irrespective of its distribution, all code examples in this fileare hereby placed into the public domain. You are permitted andencouraged to use this code in your own programs for funor for profit as you see fit. A simple comment in the code givingcredit would be courteous but is not required.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -