📄 perlretut.pod
字号:
many words there are in advance, we could extract the words usinggroupings: $x = "cat dog house"; # 3 words $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches, # $1 = 'cat' # $2 = 'dog' # $3 = 'house'But what if we had an indeterminate number of words? This is the sortof task C<//g> was made for. To extract all words, form the simpleregexp C<(\w+)> and loop over all matches with C</(\w+)/g>: while ($x =~ /(\w+)/g) { print "Word is $1, ends at position ", pos $x, "\n"; }prints Word is cat, ends at position 3 Word is dog, ends at position 7 Word is house, ends at position 13A failed match or changing the target string resets the position. Ifyou don't want the position reset after failure to match, add theC<//c>, as in C</regexp/gc>. The current position in the string isassociated with the string, not the regexp. This means that differentstrings have different positions and their respective positions can beset or read independently.In list context, C<//g> returns a list of matched groupings, or ifthere are no groupings, a list of matches to the whole regexp. So ifwe wanted just the words, we could use @words = ($x =~ /(\w+)/g); # matches, # $word[0] = 'cat' # $word[1] = 'dog' # $word[2] = 'house'Closely associated with the C<//g> modifier is the C<\G> anchor. TheC<\G> anchor matches at the point where the previous C<//g> match leftoff. C<\G> allows us to easily do context-sensitive matching: $metric = 1; # use metric units ... $x = <FILE>; # read in measurement $x =~ /^([+-]?\d+)\s*/g; # get magnitude $weight = $1; if ($metric) { # error checking print "Units error!" unless $x =~ /\Gkg\./g; } else { print "Units error!" unless $x =~ /\Glbs\./g; } $x =~ /\G\s+(widget|sprocket)/g; # continue processingThe combination of C<//g> and C<\G> allows us to process the string abit at a time and use arbitrary Perl logic to decide what to do next.C<\G> is also invaluable in processing fixed length records withregexps. Suppose we have a snippet of coding region DNA, encoded asbase pair letters C<ATCGTTGAAT...> and we want to find all the stopcodons C<TGA>. In a coding region, codons are 3-letter sequences, sowe can think of the DNA snippet as a sequence of 3-letter records. Thenaive regexp # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC" $dna = "ATCGTTGAATGCAAATGACATGAC"; $dna =~ /TGA/;doesn't work; it may match an C<TGA>, but there is no guarantee thatthe match is aligned with codon boundaries, e.g., the substringS<C<GTT GAA> > gives a match. A better solution is while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *? print "Got a TGA stop codon at position ", pos $dna, "\n"; }which prints Got a TGA stop codon at position 18 Got a TGA stop codon at position 23Position 18 is good, but position 23 is bogus. What happened?The answer is that our regexp works well until we get past the lastreal match. Then the regexp will fail to match a synchronized C<TGA>and start stepping ahead one character position at a time, not what wewant. The solution is to use C<\G> to anchor the match to the codonalignment: while ($dna =~ /\G(\w\w\w)*?TGA/g) { print "Got a TGA stop codon at position ", pos $dna, "\n"; }This prints Got a TGA stop codon at position 18which is the correct answer. This example illustrates that it isimportant not only to match what is desired, but to reject what is notdesired.B<search and replace>Regular expressions also play a big role in B<search and replace>operations in Perl. Search and replace is accomplished with theC<s///> operator. The general form isC<s/regexp/replacement/modifiers>, with everything we know aboutregexps and modifiers applying in this case as well. TheC<replacement> is a Perl double quoted string that replaces in thestring whatever is matched with the C<regexp>. The operator C<=~> isalso used here to associate a string with C<s///>. If matchingagainst C<$_>, the S<C<$_ =~> > can be dropped. If there is a match,C<s///> returns the number of substitutions made, otherwise it returnsfalse. Here are a few examples: $x = "Time to feed the cat!"; $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" if ($x =~ s/^(Time.*hacker)!$/$1 now!/) { $more_insistent = 1; } $y = "'quoted words'"; $y =~ s/^'(.*)'$/$1/; # strip single quotes, # $y contains "quoted words"In the last example, the whole string was matched, but only the partinside the single quotes was grouped. With the C<s///> operator, thematched variables C<$1>, C<$2>, etc. are immediately available for usein the replacement expression, so we use C<$1> to replace the quotedstring with just what was quoted. With the global modifier, C<s///g>will search and replace all occurrences of the regexp in the string: $x = "I batted 4 for 4"; $x =~ s/4/four/; # doesn't do it all: # $x contains "I batted four for 4" $x = "I batted 4 for 4"; $x =~ s/4/four/g; # does it all: # $x contains "I batted four for four"If you prefer 'regex' over 'regexp' in this tutorial, you could usethe following program to replace it: % cat > simple_replace #!/usr/bin/perl $regexp = shift; $replacement = shift; while (<>) { s/$regexp/$replacement/go; print; } ^D % simple_replace regexp regex perlretut.podIn C<simple_replace> we used the C<s///g> modifier to replace alloccurrences of the regexp on each line and the C<s///o> modifier tocompile the regexp only once. As with C<simple_grep>, both theC<print> and the C<s/$regexp/$replacement/go> use C<$_> implicitly.A modifier available specifically to search and replace is theC<s///e> evaluation modifier. C<s///e> wraps an C<eval{...}> aroundthe replacement string and the evaluated result is substituted for thematched substring. C<s///e> is useful if you need to do a bit ofcomputation in the process of replacing text. This example countscharacter frequencies in a line: $x = "Bill the cat"; $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself print "frequency of '$_' is $chars{$_}\n" foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars);This prints frequency of ' ' is 2 frequency of 't' is 2 frequency of 'l' is 2 frequency of 'B' is 1 frequency of 'c' is 1 frequency of 'e' is 1 frequency of 'h' is 1 frequency of 'i' is 1 frequency of 'a' is 1As with the match C<m//> operator, C<s///> can use other delimiters,such as C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes areused C<s'''>, then the regexp and replacement are treated as singlequoted strings and there are no substitutions. C<s///> in list contextreturns the same thing as in scalar context, i.e., the number ofmatches.B<The split operator>The B<C<split> > function can also optionally use a matching operatorC<m//> to split a string. C<split /regexp/, string, limit> splitsC<string> into a list of substrings and returns that list. The regexpis used to match the character sequence that the C<string> is splitwith respect to. The C<limit>, if present, constrains splitting intono more than C<limit> number of strings. For example, to split astring into words, use $x = "Calvin and Hobbes"; @words = split /\s+/, $x; # $word[0] = 'Calvin' # $word[1] = 'and' # $word[2] = 'Hobbes'If the empty regexp C<//> is used, the regexp always matches andthe string is split into individual characters. If the regexp hasgroupings, then list produced contains the matched substrings from thegroupings as well. For instance, $x = "/usr/bin/perl"; @dirs = split m!/!, $x; # $dirs[0] = '' # $dirs[1] = 'usr' # $dirs[2] = 'bin' # $dirs[3] = 'perl' @parts = split m!(/)!, $x; # $parts[0] = '' # $parts[1] = '/' # $parts[2] = 'usr' # $parts[3] = '/' # $parts[4] = 'bin' # $parts[5] = '/' # $parts[6] = 'perl'Since the first character of $x matched the regexp, C<split> prependedan empty initial element to the list.If you have read this far, congratulations! You now have all the basictools needed to use regular expressions to solve a wide range of textprocessing problems. If this is your first time through the tutorial,why not stop here and play around with regexps a while... S<Part 2>concerns the more esoteric aspects of regular expressions and thoseconcepts certainly aren't needed right at the start.=head1 Part 2: Power toolsOK, you know the basics of regexps and you want to know more. Ifmatching regular expressions is analogous to a walk in the woods, thenthe tools discussed in Part 1 are analogous to topo maps and acompass, basic tools we use all the time. Most of the tools in part 2are are analogous to flare guns and satellite phones. They aren't usedtoo often on a hike, but when we are stuck, they can be invaluable.What follows are the more advanced, less used, or sometimes esotericcapabilities of perl regexps. In Part 2, we will assume you arecomfortable with the basics and concentrate on the new features.=head2 More on characters, strings, and character classesThere are a number of escape sequences and character classes that wehaven't covered yet.There are several escape sequences that convert characters or stringsbetween upper and lower case. C<\l> and C<\u> convert the nextcharacter to lower or upper case, respectively: $x = "perl"; $string =~ /\u$x/; # matches 'Perl' in $string $x = "M(rs?|s)\\."; # note the double backslash $string =~ /\l$x/; # matches 'mr.', 'mrs.', and 'ms.',C<\L> and C<\U> converts a whole substring, delimited by C<\L> orC<\U> and C<\E>, to lower or upper case: $x = "This word is in lower case:\L SHOUT\E"; $x =~ /shout/; # matches $x = "I STILL KEYPUNCH CARDS FOR MY 360" $x =~ /\Ukeypunch/; # matches punch card stringIf there is no C<\E>, case is converted until the end of thestring. The regexps C<\L\u$word> or C<\u\L$word> convert the firstcharacter of C<$word> to uppercase and the rest of the characters tolowercase.Control characters can be escaped with C<\c>, so that a control-Zcharacter would be matched with C<\cZ>. The escape sequenceC<\Q>...C<\E> quotes, or protects most non-alphabetic characters. Forinstance, $x = "\QThat !^*&%~& cat!"; $x =~ /\Q!^*&%~&\E/; # check for rough languageIt does not protect C<$> or C<@>, so that variables can still besubstituted.With the advent of 5.6.0, perl regexps can handle more than just thestandard ASCII character set. Perl now supports B<Unicode>, a standardfor encoding the character sets from many of the world's writtenlanguages. Unicode does this by allowing characters to be more thanone byte wide. Perl uses the UTF-8 encoding, in which ASCII charactersare still encoded as one byte, but characters greater than C<chr(127)>may be stored as two or more bytes.What does this mean for regexps? Well, regexp users don't need to knowmuch about perl's internal representation of strings. But they do needto know 1) how to represent Unicode characters in a regexp and 2) whena matching operation will treat the string to be searched as asequence of bytes (the old way) or as a sequence of Unicode characters(the new way). The answer to 1) is that Unicode characters greaterthan C<chr(127)> may be represented using the C<\x{hex}> notation,with C<hex> a hexadecimal integer: use utf8; # We will be doing Unicode processing /\x{263a}/; # match a Unicode smiley face :)Unicode characters in the range of 128-255 use two hexadecimal digitswith braces: C<\x{ab}>. Note that this is different than C<\xab>,which is just a hexadecimal byte with no Unicodesignificance.Figuring out the hexadecimal sequence of a Unicode character you wantor deciphering someone else's hexadecimal Unicode regexp is about asmuch fun as programming in machine code. So another way to specifyUnicode characters is to use the S<B<named character> > escapesequence C<\N{name}>. C<name> is a name for the Unicode character, asspecified in the Unicode standard. For instance, if we wanted torepresent or match the astrological sign for the planet Mercury, wecould use use utf8; # We will be doing Unicode processing use charnames ":full"; # use named chars with Unicode full names $x = "abc\N{MERCURY}def"; $x =~ /\N{MERCURY}/; # matchesOne can also use short names or restrict names to a certain alphabet: use utf8; # We will be doing Unicode processing use charnames ':full'; print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n"; use charnames ":short"; print "\N{greek:Sigma} is an upper-case sigma.\n";
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -