📄 perlretut.pod
字号:
C<\d+> and the sign can be matched with C<[+-]>. Thus the integerregexp is /[+-]?\d+/; # matches integersA floating point number potentially has a sign, an integral part, adecimal point, a fractional part, and an exponent. One or more of theseparts is optional, so we need to check out the differentpossibilities. Floating point numbers which are in proper form include123., 0.345, .34, -1e6, and 25.4E-72. As with integers, the sign outfront is completely optional and can be matched by C<[+-]?>. We cansee that if there is no exponent, floating point numbers must have adecimal point, otherwise they are integers. We might be tempted tomodel these with C<\d*\.\d*>, but this would also match just a singledecimal point, which is not a number. So the three cases of floatingpoint number without exponent are /[+-]?\d+\./; # 1., 321., etc. /[+-]?\.\d+/; # .1, .234, etc. /[+-]?\d+\.\d+/; # 1.0, 30.56, etc.These can be combined into a single regexp with a three-way alternation: /[+-]?(\d+\.\d+|\d+\.|\.\d+)/; # floating point, no exponentIn this alternation, it is important to put C<'\d+\.\d+'> beforeC<'\d+\.'>. If C<'\d+\.'> were first, the regexp would happily match thatand ignore the fractional part of the number.Now consider floating point numbers with exponents. The keyobservation here is that I<both> integers and numbers with decimalpoints are allowed in front of an exponent. Then exponents, like theoverall sign, are independent of whether we are matching numbers withor without decimal points, and can be 'decoupled' from themantissa. The overall form of the regexp now becomes clear: /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/;The exponent is an C<e> or C<E>, followed by an integer. So theexponent regexp is /[eE][+-]?\d+/; # exponentPutting all the parts together, we get a regexp that matches numbers: /^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/; # Ta da!Long regexps like this may impress your friends, but can be hard todecipher. In complex situations like this, the C<//x> modifier for amatch is invaluable. It allows one to put nearly arbitrary whitespaceand comments into a regexp without affecting their meaning. Using it,we can rewrite our 'extended' regexp in the more pleasing form /^ [+-]? # first, match an optional sign ( # then match integers or f.p. mantissas: \d+\.\d+ # mantissa of the form a.b |\d+\. # mantissa of the form a. |\.\d+ # mantissa of the form .b |\d+ # integer of the form a ) ([eE][+-]?\d+)? # finally, optionally match an exponent $/x;If whitespace is mostly irrelevant, how does one include spacecharacters in an extended regexp? The answer is to backslash itS<C<'\ '>> or put it in a character class S<C<[ ]>>. The same thinggoes for pound signs, use C<\#> or C<[#]>. For instance, Perl allowsa space between the sign and the mantissa or integer, and we could addthis to our regexp as follows: /^ [+-]?\ * # first, match an optional sign *and space* ( # then match integers or f.p. mantissas: \d+\.\d+ # mantissa of the form a.b |\d+\. # mantissa of the form a. |\.\d+ # mantissa of the form .b |\d+ # integer of the form a ) ([eE][+-]?\d+)? # finally, optionally match an exponent $/x;In this form, it is easier to see a way to simplify thealternation. Alternatives 1, 2, and 4 all start with C<\d+>, so itcould be factored out: /^ [+-]?\ * # first, match an optional sign ( # then match integers or f.p. mantissas: \d+ # start out with a ... ( \.\d* # mantissa of the form a.b or a. )? # ? takes care of integers of the form a |\.\d+ # mantissa of the form .b ) ([eE][+-]?\d+)? # finally, optionally match an exponent $/x;or written in the compact form, /^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/;This is our final regexp. To recap, we built a regexp by=over 4=item *specifying the task in detail,=item *breaking down the problem into smaller parts,=item *translating the small parts into regexps,=item *combining the regexps,=item *and optimizing the final combined regexp.=backThese are also the typical steps involved in writing a computerprogram. This makes perfect sense, because regular expressions areessentially programs written in a little computer language that specifiespatterns.=head2 Using regular expressions in PerlThe last topic of Part 1 briefly covers how regexps are used in Perlprograms. Where do they fit into Perl syntax?We have already introduced the matching operator in its defaultC</regexp/> and arbitrary delimiter C<m!regexp!> forms. We have usedthe binding operator C<=~> and its negation C<!~> to test for stringmatches. Associated with the matching operator, we have discussed thesingle line C<//s>, multi-line C<//m>, case-insensitive C<//i> andextended C<//x> modifiers. There are a few more things you mightwant to know about matching operators.=head3 Optimizing pattern evaluationWe pointed out earlier that variables in regexps are substitutedbefore the regexp is evaluated: $pattern = 'Seuss'; while (<>) { print if /$pattern/; }This will print any lines containing the word C<Seuss>. It is not asefficient as it could be, however, because Perl has to re-evaluate(or compile) C<$pattern> each time through the loop. If C<$pattern> won't bechanging over the lifetime of the script, we can add the C<//o>modifier, which directs Perl to only perform variable substitutionsonce: #!/usr/bin/perl # Improved simple_grep $regexp = shift; while (<>) { print if /$regexp/o; # a good deal faster }=head3 Prohibiting substitutionIf you change C<$pattern> after the first substitution happens, Perlwill ignore it. If you don't want any substitutions at all, use thespecial delimiter C<m''>: @pattern = ('Seuss'); while (<>) { print if m'@pattern'; # matches literal '@pattern', not 'Seuss' }Similar to strings, C<m''> acts like apostrophes on a regexp; all otherC<m> delimiters act like quotes. If the regexp evaluates to the empty string,the regexp in the I<last successful match> is used instead. So we have "dog" =~ /d/; # 'd' matches "dogbert =~ //; # this matches the 'd' regexp used before=head3 Global matchingThe final two modifiers C<//g> and C<//c> concern multiple matches.The modifier C<//g> stands for global matching and allows thematching operator to match within a string as many times as possible.In scalar context, successive invocations against a string will have`C<//g> jump from match to match, keeping track of position in thestring as it goes along. You can get or set the position with theC<pos()> function.The use of C<//g> is shown in the following example. Suppose we havea string that consists of words separated by spaces. If we know howmany words there are in advance, we could extract the words usinggroupings: $x = "cat dog house"; # 3 words $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches, # $1 = 'cat' # $2 = 'dog' # $3 = 'house'But what if we had an indeterminate number of words? This is the sortof task C<//g> was made for. To extract all words, form the simpleregexp C<(\w+)> and loop over all matches with C</(\w+)/g>: while ($x =~ /(\w+)/g) { print "Word is $1, ends at position ", pos $x, "\n"; }prints Word is cat, ends at position 3 Word is dog, ends at position 7 Word is house, ends at position 13A failed match or changing the target string resets the position. Ifyou don't want the position reset after failure to match, add theC<//c>, as in C</regexp/gc>. The current position in the string isassociated with the string, not the regexp. This means that differentstrings have different positions and their respective positions can beset or read independently.In list context, C<//g> returns a list of matched groupings, or ifthere are no groupings, a list of matches to the whole regexp. So ifwe wanted just the words, we could use @words = ($x =~ /(\w+)/g); # matches, # $word[0] = 'cat' # $word[1] = 'dog' # $word[2] = 'house'Closely associated with the C<//g> modifier is the C<\G> anchor. TheC<\G> anchor matches at the point where the previous C<//g> match leftoff. C<\G> allows us to easily do context-sensitive matching: $metric = 1; # use metric units ... $x = <FILE>; # read in measurement $x =~ /^([+-]?\d+)\s*/g; # get magnitude $weight = $1; if ($metric) { # error checking print "Units error!" unless $x =~ /\Gkg\./g; } else { print "Units error!" unless $x =~ /\Glbs\./g; } $x =~ /\G\s+(widget|sprocket)/g; # continue processingThe combination of C<//g> and C<\G> allows us to process the string abit at a time and use arbitrary Perl logic to decide what to do next.Currently, the C<\G> anchor is only fully supported when used to anchorto the start of the pattern.C<\G> is also invaluable in processing fixed length records withregexps. Suppose we have a snippet of coding region DNA, encoded asbase pair letters C<ATCGTTGAAT...> and we want to find all the stopcodons C<TGA>. In a coding region, codons are 3-letter sequences, sowe can think of the DNA snippet as a sequence of 3-letter records. Thenaive regexp # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC" $dna = "ATCGTTGAATGCAAATGACATGAC"; $dna =~ /TGA/;doesn't work; it may match a C<TGA>, but there is no guarantee thatthe match is aligned with codon boundaries, e.g., the substringS<C<GTT GAA>> gives a match. A better solution is while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *? print "Got a TGA stop codon at position ", pos $dna, "\n"; }which prints Got a TGA stop codon at position 18 Got a TGA stop codon at position 23Position 18 is good, but position 23 is bogus. What happened?The answer is that our regexp works well until we get past the lastreal match. Then the regexp will fail to match a synchronized C<TGA>and start stepping ahead one character position at a time, not what wewant. The solution is to use C<\G> to anchor the match to the codonalignment: while ($dna =~ /\G(\w\w\w)*?TGA/g) { print "Got a TGA stop codon at position ", pos $dna, "\n"; }This prints Got a TGA stop codon at position 18which is the correct answer. This example illustrates that it isimportant not only to match what is desired, but to reject what is notdesired.=head3 Search and replaceRegular expressions also play a big role in I<search and replace>operations in Perl. Search and replace is accomplished with theC<s///> operator. The general form isC<s/regexp/replacement/modifiers>, with everything we know aboutregexps and modifiers applying in this case as well. TheC<replacement> is a Perl double quoted string that replaces in thestring whatever is matched with the C<regexp>. The operator C<=~> isalso used here to associate a string with C<s///>. If matchingagainst C<$_>, the S<C<$_ =~>> can be dropped. If there is a match,C<s///> returns the number of substitutions made, otherwise it returnsfalse. Here are a few examples: $x = "Time to feed the cat!"; $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" if ($x =~ s/^(Time.*hacker)!$/$1 now!/) { $more_insistent = 1; } $y = "'quoted words'"; $y =~ s/^'(.*)'$/$1/; # strip single quotes, # $y contains "quoted words"In the last example, the whole string was matched, but only the partinside the single quotes was grouped. With the C<s///> operator, thematched variables C<$1>, C<$2>, etc. are immediately available for usein the replacemen
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -