📄 perlrequick.pod
字号:
=head1 NAMEperlrequick - Perl regular expressions quick start=head1 DESCRIPTIONThis page covers the very basics of understanding, creating andusing regular expressions ('regexes') in Perl.=head1 The Guide=head2 Simple word matchingThe simplest regex is simply a word, or more generally, a string ofcharacters. A regex consisting of a word matches any string thatcontains that word: "Hello World" =~ /World/; # matchesIn this statement, C<World> is a regex and the C<//> enclosingC</World/> tells perl to search a string for a match. The operatorC<=~> associates the string with the regex match and produces a truevalue if the regex matched, or false if the regex did not match. Inour case, C<World> matches the second word in C<"Hello World">, so theexpression is true. This idea has several variations.Expressions like this are useful in conditionals: print "It matches\n" if "Hello World" =~ /World/;The sense of the match can be reversed by using C<!~> operator: print "It doesn't match\n" if "Hello World" !~ /World/;The literal string in the regex can be replaced by a variable: $greeting = "World"; print "It matches\n" if "Hello World" =~ /$greeting/;If you're matching against C<$_>, the C<$_ =~> part can be omitted: $_ = "Hello World"; print "It matches\n" if /World/;Finally, the C<//> default delimiters for a match can be changed toarbitrary delimiters by putting an C<'m'> out front: "Hello World" =~ m!World!; # matches, delimited by '!' "Hello World" =~ m{World}; # matches, note the matching '{}' "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', # '/' becomes an ordinary charRegexes must match a part of the string I<exactly> in order for thestatement to be true: "Hello World" =~ /world/; # doesn't match, case sensitive "Hello World" =~ /o W/; # matches, ' ' is an ordinary char "Hello World" =~ /World /; # doesn't match, no ' ' at endperl will always match at the earliest possible point in the string: "Hello World" =~ /o/; # matches 'o' in 'Hello' "That hat is red" =~ /hat/; # matches 'hat' in 'That'Not all characters can be used 'as is' in a match. Some characters,called B<metacharacters>, are reserved for use in regex notation.The metacharacters are {}[]()^$.|*+?\A metacharacter can be matched by putting a backslash before it: "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + 'C:\WIN32' =~ /C:\\WIN/; # matches "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matchesIn the last regex, the forward slash C<'/'> is also backslashed,because it is used to delimit the regex.Non-printable ASCII characters are represented by B<escape sequences>.Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r>for a carriage return. Arbitrary bytes are represented by octalescape sequences, e.g., C<\033>, or hexadecimal escape sequences,e.g., C<\x1B>: "1000\t2000" =~ m(0\t2) # matches "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell catRegexes are treated mostly as double quoted strings, so variablesubstitution works: $foo = 'house'; 'cathouse' =~ /cat$foo/; # matches 'housecat' =~ /${foo}cat/; # matchesWith all of the regexes above, if the regex matched anywhere in thestring, it was considered a match. To specify I<where> it shouldmatch, we would use the B<anchor> metacharacters C<^> and C<$>. Theanchor C<^> means match at the beginning of the string and the anchorC<$> means match at the end of the string, or before a newline at theend of the string. Some examples: "housekeeper" =~ /keeper/; # matches "housekeeper" =~ /^keeper/; # doesn't match "housekeeper" =~ /keeper$/; # matches "housekeeper\n" =~ /keeper$/; # matches "housekeeper" =~ /^housekeeper$/; # matches=head2 Using character classesA B<character class> allows a set of possible characters, rather thanjust a single character, to match at a particular point in a regex.Character classes are denoted by brackets C<[...]>, with the set ofcharacters to be possibly matched inside. Here are some examples: /cat/; # matches 'cat' /[bcr]at/; # matches 'bat', 'cat', or 'rat' "abc" =~ /[cab]/; # matches 'a'In the last statement, even though C<'c'> is the first character inthe class, the earliest point at which the regex can match is C<'a'>. /[yY][eE][sS]/; # match 'yes' in a case-insensitive way # 'yes', 'Yes', 'YES', etc. /yes/i; # also match 'yes' in a case-insensitive wayThe last example shows a match with an C<'i'> B<modifier>, which makesthe match case-insensitive.Character classes also have ordinary and special characters, but thesets of ordinary and special characters inside a character class aredifferent than those outside a character class. The specialcharacters for a character class are C<-]\^$> and are matched using anescape: /[\]c]def/; # matches ']def' or 'cdef' $x = 'bcr'; /[$x]at/; # matches 'bat, 'cat', or 'rat' /[\$x]at/; # matches '$at' or 'xat' /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'The special character C<'-'> acts as a range operator within characterclasses, so that the unwieldy C<[0123456789]> and C<[abc...xyz]>become the svelte C<[0-9]> and C<[a-z]>: /item[0-9]/; # matches 'item0' or ... or 'item9' /[0-9a-fA-F]/; # matches a hexadecimal digitIf C<'-'> is the first or last character in a character class, it istreated as an ordinary character.The special character C<^> in the first position of a character classdenotes a B<negated character class>, which matches any character butthose in the brackets. Both C<[...]> and C<[^...]> must match acharacter, or the match fails. Then /[^a]at/; # doesn't match 'aat' or 'at', but matches # all other 'bat', 'cat, '0at', '%at', etc. /[^0-9]/; # matches a non-numeric character /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinaryPerl has several abbreviations for common character classes:=over 4=item *\d is a digit and represents [0-9]=item *\s is a whitespace character and represents [\ \t\r\n\f]=item *\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_]=item *\D is a negated \d; it represents any character but a digit [^0-9]=item *\S is a negated \s; it represents any non-whitespace character [^\s]=item *\W is a negated \w; it represents any non-word character [^\w]=item *The period '.' matches any character but "\n"=backThe C<\d\s\w\D\S\W> abbreviations can be used both inside and outsideof character classes. Here are some in use: /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format /[\d\s]/; # matches any digit or whitespace character /\w\W\w/; # matches a word char, followed by a # non-word char, followed by a word char /..rt/; # matches any two chars, followed by 'rt' /end\./; # matches 'end.' /end[.]/; # same thing, matches 'end.'The S<B<word anchor> > C<\b> matches a boundary between a wordcharacter and a non-word character C<\w\W> or C<\W\w>: $x = "Housecat catenates house and cat"; $x =~ /\bcat/; # matches cat in 'catenates' $x =~ /cat\b/; # matches cat in 'housecat' $x =~ /\bcat\b/; # matches 'cat' at end of stringIn the last example, the end of the string is considered a wordboundary.=head2 Matching this or thatWe can match match different character strings with the B<alternation>metacharacter C<'|'>. To match C<dog> or C<cat>, we form the regexC<dog|cat>. As before, perl will try to match the regex at theearliest possible point in the string. At each character position,perl will first try to match the the first alternative, C<dog>. IfC<dog> doesn't match, perl will then try the next alternative, C<cat>.If C<cat> doesn't match either, then the match fails and perl moves tothe next position in the string. Some examples: "cats and dogs" =~ /cat|dog|bird/; # matches "cat" "cats and dogs" =~ /dog|cat|bird/; # matches "cat"Even though C<dog> is the first alternative in the second regex,C<cat> is able to match earlier in the string. "cats" =~ /c|ca|cat|cats/; # matches "c" "cats" =~ /cats|cat|ca|c/; # matches "cats"At a given character position, the first alternative that allows theregex match to succeed wil be the one that matches. Here, all thealternatives match at the first string position, so th first matches.=head2 Grouping things and hierarchical matchingThe B<grouping> metacharacters C<()> allow a part of a regex to betreated as a single unit. Parts of a regex are grouped by enclosingthem in parentheses. The regex C<house(cat|keeper)> means matchC<house> followed by either C<cat> or C<keeper>. Some more examplesare /(a|b)b/; # matches 'ab' or 'bb'
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -