📄 perlretut.1
字号:
\& "That hat is red" =~ /hat/; # matches \*(Aqhat\*(Aq in \*(AqThat\*(Aq.Ve.PPWith respect to character matching, there are a few more points youneed to know about. First of all, not all characters can be used 'asis' in a match. Some characters, called \fImetacharacters\fR, are reservedfor use in regexp notation. The metacharacters are.PP.Vb 1\& {}[]()^$.|*+?\e.Ve.PPThe significance of each of these will be explainedin the rest of the tutorial, but for now, it is important only to knowthat a metacharacter can be matched by putting a backslash before it:.PP.Vb 5\& "2+2=4" =~ /2+2/; # doesn\*(Aqt match, + is a metacharacter\& "2+2=4" =~ /2\e+2/; # matches, \e+ is treated like an ordinary +\& "The interval is [0,1)." =~ /[0,1)./ # is a syntax error!\& "The interval is [0,1)." =~ /\e[0,1\e)\e./ # matches\& "#!/usr/bin/perl" =~ /#!\e/usr\e/bin\e/perl/; # matches.Ve.PPIn the last regexp, the forward slash \f(CW\*(Aq/\*(Aq\fR is also backslashed,because it is used to delimit the regexp. This can lead to \s-1LTS\s0(leaning toothpick syndrome), however, and it is often more readableto change delimiters..PP.Vb 1\& "#!/usr/bin/perl" =~ m!#\e!/usr/bin/perl!; # easier to read.Ve.PPThe backslash character \f(CW\*(Aq\e\*(Aq\fR is a metacharacter itself and needs tobe backslashed:.PP.Vb 1\& \*(AqC:\eWIN32\*(Aq =~ /C:\e\eWIN/; # matches.Ve.PPIn addition to the metacharacters, there are some \s-1ASCII\s0 characterswhich don't have printable character equivalents and are insteadrepresented by \fIescape sequences\fR. Common examples are \f(CW\*(C`\et\*(C'\fR for atab, \f(CW\*(C`\en\*(C'\fR for a newline, \f(CW\*(C`\er\*(C'\fR for a carriage return and \f(CW\*(C`\ea\*(C'\fR for abell. If your string is better thought of as a sequence of arbitrarybytes, the octal escape sequence, e.g., \f(CW\*(C`\e033\*(C'\fR, or hexadecimal escapesequence, e.g., \f(CW\*(C`\ex1B\*(C'\fR may be a more natural representation for yourbytes. Here are some examples of escapes:.PP.Vb 4\& "1000\et2000" =~ m(0\et2) # matches\& "1000\en2000" =~ /0\en20/ # matches\& "1000\et2000" =~ /\e000\et2/ # doesn\*(Aqt match, "0" ne "\e000"\& "cat" =~ /\e143\ex61\ex74/ # matches, but a weird way to spell cat.Ve.PPIf you've been around Perl a while, all this talk of escape sequencesmay seem familiar. Similar escape sequences are used in double-quotedstrings and in fact the regexps in Perl are mostly treated asdouble-quoted strings. This means that variables can be used inregexps as well. Just like double-quoted strings, the values of thevariables in the regexp will be substituted in before the regexp isevaluated for matching purposes. So we have:.PP.Vb 4\& $foo = \*(Aqhouse\*(Aq;\& \*(Aqhousecat\*(Aq =~ /$foo/; # matches\& \*(Aqcathouse\*(Aq =~ /cat$foo/; # matches\& \*(Aqhousecat\*(Aq =~ /${foo}cat/; # matches.Ve.PPSo far, so good. With the knowledge above you can already performsearches with just about any literal string regexp you can dream up.Here is a \fIvery simple\fR emulation of the Unix grep program:.PP.Vb 7\& % cat > simple_grep\& #!/usr/bin/perl\& $regexp = shift;\& while (<>) {\& print if /$regexp/;\& }\& ^D\&\& % chmod +x simple_grep\&\& % simple_grep abba /usr/dict/words\& Babbage\& cabbage\& cabbages\& sabbath\& Sabbathize\& Sabbathizes\& sabbatical\& scabbard\& scabbards.Ve.PPThis program is easy to understand. \f(CW\*(C`#!/usr/bin/perl\*(C'\fR is the standardway to invoke a perl program from the shell.\&\f(CW\*(C`$regexp\ =\ shift;\*(C'\fR saves the first command line argument as theregexp to be used, leaving the rest of the command line arguments tobe treated as files. \f(CW\*(C`while\ (<>)\*(C'\fR loops over all the lines inall the files. For each line, \f(CW\*(C`print\ if\ /$regexp/;\*(C'\fR prints theline if the regexp matches the line. In this line, both \f(CW\*(C`print\*(C'\fR and\&\f(CW\*(C`/$regexp/\*(C'\fR use the default variable \f(CW$_\fR implicitly..PPWith all of the regexps above, if the regexp matched anywhere in thestring, it was considered a match. Sometimes, however, we'd like tospecify \fIwhere\fR in the string the regexp should try to match. To dothis, we would use the \fIanchor\fR metacharacters \f(CW\*(C`^\*(C'\fR and \f(CW\*(C`$\*(C'\fR. Theanchor \f(CW\*(C`^\*(C'\fR means match at the beginning of the string and the anchor\&\f(CW\*(C`$\*(C'\fR means match at the end of the string, or before a newline at theend of the string. Here is how they are used:.PP.Vb 4\& "housekeeper" =~ /keeper/; # matches\& "housekeeper" =~ /^keeper/; # doesn\*(Aqt match\& "housekeeper" =~ /keeper$/; # matches\& "housekeeper\en" =~ /keeper$/; # matches.Ve.PPThe second regexp doesn't match because \f(CW\*(C`^\*(C'\fR constrains \f(CW\*(C`keeper\*(C'\fR tomatch only at the beginning of the string, but \f(CW"housekeeper"\fR haskeeper starting in the middle. The third regexp does match, since the\&\f(CW\*(C`$\*(C'\fR constrains \f(CW\*(C`keeper\*(C'\fR to match only at the end of the string..PPWhen both \f(CW\*(C`^\*(C'\fR and \f(CW\*(C`$\*(C'\fR are used at the same time, the regexp has tomatch both the beginning and the end of the string, i.e., the regexpmatches the whole string. Consider.PP.Vb 3\& "keeper" =~ /^keep$/; # doesn\*(Aqt match\& "keeper" =~ /^keeper$/; # matches\& "" =~ /^$/; # ^$ matches an empty string.Ve.PPThe first regexp doesn't match because the string has more to it than\&\f(CW\*(C`keep\*(C'\fR. Since the second regexp is exactly the string, itmatches. Using both \f(CW\*(C`^\*(C'\fR and \f(CW\*(C`$\*(C'\fR in a regexp forces the completestring to match, so it gives you complete control over which stringsmatch and which don't. Suppose you are looking for a fellow namedbert, off in a string by himself:.PP.Vb 1\& "dogbert" =~ /bert/; # matches, but not what you want\&\& "dilbert" =~ /^bert/; # doesn\*(Aqt match, but ..\& "bertram" =~ /^bert/; # matches, so still not good enough\&\& "bertram" =~ /^bert$/; # doesn\*(Aqt match, good\& "dilbert" =~ /^bert$/; # doesn\*(Aqt match, good\& "bert" =~ /^bert$/; # matches, perfect.Ve.PPOf course, in the case of a literal string, one could just as easilyuse the string comparison \f(CW\*(C`$string\ eq\ \*(Aqbert\*(Aq\*(C'\fR and it would bemore efficient. The \f(CW\*(C`^...$\*(C'\fR regexp really becomes useful when weadd in the more powerful regexp tools below..Sh "Using character classes".IX Subsection "Using character classes"Although one can already do quite a lot with the literal stringregexps above, we've only scratched the surface of regular expressiontechnology. In this and subsequent sections we will introduce regexpconcepts (and associated metacharacter notations) that will allow aregexp to not just represent a single character sequence, but a \fIwholeclass\fR of them..PPOne such concept is that of a \fIcharacter class\fR. A character classallows a set of possible characters, rather than just a singlecharacter, to match at a particular point in a regexp. Characterclasses are denoted by brackets \f(CW\*(C`[...]\*(C'\fR, with the set of charactersto be possibly matched inside. Here are some examples:.PP.Vb 4\& /cat/; # matches \*(Aqcat\*(Aq\& /[bcr]at/; # matches \*(Aqbat, \*(Aqcat\*(Aq, or \*(Aqrat\*(Aq\& /item[0123456789]/; # matches \*(Aqitem0\*(Aq or ... or \*(Aqitem9\*(Aq\& "abc" =~ /[cab]/; # matches \*(Aqa\*(Aq.Ve.PPIn the last statement, even though \f(CW\*(Aqc\*(Aq\fR is the first character inthe class, \f(CW\*(Aqa\*(Aq\fR matches because the first character position in thestring is the earliest point at which the regexp can match..PP.Vb 2\& /[yY][eE][sS]/; # match \*(Aqyes\*(Aq in a case\-insensitive way\& # \*(Aqyes\*(Aq, \*(AqYes\*(Aq, \*(AqYES\*(Aq, etc..Ve.PPThis regexp displays a common task: perform a case-insensitivematch. Perl provides a way of avoiding all those brackets by simplyappending an \f(CW\*(Aqi\*(Aq\fR to the end of the match. Then \f(CW\*(C`/[yY][eE][sS]/;\*(C'\fRcan be rewritten as \f(CW\*(C`/yes/i;\*(C'\fR. The \f(CW\*(Aqi\*(Aq\fR stands forcase-insensitive and is an example of a \fImodifier\fR of the matchingoperation. We will meet other modifiers later in the tutorial..PPWe saw in the section above that there were ordinary characters, whichrepresented themselves, and special characters, which needed abackslash \f(CW\*(C`\e\*(C'\fR to represent themselves. The same is true in acharacter class, but the sets of ordinary and special charactersinside a character class are different than those outside a characterclass. The special characters for a character class are \f(CW\*(C`\-]\e^$\*(C'\fR (andthe pattern delimiter, whatever it is).\&\f(CW\*(C`]\*(C'\fR is special because it denotes the end of a character class. \f(CW\*(C`$\*(C'\fR isspecial because it denotes a scalar variable. \f(CW\*(C`\e\*(C'\fR is special becauseit is used in escape sequences, just like above. Here is how thespecial characters \f(CW\*(C`]$\e\*(C'\fR are handled:.PP.Vb 5\& /[\e]c]def/; # matches \*(Aq]def\*(Aq or \*(Aqcdef\*(Aq\& $x = \*(Aqbcr\*(Aq;\& /[$x]at/; # matches \*(Aqbat\*(Aq, \*(Aqcat\*(Aq, or \*(Aqrat\*(Aq\& /[\e$x]at/; # matches \*(Aq$at\*(Aq or \*(Aqxat\*(Aq\& /[\e\e$x]at/; # matches \*(Aq\eat\*(Aq, \*(Aqbat, \*(Aqcat\*(Aq, or \*(Aqrat\*(Aq.Ve.PPThe last two are a little tricky. In \f(CW\*(C`[\e$x]\*(C'\fR, the backslash protectsthe dollar sign, so the character class has two members \f(CW\*(C`$\*(C'\fR and \f(CW\*(C`x\*(C'\fR.In \f(CW\*(C`[\e\e$x]\*(C'\fR, the backslash is protected, so \f(CW$x\fR is treated as avariable and substituted in double quote fashion..PPThe special character \f(CW\*(Aq\-\*(Aq\fR acts as a range operator within characterclasses, so that a contiguous set of characters can be written as arange. With ranges, the unwieldy \f(CW\*(C`[0123456789]\*(C'\fR and \f(CW\*(C`[abc...xyz]\*(C'\fRbecome the svelte \f(CW\*(C`[0\-9]\*(C'\fR and \f(CW\*(C`[a\-z]\*(C'\fR. Some examples are.PP.Vb 6\& /item[0\-9]/; # matches \*(Aqitem0\*(Aq or ... or \*(Aqitem9\*(Aq\& /[0\-9bx\-z]aa/; # matches \*(Aq0aa\*(Aq, ..., \*(Aq9aa\*(Aq,\& # \*(Aqbaa\*(Aq, \*(Aqxaa\*(Aq, \*(Aqyaa\*(Aq, or \*(Aqzaa\*(Aq\& /[0\-9a\-fA\-F]/; # matches a hexadecimal digit\& /[0\-9a\-zA\-Z_]/; # matches a "word" character,\& # like those in a Perl variable name.Ve.PPIf \f(CW\*(Aq\-\*(Aq\fR is the first or last character in a character class, it istreated as an ordinary character; \f(CW\*(C`[\-ab]\*(C'\fR, \f(CW\*(C`[ab\-]\*(C'\fR and \f(CW\*(C`[a\e\-b]\*(C'\fR areall equivalent..PPThe special character \f(CW\*(C`^\*(C'\fR in the first position of a character classdenotes a \fInegated character class\fR, which matches any character butthose in the brackets. Both \f(CW\*(C`[...]\*(C'\fR and \f(CW\*(C`[^...]\*(C'\fR must match acharacter, or the match fails. Then.PP.Vb 4\& /[^a]at/; # doesn\*(Aqt match \*(Aqaat\*(Aq or \*(Aqat\*(Aq, but matches\& # all other \*(Aqbat\*(Aq, \*(Aqcat, \*(Aq0at\*(Aq, \*(Aq%at\*(Aq, etc.\& /[^0\-9]/; # matches a non\-numeric character\& /[a^]at/; # matches \*(Aqaat\*(Aq or \*(Aq^at\*(Aq; here \*(Aq^\*(Aq is ordinary.Ve.PPNow, even \f(CW\*(C`[0\-9]\*(C'\fR can be a bother to write multiple times, so in theinterest of saving keystrokes and making regexps more readable, Perlhas several abbreviations for common character classes, as shown below.Since the introduction of Unicode, these character classes match morethan just a few characters in the \s-1ISO\s0 8859\-1 range..IP "\(bu" 4\&\ed matches a digit, not just [0\-9] but also digits from non-roman scripts.IP "\(bu" 4\&\es matches a whitespace character, the set [\e \et\er\en\ef] and others.IP "\(bu" 4\&\ew matches a word character (alphanumeric or _), not just [0\-9a\-zA\-Z_]but also digits and characters from non-roman scripts.IP "\(bu" 4\&\eD is a negated \ed; it represents any other character than a digit, or [^\ed].IP "\(bu" 4\&\eS is a negated \es; it represents any non-whitespace character [^\es].IP "\(bu" 4\&\eW is a negated \ew; it represents any non-word character [^\ew].IP "\(bu" 4The period '.' matches any character but \*(L"\en\*(R" (unless the modifier \f(CW\*(C`//s\*(C'\fR isin effect, as explained below)..PPThe \f(CW\*(C`\ed\es\ew\eD\eS\eW\*(C'\fR abbreviations can be used both inside and outsideof character classes. Here are some in use:.PP.Vb 7\& /\ed\ed:\ed\ed:\ed\ed/; # matches a hh:mm:ss time format\& /[\ed\es]/; # matches any digit or whitespace character\& /\ew\eW\ew/; # matches a word char, followed by a\& # non\-word char, followed by a word char\& /..rt/; # matches any two chars, followed by \*(Aqrt\*(Aq\& /end\e./; # matches \*(Aqend.\*(Aq\& /end[.]/; # same thing, matches \*(Aqend.\*(Aq.Ve.PPBecause a period is a metacharacter, it needs to be escaped to matchas an ordinary period. Because, for example, \f(CW\*(C`\ed\*(C'\fR and \f(CW\*(C`\ew\*(C'\fR are setsof characters, it is incorrect to think of \f(CW\*(C`[^\ed\ew]\*(C'\fR as \f(CW\*(C`[\eD\eW]\*(C'\fR; infact \f(CW\*(C`[^\ed\ew]\*(C'\fR is the same as \f(CW\*(C`[^\ew]\*(C'\fR, which is the same as\&\f(CW\*(C`[\eW]\*(C'\fR. Think DeMorgan's laws..PPAn anchor useful in basic regexps is the \fIword anchor\fR
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -