📄 00000002.htm
字号:
Escapes. <BR> <BR>Part 2. Regular Expressions in Tcl 8.1 <BR>Tcl 8.1 regular expressions are basically a superset of 8.0 REs. This <BR>howto document has an overview of the new features. Please see the <BR>re_syntax(n) reference page for exact semantics and more details. <BR> <BR>Non-Greedy Quantifiers <BR>A quantifier specifies "how many." For example, the quantifier * in <BR>the RE z* matches zero or more zs. By default, regular expression <BR>quantifiers are greedy: they match as much text as they can. Tcl 8.1 REs <BR> also have non-greedy quantifiers, which match the least text they can. <BR> To make a non-greedy quantifier, add a question mark (?) at the end. <BR>Let's start by storing some HTML text in a variable, then using two <BR>regexp commands to match it. The first RE is greedy, and the second is <BR>non-greedy: <BR> <BR> <BR>% set x {<EM>He</EM> sits, but <EM>she</EM> stands.} <BR><EM>He</EM> sits, but <EM>she</EM> stands. <BR>% regexp {<EM>.*</EM>} $x match; set match <BR><EM>He</EM> sits, but <EM>she</EM> <BR>% regexp {<EM>.*?</EM>} $x match; set match <BR><EM>He</EM> <BR> <BR>The first RE <EM>.*</EM> is "greedy." It matches from the first <EM> <BR>to the last </EM>. The second RE <EM>.*?</EM>, with a question mark <BR>(?) after the * quantifier, is non-greedy: it matches as little text <BR>as possible after the first <EM>. Could you write a greedy RE that works <BR> like the non-greedy version? It isn't easy! A greedy RE like <BR><EM>[^<]*</EM> would do it in this case -- but it wouldn't work if there <BR> were other HTML tags (with a < character) between the pair of <EM> tags <BR> in the $x string. <BR>Here are a new string and another pair of REs to match it: <BR> <BR> <BR>% set y {123zzz456} <BR>123zzz456 <BR>% regexp {3z*} $y match; set match <BR>3zzz <BR>% regexp {3z*?} $y match; set match <BR>3 <BR> <BR>The greedy RE 3z* matches all the zs it can (three) under its "zero or <BR>more" rule. The non-greedy RE 3z*? matches just 3 because it matches the <BR> fewest zs it can under its "zero or more" rule. <BR>To review, the greedy quantifiers from Tcl 8.0 are: *, +, and ?. So <BR>the non-greedy quantifiers (added in Tcl 8.1) are: *?, +?, and ??. Tcl <BR>8.1 also has the new quantifiers {m}, {m,}, and {m,n}, as well as the <BR>non-greedy versions {m}?, {m,}?, and {m,n}?. The section on bounds <BR>explains -- and has more examples of non-greedy matching. <BR> <BR>Backslash Escapes <BR>A backslash (\) disables the metacharacter after it. For example, a\* <BR>matches the character a followed by a literal asterisk (*) character. In <BR> Tcl 8.0 and before, it was legal to put a backslash before a <BR>non-metacharacter -- for instance, regexp {\p} matched the character p. <BR> (Note that regexp {\n} matched the character n, which was a source of <BR>confusion. To get a newline character into an RE before version 8.1, you <BR> had to write regexp "\n" so Tcl processing inside double quotes would <BR>convert the \n to a newline.) <BR>The Tcl 8.1 regular expression engine interprets backslash escapes <BR>itself. So now regexp {\n} matches a newline, not the character n. REs <BR>are simpler to write in 8.1 because of this. (You can still write regexp <BR> "\n" -- and let Tcl conversion happen inside the double quotes -- so <BR>most old code will still work.) <BR> <BR>One of the most important changes in 8.1 is that a backslash inside a <BR>bracket expression is treated as the start of an escape. In 8.0 and <BR>before, a backslash inside brackets was treated as a literal backslash <BR>character. For example, in 8.0 and before, regexp {[a\n]} would match <BR>the characters a, \, or n. But in 8.1, regexp {[a\n]} would match the <BR>characters a or newline (because \n is the backslash escape for <BR>"newline"). <BR> <BR>Tcl 8.1 has also added many new backslash escapes. For instance, \d <BR>matches a digit. Some of these are listed below, and the re_syntax(n) <BR>reference page has the whole list. <BR> <BR>In Tcl 8.1 regular expressions (but not in other parts of the language), <BR> it's illegal to use a backslash before a non-metacharacter unless it <BR>makes a valid escape. So regexp {\p} is now an error. If you have code <BR>that (for some bizarre reason) has regular expressions with a <BR>backslash before a non-metacharacter, like regexp {\p}, you'll need to <BR>fix it. <BR> <BR>As explained above, the Tcl 8.1 regular expression engine now interprets <BR> backslash sequences like \n to mean "newline". It also has four new <BR>kinds of escapes: character entry escapes, class shorthand escapes, <BR>constraint escapes, and back references. Here's an introduction. (The <BR>re_syntax(n) page has full details.) <BR> <BR>A character entry escape is a convenient way to enter a non-printing <BR>or other difficult character. For instance, \n represents a newline <BR>character. \uwxyz (where wxyz is hexadecimal) represents the Unicode <BR>character U+wxyz. <BR>Class shorthand escapes are shorthand for common character classes. <BR>For example, \d stands for [[:digit:]], which means "any single digit. <BR>" <BR>A constraint escape constrains an RE to match only at a certain place. <BR>For example, the constraint escape \m matches only at the start of a <BR>word -- so the RE \mhi will match the third word in the string he said <BR>hi but won't match he said thigh. <BR>A back reference matches the same string that was matched by a <BR>previous parenthesized subexpression. (This works like subexpressions in <BR> regsub, but it's used for matching instead of extracting.) For example, <BR> (X.*Y)\1 matches any doubled string that starts with X and ends with Y, <BR> such as XYXY, XabcYXabcY, X--YX--Y, etc. <BR>Finally, remember that (as in Tcl 8.0 and before) some applications, <BR>such as C compilers, interpret these backslash sequences themselves <BR>before the regular expression engine sees them. You may need to double <BR>(or quadruple, etc.) the number of backslashes for these applications. <BR>Still, in straight Tcl 8.1 code, writing backslash escapes is now both <BR>simpler and more powerful than in 8.0 and before. <BR> <BR>Bounds <BR>You've seen the quantifiers *, +, and ?. They specify "how many" <BR>(respectively, zero or more, one or more, and zero or one). Tcl 8.1 <BR>added new quantifiers that let you choose exactly how many matches: <BR>the bounds operators, {}. <BR>These operators come in three greedy forms: {m}, {m,}, and {m,n}. The <BR>corresponding non-greedy forms are {m}?, {m,}?, and {m,n}?. <BR> <BR>The {m} quantifier matches exactly m occurrences. So does {m}?. For <BR>example, either #{70} or #{70}? match a string of exactly 70 # <BR>characters. <BR>The {m,} quantifier matches at least m occurrences. Here's a demo of the <BR> greedy and non-greedy versions: <BR> <BR>% set x {a##b#######c} <BR>a##b#######c <BR>% regexp {#{4,}} $x match; set match <BR>####### <BR>% regexp {#{4,}?} $x match; set match <BR>#### <BR> <BR>Notice that the first two number signs (##) in the string are never <BR>matched because there aren't at least four of them. <BR>The {m,n} quantifier matches at least m but no more than n occurrences. <BR> <BR>For example, the RE <A HREF="http://([^/]+/?){1,3}">http://([^/]+/?){1,3}</A> would match Web URLs that have <BR> 3 components (like <A HREF="http://xyz.fr/euro/billets.htm),">http://xyz.fr/euro/billets.htm),</A> or with 2 <BR>components (like <A HREF="http://xyz.fr/euro/,">http://xyz.fr/euro/,</A> or with just 1 component (like <BR><A HREF="http://xyz.fr).">http://xyz.fr).</A> The RE matches a final slash (/) if there is one. As <BR>always, a greedy match will match as long a string as possible: it would <BR> try for 3 matches. <BR> <BR>A non-greedy quantifier would try to match the least (1 match). But be <BR>careful: <A HREF="http://([^/]+/?){1,3}?">http://([^/]+/?){1,3}?</A> won't match all the way to a possible <BR>slash because it matches the fewest characters possible! (With input <BR><A HREF="http://xyz.fr/,">http://xyz.fr/,</A> that RE would match just <A HREF="http://x.)">http://x.)</A> This brings up one <BR>of the many subtleties in these advanced regular expressions: that the <BR>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -