📄 00000002.htm
字号:
outer non-greedy quantifier overrides the inner greedy quantifiers and <BR>makes all quantifiers non-greedy! There's an explanation in re_syntax(n) <BR> reference page section named Matching. <BR> <BR>Character Classes <BR>A character class is a name for one or more characters. For example, <BR>punct stands for the "punctuation" characters. A character class is <BR>always written as part of a bracket expression, which is a list of <BR>characters enclosed in []. <BR>For instance, the character class named digit stands for any of the <BR>digits 0-9 (zero through nine). The character class is written with <BR>the class name inside a set of brackets and colons, like this: [[: <BR>digit:]]. The old familiar expression for digits is written as a range: <BR> [0-9]. When you compare the new character class to the old range <BR>version, you can see that the outer square brackets are the same in <BR>both. So a character class is written [:classname:]. <BR> <BR>The table below describes the Tcl 8.1 character classes. alpha A <BR>letter (includes many non-ASCII characters). <BR>upper An upper-case letter. <BR>lower A lower-case letter. <BR>digit A decimal digit. <BR>xdigit A hexadecimal digit. <BR>alnum An alphanumeric (letter or digit). <BR>print An alphanumeric. (Same as alnum.) <BR>blank A space or tab character. <BR>space A character producing white space in displayed text. (Includes <BR>en-space, hair space, many others.) <BR>punct A punctuation character. <BR>graph A character with a visible representation. <BR>cntrl A control character. <BR> <BR> <BR>You can use more than one character class in a bracket expression. You <BR>can also mix character classes with ranges and single characters. For <BR>instance, [[:digit:]a-cx-z] would match a digit (0-9), a, b, c, x, y, or <BR> z -- and [^[:digit:]a-cx-z] would match any character except those. <BR>This syntax can take some time to get familiar with! The key is to <BR>look for the character class (here, [:digit:]) inside the bracket <BR>expression. <BR> <BR>The advantage of character classes (like [:alpha:]) over explicit ranges <BR> in brackets (like [a-z]) is that character classes include characters <BR>that aren't easy to type on ASCII keyboards. For example, the Spanish <BR>language includes the character ?. It doesn't fall into the range [a-z], <BR> but it is in the Tcl 8.1 character class [:alpha:]. In the same way, <BR>the Spanish punctuation character ? isn't in a list of punctuation <BR>characters like [.!?,], but it is part of [:punct:]. <BR> <BR>Tcl 8.1 has a standard set of character classes that are defined in <BR>the source code file generic/regc_locale.c. Tcl 8.1 has one locale <BR>defined: the Unicode locale. It may support other locales (and other <BR>character classes) in the future. <BR> <BR>Collating Elements <BR>A collating symbol lets you represent other characters unambiguously. <BR>A collating symbol is written surrounded by brackets and dots, like [. <BR>number-sign.] Collating symbols must be written in a bracket <BR>expression (inside []). So [[.number-sign.]] will match the character #, <BR> as you can see here: <BR> <BR>% regexp {[[.number-sign.]]+} {123###456} match <BR>1 <BR>% set match <BR>### <BR> <BR>Tcl 8.1 has a standard set of collating symbols that are defined in <BR>the source code file generic/regc_locale.c. Note: Tcl 8.1 does not <BR>implement multi-character collating elements like ch (which is the <BR>fourth character in the Spanish alphabet a, b, c, ch, d, e, f, g, h, i. <BR>..) So the examples below are not supported in Tcl 8.1, but are here for <BR> completeness. (Future versions of Tcl may have multi-character <BR>collating elements.) <BR>Suppose ch and c sort next to each other in your dialect, and ch is <BR>treated as an atomic character. The example bracket expression below <BR>uses two collating symbols. It matches one or more of ch and c. But it <BR>doesn't match an h standing alone: <BR> <BR> <BR>% set input "cchchh" <BR>cchchh <BR>% regexp {[[.ch.][.c.]]+} $input match; set match <BR>cchch <BR> <BR>Here's one tricky and surprising thing about collating symbols. A <BR>caret at the start of a bracket expression ([^...) means that, in a <BR>locale with multi-character collating elements, the symbol can match <BR>more than one character. For instance, the RE in the example below <BR>matches any character other than c, followed by the character b. So <BR>the expression matches all of chb: <BR> <BR>% set input chb <BR>% regexp {[^[.c.]]b} $input match; set match <BR>chb <BR> <BR>Again, the two previous examples are not supported in Tcl 8.1, but are <BR>here for completeness. <BR>Equivalence Classes <BR>An equivalence class is written as part of a bracket expression, like <BR>[[=c=]]. It's any collating element that has the same relative order <BR>in the collating sequence as c. <BR>Note: Tcl 8.1 only implements the Unicode locale. It doesn't define <BR>any equivalence classes. So, although the Tcl regular expression <BR>engine supports equivalence classes, the examples below are not <BR>supported in Tcl 8.1. (Future versions of Tcl may define equivalence <BR>classes.) <BR> <BR>Let's imagine that both of the characters A and a fall at the same place <BR> in the collating sequence; they belong to the same equivalence class. <BR>In that case, both of the bracket expressions [[=A=]b] and [[=a=]b] <BR>are equivalent to writing [Aab]. As another example, if o and ? are <BR>members of an equivalence class, then all of the bracket expressions <BR>[[=o=]], [[=?=]], and [o?] match those same two characters. <BR> <BR>Noncapturing Subpatterns <BR>There are two reasons to put parentheses around all or part of an RE. <BR>One is to make a quantifier (like * or +) apply to the parenthesized <BR>part. For instance, the RE Oh,( no!)+ would match Oh, no! as well as Oh, <BR> no! no! and so on. The other reason to use parentheses is that they <BR>capture the matched text. Captured text is used in back references, in <BR>"matching" variables in the regexp command, as well as in the regsub <BR>command. <BR>If you don't want parentheses to capture text, add ?: after the <BR>opening parenthesis. For instance, in the example below, the <BR>subexpression (?:http|ftp) matches either http or ftp but doesn't <BR>capture it. So the back reference \1 will hold the end of the URL <BR>(from the second set of parentheses): <BR> <BR> <BR>% set x <A HREF="http://www.ajubasolutions.com">http://www.ajubasolutions.com</A> <BR><A HREF="http://www.ajubasolutions.com">http://www.ajubasolutions.com</A> <BR>% regsub {(?:http|ftp)://(.*)} $x {The hostname is \1} answer <BR>1 <BR>% set answer <BR>The hostname is www.ajubasolutions.com <BR> <BR>Lookahead Assertions <BR>There are times you'd like to be able to test for a pattern without <BR>including that text in the match. For instance, you might want to <BR>match the protocol in a URL (like http or ftp), but only if that URL <BR>ends with .com. Or maybe you want to match the protocol only if the <BR>URL does not end with .edu. In cases like those, you'd like to "look <BR>ahead" and see how the URL ends. A lookahead assertion is handy here. <BR>A positive lookahead has the form (?=re). It matches at any place <BR>ahead where there's a substring like re. A negative lookahead has the <BR>form (?!re). It matches at any point where the regular expression re <BR>does not match. Let's see some examples: <BR> <BR> <BR>% set x <A HREF="http://www.ajubasolutions.com">http://www.ajubasolutions.com</A> <BR><A HREF="http://www.ajubasolutions.com">http://www.ajubasolutions.com</A> <BR>% regexp {^[^:]+(?=.*\.com$)} $x match <BR>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -