📄 00000003.htm
字号:
<HTML><HEAD> <TITLE>BBS水木清华站∶精华区</TITLE></HEAD><BODY><CENTER><H1>BBS水木清华站∶精华区</H1></CENTER>发信人: starw (化缘道人), 信区: Linux <BR>标 题: Python Regular Expression HOWTO 4.3 <BR>发信站: BBS 水木清华站 (Tue Nov 21 23:45:46 2000) <BR> <BR>嘿嘿,就一部分....看了半天才弄明白 <BR> <BR>4.3 Non-capturing, and Named Groups <BR> <BR>Elaborate REs may use many groups, both to capture substrings of interest, <BR>and to group and structure the RE itself. In complex REs, it becomes <BR>difficult to keep track of the group numbers. There are two features which <BR>help with this problem. Both of them use a common syntax for regular <BR>expression extensions, so we'll look at that first. <BR> <BR>Perl 5 added several additional features to standard regular expressions, <BR>and the Python re module supports most of them. It would have been difficult <BR>to choose new single-keystroke metacharacters or new special sequences <BR>beginning with "\" to represent the new features, without making Perl's <BR>regular expressions confusingly different from standard REs. If you chose <BR>"&" as a new metacharacter, for example, old expressions would be assuming <BR>that "&" was a regular character and wouldn't have escaped it by writing \& <BR>or [&]. The solution chosen was to use (?...) as the extension syntax. "?" <BR>immediately after a parenthesis was a syntax error, because the "?" would <BR>have nothing to repeat, so this doesn't introduce any compatibility problems. <BR>The characters immediately after the "?" indicate what extension is being <BR>used, so (?=foo) is one thing (a positive lookahead assertion) and (?:foo) <BR>is something else (a non-capturing group containing the subexpression foo). <BR> <BR>Python adds an extension syntax to Perl's extension syntax. If the first <BR>character after the question mark is a "P", you know that it's a extension <BR>that's specific to Python. Currently there are two such extensions: <BR>(?P<name>...) defines a named group, and (?P=name) is a backreference to a <BR>named group. If future versions of Perl 5 add similar features using a <BR>different syntax, the re module will be changed to support the new syntax, <BR>while preserving the Python-specific syntax for compatibility's sake. <BR> <BR>Now that we've looked at the general extension syntax, we can return to the <BR>features that simplify working with groups in complex REs. Since groups are <BR>numbered from left to right, and a complex expression may use many groups, <BR>it can become difficult to keep track of the correct numbering, and modifying <BR>such a complex RE is annoying. Insert a new group near the beginning, and you <BR>change the numbers of everything that follows it. <BR> <BR>First, sometimes you'll want to use a group to collect a part of a regular <BR>expression, but aren't interested in retrieving the group's contents. You can <BR>make this fact explicit by using a non-capturing group: (?:...), where you <BR>can put any other regular expression inside the parentheses. <BR> <BR> <BR>><I>>> m = re.match("([abc])+'', "abc") </I><BR>><I>>> m.groups() </I><BR>('c',) <BR>><I>>> m = re.match("(?:[abc])+", "abc") </I><BR>><I>>> m.groups() </I><BR>() <BR> <BR>Except for the fact that you can't retrieve the contents of what the group <BR>matched, a non-capturing group behaves exactly the same as a capturing group; <BR>you can put anything inside it, repeat it with a repetition metacharacter <BR>such as "*", and nest it within other groups (capturing or non-capturing). <BR>(?:...) is particularly useful when modifying an existing group, since you <BR>can add new groups without changing how all the other groups are numbered. <BR>It should be mentioned that there's no performance difference in searching <BR>between capturing and non-capturing groups; neither form is any faster than <BR>the other. <BR> <BR>The second, and more significant, feature, is named groups; instead of <BR>referring to them by numbers, groups can be referenced by a name. <BR> <BR>The syntax for a named group is one of the Python-specific extensions: <BR>(?P<name>...). name is, obviously, the name of the group. Except for <BR>associating a name with a group, named groups also behave identically to <BR>capturing groups. The MatchObject methods that deal with capturing groups <BR>all accept either integers, to refer to groups by number, or a string <BR>containing the group name. Named groups are still given numbers, so you <BR>can retrieve information about a group in two ways: <BR> <BR>><I>>> p = re.compile(r'(?P<word>\b\w+\b)') </I><BR>><I>>> m = p.search( '(((( Lots of punctuation )))' ) </I><BR>><I>>> m.group('word') </I><BR>'Lots' <BR>><I>>> m.group(1) </I><BR>'Lots' <BR> <BR>Named groups are handy because they let you use easily-remembered names, <BR>instead of having to remember numbers. Here's an example RE from the imaplib <BR>module: <BR> <BR>InternalDate = re.compile(r'INTERNALDATE "' <BR> r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-' <BR> r'(?P<year>[0-9][0-9][0-9][0-9])' <BR> r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])' <BR> r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])' <BR> r'"') <BR> <BR>It's obviously much easier to retrieve m.group('zonem'), instead of having <BR>to remember to retrieve group 9. Since the syntax for backreferences refers <BR>to the number of the group, in an expression like (...)\1, there's naturally <BR>a variant that uses the group name instead of the number. This is also a <BR>Python extension: (?P=name) indicates that the contents of the group called <BR>name should again be found at the current point. The regular expression for <BR>finding doubled words, (\b\w+)\s+\1 can also be written as <BR>(?P<word>\b\w+)\s+(?P=word): <BR> <BR> <BR>><I>>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)') </I><BR>><I>>> p.search('Paris in the the spring').group() </I><BR>'the the' <BR> <BR> <BR>-- <BR> <BR> 铜铁投洪冶,蝼蚁上粉墙。 <BR> 阴阳无二义,天地我中央。 <BR> <BR> <BR>※ 来源:·BBS 水木清华站 smth.org·[FROM: 202.117.27.35] <BR><CENTER><H1>BBS水木清华站∶精华区</H1></CENTER></BODY></HTML>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -