📄 chap3.txt

📁 很详细的Python文字处理教程
💻 TXT
📖 第 1 页 / 共 5 页
字号:
  define these character classes with square brackets, but the
  shortcuts can make regular expressions more compact and more
  readable.

      >>> from re_show import re_show, s
      >>> re_show(r'[a-z]a', s)
      Mary {ha}d a little {la}mb
      And everywhere t{ha}t Mary
      went, the {la}mb {wa}s sure
      to go

  -*-

  The caret symbol can actually have two different meanings in regular
  expressions. Most of the time, it means to match the zero-length
  pattern for line beginnings. But if it is used at the beginning of a
  character class, it reverses the meaning of the character class.
  Everything not included in the listed character set is matched.

      >>> from re_show import re_show, s
      >>> re_show(r'[^a-z]a', s)
      {Ma}ry had{ a} little lamb
      And everywhere that {Ma}ry
      went, the lamb was sure
      to go

  -*-

  Using character classes is a way of indicating that either one
  thing or another thing can occur in a particular spot. But
  what if you want to specify that either of two whole
  subexpressions occur in a position in the regular expression?
  For that, you use the alternation operator, the vertical bar
  ("|"). This is the symbol that is also used to indicate a pipe
  in Unix/DOS shells and is sometimes called the pipe character.

  The pipe character in a regular expression indicates an
  alternation between everything in the group enclosing it. What
  this means is that even if there are several groups to the left
  and right of a pipe character, the alternation greedily asks
  for everything on both sides. To select the scope of the
  alternation, you must define a group that encompasses the
  patterns that may match. The example illustrates this:

      >>> from re_show import re_show
      >>> s2 = 'The pet store sold cats, dogs, and birds.'
      >>> re_show(r'cat|dog|bird', s2)
      The pet store sold {cat}s, {dog}s, and {bird}s.

      >>> s3 = '=first first= # =second second= # =first= # =second='
      >>> re_show(r'=first|second=', s3)
      {=first} first= # =second {second=} # {=first}= # ={second=}

      >>> re_show(r'(=)(first)|(second)(=)', s3)
      {=first} first= # =second {second=} # {=first}= # ={second=}

      >>> re_show(r'=(first|second)=', s3)
      =first first= # =second second= # {=first=} # {=second=}

  -*-

  One of the most powerful and common things you can do with
  regular expressions is to specify how many times an atom occurs
  in a complete regular expression. Sometimes you want to
  specify something about the occurrence of a single character,
  but very often you are interested in specifying the occurrence
  of a character class or a grouped subexpression.

  There is only one quantifier included with "basic" regular
  expression syntax, the asterisk ("*"); in English this has the
  meaning "some or none" or "zero or more."  If you want to
  specify that any number of an atom may occur as part of a
  pattern, follow the atom by an asterisk.

  Without quantifiers, grouping expressions doesn't really serve
  as much purpose, but once we can add a quantifier to a
  subexpression we can say something about the occurrence of the
  subexpression as a whole. Take a look at the example:

      >>> from re_show import re_show
      >>> s = '''Match with zero in the middle: @@
      ... Subexpression occurs, but...: @=!=ABC@
      ... Lots of occurrences: @=!==!==!==!==!=@
      ... Must repeat entire pattern: @=!==!=!==!=@'''
      >>> re_show(r'@(=!=)*@', s)
      Match with zero in the middle: {@@}
      Subexpression occurs, but...: @=!=ABC@
      Lots of occurrences: {@=!==!==!==!==!=@}
      Must repeat entire pattern: @=!==!=!==!=@

  TOPIC -- Matching Patterns in Text: Intermediate
  --------------------------------------------------------------------

  In a certain way, the lack of any quantifier symbol after an atom
  quantifies the atom anyway: It says the atom occurs exactly once.
  Extended regular expressions add a few other useful numbers to
  "once exactly" and "zero or more times."  The plus sign ("+")
  means "one or more times" and the question mark ("?") means
  "zero or one times."  These quantifiers are by far the most
  common enumerations you wind up using.

  If you think about it, you can see that the extended regular
  expressions do not actually let you "say" anything the basic
  ones do not. They just let you say it in a shorter and more
  readable way. For example, '(ABC)+' is equivalent to
  '(ABC)(ABC)*', and 'X(ABC)?Y' is equivalent to 'XABCY|XY'. If
  the atoms being quantified are themselves complicated grouped
  subexpressions, the question mark and plus sign can make things
  a lot shorter.

      >>> from re_show import re_show
      >>> s = '''AAAD
      ... ABBBBCD
      ... BBBCD
      ... ABCCD
      ... AAABBBC'''
      >>> re_show(r'A+B*C?D', s)
      {AAAD}
      {ABBBBCD}
      BBBCD
      ABCCD
      AAABBBC

  -*-

  Using extended regular expressions, you can specify arbitrary
  pattern occurrence counts using a more verbose syntax than the
  question mark, plus sign, and asterisk quantifiers. The curly
  braces ("{" and "}") can surround a precise count of how many
  occurrences you are looking for.

  The most general form of the curly-brace quantification uses two
  range arguments (the first must be no larger than the second, and
  both must be non-negative integers). The occurrence count is
  specified this way to fall between the minimum and maximum
  indicated (inclusive). As shorthand, either argument may be left
  empty: If so, the minimum/maximum is specified as zero/infinity,
  respectively. If only one argument is used (with no comma in
  there), exactly that number of occurrences are matched.

      >>> from re_show import re_show
      >>> s2 = '''aaaaa bbbbb ccccc
      ... aaa bbb ccc
      ... aaaaa bbbbbbbbbbbbbb ccccc'''
      >>> re_show(r'a{5} b{,6} c{4,8}', s2)
      {aaaaa bbbbb ccccc}
      aaa bbb ccc
      aaaaa bbbbbbbbbbbbbb ccccc

      >>> re_show(r'a+ b{3,} c?', s2)
      {aaaaa bbbbb c}cccc
      {aaa bbb c}cc
      {aaaaa bbbbbbbbbbbbbb c}cccc

      >>> re_show(r'a{5} b{6,} c{4,8}', s2)
      aaaaa bbbbb ccccc
      aaa bbb ccc
      {aaaaa bbbbbbbbbbbbbb ccccc}

  -*-

  One powerful option in creating search patterns is specifying
  that a subexpression that was matched earlier in a regular
  expression is matched again later in the expression. We do
  this using backreferences. Backreferences are named by the
  numbers 1 through 99, preceded by the backslash/escape
  character when used in this manner. These backreferences refer
  to each successive group in the match pattern, as in
  '(one)(two)(three) \1\2\3'. Each numbered backreference refers
  to the group that, in this example, has the word corresponding
  to the number.

  It is important to note something the example illustrates. What
  gets matched by a backreference is the same literal string
  matched the first time, even if the pattern that matched the
  string could have matched other strings. Simply repeating the
  same grouped subexpression later in the regular expression does
  not match the same targets as using a backreference (but you have
  to decide what it is you actually want to match in either case).

  Backreferences refer back to whatever occurred in the previous
  grouped expressions, in the order those grouped expressions
  occurred. Up to 99 numbered backreferences may be used. However,
  Python also allows naming backreferences, which can make it much
  clearer what the backreferences are pointing to. The initial
  pattern group must begin with '?P<name>', and the corresponding
  backreference must contain '(?P=name)'.

      >>> from re_show import re_show
      >>> s2 = '''jkl abc xyz
      ... jkl xyz abc
      ... jkl abc abc
      ... jkl xyz xyz
      ... '''
      >>> re_show(r'(abc|xyz) \1', s2)
      jkl abc xyz
      jkl xyz abc
      jkl {abc abc}
      jkl {xyz xyz}

      >>> re_show(r'(abc|xyz) (abc|xyz)', s2)
      jkl {abc xyz}
      jkl {xyz abc}
      jkl {abc abc}
      jkl {xyz xyz}

      >>> re_show(r'(?P<let3>abc|xyz) (?P=let3)', s2)
      jkl abc xyz
      jkl xyz abc
      jkl {abc abc}
      jkl {xyz xyz}

  -*-

  Quantifiers in regular expressions are greedy. That is, they
  match as much as they possibly can.

  Probably the easiest mistake to make in composing regular
  expressions is to match too much. When you use a quantifier,
  you want it to match everything (of the right sort) up to the
  point where you want to finish your match. But when using the
  '*', '+', or numeric quantifiers, it is easy to forget that the
  last bit you are looking for might occur later in a line than
  the one you are interested in.

      >>> from re_show import re_show
      >>> s2 = '''-- I want to match the words that start
      ... -- with 'th' and end with 's'.
      ... this
      ... thus
      ... thistle
      ... this line matches too much
      ... '''
      >>> re_show(r'th.*s', s2)
      -- I want to match {the words that s}tart
      -- wi{th 'th' and end with 's}'.
      {this}
      {thus}
      {this}tle
      {this line matches} too much

  -*-

  Often if you find that regular expressions are matching too much,
  a useful procedure is to reformulate the problem in your mind.
  Rather than thinking about, "What am I trying to match later in
  the expression?" ask yourself, "What do I need to avoid matching
  in the next part?" This often leads to more parsimonious pattern
  matches. Often the way to avoid a pattern is to use the
  complement operator and a character class. Look at the example,
  and think about how it works.

  The trick here is that there are two different ways of
  formulating almost the same sequence. Either you can think you
  want to keep matching -until- you get to XYZ, or you can think you
  want to keep matching -unless- you get to XYZ. These are subtly
  different.

  For people who have thought about basic probability, the same
  pattern occurs. The chance of rolling a 6 on a die in one roll is
  1/6. What is the chance of rolling a 6 in six rolls? A naive
  calculation puts the odds at 1/6+1/6+1/6+1/6+1/6+1/6, or 100
  percent. This is wrong, of course (after all, the chance after
  twelve rolls isn't 200 percent). The correct calculation is, "How
  do I avoid rolling a 6 for six rolls?" (i.e.,
  5/6*5/6*5/6*5/6*5/6*5/6, or about 33 percent). The chance of
  getting a 6 is the same chance as not avoiding it (or about 66
  percent). In fact, if you imagine transcribing a series of die
  rolls, you could apply a regular expression to the written
  record, and similar thinking applies.

      >>> from re_show import re_show
      >>> s2 = '''-- I want to match the words that start
      ... -- with 'th' and end with 's'.
      ... this
      ... thus
      ... thistle
      ... this line matches too much
      ... '''
      >>> re_show(r'th[^s]*.', s2)
      -- I want to match {the words} {that s}tart
      -- wi{th 'th' and end with 's}'.
      {this}
      {thus}
      {this}tle
      {this} line matches too much

  -*-

  Not all tools that use regular expressions allow you to modify
  target strings. Some simply locate the matched pattern; the
  mostly widely used regular expression tool is probably grep,
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -