📄 chap3.txt

📁 很详细的Python文字处理教程
💻 TXT
📖 第 1 页 / 共 5 页
字号:
  which is a tool for searching only. Text editors, for example,
  may or may not allow replacement in their regular expression
  search facility.

  Python, being a general programming language, allows
  sophisticated replacement patterns to accompany matches. Since
  Python strings are immutable, [re] functions do not modify string
  objects in place, but instead return the modified versions. But
  as with functions in the [string] module, one can always rebind a
  particular variable to the new string object that results from
  [re] modification.

  Replacement examples in this tutorial will call a function
  're_new()' that is a wrapper for the module function `re.sub()`.
  Original strings will be defined above the call, and the modified
  results will appear below the call and with the same style of
  additional markup of changed areas as 're_show()' used. Be
  careful to notice that the curly braces in the results displayed
  will not be returned by standard [re] functions, but are only
  added here for emphasis (as is the typography). Simply import the
  following function in the examples below:

      #---------- re_new.py ----------#
      import re
      def re_new(pat, rep, s):
          print re.sub(pat, '{'+rep+'}', s)

  -*-

  Let us take a look at a couple of modification examples that
  build on what we have already covered. This one simply
  substitutes some literal text for some other literal text. Notice
  that `string.replace()` can achieve the same result and will be
  faster in doing so.

      >>> from re_new import re_new
      >>> s = 'The zoo had wild dogs, bobcats, lions, and other wild cats.'
      >>> re_new('cat','dog',s)
      The zoo had wild dogs, bob{dog}s, lions, and other wild {dog}s.

  -*-

  Most of the time, if you are using regular expressions to modify a
  target text, you will want to match more general patterns than just
  literal strings. Whatever is matched is what gets replaced (even if it
  is several different strings in the target):

      >>> from re_new import re_new
      >>> s = 'The zoo had wild dogs, bobcats, lions, and other wild cats.'
      >>> re_new('cat|dog','snake',s)
      The zoo had wild {snake}s, bob{snake}s, lions, and other wild {snake}s.
      >>> re_new(r'[a-z]+i[a-z]*','nice',s)
      The zoo had {nice} dogs, bobcats, {nice}, and other {nice} cats.

  -*-

  It is nice to be able to insert a fixed string everywhere a
  pattern occurs in a target text. But frankly, doing that is
  not very context sensitive. A lot of times, we do not want
  just to insert fixed strings, but rather to insert something
  that bears much more relation to the matched patterns.
  Fortunately, backreferences come to our rescue here. One can
  use backreferences in the pattern matches themselves, but it is
  even more useful to be able to use them in replacement
  patterns. By using replacement backreferences, one can pick
  and choose from the matched patterns to use just the parts of
  interest.

  As well as backreferencing, the examples below illustrate the
  importance of whitespace in regular expressions.  In most
  programming code, whitespace is merely aesthetic.  But the
  examples differ solely in an extra space within the arguments
  to the second call--and the return value is importantly
  different.

      >>> from re_new import re_new
      >>> s = 'A37 B4 C107 D54112 E1103 XXX'
      >>> re_new(r'([A-Z])([0-9]{2,4})',r'\2:\1',s)
      {37:A} B4 {107:C} {5411:D}2 {1103:E} XXX
      >>> re_new(r'([A-Z])([0-9]{2,4}) ',r'\2:\1 ',s)
      {37:A }B4 {107:C }D54112 {1103:E }XXX

  -*-

  This tutorial has already warned about the danger of matching
  too much with regular expression patterns. But the danger is
  so much more serious when one does modifications, that it is
  worth repeating. If you replace a pattern that matches a
  larger string than you thought of when you composed the
  pattern, you have potentially deleted some important data from
  your target.

  It is always a good idea to try out regular expressions on
  diverse target data that is representative of production usage.
  Make sure you are matching what you think you are matching. A
  stray quantifier or wildcard can make a surprisingly wide
  variety of texts match what you thought was a specific pattern.
  And sometimes you just have to stare at your pattern for a
  while, or find another set of eyes, to figure out what is
  really going on even after you see what matches. Familiarity
  might breed contempt, but it also instills competence.

  TOPIC -- Advanced Regular Expression Extensions
  --------------------------------------------------------------------

  Some very useful enhancements to basic regular expressions are
  included with Python (and with many other tools). Many of
  these do not strictly increase the power of Python's regular
  expressions, but they -do- manage to make expressing them far
  more concise and clear.

  Earlier in the tutorial, the problems of matching too much were
  discussed, and some workarounds were suggested. Python is nice
  enough to make this easier by providing optional "non-greedy"
  quantifiers.  These quantifiers grab as little as possible
  while still matching whatever comes next in the pattern
  (instead of as much as possible).

  Non-greedy quantifiers have the same syntax as regular greedy
  ones, except with the quantifier followed by a question mark.
  For example, a non-greedy pattern might look like:
  'A[A-Z]*?B'. In English, this means "match an A, followed by
  only as many capital letters as are needed to find a B."

  One little thing to look out for is the fact that the pattern
  '[A-Z]*?.' will always match zero capital letters. No longer
  matches are ever needed to find the following "any character"
  pattern. If you use non-greedy quantifiers, watch out for
  matching too little, which is a symmetric danger.

      >>> from re_show import re_show
      >>> s = '''-- I want to match the words that start
      ... -- with 'th' and end with 's'.
      ... this line matches just right
      ... this # thus # thistle'''
      >>> re_show(r'th.*s',s)
      -- I want to match {the words that s}tart
      -- wi{th 'th' and end with 's}'.
      {this line matches jus}t right
      {this # thus # this}tle

      >>> re_show(r'th.*?s',s)
      -- I want to match {the words} {that s}tart
      -- wi{th 'th' and end with 's}'.
      {this} line matches just right
      {this} # {thus} # {this}tle

      >>> re_show(r'th.*?s ',s)
      -- I want to match {the words }that start
      -- with 'th' and end with 's'.
      {this }line matches just right
      {this }# {thus }# thistle

  -*-

  Modifiers can be used in regular expressions or as arguments to
  many of the functions in [re]. A modifier affects, in one way
  or another, the interpretation of a regular expression pattern.
  A modifier, unlike an atom, is global to the particular
  match--in itself, a modifier doesn't match anything, it instead
  constrains or directs what the atoms match.

  When used directly within a regular expression pattern, one or
  more modifiers begin the whole pattern, as in '(?Limsux)'. For
  example, to match the word 'cat' without regard to the case of
  the letters, one could use '(?i)cat'. The same modifiers may
  be passed in as the last argument as bitmasks (i.e., with a '|'
  between each modifier), but only to some functions in the [re]
  module, not to all. For example, the two calls below are
  equivalent:

      >>> import re
      >>> re.search(r'(?Li)cat','The Cat in the Hat').start()
      4
      >>> re.search(r'cat','The Cat in the Hat',re.L|re.I).start()
      4

  However, some function calls in [re] have no argument for
  modifiers. In such cases, you should either use the modifier
  prefix pseudo-group or pre-compile the regular expression
  rather than use it in string form. For example:

      >>> import re
      >>> re.split(r'(?i)th','Brillig and The Slithy Toves')
      ['Brillig and ', 'e Sli', 'y Toves']
      >>> re.split(re.compile('th',re.I),'Brillig and the Slithy Toves')
      ['Brillig and ', 'e Sli', 'y Toves']

  See the [re] module documentation for details on which
  functions take which arguments.

  -*-

  The listed modifiers below are used in [re] expressions. Users
  of other regular expression tools may be accustomed to a 'g'
  option for "global" matching. These other tools take a line of
  text as their default unit, and "global" means to match
  multiple lines. Python takes the actual passed string as its
  unit, so "global" is simply the default. To operate on a
  single line, either the regular expressions have to be tailored
  to look for appropriate begin-line and end-line characters, or
  the strings being operated on should be split first using
  `string.split()` or other means.

      #*--------- Regular expression modifiers ---------------#
      * L (re.L) - Locale customization of \w, \W, \b, \B
      * i (re.I) - Case-insensitive match
      * m (re.M) - Treat string as multiple lines
      * s (re.S) - Treat string as single line
      * u (re.U) - Unicode customization of \w, \W, \b, \B
      * x (re.X) - Enable verbose regular expressions

  The single-line option ("s") allows the wildcard to match a
  newline character (it won't otherwise). The multiple-line
  option ("m") causes "^" and "$" to match the beginning and end
  of each line in the target, not just the begin/end of the
  target as a whole (the default).  The insensitive option ("i")
  ignores differences between the case of letters.  The Locale
  and Unicode options ("L" and "u") give different
  interpretations to the word-boundary ("\b") and alphanumeric
  ("\w") escaped patterns--and their inverse forms ("\B" and
  "\W").

  The verbose option ("x") is somewhat different from the others.
  Verbose regular expressions may contain nonsignificant
  whitespace and inline comments. In a sense, this is also just
  a different interpretation of regular expression patterns, but
  it allows you to produce far more easily readable complex
  patterns. Some examples follow in the sections below.

  -*-

  Let's take a look first at how case-insensitive and single-line
  options change the match behavior.

      >>> from re_show import re_show
      >>> s = '''MAINE # Massachusetts # Colorado #
      ... mississippi # Missouri # Minnesota #'''
      >>> re_show(r'M.*[ise] ', s)
      {MAINE # Massachusetts }# Colorado #
      mississippi # {Missouri }# Minnesota #

      >>> re_show(r'(?i)M.*[ise] ', s)
      {MAINE # Massachusetts }# Colorado #
      {mississippi # Missouri }# Minnesota #

      >>> re_show(r'(?si)M.*[ise] ', s)
      {MAINE # Massachusetts # Colorado #
      mississippi # Missouri }# Minnesota #

  Looking back to the definition of 're_show()', we can see it
  was defined to explicitly use the multiline option.  So
  patterns displayed with 're_show()' will always be multiline.
  Let us look at a couple of examples that use `re.findall()`
  instead.

      >>> from re_show import re_show
      >>> s = '''MAINE # Massachusetts # Colorado #
      ... mississippi # Missouri # Minnesota #'''
      >>> re_show(r'(?im)^M.*[ise] ', s)
      {MAINE # Massachusetts }# Colorado #
      {mississippi # Missouri }# Minnesota #

      >>> import re
      >>> re.findall(r'(?i)^M.*[ise] ', s)
      ['MAINE # Massachusetts ']
      >>> re.findall(r'(?im)^M.*[ise] ', s)
      ['MAINE # Massachusetts ', 'mississippi # Missouri ']

  -*-

  Matching word characters and word boundaries depends on exactly
  what gets counted as being alphanumeric. Character codepages
  for letters outside the (US-English) ASCII range differ among
  national alphabets. Python versions are configured to a
  particular locale, and regular expressions can optionally use
  the current one to match words.

  Of greater long-term significance is the [re] module's ability
  (after Python 2.0) to look at the Unicode categories of
  characters, and decide whether a character is alphabetic based on
  that category. Locale settings work OK for European diacritics,
  but for non-Roman sets, Unicode is clearer and less error prone.
  The "u" modifier controls whether Unicode alphabetic characters
  are recognized or merely ASCII ones:

      >>> import re
      >>> alef, omega = unichr(1488), unichr(969)
      >>> u = alef +' A b C d '+omega+' X y Z'
      >>> u, len(u.split()), len(u)
      (u'\u05d0 A b C d \u03c9 X y Z', 9, 17)
      >>> ':'.join(re.findall(ur'\b\w\b', u))
💿 文件大小 2227 K
👤 上传用户 stzwsy
📂 所属分类其他
🏷️ 相关标签

#Python #教程
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -