📄 chap3.txt

📁 很详细的Python文字处理教程
💻 TXT
📖 第 1 页 / 共 5 页
字号:
      u'A:b:C:d:X:y:Z'
      >>> ':'.join(re.findall(ur'(?u)\b\w\b', u))
      u'\u05d0:A:b:C:d:\u03c9:X:y:Z'

  -*-

  Backreferencing in replacement patterns is very powerful, but
  it is easy to use many groups in a complex regular expression,
  which can be confusing to identify. It is often more legible
  to refer to the parts of a replacement pattern in sequential
  order. To handle this issue, Python's [re] patterns allow
  "grouping without backreferencing."

  A group that should not also be treated as a backreference has
  a question mark colon at the beginning of the group, as in
  '(?:pattern)'. In fact, you can use this syntax even when your
  backreferences are in the search pattern itself:

      >>> from re_new import re_new
      >>> s = 'A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93'
      >>> re_new(r'([A-Z])(?:-[a-z]{3}-)([0-9]*)', r'\1\2', s)
      {A37} # B:abcd:142 # {C66} # {D93}
      >>> # Groups that are not of interest excluded from backref
      ...
      >>> re_new(r'([A-Z])(-[a-z]{3}-)([0-9]*)', r'\1\2', s)
      {A-xyz-} # B:abcd:142 # {C-wxy-} # {D-qrs-}
      >>> # One could lose track of groups in a complex pattern
      ...

  -*-

  Python offers a particularly handy syntax for really complex
  pattern backreferences. Rather than just play with the
  numbering of matched groups, you can give them a name. Above
  we pointed out the syntax for named backreferences in the
  pattern space; for example, '(?P=name)'. However, a bit different
  syntax is necessary in replacement patterns. For that, we use
  the '\g' operator along with angle brackets and a name. For
  example:

      >>> from re_new import re_new
      >>> s = "A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93"
      >>> re_new(r'(?P<prefix>[A-Z])(-[a-z]{3}-)(?P<id>[0-9]*)',
      ...        r'\g<prefix>\g<id>',s)
      {A37} # B:abcd:142 # {C66} # {D93}

  -*-

  Another trick of advanced regular expression tools is
  "lookahead assertions."  These are similar to regular grouped
  subexpression, except they do not actually grab what they
  match. There are two advantages to using lookahead assertions.
  On the one hand, a lookahead assertion can function in a
  similar way to a group that is not backreferenced; that is, you
  can match something without counting it in backreferences.
  More significantly, however, a lookahead assertion can specify
  that the next chunk of a pattern has a certain form, but let a
  different (more general) subexpression actually grab it
  (usually for purposes of backreferencing that other
  subexpression).

  There are two kinds of lookahead assertions:  positive and
  negative. As you would expect, a positive assertion specifies
  that something does come next, and a negative one specifies
  that something does not come next. Emphasizing their
  connection with non-backreferenced groups, the syntax for
  lookahead assertions is similar:  '(?=pattern)' for positive
  assertions, and '(?!pattern)' for negative assertions.

      >>> from re_new import re_new
      >>> s = 'A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93'
      >>> # Assert that three lowercase letters occur after CAP-DASH
      ...
      >>> re_new(r'([A-Z]-)(?=[a-z]{3})([\w\d]*)', r'\2\1', s)
      {xyz37A-} # B-ab6142 # C-Wxy66 # {qrs93D-}
      >>> # Assert three lowercase letts do NOT occur after CAP-DASH
      ...
      >>> re_new(r'([A-Z]-)(?![a-z]{3})([\w\d]*)', r'\2\1', s)
      A-xyz37 # {ab6142B-} # {Wxy66C-} # D-qrs93

  -*-

  Along with lookahead assertions, Python 2.0+ adds "lookbehind
  assertions."  The idea is similar--a pattern is of interest
  only if it is (or is not) preceded by some other pattern.
  Lookbehind assertions are somewhat more restricted than
  lookahead assertions because they may only look backwards by a
  fixed number of character positions.  In other words, no
  general quantifiers are allowed in lookbehind assertions.
  Still, some patterns are most easily expressed using lookbehind
  assertions.

  As with lookahead assertions, lookbehind assertions come in a
  negative and a positive flavor. The former assures that a certain
  pattern does -not- precede the match, the latter assures that
  the pattern -does- precede the match.

      >>> from re_new import re_new
      >>> re_show('Man', 'Manhandled by The Man')
      {Man}handled by The {Man}

      >>> re_show('(?<=The )Man', 'Manhandled by The Man')
      Manhandled by The {Man}

      >>> re_show('(?<!The )Man', 'Manhandled by The Man')
      {Man}handled by The Man

  -*-

  In the later examples we have started to see just how
  complicated regular expressions can get. These examples are
  not the half of it. It is possible to do some almost absurdly
  difficult-to-understand things with regular expression (but
  ones that are nonetheless useful).

  There are two basic facilities that Python's "verbose" modifier
  ("x") uses in clarifying expressions. One is allowing regular
  expressions to continue over multiple lines (by ignoring
  whitespace like trailing spaces and newlines). The second is
  allowing comments within regular expressions. When patterns
  get complicated, do both!

  The example given is a fairly typical example of a complicated,
  but well-structured and well-commented, regular expression:

      >>> from re_show import re_show
      >>> s = '''The URL for my site is: http://mysite.com/mydoc.html. You
      ... might also enjoy ftp://yoursite.com/index.html for a good
      ... place to download files.'''
      >>> pat = r'''  (?x)( # verbose identify URLs within text
      ... (http|ftp|gopher) # make sure we find a resource type
      ...               :// # ...needs to be followed by colon-slash-slash
      ...         [^ \n\r]+ # some stuff then space, newline, tab is URL
      ...                \w # URL always ends in alphanumeric char
      ...       (?=[\s\.,]) # assert: followed by whitespace/period/comma
      ...                 ) # end of match group'''
      >>> re_show(pat, s)
      The URL for my site is: {http://mysite.com/mydoc.html}. You
      might also enjoy {ftp://yoursite.com/index.html} for a good
      place to download files.


SECTION 1 -- Some Common Tasks
------------------------------------------------------------------------

  PROBLEM: Making a text block flush left
  --------------------------------------------------------------------

  For visual clarity or to identify the role of text, blocks of
  text are often indented--especially in prose-oriented documents
  (but log files, configuration files, and the like might also
  have unused initial fields).  For downstream purposes,
  indentation is often irrelevant, or even outright
  incorrect, since the indentation is not part of the text itself
  but only a decoration of the text.  However, it often makes
  matters even worse to perform the very most naive
  transformation of indented text--simply remove leading
  whitespace from every line.  While block indentation may be
  decoration, the relative indentations of lines within blocks
  may serve important or essential functions (for example, the
  blocks of text might be Python source code).

  The general procedure you need to take in maximally unindenting
  a block of text is fairly simple. But it is easy to throw more
  code at it than is needed, and arrive at some inelegant and
  slow nested loops of `string.find()` and `string.replace()`
  operations. A bit of cleverness in the use of regular
  expressions--combined with the conciseness of a functional
  programming (FP) style--can give you a quick, short, and direct
  transformation.

      #---------- flush_left.py ----------#
      # Remove as many leading spaces as possible from whole block
      from re import findall,sub
      # What is the minimum line indentation of a block?
      indent = lambda s: reduce(min,map(len,findall('(?m)^ *(?=\S)',s)))
      # Remove the block-minimum indentation from each line?
      flush_left = lambda s: sub('(?m)^ {%d}' % indent(s),'',s)

      if __name__ == '__main__':
          import sys
          print flush_left(sys.stdin.read())

  The 'flush_left()' function assumes that blocks are indented
  with spaces.  If tabs are used--or used combined with
  spaces--an initial pass through the utility 'untabify.py' (which
  can be found at '$PYTHONPATH/tools/scripts/') can convert
  blocks to space-only indentation.

  A helpful adjunct to 'flush_left()' is likely to be the
  'reformat_para()' function that was presented in Chapter 2,
  Problem 2. Between the two of these, you could get a good part of
  the way towards a "batch-oriented word processor." (What other
  capabilities would be most useful?)


  PROBLEM: Summarizing command-line option documentation
  --------------------------------------------------------------------

  Documentation of command-line options to programs is usually
  in semi-standard formats in places like manpages, docstrings,
  READMEs and the like.  In general, within documentation you
  expect to see command-line options indented a bit, followed by
  a bit more indentation, followed by one or more lines of
  description, and usually ended by a blank line.  This style is
  readable for users browsing documentation, but is of
  sufficiently complexity and variability that regular
  expressions are well suited to finding the right descriptions
  (simple string methods fall short).

  A specific scenario where you might want a summary of
  command-line options is as an aid to understanding
  configuration files that call multiple child commands.  The
  file '/etc/inetd.conf' on Unix-like systems is a good example
  of such a configuration file.  Moreover, configuration files
  themselves often have enough complexity and variability within
  them that simple string methods have difficulty parsing them.

  The utility below will look for every service launched by
  '/etc/inetd.conf' and present to STDOUT summary documentation
  of all the options used when the services are started.

      #---------- show_services.py ----------#
      import re, os, string, sys

      def show_opts(cmdline):
          args = string.split(cmdline)
          cmd = args[0]
          if len(args) > 1:
              opts = args[1:]
          # might want to check error output, so use popen3()
          (in_, out_, err) = os.popen3('man %s | col -b' % cmd)
          manpage = out_.read()
          if len(manpage) > 2:      # found actual documentation
              print '\n%s' % cmd
              for opt in opts:
                  pat_opt = r'(?sm)^\s*'+opt+r'.*?(?=\n\n)'
                  opt_doc = re.search(pat_opt, manpage)
                  if opt_doc is not None:
                      print opt_doc.group()
                  else:             # try harder for something relevant
                      mentions = []
                      for para in string.split(manpage,'\n\n'):
                         if re.search(opt, para):
                             mentions.append('\n%s' % para)
                      if not mentions:
                         print '\n    ',opt,' '*9,'Option docs not found'
                      else:
                         print '\n    ',opt,' '*9,'Mentioned in below para:'
                         print '\n'.join(mentions)
          else:                     # no manpage available
              print cmdline
              print '    No documentation available'

      def services(fname):
          conf = open(fname).read()
          pat_srv = r'''(?xm)(?=^[^#])       # lns that are not commented out
                        (?:(?:[\w/]+\s+){6}) # first six fields ignored
                        (.*$)                # to end of ln is servc launch'''
          return re.findall(pat_srv, conf)

      if __name__ == '__main__':
          for service in services(sys.argv[1]):
              show_opts(service)

  The particular tasks performed by 'show_opts()' and 'services()'
  are somewhat specific to Unix-like systems, but the general
  techniques are more broadly applicable. For example, the
  particular comment character and number of fields in
  '/etc/inetd.conf' might be different for other launch scripts,
  but the use of regular expressions to find the launch commands
  would apply elsewhere. If the 'man' and 'col' utilities are not
  on the relevant system, you might do something equivalent, such
  as reading in the docstrings from Python modules with similar
  option descriptions (most of the samples in '$PYTHONPATH/tools/'
  use compatible documentation, for example).

  Another thing worth noting is that even where regular expressions
  are used in parsing some data, you need not do everything with
  regular expressions. The simple `string.split()` operation to
  identify paragraphs in 'show_opts()' is still the quickest and
  easiest technique, even though `re.split()` could do the same
  thing.

  Note: Along the lines of paragraph splitting, here is a thought
  problem. What is a regular expression that matches every whole
  paragraph that contains within it some smaller pattern 'pat'? For
  purposes of the puzzle, assume that a paragraph is some text that
  both starts and ends with doubled newlines ("\n\n").


  PROBLEM:  Detecting duplicate words
💿 文件大小 2227 K
👤 上传用户 stzwsy
📂 所属分类其他
🏷️ 相关标签

#Python #教程
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -