chap3.txt

来自「很详细的Python文字处理教程」· 文本代码 · 共 1,459 行 · 第 1/5 页
TXT
1,459 行
CHAPTER III -- REGULAR EXPRESSIONS
-------------------------------------------------------------------

  Regular expressions allow extremely valuable text processing
  techniques, but ones that warrant careful explanation. Python's
  [re] module, in particular, allows numerous enhancements to basic
  regular expressions (such as named backreferences, lookahead
  assertions, backreference skipping, non-greedy quantifiers, and
  others). A solid introduction to the subtleties of regular
  expressions is valuable to programmers engaged in text processing
  tasks.

  The prequel of this chapter contains a tutorial on regular
  expressions that allows a reader unfamiliar with regular
  expressions to move quickly from simple to complex elements of
  regular expression syntax. This tutorial is aimed primarily at
  beginners, but programmers familiar with regular expressions in
  other programming tools can benefit from a quick read of the
  tutorial, which explicates the particular regular expression
  dialect in Python.

  It is important to note up-front that regular expressions,
  while very powerful, also have limitations.  In brief, regular
  expressions cannot match patterns that nest to arbitrary
  depths.  If that statement does not make sense, read Chapter 4,
  which discusses parsers--to a large extent, parsing exists to
  address the limitations of regular expressions.  In general, if
  you have doubts about whether a regular expression is
  sufficient for your task, try to understand the examples in
  Chapter 4, particularly the discussion of how you might spell a
  floating point number.

  Section 3.1 examines a number of text processing problems that
  are solved most naturally using regular expression.  As in
  other chapters, the solutions presented to problems can
  generally be adopted directly as little utilities for performing
  tasks.  However, as elsewhere, the larger goal in presenting
  problems and solutions is to address a style of thinking about
  a wider class of problems than those whose solutions are
  presented directly in this book.  Readers who are interested
  in a range of ready utilities and modules will probably want to
  check additional resources on the Web, such as the Vaults of
  Parnassus <http://www.vex.net/parnassus/> and the Python
  Cookbook <http://aspn.activestate.com/ASPN/Python/Cookbook/>.

  Section 3.2 is a "reference with commentary" on the Python
  standard library modules for doing regular expression tasks.
  Several utility modules and backward-compatibility regular
  expression engines are available, but for most readers, the only
  important module will be [re] itself. The discussions
  interspersed with each module try to give some guidance on why
  you would want to use a given module or function, and the
  reference documentation tries to contain more examples of actual
  typical usage than does a plain reference. In many cases, the
  examples and discussion of individual functions address common
  and productive design patterns in Python. The cross-references
  are intended to contextualize a given function (or other thing)
  in terms of related ones (and to help a reader decide which is
  right for her). The actual listing of functions, constants,
  classes, and the like are in alphabetical order within each
  category.


SECTION 0 -- A Regular Expression Tutorial
------------------------------------------------------------------------

    Some people, when confronted with a problem, think "I know,
    I'll use regular expressions." Now they have two problems.
     -- Jamie Zawinski, '<alt.religion.emacs>' (08/12/1997)

  TOPIC -- Just What is a Regular Expression, Anyway?
  --------------------------------------------------------------------

  Many readers will have some background with regular
  expressions, but some will not have any.  Those with
  experience using regular expressions in other languages (or in
  Python) can probably skip this tutorial section.  But readers
  new to regular expressions (affectionately called 'regexes' by
  users) should read this section; even some with experience can
  benefit from a refresher.

  A regular expression is a compact way of describing complex
  patterns in texts. You can use them to search for patterns
  and, once found, to modify the patterns in complex ways. They
  can also be used to launch programmatic actions that depend on
  patterns.

  Jamie Zawinski's tongue-in-cheek comment in the epigram is
  worth thinking about. Regular expressions are amazingly
  powerful and deeply expressive. That is the very reason that
  writing them is just as error-prone as writing any other
  complex programming code. It is always better to solve a
  genuinely simple problem in a simple way; when you go beyond
  simple, think about regular expressions.

  A large number of tools other than Python incorporate regular
  expressions as part of their functionality. Unix-oriented
  command-line tools like 'grep', 'sed', and 'awk' are mostly
  wrappers for regular expression processing. Many text editors
  allow search and/or replacement based on regular expressions.
  Many programming languages, especially other scripting languages
  such as Perl and TCL, build regular expressions into the heart of
  the language. Even most command-line shells, such as Bash or the
  Windows-console, allow restricted regular expressions as part of
  their command syntax.

  There are some variations in regular expression syntax between
  different tools that use them, but for the most part regular
  expressions are a "little language" that gets embedded inside
  bigger languages like Python. The examples in this tutorial
  section (and the documentation in the rest of the chapter) will
  focus on Python syntax, but most of this chapter transfers
  easily to working with other programming languages and tools.

  As with most of this book, examples will be illustrated by use of
  Python interactive shell sessions that readers can type
  themselves, so that they can play with variations on the
  examples. However, the [re] module has little reason to include a
  function that simply illustrates matches in the shell. Therefore,
  the availability of the small wrapper program below is implied in
  the examples:

      #---------- re_show.py ----------#
      import re
      def re_show(pat, s):
          print re.compile(pat, re.M).sub("{\g<0>}", s.rstrip()),'\n'

      s = '''Mary had a little lamb
      And everywhere that Mary
      went, the lamb was sure
      to go'''

  Place the code in an external module and 'import' it. Those
  new to regular expressions need not worry about what the above
  function does for now. It is enough to know that the first
  argument to 're_show()' will be a regular expression pattern,
  and the second argument will be a string to be matched against.
  The matches will treat each line of the string as a separate
  pattern for purposes of matching beginnings and ends of lines.
  The illustrated matches will be whatever is contained between
  curly braces (and is typographically marked for emphasis).

  TOPIC -- Matching Patterns in Text: The Basics
  --------------------------------------------------------------------

  The very simplest pattern matched by a regular expression is a
  literal character or a sequence of literal characters. Anything
  in the target text that consists of exactly those characters in
  exactly the order listed will match. A lowercase character is not
  identical with its uppercase version, and vice versa. A space in
  a regular expression, by the way, matches a literal space in the
  target (this is unlike most programming languages or command-line
  tools, where a variable number of spaces separate keywords).

      >>> from re_show import re_show, s
      >>> re_show('a', s)
      M{a}ry h{a}d {a} little l{a}mb.
      And everywhere th{a}t M{a}ry
      went, the l{a}mb w{a}s sure
      to go.

      >>> re_show('Mary', s)
      {Mary} had a little lamb.
      And everywhere that {Mary}
      went, the lamb was sure
      to go.

  -*-

  A number of characters have special meanings to regular
  expressions. A symbol with a special meaning can be matched,
  but to do so it must be prefixed with the backslash character
  (this includes the backslash character itself:  to match one
  backslash in the target, the regular expression should include
  '\\'). In Python, a special way of quoting a string is
  available that will not perform string interpolation. Since
  regular expressions use many of the same backslash-prefixed
  codes as do Python strings, it is usually easier to compose
  regular expression strings by quoting them as "raw strings"
  with an initial "r".

      >>> from re_show import re_show
      >>> s = '''Special characters must be escaped.*'''
      >>> re_show(r'.*', s)
      {Special characters must be escaped.*}

      >>> re_show(r'\.\*', s)
      Special characters must be escaped{.*}

      >>> re_show('\\\\', r'Python \ escaped \ pattern')
      Python {\} escaped {\} pattern

      >>> re_show(r'\\', r'Regex \ escaped \ pattern')
      Regex {\} escaped {\} pattern

  -*-

  Two special characters are used to mark the beginning and end
  of a line:  caret ("^") and dollarsign ("$"). To match a caret
  or dollarsign as a literal character, it must be escaped (i.e.,
  precede it by a backslash "\").

  An interesting thing about the caret and dollarsign is that
  they match zero-width patterns. That is, the length of the
  string matched by a caret or dollarsign by itself is zero (but
  the rest of the regular expression can still depend on the
  zero-width match). Many regular expression tools provide
  another zero-width pattern for word-boundary ("\b"). Words
  might be divided by whitespace like spaces, tabs, newlines, or
  other characters like nulls; the word-boundary pattern matches
  the actual point where a word starts or ends, not the
  particular whitespace characters.

      >>> from re_show import re_show, s
      >>> re_show(r'^Mary', s)
      {Mary} had a little lamb
      And everywhere that Mary
      went, the lamb was sure
      to go

      >>> re_show(r'Mary$', s)
      Mary had a little lamb
      And everywhere that {Mary}
      went, the lamb was sure
      to go

      >>> re_show(r'$','Mary had a little lamb')
      Mary had a little lamb{}

  -*-

  In regular expressions, a period can stand for any character.
  Normally, the newline character is not included, but optional
  switches can force inclusion of the newline character also (see
  later documentation of [re] module functions). Using a period
  in a pattern is a way of requiring that "something" occurs
  here, without having to decide what.

  Readers who are familiar with DOS command-line wildcards will
  know the question mark as filling the role of "some character"
  in command masks. But in regular expressions, the
  question mark has a different meaning, and the period is used
  as a wildcard.

      >>> from re_show import re_show, s
      >>> re_show(r'.a', s)
      {Ma}ry {ha}d{ a} little {la}mb
      And everywhere t{ha}t {Ma}ry
      went, the {la}mb {wa}s sure
      to go

  -*-

  A regular expression can have literal characters in it and also
  zero-width positional patterns. Each literal character or positional
  pattern is an atom in a regular expression. One may also group
  several atoms together into a small regular expression that is
  part of a larger regular expression. One might be inclined to
  call such a grouping a "molecule," but normally it is also
  called an atom.

  In older Unix-oriented tools like grep, subexpressions must be
  grouped with escaped parentheses, for example, '\(Mary\)'. In
  Python (as with most more recent tools), grouping is done with
  bare parentheses, but matching a literal parenthesis requires
  escaping it in the pattern.

      >>> from re_show import re_show, s
      >>> re_show(r'(Mary)( )(had)', s)
      {Mary had} a little lamb
      And everywhere that Mary
      went, the lamb was sure
      to go

      >>> re_show(r'\(.*\)', 'spam (and eggs)')
      spam {(and eggs)}

  -*-

  Rather than name only a single character, a pattern in a
  regular expression can match any of a set of characters.

  A set of characters can be given as a simple list inside square
  brackets, for example, '[aeiou]' will match any single lowercase
  vowel. For letter or number ranges it may also have the first and
  last letter of a range, with a dash in the middle; for example,
  '[A-Ma-m]' will match any lowercase or uppercase letter in the
  first half of the alphabet.

  Python (as with many tools) provides escape-style shortcuts to
  the most commonly used character class, such as '\s' for a
  whitespace character and '\d' for a digit. One could always
chap3.txt - 源码说明

本页面展示了「很详细的Python文字处理教程」中的 chap3.txt 源码文件，采用文本编程语言编写，共 1,459 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与Python相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?