📄 chap2.txt

📁 很详细的Python文字处理教程
💻 TXT
📖 第 1 页 / 共 5 页
字号:
12 3 4 5 下一页
CHAPTER II -- BASIC STRING OPERATIONS
-------------------------------------------------------------------

    The cheapest, fastest and most reliable components of a
    computer system are those that aren't there.
      --Gordon Bell, Encore Computer Corporation

  If you are writing programs in Python to accomplish text
  processing tasks, most of what you need to know is in this
  chapter.  Sure, you will probably need to know how to do some
  basic things with pipes, files, and arguments to get your text
  to process (covered in Chapter 1); but for actually
  -processing- the text you have gotten, the [string] module and
  string methods--and Python's basic data structures--do most
  all of what you need done, almost all the time.  To a lesser
  extent, the various custom modules to perform encodings,
  encryptions, and compressions are handy to have around (and you
  certainly do not want the work of implementing them yourself).
  But at the heart of text processing are basic transformations of
  bits of text.  That's what [string] functions and string
  methods do.

  There are a lot of interesting techniques elsewhere in this
  book.  I wouldn't have written about them if I did not find
  them important.  But be cautious before doing interesting
  things.  Specifically, given a fixed task in mind, before
  cracking this book open to any of the other chapters, consider
  very carefully whether your problem can be solved using the
  techniques in this chapter.  If you can answer this question
  affirmatively, you should usually eschew the complications of
  using the higher-level modules and techniques that other
  chapters discuss.  By all means read all of this book for
  the insight and edification that I hope it provides; but still
  focus on the "Zen of Python," and prefer simple to complex when
  simple is enough.

  This chapter does several things. Section 2.1 looks at a number
  of common problems in text processing that can (and should) be
  solved using (predominantly) the techniques documented in this
  chapter. Each of these "Problems" presents working solutions that
  can often be adopted with little change to real-life jobs. But a
  larger goal is to provide readers with a starting point for
  adaptation of the examples. It is not my goal to provide mere
  collections of packaged utilities and modules--plenty of those
  exist on the Web, and resources like the Vaults of Parnassus
  <http://www.vex.net/parnassus/> and the Python Cookbook
  <http://aspn.activestate.com/ASPN/Python/Cookbook/> are worth
  investigating as part of any project/task (and new and better
  utilities will be written between the time I write this and when
  you read it). It is better for readers to receive a solid
  foundation and starting point from which to develop the
  functionality they need for their own projects and tasks. And
  even better than spurring adaptation, these examples aim to
  encourage contemplation. In presenting examples, this book tries
  to embody a way of thinking about problems and an attitude
  towards solving them. More than any individual technique, such
  ideas are what I would most like to share with readers.

  Section 2.2 is a "reference with commentary" on the Python
  standard library modules for doing basic text manipulations. The
  discussions interspersed with each module try to give some
  guidance on why you would want to use a given module or function,
  and the reference documentation tries to contain more examples of
  actual typical usage than does a plain reference. In many cases,
  the examples and discussion of individual functions addresses
  common and productive design patterns in Python. The
  cross-references are intended to contextualize a given function
  (or other thing) in terms of related ones (and to help you decide
  which is right for you). The actual listing of functions,
  constants, classes, and the like is in alphabetical order within
  type of thing.

  Section 2.3 in many ways continues Section 2.1, but also provides
  some aids for using this book in a learning context.  The
  problems and solutions presented in Section 2.3 are somewhat more
  open-ended than those in Section 2.1.  As well, each section
  labeled as "Discussion" is followed by one labeled
  "Questions."  These questions are ones that could be assigned
  by a teacher to students; but they are also intended to be
  issues that general readers will enjoy and benefit from
  contemplating.  In many cases, the questions point to
  limitations of the approaches initially presented, and ask
  readers to think about ways to address or move beyond these
  limitations--exactly what readers need to do when writing their
  own custom code to accomplish outside tasks.  However, each
  Discussion in Section 2.3 should stand on its own, even if the
  Questions are skipped over by the reader.


SECTION 1 -- Some Common Tasks
------------------------------------------------------------------------

  PROBLEM: Quickly sorting lines on custom criteria
  --------------------------------------------------------------------

  Sorting is one of the real meat-and-potatoes algorithms of text
  processing and, in fact, of most programming. Fortunately for
  Python developers, the native `[].sort` method is extraordinarily
  fast. Moreover, Python lists with almost any heterogeneous
  objects as elements can be sorted--Python cannot rely on the
  uniform arrays of a language like C (an unfortunate exception to
  this general power was introduced in recent Python versions where
  comparisons of complex numbers raise a 'TypeError'; and
  '[1+1j,2+2j].sort()' dies for the same reason; Unicode strings in
  lists can cause similar problems).

  SEE ALSO, [complex]

  +++

  The list sort method is wonderful when you want to sort items in
  their "natural" order--or in the order that Python considers
  natural, in the case of items of varying types. Unfortunately, a
  lot of times, you want to sort things in "unnatural" orders. For
  lines of text, in particular, any order that is not simple
  alphabetization of the lines is "unnatural." But often text lines
  contain meaningful bits of information in positions other than
  the first character position: A last name may occur as the second
  word of a list of people (for example, with first name as the
  first word); an IP address may occur several fields into a server
  log file; a money total may occur at position 70 of each line;
  and so on. What if you want to sort lines based on this style of
  meaningful order that Python doesn't quite understand?

  The list sort method `[].sort()` supports an optional custom
  comparison function argument. The job this function has is to
  return -1 if the first thing should come first, return 0 if the
  two things are equal order-wise, and return 1 if the first thing
  should come second. The built-in function `cmp()` does this in a
  manner identical to the default `[].sort()` (except in terms of
  speed, 'lst.sort()' is much faster than 'lst.sort(cmp)'). For
  short lists and quick solutions, a custom comparison function is
  probably the best thing. In a lot of cases, one can even get by
  with an in-line 'lambda' function as the custom comparison
  function, which is a pleasant and handy idiom.

  When it comes to speed, however, use of custom comparison
  functions is fairly awful. Part of the problem is Python's
  function call overhead, but a lot of other factors contribute to
  the slowness. Fortunately, a technique called "Schwartzian
  Transforms" can make for much faster custom sorts. Schwartzian
  Transforms are so named after Randal Schwartz, who proposed the
  technique for working with Perl; but the technique is equally
  applicable to Python.

  The pattern involved in the Schwartzian Transform technique
  consists of three steps (these can more precisely be called the
  Guttman-Rosler Transform, which is based on the Schwartzian
  Transform):

  1.  Transform the list in a reversible way into one that sorts
      "naturally."

  2.  Call Python's native `[].sort()` method.

  3.  Reverse the transformation in (1) to restore the
      original list items (in new sorted order).

  The reason this technique works is that, for a list of size N,
  it only requires O(2N) transformation operations, which is easy
  to amortize over the necessary O(N log N) compare/flip
  operations for large lists.  The sort dominates computational
  time, so anything that makes the sort more efficient is a win
  in the limit case (this limit is reached quickly).

  Below is an example of a simple, but plausible, custom sorting
  algorithm.  The sort is on the fourth and subsequent words of
  a list of input lines.  Lines that are shorter than four words
  sort to the bottom.  Running the test against a file with about
  20,000 lines--about 1 megabyte--performed the Schwartzian
  Transform sort in less than 2 seconds, while taking over 12
  seconds for the custom comparison function sort (outputs were
  verified as identical).  Any number of factors will change the
  exact relative timings, but a better than six times gain can
  generally be expected.

      #---------- schwartzian_sort.py ----------#
      # Timing test for "sort on fourth word"
      # Specifically, two lines >= 4 words will be sorted
      #   lexographically on the 4th, 5th, etc.. words.
      #   Any line with fewer than four words will be sorted to
      #   the end, and will occur in "natural" order.

      import sys, string, time
      wrerr = sys.stderr.write

      # naive custom sort
      def fourth_word(ln1,ln2):
          lst1 = string.split(ln1)
          lst2 = string.split(ln2)
          #-- Compare "long" lines
          if len(lst1) >= 4 and len(lst2) >= 4:
              return cmp(lst1[3:],lst2[3:])
          #-- Long lines before short lines
          elif len(lst1) >= 4 and len(lst2) < 4:
              return -1
          #-- Short lines after long lines
          elif len(lst1) < 4 and len(lst2) >= 4:
              return 1
          else:                   # Natural order
              return cmp(ln1,ln2)

      # Don't count the read itself in the time
      lines = open(sys.argv[1]).readlines()

      # Time the custom comparison sort
      start = time.time()
      lines.sort(fourth_word)

      end = time.time()
      wrerr("Custom comparison func in %3.2f secs\n" % (end-start))
      # open('tmp.custom','w').writelines(lines)

      # Don't count the read itself in the time
      lines = open(sys.argv[1]).readlines()

      # Time the Schwartzian sort
      start = time.time()
      for n in range(len(lines)):       # Create the transform
          lst = string.split(lines[n])
          if len(lst) >= 4:             # Tuple w/ sort info first
              lines[n] = (lst[3:], lines[n])
          else:                         # Short lines to end
              lines[n] = (['\377'], lines[n])

      lines.sort()                      # Native sort

      for n in range(len(lines)):       # Restore original lines
          lines[n] = lines[n][1]

      end = time.time()
      wrerr("Schwartzian transform sort in %3.2f secs\n" % (end-start))
      # open('tmp.schwartzian','w').writelines(lines)

  Only one particular example is presented, but readers should be
  able to generalize this technique to any sort they need to
  perform frequently or on large files.


  PROBLEM: Reformatting paragraphs of text
  --------------------------------------------------------------------

  While I mourn the decline of plaintext ASCII as a communication
  format--and its eclipse by unnecessarily complicated and large
  (and often proprietary) formats--there is still plenty of life
  left in text files full of prose. READMEs, HOWTOs, email,
  Usenet posts, and this book itself are written in plaintext (or
  at least something close enough to plaintext that generic
  processing techniques are valuable). Moreover, many formats like
  HTML and LaTeX are frequently enough hand-edited that their
  plaintext appearance is important.

  One task that is extremely common when working with prose text
  files is reformatting paragraphs to conform to desired margins.
  Python 2.3 adds the module [textwrap], which performs more
  limited reformatting than the code below. Most of the time, this
  task gets done within text editors, which are indeed quite
  capable of performing the task. However, sometimes it would be
  nice to automate the formatting process. The task is simple
  enough that it is slightly surprising that Python has no standard
  module function to do this. There -is- the class
  `formatter.DumbWriter`, or the possibility of inheriting from and
  customizing `formatter.AbstractWriter`. These classes are
  discussed in Chapter 5; but frankly, the amount of customization
  and sophistication needed to use these classes and their many
  methods is way out of proportion for the task at hand.

  Below is a simple solution that can be used either as a
  command-line tool (reading from STDIN and writing to STDOUT) or
  by import to a larger application.

      #---------- reformat_para.py ----------#
      # Simple paragraph reformatter.  Allows specification
      # of left and right margins, and of justification style
      # (using constants defined in module).

      LEFT,RIGHT,CENTER = 'LEFT','RIGHT','CENTER'

      def reformat_para(para='',left=0,right=72,just=LEFT):
          words = para.split()
          lines = []
          line  = ''
          word = 0
          end_words = 0
          while not end_words:
              if len(words[word]) > right-left: # Handle very long words
                  line = words[word]
                  word +=1
                  if word >= len(words):
                      end_words = 1
              else:                             # Compose line of words
                  while len(line)+len(words[word]) <= right-left:
                      line += words[word]+' '
                      word += 1
                      if word >= len(words):
                          end_words = 1
                          break
              lines.append(line)
              line = ''
          if just==CENTER:
              r, l = right, left
              return '\n'.join([' '*left+ln.center(r-l) for ln in lines])
          elif just==RIGHT:
              return '\n'.join([line.rjust(right) for line in lines])
          else: # left justify
              return '\n'.join([' '*left+line for line in lines])

      if __name__=='__main__':
          import sys
          if len(sys.argv) <> 4:
              print "Please specify left_margin, right_marg, justification"
12 3 4 5 下一页
💿 文件大小 2227 K
👤 上传用户 stzwsy
📂 所属分类其他
🏷️ 相关标签

#Python #教程
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -