📄 appendix_d.txt

📁 很详细的Python文字处理教程
💻 TXT
📖 第 1 页 / 共 2 页
字号:
12 下一页
APPENDIX -- A STATE MACHINE FOR ADDING MARKUP TO TEXT
-------------------------------------------------------------------

  This book was written entirely in plaintext editors, using a set
  of conventions I call "smart ASCII." In spirit and appearance,
  smart ASCII resembles the informal markup that has developed on
  email and Usenet. In fact, I have used an evolving version of the
  format for a number of years to produce articles, tutorials, and
  other documents. The book required a few additional conventions
  in the earlier smart ASCII format, but only a few. It was a
  toolchain that made almost all the individual typographic and
  layout decisions. Of course, that toolchain only came to exist
  through many hours of programming and debugging by me and by
  other developers.

  The printed version of this book used tools I wrote in Python to
  assemble the chapters, frontmatter, and endmatter, and then to
  add LaTeX markup codes to the text. A moderate number of custom
  LaTeX macros are included in that markup. From there, the work of
  other people lets me convert LaTeX source into the PDF format
  Addison-Wesley can convert into printed copies.

  For information on the smart ASCII format, see the discussions
  of it in several places in this book, chiefly in Chapter 4.
  You may also download the ASCII text of this book from its Web
  site at <http://gnosis.cx/TPiP/>, along with a semiformal
  documentation of the conventions used.  Readers might also be
  interested in a format called "reStructuredText," which is
  similar in spirit, but both somewhat "heavier" and more
  formally specified. reStructuredText has a semiofficial
  status in the Python community since it is now included in the
  [DocUtils] package; for information see:

    <http://docutils.sourceforge.net/rst.html>.

  In this appendix, I include the full source code for an
  application that can convert the original text of this book
  into an HTML document.  I believe that this application is a
  good demonstration of the design and structure of a realistic
  text processing tool.  In general structure, 'book2html.py'
  uses a line-oriented state machine to categorize lines into
  appropriate document elements.  Under this approach, the
  "meaning" of a particular line is, in part, determined by the
  context of the lines that came immediately before it.  After
  making decisions on how to categorize each line with a
  combination of a state machine and a collection of regular
  expression patterns, the blocks of document elements are
  processed into HTML output.  In principle, it would not be
  difficult to substitute a different output format; the steps
  involved are modular.

  The Web site for this book has a collection of utilities similar
  to the one presented. Over time, I have adapted the skeleton to
  deal with variations in input and output formats, but there is
  overlap between all of them. Using this utility is simply a
  matter of typing something like:

      #*------------- Use of book2html.py ---------------------#
      % book2html.py "Text Processing in Python" < TPiP.txt > TPiP.html

  The title is optional, and you may pipe STDIN and STDOUT as
  usual.  Since the target is HTML, I decided it would be nice to
  colorize source code samples.  That capability is in a support
  module:

      #----------------------- colorize.py --------------------#
      #!/usr/bin/python
      import keyword, token, tokenize, sys
      from cStringIO import StringIO

      PLAIN = '%s'
      BOLD  = '<b>%s</b>'
      CBOLD = '<font color="%s"><b>%s</b></font>'
      _KEYWORD = token.NT_OFFSET+1
      _TEXT    = token.NT_OFFSET+2
      COLORS   = { token.NUMBER:     'black',
                   token.OP:         'darkblue',
                   token.STRING:     'green',
                   tokenize.COMMENT: 'darkred',
                   token.NAME:       None,
                   token.ERRORTOKEN: 'red',
                   _KEYWORD:         'blue',
                   _TEXT:            'black'  }

      class ParsePython:
          "Colorize python source"
          def __init__(self, raw):
              self.inp  = StringIO(raw.expandtabs(4).strip())
          def toHTML(self):
              "Parse and send the colored source"
              raw = self.inp.getvalue()
              self.out = StringIO()
              self.lines = [0,0]      # store line offsets in self.lines
              self.lines += [i+1 for i in range(len(raw)) if raw[i]=='\n']
              self.lines += [len(raw)]
              self.pos = 0
              try:
                  tokenize.tokenize(self.inp.readline, self)
                  return self.out.getvalue()
              except tokenize.TokenError, ex:
                  msg,ln = ex[0],ex[1][0]
                  sys.stderr.write("ERROR: %s %s\n" %
                                   (msg, raw[self.lines[ln]:]))
                  return raw
          def __call__(self,toktype,toktext,(srow,scol),(erow,ecol),line):
              "Token handler"
              # calculate new positions
              oldpos = self.pos
              newpos = self.lines[srow] + scol
              self.pos = newpos + len(toktext)
              if toktype in [token.NEWLINE, tokenize.NL]:  # handle newlns
                  self.out.write('\n')
                  return
              if newpos > oldpos:     # send the orig whitspce, if needed
                  self.out.write(self.inp.getvalue()[oldpos:newpos])
              if toktype in [token.INDENT, token.DEDENT]:
                  self.pos = newpos   # skip indenting tokens
                  return
              if token.LPAR <= toktype and toktype <= token.OP:
                  toktype = token.OP  # map token type to a color group
              elif toktype == token.NAME and keyword.iskeyword(toktext):
                  toktype = _KEYWORD
              color = COLORS.get(toktype, COLORS[_TEXT])
              if toktext:             # send text
                  txt = Detag(toktext)
                  if color is None:    txt = PLAIN % txt
                  elif color=='black': txt = BOLD % txt
                  else:                txt = CBOLD % (color,txt)
                  self.out.write(txt)

      Detag = lambda s: \
          s.replace('&','&amp;').replace('<','&lt;').replace('>','&gt;')

      if __name__=='__main__':
          parsed = ParsePython(sys.stdin.read())
          print '<pre>'
          print parsed.toHTML()
          print '</pre>'

  The module [colorize] contains its own self-test code and is
  perfectly usable as a utility on its own. The main module
  consists of:

      #--------------------- book2html.py ---------------------#
      #!/usr/bin/python
      """Convert ASCII book source files for HTML presentation"

      Usage: python book2html.py [title] < source.txt > target.html
      """
      __author__=["David Mertz (mertz@gnosis.cx)",]
      __version__="November 2002"

      from __future__ import generators
      import sys, re, string, time
      from colorize import ParsePython
      from cgi import escape

      #-- Define some HTML boilerplate
      html_open =\
      """<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
      <html>
      <head>
      <title>%s</title>
      <style>
        .code-sample {background-color:#EEEEEE; text-align:left;
                      width:90%%; margin-left:auto; margin-right:auto;}
        .module      {color : darkblue}
        .libfunc     {color : darkgreen}
      </style>
      </head>
      <body>
      """
      html_title = "Automatically Generated HTML"
      html_close = "</body></html>"
      code_block = \
      """<table class="code-sample"><tr><td><h4>%s</h4></td></tr>
      <tr><td><pre>%s</pre></td></tr>
      </table>"""
      #-- End of boilerplate

      #-- State constants
      for s in ("BLANK CHAPTER SECTION SUBSECT SUBSUB MODLINE "
                "MODNAME PYSHELL CODESAMP NUMLIST BODY QUOTE "
                "SUBBODY TERM DEF RULE VERTSPC").split():
          exec "%s = '%s'" % (s,s)
      markup = {CHAPTER:'h1', SECTION:'h2', SUBSECT:'h3', SUBSUB:'h4',
                BODY:'p', QUOTE:'blockquote', NUMLIST:'blockquote',
12 下一页
💿 文件大小 2227 K
👤 上传用户 stzwsy
📂 所属分类其他
🏷️ 相关标签

#Python #教程
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -