📄 appendix_d.txt
字号:
APPENDIX -- A STATE MACHINE FOR ADDING MARKUP TO TEXT
-------------------------------------------------------------------
This book was written entirely in plaintext editors, using a set
of conventions I call "smart ASCII." In spirit and appearance,
smart ASCII resembles the informal markup that has developed on
email and Usenet. In fact, I have used an evolving version of the
format for a number of years to produce articles, tutorials, and
other documents. The book required a few additional conventions
in the earlier smart ASCII format, but only a few. It was a
toolchain that made almost all the individual typographic and
layout decisions. Of course, that toolchain only came to exist
through many hours of programming and debugging by me and by
other developers.
The printed version of this book used tools I wrote in Python to
assemble the chapters, frontmatter, and endmatter, and then to
add LaTeX markup codes to the text. A moderate number of custom
LaTeX macros are included in that markup. From there, the work of
other people lets me convert LaTeX source into the PDF format
Addison-Wesley can convert into printed copies.
For information on the smart ASCII format, see the discussions
of it in several places in this book, chiefly in Chapter 4.
You may also download the ASCII text of this book from its Web
site at <http://gnosis.cx/TPiP/>, along with a semiformal
documentation of the conventions used. Readers might also be
interested in a format called "reStructuredText," which is
similar in spirit, but both somewhat "heavier" and more
formally specified. reStructuredText has a semiofficial
status in the Python community since it is now included in the
[DocUtils] package; for information see:
<http://docutils.sourceforge.net/rst.html>.
In this appendix, I include the full source code for an
application that can convert the original text of this book
into an HTML document. I believe that this application is a
good demonstration of the design and structure of a realistic
text processing tool. In general structure, 'book2html.py'
uses a line-oriented state machine to categorize lines into
appropriate document elements. Under this approach, the
"meaning" of a particular line is, in part, determined by the
context of the lines that came immediately before it. After
making decisions on how to categorize each line with a
combination of a state machine and a collection of regular
expression patterns, the blocks of document elements are
processed into HTML output. In principle, it would not be
difficult to substitute a different output format; the steps
involved are modular.
The Web site for this book has a collection of utilities similar
to the one presented. Over time, I have adapted the skeleton to
deal with variations in input and output formats, but there is
overlap between all of them. Using this utility is simply a
matter of typing something like:
#*------------- Use of book2html.py ---------------------#
% book2html.py "Text Processing in Python" < TPiP.txt > TPiP.html
The title is optional, and you may pipe STDIN and STDOUT as
usual. Since the target is HTML, I decided it would be nice to
colorize source code samples. That capability is in a support
module:
#----------------------- colorize.py --------------------#
#!/usr/bin/python
import keyword, token, tokenize, sys
from cStringIO import StringIO
PLAIN = '%s'
BOLD = '<b>%s</b>'
CBOLD = '<font color="%s"><b>%s</b></font>'
_KEYWORD = token.NT_OFFSET+1
_TEXT = token.NT_OFFSET+2
COLORS = { token.NUMBER: 'black',
token.OP: 'darkblue',
token.STRING: 'green',
tokenize.COMMENT: 'darkred',
token.NAME: None,
token.ERRORTOKEN: 'red',
_KEYWORD: 'blue',
_TEXT: 'black' }
class ParsePython:
"Colorize python source"
def __init__(self, raw):
self.inp = StringIO(raw.expandtabs(4).strip())
def toHTML(self):
"Parse and send the colored source"
raw = self.inp.getvalue()
self.out = StringIO()
self.lines = [0,0] # store line offsets in self.lines
self.lines += [i+1 for i in range(len(raw)) if raw[i]=='\n']
self.lines += [len(raw)]
self.pos = 0
try:
tokenize.tokenize(self.inp.readline, self)
return self.out.getvalue()
except tokenize.TokenError, ex:
msg,ln = ex[0],ex[1][0]
sys.stderr.write("ERROR: %s %s\n" %
(msg, raw[self.lines[ln]:]))
return raw
def __call__(self,toktype,toktext,(srow,scol),(erow,ecol),line):
"Token handler"
# calculate new positions
oldpos = self.pos
newpos = self.lines[srow] + scol
self.pos = newpos + len(toktext)
if toktype in [token.NEWLINE, tokenize.NL]: # handle newlns
self.out.write('\n')
return
if newpos > oldpos: # send the orig whitspce, if needed
self.out.write(self.inp.getvalue()[oldpos:newpos])
if toktype in [token.INDENT, token.DEDENT]:
self.pos = newpos # skip indenting tokens
return
if token.LPAR <= toktype and toktype <= token.OP:
toktype = token.OP # map token type to a color group
elif toktype == token.NAME and keyword.iskeyword(toktext):
toktype = _KEYWORD
color = COLORS.get(toktype, COLORS[_TEXT])
if toktext: # send text
txt = Detag(toktext)
if color is None: txt = PLAIN % txt
elif color=='black': txt = BOLD % txt
else: txt = CBOLD % (color,txt)
self.out.write(txt)
Detag = lambda s: \
s.replace('&','&').replace('<','<').replace('>','>')
if __name__=='__main__':
parsed = ParsePython(sys.stdin.read())
print '<pre>'
print parsed.toHTML()
print '</pre>'
The module [colorize] contains its own self-test code and is
perfectly usable as a utility on its own. The main module
consists of:
#--------------------- book2html.py ---------------------#
#!/usr/bin/python
"""Convert ASCII book source files for HTML presentation"
Usage: python book2html.py [title] < source.txt > target.html
"""
__author__=["David Mertz (mertz@gnosis.cx)",]
__version__="November 2002"
from __future__ import generators
import sys, re, string, time
from colorize import ParsePython
from cgi import escape
#-- Define some HTML boilerplate
html_open =\
"""<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
<head>
<title>%s</title>
<style>
.code-sample {background-color:#EEEEEE; text-align:left;
width:90%%; margin-left:auto; margin-right:auto;}
.module {color : darkblue}
.libfunc {color : darkgreen}
</style>
</head>
<body>
"""
html_title = "Automatically Generated HTML"
html_close = "</body></html>"
code_block = \
"""<table class="code-sample"><tr><td><h4>%s</h4></td></tr>
<tr><td><pre>%s</pre></td></tr>
</table>"""
#-- End of boilerplate
#-- State constants
for s in ("BLANK CHAPTER SECTION SUBSECT SUBSUB MODLINE "
"MODNAME PYSHELL CODESAMP NUMLIST BODY QUOTE "
"SUBBODY TERM DEF RULE VERTSPC").split():
exec "%s = '%s'" % (s,s)
markup = {CHAPTER:'h1', SECTION:'h2', SUBSECT:'h3', SUBSUB:'h4',
BODY:'p', QUOTE:'blockquote', NUMLIST:'blockquote',
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -