📄 appendix_c.txt

📁 很详细的Python文字处理教程
💻 TXT
📖 第 1 页 / 共 2 页
字号:
上一页 12
SECTION --  Declarations
-------------------------------------------------------------------

  We have seen how Unicode characters are actually encoded, at
  least briefly, but how do applications know to use a particular
  decoding procedure when Unicode is encountered? How applications
  are alerted to a Unicode encoding depends upon the type of data
  stream in question.

  Normal text files do not have any special header information
  attached to them to explicitly specify type. However, some
  operating systems (like MacOS, OS/2, and BeOS--Windows and Linux
  only in a more limited sense) have mechanisms to attach extended
  attributes to files; increasingly, MIME header information is
  stored in such extended attributes. If this happens to be the
  case, it is possible to store MIME header information such as:

      #*------------- MIME Header -----------------------------#
      Content-Type: text/plain; charset=UTF-8

  Nonetheless, having MIME headers attached to files is not a safe,
  generic assumption. Fortunately, the actual byte sequences in
  Unicode files provides a tip to applications. A Unicode-aware
  application, absent contrary indication, is supposed to assume
  that a given file is encoded with 'UTF-8'. A non-Unicode-aware
  application reading the same file will find a file that contains
  a mixture of ASCII characters and high-bit characters (for
  multibyte 'UTF-8' encodings). All the ASCII-range bytes will have
  the same values as if they were ASCII encoded. If any multibyte
  'UTF-8' sequences were used, those will appear as non-ASCII
  bytes and should be treated as noncharacter data by the legacy
  application. This may result in nonprocessing of those extended
  characters, but that is pretty much the best we could expect from
  a legacy application (that, by definition, does not know how to
  deal with the extended characters).

  For 'UTF-16' encoded files, a special convention is followed for
  the first two bytes of the file. One of the sequences '0xFF 0xFE'
  or '0xFE 0xFF' acts as small headers to the file. The choice of
  which header specifies the endianness of a platform's bytes (most
  common platforms are little-endian and will use '0xFF 0xFE'). It
  was decided that the collision risk of a legacy file beginning
  with these bytes was small and therefore these could be used as a
  reliable indicator for 'UTF-16' encoding. Within a 'UTF-16'
  encoded text file, plain ASCII characters will appear every other
  byte, interspersed with '0x00' (null) bytes. Of course, extended
  characters will produce non-null bytes and in some cases
  double-word (4 byte) representations. But a legacy tool that
  ignores embedded nulls will wind up doing the right thing with
  'UTF-16' encoded files, even without knowing about Unicode.

  Many communications protocols--and more recent document
  specifications--allow for explicit encoding specification. For
  example, an HTTP daemon application (a Web server) can return a
  header such as the following to provide explicit instructions to
  a client:

      #*------------- HTTP Header -----------------------------#
      HTTP/1.1 200 OK
      Content-Type: text/html; charset:UTF-8;

  Similarly, an NNTP, SMTP/POP3 message can carry a similar
  'Content-Type:' header field that makes explicit the encoding to
  follow (most likely as 'text/plain' rather than 'text/html',
  however; or at least we can hope).

  HTML and XML documents can contain tags and declarations to
  make Unicode encoding explicit.  An HTML document can provide a
  hint in a 'META' tag, like:

      #*------------- Content-Type META tag -------------------#
      <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">

  However, a 'META' tag should properly take lower precedence than
  an HTTP header, in a situation where both are part of the
  communication (but for a local HTML file, such an HTTP header
  does not exist).

  In XML, the actual document declaration should indicate the
  Unicode encoding, as in:

      #*------------- XML Encoding Declaration ----------------#
      <?xml version="1.0" encoding="UTF-8"?>

  Other formats and protocols may provide explicit encoding
  specification by similar means.

SECTION -- Finding Codepoints
-------------------------------------------------------------------

  Each Unicode character is identified by a unique codepoint.
  You can find information on character codepoints on official
  Unicode Web sites, but a quick way to look at visual forms of
  characters is by generating an HTML page with charts of Unicode
  characters.  The script below does this:

      #---------- mk_unicode_chart.py ----------#
      # Create an HTML chart of Unicode characters by codepoint
      import sys
      head = '<html><head><title>Unicode Code Points</title>\n' +\
             '<META HTTP-EQUIV="Content-Type" ' +\
                   'CONTENT="text/html; charset=UTF-8">\n' +\
             '</head><body>\n<h1>Unicode Code Points</h1>'
      foot = '</body></html>'
      fp = sys.stdout
      fp.write(head)
      num_blocks = 32  # Up to 256 in theory, but IE5.5 is flaky
      for block in range(0,256*num_blocks,256):
          fp.write('\n\n<h2>Range %5d-%5d</h2>' % (block,block+256))
          start = unichr(block).encode('utf-16')
          fp.write('\n<pre>     ')
          for col in range(16): fp.write(str(col).ljust(3))
          fp.write('</pre>')
          for offset in range(0,256,16):
              fp.write('\n<pre>')
              fp.write('+'+str(offset).rjust(3)+' ')
              line = '  '.join([unichr(n+block+offset) for n in range(16)])
              fp.write(line.encode('UTF-8'))
              fp.write('</pre>')
      fp.write(foot)
      fp.close()

  Exactly what you see when looking at the generated HTML page
  depends on just what Web browser and OS platform the page is
  viewed on--as well as on installed fonts and other factors.
  Generally, any character that cannot be rendered on the current
  browser will appear as some sort of square, dot, or question
  mark. Anything that -is- rendered is generally accurate. Once a
  character is visually identified, further information can be
  generated with the [unicodedata] module:

      >>> import unicodedata
      >>> unicodedata.name(unichr(1488))
      'HEBREW LETTER ALEF'
      >>> unicodedata.category(unichr(1488))
      'Lo'
      >>> unicodedata.bidirectional(unichr(1488))
      'R'

  A variant here would be to include the information provided by
  [unicodedata] within a generated HTML chart, although such a
  listing would be far more verbose than the example above.

SECTION -- Resources
-------------------------------------------------------------------

  More-or-less definitive information on all matters Unicode can
  be found at:

    <http://www.unicode.org/>.

  The Unicode Consortium:

    <http://www.unicode.org/unicode/consortium/consort.html>.

  Unicode Technical Report #17--Character Encoding Model:

    <http://www.unicode.org/unicode/reports/tr17/>.

  A brief history of ASCII:

    <http://www.bobbemer.com/ASCII.HTM>.
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -