📄 appendix_c.txt

📁 很详细的Python文字处理教程
💻 TXT
📖 第 1 页 / 共 2 页
字号:
12 下一页
APPENDIX -- UNDERSTANDING UNICODE
------------------------------------------------------------------------

SECTION -- Some Background on Characters
-------------------------------------------------------------------

  Before we see what Unicode is, it makes sense to step back
  slightly to think about just what it means to store "characters"
  in digital files. Anyone who uses a tool like a text editor
  usually just thinks of what they are doing as entering some
  characters--numbers, letters, punctuation, and so on. But behind
  the scene a little bit more is going on. "Characters" that are
  stored on digital media must be stored as sequences of ones and
  zeros, and some encoding and decoding must happen to make these
  ones and zeros into characters we see on a screen or type in with
  a keyboard.

  Sometime around the 1960s, a few decisions were made about just
  what ones and zeros (bits) would represent characters. One
  important choice that most modern computer users give no thought
  to was the decision to use 8-bit bytes on nearly all computer
  platforms. In other words, bytes have 256 possible values. Within
  these 8-bit bytes, a consensus was reached to represent one
  character in each byte. So at that point, computers needed a
  particular -encoding- of characters into byte values; there were
  256 "slots" available, but just which character would go in each
  slot? The most popular encoding developed was Bob Bemers'
  American Standard Code for Information Interchange (ASCII), which
  is now specified in exciting standards like ISO-14962-1997 and
  ANSI-X3.4-1986(R1997). But other options, like IBM's mainframe
  EBCDIC, linger on, even now.

  ASCII itself is of somewhat limited extent.  Only the values of
  the lower-order 7-bits of each byte might contain ASCII-encoded
  characters.  The top 7-bits worth of positions (128 of them)
  are "reserved" for other uses (back to this).  So, for example,
  a byte that contains "01000001" -might- be an ASCII encoding of
  the letter "A", but a byte containing "11000001" cannot be an
  ASCII encoding of anything.  Of course, a given byte may or may
  not -actually- represent a character; if it is part of a text
  file, it probably does, but if it is part of object code, a
  compressed archive, or other binary data, ASCII decoding is
  misleading.  It depends on context.

  The reserved top 7-bits in common 8-bit bytes have been used for
  a number of things in a character-encoding context. On
  traditional textual terminals (and printers, etc.) it has been
  common to allow switching between -codepages- on terminals to
  allow display of a variety of national-language characters (and
  special characters like box-drawing borders), depending on the
  needs of a user. In the world of Internet communications,
  something very similar to the codepage system exists with the
  various ISO-8859-* encodings. What all these systems do is assign
  a set of characters to the 128 slots that ASCII reserves for
  other uses. These might be accented Roman characters (used in
  many Western European languages) or they might be non-Roman
  character sets like Greek, Cyrillic, Hebrew, or Arabic (or in the
  future, Thai and Hindi). By using the right codepage, 8-bit bytes
  can be made quite suitable for encoding reasonable sized
  (phonetic) alphabets.

  Codepages and ISO-8859-* encodings, however, have some definite
  limitations.  For one thing, a terminal can only display one
  codepage at a given time, and a document with an ISO-8859-*
  encoding can only contain one character set.  Documents that
  need to contain text in multiple languages are not possible to
  represent by these encodings.  A second issue is equally
  important:  Many ideographic and pictographic character sets
  have far more than 128 or 256 characters in them (the former is
  all we would have in the codepage system, the latter if we used
  the whole byte and discarded the ASCII part).  It is simply not
  possible to encode languages like Chinese, Japanese, and
  Korean in 8-bit bytes.  Systems like ISO-2022-JP-1 and codepage
  943 allow larger character sets to be represented using two or
  more bytes for each character.  But even when using these
  language-specific multibyte encodings, the problem of mixing
  languages is still present.

SECTION -- What is Unicode?
-------------------------------------------------------------------

  Unicode solves the problems of previous character-encoding
  schemes by providing a unique code number for -every- character
  needed, worldwide and across languages. Over time, more
  characters are being added, but the allocation of available
  ranges for future uses has already been planned out, so room
  exists for new characters. In Unicode-encoded documents, no
  ambiguity exists about how a given character should display (for
  example, should byte value '0x89' appear as e-umlaut, as in
  codepage 850, or as the per-mil mark, as in codepage 1004?).
  Furthermore, by giving each character its own code, there is no
  problem or ambiguity in creating multilingual documents that
  utilize multiple character sets at the same time. Or rather,
  these documents actually utilize the single (very large)
  character set of Unicode itself.

  Unicode is managed by the Unicode Consortium (see Resources), a
  nonprofit group with corporate, institutional, and individual
  members. Originally, Unicode was planned as a 16-bit
  specification. However, this original plan failed to leave enough
  room for national variations on related (but distinct) ideographs
  across East Asian languages (Chinese, Japanese, and Korean), nor
  for specialized alphabets used in mathematics and the scholarship
  of historical languages. As a result, the code space of Unicode
  is currently 32-bits (and anticipated to remain fairly sparsely
  populated, given the 4 billion allowed characters).

SECTION -- Encodings
-------------------------------------------------------------------

  A full 32-bits of encoding space leaves plenty of room for
  every character we might want to represent, but it has its own
  problems.  If we need to use 4 bytes for every character we
  want to encode, that makes for rather verbose files (or
  strings, or streams).  Furthermore, these verbose files are
  likely to cause a variety of problems for legacy tools.  As a
  solution to this, Unicode is itself often encoded using
  "Unicode Transformation Formats" (abbreviated as 'UTF-*').  The
  encodings 'UTF-8' and 'UTF-16' use rather clever techniques to
  encode characters in a variable number of bytes, but with the
  most common situation being the use of just the number of bits
  indicated in the encoding name.  In addition, the use of
  specific byte value ranges in multibyte characters is designed
  in such a way as to be friendly to existing tools.  'UTF-32' is
  also an available encoding, one that simply uses all four bytes
  in a fixed-width encoding.

  The design of 'UTF-8' is such that 'US-ASCII' characters are
  simply encoded as themselves.  For example, the English letter
  "e" is encoded as the single byte '0x65' in both ASCII and in
  'UTF-8'.  However, the non-English "e-umlaut" diacritic, which
  is Unicode character '0x00EB', is encoded with the two bytes
  '0xC3 0xAB'.  In contrast, the 'UTF-16' representation of
  every character is always at least 2 bytes (and sometimes 4
  bytes).  'UTF-16' has the rather straightforward
  representations of the letters "e" and "e-umlaut" as '0x65
  0x00' and '0xEB 0x00', respectively.  So where does the odd
  value for the e-umlaut in 'UTF-8' come from? Here is the
  trick:  No multibyte encoded 'UTF-8' character is allowed to
  be in the 7-bit range used by ASCII, to avoid confusion.  So
  the 'UTF-8' scheme uses some bit shifting and encodes every
  Unicode character using up to 6 bytes.  But the byte values
  allowed in each position are arranged in such a manner as not
  to allow confusion of byte positions (for example, if you read
  a file nonsequentially).

  Let's look at another example, just to see it laid out. Here is a
  simple text string encoded in several ways. The view presented is
  similar to what you would see in a hex-mode file viewer. This
  way, it is easy to see both a likely on-screen character
  representation (on a legacy, non-Unicode terminal) and a
  representation of the underlying hexadecimal values each byte
  contains:

      #---- Hex view of several character string encodings ----#
      ------------------- Encoding = us-ascii ---------------------------
      55 6E 69 63 6F 64 65 20 20 20 20 20 20 20 20 20  | Unicode
      ------------------- Encoding = utf-8 ------------------------------
      55 6E 69 63 6F 64 65 20 20 20 20 20 20 20 20 20  | Unicode
      ------------------- Encoding = utf-16 -----------------------------
      FF FE 55 00 6E 00 69 00 63 00 6F 00 64 00 65 00  |   U n i c o d e

  !!!
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -