📄 appendix_c.txt
字号:
APPENDIX -- UNDERSTANDING UNICODE
------------------------------------------------------------------------
SECTION -- Some Background on Characters
-------------------------------------------------------------------
Before we see what Unicode is, it makes sense to step back
slightly to think about just what it means to store "characters"
in digital files. Anyone who uses a tool like a text editor
usually just thinks of what they are doing as entering some
characters--numbers, letters, punctuation, and so on. But behind
the scene a little bit more is going on. "Characters" that are
stored on digital media must be stored as sequences of ones and
zeros, and some encoding and decoding must happen to make these
ones and zeros into characters we see on a screen or type in with
a keyboard.
Sometime around the 1960s, a few decisions were made about just
what ones and zeros (bits) would represent characters. One
important choice that most modern computer users give no thought
to was the decision to use 8-bit bytes on nearly all computer
platforms. In other words, bytes have 256 possible values. Within
these 8-bit bytes, a consensus was reached to represent one
character in each byte. So at that point, computers needed a
particular -encoding- of characters into byte values; there were
256 "slots" available, but just which character would go in each
slot? The most popular encoding developed was Bob Bemers'
American Standard Code for Information Interchange (ASCII), which
is now specified in exciting standards like ISO-14962-1997 and
ANSI-X3.4-1986(R1997). But other options, like IBM's mainframe
EBCDIC, linger on, even now.
ASCII itself is of somewhat limited extent. Only the values of
the lower-order 7-bits of each byte might contain ASCII-encoded
characters. The top 7-bits worth of positions (128 of them)
are "reserved" for other uses (back to this). So, for example,
a byte that contains "01000001" -might- be an ASCII encoding of
the letter "A", but a byte containing "11000001" cannot be an
ASCII encoding of anything. Of course, a given byte may or may
not -actually- represent a character; if it is part of a text
file, it probably does, but if it is part of object code, a
compressed archive, or other binary data, ASCII decoding is
misleading. It depends on context.
The reserved top 7-bits in common 8-bit bytes have been used for
a number of things in a character-encoding context. On
traditional textual terminals (and printers, etc.) it has been
common to allow switching between -codepages- on terminals to
allow display of a variety of national-language characters (and
special characters like box-drawing borders), depending on the
needs of a user. In the world of Internet communications,
something very similar to the codepage system exists with the
various ISO-8859-* encodings. What all these systems do is assign
a set of characters to the 128 slots that ASCII reserves for
other uses. These might be accented Roman characters (used in
many Western European languages) or they might be non-Roman
character sets like Greek, Cyrillic, Hebrew, or Arabic (or in the
future, Thai and Hindi). By using the right codepage, 8-bit bytes
can be made quite suitable for encoding reasonable sized
(phonetic) alphabets.
Codepages and ISO-8859-* encodings, however, have some definite
limitations. For one thing, a terminal can only display one
codepage at a given time, and a document with an ISO-8859-*
encoding can only contain one character set. Documents that
need to contain text in multiple languages are not possible to
represent by these encodings. A second issue is equally
important: Many ideographic and pictographic character sets
have far more than 128 or 256 characters in them (the former is
all we would have in the codepage system, the latter if we used
the whole byte and discarded the ASCII part). It is simply not
possible to encode languages like Chinese, Japanese, and
Korean in 8-bit bytes. Systems like ISO-2022-JP-1 and codepage
943 allow larger character sets to be represented using two or
more bytes for each character. But even when using these
language-specific multibyte encodings, the problem of mixing
languages is still present.
SECTION -- What is Unicode?
-------------------------------------------------------------------
Unicode solves the problems of previous character-encoding
schemes by providing a unique code number for -every- character
needed, worldwide and across languages. Over time, more
characters are being added, but the allocation of available
ranges for future uses has already been planned out, so room
exists for new characters. In Unicode-encoded documents, no
ambiguity exists about how a given character should display (for
example, should byte value '0x89' appear as e-umlaut, as in
codepage 850, or as the per-mil mark, as in codepage 1004?).
Furthermore, by giving each character its own code, there is no
problem or ambiguity in creating multilingual documents that
utilize multiple character sets at the same time. Or rather,
these documents actually utilize the single (very large)
character set of Unicode itself.
Unicode is managed by the Unicode Consortium (see Resources), a
nonprofit group with corporate, institutional, and individual
members. Originally, Unicode was planned as a 16-bit
specification. However, this original plan failed to leave enough
room for national variations on related (but distinct) ideographs
across East Asian languages (Chinese, Japanese, and Korean), nor
for specialized alphabets used in mathematics and the scholarship
of historical languages. As a result, the code space of Unicode
is currently 32-bits (and anticipated to remain fairly sparsely
populated, given the 4 billion allowed characters).
SECTION -- Encodings
-------------------------------------------------------------------
A full 32-bits of encoding space leaves plenty of room for
every character we might want to represent, but it has its own
problems. If we need to use 4 bytes for every character we
want to encode, that makes for rather verbose files (or
strings, or streams). Furthermore, these verbose files are
likely to cause a variety of problems for legacy tools. As a
solution to this, Unicode is itself often encoded using
"Unicode Transformation Formats" (abbreviated as 'UTF-*'). The
encodings 'UTF-8' and 'UTF-16' use rather clever techniques to
encode characters in a variable number of bytes, but with the
most common situation being the use of just the number of bits
indicated in the encoding name. In addition, the use of
specific byte value ranges in multibyte characters is designed
in such a way as to be friendly to existing tools. 'UTF-32' is
also an available encoding, one that simply uses all four bytes
in a fixed-width encoding.
The design of 'UTF-8' is such that 'US-ASCII' characters are
simply encoded as themselves. For example, the English letter
"e" is encoded as the single byte '0x65' in both ASCII and in
'UTF-8'. However, the non-English "e-umlaut" diacritic, which
is Unicode character '0x00EB', is encoded with the two bytes
'0xC3 0xAB'. In contrast, the 'UTF-16' representation of
every character is always at least 2 bytes (and sometimes 4
bytes). 'UTF-16' has the rather straightforward
representations of the letters "e" and "e-umlaut" as '0x65
0x00' and '0xEB 0x00', respectively. So where does the odd
value for the e-umlaut in 'UTF-8' come from? Here is the
trick: No multibyte encoded 'UTF-8' character is allowed to
be in the 7-bit range used by ASCII, to avoid confusion. So
the 'UTF-8' scheme uses some bit shifting and encodes every
Unicode character using up to 6 bytes. But the byte values
allowed in each position are arranged in such a manner as not
to allow confusion of byte positions (for example, if you read
a file nonsequentially).
Let's look at another example, just to see it laid out. Here is a
simple text string encoded in several ways. The view presented is
similar to what you would see in a hex-mode file viewer. This
way, it is easy to see both a likely on-screen character
representation (on a legacy, non-Unicode terminal) and a
representation of the underlying hexadecimal values each byte
contains:
#---- Hex view of several character string encodings ----#
------------------- Encoding = us-ascii ---------------------------
55 6E 69 63 6F 64 65 20 20 20 20 20 20 20 20 20 | Unicode
------------------- Encoding = utf-8 ------------------------------
55 6E 69 63 6F 64 65 20 20 20 20 20 20 20 20 20 | Unicode
------------------- Encoding = utf-16 -----------------------------
FF FE 55 00 6E 00 69 00 63 00 6F 00 64 00 65 00 | U n i c o d e
!!!
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -