📄 appendix_c.txt
字号:
SECTION -- Declarations
-------------------------------------------------------------------
We have seen how Unicode characters are actually encoded, at
least briefly, but how do applications know to use a particular
decoding procedure when Unicode is encountered? How applications
are alerted to a Unicode encoding depends upon the type of data
stream in question.
Normal text files do not have any special header information
attached to them to explicitly specify type. However, some
operating systems (like MacOS, OS/2, and BeOS--Windows and Linux
only in a more limited sense) have mechanisms to attach extended
attributes to files; increasingly, MIME header information is
stored in such extended attributes. If this happens to be the
case, it is possible to store MIME header information such as:
#*------------- MIME Header -----------------------------#
Content-Type: text/plain; charset=UTF-8
Nonetheless, having MIME headers attached to files is not a safe,
generic assumption. Fortunately, the actual byte sequences in
Unicode files provides a tip to applications. A Unicode-aware
application, absent contrary indication, is supposed to assume
that a given file is encoded with 'UTF-8'. A non-Unicode-aware
application reading the same file will find a file that contains
a mixture of ASCII characters and high-bit characters (for
multibyte 'UTF-8' encodings). All the ASCII-range bytes will have
the same values as if they were ASCII encoded. If any multibyte
'UTF-8' sequences were used, those will appear as non-ASCII
bytes and should be treated as noncharacter data by the legacy
application. This may result in nonprocessing of those extended
characters, but that is pretty much the best we could expect from
a legacy application (that, by definition, does not know how to
deal with the extended characters).
For 'UTF-16' encoded files, a special convention is followed for
the first two bytes of the file. One of the sequences '0xFF 0xFE'
or '0xFE 0xFF' acts as small headers to the file. The choice of
which header specifies the endianness of a platform's bytes (most
common platforms are little-endian and will use '0xFF 0xFE'). It
was decided that the collision risk of a legacy file beginning
with these bytes was small and therefore these could be used as a
reliable indicator for 'UTF-16' encoding. Within a 'UTF-16'
encoded text file, plain ASCII characters will appear every other
byte, interspersed with '0x00' (null) bytes. Of course, extended
characters will produce non-null bytes and in some cases
double-word (4 byte) representations. But a legacy tool that
ignores embedded nulls will wind up doing the right thing with
'UTF-16' encoded files, even without knowing about Unicode.
Many communications protocols--and more recent document
specifications--allow for explicit encoding specification. For
example, an HTTP daemon application (a Web server) can return a
header such as the following to provide explicit instructions to
a client:
#*------------- HTTP Header -----------------------------#
HTTP/1.1 200 OK
Content-Type: text/html; charset:UTF-8;
Similarly, an NNTP, SMTP/POP3 message can carry a similar
'Content-Type:' header field that makes explicit the encoding to
follow (most likely as 'text/plain' rather than 'text/html',
however; or at least we can hope).
HTML and XML documents can contain tags and declarations to
make Unicode encoding explicit. An HTML document can provide a
hint in a 'META' tag, like:
#*------------- Content-Type META tag -------------------#
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
However, a 'META' tag should properly take lower precedence than
an HTTP header, in a situation where both are part of the
communication (but for a local HTML file, such an HTTP header
does not exist).
In XML, the actual document declaration should indicate the
Unicode encoding, as in:
#*------------- XML Encoding Declaration ----------------#
<?xml version="1.0" encoding="UTF-8"?>
Other formats and protocols may provide explicit encoding
specification by similar means.
SECTION -- Finding Codepoints
-------------------------------------------------------------------
Each Unicode character is identified by a unique codepoint.
You can find information on character codepoints on official
Unicode Web sites, but a quick way to look at visual forms of
characters is by generating an HTML page with charts of Unicode
characters. The script below does this:
#---------- mk_unicode_chart.py ----------#
# Create an HTML chart of Unicode characters by codepoint
import sys
head = '<html><head><title>Unicode Code Points</title>\n' +\
'<META HTTP-EQUIV="Content-Type" ' +\
'CONTENT="text/html; charset=UTF-8">\n' +\
'</head><body>\n<h1>Unicode Code Points</h1>'
foot = '</body></html>'
fp = sys.stdout
fp.write(head)
num_blocks = 32 # Up to 256 in theory, but IE5.5 is flaky
for block in range(0,256*num_blocks,256):
fp.write('\n\n<h2>Range %5d-%5d</h2>' % (block,block+256))
start = unichr(block).encode('utf-16')
fp.write('\n<pre> ')
for col in range(16): fp.write(str(col).ljust(3))
fp.write('</pre>')
for offset in range(0,256,16):
fp.write('\n<pre>')
fp.write('+'+str(offset).rjust(3)+' ')
line = ' '.join([unichr(n+block+offset) for n in range(16)])
fp.write(line.encode('UTF-8'))
fp.write('</pre>')
fp.write(foot)
fp.close()
Exactly what you see when looking at the generated HTML page
depends on just what Web browser and OS platform the page is
viewed on--as well as on installed fonts and other factors.
Generally, any character that cannot be rendered on the current
browser will appear as some sort of square, dot, or question
mark. Anything that -is- rendered is generally accurate. Once a
character is visually identified, further information can be
generated with the [unicodedata] module:
>>> import unicodedata
>>> unicodedata.name(unichr(1488))
'HEBREW LETTER ALEF'
>>> unicodedata.category(unichr(1488))
'Lo'
>>> unicodedata.bidirectional(unichr(1488))
'R'
A variant here would be to include the information provided by
[unicodedata] within a generated HTML chart, although such a
listing would be far more verbose than the example above.
SECTION -- Resources
-------------------------------------------------------------------
More-or-less definitive information on all matters Unicode can
be found at:
<http://www.unicode.org/>.
The Unicode Consortium:
<http://www.unicode.org/unicode/consortium/consort.html>.
Unicode Technical Report #17--Character Encoding Model:
<http://www.unicode.org/unicode/reports/tr17/>.
A brief history of ASCII:
<http://www.bobbemer.com/ASCII.HTM>.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -