perluniintro.html
来自「perl教程」· HTML 代码 · 共 860 行 · 第 1/5 页
HTML
860 行
<?xml version="1.0" ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<!-- saved from url=(0017)http://localhost/ -->
<script language="JavaScript" src="../../displayToc.js"></script>
<script language="JavaScript" src="../../tocParas.js"></script>
<script language="JavaScript" src="../../tocTab.js"></script>
<link rel="stylesheet" type="text/css" href="../../scineplex.css">
<title>perluniintro - Perl Unicode introduction</title>
<link rel="stylesheet" href="../../Active.css" type="text/css" />
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<link rev="made" href="mailto:" />
</head>
<body>
<script>writelinks('__top__',2);</script>
<h1><a>perluniintro - Perl Unicode introduction</a></h1>
<p><a name="__index__"></a></p>
<!-- INDEX BEGIN -->
<ul>
<li><a href="#name">NAME</a></li>
<li><a href="#description">DESCRIPTION</a></li>
<ul>
<li><a href="#unicode">Unicode</a></li>
<li><a href="#perl_s_unicode_support">Perl's Unicode Support</a></li>
<li><a href="#perl_s_unicode_model">Perl's Unicode Model</a></li>
<li><a href="#unicode_and_ebcdic">Unicode and EBCDIC</a></li>
<li><a href="#creating_unicode">Creating Unicode</a></li>
<li><a href="#handling_unicode">Handling Unicode</a></li>
<li><a href="#legacy_encodings">Legacy Encodings</a></li>
<li><a href="#unicode_i_o">Unicode I/O</a></li>
<li><a href="#displaying_unicode_as_text">Displaying Unicode As Text</a></li>
<li><a href="#special_cases">Special Cases</a></li>
<li><a href="#advanced_topics">Advanced Topics</a></li>
<li><a href="#miscellaneous">Miscellaneous</a></li>
<li><a href="#questions_with_answers">Questions With Answers</a></li>
<li><a href="#hexadecimal_notation">Hexadecimal Notation</a></li>
<li><a href="#further_resources">Further Resources</a></li>
</ul>
<li><a href="#unicode_in_older_perls">UNICODE IN OLDER PERLS</a></li>
<li><a href="#see_also">SEE ALSO</a></li>
<li><a href="#acknowledgments">ACKNOWLEDGMENTS</a></li>
<li><a href="#author__copyright__and_license">AUTHOR, COPYRIGHT, AND LICENSE</a></li>
</ul>
<!-- INDEX END -->
<hr />
<p>
</p>
<h1><a name="name">NAME</a></h1>
<p>perluniintro - Perl Unicode introduction</p>
<p>
</p>
<hr />
<h1><a name="description">DESCRIPTION</a></h1>
<p>This document gives a general idea of Unicode and how to use Unicode
in Perl.</p>
<p>
</p>
<h2><a name="unicode">Unicode</a></h2>
<p>Unicode is a character set standard which plans to codify all of the
writing systems of the world, plus many other symbols.</p>
<p>Unicode and ISO/IEC 10646 are coordinated standards that provide code
points for characters in almost all modern character set standards,
covering more than 30 writing systems and hundreds of languages,
including all commercially-important modern languages. All characters
in the largest Chinese, Japanese, and Korean dictionaries are also
encoded. The standards will eventually cover almost all characters in
more than 250 writing systems and thousands of languages.
Unicode 1.0 was released in October 1991, and 4.0 in April 2003.</p>
<p>A Unicode <em>character</em> is an abstract entity. It is not bound to any
particular integer width, especially not to the C language <code>char</code>.
Unicode is language-neutral and display-neutral: it does not encode the
language of the text and it does not define fonts or other graphical
layout details. Unicode operates on characters and on text built from
those characters.</p>
<p>Unicode defines characters like <code>LATIN CAPITAL LETTER A</code> or <code>GREEK
SMALL LETTER ALPHA</code> and unique numbers for the characters, in this
case 0x0041 and 0x03B1, respectively. These unique numbers are called
<em>code points</em>.</p>
<p>The Unicode standard prefers using hexadecimal notation for the code
points. If numbers like <code>0x0041</code> are unfamiliar to you, take a peek
at a later section, <a href="#hexadecimal_notation">Hexadecimal Notation</a>. The Unicode standard
uses the notation <code>U+0041 LATIN CAPITAL LETTER A</code>, to give the
hexadecimal code point and the normative name of the character.</p>
<p>Unicode also defines various <em>properties</em> for the characters, like
"uppercase" or "lowercase", "decimal digit", or "punctuation";
these properties are independent of the names of the characters.
Furthermore, various operations on the characters like uppercasing,
lowercasing, and collating (sorting) are defined.</p>
<p>A Unicode character consists either of a single code point, or a
<em>base character</em> (like <code>LATIN CAPITAL LETTER A</code>), followed by one or
more <em>modifiers</em> (like <code>COMBINING ACUTE ACCENT</code>). This sequence of
base character and modifiers is called a <em>combining character
sequence</em>.</p>
<p>Whether to call these combining character sequences "characters"
depends on your point of view. If you are a programmer, you probably
would tend towards seeing each element in the sequences as one unit,
or "character". The whole sequence could be seen as one "character",
however, from the user's point of view, since that's probably what it
looks like in the context of the user's language.</p>
<p>With this "whole sequence" view of characters, the total number of
characters is open-ended. But in the programmer's "one unit is one
character" point of view, the concept of "characters" is more
deterministic. In this document, we take that second point of view:
one "character" is one Unicode code point, be it a base character or
a combining character.</p>
<p>For some combinations, there are <em>precomposed</em> characters.
<code>LATIN CAPITAL LETTER A WITH ACUTE</code>, for example, is defined as
a single code point. These precomposed characters are, however,
only available for some combinations, and are mainly
meant to support round-trip conversions between Unicode and legacy
standards (like the ISO 8859). In the general case, the composing
method is more extensible. To support conversion between
different compositions of the characters, various <em>normalization
forms</em> to standardize representations are also defined.</p>
<p>Because of backward compatibility with legacy encodings, the "a unique
number for every character" idea breaks down a bit: instead, there is
"at least one number for every character". The same character could
be represented differently in several legacy encodings. The
converse is also not true: some code points do not have an assigned
character. Firstly, there are unallocated code points within
otherwise used blocks. Secondly, there are special Unicode control
characters that do not represent true characters.</p>
<p>A common myth about Unicode is that it would be "16-bit", that is,
Unicode is only represented as <code>0x10000</code> (or 65536) characters from
<code>0x0000</code> to <code>0xFFFF</code>. <strong>This is untrue.</strong> Since Unicode 2.0 (July
1996), Unicode has been defined all the way up to 21 bits (<code>0x10FFFF</code>),
and since Unicode 3.1 (March 2001), characters have been defined
beyond <code>0xFFFF</code>. The first <code>0x10000</code> characters are called the
<em>Plane 0</em>, or the <em>Basic Multilingual Plane</em> (BMP). With Unicode
3.1, 17 (yes, seventeen) planes in all were defined--but they are
nowhere near full of defined characters, yet.</p>
<p>Another myth is that the 256-character blocks have something to
do with languages--that each block would define the characters used
by a language or a set of languages. <strong>This is also untrue.</strong>
The division into blocks exists, but it is almost completely
accidental--an artifact of how the characters have been and
still are allocated. Instead, there is a concept called <em>scripts</em>,
which is more useful: there is <code>Latin</code> script, <code>Greek</code> script, and
so on. Scripts usually span varied parts of several blocks.
For further information see <a href="../../lib/Unicode/UCD.html">the Unicode::UCD manpage</a>.</p>
<p>The Unicode code points are just abstract numbers. To input and
output these abstract numbers, the numbers must be <em>encoded</em> or
<em>serialised</em> somehow. Unicode defines several <em>character encoding
forms</em>, of which <em>UTF-8</em> is perhaps the most popular. UTF-8 is a
variable length encoding that encodes Unicode characters as 1 to 6
bytes (only 4 with the currently defined characters). Other encodings
include UTF-16 and UTF-32 and their big- and little-endian variants
(UTF-8 is byte-order independent) The ISO/IEC 10646 defines the UCS-2
and UCS-4 encoding forms.</p>
<p>For more information about encodings--for instance, to learn what
<em>surrogates</em> and <em>byte order marks</em> (BOMs) are--see <a href="../../lib/Pod/perlunicode.html">the perlunicode manpage</a>.</p>
<p>
</p>
<h2><a name="perl_s_unicode_support">Perl's Unicode Support</a></h2>
<p>Starting from Perl 5.6.0, Perl has had the capacity to handle Unicode
natively. Perl 5.8.0, however, is the first recommended release for
serious Unicode work. The maintenance release 5.6.1 fixed many of the
problems of the initial Unicode implementation, but for example
regular expressions still do not work with Unicode in 5.6.1.</p>
<p><strong>Starting from Perl 5.8.0, the use of <code>use utf8</code> is no longer
necessary.</strong> In earlier releases the <code>utf8</code> pragma was used to declare
that operations in the current block or file would be Unicode-aware.
This model was found to be wrong, or at least clumsy: the "Unicodeness"
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?