📄 encode.html
字号:
constants are laid out. You can import the FB_XX constants via
<code>use Encode qw(:fallbacks)</code>; you can import the generic bitmask
constants via <code>use Encode qw(:fallback_all)</code>.</p>
</dd>
<dd>
<pre>
FB_DEFAULT FB_CROAK FB_QUIET FB_WARN FB_PERLQQ
DIE_ON_ERR 0x0001 X
WARN_ON_ERR 0x0002 X
RETURN_ON_ERR 0x0004 X X
LEAVE_SRC 0x0008 X
PERLQQ 0x0100 X
HTMLCREF 0x0200
XMLCREF 0x0400</pre>
</dd>
</li>
</dl>
<p>
</p>
<h2><a name="coderef_for_check">coderef for CHECK</a></h2>
<p>As of Encode 2.12 CHECK can also be a code reference which takes the
ord value of unmapped caharacter as an argument and returns a string
that represents the fallback character. For instance,</p>
<pre>
<span class="variable">$ascii</span> <span class="operator">=</span> <span class="variable">encode</span><span class="operator">(</span><span class="string">"ascii"</span><span class="operator">,</span> <span class="variable">$utf8</span><span class="operator">,</span> <span class="keyword">sub</span><span class="operator">{</span> <span class="keyword">sprintf</span> <span class="string">"<U+%04X>"</span><span class="operator">,</span> <span class="keyword">shift</span> <span class="operator">});</span>
</pre>
<p>Acts like FB_PERLQQ but <U+<em>XXXX</em>> is used instead of
\x{<em>XXXX</em>}.</p>
<p>
</p>
<hr />
<h1><a name="defining_encodings">Defining Encodings</a></h1>
<p>To define a new encoding, use:</p>
<pre>
<span class="keyword">use</span> <span class="variable">Encode</span> <span class="string">qw(define_encoding)</span><span class="operator">;</span>
<span class="variable">define_encoding</span><span class="operator">(</span><span class="variable">$object</span><span class="operator">,</span> <span class="string">'canonicalName'</span> <span class="operator">[</span><span class="operator">,</span> <span class="variable">alias</span><span class="operator">...</span><span class="operator">]</span><span class="operator">);</span>
</pre>
<p><em>canonicalName</em> will be associated with <em>$object</em>. The object
should provide the interface described in <a href="../lib/Encode/Encoding.html">the Encode::Encoding manpage</a>.
If more than two arguments are provided then additional
arguments are taken as aliases for <em>$object</em>.</p>
<p>See <a href="../lib/Encode/Encoding.html">the Encode::Encoding manpage</a> for more details.</p>
<p>
</p>
<hr />
<h1><a name="the_utf8_flag">The UTF-8 flag</a></h1>
<p>Before the introduction of utf8 support in perl, The <code>eq</code> operator
just compared the strings represented by two scalars. Beginning with
perl 5.8, <code>eq</code> compares two strings with simultaneous consideration
of <em>the utf8 flag</em>. To explain why we made it so, I will quote page
402 of <code>Programming Perl, 3rd ed.</code></p>
<dl>
<dt><strong><a name="item_goal__231_3a">Goal #1:</a></strong>
<dd>
<p>Old byte-oriented programs should not spontaneously break on the old
byte-oriented data they used to work on.</p>
</dd>
</li>
<dt><strong><a name="item_goal__232_3a">Goal #2:</a></strong>
<dd>
<p>Old byte-oriented programs should magically start working on the new
character-oriented data when appropriate.</p>
</dd>
</li>
<dt><strong><a name="item_goal__233_3a">Goal #3:</a></strong>
<dd>
<p>Programs should run just as fast in the new character-oriented mode
as in the old byte-oriented mode.</p>
</dd>
</li>
<dt><strong><a name="item_goal__234_3a">Goal #4:</a></strong>
<dd>
<p>Perl should remain one language, rather than forking into a
byte-oriented Perl and a character-oriented Perl.</p>
</dd>
</li>
</dl>
<p>Back when <code>Programming Perl, 3rd ed.</code> was written, not even Perl 5.6.0
was born and many features documented in the book remained
unimplemented for a long time. Perl 5.8 corrected this and the introduction
of the UTF-8 flag is one of them. You can think of this perl notion as of a
byte-oriented mode (utf8 flag off) and a character-oriented mode (utf8
flag on).</p>
<p>Here is how Encode takes care of the utf8 flag.</p>
<ul>
<li>
<p>When you encode, the resulting utf8 flag is always off.</p>
</li>
<li>
<p>When you decode, the resulting utf8 flag is on unless you can
unambiguously represent data. Here is the definition of
dis-ambiguity.</p>
<p>After <a href="#item_decode"><code>$utf8 = decode('foo', $octet);</code></a>,</p>
<pre>
When $octet is... The utf8 flag in $utf8 is
---------------------------------------------
In ASCII only (or EBCDIC only) OFF
In ISO-8859-1 ON
In any other Encoding ON
---------------------------------------------</pre>
<p>As you see, there is one exception, In ASCII. That way you can assume
Goal #1. And with Encode Goal #2 is assumed but you still have to be
careful in such cases mentioned in <strong>CAVEAT</strong> paragraphs.</p>
<p>This utf8 flag is not visible in perl scripts, exactly for the same
reason you cannot (or you <em>don't have to</em>) see if a scalar contains a
string, integer, or floating point number. But you can still peek
and poke these if you will. See the section below.</p>
</li>
</ul>
<p>
</p>
<h2><a name="messing_with_perl_s_internals">Messing with Perl's Internals</a></h2>
<p>The following API uses parts of Perl's internals in the current
implementation. As such, they are efficient but may change.</p>
<dl>
<dt><strong><a name="item_is_utf8">is_utf8(STRING [, CHECK])</a></strong>
<dd>
<p>[INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING.
If CHECK is true, also checks the data in STRING for being well-formed
UTF-8. Returns true if successful, false otherwise.</p>
</dd>
<dd>
<p>As of perl 5.8.1, <a href="../lib/utf8.html">the utf8 manpage</a> also has utf8::is_utf8().</p>
</dd>
</li>
<dt><strong><a name="item__utf8_on"><code>_utf8_on(STRING)</code></a></strong>
<dd>
<p>[INTERNAL] Turns on the UTF-8 flag in STRING. The data in STRING is
<strong>not</strong> checked for being well-formed UTF-8. Do not use unless you
<strong>know</strong> that the STRING is well-formed UTF-8. Returns the previous
state of the UTF-8 flag (so please don't treat the return value as
indicating success or failure), or <a href="../lib/Pod/perlfunc.html#item_undef"><code>undef</code></a> if STRING is not a string.</p>
</dd>
</li>
<dt><strong><a name="item__utf8_off"><code>_utf8_off(STRING)</code></a></strong>
<dd>
<p>[INTERNAL] Turns off the UTF-8 flag in STRING. Do not use frivolously.
Returns the previous state of the UTF-8 flag (so please don't treat the
return value as indicating success or failure), or <a href="../lib/Pod/perlfunc.html#item_undef"><code>undef</code></a> if STRING is
not a string.</p>
</dd>
</li>
</dl>
<p>
</p>
<hr />
<h1><a name="utf8_vs__utf8">UTF-8 vs. utf8</a></h1>
<pre>
....We now view strings not as sequences of bytes, but as sequences
of numbers in the range 0 .. 2**32-1 (or in the case of 64-bit
computers, 0 .. 2**64-1) -- Programming Perl, 3rd ed.</pre>
<p>That has been the perl's notion of UTF-8 but official UTF-8 is more
strict; Its ranges is much narrower (0 .. 10FFFF), some sequences are
not allowed (i.e. Those used in the surrogate pair, 0xFFFE, et al).</p>
<p>Now that is overruled by Larry Wall himself.</p>
<pre>
From: Larry Wall <larry@wall.org>
Date: December 04, 2004 11:51:58 JST
To: perl-unicode@perl.org
Subject: Re: Make Encode.pm support the real UTF-8
Message-Id: <20041204025158.GA28754@wall.org>
On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote:
: I've no problem with 'utf8' being perl's unrestricted uft8 encoding,
: but "UTF-8" is the name of the standard and should give the
: corresponding behaviour.
For what it's worth, that's how I've always kept them straight in my
head.
Also for what it's worth, Perl 6 will mostly default to strict but
make it easy to switch back to lax.
Larry</pre>
<p>Do you copy? As of Perl 5.8.7, <strong>UTF-8</strong> means strict, official UTF-8
while <strong>utf8</strong> means liberal, lax, version thereof. And Encode version
2.10 or later thus groks the difference between <code>UTF-8</code> and C"utf8".</p>
<pre>
<span class="variable">encode</span><span class="operator">(</span><span class="string">"utf8"</span><span class="operator">,</span> <span class="string">"\x{FFFF_FFFF}"</span><span class="operator">,</span> <span class="number">1</span><span class="operator">);</span> <span class="comment"># okay</span>
<span class="variable">encode</span><span class="operator">(</span><span class="string">"UTF-8"</span><span class="operator">,</span> <span class="string">"\x{FFFF_FFFF}"</span><span class="operator">,</span> <span class="number">1</span><span class="operator">);</span> <span class="comment"># croaks</span>
</pre>
<p><code>UTF-8</code> in Encode is actually a canonical name for <code>utf-8-strict</code>.
Yes, the hyphen between "UTF" and "8" is important. Without it Encode
goes "liberal"</p>
<pre>
find_encoding("UTF-8")->name # is 'utf-8-strict'
find_encoding("utf-8")->name # ditto. names are case insensitive
find_encoding("utf8")->name # ditto. "_" are treated as "-"
find_encoding("UTF8")->name # is 'utf8'.</pre>
<p>
</p>
<hr />
<h1><a name="see_also">SEE ALSO</a></h1>
<p><a href="../lib/Encode/Encoding.html">the Encode::Encoding manpage</a>,
<a href="../lib/Encode/Supported.html">the Encode::Supported manpage</a>,
<a href="../lib/Encode/PerlIO.html">the Encode::PerlIO manpage</a>,
<a href="../lib/encoding.html">the encoding manpage</a>,
<a href="../lib/Pod/perlebcdic.html">the perlebcdic manpage</a>,
<a href="../lib/Pod/perlfunc.html#open">open in the perlfunc manpage</a>,
<a href="../lib/Pod/perlunicode.html">the perlunicode manpage</a>,
<a href="../lib/utf8.html">the utf8 manpage</a>,
the Perl Unicode Mailing List <<a href="mailto:perl-unicode@perl.org">perl-unicode@perl.org</a>></p>
<p>
</p>
<hr />
<h1><a name="maintainer">MAINTAINER</a></h1>
<p>This project was originated by Nick Ing-Simmons and later maintained
by Dan Kogai <<a href="mailto:dankogai@dan.co.jp">dankogai@dan.co.jp</a>>. See AUTHORS for a full
list of people involved. For any questions, use
<<a href="mailto:perl-unicode@perl.org">perl-unicode@perl.org</a>> so we can all share.</p>
</body>
</html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -