📄 encode.html

📁 perl教程
💻 HTML
📖 第 1 页 / 共 3 页
字号:
上一页 1 23
constants are laid out.  You can import the FB_XX constants via
<code>use Encode qw(:fallbacks)</code>; you can import the generic bitmask
constants via <code>use Encode qw(:fallback_all)</code>.</p>
</dd>
<dd>
<pre>
                     FB_DEFAULT FB_CROAK FB_QUIET FB_WARN  FB_PERLQQ
 DIE_ON_ERR    0x0001             X
 WARN_ON_ERR   0x0002                               X
 RETURN_ON_ERR 0x0004                      X        X
 LEAVE_SRC     0x0008                                        X
 PERLQQ        0x0100                                        X
 HTMLCREF      0x0200
 XMLCREF       0x0400</pre>
</dd>
</li>
</dl>
<p>
</p>
<h2><a name="coderef_for_check">coderef for CHECK</a></h2>
<p>As of Encode 2.12 CHECK can also be a code reference which takes the
ord value of unmapped caharacter as an argument and returns a string
that represents the fallback character.  For instance,</p>
<pre>
  <span class="variable">$ascii</span> <span class="operator">=</span> <span class="variable">encode</span><span class="operator">(</span><span class="string">"ascii"</span><span class="operator">,</span> <span class="variable">$utf8</span><span class="operator">,</span> <span class="keyword">sub</span><span class="operator">{</span> <span class="keyword">sprintf</span> <span class="string">"&lt;U+%04X&gt;"</span><span class="operator">,</span> <span class="keyword">shift</span> <span class="operator">});</span>
</pre>
<p>Acts like FB_PERLQQ but &lt;U+<em>XXXX</em>&gt; is used instead of
\x{<em>XXXX</em>}.</p>
<p>
</p>
<hr />
<h1><a name="defining_encodings">Defining Encodings</a></h1>
<p>To define a new encoding, use:</p>
<pre>
    <span class="keyword">use</span> <span class="variable">Encode</span> <span class="string">qw(define_encoding)</span><span class="operator">;</span>
    <span class="variable">define_encoding</span><span class="operator">(</span><span class="variable">$object</span><span class="operator">,</span> <span class="string">'canonicalName'</span> <span class="operator">[</span><span class="operator">,</span> <span class="variable">alias</span><span class="operator">...</span><span class="operator">]</span><span class="operator">);</span>
</pre>
<p><em>canonicalName</em> will be associated with <em>$object</em>.  The object
should provide the interface described in <a href="../lib/Encode/Encoding.html">the Encode::Encoding manpage</a>.
If more than two arguments are provided then additional
arguments are taken as aliases for <em>$object</em>.</p>
<p>See <a href="../lib/Encode/Encoding.html">the Encode::Encoding manpage</a> for more details.</p>
<p>
</p>
<hr />
<h1><a name="the_utf8_flag">The UTF-8 flag</a></h1>
<p>Before the introduction of utf8 support in perl, The <code>eq</code> operator
just compared the strings represented by two scalars. Beginning with
perl 5.8, <code>eq</code> compares two strings with simultaneous consideration
of <em>the utf8 flag</em>. To explain why we made it so, I will quote page
402 of <code>Programming Perl, 3rd ed.</code></p>
<dl>
<dt><strong><a name="item_goal__231_3a">Goal #1:</a></strong>

<dd>
<p>Old byte-oriented programs should not spontaneously break on the old
byte-oriented data they used to work on.</p>
</dd>
</li>
<dt><strong><a name="item_goal__232_3a">Goal #2:</a></strong>

<dd>
<p>Old byte-oriented programs should magically start working on the new
character-oriented data when appropriate.</p>
</dd>
</li>
<dt><strong><a name="item_goal__233_3a">Goal #3:</a></strong>

<dd>
<p>Programs should run just as fast in the new character-oriented mode
as in the old byte-oriented mode.</p>
</dd>
</li>
<dt><strong><a name="item_goal__234_3a">Goal #4:</a></strong>

<dd>
<p>Perl should remain one language, rather than forking into a
byte-oriented Perl and a character-oriented Perl.</p>
</dd>
</li>
</dl>
<p>Back when <code>Programming Perl, 3rd ed.</code> was written, not even Perl 5.6.0
was born and many features documented in the book remained
unimplemented for a long time.  Perl 5.8 corrected this and the introduction
of the UTF-8 flag is one of them.  You can think of this perl notion as of a
byte-oriented mode (utf8 flag off) and a character-oriented mode (utf8
flag on).</p>
<p>Here is how Encode takes care of the utf8 flag.</p>
<ul>
<li>
<p>When you encode, the resulting utf8 flag is always off.</p>
</li>
<li>
<p>When you decode, the resulting utf8 flag is on unless you can
unambiguously represent data.  Here is the definition of
dis-ambiguity.</p>
<p>After <a href="#item_decode"><code>$utf8 = decode('foo', $octet);</code></a>,</p>
<pre>
  When $octet is...   The utf8 flag in $utf8 is
  ---------------------------------------------
  In ASCII only (or EBCDIC only)            OFF
  In ISO-8859-1                              ON
  In any other Encoding                      ON
  ---------------------------------------------</pre>
<p>As you see, there is one exception, In ASCII.  That way you can assume
Goal #1.  And with Encode Goal #2 is assumed but you still have to be
careful in such cases mentioned in <strong>CAVEAT</strong> paragraphs.</p>
<p>This utf8 flag is not visible in perl scripts, exactly for the same
reason you cannot (or you <em>don't have to</em>) see if a scalar contains a
string, integer, or floating point number.   But you can still peek
and poke these if you will.  See the section below.</p>
</li>
</ul>
<p>
</p>
<h2><a name="messing_with_perl_s_internals">Messing with Perl's Internals</a></h2>
<p>The following API uses parts of Perl's internals in the current
implementation.  As such, they are efficient but may change.</p>
<dl>
<dt><strong><a name="item_is_utf8">is_utf8(STRING [, CHECK])</a></strong>

<dd>
<p>[INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING.
If CHECK is true, also checks the data in STRING for being well-formed
UTF-8.  Returns true if successful, false otherwise.</p>
</dd>
<dd>
<p>As of perl 5.8.1, <a href="../lib/utf8.html">the utf8 manpage</a> also has utf8::is_utf8().</p>
</dd>
</li>
<dt><strong><a name="item__utf8_on"><code>_utf8_on(STRING)</code></a></strong>

<dd>
<p>[INTERNAL] Turns on the UTF-8 flag in STRING.  The data in STRING is
<strong>not</strong> checked for being well-formed UTF-8.  Do not use unless you
<strong>know</strong> that the STRING is well-formed UTF-8.  Returns the previous
state of the UTF-8 flag (so please don't treat the return value as
indicating success or failure), or <a href="../lib/Pod/perlfunc.html#item_undef"><code>undef</code></a> if STRING is not a string.</p>
</dd>
</li>
<dt><strong><a name="item__utf8_off"><code>_utf8_off(STRING)</code></a></strong>

<dd>
<p>[INTERNAL] Turns off the UTF-8 flag in STRING.  Do not use frivolously.
Returns the previous state of the UTF-8 flag (so please don't treat the
return value as indicating success or failure), or <a href="../lib/Pod/perlfunc.html#item_undef"><code>undef</code></a> if STRING is
not a string.</p>
</dd>
</li>
</dl>
<p>
</p>
<hr />
<h1><a name="utf8_vs__utf8">UTF-8 vs. utf8</a></h1>
<pre>
  ....We now view strings not as sequences of bytes, but as sequences
  of numbers in the range 0 .. 2**32-1 (or in the case of 64-bit
  computers, 0 .. 2**64-1) -- Programming Perl, 3rd ed.</pre>
<p>That has been the perl's notion of UTF-8 but official UTF-8 is more
strict; Its ranges is much narrower (0 .. 10FFFF), some sequences are
not allowed (i.e. Those used in the surrogate pair, 0xFFFE, et al).</p>
<p>Now that is overruled by Larry Wall himself.</p>
<pre>
  From: Larry Wall &lt;larry@wall.org&gt;
  Date: December 04, 2004 11:51:58 JST
  To: perl-unicode@perl.org
  Subject: Re: Make Encode.pm support the real UTF-8
  Message-Id: &lt;20041204025158.GA28754@wall.org&gt;
  
  On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote:
  : I've no problem with 'utf8' being perl's unrestricted uft8 encoding,
  : but &quot;UTF-8&quot; is the name of the standard and should give the
  : corresponding behaviour.
  
  For what it's worth, that's how I've always kept them straight in my
  head.
  
  Also for what it's worth, Perl 6 will mostly default to strict but
  make it easy to switch back to lax.
  
  Larry</pre>
<p>Do you copy?  As of Perl 5.8.7, <strong>UTF-8</strong> means strict, official UTF-8
while <strong>utf8</strong> means liberal, lax, version thereof.  And Encode version
2.10 or later thus groks the difference between <code>UTF-8</code> and C&quot;utf8&quot;.</p>
<pre>
  <span class="variable">encode</span><span class="operator">(</span><span class="string">"utf8"</span><span class="operator">,</span>  <span class="string">"\x{FFFF_FFFF}"</span><span class="operator">,</span> <span class="number">1</span><span class="operator">);</span> <span class="comment"># okay</span>
  <span class="variable">encode</span><span class="operator">(</span><span class="string">"UTF-8"</span><span class="operator">,</span> <span class="string">"\x{FFFF_FFFF}"</span><span class="operator">,</span> <span class="number">1</span><span class="operator">);</span> <span class="comment"># croaks</span>
</pre>
<p><code>UTF-8</code> in Encode is actually a canonical name for <code>utf-8-strict</code>.
Yes, the hyphen between &quot;UTF&quot; and &quot;8&quot; is important.  Without it Encode
goes &quot;liberal&quot;</p>
<pre>
  find_encoding(&quot;UTF-8&quot;)-&gt;name # is 'utf-8-strict'
  find_encoding(&quot;utf-8&quot;)-&gt;name # ditto. names are case insensitive
  find_encoding(&quot;utf8&quot;)-&gt;name  # ditto. &quot;_&quot; are treated as &quot;-&quot;
  find_encoding(&quot;UTF8&quot;)-&gt;name  # is 'utf8'.</pre>
<p>
</p>
<hr />
<h1><a name="see_also">SEE ALSO</a></h1>
<p><a href="../lib/Encode/Encoding.html">the Encode::Encoding manpage</a>,
<a href="../lib/Encode/Supported.html">the Encode::Supported manpage</a>,
<a href="../lib/Encode/PerlIO.html">the Encode::PerlIO manpage</a>,
<a href="../lib/encoding.html">the encoding manpage</a>,
<a href="../lib/Pod/perlebcdic.html">the perlebcdic manpage</a>,
<a href="../lib/Pod/perlfunc.html#open">open in the perlfunc manpage</a>,
<a href="../lib/Pod/perlunicode.html">the perlunicode manpage</a>,
<a href="../lib/utf8.html">the utf8 manpage</a>,
the Perl Unicode Mailing List &lt;<a href="mailto:perl-unicode@perl.org">perl-unicode@perl.org</a>&gt;</p>
<p>
</p>
<hr />
<h1><a name="maintainer">MAINTAINER</a></h1>
<p>This project was originated by Nick Ing-Simmons and later maintained
by Dan Kogai &lt;<a href="mailto:dankogai@dan.co.jp">dankogai@dan.co.jp</a>&gt;.  See AUTHORS for a full
list of people involved.  For any questions, use
&lt;<a href="mailto:perl-unicode@perl.org">perl-unicode@perl.org</a>&gt; so we can all share.</p>

</body>

</html>
上一页 1 23
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -