⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 perlunicode.1

📁 视频监控网络部分的协议ddns,的模块的实现代码,请大家大胆指正.
💻 1
📖 第 1 页 / 共 5 页
字号:
each of two worlds: the old world of bytes and the new world ofcharacters, upgrading from bytes to characters when necessary.If your legacy code does not explicitly use Unicode, no automaticswitch-over to characters should happen.  Characters shouldn't getdowngraded to bytes, either.  It is possible to accidentally mix bytesand characters, however (see perluniintro), in which case \f(CW\*(C`\ew\*(C'\fR inregular expressions might start behaving differently.  Review yourcode.  Use warnings and the \f(CW\*(C`strict\*(C'\fR pragma..Sh "Unicode in Perl on \s-1EBCDIC\s0".IX Subsection "Unicode in Perl on EBCDIC"The way Unicode is handled on \s-1EBCDIC\s0 platforms is stillexperimental.  On such platforms, references to \s-1UTF\-8\s0 encoding in thisdocument and elsewhere should be read as meaning the UTF-EBCDICspecified in Unicode Technical Report 16, unless \s-1ASCII\s0 vs. \s-1EBCDIC\s0 issuesare specifically discussed. There is no \f(CW\*(C`utfebcdic\*(C'\fR pragma or\&\*(L":utfebcdic\*(R" layer; rather, \*(L"utf8\*(R" and \*(L":utf8\*(R" are reused to meanthe platform's \*(L"natural\*(R" 8\-bit encoding of Unicode. See perlebcdicfor more discussion of the issues..Sh "Locales".IX Subsection "Locales"Usually locale settings and Unicode do not affect each other, butthere are a couple of exceptions:.IP "\(bu" 4You can enable automatic UTF\-8\-ification of your standard filehandles, default \f(CW\*(C`open()\*(C'\fR layer, and \f(CW@ARGV\fR by using eitherthe \f(CW\*(C`\-C\*(C'\fR command line switch or the \f(CW\*(C`PERL_UNICODE\*(C'\fR environmentvariable, see perlrun for the documentation of the \f(CW\*(C`\-C\*(C'\fR switch..IP "\(bu" 4Perl tries really hard to work both with Unicode and the oldbyte-oriented world. Most often this is nice, but sometimes Perl'sstraddling of the proverbial fence causes problems..Sh "When Unicode Does Not Happen".IX Subsection "When Unicode Does Not Happen"While Perl does have extensive ways to input and output in Unicode,and few other 'entry points' like the \f(CW@ARGV\fR which can be interpretedas Unicode (\s-1UTF\-8\s0), there still are many places where Unicode (in someencoding or another) could be given as arguments or received asresults, or both, but it is not..PPThe following are such interfaces.  For all of these interfaces Perlcurrently (as of 5.8.3) simply assumes byte strings both as argumentsand results, or \s-1UTF\-8\s0 strings if the \f(CW\*(C`encoding\*(C'\fR pragma has been used..PPOne reason why Perl does not attempt to resolve the role of Unicode inthis cases is that the answers are highly dependent on the operatingsystem and the file system(s).  For example, whether filenames can bein Unicode, and in exactly what kind of encoding, is not exactly aportable concept.  Similarly for the qx and system: how well will the\&'command line interface' (and which of them?) handle Unicode?.IP "\(bu" 4chdir, chmod, chown, chroot, exec, link, lstat, mkdir, rename, rmdir, stat, symlink, truncate, unlink, utime, \-X.IP "\(bu" 4\&\f(CW%ENV\fR.IP "\(bu" 4glob (aka the <*>).IP "\(bu" 4open, opendir, sysopen.IP "\(bu" 4qx (aka the backtick operator), system.IP "\(bu" 4readdir, readlink.Sh "Forcing Unicode in Perl (Or Unforcing Unicode in Perl)".IX Subsection "Forcing Unicode in Perl (Or Unforcing Unicode in Perl)"Sometimes (see \*(L"When Unicode Does Not Happen\*(R") there aresituations where you simply need to force Perl to believe that a bytestring is \s-1UTF\-8\s0, or vice versa.  The low-level callsutf8::upgrade($bytestring) and utf8::downgrade($utf8string) arethe answers..PPDo not use them without careful thought, though: Perl may easily getvery confused, angry, or even crash, if you suddenly change the 'nature'of scalar like that.  Especially careful you have to be if you use the\&\fIutf8::upgrade()\fR: any random byte string is not valid \s-1UTF\-8\s0..Sh "Using Unicode in \s-1XS\s0".IX Subsection "Using Unicode in XS"If you want to handle Perl Unicode in \s-1XS\s0 extensions, you may find thefollowing C APIs useful.  See also \*(L"Unicode Support\*(R" in perlguts for anexplanation about Unicode at the \s-1XS\s0 level, and perlapi for the \s-1API\s0details..IP "\(bu" 4\&\f(CW\*(C`DO_UTF8(sv)\*(C'\fR returns true if the \f(CW\*(C`UTF8\*(C'\fR flag is on and the bytespragma is not in effect.  \f(CW\*(C`SvUTF8(sv)\*(C'\fR returns true is the \f(CW\*(C`UTF8\*(C'\fRflag is on; the bytes pragma is ignored.  The \f(CW\*(C`UTF8\*(C'\fR flag being ondoes \fBnot\fR mean that there are any characters of code points greaterthan 255 (or 127) in the scalar or that there are even any charactersin the scalar.  What the \f(CW\*(C`UTF8\*(C'\fR flag means is that the sequence ofoctets in the representation of the scalar is the sequence of \s-1UTF\-8\s0encoded code points of the characters of a string.  The \f(CW\*(C`UTF8\*(C'\fR flagbeing off means that each octet in this representation encodes asingle character with code point 0..255 within the string.  Perl'sUnicode model is not to use \s-1UTF\-8\s0 until it is absolutely necessary..IP "\(bu" 4\&\f(CW\*(C`uvuni_to_utf8(buf, chr)\*(C'\fR writes a Unicode character code point intoa buffer encoding the code point as \s-1UTF\-8\s0, and returns a pointerpointing after the \s-1UTF\-8\s0 bytes..IP "\(bu" 4\&\f(CW\*(C`utf8_to_uvuni(buf, lenp)\*(C'\fR reads \s-1UTF\-8\s0 encoded bytes from a buffer andreturns the Unicode character code point and, optionally, the length ofthe \s-1UTF\-8\s0 byte sequence..IP "\(bu" 4\&\f(CW\*(C`utf8_length(start, end)\*(C'\fR returns the length of the \s-1UTF\-8\s0 encoded bufferin characters.  \f(CW\*(C`sv_len_utf8(sv)\*(C'\fR returns the length of the \s-1UTF\-8\s0 encodedscalar..IP "\(bu" 4\&\f(CW\*(C`sv_utf8_upgrade(sv)\*(C'\fR converts the string of the scalar to its \s-1UTF\-8\s0encoded form.  \f(CW\*(C`sv_utf8_downgrade(sv)\*(C'\fR does the opposite, ifpossible.  \f(CW\*(C`sv_utf8_encode(sv)\*(C'\fR is like sv_utf8_upgrade except thatit does not set the \f(CW\*(C`UTF8\*(C'\fR flag.  \f(CW\*(C`sv_utf8_decode()\*(C'\fR does theopposite of \f(CW\*(C`sv_utf8_encode()\*(C'\fR.  Note that none of these are to beused as general-purpose encoding or decoding interfaces: \f(CW\*(C`use Encode\*(C'\fRfor that.  \f(CW\*(C`sv_utf8_upgrade()\*(C'\fR is affected by the encoding pragmabut \f(CW\*(C`sv_utf8_downgrade()\*(C'\fR is not (since the encoding pragma isdesigned to be a one-way street)..IP "\(bu" 4\&\f(CWis_utf8_char(s)\fR returns true if the pointer points to a valid \s-1UTF\-8\s0character..IP "\(bu" 4\&\f(CW\*(C`is_utf8_string(buf, len)\*(C'\fR returns true if \f(CW\*(C`len\*(C'\fR bytes of the bufferare valid \s-1UTF\-8\s0..IP "\(bu" 4\&\f(CW\*(C`UTF8SKIP(buf)\*(C'\fR will return the number of bytes in the \s-1UTF\-8\s0 encodedcharacter in the buffer.  \f(CW\*(C`UNISKIP(chr)\*(C'\fR will return the number of bytesrequired to UTF\-8\-encode the Unicode character code point.  \f(CW\*(C`UTF8SKIP()\*(C'\fRis useful for example for iterating over the characters of a \s-1UTF\-8\s0encoded buffer; \f(CW\*(C`UNISKIP()\*(C'\fR is useful, for example, in computingthe size required for a \s-1UTF\-8\s0 encoded buffer..IP "\(bu" 4\&\f(CW\*(C`utf8_distance(a, b)\*(C'\fR will tell the distance in characters between thetwo pointers pointing to the same \s-1UTF\-8\s0 encoded buffer..IP "\(bu" 4\&\f(CW\*(C`utf8_hop(s, off)\*(C'\fR will return a pointer to an \s-1UTF\-8\s0 encoded bufferthat is \f(CW\*(C`off\*(C'\fR (positive or negative) Unicode characters displacedfrom the \s-1UTF\-8\s0 buffer \f(CW\*(C`s\*(C'\fR.  Be careful not to overstep the buffer:\&\f(CW\*(C`utf8_hop()\*(C'\fR will merrily run off the end or the beginning of thebuffer if told to do so..IP "\(bu" 4\&\f(CW\*(C`pv_uni_display(dsv, spv, len, pvlim, flags)\*(C'\fR and\&\f(CW\*(C`sv_uni_display(dsv, ssv, pvlim, flags)\*(C'\fR are useful for debugging theoutput of Unicode strings and scalars.  By default they are usefulonly for debugging\*(--they display \fBall\fR characters as hexadecimal codepoints\*(--but with the flags \f(CW\*(C`UNI_DISPLAY_ISPRINT\*(C'\fR,\&\f(CW\*(C`UNI_DISPLAY_BACKSLASH\*(C'\fR, and \f(CW\*(C`UNI_DISPLAY_QQ\*(C'\fR you can make theoutput more readable..IP "\(bu" 4\&\f(CW\*(C`ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)\*(C'\fR can be used tocompare two strings case-insensitively in Unicode.  For case-sensitivecomparisons you can just use \f(CW\*(C`memEQ()\*(C'\fR and \f(CW\*(C`memNE()\*(C'\fR as usual..PPFor more information, see perlapi, and \fIutf8.c\fR and \fIutf8.h\fRin the Perl source code distribution..SH "BUGS".IX Header "BUGS".Sh "Interaction with Locales".IX Subsection "Interaction with Locales"Use of locales with Unicode data may lead to odd results.  Currently,Perl attempts to attach 8\-bit locale info to characters in the range0..255, but this technique is demonstrably incorrect for locales thatuse characters above that range when mapped into Unicode.  Perl'sUnicode support will also tend to run slower.  Use of locales withUnicode is discouraged..Sh "Interaction with Extensions".IX Subsection "Interaction with Extensions"When Perl exchanges data with an extension, the extension should beable to understand the \s-1UTF8\s0 flag and act accordingly. If theextension doesn't know about the flag, it's likely that the extensionwill return incorrectly-flagged data..PPSo if you're working with Unicode data, consult the documentation ofevery module you're using if there are any issues with Unicode dataexchange. If the documentation does not talk about Unicode at all,suspect the worst and probably look at the source to learn how themodule is implemented. Modules written completely in Perl shouldn'tcause problems. Modules that directly or indirectly access code writtenin other programming languages are at risk..PPFor affected functions, the simple strategy to avoid data corruption isto always make the encoding of the exchanged data explicit. Choose anencoding that you know the extension can handle. Convert arguments passedto the extensions to that encoding and convert results back from thatencoding. Write wrapper functions that do the conversions for you, soyou can later change the functions when the extension catches up..PPTo provide an example, let's say the popular Foo::Bar::escape_htmlfunction doesn't deal with Unicode data yet. The wrapper functionwould convert the argument to raw \s-1UTF\-8\s0 and convert the result back toPerl's internal representation like so:.PP.Vb 5\&    sub my_escape_html ($) {\&      my($what) = shift;\&      return unless defined $what;\&      Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));\&    }.Ve.PPSometimes, when the extension does not convert data but just storesand retrieves them, you will be in a position to use the otherwisedangerous \fIEncode::_utf8_on()\fR function. Let's say the popular\&\f(CW\*(C`Foo::Bar\*(C'\fR extension, written in C, provides a \f(CW\*(C`param\*(C'\fR method thatlets you store and retrieve data according to these prototypes:.PP.Vb 2\&    $self\->param($name, $value);            # set a scalar\&    $value = $self\->param($name);           # retrieve a scalar.Ve.PPIf it does not yet provide support for any encoding, one could write aderived class with such a \f(CW\*(C`param\*(C'\fR method:.PP.Vb 12\&    sub param {\&      my($self,$name,$value) = @_;\&      utf8::upgrade($name);     # make sure it is UTF\-8 encoded\&      if (defined $value) {\&        utf8::upgrade($value);  # make sure it is UTF\-8 encoded\&        return $self\->SUPER::param($name,$value);\&      } else {\&        my $ret = $self\->SUPER::param($name);\&        Encode::_utf8_on($ret); # we know, it is UTF\-8 encoded\&        return $ret;\&      }\&    }.Ve.PPSome extensions provide filters on data entry/exit points, such asDB_File::filter_store_key and family. Look out for such filters inthe documentation of your extensions, they can make the transition toUnicode data much easier..Sh "Speed".IX Subsection "Speed"Some functions are slower when working on \s-1UTF\-8\s0 encoded strings thanon byte encoded strings.  All functions that need to hop overcharacters such as \fIlength()\fR, \fIsubstr()\fR or \fIindex()\fR, or matching regularexpressions can work \fBmuch\fR faster when the underlying data arebyte-encoded..PPIn Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1a caching scheme was introduced which will hopefully make the slownesssomewhat less spectacular, at least for some operations.  In general,operations with \s-1UTF\-8\s0 encoded strings are still slower. As an example,the Unicode properties (character classes) like \f(CW\*(C`\ep{Nd}\*(C'\fR are known tobe quite a bit slower (5\-20 times) than their simpler counterpartslike \f(CW\*(C`\ed\*(C'\fR (then again, there 268 Unicode characters matching \f(CW\*(C`Nd\*(C'\fRcompared with the 10 \s-1ASCII\s0 characters matching \f(CW\*(C`d\*(C'\fR)..Sh "Porting code from perl\-5.6.X".IX Subsection "Porting code from perl-5.6.X"Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmerwas required to use the \f(CW\*(C`utf8\*(C'\fR pragma to declare that a given scopeexpected to deal with Unicode data and had to make sure that onlyUnicode data were reaching that scope. If you have code that isworking with 5.6, you will need some of the following adjustments toyour code. The examples are written such that the code will continueto work under 5.6, so you should be safe to try them out..IP "\(bu" 4A filehandle that should read or write \s-1UTF\-8\s0.Sp.Vb 3\&  if ($] > 5.007) {\&    binmode $fh, ":encoding(utf8)";\&  }.Ve.IP "\(bu" 4A scalar that is going to be passed to some extension.SpBe it Compress::Zlib, Apache::Request or any extension that has nomention of Unicode in the manpage, you need to make sure that the\&\s-1UTF8\s0 flag is stripped off. Note that at the time of this writing(October 2002) the mentioned modules are not UTF\-8\-aware. Pleasecheck the documentation to verify if this is still true..Sp.Vb 4\&  if ($] > 5.007) {\&    require Encode;\&    $val = Encode::encode_utf8($val); # make octets\&  }.Ve.IP "\(bu" 4A scalar we got back from an extension.SpIf you believe the scalar comes back as \s-1UTF\-8\s0, you will most likelywant the \s-1UTF8\s0 flag restored:.Sp.Vb 4\&  if ($] > 5.007) {\&    require Encode;\&    $val = Encode::decode_utf8($val);\&  }.Ve.IP "\(bu" 4Same thing, if you are really sure it is \s-1UTF\-8\s0.Sp.Vb 4\&  if ($] > 5.007) {\&    require Encode;\&    Encode::_utf8_on($val);\&  }.Ve.IP "\(bu" 4A wrapper for fetchrow_array and fetchrow_hashref.SpWhen the database contains only \s-1UTF\-8\s0, a wrapper function or method isa convenient way to replace all your fetchrow_array andfetchrow_hashref calls. A wrapper function will also make it easier toadapt to future enhancements in your database driver. Note that at thetime of this writing (October 2002), the \s-1DBI\s0 has no standardized wayto deal with \s-1UTF\-8\s0 data. Please check the documentation to verify ifthat is still true..Sp.Vb 10\&  sub fetchrow {\&    my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref}\&    if ($] < 5.007) {\&      return $sth\->$what;\&    } else {\&      require Encode;\&      if (wantarray) {\&        my @arr = $sth\->$what;\&        for (@arr) {\&          defined && /[^\e000\-\e177]/ && Encode::_utf8_on(

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -