📄 perluniintro.1
字号:
\&\*(L"EBCDIC-safe\*(R"..Sh "Creating Unicode".IX Subsection "Creating Unicode"To create Unicode characters in literals for code points above \f(CW0xFF\fR,use the \f(CW\*(C`\ex{...}\*(C'\fR notation in double-quoted strings:.PP.Vb 1\& my $smiley = "\ex{263a}";.Ve.PPSimilarly, it can be used in regular expression literals.PP.Vb 1\& $smiley =~ /\ex{263a}/;.Ve.PPAt run-time you can use \f(CW\*(C`chr()\*(C'\fR:.PP.Vb 1\& my $hebrew_alef = chr(0x05d0);.Ve.PPSee \*(L"Further Resources\*(R" for how to find all these numeric codes..PPNaturally, \f(CW\*(C`ord()\*(C'\fR will do the reverse: it turns a character intoa code point..PPNote that \f(CW\*(C`\ex..\*(C'\fR (no \f(CW\*(C`{}\*(C'\fR and only two hexadecimal digits), \f(CW\*(C`\ex{...}\*(C'\fR,and \f(CW\*(C`chr(...)\*(C'\fR for arguments less than \f(CW0x100\fR (decimal 256)generate an eight-bit character for backward compatibility with olderPerls. For arguments of \f(CW0x100\fR or more, Unicode characters arealways produced. If you want to force the production of Unicodecharacters regardless of the numeric value, use \f(CW\*(C`pack("U", ...)\*(C'\fRinstead of \f(CW\*(C`\ex..\*(C'\fR, \f(CW\*(C`\ex{...}\*(C'\fR, or \f(CW\*(C`chr()\*(C'\fR..PPYou can also use the \f(CW\*(C`charnames\*(C'\fR pragma to invoke charactersby name in double-quoted strings:.PP.Vb 2\& use charnames \*(Aq:full\*(Aq;\& my $arabic_alef = "\eN{ARABIC LETTER ALEF}";.Ve.PPAnd, as mentioned above, you can also \f(CW\*(C`pack()\*(C'\fR numbers into Unicodecharacters:.PP.Vb 1\& my $georgian_an = pack("U", 0x10a0);.Ve.PPNote that both \f(CW\*(C`\ex{...}\*(C'\fR and \f(CW\*(C`\eN{...}\*(C'\fR are compile-time stringconstants: you cannot use variables in them. if you want similarrun-time functionality, use \f(CW\*(C`chr()\*(C'\fR and \f(CW\*(C`charnames::vianame()\*(C'\fR..PPIf you want to force the result to Unicode characters, use the special\&\f(CW"U0"\fR prefix. It consumes no arguments but causes the following bytesto be interpreted as the \s-1UTF\-8\s0 encoding of Unicode characters:.PP.Vb 1\& my $chars = pack("U0W*", 0x80, 0x42);.Ve.PPLikewise, you can stop such \s-1UTF\-8\s0 interpretation by using the special\&\f(CW"C0"\fR prefix..Sh "Handling Unicode".IX Subsection "Handling Unicode"Handling Unicode is for the most part transparent: just use thestrings as usual. Functions like \f(CW\*(C`index()\*(C'\fR, \f(CW\*(C`length()\*(C'\fR, and\&\f(CW\*(C`substr()\*(C'\fR will work on the Unicode characters; regular expressionswill work on the Unicode characters (see perlunicode and perlretut)..PPNote that Perl considers combining character sequences to beseparate characters, so for example.PP.Vb 2\& use charnames \*(Aq:full\*(Aq;\& print length("\eN{LATIN CAPITAL LETTER A}\eN{COMBINING ACUTE ACCENT}"), "\en";.Ve.PPwill print 2, not 1. The only exception is that regular expressionshave \f(CW\*(C`\eX\*(C'\fR for matching a combining character sequence..PPLife is not quite so transparent, however, when working with legacyencodings, I/O, and certain special cases:.Sh "Legacy Encodings".IX Subsection "Legacy Encodings"When you combine legacy data and Unicode the legacy data needsto be upgraded to Unicode. Normally \s-1ISO\s0 8859\-1 (or \s-1EBCDIC\s0, ifapplicable) is assumed..PPThe \f(CW\*(C`Encode\*(C'\fR module knows about many encodings and has interfacesfor doing conversions between those encodings:.PP.Vb 2\& use Encode \*(Aqdecode\*(Aq;\& $data = decode("iso\-8859\-3", $data); # convert from legacy to utf\-8.Ve.Sh "Unicode I/O".IX Subsection "Unicode I/O"Normally, writing out Unicode data.PP.Vb 1\& print FH $some_string_with_unicode, "\en";.Ve.PPproduces raw bytes that Perl happens to use to internally encode theUnicode string. Perl's internal encoding depends on the system aswell as what characters happen to be in the string at the time. Ifany of the characters are at code points \f(CW0x100\fR or above, you will geta warning. To ensure that the output is explicitly rendered in theencoding you desire\*(--and to avoid the warning\*(--open the stream withthe desired encoding. Some examples:.PP.Vb 1\& open FH, ">:utf8", "file";\&\& open FH, ">:encoding(ucs2)", "file";\& open FH, ">:encoding(UTF\-8)", "file";\& open FH, ">:encoding(shift_jis)", "file";.Ve.PPand on already open streams, use \f(CW\*(C`binmode()\*(C'\fR:.PP.Vb 1\& binmode(STDOUT, ":utf8");\&\& binmode(STDOUT, ":encoding(ucs2)");\& binmode(STDOUT, ":encoding(UTF\-8)");\& binmode(STDOUT, ":encoding(shift_jis)");.Ve.PPThe matching of encoding names is loose: case does not matter, andmany encodings have several aliases. Note that the \f(CW\*(C`:utf8\*(C'\fR layermust always be specified exactly like that; it is \fInot\fR subject tothe loose matching of encoding names. Also note that \f(CW\*(C`:utf8\*(C'\fR is unsafe forinput, because it accepts the data without validating that it is indeed valid\&\s-1UTF8\s0..PPSee PerlIO for the \f(CW\*(C`:utf8\*(C'\fR layer, PerlIO::encoding andEncode::PerlIO for the \f(CW\*(C`:encoding()\*(C'\fR layer, andEncode::Supported for many encodings supported by the \f(CW\*(C`Encode\*(C'\fRmodule..PPReading in a file that you know happens to be encoded in one of theUnicode or legacy encodings does not magically turn the data intoUnicode in Perl's eyes. To do that, specify the appropriatelayer when opening files.PP.Vb 2\& open(my $fh,\*(Aq<:encoding(utf8)\*(Aq, \*(Aqanything\*(Aq);\& my $line_of_unicode = <$fh>;\&\& open(my $fh,\*(Aq<:encoding(Big5)\*(Aq, \*(Aqanything\*(Aq);\& my $line_of_unicode = <$fh>;.Ve.PPThe I/O layers can also be specified more flexibly withthe \f(CW\*(C`open\*(C'\fR pragma. See open, or look at the following example..PP.Vb 7\& use open \*(Aq:encoding(utf8)\*(Aq; # input/output default encoding will be UTF\-8\& open X, ">file";\& print X chr(0x100), "\en";\& close X;\& open Y, "<file";\& printf "%#x\en", ord(<Y>); # this should print 0x100\& close Y;.Ve.PPWith the \f(CW\*(C`open\*(C'\fR pragma you can use the \f(CW\*(C`:locale\*(C'\fR layer.PP.Vb 9\& BEGIN { $ENV{LC_ALL} = $ENV{LANG} = \*(Aqru_RU.KOI8\-R\*(Aq }\& # the :locale will probe the locale environment variables like LC_ALL\& use open OUT => \*(Aq:locale\*(Aq; # russki parusski\& open(O, ">koi8");\& print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8\-R 0xc1\& close O;\& open(I, "<koi8");\& printf "%#x\en", ord(<I>), "\en"; # this should print 0xc1\& close I;.Ve.PPThese methods install a transparent filter on the I/O stream thatconverts data from the specified encoding when it is read in from thestream. The result is always Unicode..PPThe open pragma affects all the \f(CW\*(C`open()\*(C'\fR calls after the pragma bysetting default layers. If you want to affect only certainstreams, use explicit layers directly in the \f(CW\*(C`open()\*(C'\fR call..PPYou can switch encodings on an already opened stream by using\&\f(CW\*(C`binmode()\*(C'\fR; see \*(L"binmode\*(R" in perlfunc..PPThe \f(CW\*(C`:locale\*(C'\fR does not currently (as of Perl 5.8.0) work with\&\f(CW\*(C`open()\*(C'\fR and \f(CW\*(C`binmode()\*(C'\fR, only with the \f(CW\*(C`open\*(C'\fR pragma. The\&\f(CW\*(C`:utf8\*(C'\fR and \f(CW\*(C`:encoding(...)\*(C'\fR methods do work with all of \f(CW\*(C`open()\*(C'\fR,\&\f(CW\*(C`binmode()\*(C'\fR, and the \f(CW\*(C`open\*(C'\fR pragma..PPSimilarly, you may use these I/O layers on output streams toautomatically convert Unicode to the specified encoding when it iswritten to the stream. For example, the following snippet copies thecontents of the file \*(L"text.jis\*(R" (encoded as \s-1ISO\-2022\-JP\s0, aka \s-1JIS\s0) tothe file \*(L"text.utf8\*(R", encoded as \s-1UTF\-8:\s0.PP.Vb 3\& open(my $nihongo, \*(Aq<:encoding(iso\-2022\-jp)\*(Aq, \*(Aqtext.jis\*(Aq);\& open(my $unicode, \*(Aq>:utf8\*(Aq, \*(Aqtext.utf8\*(Aq);\& while (<$nihongo>) { print $unicode $_ }.Ve.PPThe naming of encodings, both by the \f(CW\*(C`open()\*(C'\fR and by the \f(CW\*(C`open\*(C'\fRpragma allows for flexible names: \f(CW\*(C`koi8\-r\*(C'\fR and \f(CW\*(C`KOI8R\*(C'\fR will both beunderstood..PPCommon encodings recognized by \s-1ISO\s0, \s-1MIME\s0, \s-1IANA\s0, and various otherstandardisation organisations are recognised; for a more detailedlist see Encode::Supported..PP\&\f(CW\*(C`read()\*(C'\fR reads characters and returns the number of characters.\&\f(CW\*(C`seek()\*(C'\fR and \f(CW\*(C`tell()\*(C'\fR operate on byte counts, as do \f(CW\*(C`sysread()\*(C'\fRand \f(CW\*(C`sysseek()\*(C'\fR..PPNotice that because of the default behaviour of not doing anyconversion upon input if there is no default layer,it is easy to mistakenly write code that keeps on expanding a fileby repeatedly encoding the data:.PP.Vb 8\& # BAD CODE WARNING\& open F, "file";\& local $/; ## read in the whole file of 8\-bit characters\& $t = <F>;\& close F;\& open F, ">:encoding(utf8)", "file";\& print F $t; ## convert to UTF\-8 on output\& close F;.Ve.PPIf you run this code twice, the contents of the \fIfile\fR will be twice\&\s-1UTF\-8\s0 encoded. A \f(CW\*(C`use open \*(Aq:encoding(utf8)\*(Aq\*(C'\fR would have avoided thebug, or explicitly opening also the \fIfile\fR for input as \s-1UTF\-8\s0..PP\&\fB\s-1NOTE\s0\fR: the \f(CW\*(C`:utf8\*(C'\fR and \f(CW\*(C`:encoding\*(C'\fR features work only if yourPerl has been built with the new PerlIO feature (which is the defaulton most systems)..Sh "Displaying Unicode As Text".IX Subsection "Displaying Unicode As Text"Sometimes you might want to display Perl scalars containing Unicode assimple \s-1ASCII\s0 (or \s-1EBCDIC\s0) text. The following subroutine convertsits argument so that Unicode characters with code points greater than255 are displayed as \f(CW\*(C`\ex{...}\*(C'\fR, control characters (like \f(CW\*(C`\en\*(C'\fR) aredisplayed as \f(CW\*(C`\ex..\*(C'\fR, and the rest of the characters as themselves:.PP.Vb 9\& sub nice_string {\& join("",\& map { $_ > 255 ? # if wide character...\& sprintf("\e\ex{%04X}", $_) : # \ex{...}\& chr($_) =~ /[[:cntrl:]]/ ? # else if control character ...\& sprintf("\e\ex%02X", $_) : # \ex..\& quotemeta(chr($_)) # else quoted or as themselves\& } unpack("W*", $_[0])); # unpack Unicode characters\& }.Ve.PPFor example,.PP.Vb 1\& nice_string("foo\ex{100}bar\en").Ve.PPreturns the string.PP.Vb 1\& \*(Aqfoo\ex{0100}bar\ex0A\*(Aq.Ve.PPwhich is ready to be printed..Sh "Special Cases".IX Subsection "Special Cases".IP "\(bu" 4Bit Complement Operator ~ And \fIvec()\fR.SpThe bit complement operator \f(CW\*(C`~\*(C'\fR may produce surprising results ifused on strings containing characters with ordinal values above255. In such a case, the results are consistent with the internalencoding of the characters, but not with much else. So don't dothat. Similarly for \f(CW\*(C`vec()\*(C'\fR: you will be operating on theinternally-encoded bit patterns of the Unicode characters, not onthe code point values, which is very probably not what you want..IP "\(bu" 4Peeking At Perl's Internal Encoding.SpNormal users of Perl should never care how Perl encodes any particularUnicode string (because the normal ways to get at the contents of astring with Unicode\*(--via input and output\*(--should always be viaexplicitly-defined I/O layers). But if you must, there are twoways of looking behind the scenes..SpOne way of peeking inside the internal encoding of Unicode charactersis to use \f(CW\*(C`unpack("C*", ...\*(C'\fR to get the bytes of whatever the stringencoding happens to be, or \f(CW\*(C`unpack("U0..", ...)\*(C'\fR to get the bytes of the\&\s-1UTF\-8\s0 encoding:.Sp.Vb 2\& # this prints c4 80 for the UTF\-8 bytes 0xc4 0x80\& print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\en";.Ve.SpYet another way would be to use the Devel::Peek module:.Sp.Vb 1\& perl \-MDevel::Peek \-e \*(AqDump(chr(0x100))\*(Aq.Ve.SpThat shows the \f(CW\*(C`UTF8\*(C'\fR flag in \s-1FLAGS\s0 and both the \s-1UTF\-8\s0 bytesand Unicode characters in \f(CW\*(C`PV\*(C'\fR. See also later in this documentthe discussion about the \f(CW\*(C`utf8::is_utf8()\*(C'\fR function..Sh "Advanced Topics".IX Subsection "Advanced Topics".IP "\(bu" 4String Equivalence.SpThe question of string equivalence turns somewhat complicatedin Unicode: what do you mean by \*(L"equal\*(R"?.Sp(Is \f(CW\*(C`LATIN CAPITAL LETTER A WITH ACUTE\*(C'\fR equal to\&\f(CW\*(C`LATIN CAPITAL LETTER A\*(C'\fR?).SpThe short answer is that by default Perl compares equivalence (\f(CW\*(C`eq\*(C'\fR,\&\f(CW\*(C`ne\*(C'\fR) based only on code points of the characters. In the abovecase, the answer is no (because 0x00C1 != 0x0041). But sometimes, any\&\s-1CAPITAL\s0 \s-1LETTER\s0 As should be considered equal, or even As of any case..SpThe long answer is that you need to consider character normalizationand casing issues: see Unicode::Normalize, Unicode TechnicalReports #15 and #21, \fIUnicode Normalization Forms\fR and \fICaseMappings\fR, http://www.unicode.org/unicode/reports/tr15/ andhttp://www.unicode.org/unicode/reports/tr21/.SpAs of Perl 5.8.0, the \*(L"Full\*(R" case-folding of \fICase
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -