📄 perlunifaq.1
字号:
.Ve.PPOr if you already have an open filehandle:.PP.Vb 1\& binmode $fh, \*(Aq:encoding(UTF\-8)\*(Aq;.Ve.PPSome database drivers for \s-1DBI\s0 can also automatically encode and decode, butthat is sometimes limited to the \s-1UTF\-8\s0 encoding..Sh "What if I don't know which encoding was used?".IX Subsection "What if I don't know which encoding was used?"Do whatever you can to find out, and if you have to: guess. (Don't forget todocument your guess with a comment.).PPYou could open the document in a web browser, and change the character set orcharacter encoding until you can visually confirm that all characters look theway they should..PPThere is no way to reliably detect the encoding automatically, so if peoplekeep sending you data without charset indication, you may have to educate them..Sh "Can I use Unicode in my Perl sources?".IX Subsection "Can I use Unicode in my Perl sources?"Yes, you can! If your sources are \s-1UTF\-8\s0 encoded, you can indicate that with the\&\f(CW\*(C`use utf8\*(C'\fR pragma..PP.Vb 1\& use utf8;.Ve.PPThis doesn't do anything to your input, or to your output. It only influencesthe way your sources are read. You can use Unicode in string literals, inidentifiers (but they still have to be \*(L"word characters\*(R" according to \f(CW\*(C`\ew\*(C'\fR),and even in custom delimiters..Sh "Data::Dumper doesn't restore the \s-1UTF8\s0 flag; is it broken?".IX Subsection "Data::Dumper doesn't restore the UTF8 flag; is it broken?"No, Data::Dumper's Unicode abilities are as they should be. There have beensome complaints that it should restore the \s-1UTF8\s0 flag when the data is readagain with \f(CW\*(C`eval\*(C'\fR. However, you should really not look at the flag, andnothing indicates that Data::Dumper should break this rule..PPHere's what happens: when Perl reads in a string literal, it sticks to 8 bitencoding as long as it can. (But perhaps originally it was internally encodedas \s-1UTF\-8\s0, when you dumped it.) When it has to give that up because othercharacters are added to the text string, it silently upgrades the string to\&\s-1UTF\-8\s0..PPIf you properly encode your strings for output, none of this is of yourconcern, and you can just \f(CW\*(C`eval\*(C'\fR dumped data as always..Sh "Why do regex character classes sometimes match only in the \s-1ASCII\s0 range?".IX Subsection "Why do regex character classes sometimes match only in the ASCII range?".Sh "Why do some characters not uppercase or lowercase correctly?".IX Subsection "Why do some characters not uppercase or lowercase correctly?"It seemed like a good idea at the time, to keep the semantics the same forstandard strings, when Perl got Unicode support. While it might be repairedin the future, we now have to deal with the fact that Perl treats equalstrings differently, depending on the internal state..PPAffected are \f(CW\*(C`uc\*(C'\fR, \f(CW\*(C`lc\*(C'\fR, \f(CW\*(C`ucfirst\*(C'\fR, \f(CW\*(C`lcfirst\*(C'\fR, \f(CW\*(C`\eU\*(C'\fR, \f(CW\*(C`\eL\*(C'\fR, \f(CW\*(C`\eu\*(C'\fR, \f(CW\*(C`\el\*(C'\fR,\&\f(CW\*(C`\ed\*(C'\fR, \f(CW\*(C`\es\*(C'\fR, \f(CW\*(C`\ew\*(C'\fR, \f(CW\*(C`\eD\*(C'\fR, \f(CW\*(C`\eS\*(C'\fR, \f(CW\*(C`\eW\*(C'\fR, \f(CW\*(C`/.../i\*(C'\fR, \f(CW\*(C`(?i:...)\*(C'\fR,\&\f(CW\*(C`/[[:posix:]]/\*(C'\fR..PPTo force Unicode semantics, you can upgrade the internal representation toby doing \f(CW\*(C`utf8::upgrade($string)\*(C'\fR. This does not change strings that werealready upgraded..PPFor a more detailed discussion, see Unicode::Semantics on \s-1CPAN\s0..Sh "How can I determine if a string is a text string or a binary string?".IX Subsection "How can I determine if a string is a text string or a binary string?"You can't. Some use the \s-1UTF8\s0 flag for this, but that's misuse, and makes wellbehaved modules like Data::Dumper look bad. The flag is useless for thispurpose, because it's off when an 8 bit encoding (by default \s-1ISO\-8859\-1\s0) isused to store the string..PPThis is something you, the programmer, has to keep track of; sorry. You couldconsider adopting a kind of \*(L"Hungarian notation\*(R" to help with this..Sh "How do I convert from encoding \s-1FOO\s0 to encoding \s-1BAR\s0?".IX Subsection "How do I convert from encoding FOO to encoding BAR?"By first converting the FOO-encoded byte string to a text string, and then thetext string to a BAR-encoded byte string:.PP.Vb 2\& my $text_string = decode(\*(AqFOO\*(Aq, $foo_string);\& my $bar_string = encode(\*(AqBAR\*(Aq, $text_string);.Ve.PPor by skipping the text string part, and going directly from one binaryencoding to the other:.PP.Vb 2\& use Encode qw(from_to);\& from_to($string, \*(AqFOO\*(Aq, \*(AqBAR\*(Aq); # changes contents of $string.Ve.PPor by letting automatic decoding and encoding do all the work:.PP.Vb 3\& open my $foofh, \*(Aq<:encoding(FOO)\*(Aq, \*(Aqexample.foo.txt\*(Aq;\& open my $barfh, \*(Aq>:encoding(BAR)\*(Aq, \*(Aqexample.bar.txt\*(Aq;\& print { $barfh } $_ while <$foofh>;.Ve.ie n .Sh "What are ""decode_utf8""\fP and \f(CW""encode_utf8""?".el .Sh "What are \f(CWdecode_utf8\fP and \f(CWencode_utf8\fP?".IX Subsection "What are decode_utf8 and encode_utf8?"These are alternate syntaxes for \f(CW\*(C`decode(\*(Aqutf8\*(Aq, ...)\*(C'\fR and \f(CW\*(C`encode(\*(Aqutf8\*(Aq,\&...)\*(C'\fR..ie n .Sh "What is a ""wide character""?".el .Sh "What is a ``wide character''?".IX Subsection "What is a wide character?"This is a term used both for characters with an ordinal value greater than 127,characters with an ordinal value greater than 255, or any character occupyingthan one byte, depending on the context..PPThe Perl warning \*(L"Wide character in ...\*(R" is caused by a character with anordinal value greater than 255. With no specified encoding layer, Perl tries tofit things in \s-1ISO\-8859\-1\s0 for backward compatibility reasons. When it can't, itemits this warning (if warnings are enabled), and outputs \s-1UTF\-8\s0 encoded datainstead..PPTo avoid this warning and to avoid having different output encodings in a singlestream, always specify an encoding explicitly, for example with a PerlIO layer:.PP.Vb 1\& binmode STDOUT, ":encoding(UTF\-8)";.Ve.SH "INTERNALS".IX Header "INTERNALS".ie n .Sh "What is ""the \s-1UTF8\s0 flag""?".el .Sh "What is ``the \s-1UTF8\s0 flag''?".IX Subsection "What is the UTF8 flag?"Please, unless you're hacking the internals, or debugging weirdness, don'tthink about the \s-1UTF8\s0 flag at all. That means that you very probably shouldn'tuse \f(CW\*(C`is_utf8\*(C'\fR, \f(CW\*(C`_utf8_on\*(C'\fR or \f(CW\*(C`_utf8_off\*(C'\fR at all..PPThe \s-1UTF8\s0 flag, also called SvUTF8, is an internal flag that indicates that thecurrent internal representation is \s-1UTF\-8\s0. Without the flag, it is assumed to be\&\s-1ISO\-8859\-1\s0. Perl converts between these automatically..PPOne of Perl's internal formats happens to be \s-1UTF\-8\s0. Unfortunately, Perl can'tkeep a secret, so everyone knows about this. That is the source of muchconfusion. It's better to pretend that the internal format is some unknownencoding, and that you always have to encode and decode explicitly..ie n .Sh "What about the ""use bytes"" pragma?".el .Sh "What about the \f(CWuse bytes\fP pragma?".IX Subsection "What about the use bytes pragma?"Don't use it. It makes no sense to deal with bytes in a text string, and itmakes no sense to deal with characters in a byte string. Do the properconversions (by decoding/encoding), and things will work out well: you getcharacter counts for decoded data, and byte counts for encoded data..PP\&\f(CW\*(C`use bytes\*(C'\fR is usually a failed attempt to do something useful. Just forgetabout it..ie n .Sh "What about the ""use encoding"" pragma?".el .Sh "What about the \f(CWuse encoding\fP pragma?".IX Subsection "What about the use encoding pragma?"Don't use it. Unfortunately, it assumes that the programmer's environment andthat of the user will use the same encoding. It will use the same encoding forthe source code and for \s-1STDIN\s0 and \s-1STDOUT\s0. When a program is copied to anothermachine, the source code does not change, but the \s-1STDIO\s0 environment might..PPIf you need non-ASCII characters in your source code, make it a \s-1UTF\-8\s0 encodedfile and \f(CW\*(C`use utf8\*(C'\fR..PPIf you need to set the encoding for \s-1STDIN\s0, \s-1STDOUT\s0, and \s-1STDERR\s0, for examplebased on the user's locale, \f(CW\*(C`use open\*(C'\fR..ie n .Sh "What is the difference between "":encoding""\fP and \f(CW"":utf8""?".el .Sh "What is the difference between \f(CW:encoding\fP and \f(CW:utf8\fP?".IX Subsection "What is the difference between :encoding and :utf8?"Because \s-1UTF\-8\s0 is one of Perl's internal formats, you can often just skip theencoding or decoding step, and manipulate the \s-1UTF8\s0 flag directly..PPInstead of \f(CW\*(C`:encoding(UTF\-8)\*(C'\fR, you can simply use \f(CW\*(C`:utf8\*(C'\fR, which skips theencoding step if the data was already represented as \s-1UTF8\s0 internally. This iswidely accepted as good behavior when you're writing, but it can be dangerouswhen reading, because it causes internal inconsistency when you have invalidbyte sequences. Using \f(CW\*(C`:utf8\*(C'\fR for input can sometimes result in securitybreaches, so please use \f(CW\*(C`:encoding(UTF\-8)\*(C'\fR instead..PPInstead of \f(CW\*(C`decode\*(C'\fR and \f(CW\*(C`encode\*(C'\fR, you could use \f(CW\*(C`_utf8_on\*(C'\fR and \f(CW\*(C`_utf8_off\*(C'\fR,but this is considered bad style. Especially \f(CW\*(C`_utf8_on\*(C'\fR can be dangerous, forthe same reason that \f(CW\*(C`:utf8\*(C'\fR can..PPThere are some shortcuts for oneliners; see \f(CW\*(C`\-C\*(C'\fR in perlrun..ie n .Sh "What's the difference between ""UTF\-8""\fP and \f(CW""utf8""?".el .Sh "What's the difference between \f(CWUTF\-8\fP and \f(CWutf8\fP?".IX Subsection "What's the difference between UTF-8 and utf8?"\&\f(CW\*(C`UTF\-8\*(C'\fR is the official standard. \f(CW\*(C`utf8\*(C'\fR is Perl's way of being liberal inwhat it accepts. If you have to communicate with things that aren't so liberal,you may want to consider using \f(CW\*(C`UTF\-8\*(C'\fR. If you have to communicate with thingsthat are too liberal, you may have to use \f(CW\*(C`utf8\*(C'\fR. The full explanation is inEncode..PP\&\f(CW\*(C`UTF\-8\*(C'\fR is internally known as \f(CW\*(C`utf\-8\-strict\*(C'\fR. The tutorial uses \s-1UTF\-8\s0consistently, even where utf8 is actually used internally, because thedistinction can be hard to make, and is mostly irrelevant..PPFor example, utf8 can be used for code points that don't exist in Unicode, like9999999, but if you encode that to \s-1UTF\-8\s0, you get a substitution character (bydefault; see \*(L"Handling Malformed Data\*(R" in Encode for more ways of dealing withthis.).PPOkay, if you insist: the \*(L"internal format\*(R" is utf8, not \s-1UTF\-8\s0. (When it's notsome other encoding.).Sh "I lost track; what encoding is the internal format really?".IX Subsection "I lost track; what encoding is the internal format really?"It's good that you lost track, because you shouldn't depend on the internalformat being any specific encoding. But since you asked: by default, theinternal format is either \s-1ISO\-8859\-1\s0 (latin\-1), or utf8, depending on thehistory of the string. On \s-1EBCDIC\s0 platforms, this may be different even..PPPerl knows how it stored the string internally, and will use that knowledgewhen you \f(CW\*(C`encode\*(C'\fR. In other words: don't try to find out what the internalencoding for a certain string is, but instead just encode it into the encodingthat you want..SH "AUTHOR".IX Header "AUTHOR"Juerd Waalboer <#####@juerd.nl>.SH "SEE ALSO".IX Header "SEE ALSO"perlunicode, perluniintro, Encode
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -