perlunicode.pod
来自「MSYS在windows下模拟了一个类unix的终端」· POD 代码 · 共 243 行
POD
243 行
=head1 NAMEperlunicode - Unicode support in Perl (EXPERIMENTAL, subject to change)=head1 DESCRIPTION=head2 Important Caveat WARNING: As of the 5.6.1 release, the implementation of Unicode support in Perl is incomplete, and continues to be highly experimental.The following areas need further work. They are being rapidly addressedin the 5.7.x development branch.=over 4=item Input and Output DisciplinesThere is currently no easy way to mark data read from a file or otherexternal source as being utf8. This will be one of the major areas offocus in the near future.=item Regular ExpressionsThe existing regular expression compiler does not produce polymorphicopcodes. This means that the determination on whether to match Unicodecharacters is made when the pattern is compiled, based on whether thepattern contains Unicode characters, and not when the matching happensat run time. This needs to be changed to adaptively match Unicode ifthe string to be matched is Unicode.=item C<use utf8> still needed to enable a few featuresThe C<utf8> pragma implements the tables used for Unicode support. Thesetables are automatically loaded on demand, so the C<utf8> pragma need notnormally be used.However, as a compatibility measure, this pragma must be explicitly usedto enable recognition of UTF-8 encoded literals and identifiers in thesource text.=back=head2 Byte and Character semanticsBeginning with version 5.6, Perl uses logically wide characters torepresent strings internally. This internal representation of stringsuses the UTF-8 encoding.In future, Perl-level operations can be expected to work with charactersrather than bytes, in general.However, as strictly an interim compatibility measure, Perl v5.6 aims toprovide a safe migration path from byte semantics to character semanticsfor programs. For operations where Perl can unambiguously decide that theinput data is characters, Perl now switches to character semantics.For operations where this determination cannot be made without additionalinformation from the user, Perl decides in favor of compatibility, andchooses to use byte semantics.This behavior preserves compatibility with earlier versions of Perl,which allowed byte semantics in Perl operations, but only as long asnone of the program's inputs are marked as being as source of Unicodecharacter data. Such data may come from filehandles, from calls toexternal programs, from information provided by the system (such as %ENV),or from literals and constants in the source text.If the C<-C> command line switch is used, (or the ${^WIDE_SYSTEM_CALLS}global flag is set to C<1>), all system calls will use thecorresponding wide character APIs. This is currently only implementedon Windows.Regardless of the above, the C<bytes> pragma can always be used to forcebyte semantics in a particular lexical scope. See L<bytes>.The C<utf8> pragma is primarily a compatibility device that enablesrecognition of UTF-8 in literals encountered by the parser. It may alsobe used for enabling some of the more experimental Unicode support features.Note that this pragma is only required until a future version of Perlin which character semantics will become the default. This pragma maythen become a no-op. See L<utf8>.Unless mentioned otherwise, Perl operators will use character semanticswhen they are dealing with Unicode data, and byte semantics otherwise.Thus, character semantics for these operations apply transparently; ifthe input data came from a Unicode source (for example, by adding acharacter encoding discipline to the filehandle whence it came, or aliteral UTF-8 string constant in the program), character semanticsapply; otherwise, byte semantics are in effect. To force byte semanticson Unicode data, the C<bytes> pragma should be used.Under character semantics, many operations that formerly operated onbytes change to operating on characters. For ASCII data this makesno difference, because UTF-8 stores ASCII in single bytes, but forany character greater than C<chr(127)>, the character may be stored ina sequence of two or more bytes, all of which have the high bit set.But by and large, the user need not worry about this, because Perlhides it from the user. A character in Perl is logically just a numberranging from 0 to 2**32 or so. Larger characters encode to longersequences of bytes internally, but again, this is just an internaldetail which is hidden at the Perl level.=head2 Effects of character semanticsCharacter semantics have the following effects:=over 4=item *Strings and patterns may contain characters that have an ordinal valuelarger than 255.Presuming you use a Unicode editor to edit your program, such characterswill typically occur directly within the literal strings as UTF-8characters, but you can also specify a particular character with anextension of the C<\x> notation. UTF-8 characters are specified byputting the hexadecimal code within curlies after the C<\x>. For instance,a Unicode smiley face is C<\x{263A}>.=item *Identifiers within the Perl script may contain Unicode alphanumericcharacters, including ideographs. (You are currently on your own whenit comes to using the canonical forms of characters--Perl doesn't (yet)attempt to canonicalize variable names for you.)=item *Regular expressions match characters instead of bytes. For instance,"." matches a character instead of a byte. (However, the C<\C> patternis provided to force a match a single byte ("C<char>" in C, henceC<\C>).)=item *Character classes in regular expressions match characters instead ofbytes, and match against the character properties specified in theUnicode properties database. So C<\w> can be used to match an ideograph,for instance.=item *Named Unicode properties and block ranges make be used as characterclasses via the new C<\p{}> (matches property) and C<\P{}> (doesn'tmatch property) constructs. For instance, C<\p{Lu}> matches anycharacter with the Unicode uppercase property, while C<\p{M}> matchesany mark character. Single letter properties may omit the brackets, sothat can be written C<\pM> also. Many predefined character classes areavailable, such as C<\p{IsMirrored}> and C<\p{InTibetan}>.=item *The special pattern C<\X> match matches any extended Unicode sequence(a "combining character sequence" in Standardese), where the firstcharacter is a base character and subsequent characters are markcharacters that apply to the base character. It is equivalent toC<(?:\PM\pM*)>.=item *The C<tr///> operator translates characters instead of bytes. Notethat the C<tr///CU> functionality has been removed, as the interfacewas a mistake. For similar functionality see pack('U0', ...) andpack('C0', ...).=item *Case translation operators use the Unicode case translation tableswhen provided character input. Note that C<uc()> translates touppercase, while C<ucfirst> translates to titlecase (for languagesthat make the distinction). Naturally the corresponding backslashsequences have the same semantics.=item *Most operators that deal with positions or lengths in the string willautomatically switch to using character positions, including C<chop()>,C<substr()>, C<pos()>, C<index()>, C<rindex()>, C<sprintf()>,C<write()>, and C<length()>. Operators that specifically don't switchinclude C<vec()>, C<pack()>, and C<unpack()>. Operators that reallydon't care include C<chomp()>, as well as any other operator thattreats a string as a bucket of bits, such as C<sort()>, and theoperators dealing with filenames.=item *The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change,since they're often used for byte-oriented formats. (Again, think"C<char>" in the C language.) However, there is a new "C<U>" specifierthat will convert between UTF-8 characters and integers. (It worksoutside of the utf8 pragma too.)=item *The C<chr()> and C<ord()> functions work on characters. This is likeC<pack("U")> and C<unpack("U")>, not like C<pack("C")> andC<unpack("C")>. In fact, the latter are how you now emulatebyte-oriented C<chr()> and C<ord()> under utf8.=item *The bit string operators C<& | ^ ~> can operate on character data.However, for backward compatibility reasons (bit string operationswhen the characters all are less than 256 in ordinal value) one cannotmix C<~> (the bit complement) and characters both less than 256 andequal or greater than 256. Most importantly, the DeMorgan's laws(C<~($x|$y) eq ~$x&~$y>, C<~($x&$y) eq ~$x|~$y>) won't hold.Another way to look at this is that the complement cannot returnB<both> the 8-bit (byte) wide bit complement, and the full characterwide bit complement.=item *And finally, C<scalar reverse()> reverses by character rather than by byte.=back=head2 Character encodings for input and output[XXX: This feature is not yet implemented.]=head1 CAVEATSAs of yet, there is no method for automatically coercing input andoutput to some encoding other than UTF-8. This is planned in the nearfuture, however.Whether an arbitrary piece of data will be treated as "characters" or"bytes" by internal operations cannot be divined at the current time.Use of locales with utf8 may lead to odd results. Currently there issome attempt to apply 8-bit locale info to characters in the range0..255, but this is demonstrably incorrect for locales that usecharacters above that range (when mapped into Unicode). It will alsotend to run slower. Avoidance of locales is strongly encouraged.=head1 SEE ALSOL<bytes>, L<utf8>, L<perlvar/"${^WIDE_SYSTEM_CALLS}">=cut
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?