perlunicode.pod

来自「视频监控网络部分的协议ddns,的模块的实现代码,请大家大胆指正.」· POD 代码 · 共 1,623 行 · 第 1/4 页

POD
1,623
字号
=head1 NAMEperlunicode - Unicode support in Perl=head1 DESCRIPTION=head2 Important CaveatsUnicode support is an extensive requirement. While Perl does notimplement the Unicode standard or the accompanying technical reportsfrom cover to cover, Perl does support many Unicode features.People who want to learn to use Unicode in Perl, should probably readL<the Perl Unicode tutorial|perlunitut> before reading this referencedocument.=over 4=item Input and Output LayersPerl knows when a filehandle uses Perl's internal Unicode encodings(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened withthe ":utf8" layer.  Other encodings can be converted to Perl'sencoding on input or from Perl's encoding on output by use of the":encoding(...)"  layer.  See L<open>.To indicate that Perl source itself is in UTF-8, use C<use utf8;>.=item Regular ExpressionsThe regular expression compiler produces polymorphic opcodes.  That is,the pattern adapts to the data and automatically switches to the Unicodecharacter scheme when presented with data that is internally encoded inUTF-8 -- or instead uses a traditional byte scheme when presented withbyte data.=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scriptsAs a compatibility measure, the C<use utf8> pragma must be explicitlyincluded to enable recognition of UTF-8 in the Perl scripts themselves(in string or regular expression literals, or in identifier names) onASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-basedmachines.  B<These are the only times when an explicit C<use utf8>is needed.>  See L<utf8>.=item BOM-marked scripts and UTF-16 scripts autodetectedIf a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,or UTF-8), or if the script looks like non-BOM-marked UTF-16 of eitherendianness, Perl will correctly read in the script as Unicode.(BOMless UTF-8 cannot be effectively recognized or differentiated fromISO 8859-1 or other eight-bit encodings.)=item C<use encoding> needed to upgrade non-Latin-1 byte stringsBy default, there is a fundamental asymmetry in Perl's Unicode model:implicit upgrading from byte strings to Unicode strings assumes thatthey were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings aredowngraded with UTF-8 encoding.  This happens because the first 256codepoints in Unicode happens to agree with Latin-1.  See L</"Byte and Character Semantics"> for more details.=back=head2 Byte and Character SemanticsBeginning with version 5.6, Perl uses logically-wide characters torepresent strings internally.In future, Perl-level operations will be expected to work withcharacters rather than bytes.However, as an interim compatibility measure, Perl aims toprovide a safe migration path from byte semantics to charactersemantics for programs.  For operations where Perl can unambiguouslydecide that the input data are characters, Perl switches tocharacter semantics.  For operations where this determination cannotbe made without additional information from the user, Perl decides infavor of compatibility and chooses to use byte semantics.This behavior preserves compatibility with earlier versions of Perl,which allowed byte semantics in Perl operations only ifnone of the program's inputs were marked as being as source of Unicodecharacter data.  Such data may come from filehandles, from calls toexternal programs, from information provided by the system (such as %ENV),or from literals and constants in the source text.The C<bytes> pragma will always, regardless of platform, force bytesemantics in a particular lexical scope.  See L<bytes>.The C<utf8> pragma is primarily a compatibility device that enablesrecognition of UTF-(8|EBCDIC) in literals encountered by the parser.Note that this pragma is only required while Perl defaults to bytesemantics; when character semantics become the default, this pragmamay become a no-op.  See L<utf8>.Unless explicitly stated, Perl operators use character semanticsfor Unicode data and byte semantics for non-Unicode data.The decision to use character semantics is made transparently.  Ifinput data comes from a Unicode source--for example, if a characterencoding layer is added to a filehandle or a literal Unicodestring constant appears in a program--character semantics apply.Otherwise, byte semantics are in effect.  The C<bytes> pragma shouldbe used to force byte semantics on Unicode data.If strings operating under byte semantics and strings with Unicodecharacter data are concatenated, the new string will be created bydecoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if theold Unicode string used EBCDIC.  This translation is done withoutregard to the system's native 8-bit encoding. Under character semantics, many operations that formerly operated onbytes now operate on characters. A character in Perl islogically just a number ranging from 0 to 2**31 or so. Largercharacters may encode into longer sequences of bytes internally, butthis internal detail is mostly hidden for Perl code.See L<perluniintro> for more.=head2 Effects of Character SemanticsCharacter semantics have the following effects:=over 4=item *Strings--including hash keys--and regular expression patterns maycontain characters that have an ordinal value larger than 255.If you use a Unicode editor to edit your program, Unicode characters mayoccur directly within the literal strings in UTF-8 encoding, or UTF-16.(The former requires a BOM or C<use utf8>, the latter requires a BOM.)Unicode characters can also be added to a string by using the C<\x{...}>notation.  The Unicode code for the desired character, in hexadecimal,should be placed in the braces. For instance, a smiley face isC<\x{263A}>.  This encoding scheme only works for all characters, butfor characters under 0x100, note that Perl may use an 8 bit encodinginternally, for optimization and/or backward compatibility.Additionally, if you   use charnames ':full';you can use the C<\N{...}> notation and put the official Unicodecharacter name within the braces, such as C<\N{WHITE SMILING FACE}>.=item *If an appropriate L<encoding> is specified, identifiers within thePerl script may contain Unicode alphanumeric characters, includingideographs.  Perl does not currently attempt to canonicalize variablenames.=item *Regular expressions match characters instead of bytes.  "." matchesa character instead of a byte.=item *Character classes in regular expressions match characters instead ofbytes and match against the character properties specified in theUnicode properties database.  C<\w> can be used to match a Japaneseideograph, for instance.=item *Named Unicode properties, scripts, and block ranges may be used likecharacter classes via the C<\p{}> "matches property" construct andthe C<\P{}> negation, "doesn't match property".See L</"Unicode Character Properties"> for more details.You can define your own character properties and use themin the regular expression with the C<\p{}> or C<\P{}> construct.See L</"User-Defined Character Properties"> for more details.=item *The special pattern C<\X> matches any extended Unicodesequence--"a combining character sequence" in Standardese--where thefirst character is a base character and subsequent characters are markcharacters that apply to the base character.  C<\X> is equivalent toC<(?:\PM\pM*)>.=item *The C<tr///> operator translates characters instead of bytes.  Notethat the C<tr///CU> functionality has been removed.  For similarfunctionality see pack('U0', ...) and pack('C0', ...).=item *Case translation operators use the Unicode case translation tableswhen character input is provided.  Note that C<uc()>, or C<\U> ininterpolated strings, translates to uppercase, while C<ucfirst>,or C<\u> in interpolated strings, translates to titlecase in languagesthat make the distinction.=item *Most operators that deal with positions or lengths in a string willautomatically switch to using character positions, includingC<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,C<sprintf()>, C<write()>, and C<length()>.  An operator thatspecifically does not switch is C<vec()>.  Operators that really don't care include operators that treat strings as a bucket of bits such as C<sort()>, and operators dealing with filenames.=item *The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often used for byte-oriented formats.  Again, think C<char> in the C language.There is a new C<U> specifier that converts between Unicode charactersand code points. There is also a C<W> specifier that is the equivalent ofC<chr>/C<ord> and properly handles character values even if they are above 255.=item *The C<chr()> and C<ord()> functions work on characters, similar toC<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> andC<unpack("C")>.  C<pack("C")> and C<unpack("C")> are methods foremulating byte-oriented C<chr()> and C<ord()> on Unicode strings.While these methods reveal the internal encoding of Unicode strings,that is not something one normally needs to care about at all.=item *The bit string operators, C<& | ^ ~>, can operate on character data.However, for backward compatibility, such as when using bit stringoperations when characters are all less than 256 in ordinal value, oneshould not use C<~> (the bit complement) with characters of bothvalues less than 256 and values greater than 256.  Most importantly,DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)will not hold.  The reason for this mathematical I<faux pas> is thatthe complement cannot return B<both> the 8-bit (byte-wide) bitcomplement B<and> the full character-wide bit complement.=item *lc(), uc(), lcfirst(), and ucfirst() work for the following cases:=over 8=item *the case mapping is from a single Unicode character to anothersingle Unicode character, or=item *the case mapping is from a single Unicode character to morethan one Unicode character.=backThings to do with locales (Lithuanian, Turkish, Azeri) do B<not> worksince Perl does not understand the concept of Unicode locales.See the Unicode Technical Report #21, Case Mappings, for more details.But you can also define your own mappings to be used in the lc(),lcfirst(), uc(), and ucfirst() (or their string-inlined versions).See L</"User-Defined Case Mappings"> for more details.=back=over 4=item *And finally, C<scalar reverse()> reverses by character rather than by byte.=back=head2 Unicode Character PropertiesNamed Unicode properties, scripts, and block ranges may be used likecharacter classes via the C<\p{}> "matches property" construct andthe C<\P{}> negation, "doesn't match property".For instance, C<\p{Lu}> matches any character with the Unicode "Lu"(Letter, uppercase) property, while C<\p{M}> matches any characterwith an "M" (mark--accents and such) property.  Brackets are notrequired for single letter properties, so C<\p{M}> is equivalent toC<\pM>. Many predefined properties are available, such asC<\p{Mirrored}> and C<\p{Tibetan}>.The official Unicode script and block names have spaces and dashes asseparators, but for convenience you can use dashes, spaces, orunderbars, and case is unimportant. It is recommended, however, thatfor consistency you use the following naming: the official Unicodescript, property, or block name (see below for the additional rulesthat apply to block names) with whitespace and dashes removed, and thewords "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thusbecomes C<Latin1Supplement>.You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret(^) between the first brace and the property name: C<\p{^Tamil}> isequal to C<\P{Tamil}>.B<NOTE: the properties, scripts, and blocks listed here are as ofUnicode 5.0.0 in July 2006.>=over 4=item General CategoryHere are the basic Unicode General Category properties, followed by theirlong form.  You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>,for instance, are identical.    Short       Long    L           Letter    LC          CasedLetter    Lu          UppercaseLetter    Ll          LowercaseLetter    Lt          TitlecaseLetter    Lm          ModifierLetter    Lo          OtherLetter    M           Mark    Mn          NonspacingMark    Mc          SpacingMark    Me          EnclosingMark    N           Number    Nd          DecimalNumber    Nl          LetterNumber    No          OtherNumber    P           Punctuation    Pc          ConnectorPunctuation    Pd          DashPunctuation    Ps          OpenPunctuation    Pe          ClosePunctuation    Pi          InitialPunctuation                (may behave like Ps or Pe depending on usage)    Pf          FinalPunctuation                (may behave like Ps or Pe depending on usage)    Po          OtherPunctuation    S           Symbol    Sm          MathSymbol    Sc          CurrencySymbol    Sk          ModifierSymbol    So          OtherSymbol    Z           Separator    Zs          SpaceSeparator    Zl          LineSeparator    Zp          ParagraphSeparator    C           Other    Cc          Control    Cf          Format    Cs          Surrogate   (not usable)    Co          PrivateUse    Cn          UnassignedSingle-letter properties match all characters in any of thetwo-letter sub-properties starting with the same letter.C<LC> and C<L&> are special cases, which are aliases for the set ofC<Ll>, C<Lu>, and C<Lt>.Because Perl hides the need for the user to understand the internalrepresentation of Unicode characters, there is no need to implementthe somewhat messy concept of surrogates. C<Cs> is therefore notsupported.=item Bidirectional Character TypesBecause scripts differ in their directionality--Hebrew iswritten right to left, for example--Unicode supplies these properties inthe BidiClass class:    Property    Meaning    L           Left-to-Right    LRE         Left-to-Right Embedding    LRO         Left-to-Right Override    R           Right-to-Left    AL          Right-to-Left Arabic    RLE         Right-to-Left Embedding    RLO         Right-to-Left Override    PDF         Pop Directional Format    EN          European Number    ES          European Number Separator    ET          European Number Terminator    AN          Arabic Number    CS          Common Number Separator    NSM         Non-Spacing Mark    BN          Boundary Neutral    B           Paragraph Separator    S           Segment Separator    WS          Whitespace    ON          Other NeutralsFor example, C<\p{BidiClass:R}> matches characters that are normallywritten right to left.

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?