⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 perlunicode.1

📁 视频监控网络部分的协议ddns,的模块的实现代码,请大家大胆指正.
💻 1
📖 第 1 页 / 共 5 页
字号:
\&\*(L"c\*(R" to be mapped to \*(L"A\*(R", \*(L"B\*(R", \*(L"C\*(R", all other characters will remainunchanged..PPIf there is no source range to speak of, that is, the mapping is froma single character to another single character, leave the end of thesource range empty, but the two tabulator characters are still needed.For example:.PP.Vb 5\&    sub ToLower {\&        return <<END;\&    0041\et\et0061\&    END\&    }.Ve.PPdefines a \fIlc()\fR mapping that causes only \*(L"A\*(R" to be mapped to \*(L"a\*(R", allother characters will remain unchanged..PP(For serious hackers only)  If you want to introspect the defaultmappings, you can find the data in the directory\&\f(CW$Config{privlib}\fR/\fIunicore/To/\fR.  The mapping data is returned asthe here-document, and the \f(CW\*(C`utf8::ToSpecFoo\*(C'\fR are special exceptionmappings derived from <$Config{privlib}>/\fIunicore/SpecialCasing.txt\fR.The \f(CW\*(C`Digit\*(C'\fR and \f(CW\*(C`Fold\*(C'\fR mappings that one can see in the directoryare not directly user-accessible, one can use either the\&\f(CW\*(C`Unicode::UCD\*(C'\fR module, or just match case-insensitively (that's whenthe \f(CW\*(C`Fold\*(C'\fR mapping is used)..PPA final note on the user-defined case mappings: they will be usedonly if the scalar has been marked as having Unicode characters.Old byte-style strings will not be affected..Sh "Character Encodings for Input and Output".IX Subsection "Character Encodings for Input and Output"See Encode..Sh "Unicode Regular Expression Support Level".IX Subsection "Unicode Regular Expression Support Level"The following list of Unicode support for regular expressions describesall the features currently supported.  The references to \*(L"Level N\*(R"and the section numbers refer to the Unicode Technical Standard #18,\&\*(L"Unicode Regular Expressions\*(R", version 11, in May 2005..IP "\(bu" 4Level 1 \- Basic Unicode Support.Sp.Vb 8\&        RL1.1   Hex Notation                        \- done          [1]\&        RL1.2   Properties                          \- done          [2][3]\&        RL1.2a  Compatibility Properties            \- done          [4]\&        RL1.3   Subtraction and Intersection        \- MISSING       [5]\&        RL1.4   Simple Word Boundaries              \- done          [6]\&        RL1.5   Simple Loose Matches                \- done          [7]\&        RL1.6   Line Boundaries                     \- MISSING       [8]\&        RL1.7   Supplementary Code Points           \- done          [9]\&\&        [1]  \ex{...}\&        [2]  \ep{...} \eP{...}\&        [3]  supports not only minimal list (general category, scripts,\&             Alphabetic, Lowercase, Uppercase, WhiteSpace,\&             NoncharacterCodePoint, DefaultIgnorableCodePoint, Any,\&             ASCII, Assigned), but also bidirectional types, blocks, etc.\&             (see L</"Unicode Character Properties">)\&        [4]  \ed \eD \es \eS \ew \eW \eX [:prop:] [:^prop:]\&        [5]  can use regular expression look\-ahead [a] or\&             user\-defined character properties [b] to emulate set operations\&        [6]  \eb \eB\&        [7]  note that Perl does Full case\-folding in matching, not Simple:\&             for example U+1F88 is equivalent with U+1F00 U+03B9,\&             not with 1F80.  This difference matters for certain Greek\&             capital letters with certain modifiers: the Full case\-folding\&             decomposes the letter, while the Simple case\-folding would map\&             it to a single character.\&        [8]  should do ^ and $ also on U+000B (\ev in C), FF (\ef), CR (\er),\&             CRLF (\er\en), NEL (U+0085), LS (U+2028), and PS (U+2029);\&             should also affect <>, $., and script line numbers;\&             should not split lines within CRLF [c] (i.e. there is no empty\&             line between \er and \en)\&        [9]  UTF\-8/UTF\-EBDDIC used in perl allows not only U+10000 to U+10FFFF\&             but also beyond U+10FFFF [d].Ve.Sp[a] You can mimic class subtraction using lookahead.For example, what UTS#18 might write as.Sp.Vb 1\&    [{Greek}\-[{UNASSIGNED}]].Ve.Spin Perl can be written as:.Sp.Vb 2\&    (?!\ep{Unassigned})\ep{InGreekAndCoptic}\&    (?=\ep{Assigned})\ep{InGreekAndCoptic}.Ve.SpBut in this particular example, you probably really want.Sp.Vb 1\&    \ep{GreekAndCoptic}.Ve.Spwhich will match assigned characters known to be part of the Greek script..SpAlso see the Unicode::Regex::Set module, it does implement the fullUTS#18 grouping, intersection, union, and removal (subtraction) syntax..Sp[b] '+' for union, '\-' for removal (set-difference), '&' for intersection(see \*(L"User-Defined Character Properties\*(R").Sp[c] Try the \f(CW\*(C`:crlf\*(C'\fR layer (see PerlIO)..Sp[d] Avoid \f(CW\*(C`use warning \*(Aqutf8\*(Aq;\*(C'\fR (or say \f(CW\*(C`no warning \*(Aqutf8\*(Aq;\*(C'\fR) to allowU+FFFF (\f(CW\*(C`\ex{FFFF}\*(C'\fR)..IP "\(bu" 4Level 2 \- Extended Unicode Support.Sp.Vb 6\&        RL2.1   Canonical Equivalents           \- MISSING       [10][11]\&        RL2.2   Default Grapheme Clusters       \- MISSING       [12][13]\&        RL2.3   Default Word Boundaries         \- MISSING       [14]\&        RL2.4   Default Loose Matches           \- MISSING       [15]\&        RL2.5   Name Properties                 \- MISSING       [16]\&        RL2.6   Wildcard Properties             \- MISSING\&\&        [10] see UAX#15 "Unicode Normalization Forms"\&        [11] have Unicode::Normalize but not integrated to regexes\&        [12] have \eX but at this level . should equal that\&        [13] UAX#29 "Text Boundaries" considers CRLF and Hangul syllable\&             clusters as a single grapheme cluster.\&        [14] see UAX#29, Word Boundaries\&        [15] see UAX#21 "Case Mappings"\&        [16] have \eN{...} but neither compute names of CJK Ideographs\&             and Hangul Syllables nor use a loose match [e].Ve.Sp[e] \f(CW\*(C`\eN{...}\*(C'\fR allows namespaces (see charnames)..IP "\(bu" 4Level 3 \- Tailored Support.Sp.Vb 11\&        RL3.1   Tailored Punctuation            \- MISSING\&        RL3.2   Tailored Grapheme Clusters      \- MISSING       [17][18]\&        RL3.3   Tailored Word Boundaries        \- MISSING\&        RL3.4   Tailored Loose Matches          \- MISSING\&        RL3.5   Tailored Ranges                 \- MISSING\&        RL3.6   Context Matching                \- MISSING       [19]\&        RL3.7   Incremental Matches             \- MISSING\&      ( RL3.8   Unicode Set Sharing )\&        RL3.9   Possible Match Sets             \- MISSING\&        RL3.10  Folded Matching                 \- MISSING       [20]\&        RL3.11  Submatchers                     \- MISSING\&\&        [17] see UAX#10 "Unicode Collation Algorithms"\&        [18] have Unicode::Collate but not integrated to regexes\&        [19] have (?<=x) and (?=x), but look\-aheads or look\-behinds should see\&             outside of the target substring\&        [20] need insensitive matching for linguistic features other than case;\&             for example, hiragana to katakana, wide and narrow, simplified Han\&             to traditional Han (see UTR#30 "Character Foldings").Ve.Sh "Unicode Encodings".IX Subsection "Unicode Encodings"Unicode characters are assigned to \fIcode points\fR, which are abstractnumbers.  To use these numbers, various encodings are needed..IP "\(bu" 4\&\s-1UTF\-8\s0.Sp\&\s-1UTF\-8\s0 is a variable-length (1 to 6 bytes, current character allocationsrequire 4 bytes), byte-order independent encoding. For \s-1ASCII\s0 (and wereally do mean 7\-bit \s-1ASCII\s0, not another 8\-bit encoding), \s-1UTF\-8\s0 istransparent..SpThe following table is from Unicode 3.2..Sp.Vb 1\& Code Points            1st Byte  2nd Byte  3rd Byte  4th Byte\&\&   U+0000..U+007F       00..7F\&   U+0080..U+07FF       C2..DF    80..BF\&   U+0800..U+0FFF       E0        A0..BF    80..BF\&   U+1000..U+CFFF       E1..EC    80..BF    80..BF\&   U+D000..U+D7FF       ED        80..9F    80..BF\&   U+D800..U+DFFF       ******* ill\-formed *******\&   U+E000..U+FFFF       EE..EF    80..BF    80..BF\&  U+10000..U+3FFFF      F0        90..BF    80..BF    80..BF\&  U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF\& U+100000..U+10FFFF     F4        80..8F    80..BF    80..BF.Ve.SpNote the \f(CW\*(C`A0..BF\*(C'\fR in \f(CW\*(C`U+0800..U+0FFF\*(C'\fR, the \f(CW\*(C`80..9F\*(C'\fR in\&\f(CW\*(C`U+D000...U+D7FF\*(C'\fR, the \f(CW\*(C`90..B\*(C'\fRF in \f(CW\*(C`U+10000..U+3FFFF\*(C'\fR, and the\&\f(CW\*(C`80...8F\*(C'\fR in \f(CW\*(C`U+100000..U+10FFFF\*(C'\fR.  The \*(L"gaps\*(R" are caused by legal\&\s-1UTF\-8\s0 avoiding non-shortest encodings: it is technically possible toUTF\-8\-encode a single code point in different ways, but that isexplicitly forbidden, and the shortest possible encoding should alwaysbe used.  So that's what Perl does..SpAnother way to look at it is via bits:.Sp.Vb 1\& Code Points                    1st Byte   2nd Byte  3rd Byte  4th Byte\&\&                    0aaaaaaa     0aaaaaaa\&            00000bbbbbaaaaaa     110bbbbb  10aaaaaa\&            ccccbbbbbbaaaaaa     1110cccc  10bbbbbb  10aaaaaa\&  00000dddccccccbbbbbbaaaaaa     11110ddd  10cccccc  10bbbbbb  10aaaaaa.Ve.SpAs you can see, the continuation bytes all begin with \f(CW10\fR, and theleading bits of the start byte tell how many bytes the are in theencoded character..IP "\(bu" 4UTF-EBCDIC.SpLike \s-1UTF\-8\s0 but EBCDIC-safe, in the way that \s-1UTF\-8\s0 is ASCII-safe..IP "\(bu" 4\&\s-1UTF\-16\s0, \s-1UTF\-16BE\s0, \s-1UTF\-16LE\s0, Surrogates, and BOMs (Byte Order Marks).SpThe followings items are mostly for reference and general Unicodeknowledge, Perl doesn't use these constructs internally..Sp\&\s-1UTF\-16\s0 is a 2 or 4 byte encoding.  The Unicode code points\&\f(CW\*(C`U+0000..U+FFFF\*(C'\fR are stored in a single 16\-bit unit, and the codepoints \f(CW\*(C`U+10000..U+10FFFF\*(C'\fR in two 16\-bit units.  The latter case isusing \fIsurrogates\fR, the first 16\-bit unit being the \fIhighsurrogate\fR, and the second being the \fIlow surrogate\fR..SpSurrogates are code points set aside to encode the \f(CW\*(C`U+10000..U+10FFFF\*(C'\fRrange of Unicode code points in pairs of 16\-bit units.  The \fIhighsurrogates\fR are the range \f(CW\*(C`U+D800..U+DBFF\*(C'\fR, and the \fIlow surrogates\fRare the range \f(CW\*(C`U+DC00..U+DFFF\*(C'\fR.  The surrogate encoding is.Sp.Vb 2\&        $hi = ($uni \- 0x10000) / 0x400 + 0xD800;\&        $lo = ($uni \- 0x10000) % 0x400 + 0xDC00;.Ve.Spand the decoding is.Sp.Vb 1\&        $uni = 0x10000 + ($hi \- 0xD800) * 0x400 + ($lo \- 0xDC00);.Ve.SpIf you try to generate surrogates (for example by using \fIchr()\fR), youwill get a warning if warnings are turned on, because those codepoints are not valid for a Unicode character..SpBecause of the 16\-bitness, \s-1UTF\-16\s0 is byte-order dependent.  \s-1UTF\-16\s0itself can be used for in-memory computations, but if storage ortransfer is required either \s-1UTF\-16BE\s0 (big-endian) or \s-1UTF\-16LE\s0(little-endian) encodings must be chosen..SpThis introduces another problem: what if you just know that your datais \s-1UTF\-16\s0, but you don't know which endianness?  Byte Order Marks, orBOMs, are a solution to this.  A special character has been reservedin Unicode to function as a byte order marker: the character with thecode point \f(CW\*(C`U+FEFF\*(C'\fR is the \s-1BOM\s0..SpThe trick is that if you read a \s-1BOM\s0, you will know the byte order,since if it was written on a big-endian platform, you will read thebytes \f(CW\*(C`0xFE 0xFF\*(C'\fR, but if it was written on a little-endian platform,you will read the bytes \f(CW\*(C`0xFF 0xFE\*(C'\fR.  (And if the originating platformwas writing in \s-1UTF\-8\s0, you will read the bytes \f(CW\*(C`0xEF 0xBB 0xBF\*(C'\fR.).SpThe way this trick works is that the character with the code point\&\f(CW\*(C`U+FFFE\*(C'\fR is guaranteed not to be a valid Unicode character, so thesequence of bytes \f(CW\*(C`0xFF 0xFE\*(C'\fR is unambiguously \*(L"\s-1BOM\s0, represented inlittle-endian format\*(R" and cannot be \f(CW\*(C`U+FFFE\*(C'\fR, represented in big-endianformat"..IP "\(bu" 4\&\s-1UTF\-32\s0, \s-1UTF\-32BE\s0, \s-1UTF\-32LE\s0.SpThe \s-1UTF\-32\s0 family is pretty much like the \s-1UTF\-16\s0 family, expect thatthe units are 32\-bit, and therefore the surrogate scheme is notneeded.  The \s-1BOM\s0 signatures will be \f(CW\*(C`0x00 0x00 0xFE 0xFF\*(C'\fR for \s-1BE\s0 and\&\f(CW\*(C`0xFF 0xFE 0x00 0x00\*(C'\fR for \s-1LE\s0..IP "\(bu" 4\&\s-1UCS\-2\s0, \s-1UCS\-4\s0.SpEncodings defined by the \s-1ISO\s0 10646 standard.  \s-1UCS\-2\s0 is a 16\-bitencoding.  Unlike \s-1UTF\-16\s0, \s-1UCS\-2\s0 is not extensible beyond \f(CW\*(C`U+FFFF\*(C'\fR,because it does not use surrogates.  \s-1UCS\-4\s0 is a 32\-bit encoding,functionally identical to \s-1UTF\-32\s0..IP "\(bu" 4\&\s-1UTF\-7\s0.SpA seven-bit safe (non-eight-bit) encoding, which is useful if thetransport or storage is not eight-bit safe.  Defined by \s-1RFC\s0 2152..Sh "Security Implications of Unicode".IX Subsection "Security Implications of Unicode".IP "\(bu" 4Malformed \s-1UTF\-8\s0.SpUnfortunately, the specification of \s-1UTF\-8\s0 leaves some room forinterpretation of how many bytes of encoded output one should generatefrom one input Unicode character.  Strictly speaking, the shortestpossible sequence of \s-1UTF\-8\s0 bytes should be generated,because otherwise there is potential for an input buffer overflow atthe receiving end of a \s-1UTF\-8\s0 connection.  Perl always generates theshortest length \s-1UTF\-8\s0, and with warnings on Perl will warn aboutnon-shortest length \s-1UTF\-8\s0 along with other malformations, such as thesurrogates, which are not real Unicode code points..IP "\(bu" 4Regular expressions behave slightly differently between byte data andcharacter (Unicode) data.  For example, the \*(L"word character\*(R" characterclass \f(CW\*(C`\ew\*(C'\fR will work differently depending on if data is eight-bit bytesor Unicode..SpIn the first case, the set of \f(CW\*(C`\ew\*(C'\fR characters is either small\*(--thedefault set of alphabetic characters, digits, and the \*(L"_\*(R"\-\-or, if youare using a locale (see perllocale), the \f(CW\*(C`\ew\*(C'\fR might contain a fewmore letters according to your language and country..SpIn the second case, the \f(CW\*(C`\ew\*(C'\fR set of characters is much, much larger.Most importantly, even in the set of the first 256 characters, it willprobably match different characters: unlike most locales, which arespecific to a language and country pair, Unicode classifies all thecharacters that are letters \fIsomewhere\fR as \f(CW\*(C`\ew\*(C'\fR.  For example, yourlocale might not think that \s-1LATIN\s0 \s-1SMALL\s0 \s-1LETTER\s0 \s-1ETH\s0 is a letter (unlessyou happen to speak Icelandic), but Unicode does..SpAs discussed elsewhere, Perl has one foot (two hooves?) planted in

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -