📄 perlunicode.1
字号:
.IP "\(bu" 4The \f(CW\*(C`pack()\*(C'\fR/\f(CW\*(C`unpack()\*(C'\fR letter \f(CW\*(C`C\*(C'\fR does \fInot\fR change, since it is often used for byte-oriented formats. Again, think \f(CW\*(C`char\*(C'\fR in the C language..SpThere is a new \f(CW\*(C`U\*(C'\fR specifier that converts between Unicode charactersand code points. There is also a \f(CW\*(C`W\*(C'\fR specifier that is the equivalent of\&\f(CW\*(C`chr\*(C'\fR/\f(CW\*(C`ord\*(C'\fR and properly handles character values even if they are above 255..IP "\(bu" 4The \f(CW\*(C`chr()\*(C'\fR and \f(CW\*(C`ord()\*(C'\fR functions work on characters, similar to\&\f(CW\*(C`pack("W")\*(C'\fR and \f(CW\*(C`unpack("W")\*(C'\fR, \fInot\fR \f(CW\*(C`pack("C")\*(C'\fR and\&\f(CW\*(C`unpack("C")\*(C'\fR. \f(CW\*(C`pack("C")\*(C'\fR and \f(CW\*(C`unpack("C")\*(C'\fR are methods foremulating byte-oriented \f(CW\*(C`chr()\*(C'\fR and \f(CW\*(C`ord()\*(C'\fR on Unicode strings.While these methods reveal the internal encoding of Unicode strings,that is not something one normally needs to care about at all..IP "\(bu" 4The bit string operators, \f(CW\*(C`& | ^ ~\*(C'\fR, can operate on character data.However, for backward compatibility, such as when using bit stringoperations when characters are all less than 256 in ordinal value, oneshould not use \f(CW\*(C`~\*(C'\fR (the bit complement) with characters of bothvalues less than 256 and values greater than 256. Most importantly,DeMorgan's laws (\f(CW\*(C`~($x|$y) eq ~$x&~$y\*(C'\fR and \f(CW\*(C`~($x&$y) eq ~$x|~$y\*(C'\fR)will not hold. The reason for this mathematical \fIfaux pas\fR is thatthe complement cannot return \fBboth\fR the 8\-bit (byte-wide) bitcomplement \fBand\fR the full character-wide bit complement..IP "\(bu" 4\&\fIlc()\fR, \fIuc()\fR, \fIlcfirst()\fR, and \fIucfirst()\fR work for the following cases:.RS 4.IP "\(bu" 8the case mapping is from a single Unicode character to anothersingle Unicode character, or.IP "\(bu" 8the case mapping is from a single Unicode character to morethan one Unicode character..RE.RS 4.SpThings to do with locales (Lithuanian, Turkish, Azeri) do \fBnot\fR worksince Perl does not understand the concept of Unicode locales..SpSee the Unicode Technical Report #21, Case Mappings, for more details..SpBut you can also define your own mappings to be used in the \fIlc()\fR,\&\fIlcfirst()\fR, \fIuc()\fR, and \fIucfirst()\fR (or their string-inlined versions)..SpSee \*(L"User-Defined Case Mappings\*(R" for more details..RE.IP "\(bu" 4And finally, \f(CW\*(C`scalar reverse()\*(C'\fR reverses by character rather than by byte..Sh "Unicode Character Properties".IX Subsection "Unicode Character Properties"Named Unicode properties, scripts, and block ranges may be used likecharacter classes via the \f(CW\*(C`\ep{}\*(C'\fR \*(L"matches property\*(R" construct andthe \f(CW\*(C`\eP{}\*(C'\fR negation, \*(L"doesn't match property\*(R"..PPFor instance, \f(CW\*(C`\ep{Lu}\*(C'\fR matches any character with the Unicode \*(L"Lu\*(R"(Letter, uppercase) property, while \f(CW\*(C`\ep{M}\*(C'\fR matches any characterwith an \*(L"M\*(R" (mark\*(--accents and such) property. Brackets are notrequired for single letter properties, so \f(CW\*(C`\ep{M}\*(C'\fR is equivalent to\&\f(CW\*(C`\epM\*(C'\fR. Many predefined properties are available, such as\&\f(CW\*(C`\ep{Mirrored}\*(C'\fR and \f(CW\*(C`\ep{Tibetan}\*(C'\fR..PPThe official Unicode script and block names have spaces and dashes asseparators, but for convenience you can use dashes, spaces, orunderbars, and case is unimportant. It is recommended, however, thatfor consistency you use the following naming: the official Unicodescript, property, or block name (see below for the additional rulesthat apply to block names) with whitespace and dashes removed, and thewords \*(L"uppercase-first-lowercase-rest\*(R". \f(CW\*(C`Latin\-1 Supplement\*(C'\fR thusbecomes \f(CW\*(C`Latin1Supplement\*(C'\fR..PPYou can also use negation in both \f(CW\*(C`\ep{}\*(C'\fR and \f(CW\*(C`\eP{}\*(C'\fR by introducing a caret(^) between the first brace and the property name: \f(CW\*(C`\ep{^Tamil}\*(C'\fR isequal to \f(CW\*(C`\eP{Tamil}\*(C'\fR..PP\&\fB\s-1NOTE:\s0 the properties, scripts, and blocks listed here are as ofUnicode 5.0.0 in July 2006.\fR.IP "General Category" 4.IX Item "General Category"Here are the basic Unicode General Category properties, followed by theirlong form. You can use either; \f(CW\*(C`\ep{Lu}\*(C'\fR and \f(CW\*(C`\ep{UppercaseLetter}\*(C'\fR,for instance, are identical..Sp.Vb 1\& Short Long\&\& L Letter\& LC CasedLetter\& Lu UppercaseLetter\& Ll LowercaseLetter\& Lt TitlecaseLetter\& Lm ModifierLetter\& Lo OtherLetter\&\& M Mark\& Mn NonspacingMark\& Mc SpacingMark\& Me EnclosingMark\&\& N Number\& Nd DecimalNumber\& Nl LetterNumber\& No OtherNumber\&\& P Punctuation\& Pc ConnectorPunctuation\& Pd DashPunctuation\& Ps OpenPunctuation\& Pe ClosePunctuation\& Pi InitialPunctuation\& (may behave like Ps or Pe depending on usage)\& Pf FinalPunctuation\& (may behave like Ps or Pe depending on usage)\& Po OtherPunctuation\&\& S Symbol\& Sm MathSymbol\& Sc CurrencySymbol\& Sk ModifierSymbol\& So OtherSymbol\&\& Z Separator\& Zs SpaceSeparator\& Zl LineSeparator\& Zp ParagraphSeparator\&\& C Other\& Cc Control\& Cf Format\& Cs Surrogate (not usable)\& Co PrivateUse\& Cn Unassigned.Ve.SpSingle-letter properties match all characters in any of thetwo-letter sub-properties starting with the same letter.\&\f(CW\*(C`LC\*(C'\fR and \f(CW\*(C`L&\*(C'\fR are special cases, which are aliases for the set of\&\f(CW\*(C`Ll\*(C'\fR, \f(CW\*(C`Lu\*(C'\fR, and \f(CW\*(C`Lt\*(C'\fR..SpBecause Perl hides the need for the user to understand the internalrepresentation of Unicode characters, there is no need to implementthe somewhat messy concept of surrogates. \f(CW\*(C`Cs\*(C'\fR is therefore notsupported..IP "Bidirectional Character Types" 4.IX Item "Bidirectional Character Types"Because scripts differ in their directionality\*(--Hebrew iswritten right to left, for example\*(--Unicode supplies these properties inthe BidiClass class:.Sp.Vb 1\& Property Meaning\&\& L Left\-to\-Right\& LRE Left\-to\-Right Embedding\& LRO Left\-to\-Right Override\& R Right\-to\-Left\& AL Right\-to\-Left Arabic\& RLE Right\-to\-Left Embedding\& RLO Right\-to\-Left Override\& PDF Pop Directional Format\& EN European Number\& ES European Number Separator\& ET European Number Terminator\& AN Arabic Number\& CS Common Number Separator\& NSM Non\-Spacing Mark\& BN Boundary Neutral\& B Paragraph Separator\& S Segment Separator\& WS Whitespace\& ON Other Neutrals.Ve.SpFor example, \f(CW\*(C`\ep{BidiClass:R}\*(C'\fR matches characters that are normallywritten right to left..IP "Scripts" 4.IX Item "Scripts"The script names which can be used by \f(CW\*(C`\ep{...}\*(C'\fR and \f(CW\*(C`\eP{...}\*(C'\fR,such as in \f(CW\*(C`\ep{Latin}\*(C'\fR or \f(CW\*(C`\ep{Cyrillic}\*(C'\fR, are as follows:.Sp.Vb 10\& Arabic\& Armenian\& Balinese\& Bengali\& Bopomofo\& Braille\& Buginese\& Buhid\& CanadianAboriginal\& Cherokee\& Coptic\& Cuneiform\& Cypriot\& Cyrillic\& Deseret\& Devanagari\& Ethiopic\& Georgian\& Glagolitic\& Gothic\& Greek\& Gujarati\& Gurmukhi\& Han\& Hangul\& Hanunoo\& Hebrew\& Hiragana\& Inherited\& Kannada\& Katakana\& Kharoshthi\& Khmer\& Lao\& Latin\& Limbu\& LinearB\& Malayalam\& Mongolian\& Myanmar\& NewTaiLue\& Nko\& Ogham\& OldItalic\& OldPersian\& Oriya\& Osmanya\& PhagsPa\& Phoenician\& Runic\& Shavian\& Sinhala\& SylotiNagri\& Syriac\& Tagalog\& Tagbanwa\& TaiLe\& Tamil\& Telugu\& Thaana\& Thai\& Tibetan\& Tifinagh\& Ugaritic\& Yi.Ve.IP "Extended property classes" 4.IX Item "Extended property classes"Extended property classes can supplement the basicproperties, defined by the \fIPropList\fR Unicode database:.Sp.Vb 10\& ASCIIHexDigit\& BidiControl\& Dash\& Deprecated\& Diacritic\& Extender\& HexDigit\& Hyphen\& Ideographic\& IDSBinaryOperator\& IDSTrinaryOperator\& JoinControl\& LogicalOrderException\& NoncharacterCodePoint\& OtherAlphabetic\& OtherDefaultIgnorableCodePoint\& OtherGraphemeExtend\& OtherIDStart\& OtherIDContinue\& OtherLowercase\& OtherMath\& OtherUppercase\& PatternSyntax\& PatternWhiteSpace\& QuotationMark\& Radical\& SoftDotted\& STerm\& TerminalPunctuation\& UnifiedIdeograph\& VariationSelector\& WhiteSpace.Ve.Spand there are further derived properties:.Sp.Vb 4\& Alphabetic = Lu + Ll + Lt + Lm + Lo + Nl + OtherAlphabetic\& Lowercase = Ll + OtherLowercase\& Uppercase = Lu + OtherUppercase\& Math = Sm + OtherMath\&\& IDStart = Lu + Ll + Lt + Lm + Lo + Nl + OtherIDStart\& IDContinue = IDStart + Mn + Mc + Nd + Pc + OtherIDContinue\&\& DefaultIgnorableCodePoint\& = OtherDefaultIgnorableCodePoint\& + Cf + Cc + Cs + Noncharacters + VariationSelector\& \- WhiteSpace \- FFF9..FFFB (Annotation Characters)\&\& Any = Any code points (i.e. U+0000 to U+10FFFF)\& Assigned = Any non\-Cn code points (i.e. synonym for \eP{Cn})\& Unassigned = Synonym for \ep{Cn}\& ASCII = ASCII (i.e. U+0000 to U+007F)\&\& Common = Any character (or unassigned code point)\& not explicitly assigned to a script.Ve.ie n .IP "Use of ""Is"" Prefix" 4.el .IP "Use of ``Is'' Prefix" 4.IX Item "Use of Is Prefix"For backward compatibility (with Perl 5.6), all properties mentionedso far may have \f(CW\*(C`Is\*(C'\fR prepended to their name, so \f(CW\*(C`\eP{IsLu}\*(C'\fR, forexample, is equal to \f(CW\*(C`\eP{Lu}\*(C'\fR..IP "Blocks" 4.IX Item "Blocks"In addition to \fBscripts\fR, Unicode also defines \fBblocks\fR ofcharacters. The difference between scripts and blocks is that theconcept of scripts is closer to natural languages, while the concept
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -