📄 perluniintro.1
字号:
.\" Automatically generated by Pod::Man 2.16 (Pod::Simple 3.05).\".\" Standard preamble:.\" ========================================================================.de Sh \" Subsection heading.br.if t .Sp.ne 5.PP\fB\\$1\fR.PP...de Sp \" Vertical space (when we can't use .PP).if t .sp .5v.if n .sp...de Vb \" Begin verbatim text.ft CW.nf.ne \\$1...de Ve \" End verbatim text.ft R.fi...\" Set up some character translations and predefined strings. \*(-- will.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left.\" double quote, and \*(R" will give a right double quote. \*(C+ will.\" give a nicer C++. Capital omega is used to do unbreakable dashes and.\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff,.\" nothing in troff, for use with C<>..tr \(*W-.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'.ie n \{\. ds -- \(*W-. ds PI pi. if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch. if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch. ds L" "". ds R" "". ds C` "". ds C' ""'br\}.el\{\. ds -- \|\(em\|. ds PI \(*p. ds L" ``. ds R" '''br\}.\".\" Escape single quotes in literal strings from groff's Unicode transform..ie \n(.g .ds Aq \(aq.el .ds Aq '.\".\" If the F register is turned on, we'll generate index entries on stderr for.\" titles (.TH), headers (.SH), subsections (.Sh), items (.Ip), and index.\" entries marked with X<> in POD. Of course, you'll have to process the.\" output yourself in some meaningful fashion..ie \nF \{\. de IX. tm Index:\\$1\t\\n%\t"\\$2"... nr % 0. rr F.\}.el \{\. de IX...\}.\".\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2)..\" Fear. Run. Save yourself. No user-serviceable parts.. \" fudge factors for nroff and troff.if n \{\. ds #H 0. ds #V .8m. ds #F .3m. ds #[ \f1. ds #] \fP.\}.if t \{\. ds #H ((1u-(\\\\n(.fu%2u))*.13m). ds #V .6m. ds #F 0. ds #[ \&. ds #] \&.\}. \" simple accents for nroff and troff.if n \{\. ds ' \&. ds ` \&. ds ^ \&. ds , \&. ds ~ ~. ds /.\}.if t \{\. ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u". ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u'. ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u'. ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u'. ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u'. ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u'.\}. \" troff and (daisy-wheel) nroff accents.ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V'.ds 8 \h'\*(#H'\(*b\h'-\*(#H'.ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#].ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H'.ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u'.ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#].ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#].ds ae a\h'-(\w'a'u*4/10)'e.ds Ae A\h'-(\w'A'u*4/10)'E. \" corrections for vroff.if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u'.if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u'. \" for low resolution devices (crt and lpr).if \n(.H>23 .if \n(.V>19 \\{\. ds : e. ds 8 ss. ds o a. ds d- d\h'-1'\(ga. ds D- D\h'-1'\(hy. ds th \o'bp'. ds Th \o'LP'. ds ae ae. ds Ae AE.\}.rm #[ #] #H #V #F C.\" ========================================================================.\".IX Title "PERLUNIINTRO 1".TH PERLUNIINTRO 1 "2007-12-18" "perl v5.10.0" "Perl Programmers Reference Guide".\" For nroff, turn off justification. Always turn off hyphenation; it makes.\" way too many mistakes in technical documents..if n .ad l.nh.SH "NAME"perluniintro \- Perl Unicode introduction.SH "DESCRIPTION".IX Header "DESCRIPTION"This document gives a general idea of Unicode and how to use Unicodein Perl..Sh "Unicode".IX Subsection "Unicode"Unicode is a character set standard which plans to codify all of thewriting systems of the world, plus many other symbols..PPUnicode and \s-1ISO/IEC\s0 10646 are coordinated standards that provide codepoints for characters in almost all modern character set standards,covering more than 30 writing systems and hundreds of languages,including all commercially-important modern languages. All charactersin the largest Chinese, Japanese, and Korean dictionaries are alsoencoded. The standards will eventually cover almost all characters inmore than 250 writing systems and thousands of languages.Unicode 1.0 was released in October 1991, and 4.0 in April 2003..PPA Unicode \fIcharacter\fR is an abstract entity. It is not bound to anyparticular integer width, especially not to the C language \f(CW\*(C`char\*(C'\fR.Unicode is language-neutral and display-neutral: it does not encode thelanguage of the text and it does not define fonts or other graphicallayout details. Unicode operates on characters and on text built fromthose characters..PPUnicode defines characters like \f(CW\*(C`LATIN CAPITAL LETTER A\*(C'\fR or \f(CW\*(C`GREEKSMALL LETTER ALPHA\*(C'\fR and unique numbers for the characters, in thiscase 0x0041 and 0x03B1, respectively. These unique numbers are called\&\fIcode points\fR..PPThe Unicode standard prefers using hexadecimal notation for the codepoints. If numbers like \f(CW0x0041\fR are unfamiliar to you, take a peekat a later section, \*(L"Hexadecimal Notation\*(R". The Unicode standarduses the notation \f(CW\*(C`U+0041 LATIN CAPITAL LETTER A\*(C'\fR, to give thehexadecimal code point and the normative name of the character..PPUnicode also defines various \fIproperties\fR for the characters, like\&\*(L"uppercase\*(R" or \*(L"lowercase\*(R", \*(L"decimal digit\*(R", or \*(L"punctuation\*(R";these properties are independent of the names of the characters.Furthermore, various operations on the characters like uppercasing,lowercasing, and collating (sorting) are defined..PPA Unicode character consists either of a single code point, or a\&\fIbase character\fR (like \f(CW\*(C`LATIN CAPITAL LETTER A\*(C'\fR), followed by one ormore \fImodifiers\fR (like \f(CW\*(C`COMBINING ACUTE ACCENT\*(C'\fR). This sequence ofbase character and modifiers is called a \fIcombining charactersequence\fR..PPWhether to call these combining character sequences \*(L"characters\*(R"depends on your point of view. If you are a programmer, you probablywould tend towards seeing each element in the sequences as one unit,or \*(L"character\*(R". The whole sequence could be seen as one \*(L"character\*(R",however, from the user's point of view, since that's probably what itlooks like in the context of the user's language..PPWith this \*(L"whole sequence\*(R" view of characters, the total number ofcharacters is open-ended. But in the programmer's \*(L"one unit is onecharacter\*(R" point of view, the concept of \*(L"characters\*(R" is moredeterministic. In this document, we take that second point of view:one \*(L"character\*(R" is one Unicode code point, be it a base character ora combining character..PPFor some combinations, there are \fIprecomposed\fR characters.\&\f(CW\*(C`LATIN CAPITAL LETTER A WITH ACUTE\*(C'\fR, for example, is defined asa single code point. These precomposed characters are, however,only available for some combinations, and are mainlymeant to support round-trip conversions between Unicode and legacystandards (like the \s-1ISO\s0 8859). In the general case, the composingmethod is more extensible. To support conversion betweendifferent compositions of the characters, various \fInormalizationforms\fR to standardize representations are also defined..PPBecause of backward compatibility with legacy encodings, the \*(L"a uniquenumber for every character\*(R" idea breaks down a bit: instead, there is\&\*(L"at least one number for every character\*(R". The same character couldbe represented differently in several legacy encodings. Theconverse is also not true: some code points do not have an assignedcharacter. Firstly, there are unallocated code points withinotherwise used blocks. Secondly, there are special Unicode controlcharacters that do not represent true characters..PPA common myth about Unicode is that it would be \*(L"16\-bit\*(R", that is,Unicode is only represented as \f(CW0x10000\fR (or 65536) characters from\&\f(CW0x0000\fR to \f(CW0xFFFF\fR. \fBThis is untrue.\fR Since Unicode 2.0 (July1996), Unicode has been defined all the way up to 21 bits (\f(CW0x10FFFF\fR),and since Unicode 3.1 (March 2001), characters have been definedbeyond \f(CW0xFFFF\fR. The first \f(CW0x10000\fR characters are called the\&\fIPlane 0\fR, or the \fIBasic Multilingual Plane\fR (\s-1BMP\s0). With Unicode3.1, 17 (yes, seventeen) planes in all were defined\*(--but they arenowhere near full of defined characters, yet..PPAnother myth is that the 256\-character blocks have something todo with languages\*(--that each block would define the characters usedby a language or a set of languages. \fBThis is also untrue.\fRThe division into blocks exists, but it is almost completelyaccidental\*(--an artifact of how the characters have been andstill are allocated. Instead, there is a concept called \fIscripts\fR,which is more useful: there is \f(CW\*(C`Latin\*(C'\fR script, \f(CW\*(C`Greek\*(C'\fR script, andso on. Scripts usually span varied parts of several blocks.For further information see Unicode::UCD..PPThe Unicode code points are just abstract numbers. To input andoutput these abstract numbers, the numbers must be \fIencoded\fR or\&\fIserialised\fR somehow. Unicode defines several \fIcharacter encodingforms\fR, of which \fI\s-1UTF\-8\s0\fR is perhaps the most popular. \s-1UTF\-8\s0 is avariable length encoding that encodes Unicode characters as 1 to 6bytes (only 4 with the currently defined characters). Other encodingsinclude \s-1UTF\-16\s0 and \s-1UTF\-32\s0 and their big\- and little-endian variants(\s-1UTF\-8\s0 is byte-order independent) The \s-1ISO/IEC\s0 10646 defines the \s-1UCS\-2\s0and \s-1UCS\-4\s0 encoding forms..PPFor more information about encodings\*(--for instance, to learn what\&\fIsurrogates\fR and \fIbyte order marks\fR (BOMs) are\*(--see perlunicode..Sh "Perl's Unicode Support".IX Subsection "Perl's Unicode Support"Starting from Perl 5.6.0, Perl has had the capacity to handle Unicodenatively. Perl 5.8.0, however, is the first recommended release forserious Unicode work. The maintenance release 5.6.1 fixed many of theproblems of the initial Unicode implementation, but for exampleregular expressions still do not work with Unicode in 5.6.1..PP\&\fBStarting from Perl 5.8.0, the use of \f(CB\*(C`use utf8\*(C'\fB is no longernecessary.\fR In earlier releases the \f(CW\*(C`utf8\*(C'\fR pragma was used to declarethat operations in the current block or file would be Unicode-aware.This model was found to be wrong, or at least clumsy: the \*(L"Unicodeness\*(R"is now carried with the data, instead of being attached to theoperations. Only one case remains where an explicit \f(CW\*(C`use utf8\*(C'\fR isneeded: if your Perl script itself is encoded in \s-1UTF\-8\s0, you can use\&\s-1UTF\-8\s0 in your identifier names, and in string and regular expressionliterals, by saying \f(CW\*(C`use utf8\*(C'\fR. This is not the default becausescripts with legacy 8\-bit data in them would break. See utf8..Sh "Perl's Unicode Model".IX Subsection "Perl's Unicode Model"Perl supports both pre\-5.6 strings of eight-bit native bytes, andstrings of Unicode characters. The principle is that Perl tries tokeep its data as eight-bit bytes for as long as possible, but as soonas Unicodeness cannot be avoided, the data is transparently upgradedto Unicode..PPInternally, Perl currently uses either whatever the native eight-bitcharacter set of the platform (for example Latin\-1) is, defaulting to\&\s-1UTF\-8\s0, to encode Unicode strings. Specifically, if all code points inthe string are \f(CW0xFF\fR or less, Perl uses the native eight-bitcharacter set. Otherwise, it uses \s-1UTF\-8\s0..PPA user of Perl does not normally need to know nor care how Perlhappens to encode its internal strings, but it becomes relevant whenoutputting Unicode strings to a stream without a PerlIO layer \*(-- one withthe \*(L"default\*(R" encoding. In such a case, the raw bytes used internally(the native character set or \s-1UTF\-8\s0, as appropriate for each string)will be used, and a \*(L"Wide character\*(R" warning will be issued if thosestrings contain a character beyond 0x00FF..PPFor example,.PP.Vb 1\& perl \-e \*(Aqprint "\ex{DF}\en", "\ex{0100}\ex{DF}\en"\*(Aq.Ve.PPproduces a fairly useless mixture of native bytes and \s-1UTF\-8\s0, as wellas a warning:.PP.Vb 1\& Wide character in print at ....Ve.PPTo output \s-1UTF\-8\s0, use the \f(CW\*(C`:encoding\*(C'\fR or \f(CW\*(C`:utf8\*(C'\fR output layer. Prepending.PP.Vb 1\& binmode(STDOUT, ":utf8");.Ve.PPto this sample program ensures that the output is completely \s-1UTF\-8\s0,and removes the program's warning..PPYou can enable automatic UTF\-8\-ification of your standard filehandles, default \f(CW\*(C`open()\*(C'\fR layer, and \f(CW@ARGV\fR by using eitherthe \f(CW\*(C`\-C\*(C'\fR command line switch or the \f(CW\*(C`PERL_UNICODE\*(C'\fR environmentvariable, see perlrun for the documentation of the \f(CW\*(C`\-C\*(C'\fR switch..PPNote that this means that Perl expects other software to work, too:if Perl has been led to believe that \s-1STDIN\s0 should be \s-1UTF\-8\s0, but then\&\s-1STDIN\s0 coming in from another command is not \s-1UTF\-8\s0, Perl will complainabout the malformed \s-1UTF\-8\s0..PPAll features that combine Unicode and I/O also require using the newPerlIO feature. Almost all Perl 5.8 platforms do use PerlIO, though:you can see whether yours is by running \*(L"perl \-V\*(R" and looking for\&\f(CW\*(C`useperlio=define\*(C'\fR..Sh "Unicode and \s-1EBCDIC\s0".IX Subsection "Unicode and EBCDIC"Perl 5.8.0 also supports Unicode on \s-1EBCDIC\s0 platforms. There,Unicode support is somewhat more complex to implement sinceadditional conversions are needed at every step. Some problemsremain, see perlebcdic for details..PPIn any case, the Unicode support on \s-1EBCDIC\s0 platforms is better thanin the 5.6 series, which didn't work much at all for \s-1EBCDIC\s0 platform.On \s-1EBCDIC\s0 platforms, the internal Unicode encoding form is UTF-EBCDICinstead of \s-1UTF\-8\s0. The difference is that as \s-1UTF\-8\s0 is \*(L"ASCII-safe\*(R" inthat \s-1ASCII\s0 characters encode to \s-1UTF\-8\s0 as-is, while UTF-EBCDIC is
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -