📄 charset.html
字号:
<H2><A NAME="Trigraphs">Trigraphs</A></H2><P>A <B>trigraph</B> is a sequence of three characters that beginswith two question marks (<CODE>??</CODE>). You use trigraphs to write Csource files with a character set that does not contain convenientgraphic representations for some punctuation characters. (The resultantC source file is not necessarily more readable, but it is unambiguous.)</P><P>The list of all<B><A NAME="defined trigraphs">defined trigraphs</A></B> is:</P><PRE><B>Character</B> <B>Trigraph</B>[ ??(\ ??/] ??)^ ??'{ ??<| ??!} ??>~ ??-# ??=</PRE><P>These are the only trigraphs. The translator does not alter any othersequence that begins with two question marks.</P><P>For example, the expression statements:</P><PRE> printf("Case ??=3 is done??/n"); printf("You said what????/n");</PRE><P>are equivalent to:</P><PRE> printf("Case #3 is done\n"); printf("You said what??\n");</PRE><P>The translator replaces each trigraph with its equivalent singlecharacter representation in an early<A HREF="preproc.html#Phases of Translation">phase of translation</A>.You can always treat a trigraph as a single source character.</P><H2><A NAME="Multibyte Characters">Multibyte Characters</A></H2><P>A source character set or target character set can also contain<B>multibyte characters</B> (sequences of one or more bytes). Eachsequence represents a single character in the<B><A NAME="extended character set">extended character set</A></B>.You use multibyte characters to represent large sets of characters,such as Kanji. A multibyte character can be a one-byte sequence thatis a character from the<A HREF="#basic C character set">basic C character set</A>,an additional one-byte sequence that is implementation defined,or an additional sequence of two or more bytes that isimplementation defined.</P><P>Any multibyte encoding that contains sequences of two or morebytes depends, for its interpretation between bytes, on a<B><A NAME="conversion state">conversion state</A></B> determinedby bytes earlier in the sequence of characters. In the<B><A NAME="initial conversion state">initial conversion state</A></B>if the byte immediately following matches one of the charactersin the basic C character set, the byte must represent that character.</P><P>For example, the<B><A NAME="EUC encoding">EUC encoding</A></B> is a superset of ASCII.A byte value in the interval [0xA1, 0xFE] is the first of a two-bytesequence (whose second byte value is in the interval [0x80, 0xFF]).All other byte values are one-byte sequences. Since all members of the<A HREF="#basic C character set">basic C character set</A>have byte values in the range [0x00, 0x7F] in ASCII,EUC meets the requirementsfor a multibyte encoding in Standard C. Such a sequence is <I>not</I>in the initial conversion state immediately after a byte value inthe interval [0xA1, 0xFe]. It is ill-formed if a second bytevalue is not in the interval [0x80, 0xFF].<P>Multibyte characters can also have a<B><A NAME="state-dependent encoding">state-dependent encoding</A></B>.How you interpret a byte in such an encoding depends on aconversion state that involves both a<B><A NAME="parse state">parse state</A></B>, as before, and a<B><A NAME="shift state">shift state</A></B>, determinedby bytes earlier in the sequence of characters. The<B><A NAME="initial shift state">initial shift state</A></B>,at the beginning of a new multibyte character, is also theinitial conversion state. A subsequent<B><A NAME="shift sequence">shift sequence</A></B> can determine an<B><A NAME="alternate shift state">alternate shift state</A></B>,after which all byte sequences (including one-byte sequences) can havea different interpretation. A byte containing the value zero,however, always represents the<A HREF="#null character">null character</A>.It cannot occur as any of the bytes of another multibyte character.</P><P>For example, the<B><A NAME="JIS encoding">JIS encoding</A></B> is another superset of ASCII.In the initial shift state, each byte represents a single character,except for two three-byte shift sequences:</P><UL><LI>The three-byte sequence <CODE>"\x1B$B"</CODE> shifts to two-byte mode.Subsequently, two successive bytes (both with valuesin the range [0x21, 0x7E]) constitute a single multibyte character.</LI><LI>The three-byte sequence <CODE>"\x1B(B"</CODE> shifts backto the initial shift state.</LI></UL><P>JIS also meets the requirements for a multibyte encoding in Standard C.Such a sequence is <I>not</I> in the initial conversion statewhen partway through a three-byte shift sequenceor when in two-byte mode.</P><P>(<A HREF="lib_over.html#Amendment 1">Amendment 1</A> adds the type<A HREF="wchar.html#mbstate_t"><CODE>mbstate_t</CODE></A>,which describes an object that can store a conversion state.It also relaxes the above rules for<A HREF="lib_file.html#generalized multibyte characters">generalized multibyte characters</A>, which describe the encodingrules for a broad range of<A HREF="lib_file.html#wide stream">wide streams</A>.)</P><P>You can write multibyte characters in C source text as partof a comment, a character constant, a string literal, or a filename in an<A HREF="preproc.html#include directive"><I>include</I> directive</A>.How such characters print is implementationdefined. Each sequence of multibyte characters that you write mustbegin and end in the initial shift state.The program can also include multibyte characters in<A HREF="#null-terminated string">null-terminated</A><A HREF="lib_over.html#C string">C strings</A>used by several library functions, including the<A HREF="lib_prin.html#format string">format strings</A> for<A HREF="stdio.html#printf"><CODE>printf</CODE></A> and<A HREF="stdio.html#scanf"><CODE>scanf</CODE></A>.Each such character string must begin and endin the initial shift state.</P><H3><A NAME="Wide-Character Encoding">Wide-Character Encoding</A></H3><P>Each character in the extended character set also has an integerrepresentation, called a <B>wide-character encoding</B>.Each extended character has a unique wide-character value.The value zero always corresponds to the<B><A NAME="null wide character">null wide character</A></B>.The type definition<A HREF="stddef.html#wchar_t"><CODE>wchar_t</CODE></A>specifies the integer type that represents wide characters.</P><P>You write a<B><A NAME="wide-character constant">wide-character constant</A></B>as <CODE>L'mbc'</CODE>, where <CODE>mbc</CODE> representsa single multibyte character.You write a<B><A NAME="wide-character string literal">wide-character string literal</A></B> as <CODE>L"mbs"</CODE>,where <CODE>mbs</CODE> representsa sequence of zero or more multibyte characters.The wide-character string literal<CODE>L"xyz"</CODE> becomes a sequence ofwide-character constants stored in successive bytes of memory, followedby a null wide character:<BR><CODE>{L'x', L'y', L'z', L'\0'}</CODE><P>The following library functionshelp you convert between the multibyteand wide-character representations of extended characters:<A HREF="wchar.html#btowc"><CODE>btowc</CODE></A>,<A HREF="stdlib.html#mblen"><CODE>mblen</CODE></A>,<A HREF="wchar.html#mbrlen"><CODE>mbrlen</CODE></A>,<A HREF="wchar.html#mbrtowc"><CODE>mbrtowc</CODE></A>,<A HREF="wchar.html#mbsrtowcs"><CODE>mbsrtowcs</CODE></A>,<A HREF="stdlib.html#mbstowcs"><CODE>mbstowcs</CODE></A>,<A HREF="stdlib.html#mbtowc"><CODE>mbtowc</CODE></A>,<A HREF="wchar.html#wcrtomb"><CODE>wcrtomb</CODE></A>,<A HREF="wchar.html#wcsrtombs"><CODE>wcsrtombs</CODE></A>,<A HREF="stdlib.html#wcstombs"><CODE>wcstombs</CODE></A>,<A HREF="wchar.html#wctob"><CODE>wctob</CODE></A>, and<A HREF="stdlib.html#wctomb"><CODE>wctomb</CODE></A>.</P><P>The macro<A HREF="limits.html#MB_LEN_MAX"><CODE>MB_LEN_MAX</CODE></A>specifies the length of the longest possible multibyte sequence requiredto represent a single character defined by the implementation acrosssupported locales. And the macro<A HREF="stdlib.html#MB_CUR_MAX"><CODE>MB_CUR_MAX</CODE></A>specifies the length of the longest possible multibyte sequence requiredto represent a single character defined for the current<A HREF="locale.html#locale">locale</A>.</P><P>For example, the<A HREF="#string literal">string literal</A><CODE>"hello"</CODE> becomes an array of six <I>char:</I></P><PRE> {'h', 'e', 'l', 'l', 'o', 0}</PRE><P>while the wide-character string literal<CODE>L"hello"</CODE> becomesan array of six integers of type<A HREF="stddef.html#wchar_t"><CODE>wchar_t</CODE></A>:</P><PRE> {L'h', L'e', L'l', L'l', L'o', 0}</PRE><HR><P>See also the<B><A HREF="index.html#Table of Contents">Table of Contents</A></B> and the<B><A HREF="_index.html">Index</A></B>.</P><P><I><A HREF="crit_pb.html">Copyright</A> © 1989-2002by P.J. Plauger and Jim Brodie. All rights reserved.</I></P><!--V4.01:1125--></BODY></HTML>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -