def_8655.htm
来自「C++标准库 C++标准库 C++标准库 C++标准库」· HTM 代码 · 共 269 行 · 第 1/2 页
HTM
269 行
<HTML><HEAD><TITLE>2.13 Defining A Code Conversion Facet</TITLE></HEAD><BODY><A HREF="ug2.htm"><IMG SRC="images/banner.gif"></A><BR><A HREF="cre_2288.htm"><IMG SRC="images/prev.gif"></A><A HREF="booktoc2.htm"><IMG SRC="images/toc.gif"></A><A HREF="dif_2395.htm"><IMG SRC="images/next.gif"></A><BR><STRONG>Click on the banner to return to the user guide home page.</STRONG><H2>2.13 Defining A Code Conversion Facet</H2><P>File stream buffers are responsible for the transport of characters to and from an external device. In many cases, the character encoding used internally inside your program and externally on the device will differ. Hence the file stream buffer will have to convert characters from one encoding to another each time it reads from or writes to the external device. (This <I>User's Guide</I> section on internationalization gives a detailed discussion of character encodings and explains a couple of typical code conversions. If you are not familiar with code conversions, we recommend you read about them before delving into the details of implementing one, which will be explained in this section.)</P><P>A code conversion is not performed by the file stream buffer itself. This task is encapsulated in a code conversion facet. Each time the file stream buffer has to convert characters, it consults its locale's code conversion facet for the actual conversion. For this reason, file stream buffers and code conversion facets have to work together closely, and the file stream buffer depends on its locale's code conversion facet.</P><P>This clear separation of responsibilities enables you to change a file stream's behavior substantially, without touching the file stream class itself. All you have to do is provide a special code conversion facet. In doing so, you turn an ordinary file stream into one that converts, say, EBCDIC files on a mainframe's file system into a stream of ASCII characters for internal processing.</P><P>However, the task of implementing a code conversion facet requires a thorough understanding of the way file stream buffers and code conversion facets interact. In this section, we will use two examples to explain the principles of this interaction. </P><P>Before we move on to the examples, let's go through an overview of the different kinds of code conversions. As we will see later on, different types of code conversions require different kinds of implementations.</P><A NAME="2.13.1"><H3>2.13.1 Categories of Code Conversions</H3></A><P>Code conversions fall into various categories depending on the properties of the character encodings involved. There are:</P><UL><LI><P>Constant-size conversions, and</P></LI><LI><P>Multibyte conversions, which again fall into the categories of:</P><UL><LI><P>State-independent conversions, and</P></LI><LI><P>State-dependent conversions.</P></LI></UL></LI></UL><P><B>Constant-size conversions</B> are between character encodings where all characters are of equal size. All single- or wide-character encodings are examples of such character encodings. Each single character stands for itself and can be recognized and translated independently of its context. Conversions between ASCII and EBCDIC, or Unicode and ISO10646, are examples of constant-size conversions.</P><P><B>Multibyte conversions</B> involve multibyte encodings. In multibyte encodings, characters have varying size. Some multibyte characters consist of two or more bytes, while others are represented by just one byte.</P><P>There is a substantial difference between code conversions involving state-dependent character encodings, and conversions between state-independent encodings. (Again, see this <I>User's Guide</I> section on internationalization for further details.)</P><P><B>State-dependent multibyte conversions</B> involve one character encoding that is state-dependent. In state-dependent character encodings, character sequences can have different meanings depending on the current context. State-dependent encodings typically have <I>modes</I> and escape sequences that allow switching between modes. An example of a state-dependent character conversion is the conversion between the state-dependent JIS encoding for Japanese characters and the Unicode wide-character encoding.</P><P><B>State-independent multibyte conversions</B> do not have modes. A sequence of characters can always be interpreted independently of its context. An example of a state-independent multibyte conversion is the conversion between EUC, which a state-independent multibyte encoding, and Unicode.</P></LI></UL><A NAME="2.13.2"><H3>2.13.2 Example 1 -- Defining a Tiny Character Code Conversion (ASCII and EBCDIC)</H3></A><P>As an example of how file stream buffers and code conversion facets collaborate, we would now like to implement a code conversion facet that can translate text files encoded in EBCDIC into character streams encoded in ASCII. The conversion between ASCII characters and EBCDIC characters is a constant-size code conversion where each character is represented by one byte. Hence the conversion can be done on a character-by-character basis. </P><P>To implement and use an ASCII-EBCDIC code conversion facet, we will:</P><OL><LI><P>Derive a new facet type from the standard code conversion facet type <SAMP>codecvt</SAMP>.</P></LI><LI><P>Specialize the new facet type for the character type <SAMP>char</SAMP>.</P></LI><LI><P>Implement the member functions that are used by the file buffer.</P></LI><LI><P>Imbue a file stream's buffer with a locale that carries an ASCII-EBCDIC code conversion facet. </P></LI></OL><P>The following sections will explain these steps in detail.</P></LI></OL><A NAME="2.13.2.1"><H4>2.13.2.1 Derive a New Facet Type</H4></A><P>Here is the new code conversion facet type <SAMP>AsciiEbcdicConversion</SAMP>:</P><PRE>template <class internT, class externT, class stateT>class AsciiEbcdicConversion: public codecvt<internT, externT, stateT>{};</PRE><P>It is empty because we will specialize the class template for the character type <SAMP>char</SAMP>.</P><A NAME="2.13.2.2"><H4>2.13.2.2 Specialize the New Facet Type and Implement the Member Functions</H4></A><P>Each code conversion facet has two main member functions, <SAMP>in()</SAMP> and <SAMP>out()</SAMP>:</P><UL><LI><P>Function <SAMP>in()</SAMP>is responsible for the conversion done on reading from the external device; and </P></LI><LI><P>Function <SAMP>out()</SAMP>is responsible for the conversion necessary for writing to the external device.</P></LI></UL><P>The other member functions of a code conversion facet used by a file stream buffer are: </P><UL><LI><P>The function <SAMP>always_noconv()</SAMP>, which returns <SAMP>true</SAMP> if no conversion is performed by the facet. This is because file stream buffers might want to bypass the code conversion facet when no conversion is necessary; e.g., when the external encoding is identical to the internal. Our facet obviously will perform a conversion and does not want to be bypassed, so <SAMP>always_noconv()</SAMP> will return <SAMP>false</SAMP> in our example. </P></LI><LI><P>The function <SAMP>encoding()</SAMP>, which provides information about the type of conversion; i.e., whether it is state-dependent or constant-size, etc. In our example, the conversion is constant-size. The function <SAMP>encoding()</SAMP> is supposed to return the size of the internal characters, which is 1 because the file buffer uses an ASCII encoding internally.</P></LI></UL><P>All public member functions of a facet call the respective, protected virtual member function, named <SAMP>do_...()</SAMP>. Here is the declaration of the specialized facet type:</P><PRE>class AsciiEbcdicConversion<char, char, mbstate_t>: public codecvt<char, char, mbstate_t>{protected: result do_in(mbstate_t& state ,const char* from, const char* from_end, const char*& from_next ,char* to , char* to_limit , char*& to_next) const; result do_out(mbstate_t& state ,const char* from, const char* from_end, const char*& from_next ,char* to , char* to_limit , char*& to_next) const; bool do_always_noconv() const thow() { return false; }; int do_encoding() const throw(); { return 1; } };</PRE><P>For the sake of brevity, we implement only those functions used by Rogue Wave's implementation of file stream buffers. If you want to provide a code conversion facet that is more widely usable, you would also have to implement the functions <SAMP>do_length()</SAMP> and <SAMP>do_max_length()</SAMP>. </P><P>The implementation of the functions <SAMP>do_in()</SAMP> and <SAMP>do_out()</SAMP> is straightforward. Each of the functions translates a sequence of characters in the range <SAMP>[from,from_end)</SAMP> into the corresponding sequence <SAMP>[to,to_end)</SAMP>. The pointers <SAMP>from_next</SAMP> and <SAMP>to_next</SAMP> point one beyond the last character successfully converted. In principle, you can do whatever you want, or whatever it takes, in these functions. However, for effective communication with the file stream buffer, it is important to indicate success or failure properly.</P><A NAME="2.13.2.3"><H4>2.13.2.3 Use the New Code Conversion Facet</H4></A><P>Here is an example of how the new code conversion facet can be used:</P><PRE>fstream inout("/tmp/fil"); \\1AsciiEbcdicConversion<char,char,mbstate_t> cvtfac;locale cvtloc(locale(),&cvtfac);inout.rdbuf()->pubimbue(cvtloc) \\2cout << inout.rdbuf(); \\3</PRE><TABLE CELLPADDING="3"><TR VALIGN="top"><TD>//1</TD><TD>When a file is created, a snapshot of the current global locale is attached as the default locale. Remember that a stream has two locale objects: one used for formatting numeric items, and a second used by the stream's buffer for code conversions.</TD></TR><TR VALIGN="top"><TD>//2</TD><TD>Here the stream buffer's locale is replaced by a copy of the global locale that has an ASCII-EBCDIC code conversion facet.</TD></TR><TR VALIGN="top"><TD>//3</TD><TD>The content of the EBCDIC file <SAMP>"/tmp/fil"</SAMP> is read, automatically converted to ASCII, and written to <SAMP>cout</SAMP>.</TD></TR></TABLE><A NAME="2.13.3"><H3>2.13.3 Error Indication in Code Conversion Facets</H3></A><P>Since file stream buffers depend on their locale's code conversion facet, it is important to understand how they communicate. On writing to the external device, the file stream buffer hands over the content of its internal character buffer, partially or entirely, to the code conversion facet; i.e., to its <SAMP>out()</SAMP> function. It expects to receive a converted character sequence that it can write to the external device. The reverse takes place, using the <SAMP>in()</SAMP> function, on reading from the external file. </P><P>In order to make the file stream buffer and the code conversion facet work together effectively, it is necessary that the two main functions <SAMP>in()</SAMP> and <SAMP>out()</SAMP> indicate error situations the way the file stream buffer expects them to do it.</P><P>There are four possible return codes for the functions <SAMP>in()</SAMP> and <SAMP>out()</SAMP>:</P><UL><LI><P><SAMP>ok</SAMP>, which should obviously be returned when the conversion went fine.</P></LI><LI><P><SAMP>partial</SAMP>, which should be returned when the code conversion reaches the end of the input sequence <SAMP>[from,from_end)</SAMP> before a new character can be created. The file stream buffer's reaction to <SAMP>partial</SAMP> is to provide more characters and call the code conversion facet again, in order to successfully complete the conversion.<A HREF="endnote2.htm#fn44">[44]</A></P></LI><LI><P><SAMP>error</SAMP>, which indicates a violation of the conversion rules; i.e., the character sequence to be converted does not obey the expected rules and thus cannot be recognized and converted. In this situation, the file stream buffer stops doing anything, and the file stream eventually sets its state to <SAMP>badbit</SAMP> and throws an exception if appropriate.</P></LI><LI><P><SAMP>noconv</SAMP>, which is returned if no conversion was needed.</P></LI></UL><A NAME="2.13.4"><H3>2.13.4 Example 2 -- Defining a Multibyte Character Code Conversion (JIS and Unicode)</H3></A><P>Let us consider the example of a state-dependent code conversion. As mentioned previously, this type of conversion would occur between JIS, which is a state-dependent multibyte encoding for Japanese characters, and Unicode, which is a wide-character encoding. As usual, we assume that the external device uses multibyte encoding, and the internal processing uses wide-character encoding.</P><P>Here is what you have to do to implement and use a state-dependent code conversion facet:</P><OL><LI><P>Define a new conversion state type if necessary.</P></LI><LI><P>Define a new character traits type if necessary, or instantiate the character traits template with the new state type.</P></LI>
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?