📄 7serial.html

📁 C ++ in action
💻 HTML
📖 第 1 页 / 共 3 页
字号:
    _symbol.erase ();
    while (!isspace (_look))
    {
        _symbol += _look;
        _look = _in.get ();
    }
}</pre>
</td></tr></table><!-- End Code -->


<p>As usual, we should provide simple stubs for the <var>Load</var> and <var>Save</var> methods and test our program before proceeding any further.

<h3>Serialization and Deserialization</h3>

<p>We often imagine data structures as two- or even three-dimensional creatures (just think of a parsing tree, a hash table, or a multi-dimensional array). A disk file, on the other hand, has a one-dimensional structure--it's linear. When you write to a file, you write one thing after another--serially. Hence the name <i>serialization</i>. Saving a data structure means transforming a multi-dimensional idea into its one-dimensional representation. Of course, in reality computer memory is also one-dimensional. Our data structures are already, in some manner, serialized in memory. Some of them, like multi-dimensional arrays, are serialized by the compiler, others are fit into linear memory with the use of pointers. Unfortunately, pointers have no meaning outside the context of the currently running instance of the program. You can't save pointers to a file, close the program, start it again, read the file and expect the newly read pointers to point to the same data structures as before.

<p>In order to serialize a data structure, you have to come up with a well-defined procedure for walking it, i.e., visiting every single element of it, one after another. For instance, you can walk a simple linked list by following the <i>next</i> pointers until you hit the end of the list. If the list is circular, you have to remember the initial pointer and, with every step, compare it with the <i>next</i> pointer. A binary tree can be walked by walking the left child first and the right child next (notice that it's a recursive prescription). For every data structure there is at least one deterministic procedure for walking it, but the procedure might be arbitrarily complicated. 

<p>Once you know how to walk a data structure, you know how to serialize it. You have a prescription for how to visit every element of the structure, one after another--a serial way of scanning it. At the bottom level of every data structure you find simple, built-in types, like int, char, long, etc. They can be written to a file following a set of simple rules--we'll come back to this point in a moment. If you know how to serialize each basic element, you're done.

<p>Serializing a data structure makes sense only if we know how to restore it--<i>deserialize</i> it from file to memory. Knowing the original serialization procedure helps--we can follow the same steps when when we deserialize it; only now we'll <i>read</i> from file and <i>write</i> to memory, rather than the other way around. We have to make sure, however, that the procedure is unambiguous. For instance, we have to know when to stop reading elements of a given data structure. We must know where the end of a data structure is. The clues that were present during serialization might not be present on disk. For instance, a linked list had a null pointer as <i>next</i> in its last element. But if we decide not to store pointers, how are we to know when we have reached the end of the list? Of course, we may decide to store the pointers anyway, just to have a clue when to stop. Or, even better, we could store the count of elements in front of the list. 
<p>The need to know sizes of data structures before we can deserialize them imposes additional constraints on the order of serialization. When we serialize one part of the program's data, all  other parts are present in memory. We can often infer the size of a given data structure by looking into other data structures. When deserializing, we don't have this comfort. We either have to make sure that these other data structures are deserialized first, or add some redundancy to the serialized image, e.g., store the counts multiple times. A good example is a class that contains a pointer to a dynamically allocated array and the current size of the array. It really doesn't matter which member comes first, the pointer or the count. However, when serializing an object we must store the count first and the contents of the array next. Otherwise we won't be able to allocate the appropriate amount of memory or read the correct number of entries.
<p>Another kind of ambiguity might arise when storing polymorphic data structures. For instance, a binary node contains two pointers to <var>Node</var>. That's not a problem when we serialize it--we can tell the two children to serialize themselves by calling the appropriate virtual functions. But when the time comes to deserialize the node, how do we know what the real type of each child was? We have to know that before we can even start deserializing them. That's why the serialized image of any polymorphic data structure has to start with some kind of code that identifies the class of the data structure. Based on this code, the deserializer will be able to call the appropriate constructor. 
<p>Let's now go back to our project and implement the (de-) serialization of the Calculator's data structures. First we have to create an output file. This file will be encapsulated inside a serial stream. The stream can accept a number of basic data types, long, double; as well as some simple aggregates, like strings; and write them to the file. 
<p>Notice that I didn't mention the most common type--the integer. That's because the size of the integer is system dependent. Suppose that you serialize a data structure that contains integers and send it on a diskette or through e-mail to somebody who has a version of the same program running on a different processor. Your program might write an integer as two bytes and their program might expect a four-byte or even eight-byte integer. That's why, when serializing, we convert the system-dependent types, like integers, to system-independent types like longs. In fact, it's not only the size that matters--the order of bytes is important as well. 
<!-- Sidebar -->
<table width=100% border=0 cellpadding=5><tr>
<td width=10>
<td bgcolor="#cccccc" class=sidebar>

<p>There are essentially two kinds of processors, the ones that use the Big Endian and the ones that use the Little Endian order (some can use either). For instance, a <var>short</var> or a <var>long</var> can be stored most-significant-byte-first or least-significant-byte-first. The Intel(tm) family of processor stores the least significant byte first--the Little Endian style--whereas the Motorola(tm) family does the opposite. So if you want your program to inter-operate between Wintel and Macintosh (tm), you'll have to take the order of bytes into account when you serialize. Of course, if you're <i>not</i> planning on porting your program between the two camps, you may safely ignore one of them. 
<p>Anyway, in most cases you <i>should</i> take precautions against variable size types and convert integers or enumerations to fixed-size types.
</table>
<!-- End Sidebar -->
<p>It would be great to be able to assume that once you come up with the on-disk format for your program, it will never change. In real life it would be very na飗e. Formats change and the least you can do to acknowledge it is to refuse to load a format you don't understand. 
<!-- Definition -->
<p>
<table border=4 cellpadding=10><tr>
    <td bgcolor="#ffffff" class=defTable>
Always store a version number in your on-disk data structures.</td></tr>
</table>
<!-- End Definition -->

<p>In order to implement serialization, all we have to do is to create a stream, write the version number into it and tell the calculator to serialize itself. By the way, we are now reaping the benefits of our earlier combining several objects into the <var>Calculator</var> object.

<!-- Code -->
<table width="100%" cellspacing=10><tr>
    <td class=codeTable>
<pre>const long Version = 1;

Status CommandParser::Save (std::string const &amp; nameFile)
{
    cerr &lt;&lt; "Save to: \"" &lt;&lt; nameFile &lt;&lt; "\"\n";
    Status status = stOk;
    try
    {
        Serializer out (nameFile);
        out.PutLong ( Version );
        _calc.Serialize (out);
    }
    catch (char* msg)
    {
        cerr &lt;&lt; "Error: Save failed: " &lt;&lt; msg &lt;&lt; endl;
        status = stError;
    }
    catch (...)
    {
        cerr &lt;&lt; "Error: Save failed\n";
        status = stError;
    }
    return status;
}
</pre>
    </td></tr>
</table>
<!-- End Code -->
<p>When deserializing, we follow exactly the same steps, except that now we read instead of writing and deserialize instead of serializing. And, if the version number doesn't match, we refuse to load.
<!-- Code -->
<table width="100%" cellspacing=10><tr>
    <td class=codeTable>
<pre>Status CommandParser::Load (std::string const &amp; nameFile)
{
    cerr &lt;&lt; "Load from: \"" &lt;&lt; nameFile &lt;&lt; "\"\n";
    Status status = stOk;
    try
    {
        DeSerializer in (nameFile);
        long ver = in.GetLong ();
        if (ver != Version)
            throw "Version number mismatch";
        _calc.DeSerialize (in);
    }
    catch (char* msg)
    {
        cerr &lt;&lt; "Error: Load failed: " &lt;&lt; msg &lt;&lt; endl;
        status = stError;
    }
    catch (...)
    {
        cerr &lt;&lt; "Error: Load failed\n";
        // data structures may be corrupt
        throw;
    }
    return status;
}</pre>
    </td></tr>
</table>
<!-- End Code -->

<p>There are two objects inside the Calculator that we'd like to save to the disk--the symbol table and the store--the names of the variables and their values. So that's what we'll do.

<!-- Code -->
<table width="100%" cellspacing=10><tr>
    <td class=codeTable>
<pre>void Calculator::Serialize (Serializer &amp; out)
{
    _symTab.Serialize (out);
    _store.Serialize (out);
}</pre>
    </td></tr>
</table>
<!-- End Code -->

<!-- Code -->
<table width="100%" cellspacing=10><tr>
    <td class=codeTable>
<pre>void Calculator::DeSerialize (DeSerializer &amp; in)
{
    _symTab.DeSerialize (in);
    _store.DeSerialize (in);
}</pre>
    </td></tr>
</table>
<!-- End Code -->

<p>The symbol table consists of a dictionary that maps strings to integers plus a variable that contains the current id. And the simplest way to walk the symbol table is indeed in this order. To walk the standard map we will use its iterator. First we have to store the count of elements, so that we know how many to read during deserialization. Then we will iterate over the whole map and store pairs: string, id. Notice that the iterator for <var>std::map</var> points to a <var>std::pair</var> which has <var>first</var> and <var>second</var> data members. According to our previous discussion, we store the integer id as a <var>long</var>.
<!-- Code -->
<table width="100%" cellspacing=10><tr>
    <td class=codeTable>
<pre>void SymbolTable::Serialize (Serializer &amp out) const
{
    out.PutLong (_dictionary.size ());
    std::map&lt;std::string, int&gt;::const_iterator it;
    for (it = _dictionary.begin (); it != _dictionary.end (); ++it)
    {
        out.PutString (it-&gt;first);
        out.PutLong (it-&gt;second);
    }
    out.PutLong (_id);
}</pre>
    </td></tr>
</table>
<!-- End Code -->

<p>The deserializer must read the data in the same order as they were serialized: first the dictionary, then the current id. When deserializing the map, we first read its size. Then we simply read pairs of strings and longs and add them to the map. Here we treat the map as an associative array. Notice that we first clear the existing dictionary. We have to do it, otherwise we could get into conflicts, with the same id corresponding to different strings.
<!-- Code -->
<table width="100%" cellspacing=10><tr>
    <td class=codeTable>
<pre>void SymbolTable::DeSerialize (DeSerializer &amp; in)
{
    _dictionary.clear ();
    int len = in.GetLong ();
    for (int i = 0; i &lt; len; ++i)
    {
        std::string str = in.GetString ();
        int id = in.GetLong ();
        _dictionary [str] = id;
    }
    _id = in.GetLong ();
}</pre>
    </td></tr>
</table>
<!-- End Code -->

<p>Notice that for every serialization procedure we immediately write its counterpart--the deserialization procedure. This way we make sure that the two match.

<p>The serialization of the store is also very simple. First the size and then a series of pairs (double, bool). 
<!-- Code --><table width=100% cellspacing=10><tr>    <td class=codetable>
<pre>void Store::Serialize (Serializer &amp; out) const
{
    int len = _aCell.size ();
    out.PutLong (len);
    for (int i = 0; i &lt; len; ++i)
    {
        out.PutDouble (_aCell [i]);
        out.PutBool (_aIsInit [i]);
    }
}</pre>
</td></tr></table><!-- End Code -->

<p>When deserializing the store, we first clear the previous values, read the size and then read the pairs (double, bool) one by one. We have a few options when filling the two vectors with new values. One is be to push them back, one by one. Since we know the number of entries up front, we could  reserve space in the vectors up front, by calling the method <var>reserve</var>. Here I decided to <var>resize</var> the vectors instead and then treat them as arrays. The resizing fills the vector of doubles with zeroes and the vector of <var>bool</var> with <var>false</var> (these are the default values for these types).

<!-- Sidebar -->
<table width=100% border=0 cellpadding=5><tr>
<td width=10>
<td bgcolor="#cccccc" class=sidebar>
<p>There is an important difference between <var>reserve</var> and <var>resize</var>. Most standard containers have either one or both of these methods. <var>Reserve</var> makes sure that there will be no re-allocation when elements are added, e.g., using <var>push_back</var>, up to the reserved <i>capacity</i>. This is a good optimization, in case we know the required capacity up front. In the case of a vector, the absence of re-allocation also means that iterators, pointers or references to the elements of the vector won't be suddenly invalidated by internal reallocation. 
<p><var>Reserve</var>, however, does not change the <i>size</i> of the container. <var>Resize</var> does. When you resize a container new elements are added to it. (Consequently, you can't <var>resize</var> containers that store objects with no default constructors or default values.)
<ul>
<li>reserve--changes capacity but not size
<li>resize--changes size
</ul>
<p>You can enquire about the current capacity of the container by calling its <var>capacity</var> method. And, of course, you get its size by calling <var>size</var>.
</table>
<!-- End Sidebar -->


<!-- Code --><table width=100% cellspacing=10><tr>    <td class=codetable>
<pre>void Store::DeSerialize (DeSerializer &amp; in)
{
    _aCell.clear ();
    _aIsInit.clear ();
    int len = in.GetLong ();
    _aCell.resize (len);
    _aIsInit.resize (len);
    for (int i = 0; i &lt; len; ++i)
    {
        _aCell [i] = in.GetDouble ();
        _aIsInit [i] = in.GetBool ();
    }
}</pre>
</td></tr></table><!-- End Code -->

<p>Finally, let's have a look at the implementation of the deserializer stream. It is a pretty thin layer on top of the output stream. 

<!-- Code -->
<table width="100%" cellspacing=10><tr>
    <td class=codeTable>
<pre>#include &lt;fstream&gt;
using std::ios_base;

const long TruePattern = 0xfab1fab2;
const long FalsePattern = 0xbad1bad2;

class DeSerializer
{
public:
    DeSerializer (std::string const &amp; nameFile)
        : _stream (nameFile.c_str (), ios_base::in | ios_base::binary)
    {
        if (!_stream.is_open ())
            throw "couldn't open file";
    }
    long GetLong ()
    {
        if (_stream.eof())
            throw "unexpected end of file";
        long l;
        _stream.read (reinterpret_cast&lt;char *&gt; (&amp;l), sizeof (long));
        if (_stream.bad())
            throw "file read failed";
        return l;
    }
    double GetDouble ()
    {
        double d;
        if (_stream.eof())
💿 文件大小 792 K
👤 上传用户 peterzhang1982
📂 所属分类电子书籍
🏷️ 相关标签

#action #in
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -