📄 rfc3284.txt

📁 RFC 的详细文档！
💻 TXT
📖 第 1 页 / 共 5 页
字号:
   caches, so that decoder's caches remain synchronized with the
   encoder's caches.

   a. A "near" cache is an array with "s_near" slots, each containing an
      address used for encoding addresses nearby to previously encoded
      addresses (in the positive direction only).  The near cache also
      maintains a "next_slot" index to the near cache.  New entries to
      the near cache are always inserted in the next_slot index, which
      maintains a circular buffer of the s_near most recent addresses.

   b. A "same" cache is an array with "s_same", with a multiple of 256
      slots, each containing an address.  The same cache maintains a
      hash table of recent addresses used for repeated encoding of the
      exact same address.

   By default, the parameters s_near and s_same are respectively set to
   4 and 3.  An encoder MAY modify these values, but then it MUST encode
   the new values in the encoding itself, as discussed in Section 7, so
   that the decoder can properly set up its own caches.

   At the start of processing a target window, an implementation
   (encoder or decoder) initializes all of the slots in both caches to
   zero.  The next_slot pointer of the near cache is set to point to
   slot zero.





Korn, et. al.               Standards Track                    [Page 12]

RFC 3284                         VCDIFF                        June 2002


   Each time a COPY instruction is processed by the encoder or decoder,
   the implementation's caches are updated as follows, where "addr" is
   the address in the COPY instruction.

   a. The slot in the near cache referenced by the next_slot index is
      set to addr.  The next_slot index is then incremented modulo
      s_near.

   b. The slot in the same cache whose index is addr%(s_same*256) is set
      to addr.  [We use the C notations of % for modulo and * for
      multiplication.]

5.2 Example code for maintaining caches

   To make clear the above description, below are examples of cache data
   structures and algorithms to initialize and update them:

   typedef struct _cache_s
   {
       int*  near;      /* array of size s_near        */
       int   s_near;
       int   next_slot; /* the circular index for near */
       int*  same;      /* array of size s_same*256    */
       int   s_same;
   } Cache_t;

   cache_init(Cache_t* ka)
   {
       int   i;

       ka->next_slot = 0;
       for(i = 0; i < ka->s_near; ++i)
            ka->near[i] = 0;

       for(i = 0; i < ka->s_same*256; ++i)
            ka->same[i] = 0;
   }

   cache_update(Cache_t* ka, int addr)
   {
       if(ka->s_near > 0)
       {   ka->near[ka->next_slot] = addr;
           ka->next_slot = (ka->next_slot + 1) % ka->s_near;
       }

       if(ka->s_same > 0)
           ka->same[addr % (ka->s_same*256)] = addr;
   }



Korn, et. al.               Standards Track                    [Page 13]

RFC 3284                         VCDIFF                        June 2002


5.3 Encoding of COPY instruction addresses

   The address of a COPY instruction is encoded using different modes,
   depending on the type of cached address used, if any.

   Let "addr" be the address of a COPY instruction to be decoded and
   "here" be the current location in the target data (i.e., the start of
   the data about to be encoded or decoded).  Let near[j] be the jth
   element in the near cache, and same[k] be the kth element in the same
   cache.  Below are the possible address modes:

      VCD_SELF: This mode has value 0.  The address was encoded by
         itself as an integer.

      VCD_HERE: This mode has value 1.  The address was encoded as the
         integer value "here - addr".

      Near modes: The "near modes" are in the range [2,s_near+1].  Let m
         be the mode of the address encoding.  The address was encoded
         as the integer value "addr - near[m-2]".

      Same modes: The "same modes" are in the range
         [s_near+2,s_near+s_same+1].  Let m be the mode of the encoding.
         The address was encoded as a single byte b such that "addr ==
         same[(m - (s_near+2))*256 + b]".

5.4 Example code for encoding and decoding of COPY instruction addresses

   We show example algorithms below to demonstrate the use of address
   modes more clearly.  The encoder has the freedom to choose address
   modes, the sample addr_encode() algorithm merely shows one way of
   picking the address mode.  The decoding algorithm addr_decode() will
   uniquely decode addresses, regardless of the encoder's algorithm
   choice.

   Note that the address caches are updated immediately after an address
   is encoded or decoded.  In this way, the decoder is always
   synchronized with the encoder.













Korn, et. al.               Standards Track                    [Page 14]

RFC 3284                         VCDIFF                        June 2002


   int addr_encode(Cache_t* ka, int addr, int here, int* mode)
   {
       int  i, d, bestd, bestm;

       /* Attempt to find the address mode that yields the
        * smallest integer value for "d", the encoded address
        * value, thereby minimizing the encoded size of the
        * address. */

       bestd = addr; bestm = VCD_SELF;      /* VCD_SELF == 0 */

       if((d = here-addr) < bestd)
           { bestd = d; bestm = VCD_HERE; } /* VCD_HERE == 1 */

       for(i = 0; i < ka->s_near; ++i)
           if((d = addr - ka->near[i]) >= 0 && d < bestd)
               { bestd = d; bestm = i+2; }

       if(ka->s_same > 0 && ka->same[d = addr%(ka->s_same*256)] == addr)
           { bestd = d%256; bestm = ka->s_near + 2 + d/256; }

       cache_update(ka,addr);

       *mode = bestm; /* this returns the address encoding mode */
       return  bestd; /* this returns the encoded address       */
   }

   Note that the addr_encode() algorithm chooses the best address mode
   using a local optimization, but that may not lead to the best
   encoding efficiency because different modes lead to different
   instruction encodings, as described below.

   The functions addrint() and addrbyte() used in addr_decode(), obtain
   from the "Addresses section for COPYs" (Section 4.3), an integer or a
   byte, respectively.  These utilities will not be described here.  We
   simply recall that an integer is represented as a compact variable-
   sized string of bytes, as described in Section 2 (i.e., base 128).














Korn, et. al.               Standards Track                    [Page 15]

RFC 3284                         VCDIFF                        June 2002


   int addr_decode(Cache_t* ka, int here, int mode)
   {   int  addr, m;

       if(mode == VCD_SELF)
            addr = addrint();
       else if(mode == VCD_HERE)
            addr = here - addrint();
       else if((m = mode - 2) >= 0 && m < ka->s_near) /* near cache */
            addr = ka->near[m] + addrint();
       else /* same cache */
       {    m = mode - (2 + ka->s_near);
            addr = ka->same[m*256 + addrbyte()];
       }

       cache_update(ka, addr);

       return addr;
   }

5.4 Instruction Codes

   Matches are often short in lengths and separated by small amounts of
   unmatched data.  That is, the lengths of COPY and ADD instructions
   are often small.  This is particularly true of binary data such as
   executable files or structured data, such as HTML or XML.  In such
   cases, compression can be improved by combining the encoding of the
   sizes and the instruction types, as well as combining the encoding of
   adjacent delta instructions with sufficiently small data sizes.
   Effective choices of when to perform such combinations depend on many
   factors including the data being processed and the string matching
   algorithm in use.  For example, if many COPY instructions have the
   same data sizes, it may be worthwhile to encode these instructions
   more compactly than others.

   The Vcdiff data format is designed so that a decoder does not need to
   be aware of the choices made in encoding algorithms.  This is
   achieved with the notion of an "instruction code table", containing
   256 entries.  Each entry defines, either a single delta instruction
   or a pair of instructions that have been combined.  Note that the
   code table itself only exists in main memory, not in the delta file
   (unless using an application-defined code table, described in Section
   7).  The encoded data simply includes the index of each instruction
   and, since there are only 256 indices, each index can be represented
   as a single byte.







Korn, et. al.               Standards Track                    [Page 16]

RFC 3284                         VCDIFF                        June 2002


   Each instruction code entry contains six fields, each of which is a
   single byte with an unsigned value:

          +-----------------------------------------------+
          | inst1 | size1 | mode1 | inst2 | size2 | mode2 |
          +-----------------------------------------------+

   Each triple (inst,size,mode) defines a delta instruction.  The
   meanings of these fields are as follows:

      inst: An "inst" field can have one of the four values: NOOP (0),
            ADD (1), RUN (2) or COPY (3) to indicate the instruction
            types.  NOOP means that no instruction is specified.  In
            this case, both the corresponding size and mode fields will
            be zero.

      size: A "size" field is zero or positive.  A value zero means that
            the size associated with the instruction is encoded
            separately as an integer in the "Instructions and sizes
            section" (Section 6).  A positive value for "size" defines
            the actual data size.  Note that since the size is
            restricted to a byte, the maximum value for any instruction
            with size implicitly defined in the code table is 255.

      mode: A "mode" field is significant only when the associated delta
            instruction is a COPY.  It defines the mode used to encode
            the associated addresses.  For other instructions, this is
            always zero.

5.6 The Code Table

   Following the discussions on address modes and instruction code
   tables, we define a "Code Table" to have the data below:

         s_near: the size of the near cache,
         s_same: the size of the same cache,
         i_code: the 256-entry instruction code table.

   Vcdiff itself defines a "default code table" in which s_near is 4 and
   s_same is 3.  Thus, there are 9 address modes for a COPY instruction.
   The first two are VCD_SELF (0) and VCD_HERE (1).  Modes 2, 3, 4 and 5
   are for addresses coded against the near cache.  And modes 6, 7  and
   8, are for addresses coded against the same cache.








Korn, et. al.               Standards Track                    [Page 17]

RFC 3284                         VCDIFF                        June 2002


        TYPE      SIZE     MODE    TYPE     SIZE     MODE     INDEX
       ---------------------------------------------------------------
    1.  RUN         0        0     NOOP       0        0        0
    2.  ADD    0, [1,17]     0     NOOP       0        0      [1,18]
    3.  COPY   0, [4,18]     0     NOOP       0        0     [19,34]
    4.  COPY   0, [4,18]     1     NOOP       0        0     [35,50]
    5.  COPY   0, [4,18]     2     NOOP       0        0     [51,66]
    6.  COPY   0, [4,18]     3     NOOP       0        0     [67,82]
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -