📄 rfc3284.txt

📁 RFC 的详细文档！
💻 TXT
📖 第 1 页 / 共 5 页
字号:
    7.  COPY   0, [4,18]     4     NOOP       0        0     [83,98]
    8.  COPY   0, [4,18]     5     NOOP       0        0     [99,114]
    9.  COPY   0, [4,18]     6     NOOP       0        0    [115,130]
   10.  COPY   0, [4,18]     7     NOOP       0        0    [131,146]
   11.  COPY   0, [4,18]     8     NOOP       0        0    [147,162]
   12.  ADD       [1,4]      0     COPY     [4,6]      0    [163,174]
   13.  ADD       [1,4]      0     COPY     [4,6]      1    [175,186]
   14.  ADD       [1,4]      0     COPY     [4,6]      2    [187,198]
   15.  ADD       [1,4]      0     COPY     [4,6]      3    [199,210]
   16.  ADD       [1,4]      0     COPY     [4,6]      4    [211,222]
   17.  ADD       [1,4]      0     COPY     [4,6]      5    [223,234]
   18.  ADD       [1,4]      0     COPY       4        6    [235,238]
   19.  ADD       [1,4]      0     COPY       4        7    [239,242]
   20.  ADD       [1,4]      0     COPY       4        8    [243,246]
   21.  COPY        4      [0,8]   ADD        1        0    [247,255]
       ---------------------------------------------------------------

   The default instruction code table is depicted above, in a compact
   representation that we use only for descriptive purposes.  See
   section 7 for the specification of how an instruction code table is
   represented in the Vcdiff encoding format.  In the depiction, a zero
   value for size indicates that the size is separately coded.  The mode
   of non-COPY instructions is represented as 0, even though they are
   not used.

   In the depiction, each numbered line represents one or more entries
   in the actual instruction code table (recall that an entry in the
   instruction code table may represent up to two combined delta
   instructions.)  The last column ("INDEX") shows which index value, or
   range of index values, of the entries are covered by that line.  (The
   notation [i,j] means values from i through j, inclusively.)  The
   first 6 columns of a line in the depiction, describe the pairs of
   instructions used for the corresponding index value(s).

   If a line in the depiction includes a column entry using the [i,j]
   notation, this means that the line is instantiated for each value in
   the range from i to j, inclusively.  The notation "0, [i,j]" means
   that the line is instantiated for the value 0 and for each value in
   the range from i to j, inclusively.




Korn, et. al.               Standards Track                    [Page 18]

RFC 3284                         VCDIFF                        June 2002


   If a line in the depiction includes more than one entry using the
   [i,j] notation, implying a "nested loop" to convert the line to a
   range of table entries, the first such [i,j] range specifies the
   outer loop, and the second specifies the inner loop.

   The below examples should make clear the above description:

   Line 1 shows the single RUN instruction with index 0.  As the size
   field is 0, this RUN instruction always has its actual size encoded
   separately.

   Line 2 shows the 18 single ADD instructions.  The ADD instruction
   with size field 0 (i.e., the actual size is coded separately) has
   index 1.  ADD instructions with sizes from 1 to 17 use code indices 2
   to 18 and their sizes are as given (so they will not be separately
   encoded.)

   Following the single ADD instructions are the single COPY
   instructions ordered by their address encoding modes.  For example,
   line 11 shows the COPY instructions with mode 8, i.e., the last of
   the same cache.  In this case, the COPY instruction with size field 0
   has index 147.  Again, the actual size of this instruction will be
   coded separately.

   Lines 12 to 21 show the pairs of instructions that are combined
   together.  For example, line 12 depicts the 12 entries in which an
   ADD instruction is combined with an immediately following COPY
   instruction.  The entries with indices 163, 164, 165 represent the
   pairs in which the ADD instructions all have size 1, while the COPY
   instructions have mode 0 (VCD_SELF) and sizes 4, 5 and 6
   respectively.

   The last line, line 21, shows the eight instruction pairs, where the
   first instruction is a COPY and the second is an ADD.  In this case,
   all COPY instructions have size 4 with mode ranging from 0 to 8 and
   all the ADD instructions have size 1.  Thus, the entry with the
   largest index 255 combines a COPY instruction of size 4 and mode 8
   with an ADD instruction of size 1.

   The choice of the minimum size 4 for COPY instructions in the default
   code table was made from experiments that showed that excluding small
   matches (less then 4 bytes long) improved the compression rates.









Korn, et. al.               Standards Track                    [Page 19]

RFC 3284                         VCDIFF                        June 2002


6. Decoding a Target Window

   Section 4.3 discusses that the delta instructions and associated data
   are encoded in three arrays of bytes:

         Data section for ADDs and RUNs,
         Instructions and sizes section, and
         Addresses section for COPYs.

   Further, these data sections may have been further compressed by some
   secondary compressor.  Assuming that any such compressed data has
   been decompressed so that we now have three arrays:

         inst: bytes coding the instructions and sizes.
         data: unmatched data associated with ADDs and RUNs.
         addr: bytes coding the addresses of COPYs.

   These arrays are organized as follows:

      inst: a sequence of (index, [size1], [size2]) tuples, where
            "index" is an index into the instruction code table, and
            size1 and size2 are integers that MAY or MAY NOT be included
            in the tuple as follows.  The entry with the given "index"
            in the instruction code table potentially defines two delta
            instructions.  If the first delta instruction is not a
            VCD_NOOP and its size is zero, then size1 MUST be present.
            Otherwise, size1 MUST be omitted and the size of the
            instruction (if it is not VCD_NOOP) is as defined in the
            table.  The presence or absence of size2 is defined
            similarly with respect to the second delta instruction.

      data: a sequence of data values, encoded as bytes.

      addr: a sequence of address values.  Addresses are normally
            encoded as integers as described in Section 2 (i.e., base
            128).  However, since the same cache emits addresses in the
            range [0,255], same cache addresses are always encoded as a
            single byte.

   To summarize, each tuple in the "inst" array includes an index to
   some entry in the instruction code table that determines:

   a. Whether one or two instructions were encoded and their types.

   b. If the instructions have their sizes encoded separately, these
      sizes will follow, in order, in the tuple.





Korn, et. al.               Standards Track                    [Page 20]

RFC 3284                         VCDIFF                        June 2002


   c. If the instructions have accompanying data, i.e., ADDs or RUNs,
      their data will be in the array "data".

   d. Similarly, if the instructions are COPYs, the coded addresses are
      found in the array "addr".

   The decoding procedure simply processes the arrays by reading one
   code index at a time, looking up the corresponding instruction code
   entry, then consuming the respective sizes, data and addresses
   following the directions in this entry.  In other words, the decoder
   maintains an implicit next-element pointer for each array;
   "consuming" an instruction tuple, data, or address value implies
   incrementing the associated pointer.

   For example, if during the processing of the target window, the next
   unconsumed tuple in the inst array has an index value of 19, then the
   first instruction is a COPY, whose size is found as the immediately
   following integer in the inst array.  Since the mode of this COPY
   instruction is VCD_SELF, the corresponding address is found by
   consuming the next integer in the addr array.  The data array is left
   intact.  As the second instruction for code index 19 is a NOOP, this
   tuple is finished.

7. APPLICATION-DEFINED CODE TABLES

   Although the default code table used in Vcdiff is good for general
   purpose encoders, there are times when other code tables may perform
   better.  For example, to code a file with many identical segments of
   data, it may be advantageous to have a COPY instruction with the
   specific size of these data segments, so that the instruction can be
   encoded in a single byte.  Such a special code table MUST then be
   encoded in the delta file so that the decoder can reconstruct it
   before decoding the data.

   Vcdiff allows an application-defined code table to be specified in a
   delta file with the following data:

         Size of near cache            - byte
         Size of same cache            - byte
         Compressed code table data

   The "compressed code table data" encodes the delta between the
   default code table (source) and the new code table (target) in the
   same manner as described in Section 4.3 for encoding a target window
   in terms of a source window.  This delta is computed using the
   following steps:





Korn, et. al.               Standards Track                    [Page 21]

RFC 3284                         VCDIFF                        June 2002


   a. Convert the new instruction code table into a string, "code", of
      1536 bytes using the below steps in order:

       i. Add in order the 256 bytes representing the types of the first
          instructions in the instruction pairs.
      ii. Add in order the 256 bytes representing the types of the
          second instructions in the instruction pairs.
     iii. Add in order the 256 bytes representing the sizes of the first
          instructions in the instruction pairs.
      iv. Add in order the 256 bytes representing the sizes of the
          second instructions in the instruction pairs.
       v. Add in order the 256 bytes representing the modes of the first
          instructions in the instruction pairs.
      vi. Add in order the 256 bytes representing the modes of the
          second instructions in the instruction pairs.

   b. Similarly, convert the default code table into a string "dflt".

   c. Treat the string "code" as a target window and "dflt" as the
      corresponding source data and apply an encoding algorithm to
      compute the delta encoding of "code" in terms of "dflt".  This
      computation MUST use the default code table for encoding the delta
      instructions.

   The decoder can then reverse the above steps to decode the compressed
   table data using the method of Section 6, employing the default code
   table, to generate the new code table.  Note that the decoder does
   not need to know about the details of the encoding algorithm used in
   step (c).  It is able to decode the new code table because the Vcdiff
   format is independent from the choice of encoding algorithm, and
   because the encoder in step (c) uses the known, default code table.

8. Performance

   The encoding format is compact.  For compression only, using the LZ-
   77 string parsing strategy and without any secondary compressors, the
   typical compression rate is better than Unix compress and close to
   gzip.  For differencing, the data format is better than all known
   methods in terms of its stated goal, which is primarily decoding
   speed and encoding efficiency.

   We compare the performance of compress, gzip and Vcdiff using the
   archives of three versions of the Gnu C compiler, gcc-2.95.1.tar,
   gcc-2.95.2.tar and gcc-2.95.3.tar.  Gzip was used at its default
   compression level.  The Vcdiff data were obtained using the
   Vcodex/Vcdiff software (Section 13).





Korn, et. al.               Standards Track                    [Page 22]

RFC 3284                         VCDIFF                        June 2002


   Below are the different Vcdiff runs:

      Vcdiff: vcdiff is used as a compressor only.

      Vcdiff-d: vcdiff is used as a differencer only.  That is, it only
         compares target data against source data.  Since the files
         involved are large, they are broken into windows.  In this
         case, each target window, starting at some file offset in the
         target file, is compared against a source window with the same
         file offset (in the source file).  The source window is also
         slightly larger than the target window to increase matching
         opportunities.

      Vcdiff-dc: This is similar to Vcdiff-d, but vcdiff can also
         compare target data against target data as applicable.  Thus,
         vcdiff both computes differences and compresses data.  The
         windowing algorithm is the same as above.  However, the above
         hint is recinded in this case.

      Vcdiff-dcw: This is similar to Vcdiff-dc but the windowing
         algorithm uses a content-based heuristic to select a source
         window that is more likely to match with a given target window.
         Thus, the source data segment selected for a target window
         often will not be aligned with the file offsets of this target
         window.

                       gcc-2.95.1     gcc-2.95.2     gcc-2.95.3
      ---------------------------------------------------------
      1. raw size      55,746,560     55,797,760     55,787,520
      2. compress         -           19,939,390     19,939,453
      3. gzip             -           12,973,443     12,998,097
      4. Vcdiff           -           15,358,786     15,371,737
      5. Vcdiff-d         -              100,971     26,383,849
      6. Vcdiff-dc        -               97,246     14,461,203
      7. Vcdiff-dcw       -              256,445      1,248,543

   The above table shows the raw sizes of the tar files and the sizes of
   the compressed results.  The differencing results in the gcc-2.95.2
   column were obtained by compressing gcc-2.95.2, given gcc-2.95.1.
   The same results for the column gcc-2.95.3 were obtained by
   compressing gcc-2.95.3, given gcc-2.95.2.

   Rows 2, 3 and 4 show that, for compression only, the compression rate
   from Vcdiff is worse than gzip and better than compress.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -