📄 rfc3284.txt
字号:
7. COPY 0, [4,18] 4 NOOP 0 0 [83,98]
8. COPY 0, [4,18] 5 NOOP 0 0 [99,114]
9. COPY 0, [4,18] 6 NOOP 0 0 [115,130]
10. COPY 0, [4,18] 7 NOOP 0 0 [131,146]
11. COPY 0, [4,18] 8 NOOP 0 0 [147,162]
12. ADD [1,4] 0 COPY [4,6] 0 [163,174]
13. ADD [1,4] 0 COPY [4,6] 1 [175,186]
14. ADD [1,4] 0 COPY [4,6] 2 [187,198]
15. ADD [1,4] 0 COPY [4,6] 3 [199,210]
16. ADD [1,4] 0 COPY [4,6] 4 [211,222]
17. ADD [1,4] 0 COPY [4,6] 5 [223,234]
18. ADD [1,4] 0 COPY 4 6 [235,238]
19. ADD [1,4] 0 COPY 4 7 [239,242]
20. ADD [1,4] 0 COPY 4 8 [243,246]
21. COPY 4 [0,8] ADD 1 0 [247,255]
---------------------------------------------------------------
The default instruction code table is depicted above, in a compact
representation that we use only for descriptive purposes. See
section 7 for the specification of how an instruction code table is
represented in the Vcdiff encoding format. In the depiction, a zero
value for size indicates that the size is separately coded. The mode
of non-COPY instructions is represented as 0, even though they are
not used.
In the depiction, each numbered line represents one or more entries
in the actual instruction code table (recall that an entry in the
instruction code table may represent up to two combined delta
instructions.) The last column ("INDEX") shows which index value, or
range of index values, of the entries are covered by that line. (The
notation [i,j] means values from i through j, inclusively.) The
first 6 columns of a line in the depiction, describe the pairs of
instructions used for the corresponding index value(s).
If a line in the depiction includes a column entry using the [i,j]
notation, this means that the line is instantiated for each value in
the range from i to j, inclusively. The notation "0, [i,j]" means
that the line is instantiated for the value 0 and for each value in
the range from i to j, inclusively.
Korn, et. al. Standards Track [Page 18]
RFC 3284 VCDIFF June 2002
If a line in the depiction includes more than one entry using the
[i,j] notation, implying a "nested loop" to convert the line to a
range of table entries, the first such [i,j] range specifies the
outer loop, and the second specifies the inner loop.
The below examples should make clear the above description:
Line 1 shows the single RUN instruction with index 0. As the size
field is 0, this RUN instruction always has its actual size encoded
separately.
Line 2 shows the 18 single ADD instructions. The ADD instruction
with size field 0 (i.e., the actual size is coded separately) has
index 1. ADD instructions with sizes from 1 to 17 use code indices 2
to 18 and their sizes are as given (so they will not be separately
encoded.)
Following the single ADD instructions are the single COPY
instructions ordered by their address encoding modes. For example,
line 11 shows the COPY instructions with mode 8, i.e., the last of
the same cache. In this case, the COPY instruction with size field 0
has index 147. Again, the actual size of this instruction will be
coded separately.
Lines 12 to 21 show the pairs of instructions that are combined
together. For example, line 12 depicts the 12 entries in which an
ADD instruction is combined with an immediately following COPY
instruction. The entries with indices 163, 164, 165 represent the
pairs in which the ADD instructions all have size 1, while the COPY
instructions have mode 0 (VCD_SELF) and sizes 4, 5 and 6
respectively.
The last line, line 21, shows the eight instruction pairs, where the
first instruction is a COPY and the second is an ADD. In this case,
all COPY instructions have size 4 with mode ranging from 0 to 8 and
all the ADD instructions have size 1. Thus, the entry with the
largest index 255 combines a COPY instruction of size 4 and mode 8
with an ADD instruction of size 1.
The choice of the minimum size 4 for COPY instructions in the default
code table was made from experiments that showed that excluding small
matches (less then 4 bytes long) improved the compression rates.
Korn, et. al. Standards Track [Page 19]
RFC 3284 VCDIFF June 2002
6. Decoding a Target Window
Section 4.3 discusses that the delta instructions and associated data
are encoded in three arrays of bytes:
Data section for ADDs and RUNs,
Instructions and sizes section, and
Addresses section for COPYs.
Further, these data sections may have been further compressed by some
secondary compressor. Assuming that any such compressed data has
been decompressed so that we now have three arrays:
inst: bytes coding the instructions and sizes.
data: unmatched data associated with ADDs and RUNs.
addr: bytes coding the addresses of COPYs.
These arrays are organized as follows:
inst: a sequence of (index, [size1], [size2]) tuples, where
"index" is an index into the instruction code table, and
size1 and size2 are integers that MAY or MAY NOT be included
in the tuple as follows. The entry with the given "index"
in the instruction code table potentially defines two delta
instructions. If the first delta instruction is not a
VCD_NOOP and its size is zero, then size1 MUST be present.
Otherwise, size1 MUST be omitted and the size of the
instruction (if it is not VCD_NOOP) is as defined in the
table. The presence or absence of size2 is defined
similarly with respect to the second delta instruction.
data: a sequence of data values, encoded as bytes.
addr: a sequence of address values. Addresses are normally
encoded as integers as described in Section 2 (i.e., base
128). However, since the same cache emits addresses in the
range [0,255], same cache addresses are always encoded as a
single byte.
To summarize, each tuple in the "inst" array includes an index to
some entry in the instruction code table that determines:
a. Whether one or two instructions were encoded and their types.
b. If the instructions have their sizes encoded separately, these
sizes will follow, in order, in the tuple.
Korn, et. al. Standards Track [Page 20]
RFC 3284 VCDIFF June 2002
c. If the instructions have accompanying data, i.e., ADDs or RUNs,
their data will be in the array "data".
d. Similarly, if the instructions are COPYs, the coded addresses are
found in the array "addr".
The decoding procedure simply processes the arrays by reading one
code index at a time, looking up the corresponding instruction code
entry, then consuming the respective sizes, data and addresses
following the directions in this entry. In other words, the decoder
maintains an implicit next-element pointer for each array;
"consuming" an instruction tuple, data, or address value implies
incrementing the associated pointer.
For example, if during the processing of the target window, the next
unconsumed tuple in the inst array has an index value of 19, then the
first instruction is a COPY, whose size is found as the immediately
following integer in the inst array. Since the mode of this COPY
instruction is VCD_SELF, the corresponding address is found by
consuming the next integer in the addr array. The data array is left
intact. As the second instruction for code index 19 is a NOOP, this
tuple is finished.
7. APPLICATION-DEFINED CODE TABLES
Although the default code table used in Vcdiff is good for general
purpose encoders, there are times when other code tables may perform
better. For example, to code a file with many identical segments of
data, it may be advantageous to have a COPY instruction with the
specific size of these data segments, so that the instruction can be
encoded in a single byte. Such a special code table MUST then be
encoded in the delta file so that the decoder can reconstruct it
before decoding the data.
Vcdiff allows an application-defined code table to be specified in a
delta file with the following data:
Size of near cache - byte
Size of same cache - byte
Compressed code table data
The "compressed code table data" encodes the delta between the
default code table (source) and the new code table (target) in the
same manner as described in Section 4.3 for encoding a target window
in terms of a source window. This delta is computed using the
following steps:
Korn, et. al. Standards Track [Page 21]
RFC 3284 VCDIFF June 2002
a. Convert the new instruction code table into a string, "code", of
1536 bytes using the below steps in order:
i. Add in order the 256 bytes representing the types of the first
instructions in the instruction pairs.
ii. Add in order the 256 bytes representing the types of the
second instructions in the instruction pairs.
iii. Add in order the 256 bytes representing the sizes of the first
instructions in the instruction pairs.
iv. Add in order the 256 bytes representing the sizes of the
second instructions in the instruction pairs.
v. Add in order the 256 bytes representing the modes of the first
instructions in the instruction pairs.
vi. Add in order the 256 bytes representing the modes of the
second instructions in the instruction pairs.
b. Similarly, convert the default code table into a string "dflt".
c. Treat the string "code" as a target window and "dflt" as the
corresponding source data and apply an encoding algorithm to
compute the delta encoding of "code" in terms of "dflt". This
computation MUST use the default code table for encoding the delta
instructions.
The decoder can then reverse the above steps to decode the compressed
table data using the method of Section 6, employing the default code
table, to generate the new code table. Note that the decoder does
not need to know about the details of the encoding algorithm used in
step (c). It is able to decode the new code table because the Vcdiff
format is independent from the choice of encoding algorithm, and
because the encoder in step (c) uses the known, default code table.
8. Performance
The encoding format is compact. For compression only, using the LZ-
77 string parsing strategy and without any secondary compressors, the
typical compression rate is better than Unix compress and close to
gzip. For differencing, the data format is better than all known
methods in terms of its stated goal, which is primarily decoding
speed and encoding efficiency.
We compare the performance of compress, gzip and Vcdiff using the
archives of three versions of the Gnu C compiler, gcc-2.95.1.tar,
gcc-2.95.2.tar and gcc-2.95.3.tar. Gzip was used at its default
compression level. The Vcdiff data were obtained using the
Vcodex/Vcdiff software (Section 13).
Korn, et. al. Standards Track [Page 22]
RFC 3284 VCDIFF June 2002
Below are the different Vcdiff runs:
Vcdiff: vcdiff is used as a compressor only.
Vcdiff-d: vcdiff is used as a differencer only. That is, it only
compares target data against source data. Since the files
involved are large, they are broken into windows. In this
case, each target window, starting at some file offset in the
target file, is compared against a source window with the same
file offset (in the source file). The source window is also
slightly larger than the target window to increase matching
opportunities.
Vcdiff-dc: This is similar to Vcdiff-d, but vcdiff can also
compare target data against target data as applicable. Thus,
vcdiff both computes differences and compresses data. The
windowing algorithm is the same as above. However, the above
hint is recinded in this case.
Vcdiff-dcw: This is similar to Vcdiff-dc but the windowing
algorithm uses a content-based heuristic to select a source
window that is more likely to match with a given target window.
Thus, the source data segment selected for a target window
often will not be aligned with the file offsets of this target
window.
gcc-2.95.1 gcc-2.95.2 gcc-2.95.3
---------------------------------------------------------
1. raw size 55,746,560 55,797,760 55,787,520
2. compress - 19,939,390 19,939,453
3. gzip - 12,973,443 12,998,097
4. Vcdiff - 15,358,786 15,371,737
5. Vcdiff-d - 100,971 26,383,849
6. Vcdiff-dc - 97,246 14,461,203
7. Vcdiff-dcw - 256,445 1,248,543
The above table shows the raw sizes of the tar files and the sizes of
the compressed results. The differencing results in the gcc-2.95.2
column were obtained by compressing gcc-2.95.2, given gcc-2.95.1.
The same results for the column gcc-2.95.3 were obtained by
compressing gcc-2.95.3, given gcc-2.95.2.
Rows 2, 3 and 4 show that, for compression only, the compression rate
from Vcdiff is worse than gzip and better than compress.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -