📄 draft-korn-vcdiff.txt
字号:
Header3 - byte = 0xD4HMMM0xD60xC30xC4 Header4 - byte Hdr_Indicator - byte [Secondary compressor ID] - byte [Length of code table data] - integer [Code table data] The first three Header bytes are the ASCII characters 'V', 'C' and 'D' with their most significant bits turned on (in hexadecimal, the values are 0xE6, 0xD3, and 0xD4). The fourth Header byte is currently set to zero. In the future, it might be used to indicate the version of Vcdiff. The Hdr_Indicator byte shows if there are any initialization data required to aid in the reconstruction of data in the Window sections. This byte MAY have non-zero values for either, both, or neither of the two bits VCD_DECOMPRESS and VCD_CODETABLE below: 7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+ | | | | | | | | | +-+-+-+-+-+-+-+-+ ^ ^ | | | +-- VCD_DECOMPRESS +---- VCD_CODETABLE If bit 0 (VCD_DECOMPRESS) is non-zero, this indicates that a secondary compressor may have been used to further compress certain parts of the delta encoding data as described in Sections 4.3 and 6. In that case, the ID of the secondary compressor is given next. If this bit is zero, the compressor ID byte is not included.[@@@ If we aren't defining any secondary compressors yet, then it seemsthis bit has no real value yet..] If bit 1 (VCD_CODETABLE) is non-zero, this indicates that an application-defined code table is to be used for decoding the delta instructions. This table itself is compressed. The length of the data comprising this compressed code table and the data follow next. Section 7 discusses application-defined code tables. If this bit is zero, the code table data length and the code table data are not included. If both bits are set, then the compressor ID byte is included before the code table data length and the code table data.4.2 The Format of a Window Section Each Window section is organized as follows: Win_Indicator - byte [Source segment length] - integer [Source segment position] - integer The delta encoding of the target window Below are the detail of the various items:[@@@ Here, I want to replace the Win_Indicator with a source-count,followed by source-count length/position pairs?] Win_Indicator: This byte is a set of bits, as shown: 7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+ | | | | | | | | | +-+-+-+-+-+-+-+-+ ^ ^ | | | +-- VCD_SOURCE +---- VCD_TARGET If bit 0 (VCD_SOURCE) is non-zero, this indicates that a segment of data from the "source" file was used as the corresponding source window of data to encode the target window. The decoder will use this same source data segment to decode the target window. If bit 1 (VCD_TARGET) is non-zero, this indicates that a segment of data from the "target" file was used as the corresponding source window of data to encode the target window. As above, this same source data segment is used to decode the target window. The Win_Indicator byte MUST NOT have more than one of the bits set (non-zero). It MAY have none of these bits set. If one of these bits is set, the byte is followed by two integers to indicate respectively the length and position of the source data segment in the relevant file. If the indicator byte is zero, the target window was compressed by itself without comparing against another data segment, and these two integers are not included. The delta encoding of the target window: This contains the delta encoding of the target window either in terms of the source data segment (i.e., VCD_SOURCE or VCD_TARGET was set) or by itself if no source window is specified. This data format is discussed next.4.3 The Delta Encoding of a Target Window The delta encoding of a target window is organized as follows: Length of the delta encoding - integer The delta encoding Length of the target window - integer Delta_Indicator - byte Length of data for ADDs and RUNs - integer Length of instructions section - integer Length of addresses for COPYs - integer Data section for ADDs and RUNs - array of bytes Instructions and sizes section - array of bytes Addresses section for COPYs - array of bytes Length of the delta encoding: This integer gives the total number of remaining bytes that comprise data of the delta encoding for this target window. The delta encoding: This contains the data representing the delta encoding which is described next. Length of the target window: This integer indicates the actual size of the target window after decompression. A decoder can use this value to allocate memory to store the uncompressed data. Delta_Indicator: This byte is a set of bits, as shown: 7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+ | | | | | | | | | +-+-+-+-+-+-+-+-+ ^ ^ ^ | | | | | +-- VCD_DATACOMP | +---- VCD_INSTCOMP +------ VCD_ADDRCOMP VCD_DATACOMP: bit value 1. VCD_INSTCOMP: bit value 2. VCD_ADDRCOMP: bit value 4. As discussed, the delta encoding consists of COPY, ADD and RUN instructions. The ADD and RUN instructions have accompanying unmatched data (that is, data that does not specifically match any data in the source window or in some earlier part of the target window) and the COPY instructions have addresses of where the matches occur. OPTIONALLY, these types of data MAY be further compressed using a secondary compressor. Thus, Vcdiff separates the encoding of the delta instructions into three parts: a. The unmatched data in the ADD and RUN instructions, b. The delta instructions and accompanying sizes, and c. The addresses of the COPY instructions. If the bit VCD_DECOMPRESS (Section 4.1) was on, each of these sections may have been compressed using the specified secondary compressor. The bit positions 0 (VCD_DATACOMP), 1 (VCD_INSTCOMP), and 2 (VCD_ADDRCOMP) respectively indicate, if non-zero, that the corresponding parts are compressed. Then, these parts MUST be decompressed before decoding the delta instructions. Length of data for ADDs and RUNs: This is the length (in bytes) of the section of data storing the unmatched data accompanying the ADD and RUN instructions. Length of instructions section: This is the length (in bytes) of the delta instructions and accompanying sizes. Length of addresses for COPYs: This is the length (in bytes) of the section storing the addresses of the COPY instructions. Data section for ADDs and RUNs: This sequence of bytes encodes the unmatched data for the ADD and RUN instructions. Instructions and sizes section: This sequence of bytes encodes the instructions and their sizes. Addresses section for COPYs: This sequence of bytes encodes the addresses of the COPY instructions.5. DELTA INSTRUCTION ENCODING The delta instructions described in Section 3 represent the results of string matching. For many data differencing applications in which the changes between source and target data are small, any straightforward representation of these instructions would be adequate. However, for applications including data compression, it is important to encode these instructions well to achieve good compression rates. From our experience, the following observations can be made: a. The addresses in COPY instructions are locations of matches and often occur close by or even exactly equal to one another. This is because data in local regions are often replicated with minor changes. In turn, this means that coding a newly matched address against some set of recently matched addresses can be beneficial. b. The matches are often short in length and separated by small amounts of unmatched data. That is, the lengths of COPY and ADD instructions are often small. This is particularly true of binary data such as executable files or structured data such as HTML or XML. In such cases, compression can be improved by combining the encoding of the sizes and the instruction types as well as combining the encoding of adjacent delta instructions with sufficiently small data sizes. The below subsections discuss how the Vcdiff data format provides mechanisms enabling encoders to use the above observations to improve compression rates.5.1 Address Encoding Modes of COPY Instructions As mentioned earlier, addresses of COPY instructions often occur close to one another or are exactly equal. To take advantage of this phenomenon and encode addresses of COPY instructions more efficiently, the Vcdiff data format supports the use of two different types of address caches. Both the encoder and decoder maintain these caches, so that decoder's caches remain synchronized with the encoder's caches. a. A "near" cache is an array with "s_near" slots, each containing an address used for encoding addresses nearby to previously encoded addresses (in the positive direction only). The near cache also maintains a "next_slot" index to the near cache. New entries to the near cache are always inserted in the next_slot index, which maintains a circular buffer of the s_near most recent addresses. b. A "same" cache is an array with "s_same" multiple of 256 slots, each containing an address. The same cache maintains a hash table of recent addresses used for repeated encoding of the exact same address. By default, the parameters s_near and s_same are respectively set to 4 and 3. An encoder MAY modify these values, but then it MUST encode the new values in the encoding itself, as discussed in Section 7, so that the decoder can properly set up its own caches. At the start of processing a target window, an implementation (encoder or decoder) initializes all of the slots in both caches to zero. The next_slot pointer of the near cache is set to point to slot zero. Each time a COPY instruction is processed by the encoder or decoder, the implementation's caches are updated as follows, where "addr" is the address in the COPY instruction. a. The slot in the near cache referenced by the next_slot index is set to addr. The next_slot index is then incremented modulo s_near. b. The slot in the same cache whose index is addr%(s_same*256) is set to addr. [We use the C notations of % for modulo and * for multiplication.]5.2 Example code for maintaining caches To make clear the above description, below are example cache data structures and algorithms to initialize and update them: typedef struct _cache_s { int* near; /* array of size s_near */ int s_near; int next_slot; /* the circular index for near */ int* same; /* array of size s_same*256 */ int s_same; } Cache_t; cache_init(Cache_t* ka) { int i; ka->next_slot = 0; for(i = 0; i < ka->s_near; ++i) ka->near[i] = 0; for(i = 0; i < ka->s_same*256; ++i) ka->same[i] = 0; } cache_update(Cache_t* ka, int addr) { if(ka->s_near > 0) { ka->near[ka->next_slot] = addr; ka->next_slot = (ka->next_slot + 1) % ka->s_near; } if(ka->s_same > 0) ka->same[addr % (ka->s_same*256)] = addr; }5.3 Encoding of COPY instruction addresses The address of a COPY instruction is encoded using different modes depending on the type of cached address used, if any. Let "addr" be the address of a COPY instruction to be decoded and "here" be the current location in the target data (i.e., the start of the data about to be encoded or decoded). Let near[j] be the jth element in the near cache, and same[k] be the kth element in the same cache. Below are the possible address modes: VCD_SELF: This mode has value 0. The address was encoded by itself as an integer. VCD_HERE: This mode has value 1. The address was encoded as the integer value "here - addr". Near modes: The "near modes" are in the range [2,s_near+1]. Let m be the mode of the address encoding. The address was encoded as the integer value "addr - near[m-2]". Same modes: The "same modes" are in the range
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -