📄 draft-korn-vcdiff.txt
字号:
[s_near+2,s_near+s_same+1]. Let m be the mode of the encoding. The address was encoded as a single byte b such that "addr == same[(m - (s_near+2))*256 + b]".5.3 Example code for encoding and decoding of COPY instruction addresses We show example algorithms below to demonstrate use of address modes more clearly. The encoder has freedom to choose address modes, the sample addr_encode() algorithm merely shows one way of picking the address mode. The decoding algorithm addr_decode() will uniquely decode addresses regardless of the encoder's algorithm choice. Note that the address caches are updated immediately after an address is encoded or decoded. In this way, the decoder is always synchronized with the encoder. int addr_encode(Cache_t* ka, int addr, int here, int* mode) { int i, d, bestd, bestm; /* Attempt to find the address mode that yields the * smallest integer value for "d", the encoded address * value, thereby minimizing the encoded size of the * address. */ bestd = addr; bestm = VCD_SELF; /* VCD_SELF == 0 */ if((d = here-addr) < bestd) { bestd = d; bestm = VCD_HERE; } /* VCD_HERE == 1 */ for(i = 0; i < ka->s_near; ++i) if((d = addr - ka->near[i]) >= 0 && d < bestd) { bestd = d; bestm = i+2; } if(ka->s_same > 0 && ka->same[d = addr%(ka->s_same*256)] == addr) { bestd = d%256; bestm = ka->s_near + 2 + d/256; } cache_update(ka,addr); *mode = bestm; /* this returns the address encoding mode */ return bestd; /* this returns the encoded address */ } Note that the addr_encode() algorithm chooses the best address mode using a local optimization, but that may not lead to the best encoding efficiency because different modes lead to different instruction encodings, as described below. The functions addrint() and addrbyte() used in addr_decode() obtain from the "Addresses section for COPYs" (Section 4.3) an integer or a byte, respectively. These utilities will not be described here. We simply recall that an integer is represented as a compact variable-sized string of bytes as described in Section 2 (i.e., base 128). int addr_decode(Cache_t* ka, int here, int mode) { int addr, m; if(mode == VCD_SELF) addr = addrint(); else if(mode == VCD_HERE) addr = here - addrint(); else if((m = mode - 2) >= 0 && m < ka->s_near) /* near cache */ addr = ka->near[m] + addrint(); else /* same cache */ { m = mode - (2 + ka->s_near); addr = ka->same[m*256 + addrbyte()]; } cache_update(ka, addr); return addr; }5.4 Instruction Codes As noted, the data sizes associated with delta instructions are often small. Thus, compression efficiency can be improved by combining the sizes and instruction types in a single encoding, as well by combining certain pairs of adjacent delta instructions. Effective choices of when to perform such combinations depend on many factors including the data being processed and the string matching algorithm in use. For example, if many COPY instructions have the same data sizes, it may be worth to encode these instructions more compactly than others. The Vcdiff data format is designed so that a decoder does not need to be aware of the choices made in encoding algorithms. This is achieved with the notion of an "instruction code table" containing 256 entries. Each entry defines either a single delta instruction or a pair of instructions that have been combined. Note that the code table itself only exists in main memory, not in the delta file (unless using an application-defined code table, described in Section 7). The encoded data simply includes the index of each instruction and, since there are only 256 indices, each index can be represented as a single byte. Each instruction code entry contains six fields, each of which is a single byte with unsigned value: +-----------------------------------------------+ | inst1 | size1 | mode1 | inst2 | size2 | mode2 | +-----------------------------------------------+@@@ could be more compact Each triple (inst,size,mode) defines a delta instruction. The meanings of these fields are as follows: inst: An "inst" field can have one of the four values: NOOP (0), ADD (1), RUN (2) or COPY (3) to indicate the instruction types. NOOP means that no instruction is specified. In this case, both the corresponding size and mode fields will be zero. size: A "size" field is zero or positive. A value zero means that the size associated with the instruction is encoded separately as an integer in the "Instructions and sizes section" (Section 6). A positive value for "size" defines the actual data size. Note that since the size is restricted to a byte, the maximum value for any instruction with size implicitly defined in the code table is 255. mode: A "mode" field is significant only when the associated delta instruction is a COPY. It defines the mode used to encode the associated addresses. For other instructions, this is always zero.5.5 The Code Table Following the discussions on address modes and instruction code tables, we define a "Code Table" to have the data below: s_near: the size of the near cache, s_same: the size of the same cache, i_code: the 256-entry instruction code table. Vcdiff itself defines a "default code table" in which s_near is 4 and s_same is 3. Thus, there are 9 address modes for a COPY instruction. The first two are VCD_SELF (0) and VCD_HERE (1). Modes 2, 3, 4 and 5 are for addresses coded against the near cache. And, modes 6, 7 and 8 are for addresses coded against the same cache. The default instruction code table is depicted below, in a compact representation that we use only for descriptive purposes. See section 7 for the specification of how an instruction code table is represented in the Vcdiff encoding format. In the depiction, a zero value for size indicates that the size is separately coded. The mode of non-COPY instructions is represented as 0 even though they are not used. TYPE SIZE MODE TYPE SIZE MODE INDEX --------------------------------------------------------------- 1. RUN 0 0 NOOP 0 0 0 2. ADD 0, [1,17] 0 NOOP 0 0 [1,18] 3. COPY 0, [4,18] 0 NOOP 0 0 [19,34] 4. COPY 0, [4,18] 1 NOOP 0 0 [35,50] 5. COPY 0, [4,18] 2 NOOP 0 0 [51,66] 6. COPY 0, [4,18] 3 NOOP 0 0 [67,82] 7. COPY 0, [4,18] 4 NOOP 0 0 [83,98] 8. COPY 0, [4,18] 5 NOOP 0 0 [99,114] 9. COPY 0, [4,18] 6 NOOP 0 0 [115,130] 10. COPY 0, [4,18] 7 NOOP 0 0 [131,146] 11. COPY 0, [4,18] 8 NOOP 0 0 [147,162] 12. ADD [1,4] 0 COPY [4,6] 0 [163,174] 13. ADD [1,4] 0 COPY [4,6] 1 [175,186] 14. ADD [1,4] 0 COPY [4,6] 2 [187,198] 15. ADD [1,4] 0 COPY [4,6] 3 [199,210] 16. ADD [1,4] 0 COPY [4,6] 4 [211,222] 17. ADD [1,4] 0 COPY [4,6] 5 [223,234] 18. ADD [1,4] 0 COPY 4 6 [235,238] 19. ADD [1,4] 0 COPY 4 7 [239,242] 20. ADD [1,4] 0 COPY 4 8 [243,246] 21. COPY 4 [0,8] ADD 1 0 [247,255] --------------------------------------------------------------- In the above depiction, each numbered line represents one or more entries in the actual instruction code table (recall that an entry in the instruction code table may represent up to two combined delta instructions.) The last column ("INDEX") shows which index value or range of index values of the entries covered by that line. The notation [i,j] means values from i through j, inclusive. The first 6 columns of a line in the depiction describe the pairs of instructions used for the corresponding index value(s). If a line in the depiction includes a column entry using the [i,j] notation, this means that the line is instantiated for each value in the range from i to j, inclusive. The notation "0, [i,j]" means that the line is instantiated for the value 0 and for each value in the range from i to j, inclusive. If a line in the depiction includes more than one entry using the [i,j] notation, implying a "nested loop" to convert the line to a range of table entries, the first such [i,j] range specifies the outer loop, and the second specifies the inner loop. The below examples should make clear the above description: Line 1 shows the single RUN instruction with index 0. As the size field is 0, this RUN instruction always has its actual size encoded separately. Line 2 shows the 18 single ADD instructions. The ADD instruction with size field 0 (i.e., the actual size is coded separately) has index 1. ADD instructions with sizes from 1 to 17 use code indices 2 to 18 and their sizes are as given (so they will not be separately encoded.) Following the single ADD instructions are the single COPY instructions ordered by their address encoding modes. For example, line 11 shows the COPY instructions with mode 8, i.e., the last of the same cache. In this case, the COPY instruction with size field 0 has index 147. Again, the actual size of this instruction will be coded separately. Lines 12 to 21 show the pairs of instructions that are combined together. For example, line 12 depicts the 12 entries in which an ADD instruction is combined with an immediately following COPY instruction. The entries with indices 163, 164, 165 represent the pairs in which the ADD instructions all have size 1 while the COPY instructions has mode 0 (VCD_SELF) and sizes 4, 5 and 6 respectively. The last line, line 21, shows the eight instruction pairs where the first instruction is a COPY and the second is an ADD. In this case, all COPY instructions have size 4 with mode ranging from 0 to 8 and all the ADD instructions have size 1. Thus, the entry with largest index 255 combines a COPY instruction of size 4 and mode 8 with an ADD instruction of size 1. The choice of the minimum size 4 for COPY instructions in the default code table was made from experiments that showed that excluding small matches (less then 4 bytes long) improved the compression rates.6. DECODING A TARGET WINDOW Section 4.3 discusses that the delta instructions and associated data are encoded in three arrays of bytes: Data section for ADDs and RUNs, Instructions and sizes section, and Addresses section for COPYs. Further, these data sections may have been further compressed by some secondary compressor. Assuming that any such compressed data has been decompressed so that we now have three arrays: inst: bytes coding the instructions and sizes. data: unmatched data associated with ADDs and RUNs. addr: bytes coding the addresses of COPYs. These arrays are organized as follows: inst: a sequence of (index, [size1], [size2]) tuples, where "index" is an index into the instruction code table, and size1 and size2 are integers that MAY or MAY NOT be included in the tuple as follows. The entry with the given "index" in the instruction code table potentially defines two delta instructions. If the first delta instruction is not a VCD_NOOP and its size is zero, then size1 MUST be present. Otherwise, size1 MUST be omitted and the size of the instruction (if it is not VCD_NOOP) is as defined in the table. The presence or absence of size2 is defined similarly with respect to the second delta instruction. data: a sequence of data values, encoded as bytes. addr: a sequence of address values. Addresses are normally encoded as integers as described in Section 2 (i.e., base 128). Since the same cache emits addresses in the range [0,255], however, same cache addresses are always encoded as a single byte. To summarize, each tuple in the "inst" array includes an index to some entry in the instruction code table that determines: a. Whether one or two instructions were encoded and their types. b. If the instructions have their sizes encoded separately, these sizes will follow, in order, in the tuple. c. If the instructions have accompanying data, i.e., ADDs or RUNs, their data will be in the array "data". d. Similarly, if the instructions are COPYs, the coded addresses are found in the array "addr". The decoding procedure simply processes the arrays by reading one code index at a time, looking up the corresponding instruction code entry, then consuming the respective sizes, data and addresses following the directions in this entry. In other words, the decoder maintains an implicit next-element pointer for each array; "consuming" an instruction tuple, data, or address value implies incrementing the associated pointer. For example, if during the processing of the target window, the next unconsumed tuple in the inst array has index value 19, then the first instruction is a COPY, whose size is found as the immediately following integer in the inst array. Since the mode of this COPY instruction is VCD_SELF, the corresponding address is found by consuming the next integer in the addr array. The data array is left intact. As the second instruction for code index 19 is a NOOP, this tuple is finished.7. APPLICATION-DEFINED CODE TABLES Although the default code table used in Vcdiff is good for general purpose encoders, there are times when other code tables may perform better. For example, to code a file with many identical segments of data, it may be advantageous to have a COPY instruction with the specific size of these data segments so that the instruction can be encoded in a single byte. Such a special code table MUST then be encoded in the delta file so that the decoder can reconstruct it before decoding the data. Vcdiff allows an application-defined code table to be specified in a delta file with the following data: Size of near cache - byte Size of same cache - byte Compressed code table data The "compressed code table data" encodes the delta between the default code table (source) and the new code table (target) in the same manner as described in Section 4.3 for encoding a target window in terms of a source window. This delta is computed using the following steps: a. Convert the new instruction code table into a string, "code", of 1536 bytes using the below steps in order: i. Add in order the 256 bytes representing the types of the first instructions in the instruction pairs. ii. Add in order the 256 bytes representing the types of the second instructions in the instruction pairs. iii. Add in order the 256 bytes representing the sizes of the first instructions in the instruction pairs.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -