📄 draft-korn-vcdiff.txt

📁 Linux下一个可以比较二进制文件的工具xdelta3.0u的源码。
💻 TXT
📖 第 1 页 / 共 4 页
字号:
	    [s_near+2,s_near+s_same+1]. Let m be the mode of the encoding.	    The address was encoded as a single byte b such that	    "addr == same[(m - (s_near+2))*256 + b]".5.3 Example code for encoding and decoding of COPY instruction addresses    We show example algorithms below to demonstrate use of address modes more    clearly. The encoder has freedom to choose address modes, the sample    addr_encode() algorithm merely shows one way of picking the address    mode. The decoding algorithm addr_decode() will uniquely decode addresses    regardless of the encoder's algorithm choice.    Note that the address caches are updated immediately after an address is    encoded or decoded. In this way, the decoder is always synchronized with    the encoder.        int addr_encode(Cache_t* ka, int addr, int here, int* mode)        {	    int  i, d, bestd, bestm;	    /* Attempt to find the address mode that yields the	     * smallest integer value for "d", the encoded address	     * value, thereby minimizing the encoded size of the	     * address. */            bestd = addr; bestm = VCD_SELF;      /* VCD_SELF == 0 */            if((d = here-addr) < bestd)                { bestd = d; bestm = VCD_HERE; } /* VCD_HERE == 1 */            for(i = 0; i < ka->s_near; ++i)                if((d = addr - ka->near[i]) >= 0 && d < bestd)                    { bestd = d; bestm = i+2; }            if(ka->s_same > 0 && ka->same[d = addr%(ka->s_same*256)] == addr)                { bestd = d%256; bestm = ka->s_near + 2 + d/256; }            cache_update(ka,addr);            *mode = bestm; /* this returns the address encoding mode */            return  bestd; /* this returns the encoded address       */        }    Note that the addr_encode() algorithm chooses the best address mode using a    local optimization, but that may not lead to the best encoding efficiency    because different modes lead to different instruction encodings, as    described below.    The functions addrint() and addrbyte() used in addr_decode() obtain from    the "Addresses section for COPYs" (Section 4.3) an integer or a byte,    respectively. These utilities will not be described here.  We simply    recall that an integer is represented as a compact variable-sized string    of bytes as described in Section 2 (i.e., base 128).        int addr_decode(Cache_t* ka, int here, int mode)        {   int  addr, m;            if(mode == VCD_SELF)                 addr = addrint();            else if(mode == VCD_HERE)                 addr = here - addrint();            else if((m = mode - 2) >= 0 && m < ka->s_near) /* near cache */                 addr = ka->near[m] + addrint();            else /* same cache */            {    m = mode - (2 + ka->s_near);                 addr = ka->same[m*256 + addrbyte()];            }            cache_update(ka, addr);            return addr;        }5.4 Instruction Codes    As noted, the data sizes associated with delta instructions are often    small. Thus, compression efficiency can be improved by combining the sizes    and instruction types in a single encoding, as well by combining certain    pairs of adjacent delta instructions. Effective choices of when to perform    such combinations depend on many factors including the data being processed    and the string matching algorithm in use. For example, if many COPY    instructions have the same data sizes, it may be worth to encode these    instructions more compactly than others.    The Vcdiff data format is designed so that a decoder does not need to be    aware of the choices made in encoding algorithms. This is achieved with the    notion of an "instruction code table" containing 256 entries. Each entry    defines either a single delta instruction or a pair of instructions that    have been combined.  Note that the code table itself only exists in main    memory, not in the delta file (unless using an application-defined code    table, described in Section 7). The encoded data simply includes the index    of each instruction and, since there are only 256 indices, each index    can be represented as a single byte.    Each instruction code entry contains six fields, each of which    is a single byte with unsigned value:            +-----------------------------------------------+	    | inst1 | size1 | mode1 | inst2 | size2 | mode2 |	    +-----------------------------------------------+@@@ could be more compact    Each triple (inst,size,mode) defines a delta instruction. The meanings    of these fields are as follows:    inst: An "inst" field can have one of the four values: NOOP (0), ADD (1),	RUN (2) or COPY (3) to indicate the instruction types. NOOP means	that no instruction is specified. In this case, both the corresponding	size and mode fields will be zero.    size: A "size" field is zero or positive. A value zero means that the	size associated with the instruction is encoded separately as	an integer in the "Instructions and sizes section" (Section 6).	A positive value for "size" defines the actual data size.	Note that since the size is restricted to a byte, the maximum	value for any instruction with size implicitly defined in the code	table is 255.    mode: A "mode" field is significant only when the associated delta	instruction is a COPY. It defines the mode used to encode the	associated addresses. For other instructions, this is always zero.5.5 The Code Table    Following the discussions on address modes and instruction code tables,    we define a "Code Table" to have the data below:	s_near: the size of the near cache,	s_same: the size of the same cache,	i_code: the 256-entry instruction code table.    Vcdiff itself defines a "default code table" in which s_near is 4    and s_same is 3. Thus, there are 9 address modes for a COPY instruction.    The first two are VCD_SELF (0) and VCD_HERE (1). Modes 2, 3, 4 and 5    are for addresses coded against the near cache. And, modes 6, 7  and 8    are for addresses coded against the same cache.    The default instruction code table is depicted below, in a compact    representation that we use only for descriptive purposes.  See section 7    for the specification of how an instruction code table is represented    in the Vcdiff encoding format.  In the depiction, a zero value for    size indicates that the size is separately coded. The mode of non-COPY    instructions is represented as 0 even though they are not used.         TYPE      SIZE     MODE    TYPE     SIZE     MODE     INDEX        ---------------------------------------------------------------     1.  RUN         0        0     NOOP       0        0        0     2.  ADD    0, [1,17]     0     NOOP       0        0      [1,18]     3.  COPY   0, [4,18]     0     NOOP       0        0     [19,34]     4.  COPY   0, [4,18]     1     NOOP       0        0     [35,50]     5.  COPY   0, [4,18]     2     NOOP       0        0     [51,66]     6.  COPY   0, [4,18]     3     NOOP       0        0     [67,82]     7.  COPY   0, [4,18]     4     NOOP       0        0     [83,98]     8.  COPY   0, [4,18]     5     NOOP       0        0     [99,114]     9.  COPY   0, [4,18]     6     NOOP       0        0    [115,130]    10.  COPY   0, [4,18]     7     NOOP       0        0    [131,146]    11.  COPY   0, [4,18]     8     NOOP       0        0    [147,162]    12.  ADD       [1,4]      0     COPY     [4,6]      0    [163,174]    13.  ADD       [1,4]      0     COPY     [4,6]      1    [175,186]    14.  ADD       [1,4]      0     COPY     [4,6]      2    [187,198]    15.  ADD       [1,4]      0     COPY     [4,6]      3    [199,210]    16.  ADD       [1,4]      0     COPY     [4,6]      4    [211,222]    17.  ADD       [1,4]      0     COPY     [4,6]      5    [223,234]    18.  ADD       [1,4]      0     COPY       4        6    [235,238]    19.  ADD       [1,4]      0     COPY       4        7    [239,242]    20.  ADD       [1,4]      0     COPY       4        8    [243,246]    21.  COPY        4      [0,8]   ADD        1        0    [247,255]        ---------------------------------------------------------------    In the above depiction, each numbered line represents one or more    entries in the actual instruction code table (recall that an entry in    the instruction code table may represent up to two combined delta    instructions.) The last column ("INDEX") shows which index value or    range of index values of the entries covered by that line. The notation    [i,j] means values from i through j, inclusive. The first 6 columns of    a line in the depiction describe the pairs of instructions used for    the corresponding index value(s).    If a line in the depiction includes a column entry using the [i,j]    notation, this means that the line is instantiated for each value    in the range from i to j, inclusive.  The notation "0, [i,j]" means    that the line is instantiated for the value 0 and for each value    in the range from i to j, inclusive.    If a line in the depiction includes more than one entry using the [i,j]    notation, implying a "nested loop" to convert the line to a range of    table entries, the first such [i,j] range specifies the outer loop,    and the second specifies the inner loop.    The below examples should make clear the above description:    Line 1 shows the single RUN instruction with index 0. As the size field    is 0, this RUN instruction always has its actual size encoded separately.    Line 2 shows the 18 single ADD instructions. The ADD instruction with    size field 0 (i.e., the actual size is coded separately) has index 1.    ADD instructions with sizes from 1 to 17 use code indices 2 to 18 and    their sizes are as given (so they will not be separately encoded.)    Following the single ADD instructions are the single COPY instructions    ordered by their address encoding modes. For example, line 11 shows the    COPY instructions with mode 8, i.e., the last of the same cache.    In this case, the COPY instruction with size field 0 has index 147.    Again, the actual size of this instruction will be coded separately.    Lines 12 to 21 show the pairs of instructions that are combined together.    For example, line 12 depicts the 12 entries in which an ADD instruction    is combined with an immediately following COPY instruction. The entries    with indices 163, 164, 165 represent the pairs in which the ADD    instructions all have size 1 while the COPY instructions has mode    0 (VCD_SELF) and sizes 4, 5 and 6 respectively.    The last line, line 21, shows the eight instruction pairs where the first    instruction is a COPY and the second is an ADD. In this case, all COPY    instructions have size 4 with mode ranging from 0 to 8 and all the ADD    instructions have size 1. Thus, the entry with largest index 255    combines a COPY instruction of size 4 and mode 8 with an ADD instruction    of size 1.    The choice of the minimum size 4 for COPY instructions in the default code    table was made from experiments that showed that excluding small matches    (less then 4 bytes long) improved the compression rates.6. DECODING A TARGET WINDOW    Section 4.3 discusses that the delta instructions and associated data    are encoded in three arrays of bytes:        Data section for ADDs and RUNs,        Instructions and sizes section, and        Addresses section for COPYs.    Further, these data sections may have been further compressed by some    secondary compressor. Assuming that any such compressed data has been    decompressed so that we now have three arrays:	inst: bytes coding the instructions and sizes.        data: unmatched data associated with ADDs and RUNs.	addr: bytes coding the addresses of COPYs.    These arrays are organized as follows:	inst:	    a sequence of (index, [size1], [size2]) tuples, where "index"            is an index into the instruction code table, and size1 and size2            are integers that MAY or MAY NOT be included in the tuple as            follows. The entry with the given "index" in the instruction            code table potentially defines two delta instructions. If the            first delta instruction is not a VCD_NOOP and its size is zero,            then size1 MUST be present. Otherwise, size1 MUST be omitted and            the size of the instruction (if it is not VCD_NOOP) is as defined            in the table. The presence or absence of size2 is defined            similarly with respect to the second delta instruction.	data:	    a sequence of data values, encoded as bytes.	addr:	    a sequence of address values. Addresses are normally encoded as            integers as described in Section 2 (i.e., base 128).	    Since the same cache emits addresses in the range [0,255],	    however, same cache addresses are always encoded as a	    single byte.    To summarize, each tuple in the "inst" array includes an index to some    entry in the instruction code table that determines:    a. Whether one or two instructions were encoded and their types.    b. If the instructions have their sizes encoded separately, these       sizes will follow, in order, in the tuple.    c. If the instructions have accompanying data, i.e., ADDs or RUNs,       their data will be in the array "data".    d. Similarly, if the instructions are COPYs, the coded addresses are       found in the array "addr".    The decoding procedure simply processes the arrays by reading one code    index at a time, looking up the corresponding instruction code entry,    then consuming the respective sizes, data and addresses following the    directions in this entry. In other words, the decoder maintains an implicit    next-element pointer for each array; "consuming" an instruction tuple,    data, or address value implies incrementing the associated pointer.    For example, if during the processing of the target window, the next    unconsumed tuple in the inst array has index value 19, then the first    instruction is a COPY, whose size is found as the immediately following    integer in the inst array.  Since the mode of this COPY instruction is    VCD_SELF, the corresponding address is found by consuming the next    integer in the addr array.  The data array is left intact. As the second    instruction for code index 19 is a NOOP, this tuple is finished.7. APPLICATION-DEFINED CODE TABLES    Although the default code table used in Vcdiff is good for general    purpose encoders, there are times when other code tables may perform    better. For example, to code a file with many identical segments of data,    it may be advantageous to have a COPY instruction with the specific size    of these data segments so that the instruction can be encoded in a single    byte. Such a special code table MUST then be encoded in the delta file    so that the decoder can reconstruct it before decoding the data.    Vcdiff allows an application-defined code table to be specified    in a delta file with the following data:	Size of near cache            - byte	Size of same cache            - byte	Compressed code table data    The "compressed code table data" encodes the delta between the default    code table (source) and the new code table (target) in the same manner as    described in Section 4.3 for encoding a target window in terms of a    source window. This delta is computed using the following steps:    a.  Convert the new instruction code table into a string, "code", of	1536 bytes using the below steps in order:        i. Add in order the 256 bytes representing the types of the first	   instructions in the instruction pairs.       ii. Add in order the 256 bytes representing the types of the second	   instructions in the instruction pairs.      iii. Add in order the 256 bytes representing the sizes of the first	   instructions in the instruction pairs.
💿 文件大小 215 K
👤 上传用户 shangranc
📂 所属分类 Linux/Unix编程
🏷️ 相关标签

#xdelta #Linux #3.0 #比较
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -