📄 draft-korn-vcdiff.txt

📁 Linux下一个可以比较二进制文件的工具xdelta3.0u的源码。
💻 TXT
📖 第 1 页 / 共 4 页
字号:
	    Header3                                  - byte = 0xD4HMMM0xD60xC30xC4	    Header4                                  - byte	    Hdr_Indicator                            - byte	    [Secondary compressor ID]                - byte	    [Length of code table data]              - integer	    [Code table data]    The first three Header bytes are the ASCII characters 'V', 'C' and 'D'    with their most significant bits turned on (in hexadecimal, the values    are 0xE6, 0xD3, and 0xD4). The fourth Header byte is currently set to    zero. In the future, it might be used to indicate the version of Vcdiff.    The Hdr_Indicator byte shows if there are any initialization data    required to aid in the reconstruction of data in the Window sections.    This byte MAY have non-zero values for either, both, or neither of    the two bits VCD_DECOMPRESS and VCD_CODETABLE below:	    7 6 5 4 3 2 1 0	   +-+-+-+-+-+-+-+-+	   | | | | | | | | |	   +-+-+-+-+-+-+-+-+	                ^ ^	                | |	                | +-- VCD_DECOMPRESS	                +---- VCD_CODETABLE    If bit 0 (VCD_DECOMPRESS) is non-zero, this indicates that a secondary    compressor may have been used to further compress certain parts of the    delta encoding data as described in Sections 4.3 and 6. In that case,    the ID of the secondary compressor is given next. If this bit is zero,    the compressor ID byte is not included.[@@@ If we aren't defining any secondary compressors yet, then it seemsthis bit has no real value yet..]    If bit 1 (VCD_CODETABLE) is non-zero, this indicates that an    application-defined code table is to be used for decoding the delta    instructions. This table itself is compressed.  The length of the data    comprising this compressed code table and the data follow next. Section 7    discusses application-defined code tables.  If this bit is zero, the code    table data length and the code table data are not included.    If both bits are set, then the compressor ID byte is included    before the code table data length and the code table data.4.2 The Format of a Window Section    Each Window section is organized as follows:	    Win_Indicator                            - byte	    [Source segment length]                  - integer	    [Source segment position]                - integer            The delta encoding of the target window    Below are the detail of the various items:[@@@ Here, I want to replace the Win_Indicator with a source-count,followed by source-count length/position pairs?]        Win_Indicator:	    This byte is a set of bits, as shown:	    7 6 5 4 3 2 1 0	   +-+-+-+-+-+-+-+-+	   | | | | | | | | |	   +-+-+-+-+-+-+-+-+	                ^ ^	                | |	                | +-- VCD_SOURCE	                +---- VCD_TARGET	    If bit 0 (VCD_SOURCE) is non-zero, this indicates that a segment            of data from the "source" file was used as the corresponding            source window of data to encode the target window. The decoder	    will use this same source data segment to decode the target window.	    If bit 1 (VCD_TARGET) is non-zero, this indicates that a segment            of data from the "target" file was used as the corresponding	    source window of data to encode the target window. As above, this	    same source data segment is used to decode the target window.	    The Win_Indicator byte MUST NOT have more than one of the bits	    set (non-zero).  It MAY have none of these bits set.	    If one of these bits is set, the byte is followed by two            integers to indicate respectively the length and position of            the source data segment in the relevant file.  If the            indicator byte is zero, the target window was compressed            by itself without comparing against another data segment,            and these two integers are not included.        The delta encoding of the target window:            This contains the delta encoding of the target window either            in terms of the source data segment (i.e., VCD_SOURCE            or VCD_TARGET was set) or by itself if no source window            is specified. This data format is discussed next.4.3 The Delta Encoding of a Target Window    The delta encoding of a target window is organized as follows:	Length of the delta encoding            - integer	The delta encoding	    Length of the target window         - integer	    Delta_Indicator                     - byte	    Length of data for ADDs and RUNs    - integer	    Length of instructions section      - integer	    Length of addresses for COPYs       - integer	    Data section for ADDs and RUNs      - array of bytes	    Instructions and sizes section      - array of bytes	    Addresses section for COPYs         - array of bytes	Length of the delta encoding:	    This integer gives the total number of remaining bytes that	    comprise data of the delta encoding for this target window.        The delta encoding:	    This contains the data representing the delta encoding which	    is described next.    	Length of the target window:	    This integer indicates the actual size of the target window            after decompression. A decoder can use this value to allocate            memory to store the uncompressed data.	Delta_Indicator:	    This byte is a set of bits, as shown:	    7 6 5 4 3 2 1 0	   +-+-+-+-+-+-+-+-+	   | | | | | | | | |	   +-+-+-+-+-+-+-+-+	              ^ ^ ^	              | | |	              | | +-- VCD_DATACOMP	              | +---- VCD_INSTCOMP	              +------ VCD_ADDRCOMP		VCD_DATACOMP:	bit value 1.		VCD_INSTCOMP:	bit value 2.		VCD_ADDRCOMP:	bit value 4.            As discussed, the delta encoding consists of COPY, ADD and RUN            instructions. The ADD and RUN instructions have accompanying            unmatched data (that is, data that does not specifically match            any data in the source window or in some earlier part of the            target window) and the COPY instructions have addresses of where	    the matches occur. OPTIONALLY, these types of data MAY be further	    compressed using a secondary compressor. Thus, Vcdiff separates            the encoding of the delta instructions into three parts:	        a. The unmatched data in the ADD and RUN instructions,	        b. The delta instructions and accompanying sizes, and                c. The addresses of the COPY instructions.            If the bit VCD_DECOMPRESS (Section 4.1) was on, each of these            sections may have been compressed using the specified secondary            compressor. The bit positions 0 (VCD_DATACOMP), 1 (VCD_INSTCOMP),            and 2 (VCD_ADDRCOMP) respectively indicate, if non-zero, that            the corresponding parts are compressed. Then, these parts MUST	    be decompressed before decoding the delta instructions.	Length of data for ADDs and RUNs:	    This is the length (in bytes) of the section of data storing            the unmatched data accompanying the ADD and RUN instructions.	Length of instructions section:	    This is the length (in bytes) of the delta instructions and            accompanying sizes.	Length of addresses for COPYs:	    This is the length (in bytes) of the section storing            the addresses of the COPY instructions.    	Data section for ADDs and RUNs:	    This sequence of bytes encodes the unmatched data for the ADD            and RUN instructions.	Instructions and sizes section:	    This sequence of bytes encodes the instructions and their sizes.	Addresses section for COPYs:	    This sequence of bytes encodes the addresses of the COPY	    instructions.5. DELTA INSTRUCTION ENCODING    The delta instructions described in Section 3 represent the results of    string matching. For many data differencing applications in which the    changes between source and target data are small, any straightforward    representation of these instructions would be adequate.  However, for    applications including data compression, it is important to encode    these instructions well to achieve good compression rates.  From our    experience, the following observations can be made:    a. The addresses in COPY instructions are locations of matches and       often occur close by or even exactly equal to one another. This is       because data in local regions are often replicated with minor changes.       In turn, this means that coding a newly matched address against some       set of recently matched addresses can be beneficial.    b. The matches are often short in length and separated by small amounts       of unmatched data. That is, the lengths of COPY and ADD instructions       are often small. This is particularly true of binary data such as       executable files or structured data such as HTML or XML. In such cases,       compression can be improved by combining the encoding of the sizes       and the instruction types as well as combining the encoding of adjacent       delta instructions with sufficiently small data sizes.    The below subsections discuss how the Vcdiff data format provides    mechanisms enabling encoders to use the above observations to improve    compression rates.5.1 Address Encoding Modes of COPY Instructions    As mentioned earlier, addresses of COPY instructions often occur close    to one another or are exactly equal. To take advantage of this phenomenon    and encode addresses of COPY instructions more efficiently, the Vcdiff    data format supports the use of two different types of address caches.    Both the encoder and decoder maintain these caches, so that decoder's    caches remain synchronized with the encoder's caches.    a. A "near" cache is an array with "s_near" slots, each containing an       address used for encoding addresses nearby to previously encoded       addresses (in the positive direction only).  The near cache also       maintains a "next_slot" index to the near cache.  New entries to the       near cache are always inserted in the next_slot index, which maintains       a circular buffer of the s_near most recent addresses.    b. A "same" cache is an array with "s_same" multiple of 256 slots, each       containing an address.  The same cache maintains a hash table of recent       addresses used for repeated encoding of the exact same address.    By default, the parameters s_near and s_same are respectively set to 4    and 3. An encoder MAY modify these values, but then it MUST encode the    new values in the encoding itself, as discussed in Section 7, so that    the decoder can properly set up its own caches.    At the start of processing a target window, an implementation    (encoder or decoder) initializes all of the slots in both caches    to zero.  The next_slot pointer of the near cache is set    to point to slot zero.    Each time a COPY instruction is processed by the encoder or    decoder, the implementation's caches are updated as follows, where    "addr" is the address in the COPY instruction.    a. The slot in the near cache referenced by the next_slot       index is set to addr.  The next_slot index is then incremented       modulo s_near.    b. The slot in the same cache whose index is addr%(s_same*256)       is set to addr. [We use the C notations of % for modulo and       * for multiplication.]5.2 Example code for maintaining caches    To make clear the above description, below are example cache data    structures and algorithms to initialize and update them:        typedef struct _cache_s        {	    int*  near;      /* array of size s_near        */            int   s_near;            int   next_slot; /* the circular index for near */            int*  same;      /* array of size s_same*256    */            int   s_same;        } Cache_t;        cache_init(Cache_t* ka)        {	    int   i;            ka->next_slot = 0;            for(i = 0; i < ka->s_near; ++i)                 ka->near[i] = 0;            for(i = 0; i < ka->s_same*256; ++i)                 ka->same[i] = 0;        }        cache_update(Cache_t* ka, int addr)        {	    if(ka->s_near > 0)            {   ka->near[ka->next_slot] = addr;                ka->next_slot = (ka->next_slot + 1) % ka->s_near;            }            if(ka->s_same > 0)                ka->same[addr % (ka->s_same*256)] = addr;        }5.3 Encoding of COPY instruction addresses    The address of a COPY instruction is encoded using different modes    depending on the type of cached address used, if any.    Let "addr" be the address of a COPY instruction to be decoded and "here"    be the current location in the target data (i.e., the start of the data    about to be encoded or decoded).  Let near[j] be the jth element in    the near cache, and same[k] be the kth element in the same cache.    Below are the possible address modes:	VCD_SELF: This mode has value 0. The address was encoded by itself            as an integer.	VCD_HERE: This mode has value 1. The address was encoded as	    the integer value "here - addr".	Near modes: The "near modes" are in the range [2,s_near+1]. Let m	    be the mode of the address encoding. The address was encoded	    as the integer value "addr - near[m-2]".	Same modes: The "same modes" are in the range
💿 文件大小 215 K
👤 上传用户 shangranc
📂 所属分类 Linux/Unix编程
🏷️ 相关标签

#xdelta #Linux #3.0 #比较
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -