📄 draft-korn-vcdiff-01.txt

📁 linux subdivision ying gai ke yi le ba
💻 TXT
📖 第 1 页 / 共 4 页
字号:

                                   -6-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000



COMMENTS.

   1-4: These lines read the 4-byte header.
   7-8: These lines read the indicator byte and terminate if reached end-of-file.
  9-10: These lines define a new code table.
 12-16: These lines decode the target window and output it to the target file.


    Next is the function to recompute a target window:

     1. win_inflate(int indi, uchar_t* tar, int n_tar, uchar_t* src, int n_src)
     2. {   int      n_inst, n_data;
     3.     uchar_t  *inst, *data;
     4.     int      s_inst, s_data;
     5.     int      p_src, i_code, f_inst, f_data;
     6.     Sfio_t   *sf, *instf, *dataf;

     7.     n_inst = sfgetu(Delta);
     8.     n_data = sfgetu(Delta);

     9.     if(indi & (1<<6))
    10.          i_code = sfgetc(Delta);

    11.     if(indi & (1<<5))
    12.     {    f_inst = sfgetc(Delta);
    13.          s_inst = sfgetu(Delta);
    14.     }

    15.     if(indi & (1<<4))
    16.     {    f_data = sfgetc(Delta);
    17.          s_data = sfgetu(Delta);
    18.     }

    19.     if(indi & ((1<<3)|(1<<2)) )
    20.     {    n_src = sfgetu(Delta);
    21.          p_src = sfgetu(Delta);
    22.     }

    23.     inst = malloc(n_inst);
    24.     if(indi & (1<<5))
    25.     {    sfread(Delta, inst, s_inst);
    26.          decompress(inst, n_inst, s_inst, f_inst);
    27.     }
    28.     else sfread(Delta, inst, n_inst);
    29.     instf = sfstropen(inst, n_inst);

    30.     data = malloc(n_data);
    31.     if(indi & (1<<4))
    32.     {    sfread(Delta, data, s_data);
    33.          decompress(data, n_data, s_data, f_data);
    34.     }
    35.     else sfread(Delta, data, n_data);
    36.     dataf = sfstropen(data, n_data);

    37.     if(indi & ((1<<3)|(1<<2)) )
    38.     {    src = malloc(n_src);
    39.          sf = (indi & (1<<3)) ? Target : Source;
    40.          sfseek(sf, p_src, SEEK_SET);
    41.          sfread(sf, src, n_src);

                                   -7-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


    42.     }

    43.     win_decode(tar, n_tar, src, n_src, i_code, instf, dataf);
    44.     sfclose(instf); sfclose(dataf); free(inst); free(data);
    45. }

COMMENTS.

   7-8: These lines read the sizes of the instruction and raw datasets.
  9-10: These lines read the index of the instruction code table if needed.
 11-14: These lines read the secondary instruction compressor indicator and
	the size of the compressed instruction dataset.
 15-18: These lines read the secondary data compressor indicator and the
	size of the compressed raw dataset.
 19-22: These lines read the size and position of the source data segment.
	Note that win_inflate() may also be called to decode an instruction
	code table (Section 5). In that case, "src" will be given and
	the WT and WS bits should be zero.
 23-36: These lines read the instruction and raw datasets.  If these were
	compressed, decompress() is called to reconstruct them. The function
	decompress() uses the method index f_inst or f_data to decide on
	the proper operations. Note that these datasets are then made into
	streams via the sfstropen() calls to ease data accessing later.
 37-42: These lines construct the source data segment if any.
    43: This line calls win_decode() (Section 4.4) to decode the delta
	instructions.
    44: This line frees resources allocated for processing.


4.2 Integer Encoding Formats

    Vcdiff compactly encodes integer values using the same formats introduced
    in the Sfio library [6]. The code presented below is not quite correct with
    respect to type handling as in Sfio but they show the ideas.


4.2.1 Encoding Integers Using a Variable-Sized Format

    The variable-sized encoding of integers follows the same algorithm
    used in the Sfio library. This treats an integer as a number in
    base 128 so that each digit can be coded in one byte. The eighth bit
    of such a byte is the continuation bit. It is 1 if there are more
    digits to come or 0 if it is the last digit.

                        +---------------------+
                        | 11100000 | 00111001 |
                        +---------------------+
			  byte 0       byte 1 

				Table 1

    Table 1 shows the variable sized encoding of 12345.  The bytes in
    the encoding are presented in binary to make clear the use of
    the continuation bit. Omitting the continuation bit, the encoding
    of 12345 uses two bytes with values 96 and 57.

    Below are the algorithms:

     1. int sfputu(Sfio_t* f, int_t v)
     2. {   uchar_t c[2*sizeof(int_t)], *s;

                                   -8-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000



     3.     s = &c[sizeof(c)-1];
     4.     *s = v & 127;
     5.     while((v >>= 7) != 0)
     6.          *--s = (v & 127) | (1<<7);

     7.     sfwrite(f, s, (&c[sizeof(c)]) - s);
     8.     return  &c[sizeof(c)]-s;
     9. }

    10. int_t sfgetu(Sfio_t* f)
    11. {   int_t b, v;

    12.     for(v = 0;; )
    13.     {    b = sfgetc(f);
    14.          v = (v << 7) | (b & 127);
    15.          if(!(b & 128) )
    16.               return v;
    17.     }
    18. }

COMMENTS.

     1: This line declares the formal arguments to sfputu(), a stream f
	to store the encoding and the value v to be encoded.
     2: This line declares an array large enough to store the encoding.
   4-6: These lines extract digits in base 128 and store them in
	the array. Note that the right-most digit is extracted first and
	does not carry a continuation bit.
   7-8: These lines write the encoding out to the given stream f and
	return the length of that encoding.
 10-18: These lines define the decoding algorithm.


4.2.2 Encoding Integers with a Given Range

    The address of a COPY instruction is an unsigned integer with a known
    range, i.e., it cannot be larger than the current position.  Such a
    value can be encoded more compactly than as done in the Sfio scheme.
    For example, if a value is in the range 0-255, it can be encoded with
    a single byte. On the other hand, if the same value happens to be larger
    than 127, it would require two bytes to encode in the variable-sized
    scheme. Below are the algorithms to encode integers with known ranges:


     1. int sfputm(Sfio_t* f, int_t v, int_t max)
     2. {   uchar_t c[sizeof(int_t)], *s;
     3.     s = &c[sizeof(c)-1];
     4.     *s = v & 255;
     5.     while((max >>= 8) != 0)
     6.     {   v >>= 8;
     7.         *--s = v & 255;
     8.     }
     9.     sfwrite(f, s, (&c[sizeof(c)]) - s);
    10.     return (&c[sizeof(c)]) - s;
    11. }

    12. int_t sfgetm(Sfio_t* f, int_t max)
    13. {   int_t v;
    14.     for(v = 0;;)

                                   -9-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


    15.     {   v = (v << 8) | sfgetc(f);
    16.         if((max >>= 8) == 0)
    17.             return v;
    18.     }
    19. }


4.3 Delta Instruction Coding Format

    Delta instructions are encoded as control bytes with associated data.
    Each control byte is an index into an instruction code table of 256
    entries. Each code entry is of the type Vcdcode_t as defined below:

     1. typedef struct _vcdinst_s
     2. {   unsigned char type;  /* instruction type         */
     3.     unsigned char size;  /* >0 if size is coded here */
     4.     unsigned char mode;  /* address mode for COPYs   */
     5. } Vcdinst_t;

     6. typedef struct _vcdcode_s
     7. {   Vcdinst_t     inst1; /* first instruction        */
     8.     Vcdinst_t     inst2; /* second instruction       */
     9. } Vcdcode_t;

    Thus, each control byte identifies up to two instructions.  Below are
    the different instruction types:

     1. #define VCD_NOOP  0      /* not an instruction       */
     2. #define VCD_ADD   1      /* an ADD instruction       */
     3. #define VCD_RUN   2      /* a  RUN instruction       */
     4. #define VCD_COPY  3      /* a COPY instruction       */


    Each instruction has a size n of the data involved.  If the field
    Vcdinst_t.size is non-zero, it is the value for n.  Otherwise,
    n is encoded next in the instruction dataset as a variable-sized
    integer.  If the instruction is a COPY, the copy address will follow
    next in the instruction dataset. Its encoding depends on some
    addressing scheme to be discussed next.

    A COPY address can be encoded in different ways. The field
    Vcdinst_t.mode has values in the range [0-7] and defines the below
    address encoding modes:

     1. #define VCD_SELF  0      /* coded as itself          */
     2. #define VCD_HERE  1      /* from current position    */
     3. #define VCD_SAME  2      /* index into "same" cache  */
     4. #define VCD_NEAR  3      /* index into "near" cache  */

    VCD_SAME and VCD_NEAR indicate two address caching methods designed
    to take advantage of the heuristic that successive copying addresses
    tend to be the same or fairly close to one another.  Note that as
    discussed below, there are 4 VCD_NEAR addresses corresponding to
    Vcdinst_t.mode values in the range [VCD_NEAR,VCD_NEAR+3].

    Let A be the COPY address and H the current location in the target
    data. Below are the encodings:

    VCD_SELF: A is the next integer in the instruction dataset that is
	encoded in the range [0-H]. 

                                   -10-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


 
    VCD_HERE: Let B be the next byte in the instruction dataset.
	Then A is H-B. That is, the distance from A to H, H-A, must have
	been in the range [0-255].

    VCD_SAME: The "same" cache keeps 256 addresses. Let B be the next byte
	in the instruction dataset. Then A is same[B].

    VCD_NEAR: The "near" cache keeps 4 addresses.  Let B be the next byte
	in the instruction dataset and t be Vcdinst_t.mode-VCD_NEAR.
	Then, A is near[t]+B-127.
	

    Below are the algorithms to maintain address caches.

     1. typedef struct _cache_s
     2. {   int   n;
     3.     int   near[4];
     4.     int   same[256];
     5. } Cache_t;

     6. cache_init(Cache_t* ka)
     7. {    int i;
     8.      for(i = 0; i < 4; ++i)
     9.          ka->near[i] = 0;
    10.      ka->n = 0;
    11.      for(i = 0; i < 256; ++i)
    12.          ka->same[i] = 0;
    13. }



    14. cache_update(Cache_t* ka, int_t addr)
    15. {    ka->near[ka->n] = addr;
    16.      if((ka->n += 1) >= 4)
    17.          ka->n = 0;
    18.      ka->same[addr&255] = addr;
    19. }

COMMENTS.

   1-5: These lines define the caches. As discussed, the "near" cache
	has 4 addresses and the "same" cache has 256 addresses.
  6-13: These lines initialize addresses in the caches to zero.
 14-19: These lines update the caches with a given address.  Note that
	the "near" cache is updated in a round-robin manner and the lower
	eight bits of an address is its index in the "same" cache.


    Below is the function to encode a COPY address:

     1. int addr_encode(Cache_t* ka, int addr, int here, int* best)
     2. {   int  i, d, mode = -1;

     3.     if((d = (here-addr)) < 256)
     4.     {    *best = d;
     5.          mode = VCD_HERE;
     6.     }
     7.     else if(ka->same[d = addr&255] == addr)
     8.     {    *best = d;

                                   -11-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


     9.          mode = VCD_SAME;
    10.     }
    11.     else
    12.     {    for(i = 0; i < 4; ++i)
    13.          {    if((d = addr - ka->k_near[i]) >= -127 && d <= 128) 
    14.               {    *best = d + 127;
    15.                    mode = VCD_NEAR+i;
    16.               }
    17.          }
    18.     }
    19.     if(mode < 0)
    20.     {    *best = addr;
    21.          mode = VCD_SELF;
    22.     }

    23.     cache_update(ka,addr);
    24.     return mode;
    25. }

COMMENTS.

     1: This lines declare the formal arguments. "addr" is the address to be
	encoded. "here" is the current location in the target data.  "best" is
	used to return the value to be used to encode "addr".
   3-6: If "addr" is within the range [0-255] away from the current location
	"here", then the addressing mode is VCD_HERE and the address is encoded
	as a single byte showing this distance.
  7-10: Using the lower eight bits of "addr" to index the "same" cache, if the
	address in the cache exactly matches "addr", then VCD_SAME is used
	and "addr" is encoded as this index.

 12-18: Check each address in the "near" cache to see if "addr" is within
	[-127,128] away from such an address.  If there is one, the addressing
	mode is VCD_NEAR plus the index of the address and "addr" is encoded
	as the distance plus 127 (so the encoded value is in the range [0-255]).
 19-22: If none of the above addressing modes applies, then VCD_SELF is used
	and "addr" is encoded as a value in the range [0-here].
    23: This line updates the address caches.


    Below is the function to decode a COPY address:

     1. int addr_decode(Cache_t* ka, int here, int type, Sfio_t* instf)
     2. {   int  addr;

     3.     if(type == VCD_SELF)
     4.          addr = sfgetm(instf, here);
     5.     else if(type == VCD_HERE)
     6.          addr = here - sfgetc(instf);
     7.     else if(type == VCD_SAME)
     8.          addr = ka->same[sfgetc(instf)];
     9.     else addr = ka->near[type] + sfgetc(instf) - 127;

    10.     cache_update(ka, addr);
    11.     return addr;
    12. }


4.4 Decoding A Target Window
💿 文件大小 10313 K
👤 上传用户 zhangyuntong
📂 所属分类网络
🏷️ 相关标签

#subdivision #linux #ying #gai
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -