📄 draft-korn-vcdiff-01.txt

📁 linux subdivision ying gai ke yi le ba
💻 TXT
📖 第 1 页 / 共 4 页
字号:
                                   -12-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


    The algorithm to decode a target window is as follows:

     1. win_decode(uchar_t* tar, int n_tar, uchar_t* src, int n_src,
		   int i_code, Sfio_t* instf, Sfio_t* dataf)
     2. {    int_t     here, size, addr, i;
     3.      Cache_t   ka;
     4.      Vcdinst_t *inst;
     5.      Vcdcode_t *code, *table = Code[i_code];

     6.      cache_init(&ka);
     7.      for(here = 0; here < n_tar; )
     8.      {    code = &table[sfgetc(instf)];
     9.           for(i = 1; i <= 2; ++i)
    10.           {    inst = i == 1 ? &code->inst1 : &code->inst2;
    11.                if(inst->type == VCD_NOOP)
    12.                     continue;
    13.                if((size = inst->size) == 0)
    14.                     size = sfgetu(instf);
    15.                if(inst->type == VCD_ADD)
    16.                {    for(; size > 0; --size)
    17.                          tar[here++] = sfgetc(dataf);
    18.                }
    19.                else if(inst->type == VCD_RUN)
    20.                {    int  c = sfgetc(dataf);
    21.                     for(; size > 0; --size)
    22.                          tar[here++] = c;
    23.                }
    24.                else if(inst->type == VCD_COPY)
    25.                {    uchar_t* from;
    26.                     addr = addr_decode(&ka,here,inst->mode,instf);
    27.                     from = addr < nsrc ? src+addr : tar+addr-nsrc;
    28.                     for(; size > 0; --size)
    29.                          tar[here++] = *from++;
    30.                }
    31.           }
    32.      }
    33. }

COMMENTS.

     5: "table" is initialized to be the instruction code table to be used.
	We assume that "Code" is a global variable pointing to the list of 256
	possible code tables.
     6: This line initializes the address caches.
     8: This line reads a control byte from the instruction dataset and gets
	the corresponding code table to decode instructions.
  9-32: These lines process the given pair of delta instructions. Note that
	the data for VCD_ADD and VCD_RUN are read from the raw dataset.


5.  Instruction Code Tables

    The delta instructions are encoded based on some instruction code table.
    The Vcdiff format allows applications to tailor such code tables to the
    particular data characteristics to enhance output compactness. Up to 256
    instruction tables with indices in the range [0,255] are allowed.
    However, the first 8 indices [0,7] are reserved by Vcdiff.


5.1 The Encoding of a Code Table in a Delta File

                                   -13-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000



    It is desirable to compactly encode a code table in a delta file.
    To accomplish this, we first represent a table to be output as a string.
    Since each table entry encodes two instructions and each is 3 bytes,
    the total string size is 6*256 or 1536 bytes.  Such a string can then
    be compared against some other code table string (e.g., that of table 0)
    using the window encoding algorithm win_deflate() to reduce the output
    data. Thus, the format for a code table in a delta file looks like this:

		Code table to compare with
		Encoded data

    	Code table to compare with:
	    This is a single byte indicating the code table used to compare
	    the current table with.

    	Encoded data:
	    The data uses the same format as that of a window (Section 4.1.1)
	    where the target data is the code string of the table being encoded
	    and the source data segment is the code string of the code table
	    with the above index.


    Two functions tab2str() and str2tab() are used for converting between
    a code table and its code string.  Below is the description of tab2str().
    str2tab() is just as straightforward so its description will be omitted.
    Note that bytes of the same type are grouped together to induce more
    matching when code strings are compared.

     1. tab2str(Vcdcode_t* tab, uchar_t data[6*256])
     2. {    int  i, n;
     3.      n = 0;
     4.      for(i = 0; i < 256; ++i)
     5.           data[n++] = tab[i].inst1.type;
     6.      for(i = 0; i < 256; ++i)
     7.           data[n++] = tab[i].inst2.type;
     8.      for(i = 0; i < 256; ++i)
     9.           data[n++] = tab[i].inst1.size;
    10.      for(i = 0; i < 256; ++i)
    11.           data[n++] = tab[i].inst2.size;
    12.      for(i = 0; i < 256; ++i)
    13.           data[n++] = tab[i].inst1.mode;
    14.      for(i = 0; i < 256; ++i)
    15.           data[n++] = tab[i].inst2.mode;
    16. }

    
    Below is the function code_define() to retrieve an encoded instruction
    code table from the delta file.

     1. code_define()
     2. {   uchar_t src[6*256], tar[6*256];
     3.     int     index;

     4.     index = sfgetc(Delta);
     5.     tab2str(Code[sfgetc(Delta)], src);
     8.     win_inflate(0, tar, 1536, src, 1536);
     9.     Code[index] = str2tab(data);
    10. }


                                   -14-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


COMMENTS.

   4:	This line reads the index to install the new code table.
   5:	This line constructs the source string to be compared with.
   8-9: These lines construct the new table and install it.


5.2 Reserved Instruction Code Tables

    A COPY instruction with small data size may use more encoding space than
    the data itself. Thus, it is good to restrict COPY instructions to some
    minimum sizes. In turn, a code table can take advantage of such
    a restriction to enhance compactness. Vcdiff provides two reserved
    instruction code tables with indices 0 and 1 for minimium COPY sizes of
    4 and 3 respectively. As noted, the starting code table has index 0
    so it is optimized for minimum COPY size of 4. Note also that even when
    a table optimized for a particular minimum COPY size, this does not mean
    that a COPY instruction with a smaller size cannot be encoded. It simply
    means that the size of such an instruction must be coded separately.
    

           TYPE    SIZE      MODE    TYPE    SIZE     MODE    INDEX
        -------------------------------------------------------------
        a.  RUN     0                NOOP                       0
        b.  ADD     0                NOOP                       1
        c.  ADD   [1,14]             NOOP                     [2,15]
        d. COPY     0       [0,6]    NOOP                    [16,22]
        e. COPY   [M,M+10]  [0,6]    NOOP                    [23,99]
        f. COPY     0       [0,6]     ADD   [1,4]           [100,127]
        g. COPY   [M,M+3]   [0,6]     ADD   [1,4]           [128,239]
        h. COPY   [M,M+3]    SELF    COPY   [M,M+3]   SELF  [240,255]
        -------------------------------------------------------------

                            	Table 2


    For a given minimum COPY size M, Table 2 shows how the 256 code values
    are partitioned into different categories of indices:

    	a. Index 0 defines a RUN instruction whose size is always coded
	   separately as an unsigned integer. This is signified by having
	   the "size" field being 0.
    	b. Index 1 defines an ADD instruction with size coded separately.
    	c. Indices 2-15 define 14 ADD instructions with sizes in
	   the range [1,14].
    	d. Indices 16-22 define 7 COPY instructions for the 7 different
	   address encoding modes discusses in Section 4.3. The sizes of
	   these instructions are coded separately.
    	e. Indices 23-99 define 77 COPY instructions using all 7 addressing
	   modes and having sizes in the range [M,M+10].
    	f. Indices 100-127 defines 28 pairs of COPY and ADD instructions.
	   The COPY sizes are coded separately while the ADD sizes are
	   in the range [1,4].
    	g. Indices 128-239 define 112 pairs of COPY and ADD instructions
	   whose sizes are in the ranges [M,M+3] and [1,4] respectively.
    	h. Finally, indices 240-255 defines 16 pairs of COPY and COPY
	   instructions with sizes in the range [M,M+3] and both using
	   the same VCD_SELF addressing mode.

    Next we present the algorithms to construct an instruction code table.

                                   -15-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


    Where appropriate, the algorithms also construct the inverted tables
    usable for data encoding. Any part of a table not explicitly set will
    be assumed to be initialized to be either zero or the null pointer as
    the case warranted.



     1. mk_codetab(Vcdcode_t tab[256], int add[16], int copy[15][7],
                    int copyadd[8][7][5], int copycopy[8][8], int M)
     2. {
     3.     tab[0].inst1.type = VCD_RUN;
     4.     tab[0].inst1.size = 0;
     5.     tab[0].inst2.type = VCD_NOOP;

     6.     mk_add(tab, add);
     7.     mk_copy(tab, copy, M);
     8.     mk_copyadd(tab, copyadd, M);
     9.     mk_copycopy(tab, copycopy, M);
    10. }

COMMENTS.

     1: This line declares the arguments for mk_codetab().  The inverted
	tables add[], copy[][], copyadd[][][], and copycopy[][] are designed
	to be indexed by sizes and address modes. The argument M is the minimum
	COPY size used for the table.
   3-5: These lines construct the RUN instruction.
   6-9: These lines call various subroutines to construct different parts of
	the instruction table.


    Below is the algorithm to construct the single ADD instructions:

     1. mk_add(Vcdcode_t tab[256], int add[16])
     2. {   int s, n;

     3.     tab[1].inst1.type = VCD_ADD;
     4.     tab[1].inst1.size = 0;
     5.     tab[1].inst2.type = VCD_NOOP;
     6.     add[0] = 1;

     7.     for(n = 2, s = 1; s <= 14; ++s, ++n)
     8.     {   tab[n].inst1.type = VCD_ADD;
     9.         tab[n].inst1.size = s;
    10.         tab[n].inst2.type = VCD_NOOP;
    11.         add[s] = n;
    12.     }
    13. }

COMMENTS.

   3-6: These lines construct the ADD instruction whose size is to be coded
	separately.  Location 0 in the inverted table add[] points to this code.
  7-12: These lines construct the ADD instructions with size in
	the range [1-14] as discussed earlier.


    Below is the function to construct the single COPY instructions:

     1. mk_copy(Vcdcode_t tab[256], int copy[15][7], int M)

                                   -16-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


     2. {   int s, m, n;

     3.     for(n = 16, m = 0; m < 7; ++m, ++n)
     4.     {   tab[n].inst1.type = VCD_COPY;
     5.         tab[n].inst1.size = 0;
     6.         tab[n].inst1.mode = m;
     7.         tab[n].inst2.type = VCD_NOOP;
     8.         copy[0][m] = n;
     9.     }

    10.     for(s = M; s <= M+10; ++s)
    11.     {   for(m = 0; m < 7; ++m, ++n)
    12.         {   tab[n].inst1.type = VCD_COPY;
    13.             tab[n].inst1.size = s;
    14.             tab[n].inst1.mode = m;
    15.             tab[n].inst2.type = VCD_NOOP;
    16.             copy[s][m] = n;
    17.         }
    18.     }
    20. }

COMMENTS.

   3-9: These lines construct the 7 COPY instructions with whose data sizes
	are coded separately.
 10-18: These lines construct the 77 COPY instructions whose data sizes
	are in the range [M,M+10].


    The below function constructs the COPY/ADD instruction pairs:

     1. mk_copyadd(Vcdcode_t tab[256], int copyadd[8][7][5], int M)
     2. {   int s, m, a, n;
     3.     for(n = 100, m = 0; m < 7; ++m)
     4.     {   for(a = 1; a <= 4; ++a, ++n)
     5.         {   tab[n].inst1.type = VCD_COPY;
     6.             tab[n].inst1.size = 0;
     7.             tab[n].inst1.mode = m;
     8.             tab[n].inst2.type = VCD_ADD;
     9.             tab[n].inst2.size = a;
    10.             copyadd[0][m][a] = n;
    11.         }
    12.     }
    13.     for(s = M; s <= M+3; ++s)
    14.     {   for(m = 0; m < 7; ++m)
    15.         {   for(a = 1; a <= 4; ++a, ++n)
    16.             {   tab[n].inst1.type = VCD_COPY;
    17.                 tab[n].inst1.size = s;
    18.                 tab[n].inst1.mode = m;
    19.                 tab[n].inst2.type = VCD_ADD;
    20.                 tab[n].inst2.size = a;
    21.                 copyadd[s][m][a] = n;
    22.             }
    23.         }
    24.     }
    25. }


    Finally, the algorithm to construct the 16 COPY/COPY pairs:


                                   -17-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


     1. mk_copycopy(Vcdcode_t tab[256], int copycopy[8][8], int M)
     2. {   int s, c, n;
     3.     for(n = 240, s = M; s <= M+3; ++s)
     4.     {   for(c = M; c <= M+3; ++c, ++n)
     5.         {   tab[n].inst1.type = VCD_COPY;
     6.             tab[n].inst1.size = s;
     7.             tab[n].inst1.mode = VCD_SELF;
     8.             tab[n].inst2.type = VCD_COPY;
     9.             tab[n].inst2.size = c;
    10.             tab[n].inst2.mode = VCD_SELF;
    11.             copycopy[s][c] = n;
    12.         }
    13.     }
    14. }


6.  Compression and Differencing Algorithms

    Sections 4 and 5 describe how to reconstruct a target file from data in
    a delta file. In this section, we discuss briefly general architectures
    for compressors and differencers. A few abstract functions are assumed:

    int win_target(uchar_t** tar);

        This function may be called many times to get sequential windows of
	data from the target file to be compressed. It returns the length of
	the window while the window itself is returned in "*tar".

    int win_match(uchar_t* tar, int n_tar, int p_tar,
		  uchar_t** src, int* n_src, int_t* p_src);

	This function computes a source data segment, if any, to be compared
	with a target window. It returns 0, 1 or 2 for no source segment,
	source segment from the source file, or source segment from
	the target file. The arguments "src", "n_src" and "p_src" are used
	to return the source data segment, its size, and its position in
	either the source file or target file.

    int str_match(uchar_t* tar, int n_tar, int p_tar,
		  uchar_t* src, int n_src, int* match);

	This function computes a string in either the source data segment or
	the target window that matches with a prefix of the target data being
	processed. Such a matched string can be encoded as a COPY instruction.
	The argument "p_tar" indicates the current position in the target
	window to be processed. The function returns the length of the match
	and, if that is positive, "*match" is set to the match position.

    int run_check(uchar_t* tar, int n_tar, int p_tar);

	This function checks to see if there is a run of the same bytes 
	starting at the current location in the target string.



    The performance of an encoder depends on what algorithms are used to
    implement the above functions. However, as shown earlier, decoding
    performance is completely independent of such choices.

    Below is the algorithm for encoding a target file.  We shall assume

                                   -18-
💿 文件大小 10313 K
👤 上传用户 zhangyuntong
📂 所属分类网络
🏷️ 相关标签

#subdivision #linux #ying #gai
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -