⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 draft-korn-vcdiff-01.txt

📁 linux subdivision ying gai ke yi le ba
💻 TXT
📖 第 1 页 / 共 4 页
字号:

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


    that there is no secondary encoding for the resulting instruction and
    raw datasets.

     1. sfputc(Delta,0xd6);
     2. sfputc(Delta,0xc3);
     3. sfputc(Delta,0xc4);
     4. sfputc(Delta,0);

     5. for(p_tar = 0; ; p_tar += n_tar)
     6. {   if((n_tar = win_target(&tar)) <= 0)
     7.         break;
     9.     win_deflate(tar, n_tar);
    10. }

COMMENTS.

   1-4: These lines output the 4-byte header for a delta file.
     6: This line calls win_target() to get a target window. If the target file
	has been completely processed, the return value is non-positive and 
	the loop terminates.
     8: This line computes a source data segment to be compared with the given
	target window.
     9: This line calls win_deflate() to compute the delta instructions and
	output the data to the delta file.


    Below is the function win_deflate():

     1. win_deflate(uchar_t* tar, int n_tar)
     2. {
     3.     int     type, indi, n_src;
     4.     int_t   p_src;
     5.     uchar_t *src;
     6.     Sfio_t  *instf, *dataf;

     7.     type = win_match(tar, n_tar, &src, &n_src, &p_src);
     8.     instf = sfstropen(NULL,0);
     9.     dataf = sfstropen(NULL,0);
    10.     win_encode(tar, n_tar, src, n_src, instf, dataf);

    11.     indi = type == 0 ? (1<<2) : type == 1 ? (1<<3) : 0;
    12.     sfputc(Delta, indi);

    13.     sfputu(Delta, n_tar);
    14.     sfputu(Delta, sfsize(instf));
    15.     sfputu(Delta, sfsize(dataf));

    16.     sfmove(instf,Delta,-1,-1);
    17.     sfmove(dataf,Delta,-1,-1);

    18.     sfclose(instf); sfclose(dataf);
    18. }


COMMENTS.

     7: This line calls win_match() to obtain a source data segment.
   8-9: These lines create two streams to output instructions and raw data.
    10: This line calls win_encode() to compute the delta instructions.
 11-12: These lines output the indicator byte.

                                   -19-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


 13-15: These lines output the sizes of the target window and instruction
	and raw datasets.
 16-17: These lines use the Sfio function sfmove() to output the instruction
	and raw datasets to the delta file.


    Below is the function to compute encoding of a target window:

     1. int win_encode(uchar_t* tar, int n_tar, uchar_t* src, int n_src,
                       Sfio_t* instf, Sfio_t* dataf)
     2. {   int      add, here, addr, s, size;
     3.     Cache_t  ka;

     4.     cache_init(&ka);

     5.     for(here = 0, size = 0, add = -1; here < n_tar; )
     6.     {   if( (s = run_check(tar, n_tar, here)) > 0)
     7.              addr = -1;
     8.         else s = str_match(tar,n_tar,here,src,n_src,&addr);
     9.         if(s > 0)
    10.         {    if(add >= 0)
    11.                   size += add_encode(here-add,tar+here-add,instf,dataf);
    12.              if(addr < 0)
    13.                   size += run_encode(s,tar[here],instf,dataf);
    14.              else size += copy_encode(&ka,s,addr,here,instf);

    15.              add = -1;
    16.              here += s;
    17.         }
    18.         else
    19.         {    if(add < 0)
    20.                   add = here;
    21.              here += 1;
    22.         }
    23.     }
    24.     if(add >= 0)
    25.          size += add_encode(here-add,tar+here-add,instf,dataf);
    26.     else size += copy_encode(0,0,0,0);

    27.     return size;
    28. }

COMMENTS.

     4: This line initializes the address caches.
   6-8: These lines check to see if there is a RUN or a COPY instruction.
	The function run_check() is straightforward to implement so its
	description will be omitted.
  9-17: These lines output the appropriate delta instruction.
 19-22: These lines are executed if there is no RUN or COPY instruction.
	The "add" variable is set to collect the data for an ADD instruction.
 24-25: These lines output the last unmatched part of the target window
	in an ADD instruction.
    26: This line outputs the last saved COPY instruction, if any. See the
	function copy_encode() below.
    27: This line returns the number of bytes used for encoding the data.


    To finish up the encoding algorithm, we need to describe the functions
    add_encode(), run_encode() and copy_encode(). We shall assume that

                                   -20-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


    the default code instruction table 0 is used. The below code declares
    the tables and define a few convenience macro functions.

     1. Vcdcode_t  Tab[256];
     2. Vcdcode_t* Add[16];
     3. Vcdcode_t* Copy[15][7];
     4. Vcdcode_t* Copyadd[8][7][5];
     5. Vcdcode_t* Copycopy[8][8];

     6. #define    INDEXADD(a)     ((a >= 1 && a <= 14) ? a : 0)
     7. #define    INDEXCOPY(c)    ((c >= 4 && c <= 14) ? c : 0)
     8. #define    INDEXCOPYADD(c) ((c >= 4 && c <=  7) ? c : 0)
     9. #define    ISTINYADD(a)    ((a >= 1 && a <= 4) ? 1 : 0)
    10. #define    ISTINYCOPY(c)   ((c >= 4 && c <= 7) ? 1 : 0)

COMMENTS.

   6-8: These lines define functions to compute the index into the tables
        Add, Copy, and Copyadd for any given ADD or COPY size.
  9-10: These lines define functions to test if a given size is small.
	These are used to determine if certain pair of instructions could
	be coded using a single byte code.


    Below is the function to encode a COPY instruction.

     1. int Size, Addr, Here, Mode;
 
     2. int copy_encode(Cache_t* ka,int size,int addr,int here,Sfio_t* instf)
     3. {   int code, mode = 0, best = 0, ndel = 0;

     4.     if(size > 0)
     5.         mode = addr_encode(ka, addr, here, &best);

     6.     if(Size > 0)
     7.     {   if(Mode == mode && mode == VCD_SELF &&
     8.            ISTINYCOPY(Size) && ISTINYCOPY(size) )
     9.         {    code = Copycopy[Size][size];
    10.              ndel += sfputc(instf, code);
    11.              ndel += sfputm(instf, Addr, Here);
    12.              ndel += sfputm(instf, addr, here);
    13.              Size = -1;
    14.              return ndel;
    15.         }
    16.         code = Copy[INDEXCOPY(Size)][Mode];
    17.         ndel += sfputc(instf, code);
    18.         if(Tab[code].inst1.size == 0)
    19.              ndel += sfputu(instf, Size);
    20.         if(Mode == VDC_SELF)
    21.              ndel += sfputm(instf, Addr, Here);
    22.         else ndel += sfputc(instf, Addr);
    23.     }

    24.     Size = size;
    25.     Addr = best;
    26.     Mode = mode;
    27.     Here = here;

    28.     return ndel;
    29. }

                                   -21-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000



COMMENTS.

     1: This line defines variables to save a COPY instruction so that it may
        be coded jointly with another instruction later.
   4-5: These lines compute the mode and value to encode the given COPY
        address if there is one (i.e., "size" is positive).
  6-23: These lines process a previously saved COPY instruction.
  7-15: These lines output a code that encode both COPY instructions.
	Note that this is done only when the data sizes are small and both
	instructions are using the VCD_SELF addressing mode (Table 2).
 16-22: These lines output the previously saved COPY instruction singly.
 24-27: These lines save the current COPY instruction if it has not
        been merged to a previous COPY instruction as discussed.


    Below is the function to output an ADD instruction:

     1. int add_encode(int size, uchar_t* data, Sfio_t* instf, Sfio_t* dataf)
     2. {   int code, ndel = 0;

     3.     if(Size > 0)
     4.     {   if(ISTINYADD(size))
     5.         {    code = Copyadd[INDEXCOPYADD(Size)][Mode][size];
     6.              ndel += sfputc(instf, code);
     7.              if(Tab[code].inst1.size == 0)
     8.                   ndel += sfputu(instf, Size);
     9.              if(Mode == VCD_SELF)
    10.                   ndel += sfputm(instf, Addr, Here);
    11.              else ndel += sfputc(instf, Addr);
    12.              for(; size > 0; --size, ++data)
    13.                   ndel += sfputc(dataf, *data);
    14.              return ndel;
    15.         }
    16.         else ndel += copy_encode(0,0,0,0);
    17.     }

    18.     code = Add[INDEXADD(size)];
    19.     ndel += sfputc(instf, code);
    20.     if(Tab[code].inst1.size == 0)
    21.         ndel += sfputu(instf, size);
    22.     ndel += size;
    23.     for(; size > 0; --size, ++data)
    24.         sfputc(dataf, *data);

    25.     return ndel;
    26. }


COMMENTS.

  3-17: These lines consider the case when there is a previously saved COPY
	instruction. Then, if the ADD size is small, a single code is output
	for both COPY and ADD instructions.  Otherwise, the COPY instruction
	is output by itself.
 18-24: These lines output an ADD instruction by itself.


    Below is the function to output a RUN instruction:


                                   -22-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


     1. int run_encode(int size, int byte, Sfio_t* instf, Sfio_t* dataf)
     2. {   int ndel = 0;
     3.     if(Size > 0)
     4.         ndel += copy_encode(0,0,0,0);
     5.     ndel += sfputc(instf,0);
     6.     ndel += sfputu(instf,size);
     7.     ndel += sfputc(dataf,byte);
     8.     return ndel;
     9. }

COMMENTS.

   3-4: These lines output a previously saved COPY instruction, if any.
   5-7: These lines output the RUN instruction.


7.  Summary

    We have described a general and portable encoding format for compression
    and differencing. The format is compact. For compression only, using
    the same LZ-77 string parsing strategy and without any secondary encoders,
    the typical compression rate is better than Unix compress and close to gzip.
    For differencing, the same data format is better than all known methods.
    Ignoring application-specific secondary encoder issues, the decoding
    algorithms run in linear time and require working space proportional
    to window sizes.


ACKNOWLEDGEMENTS

    Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff
    who provided much encouragement to publicize Vcdiff.


REFERENCES

    [1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression,
        Practical Reusable Unix Software, Editor B. Krishnamurthy,
        John Wiley & Sons, Inc., 1995.

    [2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data
        Compression, IEEE Transactions on Information Theory,
        23(3):337-343, May 1977.

    [3] W. Tichy, The String-to-String Correction Problem with Block Moves,
        ACM Transactions on Computer Systems, 2(4):309-321, November 1984.

    [4] E.M. McCreight, A Space-Economical Suffix Tree Construction
        Algorithm, Journal of the ACM, 23:262-272, 1976.

    [5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta
        Algorithms, IEEE Software Configuration and Maintenance Workshop,
        1996.

    [6] G.S. Fowler, D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library,
        Accepted for publication in Software Practice & Experience, 1999.



AUTHOR'S ADDRESS

                                   -23-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000



    Kiem-Phong Vo (main contact)
    AT&T Labs, Room D223
    180 Park Avenue
    Florham Park, NJ 07932

    Phone: 973-360-8630
    Email: kpv@research.att.com


    David G. Korn
    AT&T Labs, Room D237
    180 Park Avenue
    Florham Park, NJ 07932

    Phone: 973-360-8602
    Email: dgk@research.att.com


EXPIRATION DATE

    September 09, 2000







































                                   -24-

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -