📄 draft-korn-vcdiff-01.txt
字号:
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
that there is no secondary encoding for the resulting instruction and
raw datasets.
1. sfputc(Delta,0xd6);
2. sfputc(Delta,0xc3);
3. sfputc(Delta,0xc4);
4. sfputc(Delta,0);
5. for(p_tar = 0; ; p_tar += n_tar)
6. { if((n_tar = win_target(&tar)) <= 0)
7. break;
9. win_deflate(tar, n_tar);
10. }
COMMENTS.
1-4: These lines output the 4-byte header for a delta file.
6: This line calls win_target() to get a target window. If the target file
has been completely processed, the return value is non-positive and
the loop terminates.
8: This line computes a source data segment to be compared with the given
target window.
9: This line calls win_deflate() to compute the delta instructions and
output the data to the delta file.
Below is the function win_deflate():
1. win_deflate(uchar_t* tar, int n_tar)
2. {
3. int type, indi, n_src;
4. int_t p_src;
5. uchar_t *src;
6. Sfio_t *instf, *dataf;
7. type = win_match(tar, n_tar, &src, &n_src, &p_src);
8. instf = sfstropen(NULL,0);
9. dataf = sfstropen(NULL,0);
10. win_encode(tar, n_tar, src, n_src, instf, dataf);
11. indi = type == 0 ? (1<<2) : type == 1 ? (1<<3) : 0;
12. sfputc(Delta, indi);
13. sfputu(Delta, n_tar);
14. sfputu(Delta, sfsize(instf));
15. sfputu(Delta, sfsize(dataf));
16. sfmove(instf,Delta,-1,-1);
17. sfmove(dataf,Delta,-1,-1);
18. sfclose(instf); sfclose(dataf);
18. }
COMMENTS.
7: This line calls win_match() to obtain a source data segment.
8-9: These lines create two streams to output instructions and raw data.
10: This line calls win_encode() to compute the delta instructions.
11-12: These lines output the indicator byte.
-19-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
13-15: These lines output the sizes of the target window and instruction
and raw datasets.
16-17: These lines use the Sfio function sfmove() to output the instruction
and raw datasets to the delta file.
Below is the function to compute encoding of a target window:
1. int win_encode(uchar_t* tar, int n_tar, uchar_t* src, int n_src,
Sfio_t* instf, Sfio_t* dataf)
2. { int add, here, addr, s, size;
3. Cache_t ka;
4. cache_init(&ka);
5. for(here = 0, size = 0, add = -1; here < n_tar; )
6. { if( (s = run_check(tar, n_tar, here)) > 0)
7. addr = -1;
8. else s = str_match(tar,n_tar,here,src,n_src,&addr);
9. if(s > 0)
10. { if(add >= 0)
11. size += add_encode(here-add,tar+here-add,instf,dataf);
12. if(addr < 0)
13. size += run_encode(s,tar[here],instf,dataf);
14. else size += copy_encode(&ka,s,addr,here,instf);
15. add = -1;
16. here += s;
17. }
18. else
19. { if(add < 0)
20. add = here;
21. here += 1;
22. }
23. }
24. if(add >= 0)
25. size += add_encode(here-add,tar+here-add,instf,dataf);
26. else size += copy_encode(0,0,0,0);
27. return size;
28. }
COMMENTS.
4: This line initializes the address caches.
6-8: These lines check to see if there is a RUN or a COPY instruction.
The function run_check() is straightforward to implement so its
description will be omitted.
9-17: These lines output the appropriate delta instruction.
19-22: These lines are executed if there is no RUN or COPY instruction.
The "add" variable is set to collect the data for an ADD instruction.
24-25: These lines output the last unmatched part of the target window
in an ADD instruction.
26: This line outputs the last saved COPY instruction, if any. See the
function copy_encode() below.
27: This line returns the number of bytes used for encoding the data.
To finish up the encoding algorithm, we need to describe the functions
add_encode(), run_encode() and copy_encode(). We shall assume that
-20-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
the default code instruction table 0 is used. The below code declares
the tables and define a few convenience macro functions.
1. Vcdcode_t Tab[256];
2. Vcdcode_t* Add[16];
3. Vcdcode_t* Copy[15][7];
4. Vcdcode_t* Copyadd[8][7][5];
5. Vcdcode_t* Copycopy[8][8];
6. #define INDEXADD(a) ((a >= 1 && a <= 14) ? a : 0)
7. #define INDEXCOPY(c) ((c >= 4 && c <= 14) ? c : 0)
8. #define INDEXCOPYADD(c) ((c >= 4 && c <= 7) ? c : 0)
9. #define ISTINYADD(a) ((a >= 1 && a <= 4) ? 1 : 0)
10. #define ISTINYCOPY(c) ((c >= 4 && c <= 7) ? 1 : 0)
COMMENTS.
6-8: These lines define functions to compute the index into the tables
Add, Copy, and Copyadd for any given ADD or COPY size.
9-10: These lines define functions to test if a given size is small.
These are used to determine if certain pair of instructions could
be coded using a single byte code.
Below is the function to encode a COPY instruction.
1. int Size, Addr, Here, Mode;
2. int copy_encode(Cache_t* ka,int size,int addr,int here,Sfio_t* instf)
3. { int code, mode = 0, best = 0, ndel = 0;
4. if(size > 0)
5. mode = addr_encode(ka, addr, here, &best);
6. if(Size > 0)
7. { if(Mode == mode && mode == VCD_SELF &&
8. ISTINYCOPY(Size) && ISTINYCOPY(size) )
9. { code = Copycopy[Size][size];
10. ndel += sfputc(instf, code);
11. ndel += sfputm(instf, Addr, Here);
12. ndel += sfputm(instf, addr, here);
13. Size = -1;
14. return ndel;
15. }
16. code = Copy[INDEXCOPY(Size)][Mode];
17. ndel += sfputc(instf, code);
18. if(Tab[code].inst1.size == 0)
19. ndel += sfputu(instf, Size);
20. if(Mode == VDC_SELF)
21. ndel += sfputm(instf, Addr, Here);
22. else ndel += sfputc(instf, Addr);
23. }
24. Size = size;
25. Addr = best;
26. Mode = mode;
27. Here = here;
28. return ndel;
29. }
-21-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
COMMENTS.
1: This line defines variables to save a COPY instruction so that it may
be coded jointly with another instruction later.
4-5: These lines compute the mode and value to encode the given COPY
address if there is one (i.e., "size" is positive).
6-23: These lines process a previously saved COPY instruction.
7-15: These lines output a code that encode both COPY instructions.
Note that this is done only when the data sizes are small and both
instructions are using the VCD_SELF addressing mode (Table 2).
16-22: These lines output the previously saved COPY instruction singly.
24-27: These lines save the current COPY instruction if it has not
been merged to a previous COPY instruction as discussed.
Below is the function to output an ADD instruction:
1. int add_encode(int size, uchar_t* data, Sfio_t* instf, Sfio_t* dataf)
2. { int code, ndel = 0;
3. if(Size > 0)
4. { if(ISTINYADD(size))
5. { code = Copyadd[INDEXCOPYADD(Size)][Mode][size];
6. ndel += sfputc(instf, code);
7. if(Tab[code].inst1.size == 0)
8. ndel += sfputu(instf, Size);
9. if(Mode == VCD_SELF)
10. ndel += sfputm(instf, Addr, Here);
11. else ndel += sfputc(instf, Addr);
12. for(; size > 0; --size, ++data)
13. ndel += sfputc(dataf, *data);
14. return ndel;
15. }
16. else ndel += copy_encode(0,0,0,0);
17. }
18. code = Add[INDEXADD(size)];
19. ndel += sfputc(instf, code);
20. if(Tab[code].inst1.size == 0)
21. ndel += sfputu(instf, size);
22. ndel += size;
23. for(; size > 0; --size, ++data)
24. sfputc(dataf, *data);
25. return ndel;
26. }
COMMENTS.
3-17: These lines consider the case when there is a previously saved COPY
instruction. Then, if the ADD size is small, a single code is output
for both COPY and ADD instructions. Otherwise, the COPY instruction
is output by itself.
18-24: These lines output an ADD instruction by itself.
Below is the function to output a RUN instruction:
-22-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
1. int run_encode(int size, int byte, Sfio_t* instf, Sfio_t* dataf)
2. { int ndel = 0;
3. if(Size > 0)
4. ndel += copy_encode(0,0,0,0);
5. ndel += sfputc(instf,0);
6. ndel += sfputu(instf,size);
7. ndel += sfputc(dataf,byte);
8. return ndel;
9. }
COMMENTS.
3-4: These lines output a previously saved COPY instruction, if any.
5-7: These lines output the RUN instruction.
7. Summary
We have described a general and portable encoding format for compression
and differencing. The format is compact. For compression only, using
the same LZ-77 string parsing strategy and without any secondary encoders,
the typical compression rate is better than Unix compress and close to gzip.
For differencing, the same data format is better than all known methods.
Ignoring application-specific secondary encoder issues, the decoding
algorithms run in linear time and require working space proportional
to window sizes.
ACKNOWLEDGEMENTS
Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff
who provided much encouragement to publicize Vcdiff.
REFERENCES
[1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression,
Practical Reusable Unix Software, Editor B. Krishnamurthy,
John Wiley & Sons, Inc., 1995.
[2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data
Compression, IEEE Transactions on Information Theory,
23(3):337-343, May 1977.
[3] W. Tichy, The String-to-String Correction Problem with Block Moves,
ACM Transactions on Computer Systems, 2(4):309-321, November 1984.
[4] E.M. McCreight, A Space-Economical Suffix Tree Construction
Algorithm, Journal of the ACM, 23:262-272, 1976.
[5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta
Algorithms, IEEE Software Configuration and Maintenance Workshop,
1996.
[6] G.S. Fowler, D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library,
Accepted for publication in Software Practice & Experience, 1999.
AUTHOR'S ADDRESS
-23-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
Kiem-Phong Vo (main contact)
AT&T Labs, Room D223
180 Park Avenue
Florham Park, NJ 07932
Phone: 973-360-8630
Email: kpv@research.att.com
David G. Korn
AT&T Labs, Room D237
180 Park Avenue
Florham Park, NJ 07932
Phone: 973-360-8602
Email: dgk@research.att.com
EXPIRATION DATE
September 09, 2000
-24-
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -