📄 draft-korn-vcdiff-01.txt
字号:
-6-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
COMMENTS.
1-4: These lines read the 4-byte header.
7-8: These lines read the indicator byte and terminate if reached end-of-file.
9-10: These lines define a new code table.
12-16: These lines decode the target window and output it to the target file.
Next is the function to recompute a target window:
1. win_inflate(int indi, uchar_t* tar, int n_tar, uchar_t* src, int n_src)
2. { int n_inst, n_data;
3. uchar_t *inst, *data;
4. int s_inst, s_data;
5. int p_src, i_code, f_inst, f_data;
6. Sfio_t *sf, *instf, *dataf;
7. n_inst = sfgetu(Delta);
8. n_data = sfgetu(Delta);
9. if(indi & (1<<6))
10. i_code = sfgetc(Delta);
11. if(indi & (1<<5))
12. { f_inst = sfgetc(Delta);
13. s_inst = sfgetu(Delta);
14. }
15. if(indi & (1<<4))
16. { f_data = sfgetc(Delta);
17. s_data = sfgetu(Delta);
18. }
19. if(indi & ((1<<3)|(1<<2)) )
20. { n_src = sfgetu(Delta);
21. p_src = sfgetu(Delta);
22. }
23. inst = malloc(n_inst);
24. if(indi & (1<<5))
25. { sfread(Delta, inst, s_inst);
26. decompress(inst, n_inst, s_inst, f_inst);
27. }
28. else sfread(Delta, inst, n_inst);
29. instf = sfstropen(inst, n_inst);
30. data = malloc(n_data);
31. if(indi & (1<<4))
32. { sfread(Delta, data, s_data);
33. decompress(data, n_data, s_data, f_data);
34. }
35. else sfread(Delta, data, n_data);
36. dataf = sfstropen(data, n_data);
37. if(indi & ((1<<3)|(1<<2)) )
38. { src = malloc(n_src);
39. sf = (indi & (1<<3)) ? Target : Source;
40. sfseek(sf, p_src, SEEK_SET);
41. sfread(sf, src, n_src);
-7-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
42. }
43. win_decode(tar, n_tar, src, n_src, i_code, instf, dataf);
44. sfclose(instf); sfclose(dataf); free(inst); free(data);
45. }
COMMENTS.
7-8: These lines read the sizes of the instruction and raw datasets.
9-10: These lines read the index of the instruction code table if needed.
11-14: These lines read the secondary instruction compressor indicator and
the size of the compressed instruction dataset.
15-18: These lines read the secondary data compressor indicator and the
size of the compressed raw dataset.
19-22: These lines read the size and position of the source data segment.
Note that win_inflate() may also be called to decode an instruction
code table (Section 5). In that case, "src" will be given and
the WT and WS bits should be zero.
23-36: These lines read the instruction and raw datasets. If these were
compressed, decompress() is called to reconstruct them. The function
decompress() uses the method index f_inst or f_data to decide on
the proper operations. Note that these datasets are then made into
streams via the sfstropen() calls to ease data accessing later.
37-42: These lines construct the source data segment if any.
43: This line calls win_decode() (Section 4.4) to decode the delta
instructions.
44: This line frees resources allocated for processing.
4.2 Integer Encoding Formats
Vcdiff compactly encodes integer values using the same formats introduced
in the Sfio library [6]. The code presented below is not quite correct with
respect to type handling as in Sfio but they show the ideas.
4.2.1 Encoding Integers Using a Variable-Sized Format
The variable-sized encoding of integers follows the same algorithm
used in the Sfio library. This treats an integer as a number in
base 128 so that each digit can be coded in one byte. The eighth bit
of such a byte is the continuation bit. It is 1 if there are more
digits to come or 0 if it is the last digit.
+---------------------+
| 11100000 | 00111001 |
+---------------------+
byte 0 byte 1
Table 1
Table 1 shows the variable sized encoding of 12345. The bytes in
the encoding are presented in binary to make clear the use of
the continuation bit. Omitting the continuation bit, the encoding
of 12345 uses two bytes with values 96 and 57.
Below are the algorithms:
1. int sfputu(Sfio_t* f, int_t v)
2. { uchar_t c[2*sizeof(int_t)], *s;
-8-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
3. s = &c[sizeof(c)-1];
4. *s = v & 127;
5. while((v >>= 7) != 0)
6. *--s = (v & 127) | (1<<7);
7. sfwrite(f, s, (&c[sizeof(c)]) - s);
8. return &c[sizeof(c)]-s;
9. }
10. int_t sfgetu(Sfio_t* f)
11. { int_t b, v;
12. for(v = 0;; )
13. { b = sfgetc(f);
14. v = (v << 7) | (b & 127);
15. if(!(b & 128) )
16. return v;
17. }
18. }
COMMENTS.
1: This line declares the formal arguments to sfputu(), a stream f
to store the encoding and the value v to be encoded.
2: This line declares an array large enough to store the encoding.
4-6: These lines extract digits in base 128 and store them in
the array. Note that the right-most digit is extracted first and
does not carry a continuation bit.
7-8: These lines write the encoding out to the given stream f and
return the length of that encoding.
10-18: These lines define the decoding algorithm.
4.2.2 Encoding Integers with a Given Range
The address of a COPY instruction is an unsigned integer with a known
range, i.e., it cannot be larger than the current position. Such a
value can be encoded more compactly than as done in the Sfio scheme.
For example, if a value is in the range 0-255, it can be encoded with
a single byte. On the other hand, if the same value happens to be larger
than 127, it would require two bytes to encode in the variable-sized
scheme. Below are the algorithms to encode integers with known ranges:
1. int sfputm(Sfio_t* f, int_t v, int_t max)
2. { uchar_t c[sizeof(int_t)], *s;
3. s = &c[sizeof(c)-1];
4. *s = v & 255;
5. while((max >>= 8) != 0)
6. { v >>= 8;
7. *--s = v & 255;
8. }
9. sfwrite(f, s, (&c[sizeof(c)]) - s);
10. return (&c[sizeof(c)]) - s;
11. }
12. int_t sfgetm(Sfio_t* f, int_t max)
13. { int_t v;
14. for(v = 0;;)
-9-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
15. { v = (v << 8) | sfgetc(f);
16. if((max >>= 8) == 0)
17. return v;
18. }
19. }
4.3 Delta Instruction Coding Format
Delta instructions are encoded as control bytes with associated data.
Each control byte is an index into an instruction code table of 256
entries. Each code entry is of the type Vcdcode_t as defined below:
1. typedef struct _vcdinst_s
2. { unsigned char type; /* instruction type */
3. unsigned char size; /* >0 if size is coded here */
4. unsigned char mode; /* address mode for COPYs */
5. } Vcdinst_t;
6. typedef struct _vcdcode_s
7. { Vcdinst_t inst1; /* first instruction */
8. Vcdinst_t inst2; /* second instruction */
9. } Vcdcode_t;
Thus, each control byte identifies up to two instructions. Below are
the different instruction types:
1. #define VCD_NOOP 0 /* not an instruction */
2. #define VCD_ADD 1 /* an ADD instruction */
3. #define VCD_RUN 2 /* a RUN instruction */
4. #define VCD_COPY 3 /* a COPY instruction */
Each instruction has a size n of the data involved. If the field
Vcdinst_t.size is non-zero, it is the value for n. Otherwise,
n is encoded next in the instruction dataset as a variable-sized
integer. If the instruction is a COPY, the copy address will follow
next in the instruction dataset. Its encoding depends on some
addressing scheme to be discussed next.
A COPY address can be encoded in different ways. The field
Vcdinst_t.mode has values in the range [0-7] and defines the below
address encoding modes:
1. #define VCD_SELF 0 /* coded as itself */
2. #define VCD_HERE 1 /* from current position */
3. #define VCD_SAME 2 /* index into "same" cache */
4. #define VCD_NEAR 3 /* index into "near" cache */
VCD_SAME and VCD_NEAR indicate two address caching methods designed
to take advantage of the heuristic that successive copying addresses
tend to be the same or fairly close to one another. Note that as
discussed below, there are 4 VCD_NEAR addresses corresponding to
Vcdinst_t.mode values in the range [VCD_NEAR,VCD_NEAR+3].
Let A be the COPY address and H the current location in the target
data. Below are the encodings:
VCD_SELF: A is the next integer in the instruction dataset that is
encoded in the range [0-H].
-10-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
VCD_HERE: Let B be the next byte in the instruction dataset.
Then A is H-B. That is, the distance from A to H, H-A, must have
been in the range [0-255].
VCD_SAME: The "same" cache keeps 256 addresses. Let B be the next byte
in the instruction dataset. Then A is same[B].
VCD_NEAR: The "near" cache keeps 4 addresses. Let B be the next byte
in the instruction dataset and t be Vcdinst_t.mode-VCD_NEAR.
Then, A is near[t]+B-127.
Below are the algorithms to maintain address caches.
1. typedef struct _cache_s
2. { int n;
3. int near[4];
4. int same[256];
5. } Cache_t;
6. cache_init(Cache_t* ka)
7. { int i;
8. for(i = 0; i < 4; ++i)
9. ka->near[i] = 0;
10. ka->n = 0;
11. for(i = 0; i < 256; ++i)
12. ka->same[i] = 0;
13. }
14. cache_update(Cache_t* ka, int_t addr)
15. { ka->near[ka->n] = addr;
16. if((ka->n += 1) >= 4)
17. ka->n = 0;
18. ka->same[addr&255] = addr;
19. }
COMMENTS.
1-5: These lines define the caches. As discussed, the "near" cache
has 4 addresses and the "same" cache has 256 addresses.
6-13: These lines initialize addresses in the caches to zero.
14-19: These lines update the caches with a given address. Note that
the "near" cache is updated in a round-robin manner and the lower
eight bits of an address is its index in the "same" cache.
Below is the function to encode a COPY address:
1. int addr_encode(Cache_t* ka, int addr, int here, int* best)
2. { int i, d, mode = -1;
3. if((d = (here-addr)) < 256)
4. { *best = d;
5. mode = VCD_HERE;
6. }
7. else if(ka->same[d = addr&255] == addr)
8. { *best = d;
-11-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
9. mode = VCD_SAME;
10. }
11. else
12. { for(i = 0; i < 4; ++i)
13. { if((d = addr - ka->k_near[i]) >= -127 && d <= 128)
14. { *best = d + 127;
15. mode = VCD_NEAR+i;
16. }
17. }
18. }
19. if(mode < 0)
20. { *best = addr;
21. mode = VCD_SELF;
22. }
23. cache_update(ka,addr);
24. return mode;
25. }
COMMENTS.
1: This lines declare the formal arguments. "addr" is the address to be
encoded. "here" is the current location in the target data. "best" is
used to return the value to be used to encode "addr".
3-6: If "addr" is within the range [0-255] away from the current location
"here", then the addressing mode is VCD_HERE and the address is encoded
as a single byte showing this distance.
7-10: Using the lower eight bits of "addr" to index the "same" cache, if the
address in the cache exactly matches "addr", then VCD_SAME is used
and "addr" is encoded as this index.
12-18: Check each address in the "near" cache to see if "addr" is within
[-127,128] away from such an address. If there is one, the addressing
mode is VCD_NEAR plus the index of the address and "addr" is encoded
as the distance plus 127 (so the encoded value is in the range [0-255]).
19-22: If none of the above addressing modes applies, then VCD_SELF is used
and "addr" is encoded as a value in the range [0-here].
23: This line updates the address caches.
Below is the function to decode a COPY address:
1. int addr_decode(Cache_t* ka, int here, int type, Sfio_t* instf)
2. { int addr;
3. if(type == VCD_SELF)
4. addr = sfgetm(instf, here);
5. else if(type == VCD_HERE)
6. addr = here - sfgetc(instf);
7. else if(type == VCD_SAME)
8. addr = ka->same[sfgetc(instf)];
9. else addr = ka->near[type] + sfgetc(instf) - 127;
10. cache_update(ka, addr);
11. return addr;
12. }
4.4 Decoding A Target Window
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -