📄 draft-korn-vcdiff-01.txt
字号:
-12-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
The algorithm to decode a target window is as follows:
1. win_decode(uchar_t* tar, int n_tar, uchar_t* src, int n_src,
int i_code, Sfio_t* instf, Sfio_t* dataf)
2. { int_t here, size, addr, i;
3. Cache_t ka;
4. Vcdinst_t *inst;
5. Vcdcode_t *code, *table = Code[i_code];
6. cache_init(&ka);
7. for(here = 0; here < n_tar; )
8. { code = &table[sfgetc(instf)];
9. for(i = 1; i <= 2; ++i)
10. { inst = i == 1 ? &code->inst1 : &code->inst2;
11. if(inst->type == VCD_NOOP)
12. continue;
13. if((size = inst->size) == 0)
14. size = sfgetu(instf);
15. if(inst->type == VCD_ADD)
16. { for(; size > 0; --size)
17. tar[here++] = sfgetc(dataf);
18. }
19. else if(inst->type == VCD_RUN)
20. { int c = sfgetc(dataf);
21. for(; size > 0; --size)
22. tar[here++] = c;
23. }
24. else if(inst->type == VCD_COPY)
25. { uchar_t* from;
26. addr = addr_decode(&ka,here,inst->mode,instf);
27. from = addr < nsrc ? src+addr : tar+addr-nsrc;
28. for(; size > 0; --size)
29. tar[here++] = *from++;
30. }
31. }
32. }
33. }
COMMENTS.
5: "table" is initialized to be the instruction code table to be used.
We assume that "Code" is a global variable pointing to the list of 256
possible code tables.
6: This line initializes the address caches.
8: This line reads a control byte from the instruction dataset and gets
the corresponding code table to decode instructions.
9-32: These lines process the given pair of delta instructions. Note that
the data for VCD_ADD and VCD_RUN are read from the raw dataset.
5. Instruction Code Tables
The delta instructions are encoded based on some instruction code table.
The Vcdiff format allows applications to tailor such code tables to the
particular data characteristics to enhance output compactness. Up to 256
instruction tables with indices in the range [0,255] are allowed.
However, the first 8 indices [0,7] are reserved by Vcdiff.
5.1 The Encoding of a Code Table in a Delta File
-13-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
It is desirable to compactly encode a code table in a delta file.
To accomplish this, we first represent a table to be output as a string.
Since each table entry encodes two instructions and each is 3 bytes,
the total string size is 6*256 or 1536 bytes. Such a string can then
be compared against some other code table string (e.g., that of table 0)
using the window encoding algorithm win_deflate() to reduce the output
data. Thus, the format for a code table in a delta file looks like this:
Code table to compare with
Encoded data
Code table to compare with:
This is a single byte indicating the code table used to compare
the current table with.
Encoded data:
The data uses the same format as that of a window (Section 4.1.1)
where the target data is the code string of the table being encoded
and the source data segment is the code string of the code table
with the above index.
Two functions tab2str() and str2tab() are used for converting between
a code table and its code string. Below is the description of tab2str().
str2tab() is just as straightforward so its description will be omitted.
Note that bytes of the same type are grouped together to induce more
matching when code strings are compared.
1. tab2str(Vcdcode_t* tab, uchar_t data[6*256])
2. { int i, n;
3. n = 0;
4. for(i = 0; i < 256; ++i)
5. data[n++] = tab[i].inst1.type;
6. for(i = 0; i < 256; ++i)
7. data[n++] = tab[i].inst2.type;
8. for(i = 0; i < 256; ++i)
9. data[n++] = tab[i].inst1.size;
10. for(i = 0; i < 256; ++i)
11. data[n++] = tab[i].inst2.size;
12. for(i = 0; i < 256; ++i)
13. data[n++] = tab[i].inst1.mode;
14. for(i = 0; i < 256; ++i)
15. data[n++] = tab[i].inst2.mode;
16. }
Below is the function code_define() to retrieve an encoded instruction
code table from the delta file.
1. code_define()
2. { uchar_t src[6*256], tar[6*256];
3. int index;
4. index = sfgetc(Delta);
5. tab2str(Code[sfgetc(Delta)], src);
8. win_inflate(0, tar, 1536, src, 1536);
9. Code[index] = str2tab(data);
10. }
-14-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
COMMENTS.
4: This line reads the index to install the new code table.
5: This line constructs the source string to be compared with.
8-9: These lines construct the new table and install it.
5.2 Reserved Instruction Code Tables
A COPY instruction with small data size may use more encoding space than
the data itself. Thus, it is good to restrict COPY instructions to some
minimum sizes. In turn, a code table can take advantage of such
a restriction to enhance compactness. Vcdiff provides two reserved
instruction code tables with indices 0 and 1 for minimium COPY sizes of
4 and 3 respectively. As noted, the starting code table has index 0
so it is optimized for minimum COPY size of 4. Note also that even when
a table optimized for a particular minimum COPY size, this does not mean
that a COPY instruction with a smaller size cannot be encoded. It simply
means that the size of such an instruction must be coded separately.
TYPE SIZE MODE TYPE SIZE MODE INDEX
-------------------------------------------------------------
a. RUN 0 NOOP 0
b. ADD 0 NOOP 1
c. ADD [1,14] NOOP [2,15]
d. COPY 0 [0,6] NOOP [16,22]
e. COPY [M,M+10] [0,6] NOOP [23,99]
f. COPY 0 [0,6] ADD [1,4] [100,127]
g. COPY [M,M+3] [0,6] ADD [1,4] [128,239]
h. COPY [M,M+3] SELF COPY [M,M+3] SELF [240,255]
-------------------------------------------------------------
Table 2
For a given minimum COPY size M, Table 2 shows how the 256 code values
are partitioned into different categories of indices:
a. Index 0 defines a RUN instruction whose size is always coded
separately as an unsigned integer. This is signified by having
the "size" field being 0.
b. Index 1 defines an ADD instruction with size coded separately.
c. Indices 2-15 define 14 ADD instructions with sizes in
the range [1,14].
d. Indices 16-22 define 7 COPY instructions for the 7 different
address encoding modes discusses in Section 4.3. The sizes of
these instructions are coded separately.
e. Indices 23-99 define 77 COPY instructions using all 7 addressing
modes and having sizes in the range [M,M+10].
f. Indices 100-127 defines 28 pairs of COPY and ADD instructions.
The COPY sizes are coded separately while the ADD sizes are
in the range [1,4].
g. Indices 128-239 define 112 pairs of COPY and ADD instructions
whose sizes are in the ranges [M,M+3] and [1,4] respectively.
h. Finally, indices 240-255 defines 16 pairs of COPY and COPY
instructions with sizes in the range [M,M+3] and both using
the same VCD_SELF addressing mode.
Next we present the algorithms to construct an instruction code table.
-15-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
Where appropriate, the algorithms also construct the inverted tables
usable for data encoding. Any part of a table not explicitly set will
be assumed to be initialized to be either zero or the null pointer as
the case warranted.
1. mk_codetab(Vcdcode_t tab[256], int add[16], int copy[15][7],
int copyadd[8][7][5], int copycopy[8][8], int M)
2. {
3. tab[0].inst1.type = VCD_RUN;
4. tab[0].inst1.size = 0;
5. tab[0].inst2.type = VCD_NOOP;
6. mk_add(tab, add);
7. mk_copy(tab, copy, M);
8. mk_copyadd(tab, copyadd, M);
9. mk_copycopy(tab, copycopy, M);
10. }
COMMENTS.
1: This line declares the arguments for mk_codetab(). The inverted
tables add[], copy[][], copyadd[][][], and copycopy[][] are designed
to be indexed by sizes and address modes. The argument M is the minimum
COPY size used for the table.
3-5: These lines construct the RUN instruction.
6-9: These lines call various subroutines to construct different parts of
the instruction table.
Below is the algorithm to construct the single ADD instructions:
1. mk_add(Vcdcode_t tab[256], int add[16])
2. { int s, n;
3. tab[1].inst1.type = VCD_ADD;
4. tab[1].inst1.size = 0;
5. tab[1].inst2.type = VCD_NOOP;
6. add[0] = 1;
7. for(n = 2, s = 1; s <= 14; ++s, ++n)
8. { tab[n].inst1.type = VCD_ADD;
9. tab[n].inst1.size = s;
10. tab[n].inst2.type = VCD_NOOP;
11. add[s] = n;
12. }
13. }
COMMENTS.
3-6: These lines construct the ADD instruction whose size is to be coded
separately. Location 0 in the inverted table add[] points to this code.
7-12: These lines construct the ADD instructions with size in
the range [1-14] as discussed earlier.
Below is the function to construct the single COPY instructions:
1. mk_copy(Vcdcode_t tab[256], int copy[15][7], int M)
-16-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
2. { int s, m, n;
3. for(n = 16, m = 0; m < 7; ++m, ++n)
4. { tab[n].inst1.type = VCD_COPY;
5. tab[n].inst1.size = 0;
6. tab[n].inst1.mode = m;
7. tab[n].inst2.type = VCD_NOOP;
8. copy[0][m] = n;
9. }
10. for(s = M; s <= M+10; ++s)
11. { for(m = 0; m < 7; ++m, ++n)
12. { tab[n].inst1.type = VCD_COPY;
13. tab[n].inst1.size = s;
14. tab[n].inst1.mode = m;
15. tab[n].inst2.type = VCD_NOOP;
16. copy[s][m] = n;
17. }
18. }
20. }
COMMENTS.
3-9: These lines construct the 7 COPY instructions with whose data sizes
are coded separately.
10-18: These lines construct the 77 COPY instructions whose data sizes
are in the range [M,M+10].
The below function constructs the COPY/ADD instruction pairs:
1. mk_copyadd(Vcdcode_t tab[256], int copyadd[8][7][5], int M)
2. { int s, m, a, n;
3. for(n = 100, m = 0; m < 7; ++m)
4. { for(a = 1; a <= 4; ++a, ++n)
5. { tab[n].inst1.type = VCD_COPY;
6. tab[n].inst1.size = 0;
7. tab[n].inst1.mode = m;
8. tab[n].inst2.type = VCD_ADD;
9. tab[n].inst2.size = a;
10. copyadd[0][m][a] = n;
11. }
12. }
13. for(s = M; s <= M+3; ++s)
14. { for(m = 0; m < 7; ++m)
15. { for(a = 1; a <= 4; ++a, ++n)
16. { tab[n].inst1.type = VCD_COPY;
17. tab[n].inst1.size = s;
18. tab[n].inst1.mode = m;
19. tab[n].inst2.type = VCD_ADD;
20. tab[n].inst2.size = a;
21. copyadd[s][m][a] = n;
22. }
23. }
24. }
25. }
Finally, the algorithm to construct the 16 COPY/COPY pairs:
-17-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
1. mk_copycopy(Vcdcode_t tab[256], int copycopy[8][8], int M)
2. { int s, c, n;
3. for(n = 240, s = M; s <= M+3; ++s)
4. { for(c = M; c <= M+3; ++c, ++n)
5. { tab[n].inst1.type = VCD_COPY;
6. tab[n].inst1.size = s;
7. tab[n].inst1.mode = VCD_SELF;
8. tab[n].inst2.type = VCD_COPY;
9. tab[n].inst2.size = c;
10. tab[n].inst2.mode = VCD_SELF;
11. copycopy[s][c] = n;
12. }
13. }
14. }
6. Compression and Differencing Algorithms
Sections 4 and 5 describe how to reconstruct a target file from data in
a delta file. In this section, we discuss briefly general architectures
for compressors and differencers. A few abstract functions are assumed:
int win_target(uchar_t** tar);
This function may be called many times to get sequential windows of
data from the target file to be compressed. It returns the length of
the window while the window itself is returned in "*tar".
int win_match(uchar_t* tar, int n_tar, int p_tar,
uchar_t** src, int* n_src, int_t* p_src);
This function computes a source data segment, if any, to be compared
with a target window. It returns 0, 1 or 2 for no source segment,
source segment from the source file, or source segment from
the target file. The arguments "src", "n_src" and "p_src" are used
to return the source data segment, its size, and its position in
either the source file or target file.
int str_match(uchar_t* tar, int n_tar, int p_tar,
uchar_t* src, int n_src, int* match);
This function computes a string in either the source data segment or
the target window that matches with a prefix of the target data being
processed. Such a matched string can be encoded as a COPY instruction.
The argument "p_tar" indicates the current position in the target
window to be processed. The function returns the length of the match
and, if that is positive, "*match" is set to the match position.
int run_check(uchar_t* tar, int n_tar, int p_tar);
This function checks to see if there is a run of the same bytes
starting at the current location in the target string.
The performance of an encoder depends on what algorithms are used to
implement the above functions. However, as shown earlier, decoding
performance is completely independent of such choices.
Below is the algorithm for encoding a target file. We shall assume
-18-
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -