📄 draft-korn-vcdiff-01.txt
字号:
[ This paper makes reference to the "Sfio" library. See
http://www.research.att.com/sw/tools/sfio/ for more. -kff]
Network Services Research Lab David Korn and Kiem-Phong Vo
AT&T Labs
Submission: March 09, 2000
Expiration: September 09, 2000
The VCDIFF Generic Differencing and Compression Data Format
<draft-korn-vcdiff-01.txt>
Status of this Memo
This document is an Internet-Draft and is in full conformance
with all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-
Drafts as reference material or to cite them other than as
"work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
This memo describes a general and efficient data format suitable
for encoding compressed and/or differencing data so that they can
be easily transported among computers.
Table of Contents:
1. Executive Summary ............................................ 1
2. Algorithm Conventions ........................................ 3
3. Delta Instructions ........................................... 3
4. Vcdiff Encoding Format ....................................... 4
5. Instruction Code Tables ...................................... 13
6. Compression and Differencing Algorithms ...................... 18
7. Summary ...................................................... 23
ACKNOWLEDGEMENTS ............................................. 23
REFERENCES ................................................... 23
AUTHOR'S ADDRESS ............................................. 24
1. Executive Summary
Compression and differencing techniques can greatly improve storage
and transmission of files and file versions. Since files are often
transported across machines with distinct architectures and performance
characteristics, such data should be encoded in a form that is portable
and can be decoded with little or no knowledge of the encoders.
This document describes Vcdiff, a new compact portable encoding format
that is independent of encoding algorithms and also efficient to decode.
Data differencing is the process of computing a compact and invertible
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
encoding of a "target file" given a "source file". Data compression
is similar but without the use of source data. The UNIX utilities diff,
compress, and gzip are well-known examples of data differencing and
compression tools. For data differencing, the computed encoding is
called a "delta file", and, for data compression, it is called
a "compressed file". Delta and compressed files are good for storage
and transmission because they are often smaller than the originals.
Data differencing and data compression are traditionally treated
as distinct types of data processing. However, compression can be
thought of as a special case of differencing in which the source
data is empty. This blending of differencing and compression was first
introduced in the Vdelta technique [1] by Korn and Vo. The basic idea
is to unify the string parsing scheme used in the Lempel-Ziv'77 style
compressors [2], and the block-move technique of Tichy [3]. Loosely
speaking, this works as follows:
a. Concatenate source and target data.
b. Parse the data from left to right just like LZ'77 but
make sure that a parsed segment starts target data.
c. Start to output when reaching target data.
Parsing is based on string matching algorithms such as suffix trees [4]
or hashing with different time and space performance characteristics.
Vdelta uses a new string matching algorithm that performs competitive
to other techniques [5]. However, even with a memory-efficient and fast
string matching algorithm, the computing resource requirements can be
prohibitive for processing large files. The standard way to deal with
this is to partition input files into "windows" to be processed
separately. Here, except for some unpublished work by Vo, little has
been done on designing effective windowing schemes. Current techniques,
including Vdelta, simply use windows of the same size with corresponding
addresses across source and target files.
String matching and windowing algorithms have large influence on the
compression rate of delta and compressed files. However, it is desirable
to have a portable encoding format that is independent of such algorithms.
This enables construction of client-server applications in which a server
may serve clients with unknown computing characteristics. Unfortunately,
all current differencing and compressing tools, including Vdelta, fall
short in this resspect. Their storage formats are closely intertwined
with the implemented algorithms.
The encoding format Vcdiff proposed here addresses the above issues.
Vcdiff achieves the below characteristics:
Output compactness:
The basic encoding format compactly represents compressed or delta
files. In specific cases, applications can further extend it with
"secondary encoders" (e.g., a Huffman or arithmetic encoder) to
achieve more compression.
Data portability:
The basic encoding format is free from byte order and word size
issues for integer representations.
Algorithm genericity:
The decoding algorithm for the basic encoding format is independent
from string matching and windowing algorithms.
Decoding efficiency:
The decoding algorithm for the basic encoding format runs in time
proportional to the size of the target file and uses space
-2-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
proportional to the maximal window size.
The Vcdiff data format and the algorithms for decoding data shall be
described next. Since Vcdiff treats compression as a special case of
differencing, we shall use the term "delta file" to indicate the
compressed output for both cases.
2. Algorithm Conventions
Algorithms to encode and decode the Vcdiff data format will be presented
in the ANSI-C language. To ease the presentation, we shall generally omit
error checking. The below conventions will be observed:
a. Tokens with all upper cases letters will be C macros.
b. Variables with capitalized names are global.
c. Code fragments will be presented with line numbers to be used
in the subsequent COMMENTS sections.
Data will be encoded in byte units. For portability, control data
generated by Vcdiff shall be limited to the lower eight bits of a byte
even on machines with larger bytes. The bits in a byte are named from
right to left so that the first bit has the lowest value, 1<<0 or 1,
and the eighth bit has value 1<<7 or 128.
To facilitate the algorithms, we shall assume a few types and functions
for I/O on streams. To be definite, we shall use interfaces similar to
that provided in the Sfio library. Below are descriptions of some of
these types and functions. Others can be found in reference [6].
uchar_t:
This is the type "unsigned char".
int_t:
This is the largest integer type on the platform in use.
Sfio_t:
This is the type of a stream.
Sfio_t* sfstropen(uchar_t* buf, int n):
This is not an Sfio function but it can be easily implemented
on top of the Sfio primitive sfnew(). sfstropen() creates a stream
from a given buffer with a given size. We shall assume that such
a stream is both readable and writable. As with Sfio, a write stream
will extend the buffer as necessary to accomodate output data.
3. Delta Instructions
A target file is partitioned into non-overlapping sections or windows
to be processed separately. A target window T of length n may be
compared against some source data segment S of length m. Such a source
data segment may come from some earlier part of the target file or
it may come from the source file, if there is one. It is assumed that
there is sufficient memory so that both T and S can be processed
in main memory.
For string processing, we treat S and T as substrings of a superstring U
formed by concatenating T and S like this:
-3-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
s[0]s[1]...s[m-1]t[0]t[1]...t[n-1]
The index of a byte in S or T is referred to by its location in U.
For example, T[i] is referred to as U[m+i].
The instructions to encode and direct the reconstruction of a target
window are called delta instructions. There are three types:
ADD:
This instruction has two arguments, a size s and a sequence of
s bytes to be copied to reconstruct the target window.
COPY:
This instruction has two arguments, a size s and an address p
in the string U. The arguments specify the substring of U that
must be copied. We shall assert that such a substring must be
entirely contained in either S or T.
RUN:
This instruction has two arguments, a size s and a byte c that
will be repeated s times.
Below are example source and target strings and the delta instructions
that encode the target string in terms of the source string.
a b c d e f g h i j k l m n o p
a b c d w x y z e f g h e f g h e f g h e f g h z z z z
COPY 4, 0
ADD 4, w x y z
COPY 4, 4
COPY 12, 24
RUN 4, z
Note that the third COPY instruction copies data from T itself since
address 24 is position 8 in T. In addition, parts of the data to be
copied by this instruction will be reconstructed during its execution.
This allows efficient encoding of periodic sequences, i.e., sequences
with regularly repeated subsequences. The RUN instruction is a compact
way to encode a sequence repeating the same byte even though such
a sequence can be thought of as a periodic sequence with period 1.
4. Vcdiff Encoding Format
A large target file is partitioned into non-overlapping sections called
"target windows". In practice, window sizes should be chosen so that
each window can be completely processed in memory during both encoding
and decoding. A target window may be compared against some source data
segment or none. In the compression case, such a source segment would
be from some earlier part of the target file while, for differencing,
it may also be from the source file.
4.1 Delta File Layout
A Vcdiff delta file is organized into sections as follows:
4-byte header
-4-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
Section 1
Section 2
...
Header:
The header of a delta file consists of 4 bytes to identify it as
a Vcdiff delta file. The first three bytes are: 0xd6, 0xc3, 0xc4,
i.e., the ASCII letters 'V', 'C' and 'D' or-ed with the eighth bit.
The fourth byte indicates a version number which is currently 0.
Sections:
Following the 4-byte header are a number of sections. Each section
encodes either an instruction code table (Section 5) or a window.
Each section starts with an indicator byte. If the 8th bit of this
byte is on, the section encodes an instruction table; otherwise,
it encodes a window. The remaining bits of this byte are used to
encode other information as necessary.
4.1.1 The Format of a Window
Each window is organized as below. Note that items bracketed are present
only when certain corresponding bits in the indicator byte are on.
Indicator byte
Length of target window
Length of instruction dataset
Length of raw dataset
[Index of code table]
[Secondary instruction compressor]
[Length of compressed instruction dataset]
[Secondary data compressor]
[Length of compressed raw dataset]
[Source segment size]
[Source segment position]
Instruction dataset
Raw dataset
Indicator byte:
The bits of the indicator byte for a window are as follows:
0 T I D WT WS X X
0: This is the 0-bit indicating that this section is a window
of data to be decoded.
T: This bit, if on, indicates that a non-default instruction
code table should be used for decoding. In this case, the
item [Index of code table] is a single byte indicating the
index of this code table.
I: This bit, if on, indicates that the instruction dataset
has been compressed with a secondary compressor. In this
case, the item [Secondary instruction compressor] is a single
byte indicating the secondary compressor while the item
[Length of compressed instruction dataset] has the length
of the compressed instruction dataset encoded as a
variable-sized integer.
D: This bit is similar to the I-bit but it is for the raw
-5-
RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000
dataset.
WT: This bit, if on, indicates that a source data segment was
used to compare the target window with. Further, this source
segment is from an earlier part of the target file. In this
case, the size and position of the source segment are given
in [Source segment size] and [Source segment position].
Both of these items are encoded as variable-sized integers.
WS: This bit is similar to WT but, if on, indicates that the
source data segment is from the source file.
X: These bits are currently unused.
Length of target window:
This is a variable-sized integer indicating the size of the
target window.
Lengths of instruction and raw datasets:
The delta instructions ADD and RUN have associated raw data
(unmatched bytes for ADD and the repeating byte for RUN).
The Vcdiff encoding format maintains two separate datasets,
one for the raw data and one for the instructions. The lengths
of the "instruction dataset" and the "raw dataset" are stored
respectively as two variable-sized integers.
Instruction dataset:
This is the set of bytes that define the delta instructions.
If I was 1, the dataset has been encoded by the indicated
secondary encoder.
Raw dataset:
This is the set of raw data accompanying the ADD and RUN
instructions. If D was 1, the dataset has been encoded
by the indicated secondary encoder.
4.1.2 Processing a Delta File
Below is the basic algorithm to process a delta file:
1. sfgetc(Delta);
2. sfgetc(Delta);
3. sfgetc(Delta);
4. sfgetc(Delta);
6. for(;;)
7. { if((indi = sfgetc(Delta)) < 0)
8. break;
9. if((indi & (1<<7)) != 0)
10. code_define();
11. else
12. { n_tar = sfgetu(Delta);
13. tar = malloc(n_tar);
14. win_inflate(indi, tar, n_tar, NULL, 0);
15. sfwrite(Target, tar, n_tar);
16. free(tar);
17. }
18. }
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -