📄 rfc3284.txt
字号:
Network Working Group D. Korn
Request for Comments: 3284 AT&T Labs
Category: Standards Track J. MacDonald
UC Berkeley
J. Mogul
Hewlett-Packard Company
K. Vo
AT&T Labs
June 2002
The VCDIFF Generic Differencing and Compression Data Format
Status of this Memo
This document specifies an Internet standards track protocol for the
Internet community, and requests discussion and suggestions for
improvements. Please refer to the current edition of the "Internet
Official Protocol Standards" (STD 1) for the standardization state
and status of this protocol. Distribution of this memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (2002). All Rights Reserved.
Abstract
This memo describes VCDIFF, a general, efficient and portable data
format suitable for encoding compressed and/or differencing data so
that they can be easily transported among computers.
Korn, et. al. Standards Track [Page 1]
RFC 3284 VCDIFF June 2002
Table of Contents
1. Executive Summary ........................................... 2
2. Conventions ................................................. 4
3. Delta Instructions .......................................... 5
4. Delta File Organization ..................................... 6
5. Delta Instruction Encoding .................................. 12
6. Decoding a Target Window .................................... 20
7. Application-Defined Code Tables ............................. 21
8. Performance ................................................. 22
9. Further Issues .............................................. 24
10. Summary ..................................................... 25
11. Acknowledgements ............................................ 25
12. Security Considerations ..................................... 25
13. Source Code Availability .................................... 25
14. Intellectual Property Rights ................................ 26
15. IANA Considerations ......................................... 26
16. References .................................................. 26
17. Authors' Addresses .......................................... 28
18. Full Copyright Statement .................................... 29
1. Executive Summary
Compression and differencing techniques can greatly improve storage
and transmission of files and file versions. Since files are often
transported across machines with distinct architectures and
performance characteristics, such data should be encoded in a form
that is portable and can be decoded with little or no knowledge of
the encoders. This document describes Vcdiff, a compact portable
encoding format designed for these purposes.
Data differencing is the process of computing a compact and
invertible encoding of a "target file" given a "source file". Data
compression is similar, but without the use of source data. The UNIX
utilities diff, compress, and gzip are well-known examples of data
differencing and compression tools. For data differencing, the
computed encoding is called a "delta file", and for data compression,
it is called a "compressed file". Delta and compressed files are
good for storage and transmission as they are often smaller than the
originals.
Data differencing and data compression are traditionally treated as
distinct types of data processing. However, as shown in the Vdelta
technique by Korn and Vo [1], compression can be thought of as a
special case of differencing in which the source data is empty. The
basic idea is to unify the string parsing scheme used in the Lempel-
Ziv'77 (LZ'77) style compressors [2] and the block-move technique of
Tichy [3]. Loosely speaking, this works as follows:
Korn, et. al. Standards Track [Page 2]
RFC 3284 VCDIFF June 2002
a. Concatenate source and target data.
b. Parse the data from left to right as in LZ'77 but make sure
that a parsed segment starts the target data.
c. Start to output when reaching target data.
Parsing is based on string matching algorithms, such as suffix trees
[4] or hashing with different time and space performance
characteristics. Vdelta uses a fast string matching algorithm that
requires less memory than other techniques [5,6]. However, even with
this algorithm, the memory requirement can still be prohibitive for
large files. A common way to deal with memory limitation is to
partition an input file into chunks called "windows" and process them
separately. Here, except for unpublished work by Vo, little has been
done on designing effective windowing schemes. Current techniques,
including Vdelta, simply use source and target windows with
corresponding addresses across source and target files.
String matching and windowing algorithms have great influence on the
compression rate of delta and compressed files. However, it is
desirable to have a portable encoding format that is independent of
such algorithms. This enables the construction of client-server
applications in which a server may serve clients with unknown
computing characteristics. Unfortunately, all current differencing
and compressing tools, including Vdelta, fall short in this respect.
Their storage formats are closely intertwined with the implemented
string matching and/or windowing algorithms.
The encoding format Vcdiff proposed here addresses the above issues.
Vcdiff achieves the characteristics below:
Output compactness:
The basic encoding format compactly represents compressed or
delta files. Applications can further extend the basic
encoding format with "secondary encoders" to achieve more
compression.
Data portability:
The basic encoding format is free from machine byte order and
word size issues. This allows data to be encoded on one
machine and decoded on a different machine with different
architecture.
Algorithm genericity:
The decoding algorithm is independent from string matching and
windowing algorithms. This allows competition among
implementations of the encoder while keeping the same decoder.
Korn, et. al. Standards Track [Page 3]
RFC 3284 VCDIFF June 2002
Decoding efficiency:
Except for secondary encoder issues, the decoding algorithm
runs in time proportionate to the size of the target file and
uses space proportionate to the maximal window size. Vcdiff
differs from more conventional compressors in that it uses only
byte-aligned data, thus avoiding bit-level operations, which
improves decoding speed at the slight cost of compression
efficiency.
The combined differencing and compression method is called "delta
compression" [14]. As this way of data processing treats compression
as a special case of differencing, we shall use the term "delta file"
to indicate the compressed output for both cases.
2. Conventions
The basic data unit is a byte. For portability, Vcdiff shall limit a
byte to its lower eight bits even on machines with larger bytes. The
bits in a byte are ordered from right to left so that the least
significant bit (LSB) has value 1, and the most significant bit
(MSB), has value 128.
For purposes of exposition in this document, we adopt the convention
that the LSB is numbered 0, and the MSB is numbered 7. Bit numbers
never appear in the encoded format itself.
Vcdiff encodes unsigned integer values using a portable, variable-
sized format (originally introduced in the Sfio library [7]). This
encoding treats an integer as a number in base 128. Then, each digit
in this representation is encoded in the lower seven bits of a byte.
Except for the least significant byte, other bytes have their most
significant bit turned on to indicate that there are still more
digits in the encoding. The two key properties of this integer
encoding that are beneficial to a data compression format are:
a. The encoding is portable among systems using 8-bit bytes, and
b. Small values are encoded compactly.
For example, consider the value 123456789, which can be represented
with four 7-bit digits whose values are 58, 111, 26, 21 in order from
most to least significant. Below is the 8-bit byte encoding of these
digits. Note that the MSBs of 58, 111 and 26 are on.
+-------------------------------------------+
| 10111010 | 11101111 | 10011010 | 00010101 |
+-------------------------------------------+
MSB+58 MSB+111 MSB+26 0+21
Korn, et. al. Standards Track [Page 4]
RFC 3284 VCDIFF June 2002
Henceforth, the terms "byte" and "integer" will refer to a byte and
an unsigned integer as described.
Algorithms in the C language are occasionally exhibited to clarify
the descriptions. Such C code is meant for clarification only, and
is not part of the actual specification of the Vcdiff format.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in BCP 14, RFC 2119 [12].
3. Delta Instructions
A large target file is partitioned into non-overlapping sections
called "target windows". These target windows are processed
separately and sequentially based on their order in the target file.
A target window T, of length t, may be compared against some source
data segment S, of length s. By construction, this source data
segment S comes either from the source file, if one is used, or from
a part of the target file earlier than T. In this way, during
decoding, S is completely known when T is being decoded.
The choices of T, t, S and s are made by some window selection
algorithm, which can greatly affect the size of the encoding.
However, as seen later, these choices are encoded so that no
knowledge of the window selection algorithm is needed during
decoding.
Assume that S[j] represents the jth byte in S, and T[k] represents
the kth byte in T. Then, for the delta instructions, we treat the
data windows S and T as substrings of a superstring U, formed by
concatenating them like this:
S[0]S[1]...S[s-1]T[0]T[1]...T[t-1]
The "address" of a byte in S or T is referred to by its location in
U. For example, the address of T[k] is s+k.
The instructions to encode and direct the reconstruction of a target
window are called delta instructions. There are three types:
ADD: This instruction has two arguments, a size x and a sequence
of x bytes to be copied.
COPY: This instruction has two arguments, a size x and an address
p in the string U. The arguments specify the substring of U
that must be copied. We shall assert that such a substring
must be entirely contained in either S or T.
Korn, et. al. Standards Track [Page 5]
RFC 3284 VCDIFF June 2002
RUN: This instruction has two arguments, a size x and a byte b,
that will be repeated x times.
Below are example source and target windows and the delta
instructions that encode the target window in terms of the source
window.
a b c d e f g h i j k l m n o p
a b c d w x y z e f g h e f g h e f g h e f g h z z z z
COPY 4, 0
ADD 4, w x y z
COPY 4, 4
COPY 12, 24
RUN 4, z
Thus, the first letter 'a' in the target window is at location 16 in
the superstring. Note that the fourth instruction, "COPY 12, 24",
copies data from T itself since address 24 is position 8 in T. This
instruction also shows that it is fine to overlap the data to be
copied with the data being copied from, as long as the latter starts
earlier. This enables efficient encoding of periodic sequences,
i.e., sequences with regularly repeated subsequences. The RUN
instruction is a compact way to encode a sequence repeating the same
byte even though such a sequence can be thought of as a periodic
sequence with period 1.
To reconstruct the target window, one simply processes one delta
instruction at a time and copies the data, either from the source
window or the target window being reconstructed, based on the type of
the instruction and the associated address, if any.
4. Delta File Organization
A Vcdiff delta file starts with a Header section followed by a
sequence of Window sections. The Header section includes magic bytes
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -