📄 draft-korn-vcdiff.txt
字号:
David G. Korn, AT&T Labs Joshua P. MacDonald, UC Berkeley Jeffrey C. Mogul, Compaq WRLInternet-Draft Kiem-Phong Vo, AT&T LabsExpires: 09 November 2002 09 November 2001 The VCDIFF Generic Differencing and Compression Data Format draft-korn-vcdiff-06.txtStatus of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.Abstract This memo describes a general, efficient and portable data format suitable for encoding compressed and/or differencing data so that they can be easily transported among computers.Table of Contents: 1. EXECUTIVE SUMMARY ............................................ 2 2. CONVENTIONS .................................................. 3 3. DELTA INSTRUCTIONS ........................................... 4 4. DELTA FILE ORGANIZATION ...................................... 5 5. DELTA INSTRUCTION ENCODING ................................... 9 6. DECODING A TARGET WINDOW ..................................... 14 7. APPLICATION-DEFINED CODE TABLES .............................. 16 8. PERFORMANCE .................................................. 16 9. FURTHER ISSUES ............................................... 17 10. SUMMARY ...................................................... 18 11. ACKNOWLEDGEMENTS ............................................. 18 12. SECURITY CONSIDERATIONS ...................................... 18 13. SOURCE CODE AVAILABILITY ..................................... 18 14. INTELLECTUAL PROPERTY RIGHTS ................................. 18 15. IANA CONSIDERATIONS .......................................... 19 16. REFERENCES ................................................... 19 17. AUTHOR'S ADDRESS ............................................. 201. EXECUTIVE SUMMARY Compression and differencing techniques can greatly improve storage and transmission of files and file versions. Since files are often transported across machines with distinct architectures and performance characteristics, such data should be encoded in a form that is portable and can be decoded with little or no knowledge of the encoders. This document describes Vcdiff, a compact portable encoding format designed for these purposes. Data differencing is the process of computing a compact and invertible encoding of a "target file" given a "source file". Data compression is similar but without the use of source data. The UNIX utilities diff, compress, and gzip are well-known examples of data differencing and compression tools. For data differencing, the computed encoding is called a "delta file", and, for data compression, it is called a "compressed file". Delta and compressed files are good for storage and transmission as they are often smaller than the originals. Data differencing and data compression are traditionally treated as distinct types of data processing. However, as shown in the Vdelta technique by Korn and Vo [1], compression can be thought of as a special case of differencing in which the source data is empty. The basic idea is to unify the string parsing scheme used in the Lempel-Ziv'77 style compressors [2], and the block-move technique of Tichy [3]. Loosely speaking, this works as follows: a. Concatenate source and target data. b. Parse the data from left to right as in LZ'77 but make sure that a parsed segment starts the target data. c. Start to output when reaching target data. Parsing is based on string matching algorithms such as suffix trees [4] or hashing with different time and space performance characteristics. Vdelta uses a fast string matching algorithm that requires less memory than other techniques [5,6]. However, even with this algorithm, the memory requirement can still be prohibitive for large files. A common way to deal with memory limitation is to partition an input file into chunks called "windows" and process them separately. Here, except for unpublished work by Vo, little has been done on designing effective windowing schemes. Current techniques, including Vdelta, simply use source and target windows with corresponding addresses across source and target files. String matching and windowing algorithms have large influence on the compression rate of delta and compressed files. However, it is desirable to have a portable encoding format that is independent of such algorithms. This enables construction of client-server applications in which a server may serve clients with unknown computing characteristics. Unfortunately, all current differencing and compressing tools, including Vdelta, fall short in this respect. Their storage formats are closely intertwined with the implemented string matching and/or windowing algorithms. The encoding format Vcdiff proposed here addresses the above issues. Vcdiff achieves the below characteristics: Output compactness: The basic encoding format compactly represents compressed or delta files. Applications can further extend the basic encoding format with "secondary encoders" to achieve more compression. Data portability: The basic encoding format is free from machine byte order and word size issues. This allows data to be encoded on one machine and decoded on a different machine with different architecture. Algorithm genericity: The decoding algorithm is independent from string matching and windowing algorithms. This allows competition among implementations of the encoder while keeping the same decoder. Decoding efficiency: Except for secondary encoder issues, the decoding algorithm runs in time proportional to the size of the target file and uses space proportional to the maximal window size. Vcdiff differs from more conventional compressors in that it uses only byte-aligned data, thus avoiding bit-level operations, which improves decoding speed at the slight cost of compression efficiency. The Vcdiff data format and the algorithms for decoding data shall be described next. Since Vcdiff treats compression as a special case of differencing, we shall use the term "delta file" to indicate the compressed output for both cases.2. CONVENTIONS The basic data unit is a byte. For portability, Vcdiff shall limit a byte to its lower eight bits even on machines with larger bytes. The bits in a byte are ordered from right to left so that the least significant bit (LSB) has value 1, and the most significant bit (MSB), has value 128. For purposes of exposition in this document, we adopt the convention that the LSB is numbered 0, and the MSB is numbered 7. Bit numbers never appear in the encoded format itself. Vcdiff encodes unsigned integer values using a portable variable-sized format (originally introduced in the Sfio library [7]). This encoding treats an integer as a number in base 128. Then, each digit in this representation is encoded in the lower seven bits of a byte. Except for the least significant byte, other bytes have their most significant bit turned on to indicate that there are still more digits in the encoding. The two key properties of this integer encoding that are beneficial to a data compression format are: a. The encoding is portable among systems using 8-bit bytes, and b. Small values are encoded compactly. For example, consider the value 123456789 which can be represented with four 7-bit digits whose values are 58, 111, 26, 21 in order from most to least significant. Below is the 8-bit byte encoding of these digits. Note that the MSBs of 58, 111 and 26 are on. +-------------------------------------------+ | 10111010 | 11101111 | 10011010 | 00010101 | +-------------------------------------------+ MSB+58 MSB+111 MSB+26 0+21 Henceforth, the terms "byte" and "integer" will refer to a byte and an unsigned integer as described. From time to time, algorithms are exhibited to clarify the descriptions of parts of the Vcdiff format. On such occasions, the C language will be used to make precise the algorithms. The C code shown in this document is meant for clarification only, and is not part of the actual specification of the Vcdiff format. In this specification, the key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY" document are to be interpreted as described in RFC2119 [12].3. DELTA INSTRUCTIONS A large target file is partitioned into non-overlapping sections called "target windows". These target windows are processed separately and sequentially based on their order in the target file. A target window T of length t may be compared against some source data segment S of length s. By construction, this source data segment S comes either from the source file, if one is used, or from a part of the target file earlier than T. In this way, during decoding, S is completely known when T is being decoded. The choices of T, t, S and s are made by some window selection algorithm which can greatly affect the size of the encoding. However, as seen later, these choices are encoded so that no knowledge of the window selection algorithm is needed during decoding. Assume that S[j] represents the jth byte in S, and T[k] represents the kth byte in T. Then, for the delta instructions, we treat the data windows S and T as substrings of a superstring U formed by concatenating them like this: S[0]S[1]...S[s-1]T[0]T[1]...T[t-1] The "address" of a byte in S or T is referred to by its location in U. For example, the address of T[k] is s+k. The instructions to encode and direct the reconstruction of a target window are called delta instructions. There are three types: ADD: This instruction has two arguments, a size x and a sequence of x bytes to be copied. COPY: This instruction has two arguments, a size x and an address p in the string U. The arguments specify the substring of U that must be copied. We shall assert that such a substring must be entirely contained in either S or T. RUN: This instruction has two arguments, a size x and a byte b that will be repeated x times. Below are example source and target windows and the delta instructions that encode the target window in terms of the source window. a b c d e f g h i j k l m n o p a b c d w x y z e f g h e f g h e f g h e f g h z z z z COPY 4, 0 ADD 4, w x y z COPY 4, 4 COPY 12, 24 RUN 4, z Thus, the first letter 'a' in the target window is at location 16 in the superstring. Note that the fourth instruction, "COPY 12, 24", copies data from T itself since address 24 is position 8 in T. This instruction also shows that it is fine to overlap the data to be copied with the data being copied from as long as the latter starts earlier. This enables efficient encoding of periodic sequences, i.e., sequences with regularly repeated subsequences. The RUN instruction is a compact way to encode a sequence repeating the same byte even though such a sequence can be thought of as a periodic sequence with period 1. To reconstruct the target window, one simply processes one delta instruction at a time and copy the data either from the source window or the being reconstructed target window based on the type of the instruction and the associated address, if any.4. DELTA FILE ORGANIZATION A Vcdiff delta file starts with a Header section followed by a sequence of Window sections. The Header section includes magic bytes to identify the file type, and information concerning data processing beyond the basic encoding format. The Window sections encode the target windows. Below is the overall organization of a delta file. The indented items refine the ones immediately above them. An item in square brackets may or may not be present in the file depending on the information encoded in the Indicator byte above it. Header Header1 - byte Header2 - byte Header3 - byte Header4 - byte Hdr_Indicator - byte [Secondary compressor ID] - byte[@@@ Why is compressor ID not an integer? ][@@@ If we aren't defining any secondary compressors yet, then it seemsthat defining the [Secondary compressor ID] and the correspondingVCD_DECOMPRESS Hdr_Indicator bit in this draft has no real value. Animplementation of this specification won't be able to decode a VCDIFFencoded with this option if it doesn't know about any secondarycompressors. It seems that you should specify the bits related tosecondary compressors once you have defined the first a secondarycompressor. I can imagine a secondary-compressor might want to supplyextra information, such as a dictionary of some kind, in which casethis speculative treatment wouldn't go far enough.] [Length of code table data] - integer [Code table data] Size of near cache - byte Size of same cache - byte Compressed code table data Window1 Win_Indicator - byte [Source segment size] - integer [Source segment position] - integer The delta encoding of the target window Length of the delta encoding - integer The delta encoding Size of the target window - integer Delta_Indicator - byte Length of data for ADDs and RUNs - integer Length of instructions and sizes - integer Length of addresses for COPYs - integer Data section for ADDs and RUNs - array of bytes Instructions and sizes section - array of bytes Addresses section for COPYs - array of bytes Window2 ...4.1 The Header Section Each delta file starts with a header section organized as below. Note the convention that square-brackets enclose optional items. Header1 - byte = 0xE6 Header2 - byte = 0xD3
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -