draft-korn-vcdiff.txt

来自「Linux下一个可以比较二进制文件的工具xdelta3.0u的源码。」· 文本代码 · 共 1,323 行 · 第 1/4 页
TXT
1,323 行
                                                     David G. Korn, AT&T Labs				             Joshua P. MacDonald, UC Berkeley                                                 Jeffrey C. Mogul, Compaq WRLInternet-Draft                                       Kiem-Phong Vo, AT&T LabsExpires: 09 November 2002                                    09 November 2001        The VCDIFF Generic Differencing and Compression Data Format                         draft-korn-vcdiff-06.txtStatus of this Memo    This document is an Internet-Draft and is in full conformance    with all provisions of Section 10 of RFC2026.    Internet-Drafts are working documents of the Internet Engineering    Task Force (IETF), its areas, and its working groups.  Note that    other groups may also distribute working documents as    Internet-Drafts.    Internet-Drafts are draft documents valid for a maximum of six    months and may be updated, replaced, or obsoleted by other    documents at any time.  It is inappropriate to use Internet-    Drafts as reference material or to cite them other than as    "work in progress."    The list of current Internet-Drafts can be accessed at    http://www.ietf.org/ietf/1id-abstracts.txt    The list of Internet-Draft Shadow Directories can be accessed at    http://www.ietf.org/shadow.html.Abstract    This memo describes a general, efficient and portable data format    suitable for encoding compressed and/or differencing data so that    they can be easily transported among computers.Table of Contents:    1.  EXECUTIVE SUMMARY ............................................  2    2.  CONVENTIONS ..................................................  3    3.  DELTA INSTRUCTIONS ...........................................  4    4.  DELTA FILE ORGANIZATION ......................................  5    5.  DELTA INSTRUCTION ENCODING ...................................  9    6.  DECODING A TARGET WINDOW ..................................... 14    7.  APPLICATION-DEFINED CODE TABLES .............................. 16    8.  PERFORMANCE .................................................. 16    9.  FURTHER ISSUES ............................................... 17   10.  SUMMARY ...................................................... 18   11.  ACKNOWLEDGEMENTS ............................................. 18   12.  SECURITY CONSIDERATIONS ...................................... 18   13.  SOURCE CODE AVAILABILITY ..................................... 18   14.  INTELLECTUAL PROPERTY RIGHTS ................................. 18   15.  IANA CONSIDERATIONS .......................................... 19   16.  REFERENCES ................................................... 19   17.  AUTHOR'S ADDRESS ............................................. 201.  EXECUTIVE SUMMARY    Compression and differencing techniques can greatly improve storage    and transmission of files and file versions.  Since files are often    transported across machines with distinct architectures and performance    characteristics, such data should be encoded in a form that is portable    and can be decoded with little or no knowledge of the encoders.    This document describes Vcdiff, a compact portable encoding format    designed for these purposes.    Data differencing is the process of computing a compact and invertible    encoding of a "target file" given a "source file".  Data compression    is similar but without the use of source data.  The UNIX utilities diff,    compress, and gzip are well-known examples of data differencing and    compression tools.  For data differencing, the computed encoding is    called a "delta file", and, for data compression, it is called    a "compressed file".  Delta and compressed files are good for storage    and transmission as they are often smaller than the originals.    Data differencing and data compression are traditionally treated    as distinct types of data processing.  However, as shown in the Vdelta    technique by Korn and Vo [1], compression can be thought of as a special    case of differencing in which the source data is empty. The basic idea    is to unify the string parsing scheme used in the Lempel-Ziv'77 style    compressors [2], and the block-move technique of Tichy [3].  Loosely    speaking, this works as follows:        a. Concatenate source and target data.        b. Parse the data from left to right as in LZ'77 but	   make sure that a parsed segment starts the target data.        c. Start to output when reaching target data.    Parsing is based on string matching algorithms such as suffix trees [4]    or hashing with different time and space performance characteristics.    Vdelta uses a fast string matching algorithm that requires less memory    than other techniques [5,6].  However, even with this algorithm, the    memory requirement can still be prohibitive for large files.  A common    way to deal with memory limitation is to partition an input file into    chunks called "windows" and process them separately. Here, except for    unpublished work by Vo, little has been done on designing effective    windowing schemes. Current techniques, including Vdelta, simply use    source and target windows with corresponding addresses across source    and target files.    String matching and windowing algorithms have large influence on the    compression rate of delta and compressed files. However, it is desirable    to have a portable encoding format that is independent of such algorithms.    This enables construction of client-server applications in which a server    may serve clients with unknown computing characteristics.  Unfortunately,    all current differencing and compressing tools, including Vdelta, fall    short in this respect. Their storage formats are closely intertwined    with the implemented string matching and/or windowing algorithms.    The encoding format Vcdiff proposed here addresses the above issues.    Vcdiff achieves the below characteristics:	Output compactness:            The basic encoding format compactly represents compressed or delta	    files. Applications can further extend the basic encoding format	    with "secondary encoders" to achieve more compression.	Data portability:	    The basic encoding format is free from machine byte order and	    word size issues. This allows data to be encoded on one machine	    and decoded on a different machine with different architecture.    	Algorithm genericity:	    The decoding algorithm is independent from string matching and	    windowing algorithms. This allows competition among implementations	    of the encoder while keeping the same decoder.    	Decoding efficiency:	    Except for secondary encoder issues, the decoding algorithm runs	    in time proportional to the size of the target file and uses space	    proportional to the maximal window size.  Vcdiff differs from more	    conventional compressors in that it uses only byte-aligned	    data, thus avoiding bit-level operations, which improves	    decoding speed at the slight cost of compression efficiency.    The Vcdiff data format and the algorithms for decoding data shall be    described next.  Since Vcdiff treats compression as a special case of    differencing, we shall use the term "delta file" to indicate the    compressed output for both cases.2. CONVENTIONS    The basic data unit is a byte.  For portability, Vcdiff shall limit    a byte to its lower eight bits even on machines with larger bytes.    The bits in a byte are ordered from right to left so that the least    significant bit (LSB) has value 1, and the most significant bit (MSB),    has value 128.    For purposes of exposition in this document, we adopt the convention    that the LSB is numbered 0, and the MSB is numbered 7.  Bit numbers    never appear in the encoded format itself.    Vcdiff encodes unsigned integer values using a portable variable-sized    format (originally introduced in the Sfio library [7]). This encoding    treats an integer as a number in base 128. Then, each digit in this    representation is encoded in the lower seven bits of a byte. Except for    the least significant byte, other bytes have their most significant bit    turned on to indicate that there are still more digits in the encoding.    The two key properties of this integer encoding that are beneficial    to a data compression format are:	a. The encoding is portable among systems using 8-bit bytes, and        b. Small values are encoded compactly.    For example, consider the value 123456789 which can be represented with    four 7-bit digits whose values are 58, 111, 26, 21 in order from most    to least significant. Below is the 8-bit byte encoding of these digits.    Note that the MSBs of 58, 111 and 26 are on.                 +-------------------------------------------+                 | 10111010 | 11101111 | 10011010 | 00010101 |                 +-------------------------------------------+                   MSB+58     MSB+111    MSB+26     0+21    Henceforth, the terms "byte" and "integer" will refer to a byte and an    unsigned integer as described.    From time to time, algorithms are exhibited to clarify the descriptions    of parts of the Vcdiff format. On such occasions, the C language will be    used to make precise the algorithms.  The C code shown in this    document is meant for clarification only, and is not part of the    actual specification of the Vcdiff format.    In this specification, the key words "MUST", "MUST NOT",    "SHOULD", "SHOULD NOT", and "MAY" document are to be interpreted as    described in RFC2119 [12].3.  DELTA INSTRUCTIONS    A large target file is partitioned into non-overlapping sections    called "target windows". These target windows are processed separately    and sequentially based on their order in the target file.    A target window T of length t may be compared against some source data    segment S of length s. By construction, this source data segment S    comes either from the source file, if one is used, or from a part of    the target file earlier than T.  In this way, during decoding, S is    completely known when T is being decoded.    The choices of T, t, S and s are made by some window selection algorithm    which can greatly affect the size of the encoding. However, as seen later,    these choices are encoded so that no knowledge of the window selection    algorithm is needed during decoding.    Assume that S[j] represents the jth byte in S, and T[k] represents    the kth byte in T.  Then, for the delta instructions, we treat the data    windows S and T as substrings of a superstring U formed by concatenating    them like this:        S[0]S[1]...S[s-1]T[0]T[1]...T[t-1]    The "address" of a byte in S or T is referred to by its location in U.    For example, the address of T[k] is s+k.    The instructions to encode and direct the reconstruction of a target    window are called delta instructions. There are three types:	ADD: This instruction has two arguments, a size x and a sequence of	    x bytes to be copied.	COPY: This instruction has two arguments, a size x and an address p	    in the string U. The arguments specify the substring of U that	    must be copied. We shall assert that such a substring must be	    entirely contained in either S or T.	RUN: This instruction has two arguments, a size x and a byte b that	    will be repeated x times.    Below are example source and target windows and the delta instructions    that encode the target window in terms of the source window.        a b c d e f g h i j k l m n o p        a b c d w x y z e f g h e f g h e f g h e f g h z z z z        COPY  4, 0        ADD   4, w x y z        COPY  4, 4        COPY 12, 24	RUN   4, z    Thus, the first letter 'a' in the target window is at location 16    in the superstring. Note that the fourth instruction, "COPY 12, 24",    copies data from T itself since address 24 is position 8 in T.    This instruction also shows that it is fine to overlap the data to be    copied with the data being copied from as long as the latter starts    earlier. This enables efficient encoding of periodic sequences,    i.e., sequences with regularly repeated subsequences. The RUN instruction    is a compact way to encode a sequence repeating the same byte even though    such a sequence can be thought of as a periodic sequence with period 1.    To reconstruct the target window, one simply processes one delta    instruction at a time and copy the data either from the source window    or the being reconstructed target window based on the type of the    instruction and the associated address, if any.4.  DELTA FILE ORGANIZATION    A Vcdiff delta file starts with a Header section followed by a sequence    of Window sections. The Header section includes magic bytes to identify    the file type, and information concerning data processing beyond the    basic encoding format. The Window sections encode the target windows.    Below is the overall organization of a delta file. The indented items    refine the ones immediately above them. An item in square brackets may    or may not be present in the file depending on the information encoded    in the Indicator byte above it.        Header	    Header1                                  - byte	    Header2                                  - byte	    Header3                                  - byte	    Header4                                  - byte	    Hdr_Indicator                            - byte	    [Secondary compressor ID]                - byte[@@@ Why is compressor ID not an integer? ][@@@ If we aren't defining any secondary compressors yet, then it seemsthat defining the [Secondary compressor ID] and the correspondingVCD_DECOMPRESS Hdr_Indicator bit in this draft has no real value.  Animplementation of this specification won't be able to decode a VCDIFFencoded with this option if it doesn't know about any secondarycompressors.  It seems that you should specify the bits related tosecondary compressors once you have defined the first a secondarycompressor.  I can imagine a secondary-compressor might want to supplyextra information, such as a dictionary of some kind, in which casethis speculative treatment wouldn't go far enough.]	    [Length of code table data]              - integer	    [Code table data]	      	Size of near cache                   - byte	        Size of same cache                   - byte	        Compressed code table data	Window1	    Win_Indicator                            - byte	    [Source segment size]                    - integer	    [Source segment position]                - integer            The delta encoding of the target window	        Length of the delta encoding         - integer	        The delta encoding	            Size of the target window        - integer	            Delta_Indicator                  - byte	            Length of data for ADDs and RUNs - integer	            Length of instructions and sizes - integer	            Length of addresses for COPYs    - integer	            Data section for ADDs and RUNs   - array of bytes	            Instructions and sizes section   - array of bytes	            Addresses section for COPYs      - array of bytes	Window2	...4.1 The Header Section    Each delta file starts with a header section organized as below.    Note the convention that square-brackets enclose optional items.	    Header1                                  - byte = 0xE6	    Header2                                  - byte = 0xD3
draft-korn-vcdiff.txt - 源码说明

本页面展示了「Linux下一个可以比较二进制文件的工具xdelta3.0u的源码。」中的 draft-korn-vcdiff.txt 源码文件，采用文本编程语言编写，共 1,323 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与xdelta相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?