📄 rfc3284.txt

📁 RFC 的详细文档！
💻 TXT
📖 第 1 页 / 共 5 页
字号:
12 3 4 5 下一页






Network Working Group                                            D. Korn
Request for Comments: 3284                                     AT&T Labs
Category: Standards Track                                   J. MacDonald
                                                             UC Berkeley
                                                                J. Mogul
                                                 Hewlett-Packard Company
                                                                   K. Vo
                                                               AT&T Labs
                                                               June 2002


      The VCDIFF Generic Differencing and Compression Data Format

Status of this Memo

   This document specifies an Internet standards track protocol for the
   Internet community, and requests discussion and suggestions for
   improvements.  Please refer to the current edition of the "Internet
   Official Protocol Standards" (STD 1) for the standardization state
   and status of this protocol.  Distribution of this memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (2002).  All Rights Reserved.

Abstract

   This memo describes VCDIFF, a general, efficient and portable data
   format suitable for encoding compressed and/or differencing data so
   that they can be easily transported among computers.





















Korn, et. al.               Standards Track                     [Page 1]

RFC 3284                         VCDIFF                        June 2002


Table of Contents

    1.  Executive Summary ...........................................  2
    2.  Conventions .................................................  4
    3.  Delta Instructions ..........................................  5
    4.  Delta File Organization .....................................  6
    5.  Delta Instruction Encoding .................................. 12
    6.  Decoding a Target Window .................................... 20
    7.  Application-Defined Code Tables ............................. 21
    8.  Performance ................................................. 22
    9.  Further Issues .............................................. 24
   10.  Summary ..................................................... 25
   11.  Acknowledgements ............................................ 25
   12.  Security Considerations ..................................... 25
   13.  Source Code Availability .................................... 25
   14.  Intellectual Property Rights ................................ 26
   15.  IANA Considerations ......................................... 26
   16.  References .................................................. 26
   17.  Authors' Addresses .......................................... 28
   18.  Full Copyright Statement .................................... 29

1.  Executive Summary

   Compression and differencing techniques can greatly improve storage
   and transmission of files and file versions.  Since files are often
   transported across machines with distinct architectures and
   performance characteristics, such data should be encoded in a form
   that is portable and can be decoded with little or no knowledge of
   the encoders.  This document describes Vcdiff, a compact portable
   encoding format designed for these purposes.

   Data differencing is the process of computing a compact and
   invertible encoding of a "target file" given a "source file".  Data
   compression is similar, but without the use of source data.  The UNIX
   utilities diff, compress, and gzip are well-known examples of data
   differencing and compression tools.  For data differencing, the
   computed encoding is called a "delta file", and for data compression,
   it is called a "compressed file".  Delta and compressed files are
   good for storage and transmission as they are often smaller than the
   originals.

   Data differencing and data compression are traditionally treated as
   distinct types of data processing.  However, as shown in the Vdelta
   technique by Korn and Vo [1], compression can be thought of as a
   special case of differencing in which the source data is empty.  The
   basic idea is to unify the string parsing scheme used in the Lempel-
   Ziv'77 (LZ'77) style compressors [2] and the block-move technique of
   Tichy [3].  Loosely speaking, this works as follows:



Korn, et. al.               Standards Track                     [Page 2]

RFC 3284                         VCDIFF                        June 2002


      a. Concatenate source and target data.
      b. Parse the data from left to right as in LZ'77 but make sure
         that a parsed segment starts the target data.
      c. Start to output when reaching target data.

   Parsing is based on string matching algorithms, such as suffix trees
   [4] or hashing with different time and space performance
   characteristics.  Vdelta uses a fast string matching algorithm that
   requires less memory than other techniques [5,6].  However, even with
   this algorithm, the memory requirement can still be prohibitive for
   large files.  A common way to deal with memory limitation is to
   partition an input file into chunks called "windows" and process them
   separately.  Here, except for unpublished work by Vo, little has been
   done on designing effective windowing schemes.  Current techniques,
   including Vdelta, simply use source and target windows with
   corresponding addresses across source and target files.

   String matching and windowing algorithms have great influence on the
   compression rate of delta and compressed files.  However, it is
   desirable to have a portable encoding format that is independent of
   such algorithms.  This enables the construction of client-server
   applications in which a server may serve clients with unknown
   computing characteristics.  Unfortunately, all current differencing
   and compressing tools, including Vdelta, fall short in this respect.
   Their storage formats are closely intertwined with the implemented
   string matching and/or windowing algorithms.

   The encoding format Vcdiff proposed here addresses the above issues.
   Vcdiff achieves the characteristics below:

      Output compactness:
         The basic encoding format compactly represents compressed or
         delta files.  Applications can further extend the basic
         encoding format with "secondary encoders" to achieve more
         compression.

      Data portability:
         The basic encoding format is free from machine byte order and
         word size issues.  This allows data to be encoded on one
         machine and decoded on a different machine with different
         architecture.

      Algorithm genericity:
         The decoding algorithm is independent from string matching and
         windowing algorithms.  This allows competition among
         implementations of the encoder while keeping the same decoder.





Korn, et. al.               Standards Track                     [Page 3]

RFC 3284                         VCDIFF                        June 2002


      Decoding efficiency:
         Except for secondary encoder issues, the decoding algorithm
         runs in time proportionate to the size of the target file and
         uses space proportionate to the maximal window size.  Vcdiff
         differs from more conventional compressors in that it uses only
         byte-aligned data, thus avoiding bit-level operations, which
         improves decoding speed at the slight cost of compression
         efficiency.

   The combined differencing and compression method is called "delta
   compression" [14].  As this way of data processing treats compression
   as a special case of differencing, we shall use the term "delta file"
   to indicate the compressed output for both cases.

2. Conventions

   The basic data unit is a byte.  For portability, Vcdiff shall limit a
   byte to its lower eight bits even on machines with larger bytes.  The
   bits in a byte are ordered from right to left so that the least
   significant bit (LSB) has value 1, and the most significant bit
   (MSB), has value 128.

   For purposes of exposition in this document, we adopt the convention
   that the LSB is numbered 0, and the MSB is numbered 7.  Bit numbers
   never appear in the encoded format itself.

   Vcdiff encodes unsigned integer values using a portable, variable-
   sized format (originally introduced in the Sfio library [7]).  This
   encoding treats an integer as a number in base 128.  Then, each digit
   in this representation is encoded in the lower seven bits of a byte.
   Except for the least significant byte, other bytes have their most
   significant bit turned on to indicate that there are still more
   digits in the encoding.  The two key properties of this integer
   encoding that are beneficial to a data compression format are:

      a. The encoding is portable among systems using 8-bit bytes, and
      b. Small values are encoded compactly.

   For example, consider the value 123456789, which can be represented
   with four 7-bit digits whose values are 58, 111, 26, 21 in order from
   most to least significant.  Below is the 8-bit byte encoding of these
   digits.  Note that the MSBs of 58, 111 and 26 are on.

              +-------------------------------------------+
              | 10111010 | 11101111 | 10011010 | 00010101 |
              +-------------------------------------------+
                MSB+58     MSB+111    MSB+26     0+21




Korn, et. al.               Standards Track                     [Page 4]

RFC 3284                         VCDIFF                        June 2002


   Henceforth, the terms "byte" and "integer" will refer to a byte and
   an unsigned integer as described.

   Algorithms in the C language are occasionally exhibited to clarify
   the descriptions.  Such C code is meant for clarification only, and
   is not part of the actual specification of the Vcdiff format.

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in BCP 14, RFC 2119 [12].

3.  Delta Instructions

   A large target file is partitioned into non-overlapping sections
   called "target windows".  These target windows are processed
   separately and sequentially based on their order in the target file.

   A target window T, of length t, may be compared against some source
   data segment S, of length s.  By construction, this source data
   segment S comes either from the source file, if one is used, or from
   a part of the target file earlier than T.  In this way, during
   decoding, S is completely known when T is being decoded.

   The choices of T, t, S and s are made by some window selection
   algorithm, which can greatly affect the size of the encoding.
   However, as seen later, these choices are encoded so that no
   knowledge of the window selection algorithm is needed during
   decoding.

   Assume that S[j] represents the jth byte in S, and T[k] represents
   the kth byte in T.  Then, for the delta instructions, we treat the
   data windows S and T as substrings of a superstring U, formed by
   concatenating them like this:

         S[0]S[1]...S[s-1]T[0]T[1]...T[t-1]

   The "address" of a byte in S or T is referred to by its location in
   U.  For example, the address of T[k] is s+k.

   The instructions to encode and direct the reconstruction of a target
   window are called delta instructions.  There are three types:

      ADD:  This instruction has two arguments, a size x and a sequence
            of x bytes to be copied.
      COPY: This instruction has two arguments, a size x and an address
            p in the string U.  The arguments specify the substring of U
            that must be copied.  We shall assert that such a substring
            must be entirely contained in either S or T.



Korn, et. al.               Standards Track                     [Page 5]

RFC 3284                         VCDIFF                        June 2002


      RUN:  This instruction has two arguments, a size x and a byte b,
            that will be repeated x times.

   Below are example source and target windows and the delta
   instructions that encode the target window in terms of the source
   window.

         a b c d e f g h i j k l m n o p
         a b c d w x y z e f g h e f g h e f g h e f g h z z z z

         COPY  4, 0
         ADD   4, w x y z
         COPY  4, 4
         COPY 12, 24
         RUN   4, z

   Thus, the first letter 'a' in the target window is at location 16 in
   the superstring.  Note that the fourth instruction, "COPY 12, 24",
   copies data from T itself since address 24 is position 8 in T.  This
   instruction also shows that it is fine to overlap the data to be
   copied with the data being copied from, as long as the latter starts
   earlier.  This enables efficient encoding of periodic sequences,
   i.e., sequences with regularly repeated subsequences.  The RUN
   instruction is a compact way to encode a sequence repeating the same
   byte even though such a sequence can be thought of as a periodic
   sequence with period 1.

   To reconstruct the target window, one simply processes one delta
   instruction at a time and copies the data, either from the source
   window or the target window being reconstructed, based on the type of
   the instruction and the associated address, if any.

4.  Delta File Organization

   A Vcdiff delta file starts with a Header section followed by a
   sequence of Window sections.  The Header section includes magic bytes
12 3 4 5 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -