📄 draft-korn-vcdiff-01.txt

📁 linux subdivision ying gai ke yi le ba
💻 TXT
📖 第 1 页 / 共 4 页
字号:
12 3 4 下一页
[ This paper makes reference to the "Sfio" library.  See
  http://www.research.att.com/sw/tools/sfio/ for more.  -kff]


Network Services Research Lab                  David Korn  and  Kiem-Phong Vo
                                                                    AT&T Labs
                                               Submission:     March 09, 2000
                                               Expiration: September 09, 2000


        The VCDIFF Generic Differencing and Compression Data Format
                       <draft-korn-vcdiff-01.txt>


Status of this Memo

    This document is an Internet-Draft and is in full conformance
    with all provisions of Section 10 of RFC2026.

    Internet-Drafts are working documents of the Internet Engineering
    Task Force (IETF), its areas, and its working groups.  Note that
    other groups may also distribute working documents as
    Internet-Drafts.

    Internet-Drafts are draft documents valid for a maximum of six
    months and may be updated, replaced, or obsoleted by other
    documents at any time.  It is inappropriate to use Internet-
    Drafts as reference material or to cite them other than as
    "work in progress."

    The list of current Internet-Drafts can be accessed at
    http://www.ietf.org/ietf/1id-abstracts.txt

    The list of Internet-Draft Shadow Directories can be accessed at
    http://www.ietf.org/shadow.html.


Abstract

    This memo describes a general and efficient data format suitable
    for encoding compressed and/or differencing data so that they can
    be easily transported among computers.

Table of Contents:

    1.  Executive Summary ............................................  1
    2.  Algorithm Conventions ........................................  3
    3.  Delta Instructions ...........................................  3
    4.  Vcdiff Encoding Format .......................................  4
    5.  Instruction Code Tables ...................................... 13
    6.  Compression and Differencing Algorithms ...................... 18
    7.  Summary ...................................................... 23
        ACKNOWLEDGEMENTS ............................................. 23
        REFERENCES ................................................... 23
        AUTHOR'S ADDRESS ............................................. 24



1.  Executive Summary

    Compression and differencing techniques can greatly improve storage
    and transmission of files and file versions.  Since files are often
    transported across machines with distinct architectures and performance
    characteristics, such data should be encoded in a form that is portable
    and can be decoded with little or no knowledge of the encoders.
    This document describes Vcdiff, a new compact portable encoding format
    that is independent of encoding algorithms and also efficient to decode.

    Data differencing is the process of computing a compact and invertible

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


    encoding of a "target file" given a "source file".  Data compression
    is similar but without the use of source data.  The UNIX utilities diff,
    compress, and gzip are well-known examples of data differencing and
    compression tools.  For data differencing, the computed encoding is
    called a "delta file", and, for data compression, it is called
    a "compressed file".  Delta and compressed files are good for storage
    and transmission because they are often smaller than the originals.

    Data differencing and data compression are traditionally treated
    as distinct types of data processing.  However, compression can be
    thought of as a special case of differencing in which the source
    data is empty. This blending of differencing and compression was first
    introduced in the Vdelta technique [1] by Korn and Vo.  The basic idea
    is to unify the string parsing scheme used in the Lempel-Ziv'77 style
    compressors [2], and the block-move technique of Tichy [3].  Loosely
    speaking, this works as follows:

        a. Concatenate source and target data.
        b. Parse the data from left to right just like LZ'77 but
	   make sure that a parsed segment starts target data.
        c. Start to output when reaching target data.

    Parsing is based on string matching algorithms such as suffix trees [4]
    or hashing with different time and space performance characteristics.
    Vdelta uses a new string matching algorithm that performs competitive
    to other techniques [5].  However, even with a memory-efficient and fast
    string matching algorithm, the computing resource requirements can be
    prohibitive for processing large files. The standard way to deal with
    this is to partition input files into "windows" to be processed
    separately. Here, except for some unpublished work by Vo, little has
    been done on designing effective windowing schemes.  Current techniques,
    including Vdelta, simply use windows of the same size with corresponding
    addresses across source and target files.

    String matching and windowing algorithms have large influence on the
    compression rate of delta and compressed files. However, it is desirable
    to have a portable encoding format that is independent of such algorithms.
    This enables construction of client-server applications in which a server
    may serve clients with unknown computing characteristics.  Unfortunately,
    all current differencing and compressing tools, including Vdelta, fall
    short in this resspect. Their storage formats are closely intertwined
    with the implemented algorithms.

    The encoding format Vcdiff proposed here addresses the above issues.
    Vcdiff achieves the below characteristics:

	Output compactness:
            The basic encoding format compactly represents compressed or delta
	    files. In specific cases, applications can further extend it with
	    "secondary encoders" (e.g., a Huffman or arithmetic encoder) to
	    achieve more compression.
	Data portability:
	    The basic encoding format is free from byte order and word size
	    issues for integer representations.
    	Algorithm genericity:
	    The decoding algorithm for the basic encoding format is independent
	    from string matching and windowing algorithms.
    	Decoding efficiency:
	    The decoding algorithm for the basic encoding format runs in time
	    proportional to the size of the target file and uses space

                                   -2-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


	    proportional to the maximal window size.

    The Vcdiff data format and the algorithms for decoding data shall be
    described next.  Since Vcdiff treats compression as a special case of
    differencing, we shall use the term "delta file" to indicate the
    compressed output for both cases.


2. Algorithm Conventions

    Algorithms to encode and decode the Vcdiff data format will be presented
    in the ANSI-C language. To ease the presentation, we shall generally omit
    error checking. The below conventions will be observed:
    
	a. Tokens with all upper cases letters will be C macros.
	b. Variables with capitalized names are global.
        c. Code fragments will be presented with line numbers to be used
	   in the subsequent COMMENTS sections.

    Data will be encoded in byte units.  For portability, control data
    generated by Vcdiff shall be limited to the lower eight bits of a byte
    even on machines with larger bytes. The bits in a byte are named from
    right to left so that the first bit has the lowest value, 1<<0 or 1,
    and the eighth bit has value 1<<7 or 128.

    To facilitate the algorithms, we shall assume a few types and functions
    for I/O on streams.  To be definite, we shall use interfaces similar to
    that provided in the Sfio library. Below are descriptions of some of
    these types and functions. Others can be found in reference [6].

	uchar_t:
	    This is the type "unsigned char".

	int_t:
	    This is the largest integer type on the platform in use.

	Sfio_t:
	    This is the type of a stream.

	Sfio_t* sfstropen(uchar_t* buf, int n):
	    This is not an Sfio function but it can be easily implemented
	    on top of the Sfio primitive sfnew(). sfstropen() creates a stream
	    from a given buffer with a given size. We shall assume that such
	    a stream is both readable and writable. As with Sfio, a write stream
	    will extend the buffer as necessary to accomodate output data.


3.  Delta Instructions

    A target file is partitioned into non-overlapping sections or windows
    to be processed separately. A target window T of length n may be
    compared against some source data segment S of length m.  Such a source
    data segment may come from some earlier part of the target file or
    it may come from the source file, if there is one. It is assumed that
    there is sufficient memory so that both T and S can be processed
    in main memory.

    For string processing, we treat S and T as substrings of a superstring U
    formed by concatenating T and S like this:


                                   -3-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


        s[0]s[1]...s[m-1]t[0]t[1]...t[n-1]

    The index of a byte in S or T is referred to by its location in U.
    For example, T[i] is referred to as U[m+i].

    The instructions to encode and direct the reconstruction of a target
    window are called delta instructions.  There are three types:

	ADD:
            This instruction has two arguments, a size s and a sequence of
	    s bytes to be copied to reconstruct the target window.
	COPY:
            This instruction has two arguments, a size s and an address p
	    in the string U. The arguments specify the substring of U that
	    must be copied. We shall assert that such a substring must be
	    entirely contained in either S or T.
	RUN:
            This instruction has two arguments, a size s and a byte c that
	    will be repeated s times.


    Below are example source and target strings and the delta instructions
    that encode the target string in terms of the source string.

        a b c d e f g h i j k l m n o p
        a b c d w x y z e f g h e f g h e f g h e f g h z z z z

        COPY  4, 0
        ADD   4, w x y z
        COPY  4, 4
        COPY 12, 24
	RUN   4, z

    Note that the third COPY instruction copies data from T itself since
    address 24 is position 8 in T.  In addition, parts of the data to be
    copied by this instruction will be reconstructed during its execution.
    This allows efficient encoding of periodic sequences, i.e., sequences
    with regularly repeated subsequences. The RUN instruction is a compact
    way to encode a sequence repeating the same byte even though such
    a sequence can be thought of as a periodic sequence with period 1.


4.  Vcdiff Encoding Format

    A large target file is partitioned into non-overlapping sections called
    "target windows". In practice, window sizes should be chosen so that
    each window can be completely processed in memory during both encoding
    and decoding.  A target window may be compared against some source data
    segment or none. In the compression case, such a source segment would
    be from some earlier part of the target file while, for differencing,
    it may also be from the source file.




4.1 Delta File Layout

    A Vcdiff delta file is organized into sections as follows:

                4-byte header

                                   -4-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


                Section 1
                Section 2
		...

    	Header:
	    The header of a delta file consists of 4 bytes to identify it as
	    a Vcdiff delta file. The first three bytes are: 0xd6, 0xc3, 0xc4,
	    i.e., the ASCII letters 'V', 'C' and 'D' or-ed with the eighth bit.
	    The fourth byte indicates a version number which is currently 0.

    	Sections:
	    Following the 4-byte header are a number of sections. Each section
	    encodes either an instruction code table (Section 5) or a window.
	    Each section starts with an indicator byte. If the 8th bit of this
	    byte is on, the section encodes an instruction table; otherwise,
	    it encodes a window. The remaining bits of this byte are used to
	    encode other information as necessary.


4.1.1 The Format of a Window

    Each window is organized as below. Note that items bracketed are present
    only when certain corresponding bits in the indicator byte are on.

		Indicator byte
		Length of target window
		Length of instruction dataset
		Length of raw dataset
		[Index of code table]
		[Secondary instruction compressor]
		[Length of compressed instruction dataset]
		[Secondary data compressor]
		[Length of compressed raw dataset]
		[Source segment size]
		[Source segment position]
		Instruction dataset
		Raw dataset

    	Indicator byte:
	    The bits of the indicator byte for a window are as follows:

		0  T  I  D  WT  WS  X  X

	    0:	This is the 0-bit indicating that this section is a window
		of data to be decoded.

	    T:	This bit, if on, indicates that a non-default instruction
		code table should be used for decoding. In this case, the
		item [Index of code table] is a single byte indicating the
		index of this code table.

	    I:	This bit, if on, indicates that the instruction dataset
		has been compressed with a secondary compressor. In this
		case, the item [Secondary instruction compressor] is a single
		byte indicating the secondary compressor while the item
		[Length of compressed instruction dataset] has the length
		of the compressed instruction dataset encoded as a
		variable-sized integer.

	    D:	This bit is similar to the I-bit but it is for the raw

                                   -5-

RFC 1   VCDIFF Generic Differencing and Compression Data Format   March 2000


		dataset.

	    WT:	This bit, if on, indicates that a source data segment was
		used to compare the target window with. Further, this source
		segment is from an earlier part of the target file. In this
		case, the size and position of the source segment are given
		in [Source segment size] and [Source segment position].
		Both of these items are encoded as variable-sized integers.

	    WS: This bit is similar to WT but, if on, indicates that the
		source data segment is from the source file.

	    X:	These bits are currently unused.

    	Length of target window:
	    This is a variable-sized integer indicating the size of the
	    target window.

	Lengths of instruction and raw datasets:
	    The delta instructions ADD and RUN have associated raw data
	    (unmatched bytes for ADD and the repeating byte for RUN).
	    The Vcdiff encoding format maintains two separate datasets,
	    one for the raw data and one for the instructions. The lengths
	    of the "instruction dataset" and the "raw dataset" are stored
	    respectively as two variable-sized integers.

    	Instruction dataset:
	    This is the set of bytes that define the delta instructions.
	    If I was 1, the dataset has been encoded by the indicated
	    secondary encoder.

        Raw dataset:
	    This is the set of raw data accompanying the ADD and RUN
	    instructions.  If D was 1, the dataset has been encoded
	    by the indicated secondary encoder.



4.1.2 Processing a Delta File

    Below is the basic algorithm to process a delta file:

     1. sfgetc(Delta);
     2. sfgetc(Delta);
     3. sfgetc(Delta);
     4. sfgetc(Delta);

     6. for(;;)
     7. {   if((indi = sfgetc(Delta)) < 0)
     8.          break;
     9.     if((indi & (1<<7)) != 0)
    10.          code_define();
    11.     else
    12.     {    n_tar = sfgetu(Delta);
    13.          tar = malloc(n_tar);
    14.          win_inflate(indi, tar, n_tar, NULL, 0);
    15.          sfwrite(Target, tar, n_tar);
    16.          free(tar);
    17.     }
    18. }
12 3 4 下一页
💿 文件大小 10313 K
👤 上传用户 zhangyuntong
📂 所属分类网络
🏷️ 相关标签

#subdivision #linux #ying #gai
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -