📄 xdelta3.c
字号:
/* xdelta 3 - delta compression tools and library * Copyright (C) 2001, 2003, 2004, 2005, 2006, 2007. Joshua P. MacDonald * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA ------------------------------------------------------------------- Xdelta 3 The goal of this library is to to implement both the (stand-alone) data-compression and delta-compression aspects of VCDIFF encoding, and to support a programming interface that works like Zlib (http://www.gzip.org/zlib.html). See RFC3284: The VCDIFF Generic Differencing and Compression Data Format. VCDIFF is a unified encoding that combines data-compression and delta-encoding ("differencing"). VCDIFF has a detailed byte-code instruction set with many features. The instruction format supports an immediate size operand for small COPYs and ADDs (e.g., under 18 bytes). There are also instruction "modes", which are used to compress COPY addresses by using two address caches. An instruction mode refers to slots in the NEAR and SAME caches for recent addresses. NEAR remembers the previous 4 (by default) COPY addresses, and SAME catches frequent re-uses of the same address using a 3-way (by default) 256-entry associative cache of [ADDR mod 256], the encoded byte. A hit in the NEAR/SAME cache requires 0/1 ADDR bytes. VCDIFF has a default instruction table, but an alternate instruction tables may themselves be be delta-compressed and included in the encoding header. This allows even more freedom. There are 9 instruction modes in the default code table, 4 near, 3 same, VCD_SELF (absolute encoding) and VCD_HERE (relative to the current position). ---------------------------------------------------------------------- Algorithms Aside from the details of encoding and decoding, there are a bunch of algorithms needed. 1. STRING-MATCH. A two-level fingerprinting approach is used. A single loop computes the two checksums -- small and large -- at successive offsets in the TARGET file. The large checksum is more accurate and is used to discover SOURCE matches, which are potentially very long. The small checksum is used to discover copies within the TARGET. Small matching, which is more expensive, usually dominates the large STRING-MATCH costs in this code - the more exhaustive the search, the better the results. Either of the two string-matching mechanisms may be disabled. 2. INSTRUCTION SELECTION. The IOPT buffer here represents a queue used to store overlapping copy instructions. There are two possible optimizations that go beyond a greedy search. Both of these fall into the category of "non-greedy matching" optimizations. The first optimization stems from backward SOURCE-COPY matching. When a new SOURCE-COPY instruction covers a previous instruction in the target completely, it is erased from the queue. Randal Burns originally analyzed these algorithms and did a lot of related work (\cite the 1.5-pass algorithm). The second optimization comes by the encoding of common very-small COPY and ADD instructions, for which there are special DOUBLE-code instructions, which code two instructions in a single byte. The cost of bad instruction-selection overhead is relatively high for data-compression, relative to delta-compression, so this second optimization is fairly important. With "lazy" matching (the name used in Zlib for a similar optimization), the string-match algorithm searches after a match for potential overlapping copy instructions. In Xdelta and by default, VCDIFF, the minimum match size is 4 bytes, whereas Zlib searches with a 3-byte minimum. This feature, combined with double instructions, provides a nice challenge. Search in this file for "black magic", a heuristic. 3. STREAM ALIGNMENT. Stream alignment is needed to compress large inputs in constant space. See xd3_srcwin_move_point(). 4. WINDOW SELECTION. When the IOPT buffer flushes, in the first call to xd3_iopt_finish_encoding containing any kind of copy instruction, the parameters of the source window must be decided: the offset into the source and the length of the window. Since the IOPT buffer is finite, the program may be forced to fix these values before knowing the best offset/length. 5. SECONDARY COMPRESSION. VCDIFF supports a secondary encoding to be applied to the individual sections of the data format, which are ADDRess, INSTruction, and DATA. Several secondary compressor variations are implemented here, although none is standardized yet. One is an adaptive huffman algorithm -- the FGK algorithm (Faller, Gallager, and Knuth, 1985). This compressor is extremely slow. The other is a simple static Huffman routine, which is the base case of a semi-adaptive scheme published by D.J. Wheeler and first widely used in bzip2 (by Julian Seward). This is a very interesting algorithm, originally published in nearly cryptic form by D.J. Wheeler. !!!NOTE!!! Because these are not standardized, secondary compression remains off by default. ftp://ftp.cl.cam.ac.uk/users/djw3/bred3.{c,ps} -------------------------------------------------------------------- Other Features 1. USER CONVENIENCE For user convenience, it is essential to recognize Gzip-compressed files and automatically Gzip-decompress them prior to delta-compression (or else no delta-compression will be achieved unless the user manually decompresses the inputs). The compressed represention competes with Xdelta, and this must be hidden from the command-line user interface. The Xdelta-1.x encoding was simple, not compressed itself, so Xdelta-1.x uses Zlib internally to compress the representation. This implementation supports external compression, which implements the necessary fork() and pipe() mechanics. There is a tricky step involved to support automatic detection of a compressed input in a non-seekable input. First you read a bit of the input to detect magic headers. When a compressed format is recognized, exec() the external compression program and create a second child process to copy the original input stream. [Footnote: There is a difficulty related to using Gzip externally. It is not possible to decompress and recompress a Gzip file transparently. If FILE.GZ had a cryptographic signature, then, after: (1) Gzip-decompression, (2) Xdelta-encoding, (3) Gzip-compression the signature could be broken. The only way to solve this problem is to guess at Gzip's compression level or control it by other means. I recommend that specific implementations of any compression scheme store information needed to exactly re-compress the input, that way external compression is transparent - however, this won't happen here until it has stabilized.] 2. APPLICATION-HEADER This feature was introduced in RFC3284. It allows any application to include a header within the VCDIFF file format. This allows general inter-application data exchange with support for application-specific extensions to communicate metadata. 3. VCDIFF CHECKSUM An optional checksum value is included with each window, which can be used to validate the final result. This verifies the correct source file was used for decompression as well as the obvious advantage: checking the implementation (and underlying) correctness. 4. LIGHT WEIGHT The code makes efforts to avoid copying data more than necessary. The code delays many initialization tasks until the first use, it optimizes for identical (perfectly matching) inputs. It does not compute any checksums until the first lookup misses. Memory usage is reduced. String-matching is templatized (by slightly gross use of CPP) to hard-code alternative compile-time defaults. The code has few outside dependencies. ---------------------------------------------------------------------- The default rfc3284 instruction table: (see RFC for the explanation) TYPE SIZE MODE TYPE SIZE MODE INDEX -------------------------------------------------------------------- 1. Run 0 0 Noop 0 0 0 2. Add 0, [1,17] 0 Noop 0 0 [1,18] 3. Copy 0, [4,18] 0 Noop 0 0 [19,34] 4. Copy 0, [4,18] 1 Noop 0 0 [35,50] 5. Copy 0, [4,18] 2 Noop 0 0 [51,66] 6. Copy 0, [4,18] 3 Noop 0 0 [67,82] 7. Copy 0, [4,18] 4 Noop 0 0 [83,98] 8. Copy 0, [4,18] 5 Noop 0 0 [99,114] 9. Copy 0, [4,18] 6 Noop 0 0 [115,130] 10. Copy 0, [4,18] 7 Noop 0 0 [131,146] 11. Copy 0, [4,18] 8 Noop 0 0 [147,162] 12. Add [1,4] 0 Copy [4,6] 0 [163,174] 13. Add [1,4] 0 Copy [4,6] 1 [175,186] 14. Add [1,4] 0 Copy [4,6] 2 [187,198] 15. Add [1,4] 0 Copy [4,6] 3 [199,210] 16. Add [1,4] 0 Copy [4,6] 4 [211,222] 17. Add [1,4] 0 Copy [4,6] 5 [223,234] 18. Add [1,4] 0 Copy 4 6 [235,238] 19. Add [1,4] 0 Copy 4 7 [239,242] 20. Add [1,4] 0 Copy 4 8 [243,246] 21. Copy 4 [0,8] Add 1 0 [247,255] -------------------------------------------------------------------- Reading the source: Overview This file includes itself in several passes to macro-expand certain sections with variable forms. Just read ahead, there's only a little confusion. I know this sounds ugly, but hard-coding some of the string-matching parameters results in a 10-15% increase in string-match performance. The only time this hurts is when you have unbalanced #if/endifs. A single compilation unit tames the Makefile. In short, this is to allow the above-described hack without an explodingMakefile. The single compilation unit includes the core library features, configurable string-match templates, optional main() command-line tool, misc optional features, and a regression test. Features are controled with CPP #defines, see Makefile.am. The initial __XDELTA3_C_HEADER_PASS__ starts first, the _INLINE_ and _TEMPLATE_ sections follow. Easy stuff first, hard stuff last. Optional features include: xdelta3-main.h The command-line interface, external compression support, POSIX-specific, info & VCDIFF-debug tools. xdelta3-second.h The common secondary compression routines. xdelta3-decoder.h All decoding routines. xdelta3-djw.h The semi-adaptive huffman secondary encoder. xdelta3-fgk.h The adaptive huffman secondary encoder. xdelta3-test.h The unit test covers major algorithms, encoding and decoding. There are single-bit error decoding tests. There are 32/64-bit file size boundary tests. There are command-line tests. There are compression tests. There are external compression tests. There are string-matching tests. There should be more tests... Additional headers include: xdelta3.h The public header file. xdelta3-cfgs.h The default settings for default, built-in encoders. These are hard-coded at compile-time. There is also a single soft-coded string matcher for experimenting with arbitrary values. xdelta3-list.h A cyclic list template Misc little debug utilities: badcopy.c Randomly modifies an input file based on two parameters: (1) the probability that a byte in the file is replaced with a pseudo-random value, and (2) the mean change size. Changes are generated using an expoential distribution which approximates the expected error_prob distribution. -------------------------------------------------------------------- This file itself is unusually large. I hope to defend this layout with lots of comments. Everything in this file is related to encoding and decoding. I like it all together - the template stuff is just a hack. */#ifndef __XDELTA3_C_HEADER_PASS__#define __XDELTA3_C_HEADER_PASS__#include <errno.h>#include <string.h>#include "xdelta3.h"/*********************************************************************** STATIC CONFIGURATION ***********************************************************************/#ifndef XD3_MAIN /* the main application */#define XD3_MAIN 0#endif#ifndef VCDIFF_TOOLS#define VCDIFF_TOOLS XD3_MAIN#endif#ifndef SECONDARY_FGK /* one from the algorithm preservation department: */#define SECONDARY_FGK 0 /* adaptive Huffman routines */#endif#ifndef SECONDARY_DJW /* semi-adaptive/static Huffman for the eventual */#define SECONDARY_DJW 0 /* standardization, off by default until such time. */#endif#ifndef GENERIC_ENCODE_TABLES /* These three are the RFC-spec'd app-specific */#define GENERIC_ENCODE_TABLES 0 /* code features. This is tested but not recommended */#endif /* unless there's a real application. */#ifndef GENERIC_ENCODE_TABLES_COMPUTE#define GENERIC_ENCODE_TABLES_COMPUTE 0#endif#ifndef GENERIC_ENCODE_TABLES_COMPUTE_PRINT#define GENERIC_ENCODE_TABLES_COMPUTE_PRINT 0#endif#if XD3_ENCODER#define IF_ENCODER(x) x#else#define IF_ENCODER(x)#endif/***********************************************************************/typedef enum { /* header indicator bits */ VCD_SECONDARY = (1 << 0), /* uses secondary compressor */ VCD_CODETABLE = (1 << 1), /* supplies code table data */ VCD_APPHEADER = (1 << 2), /* supplies application data */ VCD_INVHDR = ~7U, /* window indicator bits */ VCD_SOURCE = (1 << 0), /* copy window in source file */ VCD_TARGET = (1 << 1), /* copy window in target file */ VCD_ADLER32 = (1 << 2), /* has adler32 checksum */ VCD_INVWIN = ~7U, VCD_SRCORTGT = VCD_SOURCE | VCD_TARGET, /* delta indicator bits */ VCD_DATACOMP = (1 << 0), VCD_INSTCOMP = (1 << 1), VCD_ADDRCOMP = (1 << 2), VCD_INVDEL = ~0x7U,} xd3_indicator;typedef enum { VCD_DJW_ID = 1, VCD_FGK_ID = 16, /* Note: these are not standard IANA-allocated IDs! */} xd3_secondary_ids;typedef enum { SEC_NOFLAGS = 0, /* Note: SEC_COUNT_FREQS Not implemented (to eliminate 1st Huffman pass) */ SEC_COUNT_FREQS = (1 << 0),} xd3_secondary_flags;typedef enum { DATA_SECTION, /* These indicate which section to the secondary * compressor. */ INST_SECTION, /* The header section is not compressed, therefore not * listed here. */ ADDR_SECTION,} xd3_section_type;typedef enum{ XD3_NOOP = 0, XD3_ADD = 1, XD3_RUN = 2, XD3_CPY = 3, /* XD3_CPY rtypes are represented as (XD3_CPY + * copy-mode value) */} xd3_rtype;/***********************************************************************/#include "xdelta3-list.h"XD3_MAKELIST(xd3_rlist, xd3_rinst, link);/***********************************************************************/#define SECONDARY_MIN_SAVINGS 2 /* Secondary compression has to save at least this many bytes. */#define SECONDARY_MIN_INPUT 10 /* Secondary compression needs at least this many bytes. */#define VCDIFF_MAGIC1 0xd6 /* 1st file byte */#define VCDIFF_MAGIC2 0xc3 /* 2nd file byte */#define VCDIFF_MAGIC3 0xc4 /* 3rd file byte */#define VCDIFF_VERSION 0x00 /* 4th file byte */#define VCD_SELF 0 /* 1st address mode */#define VCD_HERE 1 /* 2nd address mode */#define CODE_TABLE_STRING_SIZE (6 * 256) /* Should fit a code table string. */#define CODE_TABLE_VCDIFF_SIZE (6 * 256) /* Should fit a compressed code * table string */#define SECONDARY_ANY (SECONDARY_DJW || SECONDARY_FGK)#define ALPHABET_SIZE 256 /* Used in test code--size of the secondary * compressor alphabet. */#define HASH_PERMUTE 1 /* The input is permuted by random nums */#define ADLER_LARGE_CKSUM 1 /* Adler checksum vs. RK checksum */#define HASH_CKOFFSET 1U /* Table entries distinguish "no-entry" from * offset 0 using this offset. */
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -