⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 xdelta3.c

📁 Linux下一个可以比较二进制文件的工具xdelta3.0u的源码。
💻 C
📖 第 1 页 / 共 5 页
字号:
/* xdelta 3 - delta compression tools and library * Copyright (C) 2001, 2003, 2004, 2005, 2006, 2007.  Joshua P. MacDonald * *  This program is free software; you can redistribute it and/or modify *  it under the terms of the GNU General Public License as published by *  the Free Software Foundation; either version 2 of the License, or *  (at your option) any later version. * *  This program is distributed in the hope that it will be useful, *  but WITHOUT ANY WARRANTY; without even the implied warranty of *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the *  GNU General Public License for more details. * *  You should have received a copy of the GNU General Public License *  along with this program; if not, write to the Free Software *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA   -------------------------------------------------------------------			       Xdelta 3   The goal of this library is to to implement both the (stand-alone)   data-compression and delta-compression aspects of VCDIFF encoding, and   to support a programming interface that works like Zlib   (http://www.gzip.org/zlib.html). See RFC3284: The VCDIFF Generic   Differencing and Compression Data Format.   VCDIFF is a unified encoding that combines data-compression and   delta-encoding ("differencing").   VCDIFF has a detailed byte-code instruction set with many features.   The instruction format supports an immediate size operand for small   COPYs and ADDs (e.g., under 18 bytes).  There are also instruction   "modes", which are used to compress COPY addresses by using two   address caches.  An instruction mode refers to slots in the NEAR   and SAME caches for recent addresses.  NEAR remembers the   previous 4 (by default) COPY addresses, and SAME catches   frequent re-uses of the same address using a 3-way (by default)   256-entry associative cache of [ADDR mod 256], the encoded byte.   A hit in the NEAR/SAME cache requires 0/1 ADDR bytes.   VCDIFF has a default instruction table, but an alternate   instruction tables may themselves be be delta-compressed and   included in the encoding header.  This allows even more freedom.   There are 9 instruction modes in the default code table, 4 near, 3   same, VCD_SELF (absolute encoding) and VCD_HERE (relative to the   current position).   ----------------------------------------------------------------------  			      Algorithms   Aside from the details of encoding and decoding, there are a bunch   of algorithms needed.   1. STRING-MATCH.  A two-level fingerprinting approach is used.  A   single loop computes the two checksums -- small and large -- at   successive offsets in the TARGET file.  The large checksum is more   accurate and is used to discover SOURCE matches, which are   potentially very long.  The small checksum is used to discover   copies within the TARGET.  Small matching, which is more expensive,   usually dominates the large STRING-MATCH costs in this code - the   more exhaustive the search, the better the results.  Either of the   two string-matching mechanisms may be disabled.   2. INSTRUCTION SELECTION.  The IOPT buffer here represents a queue   used to store overlapping copy instructions.  There are two possible   optimizations that go beyond a greedy search.  Both of these fall   into the category of "non-greedy matching" optimizations.   The first optimization stems from backward SOURCE-COPY matching.   When a new SOURCE-COPY instruction covers a previous instruction in   the target completely, it is erased from the queue.  Randal Burns   originally analyzed these algorithms and did a lot of related work   (\cite the 1.5-pass algorithm).   The second optimization comes by the encoding of common very-small   COPY and ADD instructions, for which there are special DOUBLE-code   instructions, which code two instructions in a single byte.   The cost of bad instruction-selection overhead is relatively high   for data-compression, relative to delta-compression, so this second   optimization is fairly important.  With "lazy" matching (the name   used in Zlib for a similar optimization), the string-match   algorithm searches after a match for potential overlapping copy   instructions.  In Xdelta and by default, VCDIFF, the minimum match   size is 4 bytes, whereas Zlib searches with a 3-byte minimum.  This   feature, combined with double instructions, provides a nice   challenge.  Search in this file for "black magic", a heuristic.   3. STREAM ALIGNMENT.  Stream alignment is needed to compress large   inputs in constant space.  See xd3_srcwin_move_point().   4. WINDOW SELECTION.  When the IOPT buffer flushes, in the first call   to xd3_iopt_finish_encoding containing any kind of copy instruction,   the parameters of the source window must be decided: the offset into   the source and the length of the window.  Since the IOPT buffer is   finite, the program may be forced to fix these values before knowing   the best offset/length.   5. SECONDARY COMPRESSION.  VCDIFF supports a secondary encoding to   be applied to the individual sections of the data format, which are   ADDRess, INSTruction, and DATA.  Several secondary compressor   variations are implemented here, although none is standardized yet.   One is an adaptive huffman algorithm -- the FGK algorithm (Faller,   Gallager, and Knuth, 1985).  This compressor is extremely slow.   The other is a simple static Huffman routine, which is the base   case of a semi-adaptive scheme published by D.J. Wheeler and first   widely used in bzip2 (by Julian Seward).  This is a very   interesting algorithm, originally published in nearly cryptic form   by D.J. Wheeler. !!!NOTE!!! Because these are not standardized,   secondary compression remains off by default.   ftp://ftp.cl.cam.ac.uk/users/djw3/bred3.{c,ps}   --------------------------------------------------------------------			    Other Features   1. USER CONVENIENCE   For user convenience, it is essential to recognize Gzip-compressed   files and automatically Gzip-decompress them prior to   delta-compression (or else no delta-compression will be achieved   unless the user manually decompresses the inputs).  The compressed   represention competes with Xdelta, and this must be hidden from the   command-line user interface.  The Xdelta-1.x encoding was simple, not   compressed itself, so Xdelta-1.x uses Zlib internally to compress the   representation.   This implementation supports external compression, which implements   the necessary fork() and pipe() mechanics.  There is a tricky step   involved to support automatic detection of a compressed input in a   non-seekable input.  First you read a bit of the input to detect   magic headers.  When a compressed format is recognized, exec() the   external compression program and create a second child process to   copy the original input stream. [Footnote: There is a difficulty   related to using Gzip externally. It is not possible to decompress   and recompress a Gzip file transparently.  If FILE.GZ had a   cryptographic signature, then, after: (1) Gzip-decompression, (2)   Xdelta-encoding, (3) Gzip-compression the signature could be   broken.  The only way to solve this problem is to guess at Gzip's   compression level or control it by other means.  I recommend that   specific implementations of any compression scheme store   information needed to exactly re-compress the input, that way   external compression is transparent - however, this won't happen   here until it has stabilized.]   2. APPLICATION-HEADER   This feature was introduced in RFC3284.  It allows any application   to include a header within the VCDIFF file format.  This allows   general inter-application data exchange with support for   application-specific extensions to communicate metadata.   3. VCDIFF CHECKSUM   An optional checksum value is included with each window, which can   be used to validate the final result.  This verifies the correct source   file was used for decompression as well as the obvious advantage:   checking the implementation (and underlying) correctness.   4. LIGHT WEIGHT   The code makes efforts to avoid copying data more than necessary.   The code delays many initialization tasks until the first use, it   optimizes for identical (perfectly matching) inputs.  It does not   compute any checksums until the first lookup misses.  Memory usage   is reduced.  String-matching is templatized (by slightly gross use   of CPP) to hard-code alternative compile-time defaults.  The code   has few outside dependencies.   ----------------------------------------------------------------------		The default rfc3284 instruction table:		    (see RFC for the explanation)           TYPE      SIZE     MODE    TYPE     SIZE     MODE     INDEX   --------------------------------------------------------------------       1.  Run         0        0     Noop       0        0        0       2.  Add    0, [1,17]     0     Noop       0        0      [1,18]       3.  Copy   0, [4,18]     0     Noop       0        0     [19,34]       4.  Copy   0, [4,18]     1     Noop       0        0     [35,50]       5.  Copy   0, [4,18]     2     Noop       0        0     [51,66]       6.  Copy   0, [4,18]     3     Noop       0        0     [67,82]       7.  Copy   0, [4,18]     4     Noop       0        0     [83,98]       8.  Copy   0, [4,18]     5     Noop       0        0     [99,114]       9.  Copy   0, [4,18]     6     Noop       0        0    [115,130]      10.  Copy   0, [4,18]     7     Noop       0        0    [131,146]      11.  Copy   0, [4,18]     8     Noop       0        0    [147,162]      12.  Add       [1,4]      0     Copy     [4,6]      0    [163,174]      13.  Add       [1,4]      0     Copy     [4,6]      1    [175,186]      14.  Add       [1,4]      0     Copy     [4,6]      2    [187,198]      15.  Add       [1,4]      0     Copy     [4,6]      3    [199,210]      16.  Add       [1,4]      0     Copy     [4,6]      4    [211,222]      17.  Add       [1,4]      0     Copy     [4,6]      5    [223,234]      18.  Add       [1,4]      0     Copy       4        6    [235,238]      19.  Add       [1,4]      0     Copy       4        7    [239,242]      20.  Add       [1,4]      0     Copy       4        8    [243,246]      21.  Copy        4      [0,8]   Add        1        0    [247,255]   --------------------------------------------------------------------		     Reading the source: Overview   This file includes itself in several passes to macro-expand certain   sections with variable forms.  Just read ahead, there's only a   little confusion.  I know this sounds ugly, but hard-coding some of   the string-matching parameters results in a 10-15% increase in   string-match performance.  The only time this hurts is when you have   unbalanced #if/endifs.   A single compilation unit tames the Makefile.  In short, this is to   allow the above-described hack without an explodingMakefile.  The   single compilation unit includes the core library features,   configurable string-match templates, optional main() command-line   tool, misc optional features, and a regression test.  Features are   controled with CPP #defines, see Makefile.am.   The initial __XDELTA3_C_HEADER_PASS__ starts first, the _INLINE_ and   _TEMPLATE_ sections follow.  Easy stuff first, hard stuff last.   Optional features include:     xdelta3-main.h     The command-line interface, external compression                        support, POSIX-specific, info & VCDIFF-debug tools.     xdelta3-second.h   The common secondary compression routines.     xdelta3-decoder.h  All decoding routines.     xdelta3-djw.h      The semi-adaptive huffman secondary encoder.     xdelta3-fgk.h      The adaptive huffman secondary encoder.     xdelta3-test.h     The unit test covers major algorithms,                        encoding and decoding.  There are single-bit                        error decoding tests.  There are 32/64-bit file size                        boundary tests.  There are command-line tests.                        There are compression tests.  There are external                        compression tests.  There are string-matching tests.			There should be more tests...   Additional headers include:     xdelta3.h          The public header file.     xdelta3-cfgs.h     The default settings for default, built-in                        encoders.  These are hard-coded at                        compile-time.  There is also a single                        soft-coded string matcher for experimenting                        with arbitrary values.     xdelta3-list.h     A cyclic list template   Misc little debug utilities:     badcopy.c          Randomly modifies an input file based on two                        parameters: (1) the probability that a byte in                        the file is replaced with a pseudo-random value,                        and (2) the mean change size.  Changes are                        generated using an expoential distribution                        which approximates the expected error_prob			distribution.   --------------------------------------------------------------------   This file itself is unusually large.  I hope to defend this layout   with lots of comments.  Everything in this file is related to   encoding and decoding.  I like it all together - the template stuff   is just a hack. */#ifndef __XDELTA3_C_HEADER_PASS__#define __XDELTA3_C_HEADER_PASS__#include <errno.h>#include <string.h>#include "xdelta3.h"/*********************************************************************** STATIC CONFIGURATION ***********************************************************************/#ifndef XD3_MAIN                  /* the main application */#define XD3_MAIN 0#endif#ifndef VCDIFF_TOOLS#define VCDIFF_TOOLS XD3_MAIN#endif#ifndef SECONDARY_FGK             /* one from the algorithm preservation department: */#define SECONDARY_FGK 0           /* adaptive Huffman routines */#endif#ifndef SECONDARY_DJW             /* semi-adaptive/static Huffman for the eventual */#define SECONDARY_DJW 0           /* standardization, off by default until such time. */#endif#ifndef GENERIC_ENCODE_TABLES    /* These three are the RFC-spec'd app-specific */#define GENERIC_ENCODE_TABLES 0  /* code features.  This is tested but not recommended */#endif  			 /* unless there's a real application. */#ifndef GENERIC_ENCODE_TABLES_COMPUTE#define GENERIC_ENCODE_TABLES_COMPUTE 0#endif#ifndef GENERIC_ENCODE_TABLES_COMPUTE_PRINT#define GENERIC_ENCODE_TABLES_COMPUTE_PRINT 0#endif#if XD3_ENCODER#define IF_ENCODER(x) x#else#define IF_ENCODER(x)#endif/***********************************************************************/typedef enum {  /* header indicator bits */  VCD_SECONDARY  = (1 << 0),  /* uses secondary compressor */  VCD_CODETABLE  = (1 << 1),  /* supplies code table data */  VCD_APPHEADER  = (1 << 2),  /* supplies application data */  VCD_INVHDR     = ~7U,  /* window indicator bits */  VCD_SOURCE     = (1 << 0),  /* copy window in source file */  VCD_TARGET     = (1 << 1),  /* copy window in target file */  VCD_ADLER32    = (1 << 2),  /* has adler32 checksum */  VCD_INVWIN     = ~7U,  VCD_SRCORTGT   = VCD_SOURCE | VCD_TARGET,  /* delta indicator bits */  VCD_DATACOMP   = (1 << 0),  VCD_INSTCOMP   = (1 << 1),  VCD_ADDRCOMP   = (1 << 2),  VCD_INVDEL     = ~0x7U,} xd3_indicator;typedef enum {  VCD_DJW_ID    = 1,  VCD_FGK_ID    = 16, /* Note: these are not standard IANA-allocated IDs! */} xd3_secondary_ids;typedef enum {  SEC_NOFLAGS     = 0,  /* Note: SEC_COUNT_FREQS Not implemented (to eliminate 1st Huffman pass) */  SEC_COUNT_FREQS = (1 << 0),} xd3_secondary_flags;typedef enum {  DATA_SECTION, /* These indicate which section to the secondary                 * compressor. */  INST_SECTION, /* The header section is not compressed, therefore not                 * listed here. */  ADDR_SECTION,} xd3_section_type;typedef enum{  XD3_NOOP = 0,  XD3_ADD  = 1,  XD3_RUN  = 2,  XD3_CPY  = 3, /* XD3_CPY rtypes are represented as (XD3_CPY +                 * copy-mode value) */} xd3_rtype;/***********************************************************************/#include "xdelta3-list.h"XD3_MAKELIST(xd3_rlist, xd3_rinst, link);/***********************************************************************/#define SECONDARY_MIN_SAVINGS 2  /* Secondary compression has to save				    at least this many bytes. */#define SECONDARY_MIN_INPUT   10 /* Secondary compression needs at				    least this many bytes. */#define VCDIFF_MAGIC1  0xd6  /* 1st file byte */#define VCDIFF_MAGIC2  0xc3  /* 2nd file byte */#define VCDIFF_MAGIC3  0xc4  /* 3rd file byte */#define VCDIFF_VERSION 0x00  /* 4th file byte */#define VCD_SELF       0     /* 1st address mode */#define VCD_HERE       1     /* 2nd address mode */#define CODE_TABLE_STRING_SIZE (6 * 256) /* Should fit a code table string. */#define CODE_TABLE_VCDIFF_SIZE (6 * 256) /* Should fit a compressed code					  * table string */#define SECONDARY_ANY (SECONDARY_DJW || SECONDARY_FGK)#define ALPHABET_SIZE      256  /* Used in test code--size of the secondary				 * compressor alphabet. */#define HASH_PERMUTE       1    /* The input is permuted by random nums */#define ADLER_LARGE_CKSUM  1    /* Adler checksum vs. RK checksum */#define HASH_CKOFFSET      1U   /* Table entries distinguish "no-entry" from				 * offset 0 using this offset. */

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -