📄 draft-korn-vcdiff.txt

📁 Linux下一个可以比较二进制文件的工具xdelta3.0u的源码。
💻 TXT
📖 第 1 页 / 共 4 页
字号:
上一页 1 2 34
       iv. Add in order the 256 bytes representing the sizes of the second	   instructions in the instruction pairs.        v. Add in order the 256 bytes representing the modes of the first	   instructions in the instruction pairs.       vi. Add in order the 256 bytes representing the modes of the second	   instructions in the instruction pairs.    b.  Similarly, convert the default instruction code table into	a string "dflt".    c.  Treat the string "code" as a target window and "dflt" as the	corresponding source data and apply an encoding algorithm to	compute the delta encoding of "code" in terms of "dflt".	This computation MUST use the default code table for encoding	the delta instructions.    The decoder can then reverse the above steps to decode the compressed    table data using the method of Section 6, employing the default code    table, to generate the new code table.  Note that the decoder does not    need to know anything about the details of the encoding algorithm used    in step (c). The decoder is still able to decode the new code table    because the Vcdiff format is independent from the choice of encoding    algorithm, and because the encoder in step (c) uses the known, default    code table.8. PERFORMANCE    The encoding format is compact. For compression only, using the LZ-77    string parsing strategy and without any secondary compressors, the typical    compression rate is better than Unix compress and close to gzip.  For    differencing, the data format is better than all known methods in    terms of its stated goal, which is primarily decoding speed and    encoding efficiency.    We compare the performance of compress, gzip and Vcdiff using the    archives of three versions of the Gnu C compiler, gcc-2.95.1.tar,    gcc-2.95.2.tar and gcc-2.95.3.tar. The experiments were done on an    SGI-MIPS3, 400MHZ. Gzip was used at its default compression level.    Vcdiff timings were done using the Vcodex/Vcdiff software (Section 13).    As string and window matching typically dominates the computation during    compression, the Vcdiff compression times were directly due to the    algorithms used in the Vcodex/Vcdiff software. However, the decompression    times should be generic and representative of any good implementation    of the Vcdiff data format. Timing was done by running each program    three times and taking the average of the total cpu+system times.    Below are the different Vcdiff runs:	Vcdiff: vcdiff is used as compressor only.	Vcdiff-d: vcdiff is used as a differencer only. That is, it only		compares target data against source data.  Since the files		involved are large, they are broken into windows. In this		case, each target window starting at some file offset in		the target file is compared against a source window with		the same file offset (in the source file). The source		window is also slightly larger than the target window		to increase matching opportunities. The -d option also gives		a hint to the string matching algorithm of Vcdiff that		the two files are very similar with long stretches of matches.		The algorithm takes advantage of this to minimize its		processing of source data and save time.	Vcdiff-dc: This is similar to Vcdiff-d but vcdiff can also compare		target data against target data as applicable. Thus, vcdiff		both computes differences and compresses data. The windowing		algorithm is the same as above. However, the above hint is		recinded in this case.	Vcdiff-dcs: This is similar to Vcdiff-dc but the windowing algorithm		uses a content-based heuristic to select source data segments		that are more likely to match with a given target window.		Thus, the source data segment selected for a target window		often will not be aligned with the file offsets of this		target window.                gcc-2.95.1    gcc-2.95.2    compression   decompression    raw size      55746560      55797760    compress         -          19939390       13.85s	      7.09s    gzip             -          12973443       42.99s         5.35s    Vcdiff           -          15358786       20.04s         4.65s    Vcdiff-d         -            100971       10.93s         1.92s    Vcdiff-dc        -             97246       20.03s         1.84s    Vcdiff-dcs       -            256445       44.81s         1.84s		TABLE 1. Compressing gcc-2.95.2.tar given gcc-2.95.1    TABLE 1 shows the raw sizes of gcc-2.95.1.tar and gcc-2.95.2.tar and the    sizes of the compressed results. As a pure compressor, the compression    rate for Vcdiff is worse than gzip and better than compress. The last    three rows shows that when two file versions are very similar, differencing    can have dramatically good compression rates. Vcdiff-d and Vcdiff-dc use    the same simple window selection method but Vcdiff-dc also does compression    so its output is slightly smaller. Vcdiff-dcs uses a heuristic based on    data content to search for source data that likely will match a given target    window. Although it does a good job, the heuristic did not always find the    best matches which are given by the simple algorithm of Vcdiff-d.  As a    result, the output size is slightly larger. Note also that there is a large    cost in computing matching windows this way. Finally, the compression times    of Vcdiff-d is nearly half of that of Vcdiff-dc. It is tempting to conclude    that the compression feature causes the additional time in Vcdiff-dc    relative to Vcdiff-d.  However, this is not the case. The hint given to    the Vcdiff string matching algorithm that the two files are likely to    have very long stretches of matches helps the algorithm to minimize    processing of the "source data", thus saving half the time. However, as we    shall see below when this hint is wrong, the result is even longer time.                gcc-2.95.2    gcc-2.95.3    compression   decompression    raw size      55797760      55787520    compress         -          19939453       13.54s	      7.00s    gzip             -          12998097       42.63s         5.62s    Vcdiff           -          15371737       20.09s         4.74s    Vcdiff-d         -          26383849       71.41s         6.41s    Vcdiff-dc        -          14461203       42.48s         4.82s    Vcdiff-dcs       -           1248543       61.18s         1.99s		TABLE 2. Compressing gcc-2.95.3.tar given gcc-2.95.2    TABLE 2 shows the raw sizes of gcc-2.95.2.tar and gcc-2.95.3.tar and    the sizes of the compressed results. In this case, the tar file of    gcc-2.95.3 is rearranged in a way that makes the straightforward method    of matching file offsets for source and target windows fail. As a    result, Vcdiff-d performs rather dismally both in time and output size.    The large time for Vcdiff-d is directly due to fact that the string    matching algorithm has to work much harder to find matches when the hint    that two files have long matching stretches fails to hold. On the other    hand, Vcdiff-dc does both differencing and compression resulting in good    output size. Finally, the window searching heuristic used in Vcdiff-dcs is    effective in finding the right matching source windows for target windows    resulting a small output size. This shows why the data format needs to    have a way to specify matching windows to gain performance. Finally,    we note that the decoding times are always good regardless of how    the string matching or window searching algorithms perform.9. FURTHER ISSUES    This document does not address a few issues:    Secondary compressors:        As discussed in Section 4.3, certain sections in the delta encoding	of a window may be further compressed by a secondary compressor.	In our experience, the basic Vcdiff format is adequate for most	purposes so that secondary compressors are seldom needed. In        particular, for normal use of data differencing where the files to	be compared have long stretches of matches, much of the gain in	compression rate is already achieved by normal string matching.	Thus, the use of secondary compressors is seldom needed in this case.	However, for applications beyond differencing of such nearly identical	files, secondary compressors may be needed to achieve maximal	compressed results.        Therefore, we recommend to leave the Vcdiff data format defined	as in this document so that the use of secondary compressors 	can be implemented when they become needed in the future.        The formats of the compressed data via such compressors or any	compressors that may be defined in the future are left open to	their implementations.  These could include Huffman encoding,	arithmetic encoding, and splay tree encoding [8,9].    Large file system vs. small file system:	As discussed in Section 4, a target window in a large file may be	compared against some source window in another file or in the same	file (from some earlier part). In that case, the file offset of the	source window is specified as a variable-sized integer in the delta	encoding. There is a possibility that the encoding was computed on	a system supporting much larger files than in a system where	the data may be decoded (e.g., 64-bit file systems vs. 32-bit file	systems). In that case, some target data may not be recoverable.	This problem could afflict any compression format, and ought	to be resolved with a generic negotiation mechanism in the	appropriate protocol(s).10.  SUMMARY    We have described Vcdiff, a general and portable encoding format for    compression and differencing. The format is good in that it allows    implementing a decoder without knowledge of the encoders. Further,    ignoring the use of secondary compressors not defined within the format,    the decoding algorithms runs in linear time and requires working space    proportional to window sizes.11. ACKNOWLEDGEMENTS    Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff    who provided much encouragement to publicize Vcdiff. In particular, Jeff    helped clarifying the description of the data format presented here.12. SECURITY CONSIDERATIONS    Vcdiff only provides a format to encode compressed and differenced data.    It does not address any issues concerning how such data are, in fact,    stored in a given file system or the run-time memory of a computer system.    Therefore, we do not anticipate any security issues with respect to Vcdiff.13. SOURCE CODE AVAILABILITY    Vcdiff is implemented as a data transforming method in Phong Vo's    Vcodex library. AT&T Corp. has made the source code for Vcodex available    for anyone to use to transmit data via HTTP/1.1 Delta Encoding [10,11].    The source code and according license is accessible at the below URL:          http://www.research.att.com/sw/tools14. INTELLECTUAL PROPERTY RIGHTS   The IETF has been notified of intellectual property rights claimed in   regard to some or all of the specification contained in this   document.  For more information consult the online list of claimed   rights, at <http://www.ietf.org/ipr.html>.   The IETF takes no position regarding the validity or scope of any   intellectual property or other rights that might be claimed to   pertain to the implementation or use of the technology described in   this document or the extent to which any license under such rights   might or might not be available; neither does it represent that it   has made any effort to identify any such rights.  Information on the   IETF's procedures with respect to rights in standards-track and   standards-related documentation can be found in BCP-11.  Copies of   claims of rights made available for publication and any assurances of   licenses to be made available, or the result of an attempt made to   obtain a general license or permission for the use of such   proprietary rights by implementors or users of this specification can   be obtained from the IETF Secretariat.15. IANA CONSIDERATIONS   The Internet Assigned Numbers Authority (IANA) administers the number   space for Secondary Compressor ID values.  Values and their meaning   must be documented in an RFC or other peer-reviewed, permanent, and   readily available reference, in sufficient detail so that   interoperability between independent implementations is possible.   Subject to these constraints, name assignments are First Come, First   Served - see RFC2434 [13].  Legal ID values are in the range 1..255.   This document does not define any values in this number space.16. REFERENCES    [1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression,        Practical Reusable Unix Software, Editor B. Krishnamurthy,        John Wiley & Sons, Inc., 1995.    [2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data        Compression, IEEE Trans. on Information Theory, 23(3):337-343, 1977.    [3] W. Tichy, The String-to-String Correction Problem with Block Moves,        ACM Transactions on Computer Systems, 2(4):309-321, November 1984.    [4] E.M. McCreight, A Space-Economical Suffix Tree Construction        Algorithm, Journal of the ACM, 23:262-272, 1976.    [5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta Algorithms,        IEEE Software Configuration and Maintenance Workshop, 1996.    [6] J.J. Hunt, K.P. Vo, W. Tichy, Delta Algorithms: An Empirical Analysis,        ACM Trans. on Software Engineering and Methodology, 7:192-214, 1998.    [7] D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library,        Proc. of the Summer '91 Usenix Conference, 1991.    [8] D. W. Jones, Application of Splay Trees to Data Compression,        CACM, 31(8):996:1007.    [9] M. Nelson, J. Gailly, The Data Compression Book, ISBN 1-55851-434-1,        M&T Books, New York, NY, 1995.   [10] J.C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy,        Potential benefits of delta encoding and data compression for HTTP,        SIGCOMM '97, Cannes, France, 1997.   [11] J.C. Mogul, B. Krishnamurthy, F. Douglis, A. Feldmann,        Y. Goland, and A. Van Hoff, Delta Encoding in HTTP,        IETF, draft-mogul-http-delta-10, 2001.   [12] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels,        RFC 2119, March 1997.   [13] T. Narten, H. Alvestrand, Guidelines for Writing an IANA        Considerations Section in RFCs, RFC2434, October 1998.17. AUTHOR'S ADDRESS    Kiem-Phong Vo (main contact)    AT&T Labs, Room D223    180 Park Avenue    Florham Park, NJ 07932    Email: kpv@research.att.com    Phone: 1 973 360 8630    David G. Korn    AT&T Labs, Room D237    180 Park Avenue    Florham Park, NJ 07932    Email: dgk@research.att.com    Phone: 1 973 360 8602    Jeffrey C. Mogul    Western Research Laboratory    Compaq Computer Corporation    250 University Avenue    Palo Alto, California, 94305, U.S.A.    Email: JeffMogul@acm.org    Phone: 1 650 617 3304 (email preferred)    Joshua P. MacDonald    Computer Science Division    University of California, Berkeley    345 Soda Hall    Berkeley, CA 94720    Email: jmacd@cs.berkeley.edu
上一页 1 2 34
💿 文件大小 215 K
👤 上传用户 shangranc
📂 所属分类 Linux/Unix编程
🏷️ 相关标签

#xdelta #Linux #3.0 #比较
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -