📄 draft-korn-vcdiff.txt
字号:
iv. Add in order the 256 bytes representing the sizes of the second instructions in the instruction pairs. v. Add in order the 256 bytes representing the modes of the first instructions in the instruction pairs. vi. Add in order the 256 bytes representing the modes of the second instructions in the instruction pairs. b. Similarly, convert the default instruction code table into a string "dflt". c. Treat the string "code" as a target window and "dflt" as the corresponding source data and apply an encoding algorithm to compute the delta encoding of "code" in terms of "dflt". This computation MUST use the default code table for encoding the delta instructions. The decoder can then reverse the above steps to decode the compressed table data using the method of Section 6, employing the default code table, to generate the new code table. Note that the decoder does not need to know anything about the details of the encoding algorithm used in step (c). The decoder is still able to decode the new code table because the Vcdiff format is independent from the choice of encoding algorithm, and because the encoder in step (c) uses the known, default code table.8. PERFORMANCE The encoding format is compact. For compression only, using the LZ-77 string parsing strategy and without any secondary compressors, the typical compression rate is better than Unix compress and close to gzip. For differencing, the data format is better than all known methods in terms of its stated goal, which is primarily decoding speed and encoding efficiency. We compare the performance of compress, gzip and Vcdiff using the archives of three versions of the Gnu C compiler, gcc-2.95.1.tar, gcc-2.95.2.tar and gcc-2.95.3.tar. The experiments were done on an SGI-MIPS3, 400MHZ. Gzip was used at its default compression level. Vcdiff timings were done using the Vcodex/Vcdiff software (Section 13). As string and window matching typically dominates the computation during compression, the Vcdiff compression times were directly due to the algorithms used in the Vcodex/Vcdiff software. However, the decompression times should be generic and representative of any good implementation of the Vcdiff data format. Timing was done by running each program three times and taking the average of the total cpu+system times. Below are the different Vcdiff runs: Vcdiff: vcdiff is used as compressor only. Vcdiff-d: vcdiff is used as a differencer only. That is, it only compares target data against source data. Since the files involved are large, they are broken into windows. In this case, each target window starting at some file offset in the target file is compared against a source window with the same file offset (in the source file). The source window is also slightly larger than the target window to increase matching opportunities. The -d option also gives a hint to the string matching algorithm of Vcdiff that the two files are very similar with long stretches of matches. The algorithm takes advantage of this to minimize its processing of source data and save time. Vcdiff-dc: This is similar to Vcdiff-d but vcdiff can also compare target data against target data as applicable. Thus, vcdiff both computes differences and compresses data. The windowing algorithm is the same as above. However, the above hint is recinded in this case. Vcdiff-dcs: This is similar to Vcdiff-dc but the windowing algorithm uses a content-based heuristic to select source data segments that are more likely to match with a given target window. Thus, the source data segment selected for a target window often will not be aligned with the file offsets of this target window. gcc-2.95.1 gcc-2.95.2 compression decompression raw size 55746560 55797760 compress - 19939390 13.85s 7.09s gzip - 12973443 42.99s 5.35s Vcdiff - 15358786 20.04s 4.65s Vcdiff-d - 100971 10.93s 1.92s Vcdiff-dc - 97246 20.03s 1.84s Vcdiff-dcs - 256445 44.81s 1.84s TABLE 1. Compressing gcc-2.95.2.tar given gcc-2.95.1 TABLE 1 shows the raw sizes of gcc-2.95.1.tar and gcc-2.95.2.tar and the sizes of the compressed results. As a pure compressor, the compression rate for Vcdiff is worse than gzip and better than compress. The last three rows shows that when two file versions are very similar, differencing can have dramatically good compression rates. Vcdiff-d and Vcdiff-dc use the same simple window selection method but Vcdiff-dc also does compression so its output is slightly smaller. Vcdiff-dcs uses a heuristic based on data content to search for source data that likely will match a given target window. Although it does a good job, the heuristic did not always find the best matches which are given by the simple algorithm of Vcdiff-d. As a result, the output size is slightly larger. Note also that there is a large cost in computing matching windows this way. Finally, the compression times of Vcdiff-d is nearly half of that of Vcdiff-dc. It is tempting to conclude that the compression feature causes the additional time in Vcdiff-dc relative to Vcdiff-d. However, this is not the case. The hint given to the Vcdiff string matching algorithm that the two files are likely to have very long stretches of matches helps the algorithm to minimize processing of the "source data", thus saving half the time. However, as we shall see below when this hint is wrong, the result is even longer time. gcc-2.95.2 gcc-2.95.3 compression decompression raw size 55797760 55787520 compress - 19939453 13.54s 7.00s gzip - 12998097 42.63s 5.62s Vcdiff - 15371737 20.09s 4.74s Vcdiff-d - 26383849 71.41s 6.41s Vcdiff-dc - 14461203 42.48s 4.82s Vcdiff-dcs - 1248543 61.18s 1.99s TABLE 2. Compressing gcc-2.95.3.tar given gcc-2.95.2 TABLE 2 shows the raw sizes of gcc-2.95.2.tar and gcc-2.95.3.tar and the sizes of the compressed results. In this case, the tar file of gcc-2.95.3 is rearranged in a way that makes the straightforward method of matching file offsets for source and target windows fail. As a result, Vcdiff-d performs rather dismally both in time and output size. The large time for Vcdiff-d is directly due to fact that the string matching algorithm has to work much harder to find matches when the hint that two files have long matching stretches fails to hold. On the other hand, Vcdiff-dc does both differencing and compression resulting in good output size. Finally, the window searching heuristic used in Vcdiff-dcs is effective in finding the right matching source windows for target windows resulting a small output size. This shows why the data format needs to have a way to specify matching windows to gain performance. Finally, we note that the decoding times are always good regardless of how the string matching or window searching algorithms perform.9. FURTHER ISSUES This document does not address a few issues: Secondary compressors: As discussed in Section 4.3, certain sections in the delta encoding of a window may be further compressed by a secondary compressor. In our experience, the basic Vcdiff format is adequate for most purposes so that secondary compressors are seldom needed. In particular, for normal use of data differencing where the files to be compared have long stretches of matches, much of the gain in compression rate is already achieved by normal string matching. Thus, the use of secondary compressors is seldom needed in this case. However, for applications beyond differencing of such nearly identical files, secondary compressors may be needed to achieve maximal compressed results. Therefore, we recommend to leave the Vcdiff data format defined as in this document so that the use of secondary compressors can be implemented when they become needed in the future. The formats of the compressed data via such compressors or any compressors that may be defined in the future are left open to their implementations. These could include Huffman encoding, arithmetic encoding, and splay tree encoding [8,9]. Large file system vs. small file system: As discussed in Section 4, a target window in a large file may be compared against some source window in another file or in the same file (from some earlier part). In that case, the file offset of the source window is specified as a variable-sized integer in the delta encoding. There is a possibility that the encoding was computed on a system supporting much larger files than in a system where the data may be decoded (e.g., 64-bit file systems vs. 32-bit file systems). In that case, some target data may not be recoverable. This problem could afflict any compression format, and ought to be resolved with a generic negotiation mechanism in the appropriate protocol(s).10. SUMMARY We have described Vcdiff, a general and portable encoding format for compression and differencing. The format is good in that it allows implementing a decoder without knowledge of the encoders. Further, ignoring the use of secondary compressors not defined within the format, the decoding algorithms runs in linear time and requires working space proportional to window sizes.11. ACKNOWLEDGEMENTS Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff who provided much encouragement to publicize Vcdiff. In particular, Jeff helped clarifying the description of the data format presented here.12. SECURITY CONSIDERATIONS Vcdiff only provides a format to encode compressed and differenced data. It does not address any issues concerning how such data are, in fact, stored in a given file system or the run-time memory of a computer system. Therefore, we do not anticipate any security issues with respect to Vcdiff.13. SOURCE CODE AVAILABILITY Vcdiff is implemented as a data transforming method in Phong Vo's Vcodex library. AT&T Corp. has made the source code for Vcodex available for anyone to use to transmit data via HTTP/1.1 Delta Encoding [10,11]. The source code and according license is accessible at the below URL: http://www.research.att.com/sw/tools14. INTELLECTUAL PROPERTY RIGHTS The IETF has been notified of intellectual property rights claimed in regard to some or all of the specification contained in this document. For more information consult the online list of claimed rights, at <http://www.ietf.org/ipr.html>. The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards-related documentation can be found in BCP-11. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementors or users of this specification can be obtained from the IETF Secretariat.15. IANA CONSIDERATIONS The Internet Assigned Numbers Authority (IANA) administers the number space for Secondary Compressor ID values. Values and their meaning must be documented in an RFC or other peer-reviewed, permanent, and readily available reference, in sufficient detail so that interoperability between independent implementations is possible. Subject to these constraints, name assignments are First Come, First Served - see RFC2434 [13]. Legal ID values are in the range 1..255. This document does not define any values in this number space.16. REFERENCES [1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression, Practical Reusable Unix Software, Editor B. Krishnamurthy, John Wiley & Sons, Inc., 1995. [2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data Compression, IEEE Trans. on Information Theory, 23(3):337-343, 1977. [3] W. Tichy, The String-to-String Correction Problem with Block Moves, ACM Transactions on Computer Systems, 2(4):309-321, November 1984. [4] E.M. McCreight, A Space-Economical Suffix Tree Construction Algorithm, Journal of the ACM, 23:262-272, 1976. [5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta Algorithms, IEEE Software Configuration and Maintenance Workshop, 1996. [6] J.J. Hunt, K.P. Vo, W. Tichy, Delta Algorithms: An Empirical Analysis, ACM Trans. on Software Engineering and Methodology, 7:192-214, 1998. [7] D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library, Proc. of the Summer '91 Usenix Conference, 1991. [8] D. W. Jones, Application of Splay Trees to Data Compression, CACM, 31(8):996:1007. [9] M. Nelson, J. Gailly, The Data Compression Book, ISBN 1-55851-434-1, M&T Books, New York, NY, 1995. [10] J.C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy, Potential benefits of delta encoding and data compression for HTTP, SIGCOMM '97, Cannes, France, 1997. [11] J.C. Mogul, B. Krishnamurthy, F. Douglis, A. Feldmann, Y. Goland, and A. Van Hoff, Delta Encoding in HTTP, IETF, draft-mogul-http-delta-10, 2001. [12] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, RFC 2119, March 1997. [13] T. Narten, H. Alvestrand, Guidelines for Writing an IANA Considerations Section in RFCs, RFC2434, October 1998.17. AUTHOR'S ADDRESS Kiem-Phong Vo (main contact) AT&T Labs, Room D223 180 Park Avenue Florham Park, NJ 07932 Email: kpv@research.att.com Phone: 1 973 360 8630 David G. Korn AT&T Labs, Room D237 180 Park Avenue Florham Park, NJ 07932 Email: dgk@research.att.com Phone: 1 973 360 8602 Jeffrey C. Mogul Western Research Laboratory Compaq Computer Corporation 250 University Avenue Palo Alto, California, 94305, U.S.A. Email: JeffMogul@acm.org Phone: 1 650 617 3304 (email preferred) Joshua P. MacDonald Computer Science Division University of California, Berkeley 345 Soda Hall Berkeley, CA 94720 Email: jmacd@cs.berkeley.edu
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -