📄 rfc3284.txt
字号:
Korn, et. al. Standards Track [Page 23]
RFC 3284 VCDIFF June 2002
The last three rows in the column gcc-2.95.2 show that when two file
versions are very similar, differencing can give dramatically good
compression rates. Vcdiff-d and Vcdiff-dc use the same simple window
selection method of aligning by file offsets, but Vcdiff-dc also does
compression so its output is slightly smaller. Vcdiff-dcw uses a
content-based algorithm to search for source data that likely will
match a given target window. Although it does a good job, the
algorithm does not always find the best matches, which in this case,
are given by the simple algorithm of Vcdiff-d. As a result, the
output size for Vcdiff-dcw is slightly larger.
The situation is reversed in the gcc-2.95.3 column. Here, the files
and their contents were sufficiently rearranged or changed between
the making of the gcc-2.95.3.tar archive and the gcc-2.95.2 archive
so that the simple method of aligning windows by file offsets no
longer works. As a result, Vcdiff-d and Vcdiff-dc do not perform
well. By allowing compression, along with differencing, Vcdiff-dc
manages to beat Vcdiff-c, which does compression only. The content-
based window matching algorithm in Vcdiff-dcw is effective in
matching the right source and target windows so that Vcdiff-dcw is
the overall winner.
9. Further Issues
This document does not address a few issues:
Secondary compressors:
As discussed in Section 4.3, certain sections in the delta
encoding of a window may be further compressed by a secondary
compressor. In our experience, the basic Vcdiff format is
adequate for most purposes so that secondary compressors are
seldom needed. In particular, for normal use of data
differencing, where the files to be compared have long stretches
of matches, much of the gain in compression rate is already
achieved by normal string matching. Thus, the use of secondary
compressors is seldom needed in this case. However, for
applications beyond differencing of such nearly identical files,
secondary compressors may be needed to achieve maximal compressed
results.
Therefore, we recommend leaving the Vcdiff data format defined as
in this document so that the use of secondary compressors can be
implemented when they become needed in the future. The formats of
the compressed data via such compressors or any compressors that
may be defined in the future are left open to their
implementations. These could include Huffman encoding, arithmetic
encoding, and splay tree encoding [8,9].
Korn, et. al. Standards Track [Page 24]
RFC 3284 VCDIFF June 2002
Large file system vs. small file system:
As discussed in Section 4, a target window in a large file may be
compared against some source window in another file or in the same
file (from some earlier part). In that case, the file offset of
the source window is specified as a variable-sized integer in the
delta encoding. There is a possibility that the encoding was
computed on a system supporting much larger files than in a system
where the data may be decoded (e.g., 64-bit file systems vs. 32-
bit file systems). In that case, some target data may not be
recoverable. This problem could afflict any compression format,
and ought to be resolved with a generic negotiation mechanism in
the appropriate protocol(s).
10. Summary
We have described Vcdiff, a general and portable encoding format for
compression and differencing. The format is good in that it allows
implementing a decoder without knowledge of the encoders. Further,
ignoring the use of secondary compressors not defined within the
format, the decoding algorithms run in linear time and requires
working space proportional to window size.
11. Acknowledgements
Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur
Van Hoff who provided much encouragement to publicize Vcdiff. In
particular, Jeff helped in clarifying the description of the data
format presented here.
12. Security Considerations
Vcdiff only provides a format to encode compressed and differenced
data. It does not address any issues concerning how such data are,
in fact, stored in a given file system or the run-time memory of a
computer system. Therefore, we do not anticipate any security issues
with respect to Vcdiff.
13. Source Code Availability
Vcdiff is implemented as a data transforming method in Phong Vo's
Vcodex library. AT&T Corp. has made the source code for Vcodex
available for anyone to use to transmit data via HTTP/1.1 Delta
Encoding [10,11]. The source code and according license is
accessible at the below URL:
http://www.research.att.com/sw/tools
Korn, et. al. Standards Track [Page 25]
RFC 3284 VCDIFF June 2002
14. Intellectual Property Rights
The IETF has been notified of intellectual property rights claimed in
regard to some or all of the specification contained in this
document. For more information consult the online list of claimed
rights, at <http://www.ietf.org/ipr.html>.
The IETF takes no position regarding the validity or scope of any
intellectual property or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; neither does it represent that it
has made any effort to identify any such rights. Information on the
IETF's procedures with respect to rights in standards-track and
standards-related documentation can be found in BCP 11. Copies of
claims of rights made available for publication and any assurances of
licenses to be made available, or the result of an attempt made to
obtain a general license or permission for the use of such
proprietary rights by implementors or users of this specification can
be obtained from the IETF Secretariat.
15. IANA Considerations
The Internet Assigned Numbers Authority (IANA) administers the number
space for Secondary Compressor ID values. Values and their meaning
must be documented in an RFC or other peer-reviewed, permanent, and
readily available reference, in sufficient detail so that
interoperability between independent implementations is possible.
Subject to these constraints, name assignments are First Come, First
Served - see RFC 2434 [13]. Legal ID values are in the range 1..255.
This document does not define any values in this number space.
16. References
[1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression,
Practical Reusable Unix Software, Editor B. Krishnamurthy, John
Wiley & Sons, Inc., 1995.
[2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data
Compression, IEEE Trans. on Information Theory, 23(3):337-343,
1977.
[3] W. Tichy, The String-to-String Correction Problem with Block
Moves, ACM Transactions on Computer Systems, 2(4):309-321,
November 1984.
Korn, et. al. Standards Track [Page 26]
RFC 3284 VCDIFF June 2002
[4] E.M. McCreight, A Space-Economical Suffix Tree Construction
Algorithm, Journal of the ACM, 23:262-272, 1976.
[5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta
Algorithms, IEEE Software Configuration and Maintenance
Workshop, 1996.
[6] J.J. Hunt, K.P. Vo, W. Tichy, Delta Algorithms: An Empirical
Analysis, ACM Trans. on Software Engineering and Methodology,
7:192-214, 1998.
[7] D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library, Proc. of the
Summer '91 Usenix Conference, 1991.
[8] D. W. Jones, Application of Splay Trees to Data Compression,
CACM, 31(8):996:1007.
[9] M. Nelson, J. Gailly, The Data Compression Book, ISBN 1-55851-
434-1, M&T Books, New York, NY, 1995.
[10] J.C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy,
Potential benefits of delta encoding and data compression for
HTTP, SIGCOMM '97, Cannes, France, 1997.
[11] Mogul, J., Krishnamurthy, B., Douglis, F., Feldmann, A., Goland,
Y. and A. Van Hoff, "Delta Encoding in HTTP", RFC 3229, January
2002.
[12] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", BCP 14, RFC 2119, March 1997.
[13] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA
Considerations Section in RFCs", BCP 26, RFC 2434, October 1998.
[14] D.G. Korn and K.P. Vo, Engineering a Differencing and
Compression Data Format, Submitted to Usenix'2002, 2001.
Korn, et. al. Standards Track [Page 27]
RFC 3284 VCDIFF June 2002
17. Authors' Addresses
Kiem-Phong Vo (main contact)
AT&T Labs, Room D223
180 Park Avenue
Florham Park, NJ 07932
Phone: 1 973 360 8630
EMail: kpv@research.att.com
David G. Korn
AT&T Labs, Room D237
180 Park Avenue
Florham Park, NJ 07932
Phone: 1 973 360 8602
EMail: dgk@research.att.com
Jeffrey C. Mogul
Western Research Laboratory
Hewlett-Packard Company
1501 Page Mill Road, MS 1251
Palo Alto, California, 94304, U.S.A.
Phone: 1 650 857 2206 (email preferred)
EMail: JeffMogul@acm.org
Joshua P. MacDonald
Computer Science Division
University of California, Berkeley
345 Soda Hall
Berkeley, CA 94720
EMail: jmacd@cs.berkeley.edu
Korn, et. al. Standards Track [Page 28]
RFC 3284 VCDIFF June 2002
18. Full Copyright Statement
Copyright (C) The Internet Society (2002). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Acknowledgement
Funding for the RFC Editor function is currently provided by the
Interne
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -