📄 rfc3284.txt
字号:
caches, so that decoder's caches remain synchronized with the
encoder's caches.
a. A "near" cache is an array with "s_near" slots, each containing an
address used for encoding addresses nearby to previously encoded
addresses (in the positive direction only). The near cache also
maintains a "next_slot" index to the near cache. New entries to
the near cache are always inserted in the next_slot index, which
maintains a circular buffer of the s_near most recent addresses.
b. A "same" cache is an array with "s_same", with a multiple of 256
slots, each containing an address. The same cache maintains a
hash table of recent addresses used for repeated encoding of the
exact same address.
By default, the parameters s_near and s_same are respectively set to
4 and 3. An encoder MAY modify these values, but then it MUST encode
the new values in the encoding itself, as discussed in Section 7, so
that the decoder can properly set up its own caches.
At the start of processing a target window, an implementation
(encoder or decoder) initializes all of the slots in both caches to
zero. The next_slot pointer of the near cache is set to point to
slot zero.
Korn, et. al. Standards Track [Page 12]
RFC 3284 VCDIFF June 2002
Each time a COPY instruction is processed by the encoder or decoder,
the implementation's caches are updated as follows, where "addr" is
the address in the COPY instruction.
a. The slot in the near cache referenced by the next_slot index is
set to addr. The next_slot index is then incremented modulo
s_near.
b. The slot in the same cache whose index is addr%(s_same*256) is set
to addr. [We use the C notations of % for modulo and * for
multiplication.]
5.2 Example code for maintaining caches
To make clear the above description, below are examples of cache data
structures and algorithms to initialize and update them:
typedef struct _cache_s
{
int* near; /* array of size s_near */
int s_near;
int next_slot; /* the circular index for near */
int* same; /* array of size s_same*256 */
int s_same;
} Cache_t;
cache_init(Cache_t* ka)
{
int i;
ka->next_slot = 0;
for(i = 0; i < ka->s_near; ++i)
ka->near[i] = 0;
for(i = 0; i < ka->s_same*256; ++i)
ka->same[i] = 0;
}
cache_update(Cache_t* ka, int addr)
{
if(ka->s_near > 0)
{ ka->near[ka->next_slot] = addr;
ka->next_slot = (ka->next_slot + 1) % ka->s_near;
}
if(ka->s_same > 0)
ka->same[addr % (ka->s_same*256)] = addr;
}
Korn, et. al. Standards Track [Page 13]
RFC 3284 VCDIFF June 2002
5.3 Encoding of COPY instruction addresses
The address of a COPY instruction is encoded using different modes,
depending on the type of cached address used, if any.
Let "addr" be the address of a COPY instruction to be decoded and
"here" be the current location in the target data (i.e., the start of
the data about to be encoded or decoded). Let near[j] be the jth
element in the near cache, and same[k] be the kth element in the same
cache. Below are the possible address modes:
VCD_SELF: This mode has value 0. The address was encoded by
itself as an integer.
VCD_HERE: This mode has value 1. The address was encoded as the
integer value "here - addr".
Near modes: The "near modes" are in the range [2,s_near+1]. Let m
be the mode of the address encoding. The address was encoded
as the integer value "addr - near[m-2]".
Same modes: The "same modes" are in the range
[s_near+2,s_near+s_same+1]. Let m be the mode of the encoding.
The address was encoded as a single byte b such that "addr ==
same[(m - (s_near+2))*256 + b]".
5.4 Example code for encoding and decoding of COPY instruction addresses
We show example algorithms below to demonstrate the use of address
modes more clearly. The encoder has the freedom to choose address
modes, the sample addr_encode() algorithm merely shows one way of
picking the address mode. The decoding algorithm addr_decode() will
uniquely decode addresses, regardless of the encoder's algorithm
choice.
Note that the address caches are updated immediately after an address
is encoded or decoded. In this way, the decoder is always
synchronized with the encoder.
Korn, et. al. Standards Track [Page 14]
RFC 3284 VCDIFF June 2002
int addr_encode(Cache_t* ka, int addr, int here, int* mode)
{
int i, d, bestd, bestm;
/* Attempt to find the address mode that yields the
* smallest integer value for "d", the encoded address
* value, thereby minimizing the encoded size of the
* address. */
bestd = addr; bestm = VCD_SELF; /* VCD_SELF == 0 */
if((d = here-addr) < bestd)
{ bestd = d; bestm = VCD_HERE; } /* VCD_HERE == 1 */
for(i = 0; i < ka->s_near; ++i)
if((d = addr - ka->near[i]) >= 0 && d < bestd)
{ bestd = d; bestm = i+2; }
if(ka->s_same > 0 && ka->same[d = addr%(ka->s_same*256)] == addr)
{ bestd = d%256; bestm = ka->s_near + 2 + d/256; }
cache_update(ka,addr);
*mode = bestm; /* this returns the address encoding mode */
return bestd; /* this returns the encoded address */
}
Note that the addr_encode() algorithm chooses the best address mode
using a local optimization, but that may not lead to the best
encoding efficiency because different modes lead to different
instruction encodings, as described below.
The functions addrint() and addrbyte() used in addr_decode(), obtain
from the "Addresses section for COPYs" (Section 4.3), an integer or a
byte, respectively. These utilities will not be described here. We
simply recall that an integer is represented as a compact variable-
sized string of bytes, as described in Section 2 (i.e., base 128).
Korn, et. al. Standards Track [Page 15]
RFC 3284 VCDIFF June 2002
int addr_decode(Cache_t* ka, int here, int mode)
{ int addr, m;
if(mode == VCD_SELF)
addr = addrint();
else if(mode == VCD_HERE)
addr = here - addrint();
else if((m = mode - 2) >= 0 && m < ka->s_near) /* near cache */
addr = ka->near[m] + addrint();
else /* same cache */
{ m = mode - (2 + ka->s_near);
addr = ka->same[m*256 + addrbyte()];
}
cache_update(ka, addr);
return addr;
}
5.4 Instruction Codes
Matches are often short in lengths and separated by small amounts of
unmatched data. That is, the lengths of COPY and ADD instructions
are often small. This is particularly true of binary data such as
executable files or structured data, such as HTML or XML. In such
cases, compression can be improved by combining the encoding of the
sizes and the instruction types, as well as combining the encoding of
adjacent delta instructions with sufficiently small data sizes.
Effective choices of when to perform such combinations depend on many
factors including the data being processed and the string matching
algorithm in use. For example, if many COPY instructions have the
same data sizes, it may be worthwhile to encode these instructions
more compactly than others.
The Vcdiff data format is designed so that a decoder does not need to
be aware of the choices made in encoding algorithms. This is
achieved with the notion of an "instruction code table", containing
256 entries. Each entry defines, either a single delta instruction
or a pair of instructions that have been combined. Note that the
code table itself only exists in main memory, not in the delta file
(unless using an application-defined code table, described in Section
7). The encoded data simply includes the index of each instruction
and, since there are only 256 indices, each index can be represented
as a single byte.
Korn, et. al. Standards Track [Page 16]
RFC 3284 VCDIFF June 2002
Each instruction code entry contains six fields, each of which is a
single byte with an unsigned value:
+-----------------------------------------------+
| inst1 | size1 | mode1 | inst2 | size2 | mode2 |
+-----------------------------------------------+
Each triple (inst,size,mode) defines a delta instruction. The
meanings of these fields are as follows:
inst: An "inst" field can have one of the four values: NOOP (0),
ADD (1), RUN (2) or COPY (3) to indicate the instruction
types. NOOP means that no instruction is specified. In
this case, both the corresponding size and mode fields will
be zero.
size: A "size" field is zero or positive. A value zero means that
the size associated with the instruction is encoded
separately as an integer in the "Instructions and sizes
section" (Section 6). A positive value for "size" defines
the actual data size. Note that since the size is
restricted to a byte, the maximum value for any instruction
with size implicitly defined in the code table is 255.
mode: A "mode" field is significant only when the associated delta
instruction is a COPY. It defines the mode used to encode
the associated addresses. For other instructions, this is
always zero.
5.6 The Code Table
Following the discussions on address modes and instruction code
tables, we define a "Code Table" to have the data below:
s_near: the size of the near cache,
s_same: the size of the same cache,
i_code: the 256-entry instruction code table.
Vcdiff itself defines a "default code table" in which s_near is 4 and
s_same is 3. Thus, there are 9 address modes for a COPY instruction.
The first two are VCD_SELF (0) and VCD_HERE (1). Modes 2, 3, 4 and 5
are for addresses coded against the near cache. And modes 6, 7 and
8, are for addresses coded against the same cache.
Korn, et. al. Standards Track [Page 17]
RFC 3284 VCDIFF June 2002
TYPE SIZE MODE TYPE SIZE MODE INDEX
---------------------------------------------------------------
1. RUN 0 0 NOOP 0 0 0
2. ADD 0, [1,17] 0 NOOP 0 0 [1,18]
3. COPY 0, [4,18] 0 NOOP 0 0 [19,34]
4. COPY 0, [4,18] 1 NOOP 0 0 [35,50]
5. COPY 0, [4,18] 2 NOOP 0 0 [51,66]
6. COPY 0, [4,18] 3 NOOP 0 0 [67,82]
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -