appnote1.txt

来自「PKZIP 压缩文件的格式结构」· 文本代码 · 共 1,064 行 · 第 1/3 页
TXT
1,064 行
	  (var.)        Size1           Attribute #1 data
	  .
	  .
	  .
	  TagN          Short           VMS attribute tage value #N
	  SizeN         Short           Size of attribute #N, in bytes
	  (var.)        SizeN           Attribute #N data

	  Rules:

	  1. There will be one or more of attributes present, which will
	     each be preceded by the above TagX & SizeX values.  These
	     values are identical to the ATR$C_XXXX and ATR$S_XXXX constants
	     which are defined in ATR.H under VMS C.  Neither of these values
	     will ever be zero.

	  2. No word alignment or padding is performed.

	  3. A well-behaved PKZIP/VMS program should never produce more than
	     one sub-block with the same TagX value.  Also, there will never
	     be more than one "extra" block of type 0x000c in a particular
	     directory record.

          - FWKCS MD5 Extra Field:
 
          The FWKCS Contents_Signature System, used in
          automatically identifying files independent of filename,
          optionally adds and uses an extra field to support the
          rapid creation of an enhanced contents_signature:
 
              Header ID = 0x4b46
              Data Size = 0x0013
              Preface   = 'M','D','5'
              followed by 16 bytes containing the uncompressed
                  file's 128_bit MD5 hash(1), low byte first.
 
          When FWKCS revises a zipfile central directory to add
          this extra field for a file, it also replaces the
          central directory entry for that file's uncompressed
          filelength with a measured value.
 
          FWKCS provides an option to strip this extra field, if
          present, from a zipfile central directory. In adding
          this extra field, FWKCS preserves Zipfile Authenticity
          Verification; if stripping this extra field, FWKCS
          preserves all versions of AV through PKZIP version 2.04g.

          FWKCS, and FWKCS Contents_Signature System, are
          trademarks of Frederick W. Kantor.
 
          (1) R. Rivest, RFC1321.TXT, MIT Laboratory for Computer
              Science and RSA Data Security, Inc., April 1992.
              ll.76-77: "The MD5 algorithm is being placed in the
              public domain for review and possible adoption as a
              standard."

      file comment: (Variable)

	  The comment for this file.

      number of this disk: (2 bytes)

	  The number of this disk, which contains central
	  directory end record.

      number of the disk with the start of the central directory: (2 bytes)

	  The number of the disk on which the central
	  directory starts.

      total number of entries in the central dir on this disk: (2 bytes)

	  The number of central directory entries on this disk.

      total number of entries in the central dir: (2 bytes)

	  The total number of files in the zipfile.


      size of the central directory: (4 bytes)

	  The size (in bytes) of the entire central directory.

      offset of start of central directory with respect to
      the starting disk number:  (4 bytes)

	  Offset of the start of the central direcory on the
	  disk on which the central directory starts.

      zipfile comment length: (2 bytes)

	  The length of the comment for this zipfile.

      zipfile comment: (Variable)

	  The comment for this zipfile.


  D.  General notes:

      1)  All fields unless otherwise noted are unsigned and stored
	  in Intel low-byte:high-byte, low-word:high-word order.

      2)  String fields are not null terminated, since the
	  length is given explicitly.

      3)  Local headers should not span disk boundries.  Also, even
	  though the central directory can span disk boundries, no
	  single record in the central directory should be split
	  across disks.

      4)  The entries in the central directory may not necessarily
	  be in the same order that files appear in the zipfile.

UnShrinking - Method 1
----------------------

Shrinking is a Dynamic Ziv-Lempel-Welch compression algorithm
with partial clearing.  The initial code size is 9 bits, and
the maximum code size is 13 bits.  Shrinking differs from
conventional Dynamic Ziv-Lempel-Welch implementations in several
respects:

1)  The code size is controlled by the compressor, and is not
    automatically increased when codes larger than the current
    code size are created (but not necessarily used).  When
    the decompressor encounters the code sequence 256
    (decimal) followed by 1, it should increase the code size
    read from the input stream to the next bit size.  No
    blocking of the codes is performed, so the next code at
    the increased size should be read from the input stream
    immediately after where the previous code at the smaller
    bit size was read.  Again, the decompressor should not
    increase the code size used until the sequence 256,1 is
    encountered.

2)  When the table becomes full, total clearing is not
    performed.  Rather, when the compresser emits the code
    sequence 256,2 (decimal), the decompressor should clear
    all leaf nodes from the Ziv-Lempel tree, and continue to
    use the current code size.  The nodes that are cleared
    from the Ziv-Lempel tree are then re-used, with the lowest
    code value re-used first, and the highest code value
    re-used last.  The compressor can emit the sequence 256,2
    at any time.



Expanding - Methods 2-5
-----------------------

The Reducing algorithm is actually a combination of two
distinct algorithms.  The first algorithm compresses repeated
byte sequences, and the second algorithm takes the compressed
stream from the first algorithm and applies a probabilistic
compression method.

The probabilistic compression stores an array of 'follower
sets' S(j), for j=0 to 255, corresponding to each possible
ASCII character.  Each set contains between 0 and 32
characters, to be denoted as S(j)[0],...,S(j)[m], where m<32.
The sets are stored at the beginning of the data area for a
Reduced file, in reverse order, with S(255) first, and S(0)
last.

The sets are encoded as { N(j), S(j)[0],...,S(j)[N(j)-1] },
where N(j) is the size of set S(j).  N(j) can be 0, in which
case the follower set for S(j) is empty.  Each N(j) value is
encoded in 6 bits, followed by N(j) eight bit character values
corresponding to S(j)[0] to S(j)[N(j)-1] respectively.  If
N(j) is 0, then no values for S(j) are stored, and the value
for N(j-1) immediately follows.

Immediately after the follower sets, is the compressed data
stream.  The compressed data stream can be interpreted for the
probabilistic decompression as follows:


let Last-Character <- 0.
loop until done
    if the follower set S(Last-Character) is empty then
	read 8 bits from the input stream, and copy this
	value to the output stream.
    otherwise if the follower set S(Last-Character) is non-empty then
	read 1 bit from the input stream.
	if this bit is not zero then
	    read 8 bits from the input stream, and copy this
	    value to the output stream.
	otherwise if this bit is zero then
	    read B(N(Last-Character)) bits from the input
	    stream, and assign this value to I.
	    Copy the value of S(Last-Character)[I] to the
	    output stream.

    assign the last value placed on the output stream to
    Last-Character.
end loop


B(N(j)) is defined as the minimal number of bits required to
encode the value N(j)-1.


The decompressed stream from above can then be expanded to
re-create the original file as follows:


let State <- 0.

loop until done
    read 8 bits from the input stream into C.
    case State of
	0:  if C is not equal to DLE (144 decimal) then
		copy C to the output stream.
	    otherwise if C is equal to DLE then
		let State <- 1.

	1:  if C is non-zero then
		let V <- C.
		let Len <- L(V)
		let State <- F(Len).
	    otherwise if C is zero then
		copy the value 144 (decimal) to the output stream.
		let State <- 0

	2:  let Len <- Len + C
	    let State <- 3.

	3:  move backwards D(V,C) bytes in the output stream
	    (if this position is before the start of the output
	    stream, then assume that all the data before the
	    start of the output stream is filled with zeros).
	    copy Len+3 bytes from this position to the output stream.
	    let State <- 0.
    end case
end loop


The functions F,L, and D are dependent on the 'compression
factor', 1 through 4, and are defined as follows:

For compression factor 1:
    L(X) equals the lower 7 bits of X.
    F(X) equals 2 if X equals 127 otherwise F(X) equals 3.
    D(X,Y) equals the (upper 1 bit of X) * 256 + Y + 1.
For compression factor 2:
    L(X) equals the lower 6 bits of X.
    F(X) equals 2 if X equals 63 otherwise F(X) equals 3.
    D(X,Y) equals the (upper 2 bits of X) * 256 + Y + 1.
For compression factor 3:
    L(X) equals the lower 5 bits of X.
    F(X) equals 2 if X equals 31 otherwise F(X) equals 3.
    D(X,Y) equals the (upper 3 bits of X) * 256 + Y + 1.
For compression factor 4:
    L(X) equals the lower 4 bits of X.
    F(X) equals 2 if X equals 15 otherwise F(X) equals 3.
    D(X,Y) equals the (upper 4 bits of X) * 256 + Y + 1.


Imploding - Method 6
--------------------

The Imploding algorithm is actually a combination of two distinct
algorithms.  The first algorithm compresses repeated byte
sequences using a sliding dictionary.  The second algorithm is
used to compress the encoding of the sliding dictionary ouput,
using multiple Shannon-Fano trees.

The Imploding algorithm can use a 4K or 8K sliding dictionary
size. The dictionary size used can be determined by bit 1 in the
general purpose flag word; a 0 bit indicates a 4K dictionary
while a 1 bit indicates an 8K dictionary.

The Shannon-Fano trees are stored at the start of the compressed
file. The number of trees stored is defined by bit 2 in the
general purpose flag word; a 0 bit indicates two trees stored, a
1 bit indicates three trees are stored.  If 3 trees are stored,
the first Shannon-Fano tree represents the encoding of the
Literal characters, the second tree represents the encoding of
the Length information, the third represents the encoding of the
Distance information.  When 2 Shannon-Fano trees are stored, the
Length tree is stored first, followed by the Distance tree.

The Literal Shannon-Fano tree, if present is used to represent
the entire ASCII character set, and contains 256 values.  This
tree is used to compress any data not compressed by the sliding
dictionary algorithm.  When this tree is present, the Minimum
Match Length for the sliding dictionary is 3.  If this tree is
not present, the Minimum Match Length is 2.

The Length Shannon-Fano tree is used to compress the Length part
of the (length,distance) pairs from the sliding dictionary
output.  The Length tree contains 64 values, ranging from the
Minimum Match Length, to 63 plus the Minimum Match Length.

The Distance Shannon-Fano tree is used to compress the Distance
part of the (length,distance) pairs from the sliding dictionary
output. The Distance tree contains 64 values, ranging from 0 to
63, representing the upper 6 bits of the distance value.  The
distance values themselves will be between 0 and the sliding
dictionary size, either 4K or 8K.

The Shannon-Fano trees themselves are stored in a compressed
format. The first byte of the tree data represents the number of
bytes of data representing the (compressed) Shannon-Fano tree
minus 1.  The remaining bytes represent the Shannon-Fano tree
data encoded as:

    High 4 bits: Number of values at this bit length + 1. (1 - 16)
    Low  4 bits: Bit Length needed to represent value + 1. (1 - 16)

The Shannon-Fano codes can be constructed from the bit lengths
using the following algorithm:

1)  Sort the Bit Lengths in ascending order, while retaining the
    order of the original lengths stored in the file.

2)  Generate the Shannon-Fano trees:

    Code <- 0
    CodeIncrement <- 0
    LastBitLength <- 0
    i <- number of Shannon-Fano codes - 1   (either 255 or 63)

    loop while i >= 0
	Code = Code + CodeIncrement
	if BitLength(i) <> LastBitLength then
	    LastBitLength=BitLength(i)
	    CodeIncrement = 1 shifted left (16 - LastBitLength)
	ShannonCode(i) = Code
	i <- i - 1
    end loop


3)  Reverse the order of all the bits in the above ShannonCode()
    vector, so that the most significant bit becomes the least
    significant bit.  For example, the value 0x1234 (hex) would
    become 0x2C48 (hex).

4)  Restore the order of Shannon-Fano codes as originally stored
    within the file.

Example:

    This example will show the encoding of a Shannon-Fano tree
    of size 8.  Notice that the actual Shannon-Fano trees used
    for Imploding are either 64 or 256 entries in size.

Example:   0x02, 0x42, 0x01, 0x13

    The first byte indicates 3 values in this table.  Decoding the
    bytes:
	    0x42 = 5 codes of 3 bits long
	    0x01 = 1 code  of 2 bits long
	    0x13 = 2 codes of 4 bits long
appnote1.txt - 源码说明

本页面展示了「PKZIP 压缩文件的格式结构」中的 appnote1.txt 源码文件，采用文本编程语言编写，共 1,064 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与PKZIP相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?