📄 lzrw4 ziv and lempel meet markov.htm

📁 一个基于lZW压缩算法的高效实现
💻 HTM
📖 第 1 页 / 共 2 页
字号:
12 下一页
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0065)http://www.cs.pdx.edu/~idr/unbzip2.cgi?compression/lzrw4.html.bz2 -->
<HTML><HEAD><TITLE>LZRW4: ZIV AND LEMPEL MEET MARKOV</TITLE>
<META content="text/html; charset=gb2312" http-equiv=Content-Type>
<META content="MSHTML 5.00.2920.0" name=GENERATOR></HEAD>
<BODY>
<H1 align=center>LZRW4: ZIV AND LEMPEL MEET MARKOV</H1>Author : Ross 
Williams<BR>Date : 23-Jul-1991. 
<P>This is a public domain document and may be copied freely so long as it is 
not modified (but text formatting transformations are permitted though). 
<P>Note: This paper has not been polished to publication-standard. It has been 
prepared under considerable time pressure and is released mainly to make the 
ideas it contains public so as to avoid others re-inventing and patenting them. 
<P>
<H2>Abstract</H2>
<P>In 1977 and 1978, Ziv and Lempel[Ziv77][Ziv78] introduced the LZ class of 
adaptive dictionary data compression techniques that have become so very popular 
today. In competition with these techniques are the Markov algorithms that give 
better compression but run much more slowly. This paper compares the two classes 
of algorithm and shows how algorithms that combine techniques from both classes 
have the potential to provide higher performance (i.e. speed/ compression trade 
off) than we have seen so far. An algorithm called LZRW4 that combines Markov 
and LZ77 ideas is sketched. 
<P>
<H2>Comparison of Ziv/Lempel and Markov Algorithms</H2>
<P>To the casual observer, the Ziv/Lempel (LZ) data compression algorithms may 
appear to be worlds away from the lesser know Markov algorithms. LZ algorithms 
construct trees of strings and transmit strings as codewords. Markov algorithms 
construct large numbers of contexts and fiddle with arithmetic ranges and 
probabilities. The two kinds of algorithms seem to operate with totally 
different objects. 
<P>However, deep down, both classes of algorithm are using the same 
probabilistic engine. Deep down, both classes must be representing more common 
events with short codes and less common events with longer codes. This is the 
law of data compression. But, without some careful thought, it's hard to see how 
the two classes relate. 
<P>In the early eighties, Glen Langdon put in some careful thought and a lot of 
insight and came up with two papers[Langdon83] and [Langdon84] which for me are 
pivotal in my understanding of the field of data compression. I cannot recommend 
these papers highly enough to anyone who wishes to gain an understanding of what 
REALLY makes LZW tick or to anyone who wants a broader perspective of the field. 
The papers seem highly focused and obscure, but in fact they cast a broad soft 
light over the whole field of text compression, illuminating areas that for me 
were very dark (but might not be so dark for people who read IEEE Transactions 
on Information Theory like Dickens). The two papers have to be read carefully 
though. An earlier paper [Rissanen81] by Rissanen and Langdon gives an even more 
general view, but does not deal directly with LZ algorithms as the later papers 
do and so I have focussed on the later papers. 
<P>The idea behind the two papers is that all dictionary algorithms, including 
the LZ classes (LZ77 and LZ78), are subsets (can be emulated by) Markov 
algorithms, but the converse is not true. Langdon shows us a clear view of LZ 
algorithms through Markov eyes. This view is highly illuminating. 
<P>In [Langdon83], Langdon examines the LZ78 class of algorithm closely and 
explains the underlying engine in probabilistic terms. In [Langdon84], he has a 
more thorough byte at the problem, formalizing the whole case and proving that 
dictionary algorithms can always be emulated by Markov algorithms, but not the 
converse. 
<P>I will attempt to explain the idea of [Langdon83] briefly. I will review LZW 
but not give a detailed explanation of how it works. 
<P>LZ78/LZW builds its dictionary by parsing the input as phrases consisting of 
the longest dictionary match to date. At the end of each phrase parse, the 
algorithm adds a new phrase to the dictionary consisting of the current phrase 
plus the next input character. The dictionary grows forever, adding one new 
phrase per phrase parsed. 
<P>The dictionary may be viewed in many ways. We choose to view it as a 
(forwards) trie structure whose leaves are to the right and whose root is at the 
left. The following diagram illustrates such a tree. Each arc carries a single 
character label. Each node represents a single string. In the diagram, arcs are 
labelled in lower case and nodes in upper case. <PRE>         AA
        /
      a/
      A
    a/ \b
    /   \
Root     AB
   b\
     \
      B
</PRE>All dictionaries can be viewed in this way. 
<P>In LZW, each node is allocated a code value. Suppose we have two-bit code 
words and the codes are allocated as follows: 
<P>0: A<BR>1: B<BR>2: AA<BR>3: AB<BR>
<P>This code allocation implies that each event has an equal probability of one 
quarter. We can thus label our tree with node occurrence probabilities. <PRE>         AA 1/4
        /
      a/
      A  1/4
    a/ \b
    /   \
Root     AB 1/4
   b\
     \
      B 1/4
</PRE>These node probabilities can then be converted into probabilities for the 
ARCS: <PRE>         AA
        /
       /a 1/3
      A
 3/4a/ \b 1/3
    /   \
Root     AB
   b\
  1/4\
      B
</PRE>(the 1/3 and 1/3 don't add up because the A node (which represents an 
imaginary arc labelled by the empty string) has 1/3 too.) 
<P>These probabilties are exactly the sort of fodder that we need to feed a 
Markov algorithm. From the above, it can be seen that the act of LZW of parsing 
the longest match and assigning a new codeword, can be viewed as the execution 
of a series of single-byte prediction decisions each with its own separate 
probabilities and each capable of being coded independently and with identical 
efficiency using arithmetic coding. You can't see this normally because LZW does 
the parse all in one go and outputs (for each phrased) a codeword which looks, 
for all the world, to be indivisible. But deep down, we all know that it isn't. 
Instead, it's made up of little pieces of entropy accumulated during the 
traversal and cleverly bundled into 16 bit codewords at the last minute. 
<P>So far, we have converted our LZW tree to a Markov-like prediction tree. Now 
let's analyse it. 
<P>Consider a single node in the LZW tree. Associated with the node is a string 
(being the characters on the arc leading from the root). Each time that the 
parse reaches the node, it means that the node's string has occurred at the 
start of a phrase boundary. LZW then goes on further to the node's child* nodes 
and eventually reaches a leaf where it adds a new node AND ALLOCATES A NEW 
CODEWORD. The result is that each node accumulates a subtree whose cardinality 
(=number of subnodes including itself in this paper=number of codewords 
allocated to the subtree) is proportional to the probability of the node's 
string arising as a prefix of a parse. As the number of subnodes is also the 
node's share of the probability space (because there is one codeword allocated 
to each node), we see that the amount of code space allocated to a node is 
proportional to the number of times it has occurred. Very neat. 
<P>All this stuff is in [Langdon83] which quantifies it. 
<P>The above explains why LZW compresses so well. Why then does it compress so 
badly (compared to Markov models)? Without going into details, here are some 
reasons: 
<UL>
  <LI>Approximately half of the code space is wasted at any point of time in 
  unallocated codewords. 
  <LI>Probability estimates are inaccurate near the leaves. </LI></UL>
<P>Other such reasons eat away at the compression. In particular, there is one 
inefficiency of LZW-type compression that stands out. 
<P>In our analysis, each LZW parse of length d consists of d single-character 
predictions. Each prediction is made in the context of previous predictions. 
Thus, after having parsed part of a phrase "ab", the LZW algorithm has two 
characters of Markov context with which to "predict" the next character. After 
having parsed "compressi", it would have nine characters of context. Thus we see 
that, to the extent that we can view LZ78/LZW algorithms as making predictions, 
they make predictions using a sawtooth context-length. <PRE>Order
     ^
    4|              *.
    3|            *  .            *.          *.
    2|    *.    *    .    *.    *  .        *  .
    1|  *  .  *      .  *  .  *    .  *.  *    .
    0|*    .*        .*    .*      .*  .*      .
     +-----|---------|-----|-------|---|-------|-----&gt; Characters
    phrase1    phr2   phr3   phr4  phr5   phr6
</PRE>In contrast, Markov algorithms make predictions using a fairly flat (or 
sometimes totally flat) context line. Thus, while a Markov algorithm may make 
all of its predictions based on a 3-character context, a LZW algorithm will make 
only some of its predictions from that depth. 
<P>The thing that really kills LZW's compression is the first one or two 
characters of each phrase, the first of which is made based on no context at all 
and the second of which is made based on a context of a single character. After 
that the context is sufficient not to introduce too much inefficiency. See 
Figure 64 of [WIlliams91] to get an idea of how much impact this has on 
compression. 
<P>To compensate for its poor performance in the first one or two characters, 
LZW grows an enormous tree. This has the effect of increasing the average length 
of a phrase so that the beginning of phrases occurs less often. This is like 
Pollyanna adding extra days to the week to avoid Sundays. As the length of the 
input increases to infinity, so does the average phrase length, with the 
startling result that by the time you get to infinity, LZW has converged on the 
entropy of the source (as Ziv and Lempel proved in their founding paper[Ziv78]). 
The rate of convergence depends on the average phrase length which depends on 
the entropy of the source. In practice, for ordinary sources (e.g. text files), 
the average phrase length does not rise fast enough to properly offset the 
effect of the low-context-length predictions at the start of each phrase. 
<P>An additional reason for the relatively poor compression performance of LZW 
can be found in its phrasing. Markov algorithms predict each character of the 
input one at a time and record contextual information at each step. In contrast 
LZW divides the input up into phrases each of which causes a single parse 
through its tree. Thus, the "zero-order" model that LZW builds up for the first 
character of each phrase is based only on the first character of all the 
previous phrases it has seen whereas a zero-order Markov model builds its model 
based on ALL the characters it has seen. This applies for all orders. 
<P>To summarize: 
<UL>
  <LI>All dictionary algorithms can be cast into Markov form. 
  <LI>The compression grunt (order) that LZW applies is in the shape of an 
  irregular sawtooth. 
  <LI>LZW collects statistics for each context only on phrase boundaries. 
  <LI>A little thought reveals that all this applies to the LZ77 class as well. 
  </LI></UL>
<P>A brief discussion of all this can be found in Chapter one of my book 
[Williams91] and in particular section 1.13:"Dictionary Methods vs Markov 
Methods". However, for a full treatment, the reader is referred to [Langdon83] 
and [Langdon84]. 
<P>
<H2>Ziv and Lempel meet Markov</H2>
<P>Currently, the field of text data compression has two separate methods being 
statistical (Markov) and dictionary (Ziv and Lempel). Markov algorithms inch 
through the input character by character, recording and using context 
information at each step. This yields the best compression, but makes Markov 
algorithms slow. Ziv and Lempel algorithms gallop through the input one phrase 
at a time, losing lots of non-phrase-aligned context information and using zero 
order models at the start of each phrase. This yields a higher speed but poorer 
compression. 
<P>Obviously it would be highly desirable to combine the speed of the Ziv and 
Lempel algorithms with the compression power of the Markov algorithms. As time 
goes by this seems less and less likely (without special (e.g. parallel) 
hardware). However, there does seem to be one way of combining the two classes 
of algorithm and to my surprise, I have not yet seen it used. 
<P>The idea is to create a group of contexts, but to parse phrases from each 
context. For example, we might create 256 1-character contexts and grow an LZW 
tree in each of them. Compression would proceed exactly as with LZW except that 
the most recent character transmitted (the last character of the previous 
phrase) would be used to select one of the 256 contexts, each of which would 
contain an LZW tree which would then be used to parse the phrase. 
<P>The result would be to take the low-order edge off LZW without losing any of 
its speed. Each character of the phrase would be "predicted" using a model one 
order higher than would have been used in LZW. At high orders this would have 
negligible effect, but at low orders it would have a significant effect, for the 
first character of each phrase, formerly predicted by a zero-order model, would 
now be predicted by a first-order model. Each phrase would be coded using only 
the range required by the size of the tree in its context.
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -