📄 lzrw4 ziv and lempel meet markov.htm
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0065)http://www.cs.pdx.edu/~idr/unbzip2.cgi?compression/lzrw4.html.bz2 -->
<HTML><HEAD><TITLE>LZRW4: ZIV AND LEMPEL MEET MARKOV</TITLE>
<META content="text/html; charset=gb2312" http-equiv=Content-Type>
<META content="MSHTML 5.00.2920.0" name=GENERATOR></HEAD>
<BODY>
<H1 align=center>LZRW4: ZIV AND LEMPEL MEET MARKOV</H1>Author : Ross
Williams<BR>Date : 23-Jul-1991.
<P>This is a public domain document and may be copied freely so long as it is
not modified (but text formatting transformations are permitted though).
<P>Note: This paper has not been polished to publication-standard. It has been
prepared under considerable time pressure and is released mainly to make the
ideas it contains public so as to avoid others re-inventing and patenting them.
<P>
<H2>Abstract</H2>
<P>In 1977 and 1978, Ziv and Lempel[Ziv77][Ziv78] introduced the LZ class of
adaptive dictionary data compression techniques that have become so very popular
today. In competition with these techniques are the Markov algorithms that give
better compression but run much more slowly. This paper compares the two classes
of algorithm and shows how algorithms that combine techniques from both classes
have the potential to provide higher performance (i.e. speed/ compression trade
off) than we have seen so far. An algorithm called LZRW4 that combines Markov
and LZ77 ideas is sketched.
<P>
<H2>Comparison of Ziv/Lempel and Markov Algorithms</H2>
<P>To the casual observer, the Ziv/Lempel (LZ) data compression algorithms may
appear to be worlds away from the lesser know Markov algorithms. LZ algorithms
construct trees of strings and transmit strings as codewords. Markov algorithms
construct large numbers of contexts and fiddle with arithmetic ranges and
probabilities. The two kinds of algorithms seem to operate with totally
different objects.
<P>However, deep down, both classes of algorithm are using the same
probabilistic engine. Deep down, both classes must be representing more common
events with short codes and less common events with longer codes. This is the
law of data compression. But, without some careful thought, it's hard to see how
the two classes relate.
<P>In the early eighties, Glen Langdon put in some careful thought and a lot of
insight and came up with two papers[Langdon83] and [Langdon84] which for me are
pivotal in my understanding of the field of data compression. I cannot recommend
these papers highly enough to anyone who wishes to gain an understanding of what
REALLY makes LZW tick or to anyone who wants a broader perspective of the field.
The papers seem highly focused and obscure, but in fact they cast a broad soft
light over the whole field of text compression, illuminating areas that for me
were very dark (but might not be so dark for people who read IEEE Transactions
on Information Theory like Dickens). The two papers have to be read carefully
though. An earlier paper [Rissanen81] by Rissanen and Langdon gives an even more
general view, but does not deal directly with LZ algorithms as the later papers
do and so I have focussed on the later papers.
<P>The idea behind the two papers is that all dictionary algorithms, including
the LZ classes (LZ77 and LZ78), are subsets (can be emulated by) Markov
algorithms, but the converse is not true. Langdon shows us a clear view of LZ
algorithms through Markov eyes. This view is highly illuminating.
<P>In [Langdon83], Langdon examines the LZ78 class of algorithm closely and
explains the underlying engine in probabilistic terms. In [Langdon84], he has a
more thorough byte at the problem, formalizing the whole case and proving that
dictionary algorithms can always be emulated by Markov algorithms, but not the
converse.
<P>I will attempt to explain the idea of [Langdon83] briefly. I will review LZW
but not give a detailed explanation of how it works.
<P>LZ78/LZW builds its dictionary by parsing the input as phrases consisting of
the longest dictionary match to date. At the end of each phrase parse, the
algorithm adds a new phrase to the dictionary consisting of the current phrase
plus the next input character. The dictionary grows forever, adding one new
phrase per phrase parsed.
<P>The dictionary may be viewed in many ways. We choose to view it as a
(forwards) trie structure whose leaves are to the right and whose root is at the
left. The following diagram illustrates such a tree. Each arc carries a single
character label. Each node represents a single string. In the diagram, arcs are
labelled in lower case and nodes in upper case. <PRE> AA
/
a/
A
a/ \b
/ \
Root AB
b\
\
B
</PRE>All dictionaries can be viewed in this way.
<P>In LZW, each node is allocated a code value. Suppose we have two-bit code
words and the codes are allocated as follows:
<P>0: A<BR>1: B<BR>2: AA<BR>3: AB<BR>
<P>This code allocation implies that each event has an equal probability of one
quarter. We can thus label our tree with node occurrence probabilities. <PRE> AA 1/4
/
a/
A 1/4
a/ \b
/ \
Root AB 1/4
b\
\
B 1/4
</PRE>These node probabilities can then be converted into probabilities for the
ARCS: <PRE> AA
/
/a 1/3
A
3/4a/ \b 1/3
/ \
Root AB
b\
1/4\
B
</PRE>(the 1/3 and 1/3 don't add up because the A node (which represents an
imaginary arc labelled by the empty string) has 1/3 too.)
<P>These probabilties are exactly the sort of fodder that we need to feed a
Markov algorithm. From the above, it can be seen that the act of LZW of parsing
the longest match and assigning a new codeword, can be viewed as the execution
of a series of single-byte prediction decisions each with its own separate
probabilities and each capable of being coded independently and with identical
efficiency using arithmetic coding. You can't see this normally because LZW does
the parse all in one go and outputs (for each phrased) a codeword which looks,
for all the world, to be indivisible. But deep down, we all know that it isn't.
Instead, it's made up of little pieces of entropy accumulated during the
traversal and cleverly bundled into 16 bit codewords at the last minute.
<P>So far, we have converted our LZW tree to a Markov-like prediction tree. Now
let's analyse it.
<P>Consider a single node in the LZW tree. Associated with the node is a string
(being the characters on the arc leading from the root). Each time that the
parse reaches the node, it means that the node's string has occurred at the
start of a phrase boundary. LZW then goes on further to the node's child* nodes
and eventually reaches a leaf where it adds a new node AND ALLOCATES A NEW
CODEWORD. The result is that each node accumulates a subtree whose cardinality
(=number of subnodes including itself in this paper=number of codewords
allocated to the subtree) is proportional to the probability of the node's
string arising as a prefix of a parse. As the number of subnodes is also the
node's share of the probability space (because there is one codeword allocated
to each node), we see that the amount of code space allocated to a node is
proportional to the number of times it has occurred. Very neat.
<P>All this stuff is in [Langdon83] which quantifies it.
<P>The above explains why LZW compresses so well. Why then does it compress so
badly (compared to Markov models)? Without going into details, here are some
reasons:
<UL>
<LI>Approximately half of the code space is wasted at any point of time in
unallocated codewords.
<LI>Probability estimates are inaccurate near the leaves. </LI></UL>
<P>Other such reasons eat away at the compression. In particular, there is one
inefficiency of LZW-type compression that stands out.
<P>In our analysis, each LZW parse of length d consists of d single-character
predictions. Each prediction is made in the context of previous predictions.
Thus, after having parsed part of a phrase "ab", the LZW algorithm has two
characters of Markov context with which to "predict" the next character. After
having parsed "compressi", it would have nine characters of context. Thus we see
that, to the extent that we can view LZ78/LZW algorithms as making predictions,
they make predictions using a sawtooth context-length. <PRE>Order
^
4| *.
3| * . *. *.
2| *. * . *. * . * .
1| * . * . * . * . *. * .
0|* .* .* .* .* .* .
+-----|---------|-----|-------|---|-------|-----> Characters
phrase1 phr2 phr3 phr4 phr5 phr6
</PRE>In contrast, Markov algorithms make predictions using a fairly flat (or
sometimes totally flat) context line. Thus, while a Markov algorithm may make
all of its predictions based on a 3-character context, a LZW algorithm will make
only some of its predictions from that depth.
<P>The thing that really kills LZW's compression is the first one or two
characters of each phrase, the first of which is made based on no context at all
and the second of which is made based on a context of a single character. After
that the context is sufficient not to introduce too much inefficiency. See
Figure 64 of [WIlliams91] to get an idea of how much impact this has on
compression.
<P>To compensate for its poor performance in the first one or two characters,
LZW grows an enormous tree. This has the effect of increasing the average length
of a phrase so that the beginning of phrases occurs less often. This is like
Pollyanna adding extra days to the week to avoid Sundays. As the length of the
input increases to infinity, so does the average phrase length, with the
startling result that by the time you get to infinity, LZW has converged on the
entropy of the source (as Ziv and Lempel proved in their founding paper[Ziv78]).
The rate of convergence depends on the average phrase length which depends on
the entropy of the source. In practice, for ordinary sources (e.g. text files),
the average phrase length does not rise fast enough to properly offset the
effect of the low-context-length predictions at the start of each phrase.
<P>An additional reason for the relatively poor compression performance of LZW
can be found in its phrasing. Markov algorithms predict each character of the
input one at a time and record contextual information at each step. In contrast
LZW divides the input up into phrases each of which causes a single parse
through its tree. Thus, the "zero-order" model that LZW builds up for the first
character of each phrase is based only on the first character of all the
previous phrases it has seen whereas a zero-order Markov model builds its model
based on ALL the characters it has seen. This applies for all orders.
<P>To summarize:
<UL>
<LI>All dictionary algorithms can be cast into Markov form.
<LI>The compression grunt (order) that LZW applies is in the shape of an
irregular sawtooth.
<LI>LZW collects statistics for each context only on phrase boundaries.
<LI>A little thought reveals that all this applies to the LZ77 class as well.
</LI></UL>
<P>A brief discussion of all this can be found in Chapter one of my book
[Williams91] and in particular section 1.13:"Dictionary Methods vs Markov
Methods". However, for a full treatment, the reader is referred to [Langdon83]
and [Langdon84].
<P>
<H2>Ziv and Lempel meet Markov</H2>
<P>Currently, the field of text data compression has two separate methods being
statistical (Markov) and dictionary (Ziv and Lempel). Markov algorithms inch
through the input character by character, recording and using context
information at each step. This yields the best compression, but makes Markov
algorithms slow. Ziv and Lempel algorithms gallop through the input one phrase
at a time, losing lots of non-phrase-aligned context information and using zero
order models at the start of each phrase. This yields a higher speed but poorer
compression.
<P>Obviously it would be highly desirable to combine the speed of the Ziv and
Lempel algorithms with the compression power of the Markov algorithms. As time
goes by this seems less and less likely (without special (e.g. parallel)
hardware). However, there does seem to be one way of combining the two classes
of algorithm and to my surprise, I have not yet seen it used.
<P>The idea is to create a group of contexts, but to parse phrases from each
context. For example, we might create 256 1-character contexts and grow an LZW
tree in each of them. Compression would proceed exactly as with LZW except that
the most recent character transmitted (the last character of the previous
phrase) would be used to select one of the 256 contexts, each of which would
contain an LZW tree which would then be used to parse the phrase.
<P>The result would be to take the low-order edge off LZW without losing any of
its speed. Each character of the phrase would be "predicted" using a model one
order higher than would have been used in LZW. At high orders this would have
negligible effect, but at low orders it would have a significant effect, for the
first character of each phrase, formerly predicted by a zero-order model, would
now be predicted by a first-order model. Each phrase would be coded using only
the range required by the size of the tree in its context.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -