📄 lzrw4 ziv and lempel meet markov.htm
字号:
<P>By using contexts at the start of the LZW tree, we raise the bottom of the
sawtooth a few notches. <PRE>Order
^
6| *.
5| * . *. *.
4| *. * . *. * . * .
3| * . * . * . * . *. * .
2|* .* .* .* .* .* .
1|
0|
+-----|---------|-----|-------|---|-------|-----> Characters
phrase1 phr2 phr3 phr4 phr5 phr6
</PRE>This idea is not at all specific to LZW, but could be applied in dozens of
ways to many different fast parsing adaptive algorithms. In fact my example
algorithm, LZRW4 below uses an LZ77-like allgorithm. Here are some ideas:
<UL>
<LI>Use a two-byte context to hash into a table of LZW trees.
<LI>Grow backwards context trees as in DHPC, but grow LZW parsing trees out of
each node.
<LI>Somehow use a single LZW tree BUT START PARSING ONE OR MORE CHARACTERS
BACK! This avoids having two different kinds of data structure.
<LI>Set up a set of contexts and bolt an LZ77 compressor on the back. Each
context stores a list of pointers to strings that occurred in that context
(see LZRW4 below).
<LI>Use context with any other dynamic dictionary scheme.
<LI>Use a rolling hash function in hardware to avoid hash delays at the start
of phrases. </LI></UL>
<P>...combinations and permuations.
<P>There are two drawbacks to all this. First of all we are up for a lot more
memory. Second, we run up against the context-depth/sample-size tradeoff
encountered by Markov algorithms. The deeper the contexts we establish, the more
specific they become, but the smaller the amount of information that each
context has on which to base its mini-model and hence its predictions. The first
problem can be overcome with more memory. The second problem requires careful
tuning.
<P>
<H2>LZRW4</H2>
<P>To test out the idea, I devised LZRW4. I didn't implement LZRW4, but I did
write a program to measure the compression that it would yield. This program is
provided in Appendix A on an "as is" basis.
<P>Note: The name LZRW4 is used here to describe an algorithm. However, the
eventual official code release of LZRW4 may differ in some details from the
algorithm presented here.
<P>With a huge range of LZ algorithms from which to choose, I chose the LZ77
class with which I am familiar. This avoided patent problems.
<P>LZRW4 uses a hash table of 4096 partitions each containing 32 pointers to the
input which is held in a 1 Meg buffer. At the start of each phrase, LZRW4 hashes
the previous two characters to obtain a partition number in the range 0..4095.
The 32 pointers in that partition are searched for the longest match and the
length 3bits=[2,3,4,5,6,7,8,16] and pointer number 5bits=[0,..,31] coded and
transmitted. LZRW4 uses a single control bit for each phrase to say whether a
literal byte or a copy byte is coming next. These control bits are buffered as
in the other LZRW* algorithms. The equal length of literal and copy items is an
interesting feature of the algorithm which seems bound to be useful to someone
someday!
<P>To update the partition, LZRW4 swaps the winning pointer with the pointer
halfway towards the front of the partition. If the winning pointer matched less
than 4 bytes then a pointer to the current position (start of the Ziv) is
shifted into the partition.
<P>See the code in Appendix A for the details. An understanding of the LZRW3-A
algorithm may also assist.
<P>The results of LZRW4 are not startling, but demonstrate that the idea is
working. By using contexts, LZRW4 gets away with using 3-bit lengths and 5-bit
"indexes" whereas its ancestor algorithm LZRW3-A uses 4-bit lengths and 12-bit
"indexes", yielding comparable compression. Here are some rough results on some
Calgary corpus files (percentage remaining): <PRE> LZRW4 LZRW3-A Unix Compress
bib 39.5 44.1 41.8
book1 51.4 54.7 43.2
book2 41.2 45.5 41.1
obj1 65.5 58.8 65.3
obj2 43.4 43.8 52.1
paper1 46.5 46.1 47.2
progc 46.4 45.2 48.3
trans 29.1 32.3 40.8
</PRE>I have experiments with lots of variants of LZRW4 but have not formalized
them nor written them up. One particularly interesting idea is to subdivide each
partition into subpartitions based on the first character of the phrase. LZRW3-A
partition management techniques can then be used.
<P>
<H2>Closing Thoughts</H2>
<P>I don't know if this idea is original or whether it will yield any benefit.
However, I have not heard of any Ziv/Lempel/Markov hybrids to date, and the
experiments above indicate at least that the idea works, if not that it has real
competitive benefit. With the right tuning, the use of contexts could provide
extra compression with little impact on speed (but usually a lot of impact on
memory!). At the very least, it is another weapon in the data compression
armoury.
<P>
<H2>References</H2>
<P>[Langdon83] Langdon G.G., "A Note on the Ziv-Lempel Model for Compressing
Individual Sequences, IEEE Transactions on Information Theory, Vol.29, No.2,
pp.284-287.
<P>[Langdon84] Langdon G.G., "On Parsing Versus Mixed-Order Model Structures for
Data Compression", IBM Research Report RJ-4163 (46091) 1/18/84, IBM Research
Laboratory, San Jose, CA 95193, 1984.
<P>[Rissanen81] Rissanen J.J, Langdon G.G, "Universal Modeling and Coding", IEEE
Transactions on Information Theory, Vol. 27, No. 1, pp.12-23.
<P>[Williams91] Williams R.N., "Adaptive Data Compression", Kluwer Academic
Publishers, 101 Philip Drive, Assinippi Park, Norwell, Massachusetts 02061.
<P>[Ziv77] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
Compression", IEEE Transactions on Information Theory", Vol. 23, No. 3, pp.
337-343.
<P>[Ziv78] Ziv J., Lempel A., "Compression of Individual Sequences via
Variable-Rate Coding", IEEE Transactions on Information Theory, Vol. 24, No. 5,
pp. 530-536.
<P>
<H2>Appendix: LZRW4: Experimental Code</H2>
<P>The following code is public domain and is provided "as is". As most
experimental code, it is poorly designed, structured, coded, and commented. <PRE>#include <STDIO.H>
/* #include <UNIX.H> */
#include <FCNTL.H>
#include "port.h"
#define MY_BUFFER_SIZE (1024*1024)
#define HASHLENGTH 4096
#define HASHDEPTH 32
#define CONTEXT 2
#define MEDMATCH 8 /* Expressible lengths are [2..MEDMATCH,MAXMATCH]. */
#define MAXMATCH 16
#define NOSHIFTTHRESH 4 /* This or longer match and we don't shift */
main(argc,argv)
int argc;
char **argv;
{
typedef UBYTE *queue[HASHDEPTH];
/* Element zero is most recent element. */
UBYTE buf[MY_BUFFER_SIZE+2000];
queue hash[HASHLENGTH];
char *input_file_name;
int infile;
ULONG bits_in =0;
ULONG bits_out =0;
if (argc != 2)
{
printf("To run lzcontext alg: x <FILENAME>\n");
exit(0);
}
printf("Welcome to LZCONTEXT\n");
printf("--------------------\n");
printf("Buffer size = %u\n",MY_BUFFER_SIZE);
printf("Hash table length = %u\n",HASHLENGTH );
printf("Hash table depth = %u\n",HASHDEPTH );
printf("Context = %u\n",CONTEXT );
printf("Med match = %u\n",MEDMATCH );
printf("Max match = %u\n",MAXMATCH );
printf("No shift threshold = %u\n",NOSHIFTTHRESH );
input_file_name=argv[1];
infile =open(input_file_name ,O_RDONLY); /* O_BINARY*/
while(TRUE)
{
ULONG i,j,k;
UBYTE *p_next_byte;
ULONG bytes_in;
/* Read in a block of bytes to compress. */
bytes_in=read(infile,buf,(unsigned int) MY_BUFFER_SIZE);
if (bytes_in == 0)
break;
/* Initialize the hash table so all entries point to a defined string. */
for (i=0;i<HASHLENGTH;I++) index="(*(p_next_byte-2)<<8)" * ); (*(p_next_byte-1) ^ (*(p_next_byte-2)<<4) 3 of context (*(p_next_byte-1)); longest_match_index; ULONG *old_p_next_byte="p_next_byte;" UBYTE **this_queue; maxlength; UWORD index; { (buf+bytes_in)) < (p_next_byte while cares? who so approximate only is this but sloppy end The item. each for once Loop p_next_byte="buf+CONTEXT;" bits_out+="CONTEXT*9;" +="CONTEXT*8;" bits_in bytes. few first the past Get ?0123456789ABCDEF?; *) hash[i][j]="(UBYTE" (j="0;j<HASHDEPTH;j++)">> 4;
/* index&=0x3FFF; */
index&=0xFFF;
this_queue=&hash[index][0];
/* Find the longest match in the 16 recent positions. */
/* Note: Because we are only calculating entropy here, we don't */
/* actually need to know the number of the best position. */
maxlength=0;
for (j=0;j<HASHDEPTH;J++) UBYTE { for (k="0;k<MAXMATCH;k++)" if break; !="p[k])" (p_next_byte[k] *p="this_queue[j];">maxlength)
{
maxlength=k;
longest_match_index=j;
}
}
/* Both literals and copies output one control bit and eight
data bits. They differ on how much input they consume. */
if (maxlength == 0) maxlength=1;
if (maxlength>MEDMATCH && maxlength<MAXMATCH) * of +="8*maxlength;" bits_in the if (maxlength head. to distance half one with element matching swap greater, or 2 match a was there If bits_out p_next_byte+="maxlength;" maxlength="MEDMATCH;"> 1)
{
UBYTE *t;
ULONG half=longest_match_index/2;
t=this_queue[half];
this_queue[half]=this_queue[longest_match_index];
this_queue[longest_match_index]=t;
}
/* Shift the queue and put the new value in the recent end. */
if (maxlength < NOSHIFTTHRESH)
{
for (j=HASHDEPTH-1;j>0;j--)
this_queue[j]=this_queue[j-1];
this_queue[0]=old_p_next_byte;
}
}
}
close(infile);
printf("Bytes in = %lu.\n",bits_in /8);
printf("Bytes out = %lu.\n",bits_out/8);
printf("Percentage remaining=%5.1f\n",100.0*((float) bits_out)/((float)
bits_in));
}
void error(mess)
/* Prints out the string and terminates. */
char mess[];
{
printf("%s\n",mess);
exit(0);
}
</PRE>
<HR>
Return to the <A href="http://www.cs.pdx.edu/~idr/index.html">Epsilon
Compression Page</A> </BODY></HTML>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -