📄 lzrw4 ziv and lempel meet markov.htm

📁 一个基于lZW压缩算法的高效实现
💻 HTM
📖 第 1 页 / 共 2 页
字号:
上一页 12
<P>By using contexts at the start of the LZW tree, we raise the bottom of the 
sawtooth a few notches. <PRE>Order
     ^
    6|              *.
    5|            *  .            *.          *.
    4|    *.    *    .    *.    *  .        *  .
    3|  *  .  *      .  *  .  *    .  *.  *    .
    2|*    .*        .*    .*      .*  .*      .
    1|
    0|
     +-----|---------|-----|-------|---|-------|-----&gt; Characters
    phrase1    phr2   phr3   phr4  phr5   phr6
</PRE>This idea is not at all specific to LZW, but could be applied in dozens of 
ways to many different fast parsing adaptive algorithms. In fact my example 
algorithm, LZRW4 below uses an LZ77-like allgorithm. Here are some ideas: 
<UL>
  <LI>Use a two-byte context to hash into a table of LZW trees. 
  <LI>Grow backwards context trees as in DHPC, but grow LZW parsing trees out of 
  each node. 
  <LI>Somehow use a single LZW tree BUT START PARSING ONE OR MORE CHARACTERS 
  BACK! This avoids having two different kinds of data structure. 
  <LI>Set up a set of contexts and bolt an LZ77 compressor on the back. Each 
  context stores a list of pointers to strings that occurred in that context 
  (see LZRW4 below). 
  <LI>Use context with any other dynamic dictionary scheme. 
  <LI>Use a rolling hash function in hardware to avoid hash delays at the start 
  of phrases. </LI></UL>
<P>...combinations and permuations. 
<P>There are two drawbacks to all this. First of all we are up for a lot more 
memory. Second, we run up against the context-depth/sample-size tradeoff 
encountered by Markov algorithms. The deeper the contexts we establish, the more 
specific they become, but the smaller the amount of information that each 
context has on which to base its mini-model and hence its predictions. The first 
problem can be overcome with more memory. The second problem requires careful 
tuning. 
<P>
<H2>LZRW4</H2>
<P>To test out the idea, I devised LZRW4. I didn't implement LZRW4, but I did 
write a program to measure the compression that it would yield. This program is 
provided in Appendix A on an "as is" basis. 
<P>Note: The name LZRW4 is used here to describe an algorithm. However, the 
eventual official code release of LZRW4 may differ in some details from the 
algorithm presented here. 
<P>With a huge range of LZ algorithms from which to choose, I chose the LZ77 
class with which I am familiar. This avoided patent problems. 
<P>LZRW4 uses a hash table of 4096 partitions each containing 32 pointers to the 
input which is held in a 1 Meg buffer. At the start of each phrase, LZRW4 hashes 
the previous two characters to obtain a partition number in the range 0..4095. 
The 32 pointers in that partition are searched for the longest match and the 
length 3bits=[2,3,4,5,6,7,8,16] and pointer number 5bits=[0,..,31] coded and 
transmitted. LZRW4 uses a single control bit for each phrase to say whether a 
literal byte or a copy byte is coming next. These control bits are buffered as 
in the other LZRW* algorithms. The equal length of literal and copy items is an 
interesting feature of the algorithm which seems bound to be useful to someone 
someday! 
<P>To update the partition, LZRW4 swaps the winning pointer with the pointer 
halfway towards the front of the partition. If the winning pointer matched less 
than 4 bytes then a pointer to the current position (start of the Ziv) is 
shifted into the partition. 
<P>See the code in Appendix A for the details. An understanding of the LZRW3-A 
algorithm may also assist. 
<P>The results of LZRW4 are not startling, but demonstrate that the idea is 
working. By using contexts, LZRW4 gets away with using 3-bit lengths and 5-bit 
"indexes" whereas its ancestor algorithm LZRW3-A uses 4-bit lengths and 12-bit 
"indexes", yielding comparable compression. Here are some rough results on some 
Calgary corpus files (percentage remaining): <PRE>           LZRW4 LZRW3-A  Unix Compress
   bib     39.5    44.1   41.8
   book1   51.4    54.7   43.2
   book2   41.2    45.5   41.1
   obj1    65.5    58.8   65.3
   obj2    43.4    43.8   52.1
   paper1  46.5    46.1   47.2
   progc   46.4    45.2   48.3
   trans   29.1    32.3   40.8
</PRE>I have experiments with lots of variants of LZRW4 but have not formalized 
them nor written them up. One particularly interesting idea is to subdivide each 
partition into subpartitions based on the first character of the phrase. LZRW3-A 
partition management techniques can then be used. 
<P>
<H2>Closing Thoughts</H2>
<P>I don't know if this idea is original or whether it will yield any benefit. 
However, I have not heard of any Ziv/Lempel/Markov hybrids to date, and the 
experiments above indicate at least that the idea works, if not that it has real 
competitive benefit. With the right tuning, the use of contexts could provide 
extra compression with little impact on speed (but usually a lot of impact on 
memory!). At the very least, it is another weapon in the data compression 
armoury. 
<P>
<H2>References</H2>
<P>[Langdon83] Langdon G.G., "A Note on the Ziv-Lempel Model for Compressing 
Individual Sequences, IEEE Transactions on Information Theory, Vol.29, No.2, 
pp.284-287. 
<P>[Langdon84] Langdon G.G., "On Parsing Versus Mixed-Order Model Structures for 
Data Compression", IBM Research Report RJ-4163 (46091) 1/18/84, IBM Research 
Laboratory, San Jose, CA 95193, 1984. 
<P>[Rissanen81] Rissanen J.J, Langdon G.G, "Universal Modeling and Coding", IEEE 
Transactions on Information Theory, Vol. 27, No. 1, pp.12-23. 
<P>[Williams91] Williams R.N., "Adaptive Data Compression", Kluwer Academic 
Publishers, 101 Philip Drive, Assinippi Park, Norwell, Massachusetts 02061. 
<P>[Ziv77] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data 
Compression", IEEE Transactions on Information Theory", Vol. 23, No. 3, pp. 
337-343. 
<P>[Ziv78] Ziv J., Lempel A., "Compression of Individual Sequences via 
Variable-Rate Coding", IEEE Transactions on Information Theory, Vol. 24, No. 5, 
pp. 530-536. 
<P>
<H2>Appendix: LZRW4: Experimental Code</H2>
<P>The following code is public domain and is provided "as is". As most 
experimental code, it is poorly designed, structured, coded, and commented. <PRE>#include <STDIO.H>
/* #include <UNIX.H>    */
#include <FCNTL.H>
#include "port.h"

#define MY_BUFFER_SIZE (1024*1024)
#define HASHLENGTH  4096
#define HASHDEPTH     32
#define CONTEXT        2
#define MEDMATCH       8 /* Expressible lengths are [2..MEDMATCH,MAXMATCH]. */
#define MAXMATCH      16
#define NOSHIFTTHRESH  4 /* This or longer match and we don't shift  */

main(argc,argv)
int argc;
char **argv;
{
 typedef UBYTE *queue[HASHDEPTH];
 /* Element zero is most recent element. */

 UBYTE buf[MY_BUFFER_SIZE+2000];
 queue hash[HASHLENGTH];

 char *input_file_name;
 int infile;

 ULONG bits_in  =0;
 ULONG bits_out =0;
 if (argc != 2)
   {
    printf("To run lzcontext alg: x <FILENAME>\n");
    exit(0);
   }

 printf("Welcome to LZCONTEXT\n");
 printf("--------------------\n");
 printf("Buffer size        = %u\n",MY_BUFFER_SIZE);
 printf("Hash table length  = %u\n",HASHLENGTH    );
 printf("Hash table depth   = %u\n",HASHDEPTH     );
 printf("Context            = %u\n",CONTEXT       );
 printf("Med match          = %u\n",MEDMATCH      );
 printf("Max match          = %u\n",MAXMATCH      );
 printf("No shift threshold = %u\n",NOSHIFTTHRESH   );

 input_file_name=argv[1];

 infile =open(input_file_name ,O_RDONLY); /* O_BINARY*/
 while(TRUE)
   {
    ULONG  i,j,k;
    UBYTE *p_next_byte;
    ULONG bytes_in;

    /* Read in a block of bytes to compress. */
    bytes_in=read(infile,buf,(unsigned int) MY_BUFFER_SIZE);
    if (bytes_in == 0)
       break;

    /* Initialize the hash table so all entries point to a defined string. */
    for (i=0;i<HASHLENGTH;I++) index="(*(p_next_byte-2)<<8)" * ); (*(p_next_byte-1) ^ (*(p_next_byte-2)<<4) 3 of context (*(p_next_byte-1)); longest_match_index; ULONG *old_p_next_byte="p_next_byte;" UBYTE **this_queue; maxlength; UWORD index; { (buf+bytes_in)) < (p_next_byte while cares? who so approximate only is this but sloppy end The item. each for once Loop p_next_byte="buf+CONTEXT;" bits_out+="CONTEXT*9;" +="CONTEXT*8;" bits_in bytes. few first the past Get ?0123456789ABCDEF?; *) hash[i][j]="(UBYTE" (j="0;j<HASHDEPTH;j++)">&gt; 4;
       /* index&amp;=0x3FFF;        */
       index&amp;=0xFFF;
       this_queue=&amp;hash[index][0];

       /* Find the longest match in the 16 recent positions.           */
       /* Note: Because we are only calculating entropy here, we don't */
       /* actually need to know the number of the best position.       */
       maxlength=0;
       for (j=0;j<HASHDEPTH;J++) UBYTE { for (k="0;k<MAXMATCH;k++)" if break; !="p[k])" (p_next_byte[k] *p="this_queue[j];">maxlength)
            {
             maxlength=k;
             longest_match_index=j;
            }
         }

       /* Both literals and copies output one control bit and eight
          data bits. They differ on how much input they consume. */
       if (maxlength == 0) maxlength=1;
       if (maxlength&gt;MEDMATCH &amp;&amp; maxlength<MAXMATCH) * of +="8*maxlength;" bits_in the if (maxlength head. to distance half one with element matching swap greater, or 2 match a was there If bits_out p_next_byte+="maxlength;" maxlength="MEDMATCH;"> 1)
         {
          UBYTE *t;
          ULONG half=longest_match_index/2;
          t=this_queue[half];
          this_queue[half]=this_queue[longest_match_index];
          this_queue[longest_match_index]=t;
         }

       /* Shift the queue and put the new value in the recent end. */
       if (maxlength &lt; NOSHIFTTHRESH)
         {
          for (j=HASHDEPTH-1;j&gt;0;j--)
             this_queue[j]=this_queue[j-1];
          this_queue[0]=old_p_next_byte;
         }
      }
   }

 close(infile);
 printf("Bytes in  = %lu.\n",bits_in /8);
 printf("Bytes out = %lu.\n",bits_out/8);
 printf("Percentage remaining=%5.1f\n",100.0*((float) bits_out)/((float)
 bits_in));
}

void error(mess)
/* Prints out the string and terminates. */
char mess[];
{
 printf("%s\n",mess);
 exit(0);
}
</PRE>
<HR>
Return to the <A href="http://www.cs.pdx.edu/~idr/index.html">Epsilon 
Compression Page</A> </BODY></HTML>
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -