⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 lzrw4.txt

📁 一个基于lZW压缩算法的高效实现
💻 TXT
📖 第 1 页 / 共 2 页
字号:
Order     ^    6|              *.    5|            *  .            *.          *.    4|    *.    *    .    *.    *  .        *  .    3|  *  .  *      .  *  .  *    .  *.  *    .    2|*    .*        .*    .*      .*  .*      .    1|    0|     +-----|---------|-----|-------|---|-------|-----> Characters    phrase1    phr2   phr3   phr4  phr5   phr6This idea is not at all specific to LZW, but could be applied indozens of ways to many different fast parsing adaptive algorithms. Infact my example algorithm, LZRW4 below uses an LZ77-like allgorithm.Here are some ideas:   * Use a two-byte context to hash into a table of LZW trees.   * Grow backwards context trees as in DHPC, but grow LZW parsing trees     out of each node.   * Somehow use a single LZW tree BUT START PARSING ONE OR MORE     CHARACTERS BACK! This avoids having two different kinds of data structure.   * Set up a set of contexts and bolt an LZ77 compressor on the     back. Each context stores a list of pointers to strings that     occurred in that context (see LZRW4 below).   * Use context with any other dynamic dictionary scheme.   * Use a rolling hash function in hardware to avoid hash delays at the     start of phrases....combinations and permuations.There are two drawbacks to all this. First of all we are up for a lotmore memory. Second, we run up against the context-depth/sample-sizetradeoff encountered by Markov algorithms. The deeper the contexts weestablish, the more specific they become, but the smaller the amountof information that each context has on which to base its mini-modeland hence its predictions. The first problem can be overcome with morememory. The second problem requires careful tuning.LZRW4-----To test out the idea, I devised LZRW4. I didn't implement LZRW4, but Idid write a program to measure the compression that it would yield.This program is provided in Appendix A on an "as is" basis.Note: The name LZRW4 is used here to describe an algorithm. However,the eventual official code release of LZRW4 may differ in some detailsfrom the algorithm presented here.With a huge range of LZ algorithms from which to choose, I chose theLZ77 class with which I am familiar. This avoided patent problems.LZRW4 uses a hash table of 4096 partitions each containing 32 pointersto the input which is held in a 1 Meg buffer. At the start of eachphrase, LZRW4 hashes the previous two characters to obtain a partitionnumber in the range 0..4095. The 32 pointers in that partition aresearched for the longest match and the length 3bits=[2,3,4,5,6,7,8,16]and pointer number 5bits=[0,..,31] coded and transmitted. LZRW4 uses asingle control bit for each phrase to say whether a literal byte or acopy byte is coming next. These control bits are buffered as in theother LZRW* algorithms. The equal length of literal and copy items isan interesting feature of the algorithm which seems bound to be usefulto someone someday!To update the partition, LZRW4 swaps the winning pointer with thepointer halfway towards the front of the partition. If the winningpointer matched less than 4 bytes then a pointer to the currentposition (start of the Ziv) is shifted into the partition.See the code in Appendix A for the details. An understanding of theLZRW3-A algorithm may also assist.The results of LZRW4 are not startling, but demonstrate that the ideais working. By using contexts, LZRW4 gets away with using 3-bitlengths and 5-bit "indexes" whereas its ancestor algorithm LZRW3-Auses 4-bit lengths and 12-bit "indexes", yielding comparablecompression. Here are some rough results on some Calgary corpus files(percentage remaining):           LZRW4 LZRW3-A  Unix Compress   bib     39.5    44.1   41.8   book1   51.4    54.7   43.2   book2   41.2    45.5   41.1   obj1    65.5    58.8   65.3   obj2    43.4    43.8   52.1   paper1  46.5    46.1   47.2   progc   46.4    45.2   48.3   trans   29.1    32.3   40.8I have experiments with lots of variants of LZRW4 but have notformalized them nor written them up. One particularly interesting ideais to subdivide each partition into subpartitions based on the firstcharacter of the phrase. LZRW3-A partition management techniques canthen be used.Closing Thoughts----------------I don't know if this idea is original or whether it will yield anybenefit. However, I have not heard of any Ziv/Lempel/Markov hybrids todate, and the experiments above indicate at least that the idea works,if not that it has real competitive benefit. With the right tuning,the use of contexts could provide extra compression with little impacton speed (but usually a lot of impact on memory!). At the very least,it is another weapon in the data compression armoury.References----------[Langdon83] Langdon G.G., "A Note on the Ziv-Lempel Model forCompressing Individual Sequences, IEEE Transactions on Information Theory,Vol.29, No.2, pp.284-287.[Langdon84] Langdon G.G., "On Parsing Versus Mixed-Order ModelStructures for Data Compression", IBM Research Report RJ-4163 (46091)1/18/84, IBM Research Laboratory, San Jose, CA 95193, 1984.[Rissanen81] Rissanen J.J, Langdon G.G, "Universal Modeling and Coding",IEEE Transactions on Information Theory, Vol. 27, No. 1, pp.12-23.[Williams91] Williams R.N., "Adaptive Data Compression", Kluwer AcademicPublishers, 101 Philip Drive, Assinippi Park, Norwell, Massachusetts 02061.[Ziv77] Ziv J., Lempel A., "A Universal Algorithm for Sequential DataCompression", IEEE Transactions on Information Theory", Vol. 23, No. 3,pp. 337-343.[Ziv78] Ziv J., Lempel A., "Compression of Individual Sequences viaVariable-Rate Coding", IEEE Transactions on Information Theory,Vol. 24, No. 5, pp. 530-536.Appendix: LZRW4: Experimental Code----------------------------------The following code is public domain and is provided "as is". As mostexperimental code, it is poorly designed, structured, coded, andcommented.#include <stdio.h>/* #include <unix.h>    */#include <fcntl.h>#include "port.h"#define MY_BUFFER_SIZE (1024*1024)#define HASHLENGTH  4096#define HASHDEPTH     32#define CONTEXT        2#define MEDMATCH       8 /* Expressible lengths are [2..MEDMATCH,MAXMATCH]. */#define MAXMATCH      16#define NOSHIFTTHRESH  4 /* This or longer match and we don't shift  */main(argc,argv)int argc;char **argv;{ typedef UBYTE *queue[HASHDEPTH]; /* Element zero is most recent element. */ UBYTE buf[MY_BUFFER_SIZE+2000]; queue hash[HASHLENGTH]; char *input_file_name; int infile; ULONG bits_in  =0; ULONG bits_out =0; if (argc != 2)   {    printf("To run lzcontext alg: x <filename>\n");    exit(0);   } printf("Welcome to LZCONTEXT\n"); printf("--------------------\n"); printf("Buffer size        = %u\n",MY_BUFFER_SIZE); printf("Hash table length  = %u\n",HASHLENGTH    ); printf("Hash table depth   = %u\n",HASHDEPTH     ); printf("Context            = %u\n",CONTEXT       ); printf("Med match          = %u\n",MEDMATCH      ); printf("Max match          = %u\n",MAXMATCH      ); printf("No shift threshold = %u\n",NOSHIFTTHRESH   ); input_file_name=argv[1]; infile =open(input_file_name ,O_RDONLY); /* O_BINARY*/ while(TRUE)   {    ULONG  i,j,k;    UBYTE *p_next_byte;    ULONG bytes_in;    /* Read in a block of bytes to compress. */    bytes_in=read(infile,buf,(unsigned int) MY_BUFFER_SIZE);    if (bytes_in == 0)       break;    /* Initialize the hash table so all entries point to a defined string. */    for (i=0;i<HASHLENGTH;i++)       for (j=0;j<HASHDEPTH;j++)          hash[i][j]=(UBYTE *) "0123456789ABCDEF";    /* Get past the first few context bytes. */    bits_in +=CONTEXT*8;    bits_out+=CONTEXT*9;    p_next_byte=buf+CONTEXT;    /* Loop once for each item. */    /* The end is sloppy but this is only approximate so who cares? */    while (p_next_byte < (buf+bytes_in))      {       ULONG index;       UWORD maxlength;       UBYTE **this_queue;       UBYTE *old_p_next_byte=p_next_byte;       ULONG longest_match_index;       index= (*(p_next_byte-2)<<8) ^ (*(p_next_byte-1));       /* context of 3       index= (*(p_next_byte-3)<<8) ^              (*(p_next_byte-2)<<4) ^              (*(p_next_byte-1)   );       */       index=(40543*index) >> 4;       /* index&=0x3FFF;        */       index&=0xFFF;       this_queue=&hash[index][0];       /* Find the longest match in the 16 recent positions.           */       /* Note: Because we are only calculating entropy here, we don't */       /* actually need to know the number of the best position.       */       maxlength=0;       for (j=0;j<HASHDEPTH;j++)         {          UBYTE *p = this_queue[j];          for (k=0;k<MAXMATCH;k++)             if (p_next_byte[k] != p[k])               break;          if (k>maxlength)            {             maxlength=k;             longest_match_index=j;            }         }       /* Both literals and copies output one control bit and eight          data bits. They differ on how much input they consume. */       if (maxlength == 0) maxlength=1;       if (maxlength>MEDMATCH && maxlength<MAXMATCH) maxlength=MEDMATCH;       p_next_byte+=maxlength;       bits_in  += 8*maxlength;       bits_out += 9;       /* If there was a match of 2 or greater, swap the matching          element with the one half the distance to the head. */       if (maxlength > 1)         {          UBYTE *t;          ULONG half=longest_match_index/2;          t=this_queue[half];          this_queue[half]=this_queue[longest_match_index];          this_queue[longest_match_index]=t;         }       /* Shift the queue and put the new value in the recent end. */       if (maxlength < NOSHIFTTHRESH)         {          for (j=HASHDEPTH-1;j>0;j--)             this_queue[j]=this_queue[j-1];          this_queue[0]=old_p_next_byte;         }      }   } close(infile); printf("Bytes in  = %lu.\n",bits_in /8); printf("Bytes out = %lu.\n",bits_out/8); printf("Percentage remaining=%5.1f\n",100.0*((float) bits_out)/((float) bits_in));}void error(mess)/* Prints out the string and terminates. */char mess[];{ printf("%s\n",mess); exit(0);}--<End of LZRW4 paper>--

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -