paper1

来自「Best algorithm for LZW ..C language」· 代码 · 共 1,251 行 · 第 1/4 页
TXT
1,251 行
well.)  \cHowever, the decoder will face the problem of detecting the end of themessage, to determine when to stop decoding.After all, the single number 0.0 could represent any of \fIa\fR, \fIaa\fR,\fIaaa\fR, \fIaaaa\fR, ...\ .To resolve the ambiguity we ensure that each message ends with a specialEOF symbol known to both encoder and decoder.For the alphabet of Table\ 1, \fI!\fR will be used to terminate messages,and only to terminate messages.When the decoder sees this symbol it stops decoding..ppRelative to the fixed model of Table\ 1, the entropy of the 5-symbol message\fIeaii!\fR is.LB$- ~ log ~ 0.3 ~ - ~ log ~ 0.2 ~ - ~ log ~ 0.1 ~ - ~ log ~ 0.1 ~ - ~ log ~ 0.1 ~~=~~ - ~ log ~ 0.00006 ~~ approx ~~ 4.22$.LE(using base 10, since the above encoding was performed in decimal).This explains why it takes 5\ decimal digits to encode the message.In fact, the size of the final range is $0.2336 ~-~ 0.23354 ~~=~~ 0.00006$,and the entropy is the negative logarithm of this figure.Of course, we normally work in binary, transmitting binary digits andmeasuring entropy in bits..ppFive decimal digits seems a lot to encode a message comprising four vowels!It is perhaps unfortunate that our example ended up by expanding rather thancompressing.Needless to say, however, different models will give different entropies.The best single-character model of the message \fIeaii!\fR is the set ofsymbol frequencies{\fIe\fR\ (0.2),  \fIa\fR\ (0.2),  \fIi\fR\ (0.4),  \fI!\fR\ (0.2)},which gives an entropy for \fIeaii!\fR of 2.89\ decimal digits.Using this model the encoding would be only 3\ digits long.Moreover, as noted earlier more sophisticated models give much betterperformance in general..sh "A program for arithmetic coding".ppFigure\ 2 shows a pseudo-code fragment which summarizes the encoding anddecoding procedures developed in the last section.Symbols are numbered 1, 2, 3, ...The frequency range for the $i$th symbol is from $cum_freq[i]$ to$cum_freq[i-1]$.$cum_freq[i]$ increases as $i$ decreases, and $cum_freq[0] = 1$.(The reason for this ``backwards'' convention is that later, $cum_freq[0]$will contain a normalizing factor, and it will be convenient to have itbegin the array.)  \cThe ``current interval'' is [$low$,\ $high$); and for both encoding anddecoding this should be initialized to [0,\ 1)..ppUnfortunately, Figure\ 2 is overly simplistic.In practice, there are several factors which complicate both encoding anddecoding..LB.NPIncremental transmission and reception..brThe encode algorithm as described does not transmit anything until the entiremessage has been encoded; neither does the decode algorithm begin decodinguntil it has received the complete transmission.In most applications, an incremental mode of operation is necessary..sp.NPThe desire to use integer arithmetic..brThe precision required to represent the [$low$, $high$) interval grows withthe length of the message.Incremental operation will help overcome this, but the potential for overflowand underflow must still be examined carefully..sp.NPRepresenting the model so that it can be consulted efficiently..brThe representation used for the model should minimize the time required forthe decode algorithm to identify the next symbol.Moreover, an adaptive model should be organized to minimize thetime-consuming task of maintaining cumulative frequencies..LE.ppFigure\ 3 shows working code, in C, for arithmetic encoding and decoding.It is considerably more detailed than the bare-bones sketch of Figure\ 2!Implementations of two different models are given in Figure\ 4;the Figure\ 3 code can use either one..ppThe remainder of this section examines the code of Figure\ 3 more closely,including a proof that decoding is still correct in the integerimplementation and a review of constraints on word lengths in the program..rh "Representing the model."Implementations of models are discussed in the next section; here weare concerned only with the interface to the model (lines 20-38).In C, a byte is represented as an integer between 0 and 255 (call this a$char$).Internally, we represent a byte as an integer between 1 and 257 inclusive(call this an $index$), EOF being treated as a 257th symbol.It is advantageous to sort the model into frequency order, to minimize thenumber of executions of the decoding loop (line 189).To permit such reordering, the $char$/$index$ translation is implemented asa pair of tables, $index_to_char[ \| ]$ and $char_to_index[ \| ]$.In one of our models, these tables simply form the $index$ by adding 1 to the$char$, but another implements a more complex translation which assigns smallindexes to frequently-used symbols..ppThe probabilities in the model are represented as integer frequency counts,and cumulative counts are stored in the array $cum_freq[ \| ]$.As previously, this array is ``backwards,'' and the total frequency count \(emwhich is used to normalize all frequencies \(em appears in $cum_freq[0]$.Cumulative counts must not exceed a predetermined maximum, $Max_frequency$,and the model implementation must prevent overflow by scaling appropriately.It must also ensure that neighboring values in the $cum_freq[ \| ]$ arraydiffer by at least 1; otherwise the affected symbol could not be transmitted..rh "Incremental transmission and reception."Unlike Figure\ 2, the program of Figure\ 3 represents $low$ and $high$ asintegers.A special data type, $code_value$, is defined for these quantities,together with some useful constants:  \c$Top_value$, representing the largest possible $code_value$, and$First_qtr$, $Half$, and $Third_qtr$, representing parts of the range(lines 6-16).Whereas in Figure\ 2 the current interval is represented by[$low$,\ $high$), in Figure\ 3 it is [$low$,\ $high$]; that is, the range nowincludes the value of $high$.Actually, it is more accurate (though more confusing) to say that in theprogram of Figure\ 3 the interval represented is[$low$,\ $high + 0.11111 ...$).This is because when the bounds are scaled up to increase the precision, 0'sare shifted into the low-order bits of $low$ but 1's are shifted into $high$.While it is possible to write the program to use a different convention,this one has some advantages in simplifying the code..ppAs the code range narrows, the top bits of $low$ and $high$ will become thesame.Any bits which are the same can be transmitted immediately, since they cannotbe affected by future narrowing.For encoding, since we know that $low ~ <= ~ high$, this requires code like.LB "nnnn".nf.ta \w'nnnn'u +\w'if (high < 'u +\w'Half) { 'u +\w'output_bit(1); low = 2*(low\-Half); high = 2*(high\-Half)+1; 'u.ne 4for (;;) {	if (high <	Half) {	output_bit(0); low = 2*low; high = 2*high+1;	}	if (low $>=$	Half) {	output_bit(1); low = 2*(low\-Half); high = 2*(high\-Half)+1;	}}.fi.LE "nnnn"which ensures that, upon completion, $low ~ < ~ Half ~ <= ~ high$.This can be found in lines 95-113 of $encode_symbol( \| )$,although there are some extra complications caused by underflow possibilities(see next subsection).Care is taken care to shift 1's in at the bottom when $high$ is scaled, asnoted above..ppIncremental reception is done using a number called $value$ as in Figure\ 2,in which processed bits flow out the top (high-significance) end andnewly-received ones flow in the bottom.$start_decoding( \| )$ (lines 168-176) fills $value$ with received bits initially.Once $decode_symbol( \| )$ has identified the next input symbol, it shifts outnow-useless high-order bits which are the same in $low$ and $high$, shifting$value$ by the same amount (and replacing lost bits by fresh input bits at thebottom end):.LB "nnnn".nf.ta \w'nnnn'u +\w'if (high < 'u +\w'Half) { 'u +\w'value = 2*(value\-Half)+input_bit(\|); low = 2*(low\-Half); high = 2*(high\-Half)+1; 'u.ne 4for (;;) {	if (high <	Half) {	value = 2*value+input_bit(\|); low = 2*low; high = 2*high+1;	}	if (low $>=$	Half) {	value = 2*(value\-Half)+input_bit(\|); low = 2*(low\-Half); high = 2*(high\-Half)+1;	}}.fi.LE "nnnn"(see lines 194-213, again complicated by precautions against underflowdiscussed below)..rh "Proof of decoding correctness."At this point it is worth checking that the identification of the nextsymbol by $decode_symbol( \| )$ works properly.Recall from Figure\ 2 that $decode_symbol( \| )$ must use $value$ to find thatsymbol which, when encoded, reduces the range to one that still includes$value$.Lines 186-188 in $decode_symbol( \| )$ identify the symbol for which.LB$cum_freq[symbol] ~ <= ~~left f {(value-low+1)*cum_freq[0] ~-~ 1} over {high-low+1} right f~~ < ~ cum_freq[symbol-1]$,.LEwhere $left f ~ right f$ denotes the ``integer part of'' function thatcomes from integer division with truncation.It is shown in the Appendix that this implies.LB "nnnn"$low ~+~~ left f {(high-low+1)*cum_freq[symbol]} over cum_freq[0] right f~~ <= ~ v ~ <=  ~~low ~+~~ left f {(high-low+1)*cum_freq[symbol-1]} over cum_freq[0] right f ~~ - ~ 1$,.LE "nnnn"so $value$ lies within the new interval that $decode_symbol( \| )$calculates in lines 190-193.This is sufficient to guarantee that the decoding operation identifies eachsymbol correctly..rh "Underflow."As Figure\ 1 shows, arithmetic coding works by scaling the cumulativeprobabilities given by the model into the interval [$low$,\ $high$] foreach character transmitted.Suppose $low$ and $high$ are very close together, so close that this scalingoperation maps some different symbols of the model on to the same integer inthe [$low$,\ $high$] interval.This would be disastrous, because if such a symbol actually occurred it wouldnot be possible to continue encoding.Consequently, the encoder must guarantee that the interval [$low$,\ $high$] isalways large enough to prevent this.The simplest way is to ensure that this interval is at least as large as$Max_frequency$, the maximum allowed cumulative frequency count (line\ 36)..ppHow could this condition be violated?The bit-shifting operation explained above ensures that $low$ and $high$ canonly become close together when they straddle $Half$.Suppose in fact they become as close as.LB$First_qtr ~ <= ~ low ~<~ Half ~ <= ~ high ~<~ Third_qtr$..LEThen the next two bits sent will have opposite polarity, either 01 or 10.For example, if the next bit turns out to be 0 (ie $high$ descends below$Half$ and [0,\ $Half$] is expanded to the full interval) the bit afterthat will be 1 since the range has to be above the midpoint of the expandedinterval.Conversely if the next bit happens to be 1 the one after that will be 0.Therefore the interval can safely be expanded right now, if only we rememberthat whatever bit actually comes next, its opposite must be transmittedafterwards as well.Thus lines 104-109 expand [$First_qtr$,\ $Third_qtr$] into the whole interval,remembering in $bits_to_follow$ that the bit that is output next must befollowed by an opposite bit.This explains why all output is done via $bit_plus_follow( \| )$(lines 128-135) instead of directly with $output_bit( \| )$..ppBut what if, after this operation, it is \fIstill\fR true that.LB$First_qtr ~ <= ~ low ~<~ Half ~ <= ~ high ~<~ Third_qtr$?.LEFigure\ 5 illustrates this situation, where the current [$low$,\ $high$]range (shown as a thick line) has been expanded a total of three times.Suppose the next bit will turn out to be 0, as indicated by the arrow inFigure\ 5(a) being below the halfway point.Then the next \fIthree\fR bits will be 1's, since not only is the arrow in thetop half of the bottom half of the original range, it is in the top quarter,and moreover the top eighth, of that half \(em that is why the expansioncan occur three times.Similarly, as Figure\ 5(b) shows, if the next bit turns out to be a 1 it willbe followed by three 0's.Consequently we need only count the number of expansions and follow the nextbit by that number of opposites (lines 106 and 131-134)..ppUsing this technique, the encoder can guarantee that after the shiftingoperations, either.LB.ta \n(.lu-\n(.iuR$low ~<~ First_qtr ~<~ Half ~ <= ~ high$	(1a).LEor.LB.ta \n(.lu-\n(.iuR$low ~<~ Half ~<~ Third_qtr ~ <= ~ high$.	(1b).LETherefore as long as the integer range spanned by the cumulative frequenciesfits into a quarter of that provided by $code_value$s, the underflow problemcannot occur.This corresponds to the condition.LB$Max_frequency ~ <= ~~ {Top_value+1} over 4 ~~ + ~ 1$,.LEwhich is satisfied by Figure\ 3 since $Max_frequency ~=~ 2 sup 14 - 1$ and$Top_value ~=~ 2 sup 16 - 1$ (lines\ 36, 9).More than 14\ bits cannot be used to represent cumulative frequencycounts without increasing the number of bits allocated to $code_value$s..ppWe have discussed underflow in the encoder only.Since the decoder's job, once each symbol has been decoded, is to track theoperation of the encoder, underflow will be avoided if it performs the sameexpansion operation under the same conditions..rh "Overflow."Now consider the possibility of overflow in the integer multiplicationscorresponding to those of Figure\ 2, which occur in lines 91-94 and 190-193of Figure\ 3.Overflow cannot occur provided the product.LB$range * Max_frequency$.LEfits within the integer word length available, since cumulative frequenciescannot exceed $Max_frequency$.$Range$ might be as large as $Top_value ~+~1$, so the largest possible productin Figure 3 is $2 sup 16 ( 2 sup 14 - 1 )$ which is less than $2 sup 30$.$Long$ declarations are used for $code_value$ (line\ 7) and $range$(lines\ 89, 183) to ensure that arithmetic is done to 32-bit precision..rh "Constraints on the implementation."The constraints on word length imposed by underflow and overflow canbe simplified by assuming frequency counts are represented in $f$\ bits, and$code_value$s in $c$\ bits.The implementation will work correctly provided.LB$f ~ <= ~ c ~ - ~2$.br$f ~+~ c ~ <= ~ p$, the precision to which arithmetic is performed..LEIn most C implementations, $p=31$ if $long$ integers are used, and $p=32$if they are $unsigned ~ long$.In Figure\ 3, $f=14$ and $c=16$.With appropriately modified declarations, $unsigned ~ long$ arithmetic with$f=15$, $c=17$ could be used.In assembly language $c=16$ is a natural choice because it expedites somecomparisons and bit manipulations (eg those of lines\ 95-113 and 194-213)..ppIf $p$ is restricted to 16\ bits, the best values possible are $c=9$ and$f=7$, making it impossible to encode a full alphabet of 256\ symbols, as eachsymbol must have a count of at least 1.A smaller alphabet (eg the letters, or 4-bit nibbles) could still behandled..rh "Termination."To finish the transmission, it is necessary to send a unique terminatingsymbol ($EOF_symbol$, line 56) and then follow it by enough bits to ensure
paper1 - 源码说明

本页面展示了「Best algorithm for LZW ..C language」中的 paper1 源码文件，采用编程语言编写，共 1,251 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与algorithm相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?