📄 paper1
字号:
.pn 0.ls1.EQdelim $$.EN.ev1.ps-2.vs-2.ev\&.sp 5.ps+4.ceARITHMETIC CODING FOR DATA COMPRESSION.ps-4.sp4.ceIan H. Witten, Radford M. Neal, and John G. Cleary.sp2.ce4Department of Computer ScienceThe University of Calgary2500 University Drive NWCalgary, Canada T2N 1N4.sp2.ceAugust, 1986, revised January 1987.sp 8.in+1i.ll-1iThe state of the art in data compression is arithmetic coding, notthe better-known Huffman method.Arithmetic coding gives greater compression, is faster for adaptive models,and clearly separates the model from the channel encoding.This paper presents a practical implementation of the technique..sp 3.in +0.5i.ti -0.5i\fICR Categories and subject descriptors:\fR.brE.4 [DATA] Coding and Information Theory \(em Data Compaction and Compression.brH.1.1 [Models and Principles] Systems and Information Theory \(em Information Theory.sp.ti -0.5i\fIGeneral terms:\fR algorithms, performance.sp.ti -0.5i\fIAdditional key words and phrases:\fR arithmetic coding, Huffman coding, adaptive modeling.ll+1i.in 0.bp.sh "Introduction".ppArithmetic coding is superior in most respects to the better-known Huffman(1952) method..[Huffman 1952 method construction minimum-redundancy codes.]It represents information at least as compactly, sometimes considerably moreso.Its performance is optimal without the need for blocking of input data.It encourages a clear separation between the model for representing data andthe encoding of information with respect to that model.It accommodates adaptive models easily.It is computationally efficient.Yet many authors and practitioners seem unaware of the technique.Indeed there is a widespread belief that Huffman coding cannot be improvedupon..ppThis paper aims to rectify the situation by presenting an accessibleimplementation of arithmetic coding, and detailing its performancecharacteristics.The next section briefly reviews basic concepts of data compression andintroduces the model-based approach which underlies most modern techniques.We then outline the idea of arithmetic coding using a simple example.Following that programs are presented for both encoding and decoding, in whichthe model occupies a separate module so that different ones can easily beused.Next we discuss the construction of fixed and adaptive models.The subsequent section details the compression efficiency and execution timeof the programs, including the effect of different arithmetic word lengths oncompression efficiency.Finally, we outline a few applications where arithmetic coding is appropriate..sh "Data compression".ppTo many, data compression conjures up an assortment of \fIad hoc\fRtechniques such as converting spaces in text to tabs, creating special codesfor common words, or run-length coding of picture data (eg see Held, 1984)..[Held 1984 data compression techniques applications.]This contrasts with the more modern model-based paradigm forcoding, where from an \fIinput string\fR of symbols and a \fImodel\fR, an\fIencoded string\fR is produced which is (usually) a compressed version ofthe input.The decoder, which must have access to the same model,regenerates the exact input string from the encoded string.Input symbols are drawn from some well-defined set such as the ASCII orbinary alphabets;the encoded string is a plain sequence of bits.The model is a way of calculating, in any given context, the distribution ofprobabilities for the next input symbol.It must be possible for the decoder to produce exactly the same probabilitydistribution in the same context.Compression is achieved by transmitting the more probable symbols in fewerbits than the less probable ones..ppFor example, the model may assign a predetermined probability to each symbolin the ASCII alphabet.No context is involved.These probabilities may be determined by counting frequencies inrepresentative samples of text to be transmitted.Such a \fIfixed\fR model is communicated in advance to both encoder anddecoder, after which it is used for many messages..ppAlternatively, the probabilities the model assigns may change as each symbolis transmitted, based on the symbol frequencies seen \fIso far\fR in thismessage.This is an \fIadaptive\fR model.There is no need for a representative sample of text, because each messageis treated as an independent unit, starting from scratch.The encoder's model changes with each symbol transmitted, and the decoder'schanges with each symbol received, in sympathy..ppMore complex models can provide more accurate probabilistic predictions andhence achieve greater compression.For example, several characters of previous context could condition thenext-symbol probability.Such methods have enabled mixed-case English text to be encoded in around2.2\ bit/char with two quite different kinds of model.(Cleary & Witten, 1984b; Cormack & Horspool, 1985)..[Cleary Witten 1984 data compression%D 1984b.].[Cormack Horspool 1985 dynamic Markov%O April.]Techniques which do not separate modeling from coding so distinctly, likethat of Ziv & Lempel (1978), do not seem to show such great potential forcompression, although they may be appropriate when the aim is raw speed ratherthan compression performance (Welch, 1984)..[Ziv Lempel 1978 compression of individual sequences.].[Welch 1984 data compression.].ppThe effectiveness of any model can be measured by the \fIentropy\fR of themessage with respect to it, usually expressed in bits/symbol.Shannon's fundamental theorem of coding states that given messages randomlygenerated from a model, it is impossible to encode them into less bits(on average) than the entropy of that model (Shannon & Weaver, 1949)..[Shannon Weaver 1949.].ppA message can be coded with respect to a model using either Huffman orarithmetic coding.The former method is frequently advocated as the best possible technique forreducing the encoded data rate.But it is not.Given that each symbol in the alphabet must translate into an integral numberof bits in the encoding, Huffman coding indeed achieves ``minimumredundancy''.In other words, it performs optimally if all symbol probabilities areintegral powers of 1/2.But this is not normally the case in practice;indeed, Huffman coding can take up to one extra bit per symbol.The worst case is realized by a source in which one symbol has probabilityapproaching unity.Symbols emanating from such a source convey negligible information on average,but require at least one bit to transmit (Gallagher, 1978)..[Gallagher 1978 variations on a theme by Huffman.]Arithmetic coding dispenses with the restriction that each symbol translatesinto an integral number of bits, thereby coding more efficiently.It actually achieves the theoretical entropy bound to compression efficiencyfor any source, including the one just mentioned..ppIn general, sophisticated models expose the deficiencies of Huffman codingmore starkly than simple ones.This is because they more often predict symbols with probabilities close toone, the worst case for Huffman coding.For example, the techniques mentioned above which code English text in2.2\ bit/char both use arithmetic coding as the final step, and performancewould be impacted severely if Huffman coding were substituted.Nevertheless, since our topic is coding and not modeling, the illustrations inthis paper all employ simple models.Even then, as we shall see, Huffman coding is inferior to arithmetic coding..ppThe basic concept of arithmetic coding can be traced back to Elias in theearly 1960s (see Abramson, 1963, pp 61-62)..[Abramson 1963.]Practical techniques were first introduced by Rissanen (1976) andPasco (1976), and developed further in Rissanen (1979)..[Rissanen 1976 Generalized Kraft Inequality.].[Pasco 1976.].[Rissanen 1979 number representations.].[Langdon 1981 tutorial arithmetic coding.]Details of the implementation presented here have not appeared in theliterature before; Rubin (1979) is closest to our approach..[Rubin 1979 arithmetic stream coding.]The reader interested in the broader class of arithmetic codes is referredto Rissanen & Langdon (1979);.[Rissanen Langdon 1979 Arithmetic coding.]a tutorial is available in Langdon (1981)..[Langdon 1981 tutorial arithmetic coding.]Despite these publications, the method is not widely known.A number of recent books and papers on data compression mention it only inpassing, or not at all..sh "The idea of arithmetic coding".ppIn arithmetic coding a message is represented by aninterval of real numbers between 0 and 1.As the message becomes longer, the interval needed to represent it becomessmaller, and the number of bits needed to specify that interval grows.Successive symbols of the message reduce the size of theinterval in accordance with the symbol probabilities generated by themodel.The more likely symbols reduce the range by less than the unlikely symbols,and hence add fewer bits to the message..ppBefore anything is transmitted, the range for the message is the entireinterval [0,\ 1)\(dg..FN\(dg [0,\ 1) denotes the half-open interval 0\(<=\fIx\fR<1..EFAs each symbol is processed, the range is narrowed to that portion of itallocated to the symbol.For example, suppose the alphabet is {\fIa,\ e,\ i,\ o,\ u,\ !\fR}, and afixed model is used with probabilities shown in Table\ 1.Imagine transmitting the message \fIeaii!\fR.Initially, both encoder and decoder know that the range is [0,\ 1).After seeing the first symbol, \fIe\fR, the encoder narrows it to[0.2,\ 0.5), the range the model allocates to this symbol.The second symbol, \fIa\fR, will narrow this new range to the first 1/5 of it,since \fIa\fR has been allocated [0,\ 0.2).This produces [0.2,\ 0.26) since the previous range was 0.3 units long and1/5 of that is 0.06.The next symbol, \fIi\fR, is allocated [0.5,\ 0.6), which when applied to[0.2,\ 0.26) gives the smaller range [0.23,\ 0.236).Proceeding in this way, the encoded message builds up as follows:.LB.nf.ta \w'after seeing 'u +0.5i +\w'[0.23354, 'uinitially [0, 1)after seeing \fIe\fR [0.2, 0.5) \fIa\fR [0.2, 0.26) \fIi\fR [0.23, 0.236) \fIi\fR [0.233, 0.2336) \fI!\fR [0.23354, 0.2336).fi.LEFigure\ 1 shows another representation of the encoding process.The vertical bars with ticks represent the symbol probabilities stipulatedby the model.After the first symbol, \fIe\fR, has been processed, the model is scaledinto the range [0.2,\ 0.5), as shown in part (a).The second symbol, \fIa\fR, scales it again into the range [0.2,\ 0.26).But the picture cannot be continued in this way without a magnifying glass!Consequently Figure\ 1(b) shows the ranges expanded to full height at everystage, and marked with a scale which gives the endpoints as numbers..ppSuppose all the decoder knows about the message is the final range,[0.23354,\ 0.2336).It can immediately deduce that the first character was \fIe\fR, since therange lies entirely within the space the model of Table\ 1 allocates for\fIe\fR.Now it can simulate the operation of the \fIen\fR\^coder:.LB.nf.ta \w'after seeing 'u +0.5i +\w'[0.2, 'uinitially [0, 1)after seeing \fIe\fR [0.2, 0.5).fi.LEThis makes it clear that the second character of the message is \fIa\fR,since this will produce the range.LB.nf.ta \w'after seeing 'u +0.5i +\w'[0.2, 'uafter seeing \fIa\fR [0.2, 0.26).fi.LEwhich entirely encloses the given range [0.23354,\ 0.2336).Proceeding like this, the decoder can identify the whole message..ppIt is not really necessary for the decoder to know both ends of the rangeproduced by the encoder.Instead, a single number within the range \(em for example, 0.23355 \(em willsuffice.(Other numbers, like 0.23354, 0.23357, or even 0.23354321, would do just as
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -