📄 tr156.tex
字号:
\documentstyle[psfig,12pt,a4wide]{article}\begin{document}\def\baselinestretch{0.95}\title{{\Large SHORTEN:} \\Simple lossless and near-lossless waveform compression}\author{Tony Robinson \\\\Technical report {\sc CUED/F-INFENG/TR.156} \\\\Cambridge University Engineering Department, \\Trumpington Street, Cambridge, CB2 1PZ, UK}\date{December 1994}\maketitle\begin{abstract}This report describes a program that performs compression of waveformfiles such as audio data. A simple predictive model of the waveform isused followed by Huffman coding of the prediction residuals. This isboth fast and near optimal for many commonly occuring waveform signals.This framework is then extended to lossy coding under the conditions ofmaximising the segmental signal to noise ratio on a per frame basis andcoding to a fixed acceptable signal to noise ratio.\end{abstract}\section{Introduction}It is common to store digitised waveforms on computers and the resultingfiles can often consume significant amounts of storage space. Generalcompression algorithms do not perform very well on these files as theyfail to take into account the structure of the data and the nature ofthe signal contained therein. Typically a waveform file will consist ofsigned 16 bit numbers and there will be significant sample to samplecorrelation. A compression utility for these file must be reasonablyfast, portable, accept data in a most popular formats and givesignificant compression. This report describes ``shorten'', a programfor the UNIX and DOS environments which aims to meet these requirements.A significant application of this program is to the problem ofcompression of speech files for distribution on CDROM. This reportstarts with a description of this domain, then discusses the two mainproblems associated with general waveform compression, namely predictivemodelling and residual coding. This framework is then extended to lossycoding. Finally, the shorten implementation is described and anappendix details the command line options.\section{Compression for speech corpora}One important use for lossless waveform compression is to compressspeech corpora for distribution on CDROM. State of the art speechrecognition systems require gigabytes of acoustic data for modelestimation which takes many CDROMs to store. Use of compressionsoftware both reduces the distribution cost and the number of CDROMchanges required to read the complete data set.The key factors in the design of compression software for speech corporaare that there must be no perceptual degradation in the speech signaland that the decompression routine must be fast and portable.There has been much research into efficient speech coding techniques andmany standards have been established. However, most of this work hasbeen for telephony applications where dedicated hardware can used toperform the coding and where it is important that the resulting systemoperates at a well defined bit rate. In such applications lossy codingis acceptable and indeed necessary order to guarantee that the systemoperates at the fixed bit rate.Similarly there has been much work in design of general purpose losslesscompressors for workstation use. Such systems do not guarantee anycompression for an arbitrary file, but in general achieve worthwhilecompression in reasonable time on general purpose computers.Speech corpora compression needs some features of both systems.Lossless compression is an advantage as it guarantees there is noperceptual degradation in the speech signal. However, the establishedcompression utilities do not exploit the known structure of the speechsignal. Hence {\tt shorten} was written to fill this gap and is now inuse in the distribution of CDROMs containing speechdatabases~\cite{GarofoloRobinsonFiscus94}.The recordings used as examples in section~\ref{ss:model} andsection~\ref{ss:perf} are from the TIMIT corpus which is distributed as16 bit, 16kHz linear PCM samples. This format is in common used forcontinuous speech recognition research corpora. The recordings werecollected using a Sennheiser HMD 414 noise-cancelling head-mountedmicrophone in low noise conditions. All ten utterances from speaker{\tt fcjf0} are used which amount to a total of 24 seconds or about384,000 samples.\section{Waveform Modeling\label{ss:model}}Compression is achieved by building a predictive model of the waveform(a good introduction for speech is Jayant and Noll~\cite{JayantNoll84}).An established model for a wide variety of waveforms is that of anautoregressive model, also known as linear predictive coding (LPC).Here the predicted waveform is a linear combination of past samples:\begin{eqnarray}\hat{s}(t) & = & \sum_{i = 1}^{p} a_i s(t - i) \label{eq:lpc}\end{eqnarray}The coded signal, $e(t)$, is the differencebetween the estimate of the linear predictor, $\hat{s}(t)$ and thespeech signal, $s(t)$.\begin{eqnarray}e(t) & = & s(t) - \hat{s}(t) \label{eq:error}\end{eqnarray}However, many waveforms of interest are not stationary, that is the bestvalues for the coefficients of the predictor, $a_i$, vary from onesection of the waveform to another. It is often reasonable to assumethat the signal is pseudo-stationary, i.e.\ there exists a time-spanover which reasonable values for the linear predictor can be found.Thus the three main stages in the coding process are blocking,predictive modelling, and residual coding.\subsection{Blocking}The time frame over which samples are blocked depends to some extent onthe nature of the signal. It is inefficient to block on too short atime scale as this incurs an overhead in the computation andtransmission of the prediction parameters. It is also inefficient touse a time scale over which the signal characteristics changeappreciably as this will result in a poorer model of the signal.However, in the implementation described below the linear predictorparameters typically take much less information to transmit than theresidual signal so the choice of window length is not critical. Thedefault value in the shorten implementation is 256 which results in 16msframes for a signal sampled at 16 kHz.Sample interleaved signals are handelled by treating each data stream asindependent. Even in cases where there is a known correlation betweenthe streams, such as in stereo audio, the within-channel correlationsare often significantly greater than the cross-channel correlations sofor lossless or near-lossless coding the exploitation of this additionalcorrelation only results in small additional gains.A rectangular window is used in preference to any tapering window as theaim is to model just those samples within the block, not the spectralcharacteristics of the segment surrounding the block. The window lengthis longer than the block size by the prediction order, which istypically three samples.\subsection{Linear Prediction\label{sect:lpc}}Shorten supports two forms of linear prediction: the standard $p$thorder LPC analysis of equation~\ref{eq:lpc}; and a restricted formwhereby the coefficients are selected from one of four fixed polynomialpredictors.In the case of the general LPC algorithm, the prediction coefficients,$a_i$, are quantised in accordance with the same Laplacian distributionused for the residual signal and described in section~\ref{sect:resid}.The expected number of bits per coefficient is 7 as this was found to bea good tradeoff between modelling accuracy and model storage. Thestandard Durbin's algorithm for computing the LPC coefficients from theautocorrelation coefficients is used in a incremental way. On eachiteration the mean squared value of the prediction residual iscalculated and this is used to compute the expected number of bitsneeded to code the residual signal. This is added to the number of bitsneeded to code the prediction coefficients and the LPC order is selectedto minimise the total. As the computation of the autocorrelationcoefficients is the most expensive step in this process, the search forthe optimal model order is terminated when the last two models haveresulted in a higher bit rate. Whilst it is possible to constructsignals that defeat this search procedure, in practice for speechsignals it has been found that the occasional use of a lower predictionorder results in an insignificant increase in the bit rate and has theadditional side effect of requiring less compute to decode.A restrictive form of the linear predictor has been found to be useful.In this case the prediction coefficients are those specified by fittinga $p$ order polynomial to the last $p$ data points, e.g.\ a line to thelast two points:\begin{eqnarray}\hat{s}_0(t) & = & 0 \\\hat{s}_1(t) & = & s(t-1) \\\hat{s}_2(t) & = & 2 s(t-1) - s(t-2) \\\hat{s}_3(t) & = & 3 s(t-1) - 3 s(t-2) + s(t-3)\end{eqnarray}Writing $e_i(t)$ as the error signal from the $i$th polynomial predictor:\begin{eqnarray}e_0(t) & = & s(t) \label{eq:polyinit}\\e_1(t) & = & e_0(t) - e_0(t - 1) \\e_2(t) & = & e_1(t) - e_1(t - 1) \\e_3(t) & = & e_2(t) - e_2(t - 1) \label{eq:polyquit}\end{eqnarray}As can be seen from equations~\ref{eq:polyinit}-\ref{eq:polyquit} thereis an efficient recursive algorithm for computing the set of polynomialprediction residuals. Each residual term is formed from the differenceof the previous order predictors. As each term involves only a fewinteger additions/subtractions, it is possible to compute all predictorsand select the best. Moreover, as the sum of absolute values islinearly related to the variance, this may be used as the basis ofpredictor selection and so the whole process is cheap to compute as itinvolves no multiplications.Figure~\ref{fig:rate} shows both forms of prediction for a range ofmaximum predictor orders. The figure shows that first and second orderprediction provides a substantial increase in compression and thathigher order predictors provide relatively little improvement. Thefigure also shows that for this example most of the total compressioncan be obtained using no prediction, that is a zeroth order coderachieved about 48\% compression and the best predictor 58\%. Hence, forlossless compression it is important not to waste too much compute onthe predictor and to to perform the residual coding efficiently.\begin{figure}[hbtp]\center\mbox{\psfig{file=rate.eps,width=0.7\columnwidth}}\caption[nop]{compression against maximum prediction order}\label{fig:rate}\end{figure}\subsection{Residual Coding\label{sect:resid}}The samples in the prediction residual are now assumed to beuncorrelated and therefore may be coded independently. The problem ofresidual coding is therefore to find an appropriate form for theprobability density function (p.d.f.) of the distribution of residualvalues so that they can be efficiently modelled. Figures~\ref{fig:pdf}and~\ref{fig:logpdf} show the p.d.f.\ for the segmentally normalizedresidual of the polynomial predictor (the full linear predictor shows asimilar p.d.f). The observed values are shown as open circles, theGaussian p.d.f.\ is shown as dot-dash line and the Laplacian, or doublesided exponential distribution is shown as a dashed line.\begin{figure}[hbtp]\center\mbox{\psfig{file=hist.eps,width=0.7\columnwidth}}\caption[nop]{Observed, Gaussian and quantized Laplacian p.d.f.}\label{fig:pdf}\end{figure}\begin{figure}[hbtp]\center\mbox{\psfig{file=lnhist.eps,width=0.7\columnwidth}}\caption[nop]{Observed, Gaussian, Laplacian and quantized Laplacian p.d.f and log$_2$ p.d.f.}\label{fig:logpdf}\end{figure}These figures demonstrate that the Laplacian p.d.f. fits the observeddistribution very well. This is convenient as there is a simple Huffmancode for this distribution~\cite{Rice71,YehRiceMiller91,Rice91}. Toform this code, a number is divided into a sign bit, the $n$th low orderbits and the the remaining high order bits. The high order bits aretreated as an integer and this number of 0's are transmitted followed bya terminating 1. The $n$ low order bits then follow, as in the example
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -