📄 manual_4.html

📁 高效率的一种通用压缩/解压程序
💻 HTML
📖 第 1 页 / 共 2 页
字号:
12 下一页
<HTML><HEAD><!-- This HTML file has been created by texi2html 1.51     from manual.texi on 22 October 1998 --><TITLE>bzip2 and libbzip2 - Miscellanea</TITLE></HEAD><BODY>Go to the <A HREF="manual_1.html">first</A>, <A HREF="manual_3.html">previous</A>, next, last section, <A HREF="manual_toc.html">table of contents</A>.<P><HR><P><H1><A NAME="SEC33" HREF="manual_toc.html#TOC33">Miscellanea</A></H1><P>These are just some random thoughts of mine.  Your mileage mayvary.</P><H2><A NAME="SEC34" HREF="manual_toc.html#TOC34">Limitations of the compressed file format</A></H2><P><CODE>bzip2-0.9.0</CODE> uses exactly the same file format as the previousversion, <CODE>bzip2-0.1</CODE>.  This decision was made in the interests ofstability.  Creating yet another incompatible compressed file formatwould create further confusion and disruption for users.</P><P>Nevertheless, this is not a painless decision.  Developmentwork since the release of <CODE>bzip2-0.1</CODE> in August 1997has shown complexities in the file format which slow downdecompression and, in retrospect, are unnecessary.  These are:<UL><LI>The run-length encoder, which is the first of the      compression transformations, is entirely irrelevant.      The original purpose was to protect the sorting algorithm      from the very worst case input: a string of repeated      symbols.  But algorithm steps Q6a and Q6b in the original      Burrows-Wheeler technical report (SRC-124) show how      repeats can be handled without difficulty in block      sorting.<LI>The randomisation mechanism doesn't really need to be      there.  Udi Manber and Gene Myers published a suffix      array construction algorithm a few years back, which      can be employed to sort any block, no matter how       repetitive, in O(N log N) time.  Subsequent work by      Kunihiko Sadakane has produced a derivative O(N (log N)^2)       algorithm which usually outperforms the Manber-Myers      algorithm.      I could have changed to Sadakane's algorithm, but I find      it to be slower than <CODE>bzip2</CODE>'s existing algorithm for      most inputs, and the randomisation mechanism protects      adequately against bad cases.  I didn't think it was      a good tradeoff to make.  Partly this is due to the fact      that I was not flooded with email complaints about      <CODE>bzip2-0.1</CODE>'s performance on repetitive data, so      perhaps it isn't a problem for real inputs.      Probably the best long-term solution      is to use the existing sorting      algorithm initially, and fall back to a O(N (log N)^2)      algorithm if the standard algorithm gets into difficulties.      This can be done without much difficulty; I made      a prototype implementation of it some months now.<LI>The compressed file format was never designed to be      handled by a library, and I have had to jump though      some hoops to produce an efficient implementation of      decompression.  It's a bit hairy.  Try passing      <CODE>decompress.c</CODE> through the C preprocessor       and you'll see what I mean.  Much of this complexity      could have been avoided if the compressed size of      each block of data was recorded in the data stream.<LI>An Adler-32 checksum, rather than a CRC32 checksum,      would be faster to compute.</UL><P>It would be fair to say that the <CODE>bzip2</CODE> format was frozenbefore I properly and fully understood the performanceconsequences of doing so.</P><P>Improvements which I have been able to incorporate into0.9.0, despite using the same file format, are:<UL><LI>Single array implementation of the inverse BWT.  This      significantly speeds up decompression, presumably      because it reduces the number of cache misses.<LI>Faster inverse MTF transform for large MTF values.  The      new implementation is based on the notion of sliding blocks      of values.<LI><CODE>bzip2-0.9.0</CODE> now reads and writes files with <CODE>fread</CODE>      and <CODE>fwrite</CODE>; version 0.1 used <CODE>putc</CODE> and <CODE>getc</CODE>.      Duh! I'm embarrassed at my own moronicness (moronicity?) on this      one.</UL><P>Further ahead, it would be nice to be able to do random access into files.  This will require some careful design of compressed file formats.</P><H2><A NAME="SEC35" HREF="manual_toc.html#TOC35">Portability issues</A></H2><P>After some consideration, I have decided not to useGNU <CODE>autoconf</CODE> to configure 0.9.0.</P><P><CODE>autoconf</CODE>, admirable and wonderful though it is, mainly assists with portability problems between Unix-likeplatforms.  But <CODE>bzip2</CODE> doesn't have much in the wayof portability problems on Unix; most of the difficulties appearwhen porting to the Mac, or to Microsoft's operating systems.<CODE>autoconf</CODE> doesn't help in those cases, and brings in a whole load of new complexity.</P><P>Most people should be able to compile the library and programunder Unix straight out-of-the-box, so to speak, especially if you have a version of GNU C available.</P><P>There are a couple of <CODE>__inline__</CODE> directives in the code.  GNU C(<CODE>gcc</CODE>) should be able to handle them.  If your compiler doesn'tlike them, just <CODE>#define</CODE> <CODE>__inline__</CODE> to be null.  Oneeasy way to do this is to compile with the flag <CODE>-D__inline__=</CODE>, which should be understood by most Unix compilers.</P><P>If you still have difficulties, try compiling with the macro<CODE>BZ_STRICT_ANSI</CODE> defined.  This should enable you to build thelibrary in a strictly ANSI compliant environment.  Building the programitself like this is dangerous and not supported, since you remove<CODE>bzip2</CODE>'s checks against compressing directories, symbolic links,devices, and other not-really-a-file entities.  This could causefilesystem corruption!</P><P>One other thing: if you create a <CODE>bzip2</CODE> binary for publicdistribution, please try and link it statically (<CODE>gcc -s</CODE>).  Thisavoids all sorts of library-version issues that others may encounterlater on.</P><H2><A NAME="SEC36" HREF="manual_toc.html#TOC36">Reporting bugs</A></H2><P>I tried pretty hard to make sure <CODE>bzip2</CODE> isbug free, both by design and by testing.  Hopefullyyou'll never need to read this section for real.</P><P>Nevertheless, if <CODE>bzip2</CODE> dies with a segmentationfault, a bus error or an internal assertion failure, itwill ask you to email me a bug report.  Experience withversion 0.1 shows that almost all these problems canbe traced to either compiler bugs or hardware problems.<UL><LI>Recompile the program with no optimisation, and see if itworks.  And/or try a different compiler.I heard all sorts of stories about various flavoursof GNU C (and other compilers) generating bad code for<CODE>bzip2</CODE>, and I've run across two such examples myself.2.7.X versions of GNU C are known to generate bad code fromtime to time, at high optimisation levels.  If you get problems, try using the flags<CODE>-O2</CODE> <CODE>-fomit-frame-pointer</CODE> <CODE>-fno-strength-reduce</CODE>.You should specifically <EM>not</EM> use <CODE>-funroll-loops</CODE>.You may notice that the Makefile runs four tests as part ofthe build process.  If the program passes all of these, it'sa pretty good (but not 100%) indication that the compiler hasdone its job correctly.<LI>If <CODE>bzip2</CODE> crashes randomly, and the crashes are notrepeatable, you may have a flaky memory subsystem.  <CODE>bzip2</CODE>really hammers your memory hierarchy, and if it's a bit marginal,you may get these problems.  Ditto if your disk or I/O subsystemis slowly failing.  Yup, this really does happen.Try using a different machine of the same type, and see ifyou can repeat the problem.<LI>This isn't really a bug, but ... If <CODE>bzip2</CODE> tellsyou your file is corrupted on decompression, and youobtained the file via FTP, there is a possibility that youforgot to tell FTP to do a binary mode transfer.  That absolutelywill cause the file to be non-decompressible.  You'll have to transferit again.</UL><P>If you've incorporated <CODE>libbzip2</CODE> into your own programand are getting problems, please, please, please, check that the parameters you are passing in calls to the library, arecorrect, and in accordance with what the documentation saysis allowable.  I have tried to make the library robust againstsuch problems, but I'm sure I haven't succeeded.</P><P>Finally, if the above comments don't help, you'll have to sendme a bug report.  Now, it's just amazing how many people will send me a bug report saying something like<PRE>   bzip2 crashed with segmentation fault on my machine</PRE><P>and absolutely nothing else.  Needless to say, a such a reportis <EM>totally, utterly, completely and comprehensively 100% useless; a waste of your time, my time, and net bandwidth</EM>.With no details at all, there's no way I can possibly beginto figure out what the problem is.</P><P>The rules of the game are: facts, facts, facts.  Don't omitthem because "oh, they won't be relevant".  At the bare
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -