bzip2.1

来自「高效率的一种通用压缩/解压程序」· 1 代码 · 共 420 行
420 行
.PU.TH bzip2 1.SH NAMEbzip2, bunzip2 \- a block-sorting file compressor, v0.9.0.brbzcat \- decompresses files to stdout.brbzip2recover \- recovers data from damaged bzip2 files.SH SYNOPSIS.ll +8.B bzip2.RB [ " \-cdfkstvzVL123456789 " ][.I "filenames \&..."].ll -8.br.B bunzip2.RB [ " \-fkvsVL " ][.I "filenames \&..."].br .B bzcat.RB [ " \-s " ][.I "filenames \&..."].br.B bzip2recover.I "filename".SH DESCRIPTION.I bzip2compresses files using the Burrows-Wheeler block-sorting text compression algorithm, and Huffman coding.Compression is generally considerablybetter than that achieved by more conventional LZ77/LZ78-based compressors,and approaches the performance of the PPM family of statisticalcompressors.The command-line options are deliberately very similar to those of .I GNU Gzip,but they are not identical..I bzip2 expects a list of file names to accompany the command-line flags.  Each file is replaced by a compressed version of itself,with the name "original_name.bz2".Each compressed file has the same modification date and permissionsas the corresponding original, so that these properties can be correctly restored at decompression time.  File name handling isnaive in the sense that there is no mechanism for preservingoriginal file names, permissions and dates in filesystems which lack these concepts, or have serious file name lengthrestrictions, such as MS-DOS..I bzip2and.I bunzip2will by default not overwrite existing files; if you want this to happen, specify the \-f flag.If no file names are specified,.I bzip2compresses from standard input to standard output.In this case,.I bzip2will decline to write compressed output to a terminal, asthis would be entirely incomprehensible and therefore pointless..I bunzip2(or.I bzip2 \-d) decompresses and restores all specified files whose namesend in ".bz2".Files without this suffix are ignored.  Again, supplying no filenamescauses decompression from standard input to standard output..I bunzip2will correctly decompress a file which is the concatenationof two or more compressed files.  The result is the concatenationof the corresponding uncompressed files.  Integrity testing(\-t) of concatenated compressed files is also supported.You can also compress or decompress files tothe standard output by giving the \-c flag.Multiple files may be compressed and decompressed like this.The resulting outputs are fed sequentially to stdout.Compression of multiple files in this manner generatesa stream containing multiple compressed file representations.Such a stream can be decompressed correctly only by.I bzip2version 0.9.0 or later.  Earlier versions of.I bzip2will stop after decompressing the first file in the stream..I bzcat(or.I bzip2 \-dc) decompresses all specified files to the standard output.Compression is always performed, even if the compressed file isslightly larger than the original.  Files of less than aboutone hundred bytes tend to get larger, since the compression mechanism has a constant overhead in the region of 50 bytes.Random data (including the output of most file compressors)is coded at about 8.05 bits per byte, giving an expansion of around 0.5%.As a self-check for your protection,.I bzip2uses 32-bit CRCs to make sure that the decompressedversion of a file is identical to the original.  This guards against corruption of the compressed data,and against undetected bugs in.I bzip2(hopefully very unlikely).The chances of data corruption going undetected is microscopic, about one chance in four billionfor each file processed.  Be aware, though, that the checkoccurs upon decompression, so it can only tell you thatthat something is wrong.  It can't help you recover theoriginal uncompressed data.You can use.I bzip2recoverto try to recover data from damaged files.Return values: 0 for a normal exit, 1 for environmentalproblems (file not found, invalid flags, I/O errors, &c),2 to indicate a corrupt compressed file,3 for an internal consistency error (eg, bug) which caused.I bzip2 to panic..SH MEMORY MANAGEMENT.I Bzip2compresses large files in blocks.  The block size affects both the compression ratio achieved, and the amount of memory needed both forcompression and decompression.  The flags \-1 through \-9specify the block size to be 100,000 bytes through 900,000 bytes(the default) respectively.  At decompression-time, the block size used forcompression is read from the header of the compressed file, and.I bunzip2then allocates itself just enough memory to decompress the file.Since block sizes are stored in compressed files, it follows that the flags\-1 to \-9are irrelevant to and so ignored during decompression.Compression and decompression requirements, in bytes, can be estimated as:      Compression:   400k + ( 7 x block size )      Decompression: 100k + ( 4 x block size ), or.br                     100k + ( 2.5 x block size )Larger block sizes give rapidly diminishing marginal returns; mostof the compression comes from the first two or three hundred k of block size,a fact worth bearing in mind when using .I bzip2on small machines.  It is also important to appreciate that thedecompression memory requirement is set at compression-time by thechoice of block size.For files compressed with the default 900k block size, .I bunzip2will require about 3700 kbytes to decompress.To support decompression of any file on a 4 megabyte machine,.I bunzip2has an option to decompress using approximately half thisamount of memory, about 2300 kbytes.  Decompression speed isalso halved, so you should use this option only where necessary.The relevant flag is \-s.In general, try and use the largest block sizememory constraints allow, since that maximises the compressionachieved.  Compression and decompressionspeed are virtually unaffected by block size.Another significant point applies to files which fit in a singleblock -- that means most files you'd encounter using a large block size.  The amount of real memory touched is proportionalto the size of the file, since the file is smaller than a block.For example, compressing a file 20,000 bytes long with the flag\-9will cause the compressor to allocate around6700k of memory, but only touch 400k + 20000 * 7 = 540kbytes of it.  Similarly, the decompressor will allocate 3700k butonly touch 100k + 20000 * 4 = 180 kbytes.Here is a table which summarises the maximum memory usage for different block sizes.  Also recorded is the total compressedsize for 14 files of the Calgary Text Compression Corpustotalling 3,141,622 bytes.  This column gives some feel for howcompression varies with block size.  These figures tend to understatethe advantage of larger block sizes for larger files, since theCorpus is dominated by smaller files.           Compress   Decompress   Decompress   Corpus    Flag     usage      usage       -s usage     Size     -1      1100k       500k         350k      914704     -2      1800k       900k         600k      877703     -3      2500k      1300k         850k      860338     -4      3200k      1700k        1100k      846899     -5      3900k      2100k        1350k      845160     -6      4600k      2500k        1600k      838626     -7      5400k      2900k        1850k      834096     -8      6000k      3300k        2100k      828642     -9      6700k      3700k        2350k      828642.SH OPTIONS.TP.B \-c --stdoutCompress or decompress to standard output.  \-c will decompressmultiple files to stdout, but will only compress a single file tostdout..TP.B \-d --decompressForce decompression..I bzip2,.I bunzip2and.I bzcatare really the same program, and the decision about what actionsto take is done on the basis of which name isused.  This flag overrides that mechanism, and forces.I bzip2to decompress..TP .B \-z --compressThe complement to \-d: forces compression, regardless of the invokationname..TP.B \-t --testCheck integrity of the specified file(s), but don't decompress them.This really performs a trial decompression and throws away the result..TP.B \-f --forceForce overwrite of output files.  Normally,.I bzip2will not overwrite existing output files..TP.B \-k --keepKeep (don't delete) input files during compression or decompression..TP.B \-s --smallReduce memory usage, for compression, decompression andtesting.Files are decompressed and tested using a modified algorithm which onlyrequires 2.5 bytes per block byte.  This means any file can bedecompressed in 2300k of memory, albeit at about half the normalspeed.During compression, -s selects a block size of 200k, which limitsmemory use to around the same figure, at the expense of yourcompression ratio.  In short, if your machine is low on memory(8 megabytes or less), use -s for everything.  SeeMEMORY MANAGEMENT above..TP.B \-v --verboseVerbose mode -- show the compression ratio for each file processed.Further \-v's increase the verbosity level, spewing out lots ofinformation which is primarily of interest for diagnostic purposes..TP.B \-L --license -V --versionDisplay the software version, license terms and conditions..TP.B \-1 to \-9 Set the block size to 100 k, 200 k .. 900 k whencompressing.  Has no effect when decompressing.See MEMORY MANAGEMENT above..TP.B \--repetitive-fast.I bzip2injects some small pseudo-random variationsinto very repetitive blocks to limitworst-case performance during compression.If sorting runs into difficulties, the blockis randomised, and sorting is restarted.  Very roughly, .I bzip2persists for three times as long as a well-behaved inputwould take before resorting to randomisation.This flag makes it give up much sooner..TP.B \--repetitive-bestOpposite of \--repetitive-fast; try a lot harder before resorting to randomisation..SH RECOVERING DATA FROM DAMAGED FILES.I bzip2compresses files in blocks, usually 900kbytes long.Each block is handled independently.  If a media ortransmission error causes a multi-block .bz2 file to become damaged,it may be possible to recover data from the undamaged blocksin the file.  The compressed representation of each block is delimited bya 48-bit pattern, which makes it possible to find the blockboundaries with reasonable certainty.  Each block also carriesits own 32-bit CRC, so damaged blocks can bedistinguished from undamaged ones..I bzip2recoveris a simple program whose purpose is to search for blocks in .bz2 files, and write each block out intoits own .bz2 file.  You can then use.I bzip2 -tto test the integrity of the resulting files, and decompress those which are undamaged..I bzip2recovertakes a single argument, the name of the damaged file,and writes a number of files "rec0001file.bz2", "rec0002file.bz2",etc, containing the extracted blocks.  The output filenamesare designed so that the use of wildcards in subsequent processing-- for example, "bzip2 -dc rec*file.bz2 > recovered_data" --lists the files in the "right" order..I bzip2recovershould be of most use dealing with large .bz2 files, asthese will contain many blocks.  It is clearly futile touse it on damaged single-block files, since a damagedblock cannot be recovered.  If you wish to minimise any potential data loss through media or transmissionerrors, you might consider compressing with a smallerblock size..SH PERFORMANCE NOTESThe sorting phase of compression gathers together similar stringsin the file.  Because of this, files containing very long runs of repeated symbols, like "aabaabaabaab ..." (repeatedseveral hundred times) may compress extraordinarily slowly.You can use the\-vvvvv option to monitor progress in great detail, if you want.Decompression speed is unaffected.Such pathological casesseem rare in practice, appearing mostly in artificially-constructedtest files, and in low-level disk images.  It may be inadvisable touse .I bzip2to compress the latter.  If you do get a file which causes severe slowness in compression,try making the block size as small as possible, with flag \-1..I bzip2usually allocates several megabytes of memory to operate in,and then charges all over it in a fairly random fashion.  Thismeans that performance, both for compressing and decompressing,is largely determined by the speedat which your machine can service cache misses.  Because of this, small changesto the code to reduce the miss rate have been observed to givedisproportionately large performance improvements.I imagine .I bzip2will perform best on machines with very large caches..SH CAVEATSI/O error messages are not as helpful as they could be..I Bzip2tries hard to detect I/O errors and exit cleanly, but thedetails of what the problem is sometimes seem rather misleading.This manual page pertains to version 0.9.0 of .I bzip2.  Compressed data created by this version is entirely forwards andbackwards compatible with the previous public release, version 0.1pl2,but with the following exception: 0.9.0 can correctly decompressmultiple concatenated compressed files.  0.1pl2 cannot do this; itwill stop after decompressing just the first file in the stream.Wildcard expansion for Windows 95 and NT is flaky..I bzip2recoveruses 32-bit integers to represent bit positions incompressed files, so it cannot handle compressed filesmore than 512 megabytes long.  This could easily be fixed..SH AUTHORJulian Seward, jseward@acm.org.http://www.muraroa.demon.co.ukThe ideas embodied in .I bzip2are due to (at least) the following people:Michael Burrows and David Wheeler (for the block sortingtransformation), David Wheeler (again, for the Huffman coder),Peter Fenwick (for the structured coding model in the original.I bzip, and many refinements),andAlistair Moffat, Radford Neal and Ian Witten (for the arithmeticcoder in the original.I bzip).  I am much indebted for their help, support and advice.See the manual in the source distribution for pointers tosources of documentation.Christian von Roques encouraged me to look for fastersorting algorithms, so as to speed up compression.Bela Lubkin encouraged me to improve the worst-casecompression performance.Many people sent patches, helped with portability problems,lent machines, gave advice and were generally helpful.
bzip2.1 - 源码说明

本页面展示了「高效率的一种通用压缩/解压程序」中的 bzip2.1 源码文件，采用 1 编程语言编写，共 420 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫开发者社区收录了大量与压缩算法相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?