📄 changelog.txt
字号:
Note:
Code has only been tested on MS Visual-C 6 (although it should work under MinGW too)
==============================================================================
Microsoft Compiler Type Sizes
-----------------------------
char, unsigned char, signed char 1 byte
typedef unsigned char UCHAR;
typedef unsigned char BYTE;
short, unsigned short 2 bytes
typedef unsigned short USHORT;
typedef unsigned short WORD;
int, unsigned int 4 bytes
typedef unsigned int UINT;
long, unsigned long 4 bytes
typedef unsigned long ULONG;
typedef unsigned long DWORD;
float 4 bytes
double 8 bytes
long double 8 bytes
Notes: UINT and ULONG appear to be equal and as "fast" as each other, this
would make sense as CPU is 32bits.
LZSS Implementation
===================
Jonathan Bennett (c)2002-2003, jon@hiddensoft.com
All code contained is free to use. Just let me know that you have found it useful.
Check the "lzss.txt" file for information on the algorithm, this file just details the
technical changes.
v0.10
=====
First implementation of the LZSS routine. Coded to be as readable as something
this complicated can be.
The way matches are searched for is called "greedy parsing" and is very very slow.
A 2 meg file takes around 1m45s on a 366MHz CPU to compress :(
v0.11
=====
(NOTE: It was found that this is slightly incorrect -- see v1.00)
First optimization is in the compression. We use a number of bits to code the
(offset, len) pair. For example, 0-4095 (12 bits) for the window size and 0-63 (6bits)
for the match size. Each literal byte encoded is 9bits. In this example we are using
18 bits for a (offset, len) pair so this means that it is only worth encoding 3 bytes or
more.
So, if our minimum match length is 3, then the (offset, len) pair will never use the
values 0,1 and 2. So, why don't we use this to increase the size of the numbers we
can encode into (offset, len)? Instead of 0-4095 we have 2-4097, and instead of
0-63 we have 2-65. It's effectively free bits! On a 1MB file this improved the compression
by a couple of percent. Not much, but hey!
All the window size, and match lengths are now definable at run-time.
v0.20
=====
Speed optimizations were done by using a chaining hash table rather than greedy parsing.
Compression of a 2 meg file was reduced from 1m45s to 5s with only 1% loss of compression!
Almost usable now. At this point I think the "hash table" related patents may kick in...
v0.21
=====
General changes.
- Compression and decompression code split into two classes so that you can include
just one of them in your programs to save a couple of kilobytes.
- When a value is added to the hash table, and there is no more room in the chain
the oldest entry is replaced. Will not improve compression much (0.5%), but it will
mean that matches are closer to the position being searched which may help in the future.
- Hash table converted into a one-way linked list (rather than 2 way) to reduce memory needed
v0.30
=====
"lazy evaluation" implemented. Strangely enough, this speeded things up a little, and of
course the compression ratio was increased a little.
It seems that when using small hash tables (chain limit ~2) that the most efficient window
and match bits size are 12-13 and 4. This is because with a small chain limit (although fast)
you don't get the extra compression that the larger window gives. For good compression
(not best speed) a good value for the chain limit is window size / 256.
Here are my results for a 2 meg binary .exe file:
Test 1
------
A hash chain limit of 2 was used.
Original file size is 2345984 bytes
Window Bits Match Bits Compressed Size Hash Mem Used Time Taken
----------- ---------- --------------- ------------- ----------
12 4 1315650 74488 3 secs
13 4 1321970 81032 3 secs
12 5 1326140 74632 3 secs
13 5 1330782 81032 3 secs
11 4 1331880 69552 3 secs
14 4 1337788 90304 3 secs
12 3 1344038 74176 3 secs
14 5 1344550 90328 3 secs
11 5 1345478 69596 3 secs
11 3 1353626 68912 3 secs
13 3 1354798 80968 3 secs
10 4 1359530 65112 3 secs
15 4 1366410 95872 3 secs
15 5 1371068 95864 3 secs
10 5 1374808 65272 3 secs
10 3 1376084 64504 3 secs
14 3 1375580 90264 3 secs
15 3 1408496 95856 3 secs
The first thing obvious from these results is that in every case
the best "match bit" size for a given window size is 4 - we will
concentrate on this size from now on.
Test 2
------
A hash chain limit of 16 was used.
Original file size is 2345984 bytes
Window Bits Match Bits Compressed Size Hash Mem Used Time Taken
----------- ---------- --------------- ------------- ----------
13 4 1211858 169312 4 secs
14 4 1212150 221680 5 secs
12 4 1218880 145928 4 secs
11 4 1247888 122264 4 secs
15 4 1223296 305760 5 secs
10 4 1284376 113680 4 secs
Interesting, again, it is 13 putting up a good performance along with
11, 12 and 14.
Test 3
------
A hash chain limit of 32 was used.
Original file size is 2345984 bytes
Window Bits Match Bits Compressed Size Hash Mem Used Time Taken
----------- ---------- --------------- ------------- ----------
14 4 1194912 249712 6 secs
13 4 1197928 196312 5 secs
15 4 1202494 358056 6 secs
12 4 1207556 160728 4 secs
11 4 1238556 139168 4 secs
10 4 1276936 132920 4 secs
Again 12, 13 and 14 are giving the best compression (13 is my fave at this
point due to it being quicker than the others). I'll now drop 10 and 11 from
the testing
Test 3
------
A hash chain limit of 64 was used.
Original file size is 2345984 bytes
Window Bits Match Bits Compressed Size Hash Mem Used Time Taken
----------- ---------- --------------- ------------- ----------
14 4 1185580 280816 7 secs
15 4 1190544 391360 8 secs
13 4 1190618 212616 5 secs
12 4 1202214 170272 5 secs
The pattern seems to be 13 is a good general window size giving both good
compression and speed, but a window size of 14 starts giving better compression
(at the expense of speed) as the hash chain limit reaches 32 and above. Lets do
a very large hash table to prove the trend
Test 4
------
A hash chain limit of 1024 was used.
Original file size is 2345984 bytes
Window Bits Match Bits Compressed Size Hash Mem Used Time Taken
----------- ---------- --------------- ------------- ----------
14 4 1172642 357056 25 secs
13 4 1181234 261240 19 secs
Yes, once over a hash chain limit of 32, a window size of 14 bits seems
to be the way to go. Lets see at what point raising the hash limit
becomes a waste.
Test 5
------
Window Bits Hash Limit Compressed Size Hash Mem Used Time Taken
----------- ---------- --------------- ------------- ----------
14 128 1179972 305960 12 secs
14 256 1176368 320112 15 secs
14 512 1174192 339536 21 secs
14 1024 1172642 357056 25 secs
14 2048 1171360 370504 46 secs
Based on these tests, it looks like having variable window and match bit
sizes is a little bit of a waste of time, 4 options would seem to be the
way to go:
//Compression Level Window Bits Match Bits Hash Limit
//----------------- ----------- ---------- ----------
//Fast (0) 12 4 2
//Normal (1) 13 4 8
//Slow (2) 13 4 16
//Very slow (3) 14 4 32
//Ultra slow (4) 14 4 128
v1.00
=====
For release.
Compression options as detailed above implemented.
Bug fixed in the minimum length and "adjust" matching:
"So, if our minimum match length is 3, then the (offset, len) pair will never use the
values 0,1 and 2. So, why don't we use this to increase the size of the numbers we
can encode into (offset, len)? Instead of 0-4095 we have 2-4097, and instead of
0-63 we have 2-65. It's effectively free bits! On a 1MB file this improved the compression
by a couple of percent. Not much, but hey!"
Not true, it should be 3-4098 and 3-66.
v1.01
=====
Redid hashing functions to increase speed. Also added a "malloc" buffer so that
the constant malloc/free calls don't take up as much time.
Changed the algorithm ID to "JB01".
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -