📄 paq8f.cpp
字号:
- Huffman code the RS symbol, followed by S literal bits.
The most useful contexts are the current partially coded Huffman code
(including S following bits) combined with the coefficient position
(0-63), color (0-2), and last few RS codes.
- Match. When a context match of 400 bytes or longer is detected,
the next bit of the match is predicted and other models are turned
off to improve speed.
- Exe. When a x86 file (.exe, .obj, .dll) is detected, sparse contexts
with gaps of 1-12 selecting only the prefix, opcode, and the bits
of the modR/M byte that are relevant to parsing are selected.
This model is turned off otherwise.
- Indirect. The history of the last 1-3 bytes in the context of the
last 1-2 bytes is combined with this 1-2 byte context.
ARCHITECTURE
The context models are mixed by up to 4 of several hundred neural networks
selected by a low-order context. The outputs of these networks are
combined using a second neural network, then fed through two stages of
adaptive probability maps (APM) before arithmetic coding. The four neural
networks in the first layer are selected by the following contexts:
- NN1: Order 0: the current (partially coded) byte, and the number of
bits coded (0-7).
- NN2: The previous whole byte (does not include current byte).
- NN3: The second from last byte.
- NN4: The file type (image, long match, or normal), order (0-7), 3 high bits
of the previous byte, and whether the last 2 bytes are equal.
For images and long matches, only one neural network is used and its
context is fixed.
An APM is a stationary map combining a context and an input probability.
The input probability is stretched and divided into 32 segments to
combine with other contexts. The output is interpolated between two
adjacent quantized values of stretch(p1). There are 2 APM stages in series:
p1 := (p1 + 3 APM(order 0, p1)) / 4.
p1 := (APM(order 1, p1) + 2 APM(order 2, p1) + APM(order 3, p1)) / 4.
PREPROCESSING
paq8f uses preprocessing transforms on certain data types to improve
compression. To improve reliability, the decoding transform is
tested during compression to ensure that the input file can be
restored. If the decoder output is not identical to the input file
due to a bug, then the transform is abandoned and the data is compressed
without a transform so that it will still decompress correctly.
The input is split into blocks with the format <type> <decoded size> <data>
where <type> is 1 byte (0 = no transform), <decoded size> is the size
of the data after decoding, which may be different than the size of <data>.
Blocks do not span file boundaries, and have a maximum size of 4MB to
2GB depending on compression level. Large files are split into blocks
of this size. The preprocessor has 3 parts:
- Detector. Splits the input into smaller blocks depending on data type.
- Coder. Input is a block to be compressed. Output is a temporary
file. The coder determines whether a transform is to be applied
based on file type, and if so, which one. A coder may use lots
of resources (memory, time) and make multiple passes through the
input file. The file type is stored (as one byte) during compression.
- Decoder. Performs the inverse transform of the coder. It uses few
resorces (fast, low memory) and runs in a single pass (stream oriented).
It takes input either from a file or the arithmetic decoder. Each call
to the decoder returns a single decoded byte.
The following transforms are used:
- EXE: CALL (0xE8) and JMP (0xE9) address operands are converted from
relative to absolute address. The transform is to replace the sequence
E8/E9 xx xx xx 00/FF by adding file offset modulo 2^25 (signed range,
little-endian format). Data to transform is identified by trying the
transform and applying a crude compression test: testing whether the
byte following the E8/E8 (LSB of the address) occurred more recently
in the transformed data than the original and within 4KB 4 times in
a row. The block ends when this does not happen for 4KB.
- JPEG: detected by SOI and SOF and ending with EOI or any nondecodable
data. No transform is applied. The purpose is to separate images
embedded in execuables to block the EXE transform, and for a future
place to insert a transform.
IMPLEMENTATION
Hash tables are designed to minimize cache misses, which consume most
of the CPU time.
Most of the memory is used by the nonstationary context models.
Contexts are represented by 32 bits, possibly a hash. These are
mapped to a bit history, represented by 1 byte. The hash table is
organized into 64-byte buckets on cache line boundaries. Each bucket
contains 7 x 7 bit histories, 7 16-bit checksums, and a 2 element LRU
queue packed into one byte. Each 7 byte element represents 7 histories
for a context ending on a 3-bit boundary plus 0-2 more bits. One
element (for bits 0-1, which have 4 unused bytes) also contains a run model
consisting of the last byte seen and a count (as 1 byte each).
Run models use 4 byte hash elements consisting of a 2 byte checksum, a
repeat count (0-255) and the byte value. The count also serves as
a priority.
Stationary models are most appropriate for small contexts, so the
context is used as a direct table lookup without hashing.
The match model maintains a pointer to the last match until a mismatching
bit is found. At the start of the next byte, the hash table is referenced
to find another match. The hash table of pointers is updated after each
whole byte. There is no checksum. Collisions are detected by comparing
the current and matched context in a rotating buffer.
The inner loops of the neural network prediction (1) and training (2)
algorithms are implemented in MMX assembler, which computes 4 elements
at a time. Using assembler is 8 times faster than C++ for this code
and 1/3 faster overall. (However I found that SSE2 code on an AMD-64,
which computes 8 elements at a time, is not any faster).
DIFFERENCES FROM PAQ7
An .exe model and filter are added. Context maps are improved using 16-bit
checksums to reduce collisions. The state table uses probabilistic updates
for large counts, more states that remember the last bit, and decreased
discounting of the opposite count. It is implemented as a fixed table.
There are also many minor changes.
DIFFERENCES FROM PAQ8A
The user interface supports directory compression and drag and drop.
The preprocessor segments the input into blocks and uses more robust
EXE detection. An indirect context model was added. There is no
dictionary preprocesor like PAQ8B/C/D/E.
*/
#define PROGNAME "paq8f" // Please change this if you change the program.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <math.h>
#include <ctype.h>
#define NDEBUG // remove for debugging (turns on Array bound checks)
#include <assert.h>
#ifdef UNIX
#include <sys/types.h>
#include <sys/stat.h>
#include <dirent.h>
#include <errno.h>
#endif
#ifdef WINDOWS
#include <windows.h>
#endif
#ifndef DEFAULT_OPTION
#define DEFAULT_OPTION 5
#endif
// 8, 16, 32 bit unsigned types (adjust as appropriate)
typedef unsigned char U8;
typedef unsigned short U16;
typedef unsigned int U32;
// min, max functions
#ifndef WINDOWS
inline int min(int a, int b) {return a<b?a:b;}
inline int max(int a, int b) {return a<b?b:a;}
#endif
// Error handler: print message if any, and exit
void quit(const char* message=0) {
throw message;
}
// strings are equal ignoring case?
int equals(const char* a, const char* b) {
assert(a && b);
while (*a && *b) {
int c1=*a;
if (c1>='A'&&c1<='Z') c1+='a'-'A';
int c2=*b;
if (c2>='A'&&c2<='Z') c2+='a'-'A';
if (c1!=c2) return 0;
++a;
++b;
}
return *a==*b;
}
//////////////////////// Program Checker /////////////////////
// Track time and memory used
class ProgramChecker {
int memused; // bytes allocated by Array<T> now
int maxmem; // most bytes allocated ever
clock_t start_time; // in ticks
public:
void alloc(int n) { // report memory allocated, may be negative
memused+=n;
if (memused>maxmem) maxmem=memused;
}
ProgramChecker(): memused(0), maxmem(0) {
start_time=clock();
assert(sizeof(U8)==1);
assert(sizeof(U16)==2);
assert(sizeof(U32)==4);
assert(sizeof(short)==2);
assert(sizeof(int)==4);
}
void print() const { // print time and memory used
printf("Time %1.2f sec, used %d bytes of memory\n",
double(clock()-start_time)/CLOCKS_PER_SEC, maxmem);
}
} programChecker;
//////////////////////////// Array ////////////////////////////
// Array<T, ALIGN> a(n); creates n elements of T initialized to 0 bits.
// Constructors for T are not called.
// Indexing is bounds checked if assertions are on.
// a.size() returns n.
// a.resize(n) changes size to n, padding with 0 bits or truncating.
// a.push_back(x) appends x and increases size by 1, reserving up to size*2.
// a.pop_back() decreases size by 1, does not free memory.
// Copy and assignment are not supported.
// Memory is aligned on a ALIGN byte boundary (power of 2), default is none.
template <class T, int ALIGN=0> class Array {
private:
int n; // user size
int reserved; // actual size
char *ptr; // allocated memory, zeroed
T* data; // start of n elements of aligned data
void create(int i); // create with size i
public:
explicit Array(int i=0) {create(i);}
~Array();
T& operator[](int i) {
#ifndef NDEBUG
if (i<0 || i>=n) fprintf(stderr, "%d out of bounds %d\n", i, n), quit();
#endif
return data[i];
}
const T& operator[](int i) const {
#ifndef NDEBUG
if (i<0 || i>=n) fprintf(stderr, "%d out of bounds %d\n", i, n), quit();
#endif
return data[i];
}
int size() const {return n;}
void resize(int i); // change size to i
void pop_back() {if (n>0) --n;} // decrement size
void push_back(const T& x); // increment size, append x
private:
Array(const Array&); // no copy or assignment
Array& operator=(const Array&);
};
template<class T, int ALIGN> void Array<T, ALIGN>::resize(int i) {
if (i<=reserved) {
n=i;
return;
}
char *saveptr=ptr;
T *savedata=data;
int saven=n;
create(i);
if (savedata && saveptr) {
memcpy(data, savedata, sizeof(T)*min(i, saven));
programChecker.alloc(-ALIGN-n*sizeof(T));
free(saveptr);
}
}
template<class T, int ALIGN> void Array<T, ALIGN>::create(int i) {
n=reserved=i;
if (i<=0) {
data=0;
ptr=0;
return;
}
const int sz=ALIGN+n*sizeof(T);
programChecker.alloc(sz);
ptr = (char*)calloc(sz, 1);
if (!ptr) quit("Out of memory");
data = (ALIGN ? (T*)(ptr+ALIGN-(((long)ptr)&(ALIGN-1))) : (T*)ptr);
assert((char*)data>=ptr && (char*)data<=ptr+ALIGN);
}
template<class T, int ALIGN> Array<T, ALIGN>::~Array() {
programChecker.alloc(-ALIGN-n*sizeof(T));
free(ptr);
}
template<class T, int ALIGN> void Array<T, ALIGN>::push_back(const T& x) {
if (n==reserved) {
int saven=n;
resize(max(1, n*2));
n=saven;
}
data[n++]=x;
}
/////////////////////////// String /////////////////////////////
// A tiny subset of std::string
// size() includes NUL terminator.
class String: public Array<char> {
public:
const char* c_str() const {return &(*this)[0];}
void operator=(const char* s) {
resize(strlen(s)+1);
strcpy(&(*this)[0], s);
}
void operator+=(const char* s) {
assert(s);
pop_back();
while (*s) push_back(*s++);
push_back(0);
}
String(const char* s=""): Array<char>(1) {
(*this)+=s;
}
};
//////////////////////////// rnd ///////////////////////////////
// 32-bit pseudo random number generator
class Random{
Array<U32> table;
int i;
public:
Random(): table(64) {
table[0]=123456789;
table[1]=987654321;
for(int j=0; j<62; j++) table[j+2]=table[j+1]*11+table[j]*23/16;
i=0;
}
U32 operator()() {
return ++i, table[i&63]=table[i-24&63]^table[i-55&63];
}
} rnd;
////////////////////////////// Buf /////////////////////////////
// Buf(n) buf; creates an array of n bytes (must be a power of 2).
// buf[i] returns a reference to the i'th byte with wrap (no out of bounds).
// buf(i) returns i'th byte back from pos (i > 0)
// buf.size() returns n.
int pos; // Number of input bytes in buf (not wrapped)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -