📄 compressionchapter.txt

📁 是内存受限系统设计的代码。
💻 TXT
📖 第 1 页 / 共 2 页
字号:
上一页 12
Compression is used widely through the industry.  Operating systems use compression to store more information on secondary storage, communications protocols use compression to transmit information more quickly, and programming languages use compression to reduce the memory requirements of programs.
______________________________
String Compression Pattern
Also know as: Nibble Coding, Character Coding.
How do you reduce the memory taken up by static text strings in a program?
* You have lots of small-to-medium sized strings in your program - all different
* You need to reduce your program's memory requirements.
* You need random access to individual strings.
* You don't want to expend too much extra programmer effort, memory space, or processing time on managing the strings.
Many programs need lots of strings - to display information or messages to the user, to supply keys or query expressions for databases, to describe fields or attributes in self-describing data formats, or to retrieve resources for window systems. All these strings can take up significant amounts of memory, increasing the program's memory requirements.
Programs need to be able to perform typical string operations such as determining their length and initial characters, concatenating strings, or substituting parameter strings into constant format strings.  Each string in a collection of strings needs to be individually accessible.
Strings can cause problems even when not stored in main memory.  Strings stored in READ-ONLY MEMORY or SECONDARY STORAGE can also occupy too much space, especially when the ROM or Secondary Storage capacity is limited, or is needed to store other parts of the program or data.
Finally, although storing strings is important, it is seldom the most significant memory use in the system.  Typically, you don't want to put to much programmer effort into the problem. 
One tempting approach to this problem is to apply a large-scale FILE COMPRESSION algorithm to all the strings in the program; you could use adaptive text compression algorithms, which are well known, available, and provide good compression ratios.  Unfortunately file compression algorithms are seldom appropriate for program strings.  First, they can only access the compressed data sequentially, so if all the strings are compressed together, all must be decompressed together before a particular string can be retrieved.  Second, few file compression algorithms work well on short to medium length strings. File compression algorithms usually need to store auxiliary information to guide decoding along with the compressed data, so it is infeasible to use them to compress many small strings individually.  Third, file compression algorithms typically require a reasonable amount of processing time and memory buffer space to execute. 
For example, the Strap-It-On PC needs to store and display a large number of information and error messages to the user.  The messages need to be stored in scarce main memory or read-only memory - and there isn't really enough space to store all the strings directly.  The programs must access each string individually, to show its message to the user.  Given that many of the strings describe exceptional situations such a memory shortage, they need to be able to be retrieved and displayed quickly, efficiently, and without requiring extra memory.
Therefore: store the strings using a compact encoding for each character.
They key to string compression is that there is a lot of redundancy implicit in the design of "standard" character set encodings.  First, each character is always stored in the same number of bits (typically six, seven, or eight).  This represents an underlying assumption that each character is equally likely to appear in a string. This is not the case: in most Roman languages, for example, some letters (such as "e" and "a" and "t") appear much more frequently than others (such as "z" or "q").  Figure 1 below, for example, shows the frequency of lowercase letters used in this chapter.  Second, there is an even greater difference in frequency between characters and letters - capitals are used much less frequently than lower-case letters, and symbols such as "|" or "~" appear even less frequently than letters in most written texts.  Finally, all characters are often stored in more bits than they need - for example, large amounts of text are coded in standard ASCII, which supports only 127 characters and control codes and was designed to be stored in seven bits. These seven-bit characters are routinely padded by 15% to fit into the eight-bit bytes that can be processed conveniently by modern CPUs.

Figure 2 Distribution of Characters in this Chapter
We can reduce the memory requirements for strings by removing the redundancy from the string encoding.  The most important technique is to reduce the number of bits stored per character, typically down to six, five, or even four bits per character.  Since the reduced bit sizes only allow you to store 64, 32, or 16 distinct characters, some character codes are represented using special escape codes that change the interpretation of the following codes.
 Figure 3 Character distribution, sorted by frequency
For example, figure 3 shows what a significant proportion of characters are one of the most common 15 characters (those to the right of character 'u').  So if we encode the characters such that these most common 15 characters take less bits, and the least common take more, that will give us a significant compression.
Consequences
Typically you get a reasonable compression for the strings themselves (a compression ratio of 70-75% of the original size), reducing the program's memory requirements.
Most operations on compressed strings execute almost as fast as operations on native strings, preserving time performance. 
String compression is quite easy to implement, so it does not take much programmer effort.
Each string in a collection of compressed strings can be accessed individually, without decompressing all proceeding strings.
However: the total compression of the program data isn't high, so the program's memory requirements are not greatly reduced.
String operations that rely on random access to the characters in the string may execute up to an order of magnitude slower than the same operations on decompressed strings.  Because characters may have variable lengths (see below), you can only access a specific character by scanning from the start of the string.  It's even more complicated to implement operations that change the characters in the string; in practice the only realistic option is to uncompress the string and do the operations on that.
The compressed strings are more difficult to compile than strings in standard encodings, especially when they are part of the programs literal text. Compressed strings require either manual encoding or a string pre-processing pass, either of which increases complexity.
You have to test the compressed string operations, but these tests are quite straightforward.
Implementation 
There are three main techniques used to compress strings - reducing the number of bits used to store each character, using escape codes to switch between different character encodings, and storing variable length strings. 
Reducing the number of bits per character.  
If the underlying character set has only 128 characters, it certainly makes sense to store each character in seven bits, rather than eight, sixteen, or thirty-two bits.  But most text strings in programs use far less than 127 characters - for a start, the first 32 ASCII characters are unprintable control codes.  Many text strings in programs require only upper and lower-case letters, numbers, and a few punctuation characters - less that seventy characters - just slightly over what you can store in six bits.
Escape Codes
Escape code changes character encodings.  You can reduce the number of bits per character still further by using more than one encoding from stored codes to characters, and then allocating one or more escape sequence codes to change encodings.  For example, you could store uppercase and lowercase letters and numbers in five bits per character (five bits gives 32 codes for characters) by first coding the 26 lowercase letters (using codes 0-25) and the most important punctuation characters (say space, period, comma, return in codes  26-29). The remaining two codes are used as escape sequences - meaning that the following code is to be interpreted either as an uppercase letter (code 30), or as a number or graphical symbol (code 31).
Code Character        Code Character        Code Character
0        a            30
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -