charset.txt
来自「汇编编程艺术」· 文本 代码 · 共 485 行 · 第 1/2 页
TXT
485 行
Character Set Routines
----------------------
The character set routines let you deal with groups of characters as a set
rather than a string. A set is an unordered collection of objects where
membership (presence or absence) is the only important quality. The stdlib
set routines were designed to let you quickly check if an ASCII character is
in a set, to quickly add characters to a set or remove characters from a set.
These operations are the ones most commonly used on character sets. The
other operations (like union, intersection, difference, etc.) are useful, but
are not as popular as the former routines. Therefore, the data structure
has been optimized for sets to handle the membership and add/delete operations
at the slight expense of the others.
Character sets are implemented via bit vectors. A "1" bit means that an item
is present in the set and a "0" bit means that the item is absent from the
set. The most common implementation of a character set is to use thirty-two
consecutive bytes, eight bytes per, giving 256 bits (one bit for each char-
acter in the character set). While this makes certain operations (like
assignment, union, intersection, etc.) fast and convenient, other operations
(membership, add/remove items) run much slower. Since these are the more
important operations, a different data structure is used to represent sets.
A faster approach is to simply use a byte value for each item in the set.
This offers a major advantage over the thirty-two bit scheme: for operations
like membership it is very fast (since all you have got to do is index into
an array and test the resulting value). It has two drawbacks: first, oper-
ations like set assignment, union, difference, etc., require 256 operations
rather than thirty-two; second, it takes eight times as much memory.
The first drawback, speed, is of little consequence. You will rarely use the
the operations so affected, so the fact that they run a little slower will be
of little consequence. Wasting 224 bytes is a problem, however. Especially
if you have a lot of character sets.
The approach used here is to allocate 272 bytes. The first eight bytes con-
tain bit masks, 1, 2, 4, 8, 16, 32, 64, 128. These masks tell you which bit
in the following 264 bytes is associated with the set. This facilitates
putting eight sets into 272 bytes (34 bytes per character set). This provides
almost the speed of the 256-byte set with only a two byte overhead. In the
stdlib.a file there is a macro that lets you define a group of character
sets: set. The macro is used as follows:
set set1, set2, set3, ... , set8
You must supply between one and eight labels in the operand field. These are
the names of the sets you want to create. The set macro automatically
attaches these labels to the appropriate mask bytes in the set. The actual
bit patterns for the set begin eight bytes later (from each label). There-
fore, the byte corresponding to chr(0) is staggered by one byte for each
set (which explains the other eight bytes needed above and beyond the 256
required for the set). When using the set manipulation routines, you should
always pass the address of the mask byte (i.e., the seg/offset of one of the
labels above) to the particular set manipulation routine you are using.
Passing the address of the structure created with the macro above will
reference only the first set in the group.
Note that you can use the set operations for fast pattern matching appli-
cations. The set membership operation for example, is much faster that the
strspan routine found in the string package. Proper use of character sets
can produce a program which runs much faster than some of the equivalent
string operations.
Note: there is a special include file in the INCLUDE directory, STDSETS.A,
which contains the bit definitions for eight commonly-used character sets:
Alpha (upper and lower case alphabetics), lower (lower case alphabetics),
upper (upper case alphabetics), digits ("0".."9"), xdigits (hexadecimal
digits: "0"-"9", 'a'-'z', and 'A'-'Z'), alphanum (upper/lower case alpha
and digits), whitespace (spaces, tabs, carriage returns, and linefeeds),
and delimiters (whitespace plus ",", ";", "<", ">", and "|").
If you want to use this standard character set in your program you must
include the STDSETS.A file in an appropriate (data) segment. Note that
including STDLIB.A or CHARSETS.A will not give the standard sets. You must
explicitly place an include STDSETS.A in your program to have access to
these sets.
Routine: Createsets
--------------------
Category: Character Set Routine
Registers on Entry: no parameters passed
Registers on return: ES:DI - pointer to eight sets
Flags affected: Carry = 0 if no error. Carry = 1 if insufficient
memory to allocate storage for sets.
Example of Usage:
Createsets
jc NoMemory
mov word ptr SetPtr, di
mov word ptr SetPtr+2, es
Description: Createsets allocates 272 bytes on the heap. This is sufficient
room for eight character sets. It then initializes the first
eight bytes of this storage with the proper mask values for
each set. Location es:0[di] gets set to 1, location es:1[di]
gets 2, location es:2[di] gets 4, etc. The Createsets routine
also initializes all of the sets to the empty set by clearing
all the bits to zero.
Include: stdlib.a or charsets.a
Routine: EmptySet
------------------
Category: Character Set Routine
Registers on Entry: ES:DI - pointer to first byte of desired set
Registers on return: None
Flags affected: None
Example of Usage:
les di, SetPtr
add di, 3 ; Point at 4th set in group.
Emptyset
Description: Emptyset clears out the bits in a character set to zero
(thereby setting it to the empty set). Upon entry, es:di must
point at the first byte of the character set you want to clear.
Note that this is not the address returned by Createsets. The
first eight bytes of a character set structure are the
addresses of eight different sets. ES:DI must point at one of
these bytes upon entry into Emptyset.
Include: stdlib.a or charsets.a
Routine: Rangeset
------------------
Category: Character Set Routine
Registers on entry: ES:DI (contains the address of the first byte of the set)
AL (contains the lower bound of the items)
AH (contains the upper bound of the items)
Registers on return: None
Flags affected: None
Example of Usage:
lea di, SetPtr
add di, 4
mov al, 'A'
mov ah, 'Z'
rangeset
Description: This routine adds a range of values to a set with ES:DI as the
pointer to the set, AL as the lower bound of the set, and
AH as the upper bound of the set (AH has to be greater than
AL, otherwise, there will an error).
Include: stdlib.a or charsets.a
Routine: Addstr (l)
--------------------
Category: Character Set Routine
Registers on Entry: ES:DI- pointer to first byte of desired set
DX:SI- pointer to string to add to set (Addstr only)
CS:RET-pointer to string to add to set (Addstrl only)
Registers on Return: None
Flags Affected: None
Example of Usage:
les di, SetPtr
add di, 1 ;Point at 2nd set in group.
mov dx, seg CharStr ;Pointer to string
lea si, CharStr ; chars to add to set.
addstr ;Union in these characters.
;
les di, SetPtr ;Point at first set in group.
addstrl
db "AaBbCcDdEeFf0123456789",0
;
Description: Addstr lets you add a group of characters to a set by
specifying a string containing the characters you want in
the set. To Addstr you pass a pointer to a zero-terminated
string in dx:si. Addstr will add (union) each character
from this string into the set.
Addstrl works the same way except you pass the string as
a literal string constant in the code stream rather than
via ES:DI.
Include: stdlib.a or charsets.a
Routine: Rmvstr (l)
--------------------
Category: Character Set Routine
Registers on entry: ES:DI contains the address of first byte of a set
DX:SI contains the address of string to be removed
from a set (Rmvstr only)
CS:RET pointer to string to add to set (Rmvstrl only)
Registers on return: None
Flags affected: None
Example of Usage:
les di, SetPtr
mov dx, seg CharStr
lea si, CharStr
rmvstr
mov dx, seg CharStr
lea si, CharStr
rmvstrl
db "ABCDEFG",0
Description: This routine is to remove a string from a set with ES:DI
pointing to its first byte, and DX:SI pointing to the
string to be removed from the set.
For Rmvstrl, the string of characters to remove from the
set follows the call in the code stream.
Include: stdlib.a or charsets.a
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?