📄 pattern.txt
字号:
Pattern Matching Routines
-------------------------
The UCR Standard Library contains a very rich set of (character string)
pattern matching routines. These routines were designed to mimic the
pattern matching primitives found in the SNOBOL4 programming language, so
they are very powerful indeed.
These routines are actually quite simple. They derive their power through
the use of a recursive, backtracking pattern matching algorithm and a
properly specified data structure: the "pattern". The data type for a
pattern is the following:
Pattern struc
MatchFunction dd ?
MatchParm dd 0
MatchAlternate dd 0
NextPattern dd 0
EndPattern dw ?
StartPattern dw ?
StrSeg dw ?
Pattern ends
The "MatchFunction" field is a pointer to a (far) procedure which tests the
current characters in the string. You could write your own functions for
this purpose if you choose, however, several important routines are already
provided in this package.
"MatchParm" represents a four-byte value which the pattern matching algorithm
passes to the MatchFunction routine. Typically it is a pointer to a string
or a character set though it could be any four-byte value.
"MatchAlternate" is a pointer to an alternate pattern to try if the current
pattern fails to match the string.
"NextPattern" is a pointer to another pattern in the current pattern list.
This lets you concatenate patterns to form more complex patterns.
"EndPattern", "StartPattern", and "StrSeg" are words filled in by the pattern
matching routine so other code can locate the characters matched by this
particular pattern. In general, you should not modify these values.
The MatchFunction is where most of the work actually takes place. Currently
the standard library provides 18 different MatchFunctions:
Spancset
Brkcset
MatchStr
MatchiStr
MatchToStr
MatchChar
MatchToChar
MatchChars
MatchToPat
Anycset
NotAnycset
EOS
ARB
ARBNUM
Skip
POS
RPOS
GOTOpos
and RGOTOpos
Note: in order to gain access to these match functions you must place the
statement:
matchfuncs
somewhere in your program after including "pattern.a" or "stdlib.a". This
macro defines the externals for the match functions. Since these match
functions do not get called in the same manner as other standard library
routines, they do not get automatically linked with your programs. There
is a comment in the "shell.asm" file which activates this macro. Uncomment
this line and you'll be in great shape.
A brief description of the matching functions:
Spancset will match any number of characters belonging to a character set.
(This includes zero characters.) You specify the character set via a pointer
to a UCR Stdlib CSET. The pointer to this character set goes in the
"MatchParm" field of the above structure.
Brkcset will match any number of characters which are *not* in a character
set. That is, it will match up to a character in the specified character
set or until the end of the string, whichever comes first. Once again,
"MatchParm" contains a pointer to the character set to use.
MatchStr matches a specified string (something like the strcmp routine).
The "MatchParm" field points at the zero terminated string to match.
MatchiStr is like MatchStr except it converts the input string to uppercase
before comparing against the specified string (the specified string should
contain all upper case characters or it will not match).
MatchToStr matches all characters in a string up to and including the
string specified by the "MatchParm" field.
MatchChar matches a single character. This character must appear in the
L.O. byte of the "MatchParm" field. Note that this routine must match
exactly one character.
MatchChars matches zero or more occurrences of the same character. Again,
the character appears in the L.O. byte of the "MatchParm" field.
MatchToChar matches all characters up to, and including, the character
specified in the L.O. byte of the "MatchParm" field.
MatchToPat matches all characters up to, and including, the characters
matched by the pattern specified in the "MatchParm" field.
Anycset matches a single character from a character set. As usual, the
"MatchParm" field points at the character set to test.
NotAnycset matches a single character which is *not* in the specified
character set. Once again, "MatchParm" points at the character set to use.
EOS matches the end of the string (that is, the zero terminating byte).
ARB matches an arbitrary number of characters.
ARBNUM matches an arbitrary (zero or more) number of strings matching the
pattern specified in the "MatchParm" field.
Skip matches "n" arbitrary characters. The number of characters to skip is
specified in the L.O. word of the "MatchParm" field.
POS matches if the matching routine is currently at position "n" in the
string. "n" is given by the L.O. word of the "MatchParm" field.
RPOS matches if the matching routine is currently at position "n" from the
end of the string. Again, "n" is the L.O. word of the "MatchParm" field.
GOTOpos moves to the position in the string specified by the L.O. word of
the "MatchParm" field. This routine fails if it tries to move backwards in
the string or it attempts to move beyond the end of the string.
RGOTOpos moves to the position in the string "n" chars from the end of the
string ("n" being the L.O. word of "MatchParm"). Fails if this moves you
backwards in the string.
To understand how to use these routines, a little pattern matching theory
is in order. Pattern matching is quite similar to string comparison except
you don't need to match an exact string. Instead, you can specify arbitrary
*patterns* of characters to match. For example, an indentifier in a high
level language like Pascal consists of an alphabetic character following by
zero or more alphanumeric characters. You cannot perform a single string
comparison which will accept all possible Pascal identifiers (indeed, that
single comparison would only accept a single Pascal identifier). However,
you can easily create a pattern which will accept all Pascal ids:
Anycset(A-Za-z) Spancset(A-Za-z0-9)
Anycset above matches a single character from the specified set (alphabetic)
and the Spancset function matches zero or more characters from the alpha-
numeric character set. If you were to take a character string and *match*
it against the above pattern, the match would __succeed__ if the string is a
valid Pascal identifier, it would __fail__ if the string is not a valid
Pascal identifier.
Note, by the way, that the above pattern actually matches any string which
*begins* with a valid Pascal identifier. If the match routine exhausts
the pattern before exhausting characters in the string, it considers the
match successful. If you wanted to match *only* a Pascal identifier you
would use a pattern like the following:
Anycset(A-Za-z) Spancset(A-Za-z0-9) EOS
The "EOS" pattern matches the end of the string, so this pattern matches Pascal
identifiers only.
Of course, you can use pattern matching to perform string comparisons using
the MatchStr function:
MatchStr("Hello world") EOS
However, this is a frightfully expensive way to do a simple string comparison.
(Okay, it really isn't that bad, but it does take several times longer than,
say, strcmp.)
Although you wouldn't match character strings this way, using strings within
patterns is quite useful. Consider the following which matches
"value=<fp number>"
where "<fp number>" denotes a decimal value:
MatchStr("value=") Spancset(0-9) MatchChar('.') Spancset(0-9)
This pattern requires that the string begin with "value=<fp number>" though
anything could follow the decimal value in the string. If you wanted the
string to exactly match the above pattern, you could put an EOS at the end
of the pattern.
What if you wanted to test for the presence of the above pattern *anywhere*
in the string (not necessarily at the beginning)? This is easily accomplished
by the pattern:
ARB MatchStr("value=") Spancset(0-9) MatchChar('.') Spancset(0-9)
ARB matches an arbitrary number of characters. At first you might think
"gee, if it matches any number of characters, what's to prevent it from
matching everything to the end of the string and causing the rest of the
pattern to fail?" Well, simply put, the pattern matching algorithm tries
its absolute best to succeed. So ARB will match must enough characters
so that the string "value=<fp number>" will match the rest of the pattern,
if it is present in the string. To achieve this, the matching algorithm
usings *backtracking*. In a nutshell, backtracking works as follows:
the current pattern matches as many characters as it can (all of them in the
case of ARB). It then tries to match the remaining characters against the
rest of the pattern. If that fails, then it backs up and tries again
(logically you can think of ARB giving up one character at a time from the
end of the string until the remaining patterns match). If there is no match
after backing up back to the starting point, the whole pattern fails.
If this sounds expensive (slow), well, it is. That's why you would never
try and use the pattern matching primitives for simple string comparisons.
That's also why you should try to avoid two adjacent patterns which match
the same set of characters (since ARB matches anything, it will match the
same characters as any adjacent pattern, hence it may be slow).
Another important feature to this pattern matching system is the ability
to do *alternation*. Consider the following:
MatchStr("black") | MatchStr("blue")
The "|" symbol above denotes alternation, and is read as "or". This pattern
matches the string "black" *or* the string "blue". Okay, now consider the
following (very typical example in pattern matching):
[MatchStr("black") | MatchStr("blue")] [MatchStr("bird") | MatchStr("berry")]
The above pattern matches the strings "blackbird", "bluebird", "blackberry",
or "blueberry". The alternation operator lets you choose "black" or "blue"
from the first pattern and "bird" or "berry" from the second pattern.
This description of pattern matching theory could go on for quite some time,
but this is not the place for it. This discussion is intended to serve as
only an appetizer for you. If you want additional information on pattern
matching theory, pick up any SNOBOL4 manual, especially "Algorithms in
SNOBOL4" (by Gimpel, if I remember correctly). Also, copies of Vanilla
SNOBOL are floating around with an electronic manual. Among other things,
this contains the phone number and all of Catspaw which sells SNOBOL4 and
ICON products for various systems. They sell several texts on SNOBOL4 and
ICON (which are pattern matching languages). Since these library routines
are based on SNOBOL4, taking a look at SNOBOL4 will provide some insight
into these routines.
Now let's talk about how to actually specify a pattern in assembly language
using the pattern matching routines. As you probably figured out already,
you don't get to use the nice (SNOBOL4-like) syntax used the preceding
examples. Indeed, if there is any pain associated with pattern matching
in this package, it's setting up the patterns in the first place.
A pattern is a linked list of objects all of type "PATTERN" (see the
structure definition earlier). You must fill in all but the last three
fields of this structure (or live with the default value of zero).
The Pascal identifier pattern mentioned above would look something
like the following:
pasid pattern <Anycset,alphabetic,0,pasid2>
pasid2 pattern <Spanscet,alphanum>
(note: alphabetic and alphanum are standard character sets available in
UCR standard library. See the section on character sets for details).
The first pattern matches a single character from the "alphabetic" character
set. The second pattern matches zero or more characters from the "alphanum"
character set. In both cases the alternate field is zero (NULL/NIL) because
there is no alternate to match. In the first pattern above, the NextPattern
field contains the link to the "pasid" pattern which concatenates the two
patterns. Pasid2 has no link field since it is the last pattern in the
list (if the field is not present, it defaults to zero/NIL/NULL).
To actually match a string against a pattern you load ES:DI with the address
of the string to test and DX:SI with the address of the first pattern in
your pattern list. CX contains the offset of the last character you wish
to check in the string (set CX to zero if you want to match all characters
in the string). Then you execute the "match" procedure. On return, the
carry flag denotes success (C=1) or failure (C=0), AX contains the offset
of the character in the string immediately after the match.
lesi MyString
ldxi MyPattern
mov cx, 0
match
jc TheyMatched
By default, all patterns match characters at the beginning of the string.
As you've seen earlier in the document, you can use ARB to allow the
pattern to match at some other point in the string. For example, the following
matches any string which contains one or more alphabetic characters followed
by one or more decimal digits:
HasAnAlpha pattern <ARB,0,0,HAA2>
HAA2 pattern <Anycset,alphabetic,0,HAA3>
HAA3 pattern <Spancset,alphabetic,0,HAA4>
HAA4 pattern <Anycset,digits>
Since ARB does not require any parameters, you can use any value for the
second parameter to "pattern". Zero is as good a value as any other.
Note that "Spancset" by itself would not be sufficient for the alphabetic
matches. Spancset matches zero or more characters. We need to match one
or more characters. That is why the pattern above needs the Anycset and
Spancset patterns.
The above pattern data type matches its pattern *anywhere* in the target
string. If you wanted to force a match at the end of the string, you could
use the following pattern:
HasAnAlpha pattern <ARB,0,0,HAA2>
HAA2 pattern <Anycset,alphabetic,0,HAA3>
HAA3 pattern <Spancset,alphabetic,0,HAA4>
HAA4 pattern <Anycset,digits,0,HAA5>
HAA5 pattern <Spancset,digits,0,HAA6>
HAA6 pattern <EOS>
EOS doesn't require a parameter, so the above lets all the fields except the
function name default to zero.
The MatchAlternate field contains the address of a pattern to match if the
current pattern fails (and *only* if the current pattern fails). Consider
the blue/blackbird/berry pattern described earlier. You can easily implement
that pattern with the following statements:
BB pattern <MatchString,black,bluepat,BB2>
BB2 pattern <MatchString,berry,birdpat>
bluepat pattern <MatchString,blue,0,BB2>
birdpat pattern <MatchString,bird>
black db "black",0
blue db "blue",0
bird db "bird",0
berry db "berry",0
If you match BB against the string "blackberry" BB will match the black,
then it will go to BB2 which matches the berry. This string doesn't use
either alternate.
If you match BB against the string "blueberry" BB immediately fails, so
it tries the alternate pattern, bluepat. Bluepat matches blue and then
goes on to the BB2 pattern which matches the berry.
If you match BB against the string "blackbird", BB matches black and then
tries to match BB2 against bird. BB2 fails and tries its alternate (birdpat)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -