📄 pattern.txt

📁 汇编编程艺术
💻 TXT
📖 第 1 页 / 共 4 页
字号:
with matches the characters "bird".

If you match BB against the string "bluebird", BB fails and tries its alter-
nate, bluepat.  Bluepat matches "blue" and passes control to its next pattern,
which is BB2.  BB2 tries to match "bird" and fails, so it passes control to
its alternate, birdpat, which matches bird.

The example above is pretty straightforward as far as alternation is concerned.
You can create some very sophisticated patterns with the alternation field.
For example, consider the following generic pattern:

	[pat1 | ] [pat2 | pat3]

"[pat1 | ]" is an easy way of saying that pat1 is optional.  You can easily
create this pattern as follows:

Pat1		pattern	<pat1func,pat1parm,pat2,pat2>
Pat2		pattern	<pat2func,pat2parm,pat3>
Pat3		pattern	<pat3func,pat3parm>

If you match a string against Pat1 and the beginning of the string does not
match Pat1, then it tries Pat2 instead (and if that fails, it tries Pat3
before failing).  If the string begins with a pattern matched by Pat1, the
matching algorithm then looks to see if the characters matching Pat1 are
followed by some character matching Pat2 or, alternately, Pat3.

There are all kinds of tricky ways you can use the alternation field to
create complex patterns and control the precendence of the pattern matching
algorithm.  This short document cannot even begin to describe the
possibilities.  You will need to experiment with this capability to discover
its true potential.


Creating Your Own Pattern Functions
-----------------------------------

Although the UCR Stdlib pattern matching routines include many of the
functions you'll typically want to use for pattern matching, it's quite
possible you'll want to write your own pattern matching functions.  This
is actually quite easy to do.  The matching functions are all far procedures
which the Match procedure calls with the following parameters:

ES:DI- Points at the first character of the character string the function
       should match against.  The match function should never look at
       characters before this string and it should not look beyond the end
       of the string (which is marked by a zero terminating byte).

DS:SI- Contains the four-byte parameter found the the MatchParm field.

CX-    Contains the last position plus one in the string you're allowed
       to compare.  Note that this may or may not point at the zero term-
       inating byte.  You must not scan beyond this character.  Generally,
       you can assume the zero terminating byte is at or after this location
       in the string.

On return, AX must contain the address (offset into the string) of the last
character matched *plus one*.  After your pattern matches, additional patterns
following in the pattern list will begin their matching at location ES:AX.
You must also return the carry set if your match succeeded, you must return
the carry clear if your match failed.

Note that the MATCH procedure is fully recursive and rentrant.  So you can
call MATCH recursively from inside your match function.  This helps make
writing your own match routines much easier.  (note: actually, you need to
call MATCH2, which is the reentrant version, from inside your match functions.)

As an example, let's consider the example above where we wanted to match
a string of one or more alphabetic characters following by one or more
digits anywhere in a string.  Consider the following pattern:

HAA		pattern	<ARB,0,0,HAA2>
HAA2		pattern	<MatchAlpha,0,0,HAA3>
HAA2		pattern	<MatchDigits>

The MatchAlpha and MatchDigits pattern functions are not provided in the
standard library, we will have to write them.  MatchAlpha matches one or
more alphabetic characters, MatchDigits matches one or more decimal digits.
Here's the routines that implement these two functions:


; Note that ES:DI & CX are already set up for these routines by the
; Match procedure.

MatchAlpha	proc	far		;Must be a far proc!
		push	dx
		push	si		;Preserve modified registers.
		ldxi	Alpha1		;Get pointer to "Match one or more
		match2			; alpha" pattern and match it.
		pop	si
		pop	dx
		ret
MatchAlpha	endp

MatchDigits	proc	far		;Must be a far proc!
		push	dx
		push	si		;Preserve modified registers.
		ldxi	Digits1		;Get pointer to "Match one or more
		match2			; digits" pattern and match it.
		pop	si
		pop	dx
		ret
MatchDigits	endp

Alpha1		pattern	<Anycset,alpha,0,Alpha2>
Alpha2		pattern	<Spancset,alpha>

Digits1		pattern	<Anycset,digits,0,Digits2>
Digits2		pattern	<Spancset,digits>


Note that the MatchAlpha and MatchDigits patterns do not require any
parameters from the MatchParm field, they intrinsically know what they
need to use.

Another way to accomplish the above is to write a generic "one or more
occurrences of a pattern" type of pattern.  The following code implements
this:

; Assume the "MatchParm" field contains a pointer to the pattern we
; want to repeat one or more times:

OneOrMore	proc	far
		push	dx
		push	di
		mov	dx, ds			;Point DX:SI at pattern.
		match2				;Make sure we get at least 1.
		jnc	Fails
MatchMore:      mov	di, ax			;Move on in string.
		match2
		jc	MatchMore
		pop	di
		pop	dx
		stc				;Return success
		ret

Fails:		pop	di
		pop	dx
		clc				;Return failure
		ret
OneOrMore	endp


A pattern which would match one or more alphabetics with this would be:

Alpha1ormore	pattern	<OneOrMore,alphaset>
AlphaSet	pattern	<Anycset,alpha>

You would specify the "Alpha1ormore" pattern to match one or more alphabetic
characters.


Of course, you can write any arbitrary function you choose for your match
function, you do not need to call MATCH2 from within your match function.
For example, a simple routine which matches one or more alphabetics followed
by one or more digits could be written as follows:

AlphaDigits	proc	far
		push	di

		cmp	di, cx
		jae	Failure
		mov	al, es:[di]
		and	al, 5fh			;Convert l.c. -> U.C.
		cmp	al, 'A'
		jb	Failure
		cmp	al, 'Z'
		ja	Failure
DoTheMore0:	inc	di
		cmp	di, cx
		jae	Failure
		mov	al, es:[di]
		and	al, 5fh
		cmp	al, 'A'
		jb	TryDigits
		cmp	al, 'Z'
		jbe	DoTheMore0

TryDigits:	mov	al, es:[di]
		xor	al, '0'			;See if in range '0'..'9'
		cmp	al, 10
		jae	Failure
DoTheMore1:	inc	di
		cmp	di, cx
		jae	Success
		mov	al, es:[di]
		xor	al, '0'
		cmp	al, 10
		jb	DoTheMore1
Success:	mov	ax, di			;Return ending posn in AX.
		pop	di
		stc				;Success!
		ret

Failure:	mov	ax, di			;Return failure position.
		pop	di
		clc				;Return failure.
		ret
AlphaDigits	endp


Note that the pattern matching function must return the failure position in
AX.  Also note that the routine must *not* search beyond the point specified
in the CX register.  These points did not appear in the previous code because
all of that was handled automatically by the MATCH2 routine.  Of course,
the matching function must set or clear the carry flag depending upon the
success of the operation.

There is a really sneaky way to simulate the use of parentheses in a pattern
to override the normal left-to-right evaluation of a pattern.  The SL_MATCH2
routine (which is what MATCH2 winds up calling) works quite well as a pattern
matching function.  Consider the following pattern:

	ParenPat	pattern	<sl_match2,HAA>

Now the ParenPat pattern will match anything the HAA pattern matches, however,
the system treats all of HAA as a single pattern (inside ParenPat) rather
than as a list of concatenated patterns.  This is real important when using
PATGRAB to extract portions of a pattern.  Patgrab can only extract the
characters belonging to a single pattern data structure (not a list).  However,
ParenPat above is a single pattern data structure which maintains the infor-
mation for the entire string matched by HAA.  Therefore, using patgrab on
ParenPat will extract the entire string matched by HAA.

Parenthetical operations can also simplify other patterns.  Keep in mind,
however, that the Alternate pointer field in the pattern structure can
also be used to simulate parenthetical patterns without the expense of
generating new patterns (see the previous examples).


A Note About Performance:

There are two aspects to efficiency programmers worry about: speed and memory
usage.  In the worst case, this pattern matching package does not fare well
in either category.  It is possible that the routines in this package could
consume several kilobytes of stack space while matching a string;  it is also
possible that the matching would take so long that it's impractical to use
this package.  Fortunately, these worst case scenerios are pathological
cases which rarely (if ever except in synthetic programs) occur.

Space is the first issue to address.  Each call to Match/Match2 can push
upwards of fifty bytes onto the stack.  This is in addition to the stack
space required by the low-level matching function.  Few of the built-in
matching functions push more than six or eight bytes, but you could write
your own matching function which pushes more.  It is very easy to design
a (synthetic) pattern which forces a nested, recursive, call of MATCH2 for
each character in the string you are going to match.  In such a case, this
package could require upwards of n*50 bytes on the stack where "n" is the
length of the string.  For a string with 100 characters, you'd probably
need 5K of stack space for the pattern matching routines alone.

In general, patterns would rarely exhibit this type of behavior.  Most low-
level pattern matching functions match several characters at once.  Further-
more, you almost never encounter patterns in real life which require a
recursive call for each character in the string.  Instead, most complex
patterns consist of simpler patterns concatenated together in a list.  This
does not require a nested recursive call for each character in the string,
rather, the package makes a call, matches some characters, then returns;  next,
the routines call the next sub-pattern in a similar fashion.  Note however,
that the state (stack space) for the previous sub-pattern has been reclaimed
from the stack at that point.

In practice, a typical pattern might require as much as 1K free stack space.
However, keep in mind that this is a "typical worst case value" not an
absolute worst case value.  Of course, you can control the amount of stack
space the pattern matching algorithms use by avoiding recursive pattern
definitions and avoiding parenthetical patterns (which stack up machine
state recursively) in complex patterns.  Of course, limiting the size of
the strings you're matching the pattern against will help as well.

Generally, the stack space used by the pattern matching algorithm is of
little concern.  Setting aside an extra kilobyte of memory, even five
kilobytes of memory, isn't a big problem for most programmers.  Speed,
on the other hand, can be a problem.  This pattern matching package uses
a generalized backtracking pattern matching algorithm.  You can easily
devise a pattern which runs in O(x**n) time where "x" is an arbitrary value
you get to pick (basically the number of possible alternations on each
sub-pattern in the pattern) and "n" is the length of the string you match.
The "O()" function notation simply means that if it takes "m" units of time
with a string of length "n", it will take "m*x" units of time for a string
of length "x+1".  Not very good.  For some patterns it could easily take
longer than your lifetime to match a string whose length is 100 characters.

Once again, we're fortunate in the sense that this terrible performance occurs
so rarely you need not be too concerned about it.  To achieve such bad per-
formance requires a specially prepared pattern matching a specially prepared
string.  Such combinations do not normally exist in nature!  However, while
100 year matching times may not occur much, most programmers are interested
in having their patterns match in milliseconds.  It is very easy to devise
a pattern which takes seconds, minutes, or possibly even hours, to match
some types of strings (second timing is very likely for some common strings
and patterns, minutes is rather rare, hours is very rare, but still much
more possible than the 100 year problem).  The really sad part is that slow
matching times is almost always due to a poor choice of pattern rather than
an intrinsic problem with the pattern matching algorithm.

Typically, all performance problems are directly related to the amount of
backtracking which must occur to match a pattern.  Backtracking almost always
occurs which you have two adjacent sub-patterns in a pattern where the
end of the first sub-pattern matches strings from the same set as the front
of the second sub-pattern.  Consider the following pattern:

		 spancset(a-z " " ",") matchstr("hello")

If you match this pattern against the string "hi there, hello" the first
spancset pattern matches the entire string.  The matchstr function would then
fail.  At this point, backtracking kicks in and backs up one character in the
string.  To the spancset function matches "hi there, hell" and the matchstr
function fails on "o".  So the algorithm backs up one more character and
tries again.  It keeps doing this until it backs up all the way to the
point where spancset matches "hi there, " and matchstr finally matches
"hello".  To get around this problem, a better choice of pattern is probably
in order.  Consider the following which generally does the same thing:

		matchtostr("hello")

Matchtostr skips over all characters in a string up to the string "hello"
and leaves the "match cursor" pointing just beyond the "o" in "hello".
This matches almost what the previous pattern will match but it does it a
whole lot faster.  It doesn't exactly match the previous pattern because
matchtostr matches any characters, not just (a-z " " ",") but for most
purposes this is exactly what you want anyway.

ARB is probably one of the worst offenders.  Anytime you use ARB (with
a sub-pattern following it in the pattern list), you are almost guaranteed
that backtracking will occur.  Therefore, you should attempt to avoid the
use of ARB in patterns where performance is a consideration.

In general, you can often redesign a pattern data structure to avoid
overlaps in adjacent sub-patterns.  Patterns which do not have such conflicts
will generally have reasonable performance.  Of course, designing such patterns
takes more effort and testing;  so it may not be worth the effort for quick
and dirty projects.  On the other hand, if you execute the pattern match more
than a few times (or the pattern matching starts to take minutes rather than
seconds) its probably worthwhile to redo the pattern.

Of course, this pattern matching package is not suitable for all pattern
matching tasks.  The whole purpose of this package was to make pattern
matching in assembly language easy, not fast.  You would never, for example,
want to write a lexical analyzer for a compiler using this package.  It
would be too slow and using languages like LEX and FLEX produce faster
(much faster) lexical analyzers and it's easier to create them with LEX/FLEX
as well.  Ditto for parsers using BISON or YACC.  Likewise, there are many
times when pattern matching languages like AWK, ICON, SNOBOL4, etc., are
more appropriate than using the routines in this package.  This package is
really intended for small pattern matching tasks inside a larger assembly
language program.  For example, parsing command line parameters or
parsing input lines typed by the user to request some activity in your
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -