📄 csregex.pas

📁 Delphi script parser
💻 PAS
📖 第 1 页 / 共 5 页
字号:
       <p>A regular expression (or RE) specifies a set of strings that matches
       it;
       The functions in this component let you check if a particular string
       matches a given regular expression (or if a given regular expression
       matches a particular string, which comes down to the same thing).</p>

       <p>Regular expressions can be concatenated to form new regular
       expressions;
       If A and B are both regular expressions, then AB is also an regular
       expression. If a string p matches A and another string q matches B, the
       string pq will match AB. Thus, complex expressions can easily be
       constructed from simpler ones like the primitives described here.
       For details of the theory and implementation of regular expressions,
       consult almost any textbook about compiler construction.</p>

       <p>A brief explanation of the format of regular expressions follows.</p>

       <p>Regular expressions can contain both special and ordinary characters.
       Ordinary characters, like 'A', 'a', or '0', are the simplest regular
       expressions; they simply match themselves. You can concatenate ordinary
       characters, so 'last' matches the characters 'last'. (In the rest of this
       section, we'll write RE's in this special font, usually without quotes,
       and strings to be matched 'in single quotes'.)</p>

       <p>Special characters either stand for classes of ordinary characters, or
       affect how the regular expressions around them are interpreted.</p>

       <p>The special characters are:</p>

       <p><b>.</b> (Dot.) Matches any character except a newline.</p>
       <p><b>^</b> (Caret.) Matches the start of the string.</p>
       <p><b>$</b> Matches the end of the string. foo matches both 'foo' and
       'foobar', while the regular expression 'foo$' matches only 'foo'.</p>
       <p><b>*</b> Causes the resulting RE to match 0 or more repetitions of the
       preceding RE. ab* will match 'a', 'ab', or 'a' followed by any number of
       'b's.</p>
       <p><b>+</b> Causes the resulting RE to match 1 or more repetitions of the
       preceding RE. ab+ will match 'a' followed by any non-zero number of 'b's;
       <br>it will not match just 'a'.</p>
       <p><b>?</b> Causes the resulting RE to match 0 or 1 repetitions of the
       preceding RE. ab? will match either 'a' or 'ab'.</p>

       <p><b>\</b> Either escapes special characters (permitting you to match
       characters like '*?+&$'), or signals a special sequence; special
       sequences are discussed below.</p>

       <p><b>[ ]</b> Used to indicate a set of characters. Characters can be
       listed individually, or a range is indicated by giving two characters and
       separating them by a '-'. Special characters are not active inside sets.
       For example, [akm$] will match any of the characters 'a', 'k', 'm', or
       '$'; [a-z] will match any lowercase letter. If you want to include a ]
       inside a set, it must be the first character of the set;
       To include a -, place it as the first or last character.</p>

       <p>Characters not within a range can be matched by including a ^ as the
       first character of the set; ^ elsewhere will simply match the '^'
       character.</p>

       <p>The standard style is EGrep compatible, which means that (, ), and |
       doesn't have to be qouted. If you change the style setting, take in
       account the differences.</p>
       <p>See alse <See Class="TmkreExpr" property="SyntaxStyles"></p>
       <p>The special sequences consist of '\' and a character from the list
       below. If the ordinary character is not on the list, then the resulting
       RE will match the second character. For example, \$ matches the character
       '$'.</p>

       <p><b>|</b> A|B, where A and B can be arbitrary REs, creates a regular
       expression that will match either A or B. This can be used inside groups
       (see below) as well.</p>
       <p><b>( )</b> Indicates the start and end of a group; the contents of a
       group can be matched later in the string with the \[1-9] special
       sequence, described next.</p>
       <p><b>\1</b> .. <b>\9</b> Matches the contents of the group of the same
       number. For example, (.+) \1 matches 'the the' or '55 55', but not 'the
       end' (note the space after the group). This special sequence can only be
       used to match one of the first 9 groups; groups with higher numbers can
       be matched using the \v sequence.</p>
       <p><b>\b</b> Matches the empty string, but only at the beginning or end
       of a word. A word is defined as a sequence of alphanumeric characters, so
       the end of a word is indicated by whitespace or a non-alphanumeric
       character.</p>
       <p><b>\B</b> Matches the empty string, but when it is not at the
       beginning or end of a word.</p>
       <p><b>\v</b> Must be followed by a two digit decimal number, and matches
       the contents of the group of the same number. The group number must be
       between 1 and 99, inclusive.</p>
       <p><b>\d</b> Matches any digit; this is equivalent to the set [0-9].</p>
       <p><b>\D</b> Matches any non-digit; this is equivalent to the set [^0-9].
       </p>
       <p><b>\w</b> Matches any alphanumeric character; this is equivalent to
       the set [a-zA-Z0-9].</p>
       <p><b>\W</b> Matches any non-alphanumeric character; this is equivalent
       to the set [^a-zA-Z0-9].</p>
       <p><b>\<</b> Matches the empty string, but only at the beginning of a
       word. A word is defined as a sequence of alphanumeric characters, so the
       end of a word is indicated by whitespace or a non-alphanumeric character.
       </p>
       <p><b>\></b> Matches the empty string, but only at the end of a word.</p>
       <p><b>\\</b> (2 backslashes) Matches a literal backslash.</p>
        <p><b>\`</b> Like ^, this only matches at the start of the string.</p>
       <p><b>\'</b> Like $, this only matches at the end of the string.</p> }
    property Pattern: string read FPattern write Setpattern;

    {: Syntaxt style to use.
       <p>Following Syntax rules are defined:</p>
       <UL>
         <LI><b>mkre_No_Bk_Parens</b> no quoting for parentheses  ( )
         <LI><b>mkre_No_Bk_Vbar</b> no quoting for vertical bar |
         <LI><b>mkre_Bk_Plus_Qm</b> quoting needed for + and ?
         <LI><b>mkre_Tight_Vbar</b> | binds tighter than ^ and $
         <LI><b>mkre_Newline_Or</b> treat newline (in expression) as or
         <LI><b>mkre_Context_Indep_Ops</b> ^$?*+ are special in all contexts
         <LI><b>mkre_Ansi_Hex</b> ansi sequences (\n etc) and \xhh
         <LI><b>mkre_No_Gnu_Extensions</b> no gnu extensions
         <LI><b>mkre_HighCharsWhitespace</b> all characters above 127 are whitespaces
       </UL>
       <p>Common styles:</p>
       <UL>
         <LI><b>WK</b> mkre_No_Bk_Parens, mkre_No_Bk_Vbar and mkre_Context_Indep_Ops
         <LI><b>EGREP</b> mkre_No_Bk_Parens, mkre_No_Bk_Vbar, mkre_Context_Indep_Ops and
                 mkre_Newline_Or
         <LI><b>GREP</b> mkre_Bk_Plus_Qm and mkre_Newline_Or
         <LI><b>EMACS</b> none
       </UL>
       <p>Default setting when created:</p>
       <UL>
         <LI><b>EGREP</b> mkre_No_Bk_Parens, mkre_No_Bk_Vbar,
         mkre_Context_Indep_Ops and mkre_Newline_Or
       </UL>}
    property SyntaxStyles: TmkreSyntaxStyles read FSyntaxStyles write SetSyntaxStyles;

    {: This string will be matched by the pattern.

    <p>Matching can be started by:</p>
    <OL>
      <LI>Reading Matches
      <LI>calling the DoMatch method
      <LI>Setting Active to True
    </OL>
    When adding a text file to this property, be aware that the Regular
    expressions works on Unix text files, that means that lines are separated by
    a LF and not by Cr / LF. It's because of that that it is nessesary to
    eliminate the CR from the file. For that purpose the function
    mkConvertDos2Ux is added.

    <p>See also: <See Class="TmkreExpr" Property ="Pattern">,
    <See Class="TmkreExpr" Property ="Matches">,
    <See Class="TmkreExpr" Method ="DoMatch">,
    <See Class="TmkreExpr" Property ="Active">,
    <See Routine="mkConvertDos2Ux"></p>}
    property Str: string read FStr write SetStr;

    {: This event is called everytime when an match is found on a string.
    <p></p>
    @Param Sender Is the TmkreRegexpr component that caused the event
    @Param str The match that was found
    @Param pos The start position of the match in Str
    @Param ret The end position of the match in Str
    @Param re_registers The groups that were found }
    property OnMatch: TOnMatchEvent read FOnMatch write FOnMatch;

    {: This event is triggered after the whole string is searched for matches,
    and before the Thread for matching is destroyed. }
    property OnEndMatch: TNotifyEvent read FOnEndMatch write FOnEndMatch;

    {:This event is called when a match is found for the search pattern.
    <p></p>
    @param Sender The TmkRegexpr component that caused the event
    @param str The match that was found
    @param pos The start position of the match in Str
    @param re_registers The groups that were found }
    property OnSearch: TOnSearchEvent read FOnSearch write FOnSearch;

    {: This event is called before any match is done. }
    property OnStartMatch: TNotifyEvent read FOnStartMatch write FOnStartMatch;
  end;

procedure csRegex_Install(ProcManager: PProcedureManager);
{This will install:

Function ReMatch (Str, Pattrn : String) : Boolean;
Function ReSearch(Str, Pattrn : String) : Integer;
}

implementation

const
  STACK_PAGE_SIZE = 256;
  NUM_REGISTERS = 256;
  NUM_LEVELS = 5;
  MAX_NESTING = 100;

{  The stack implementation is taken from an idea by Andrew Kuchling.
 * It's a doubly linked list of arrays. The advantages of this over a
 * simple linked list are that the number of mallocs required are
 * reduced. It also makes it possible to statically allocate enough
 * space so that small patterns don't ever need to call malloc.
 *
 * The advantages over a single array is that is periodically
 * realloced when more space is needed is that we avoid ever copying
 * the stack. }


type
  Pitem_t = ^Titem_t;
{: item_t is the basic stack element.  Defined as a union of
   structures so that both registers, failure points, and counters can
   be pushed/popped from the stack.  There's nothing built into the
   item to keep track of whether a certain stack item is a register, a
   failure point, or a counter. }
  Titem_t = record
    reg_num: integer;
    reg_level: integer;
    reg_start: PChar;
    reg_end: PChar;
    fail_count: integer;
    fail_level: integer;
    fail_phantom: integer;
    fail_code: PChar;
    fail_text: PChar;

   //MK: union causes error
{  case integer of
    0: (reg_num: integer;
        reg_level: integer;
        reg_start: PChar;
        reg_end: PChar);
    1: (fail_count: integer;
        fail_level: integer;
        fail_phantom: integer;
        fail_code: PChar;
        fail_text: PChar); }
  end;

  Pitem_page_t = ^Titem_page_t;
{: A 'page' of stack items. }
  Titem_page_t = record
    items: array[0..NUM_REGISTERS] of Titem_t;
    prev: Pitem_page_t;
    next: Pitem_page_t;
  end;

{: Structure to encapsulate the stack. }
  Tstack = record
    index: integer;
    current: Pitem_page_t; // Pointer to the current page.
    first: Titem_page_t; // First page is statically allocated.
  end;

  Tmatch_state = record
{:  The number of registers that have been pushed onto the stack
    since the last failure point. }
    count: integer;

{:  The number of failure points on the stack.}
    level: integer;

{:  The number of failure points on the stack.}
    point: integer;

{:  Storage for the registers.  Each register consists of two
    pointers to characters.  So register N is represented as
    start[N] and end[N].  The pointers must be converted to
    offsets from the beginning of the string before returning the
    registers to the calling program. }
    _start: array[0..NUM_REGISTERS] of Pchar;
    _end: array[0..NUM_REGISTERS] of Pchar;

{:  Keeps track of whether a register has changed recently.}
    changed: array[0..NUM_REGISTERS] of integer;

{   Index into the curent page.  If index == 0 and you need
    to pop an item, move to the previous page and set index
    = STACK_PAGE_SIZE - 1.  Otherwise decrement index to
    push a page. If index == STACK_PAGE_SIZE and you need
    to push a page move to the next page and set index =
    0. If there is no new next page, allocate a new page
    and link it in. Otherwise, increment index to push a
    page.}
    stack: Tstack;
  end;

function rematchProc(ID: Pointer; Name: string; params: PVariableManager; Res: PCajVariant): Word; far;
var
  vPattern: PCajVariant;
  vString: PCajVariant;
  re: TcsReExpr;
  regs: Tmkre_registers;
begin
  rematchProc := ENoError;
  re := TcsReExpr.Create;
  try
    vString := GetVarLink(VM_Get(Params, 0));
    vPattern := GetVarLink(VM_Get(Params, 1));
    if Assigned(vString) and Assigned(vPattern) then begin
      re.Str := vString.Cv_Str;
      re.Pattern := vPattern.Cv_Str;
      Res^.cv_Bool := re.re_Match(1, @regs) > 0;
    end else
      Res^.cv_Bool := True;
  finally
    re.Free;
  end;
end;

function reSearchProc(ID: Pointer; Name: string; params: PVariableManager; Res: PCajVariant): Word; far;
var
  vPattern: PCajVariant;
  vString: PCajVariant;
  re: TcsReExpr;
begin
  reSearchProc := ENoError;
  re := TcsReExpr.Create;
  try
    vString := GetVarLink(VM_Get(Params, 0));
    vPattern := GetVarLink(VM_Get(Params, 1));
    if Assigned(vString) and Assigned(vPattern) then begin
      re.Str := vString.Cv_Str;
      re.Pattern := vPattern.Cv_Str;
      SetInteger(Res, re.DoSearch(1));
    end else
      SetInteger(Res, 0);
  finally
    re.Free;
  end;
end;

procedure csRegex_Install(ProcManager: PProcedureManager);
begin
{$IFDEF DELPHI}
  PM_Add(ProcManager, '14 REMATCH STRING 8 PATTERN 8', @rematchProc); //bool REMATCH (string string)
  PM_Add(ProcManager, '6 RESEARCH STRING 8 PATTERN 8', @reSearchProc); //integer RESEARCH (string string)
{$ELSE}
  PM_Add(ProcManager, '14 REMATCH STRING 8 PATTERN 8', rematchProc); //bool REMATCH (string string)
  PM_Add(ProcManager, '6 RESEARCH STRING 8 PATTERN 8', reSearchProc); //integer RESEARCH (string string)
{$ENDIF}
end;

{: Initializes the component.
   <P>Default the component is not active, and EGREP compatible, the fastmap will
   be used.</P>}

constructor TcsReExpr.Create;
begin
  inherited Create;
  FActive := False;
  FNoChange := False;
  FStyleChange := True;
  FMatches := TStringList.Create;
  SetSyntaxStyles([mkre_No_Bk_Parens, mkre_No_Bk_Vbar, mkre_Context_Indep_Ops,
    mkre_Newline_Or]);
  SetLength(regexp_t.fastmap, 256);
  regexp_t.translate := '';
  regexp_t.fastmap_accurate := False;
  regexp_t.can_be_null := #0;
  regexp_t.uses_registers := True;
  regexp_t.anchor := 0;
  FUseFastmap := True;
end;

{: Deinitialize the component. Memory and threads are freed. }
💿 文件大小 56 K
👤 上传用户 wugu0
📂 所属分类 Delphi控件源码
🏷️ 相关标签

#Delphi #script #parser
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -