re.java

来自「jakarta-regexp-1.5 正则表达式的源代码」· Java 代码 · 共 1,748 行 · 第 1/5 页
JAVA
1,748 行
/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements.  See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License.  You may obtain a copy of the License at * *     http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package org.apache.regexp;import java.io.Serializable;import java.util.Vector;/** * RE is an efficient, lightweight regular expression evaluator/matcher * class. Regular expressions are pattern descriptions which enable * sophisticated matching of strings.  In addition to being able to * match a string against a pattern, you can also extract parts of the * match.  This is especially useful in text parsing! Details on the * syntax of regular expression patterns are given below. * * <p> * To compile a regular expression (RE), you can simply construct an RE * matcher object from the string specification of the pattern, like this: * * <pre> *  RE r = new RE("a*b"); * </pre> * * <p> * Once you have done this, you can call either of the RE.match methods to * perform matching on a String.  For example: * * <pre> *  boolean matched = r.match("aaaab"); * </pre> * * will cause the boolean matched to be set to true because the * pattern "a*b" matches the string "aaaab". * * <p> * If you were interested in the <i>number</i> of a's which matched the * first part of our example expression, you could change the expression to * "(a*)b".  Then when you compiled the expression and matched it against * something like "xaaaab", you would get results like this: * * <pre> *  RE r = new RE("(a*)b");                  // Compile expression *  boolean matched = r.match("xaaaab");     // Match against "xaaaab" * *  String wholeExpr = r.getParen(0);        // wholeExpr will be 'aaaab' *  String insideParens = r.getParen(1);     // insideParens will be 'aaaa' * *  int startWholeExpr = r.getParenStart(0); // startWholeExpr will be index 1 *  int endWholeExpr = r.getParenEnd(0);     // endWholeExpr will be index 6 *  int lenWholeExpr = r.getParenLength(0);  // lenWholeExpr will be 5 * *  int startInside = r.getParenStart(1);    // startInside will be index 1 *  int endInside = r.getParenEnd(1);        // endInside will be index 5 *  int lenInside = r.getParenLength(1);     // lenInside will be 4 * </pre> * * You can also refer to the contents of a parenthesized expression * within a regular expression itself.  This is called a * 'backreference'.  The first backreference in a regular expression is * denoted by \1, the second by \2 and so on.  So the expression: * * <pre> *  ([0-9]+)=\1 * </pre> * * will match any string of the form n=n (like 0=0 or 2=2). * * <p> * The full regular expression syntax accepted by RE is described here: * * <pre> * *  <b><font face=times roman>Characters</font></b> * *    <i>unicodeChar</i>   Matches any identical unicode character *    \                    Used to quote a meta-character (like '*') *    \\                   Matches a single '\' character *    \0nnn                Matches a given octal character *    \xhh                 Matches a given 8-bit hexadecimal character *    \\uhhhh              Matches a given 16-bit hexadecimal character *    \t                   Matches an ASCII tab character *    \n                   Matches an ASCII newline character *    \r                   Matches an ASCII return character *    \f                   Matches an ASCII form feed character * * *  <b><font face=times roman>Character Classes</font></b> * *    [abc]                Simple character class *    [a-zA-Z]             Character class with ranges *    [^abc]               Negated character class * </pre> * * <b>NOTE:</b> Incomplete ranges will be interpreted as &quot;starts * from zero&quot; or &quot;ends with last character&quot;. * <br> * I.e. [-a] is the same as [\\u0000-a], and [a-] is the same as [a-\\uFFFF], * [-] means &quot;all characters&quot;. * * <pre> * *  <b><font face=times roman>Standard POSIX Character Classes</font></b> * *    [:alnum:]            Alphanumeric characters. *    [:alpha:]            Alphabetic characters. *    [:blank:]            Space and tab characters. *    [:cntrl:]            Control characters. *    [:digit:]            Numeric characters. *    [:graph:]            Characters that are printable and are also visible. *                         (A space is printable, but not visible, while an *                         `a' is both.) *    [:lower:]            Lower-case alphabetic characters. *    [:print:]            Printable characters (characters that are not *                         control characters.) *    [:punct:]            Punctuation characters (characters that are not letter, *                         digits, control characters, or space characters). *    [:space:]            Space characters (such as space, tab, and formfeed, *                         to name a few). *    [:upper:]            Upper-case alphabetic characters. *    [:xdigit:]           Characters that are hexadecimal digits. * * *  <b><font face=times roman>Non-standard POSIX-style Character Classes</font></b> * *    [:javastart:]        Start of a Java identifier *    [:javapart:]         Part of a Java identifier * * *  <b><font face=times roman>Predefined Classes</font></b> * *    .         Matches any character other than newline *    \w        Matches a "word" character (alphanumeric plus "_") *    \W        Matches a non-word character *    \s        Matches a whitespace character *    \S        Matches a non-whitespace character *    \d        Matches a digit character *    \D        Matches a non-digit character * * *  <b><font face=times roman>Boundary Matchers</font></b> * *    ^         Matches only at the beginning of a line *    $         Matches only at the end of a line *    \b        Matches only at a word boundary *    \B        Matches only at a non-word boundary * * *  <b><font face=times roman>Greedy Closures</font></b> * *    A*        Matches A 0 or more times (greedy) *    A+        Matches A 1 or more times (greedy) *    A?        Matches A 1 or 0 times (greedy) *    A{n}      Matches A exactly n times (greedy) *    A{n,}     Matches A at least n times (greedy) *    A{n,m}    Matches A at least n but not more than m times (greedy) * * *  <b><font face=times roman>Reluctant Closures</font></b> * *    A*?       Matches A 0 or more times (reluctant) *    A+?       Matches A 1 or more times (reluctant) *    A??       Matches A 0 or 1 times (reluctant) * * *  <b><font face=times roman>Logical Operators</font></b> * *    AB        Matches A followed by B *    A|B       Matches either A or B *    (A)       Used for subexpression grouping *   (?:A)      Used for subexpression clustering (just like grouping but *              no backrefs) * * *  <b><font face=times roman>Backreferences</font></b> * *    \1    Backreference to 1st parenthesized subexpression *    \2    Backreference to 2nd parenthesized subexpression *    \3    Backreference to 3rd parenthesized subexpression *    \4    Backreference to 4th parenthesized subexpression *    \5    Backreference to 5th parenthesized subexpression *    \6    Backreference to 6th parenthesized subexpression *    \7    Backreference to 7th parenthesized subexpression *    \8    Backreference to 8th parenthesized subexpression *    \9    Backreference to 9th parenthesized subexpression * </pre> * * <p> * All closure operators (+, *, ?, {m,n}) are greedy by default, meaning * that they match as many elements of the string as possible without * causing the overall match to fail.  If you want a closure to be * reluctant (non-greedy), you can simply follow it with a '?'.  A * reluctant closure will match as few elements of the string as * possible when finding matches.  {m,n} closures don't currently * support reluctancy. * * <p> * <b><font face="times roman">Line terminators</font></b> * <br> * A line terminator is a one- or two-character sequence that marks * the end of a line of the input character sequence. The following * are recognized as line terminators: * <ul> * <li>A newline (line feed) character ('\n'),</li> * <li>A carriage-return character followed immediately by a newline character ("\r\n"),</li> * <li>A standalone carriage-return character ('\r'),</li> * <li>A next-line character ('\u0085'),</li> * <li>A line-separator character ('\u2028'), or</li> * <li>A paragraph-separator character ('\u2029).</li> * </ul> * * <p> * RE runs programs compiled by the RECompiler class.  But the RE * matcher class does not include the actual regular expression compiler * for reasons of efficiency.  In fact, if you want to pre-compile one * or more regular expressions, the 'recompile' class can be invoked * from the command line to produce compiled output like this: * * <pre> *    // Pre-compiled regular expression "a*b" *    char[] re1Instructions = *    { *        0x007c, 0x0000, 0x001a, 0x007c, 0x0000, 0x000d, 0x0041, *        0x0001, 0x0004, 0x0061, 0x007c, 0x0000, 0x0003, 0x0047, *        0x0000, 0xfff6, 0x007c, 0x0000, 0x0003, 0x004e, 0x0000, *        0x0003, 0x0041, 0x0001, 0x0004, 0x0062, 0x0045, 0x0000, *        0x0000, *    }; * * *    REProgram re1 = new REProgram(re1Instructions); * </pre> * * You can then construct a regular expression matcher (RE) object from * the pre-compiled expression re1 and thus avoid the overhead of * compiling the expression at runtime. If you require more dynamic * regular expressions, you can construct a single RECompiler object and * re-use it to compile each expression. Similarly, you can change the * program run by a given matcher object at any time. However, RE and * RECompiler are not threadsafe (for efficiency reasons, and because * requiring thread safety in this class is deemed to be a rare * requirement), so you will need to construct a separate compiler or * matcher object for each thread (unless you do thread synchronization * yourself). Once expression compiled into the REProgram object, REProgram * can be safely shared across multiple threads and RE objects. * * <br><p><br> * * <font color="red"> * <i>ISSUES:</i> * * <ul> *  <li>com.weusours.util.re is not currently compatible with all *      standard POSIX regcomp flags</li> *  <li>com.weusours.util.re does not support POSIX equivalence classes *      ([=foo=] syntax) (I18N/locale issue)</li> *  <li>com.weusours.util.re does not support nested POSIX character *      classes (definitely should, but not completely trivial)</li> *  <li>com.weusours.util.re Does not support POSIX character collation *      concepts ([.foo.] syntax) (I18N/locale issue)</li> *  <li>Should there be different matching styles (simple, POSIX, Perl etc?)</li> *  <li>Should RE support character iterators (for backwards RE matching!)?</li> *  <li>Should RE support reluctant {m,n} closures (does anyone care)?</li> *  <li>Not *all* possibilities are considered for greediness when backreferences *      are involved (as POSIX suggests should be the case).  The POSIX RE *      "(ac*)c*d[ac]*\1", when matched against "acdacaa" should yield a match *      of acdacaa where \1 is "a".  This is not the case in this RE package, *      and actually Perl doesn't go to this extent either!  Until someone *      actually complains about this, I'm not sure it's worth "fixing". *      If it ever is fixed, test #137 in RETest.txt should be updated.</li> * </ul> * * </font> * * @see recompile * @see RECompiler * * @author <a href="mailto:jonl@muppetlabs.com">Jonathan Locke</a> * @author <a href="mailto:ts@sch-fer.de">Tobias Sch&auml;fer</a> * @version $Id: RE.java 518156 2007-03-14 14:31:26Z vgritsenko $ */public class RE implements Serializable{    /**     * Specifies normal, case-sensitive matching behaviour.     */    public static final int MATCH_NORMAL          = 0x0000;    /**     * Flag to indicate that matching should be case-independent (folded)     */    public static final int MATCH_CASEINDEPENDENT = 0x0001;    /**     * Newlines should match as BOL/EOL (^ and $)     */    public static final int MATCH_MULTILINE       = 0x0002;    /**     * Consider all input a single body of text - newlines are matched by .     */    public static final int MATCH_SINGLELINE      = 0x0004;    /************************************************     *                                              *     * The format of a node in a program is:        *     *                                              *     * [ OPCODE ] [ OPDATA ] [ OPNEXT ] [ OPERAND ] *     *                                              *     * char OPCODE - instruction                    *     * char OPDATA - modifying data                 *     * char OPNEXT - next node (relative offset)    *     *                                              *     ************************************************/                 //   Opcode              Char       Opdata/Operand  Meaning                 //   ----------          ---------- --------------- --------------------------------------------------    static final char OP_END              = 'E';  //                 end of program    static final char OP_BOL              = '^';  //                 match only if at beginning of line    static final char OP_EOL              = '$';  //                 match only if at end of line    static final char OP_ANY              = '.';  //                 match any single character except newline    static final char OP_ANYOF            = '[';  // count/ranges    match any char in the list of ranges    static final char OP_BRANCH           = '|';  // node            match this alternative or the next one    static final char OP_ATOM             = 'A';  // length/string   length of string followed by string itself    static final char OP_STAR             = '*';  // node            kleene closure    static final char OP_PLUS             = '+';  // node            positive closure    static final char OP_MAYBE            = '?';  // node            optional closure    static final char OP_ESCAPE           = '\\'; // escape          special escape code char class (escape is E_* code)    static final char OP_OPEN             = '(';  // number          nth opening paren    static final char OP_OPEN_CLUSTER     = '<';  //                 opening cluster    static final char OP_CLOSE            = ')';  // number          nth closing paren    static final char OP_CLOSE_CLUSTER    = '>';  //                 closing cluster    static final char OP_BACKREF          = '#';  // number          reference nth already matched parenthesized string    static final char OP_GOTO             = 'G';  //                 nothing but a (back-)pointer    static final char OP_NOTHING          = 'N';  //                 match null string such as in '(a|)'    static final char OP_CONTINUE         = 'C';  //                 continue to the following command (ignore next)
re.java - 源码说明

本页面展示了「jakarta-regexp-1.5 正则表达式的源代码」中的 re.java 源码文件，采用 Java 编程语言编写，共 1,748 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与jakarta-regexp相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?