📄 re.java
字号:
package org.apache.regexp;
/*
* ====================================================================
*
* The Apache Software License, Version 1.1
*
* Copyright (c) 1999 The Apache Software Foundation. All rights
* reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in
* the documentation and/or other materials provided with the
* distribution.
*
* 3. The end-user documentation included with the redistribution, if
* any, must include the following acknowlegement:
* "This product includes software developed by the
* Apache Software Foundation (http://www.apache.org/)."
* Alternately, this acknowlegement may appear in the software itself,
* if and wherever such third-party acknowlegements normally appear.
*
* 4. The names "The Jakarta Project", "Jakarta-Regexp", and "Apache Software
* Foundation" must not be used to endorse or promote products derived
* from this software without prior written permission. For written
* permission, please contact apache@apache.org.
*
* 5. Products derived from this software may not be called "Apache"
* nor may "Apache" appear in their names without prior written
* permission of the Apache Group.
*
* THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
* WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
* ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
* USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
* ====================================================================
*
* This software consists of voluntary contributions made by many
* individuals on behalf of the Apache Software Foundation. For more
* information on the Apache Software Foundation, please see
* <http://www.apache.org/>.
*
*/
import java.util.Vector;
/**
* RE is an efficient, lightweight regular expression evaluator/matcher class.
* Regular expressions are pattern descriptions which enable sophisticated matching of
* strings. In addition to being able to match a string against a pattern, you
* can also extract parts of the match. This is especially useful in text parsing!
* Details on the syntax of regular expression patterns are given below.
*
* <p>
*
* To compile a regular expression (RE), you can simply construct an RE matcher
* object from the string specification of the pattern, like this:
*
* <pre>
*
* RE r = new RE("a*b");
*
* </pre>
*
* <p>
*
* Once you have done this, you can call either of the RE.match methods to
* perform matching on a String. For example:
*
* <pre>
*
* boolean matched = r.match("aaaab");
*
* </pre>
*
* will cause the boolean matched to be set to true because the
* pattern "a*b" matches the string "aaaab".
*
* <p>
* If you were interested in the <i>number</i> of a's which matched the first
* part of our example expression, you could change the expression to
* "(a*)b". Then when you compiled the expression and matched it against
* something like "xaaaab", you would get results like this:
*
* <pre>
*
* RE r = new RE("(a*)b"); // Compile expression
* boolean matched = r.match("xaaaab"); // Match against "xaaaab"
*
* <br>
*
* String wholeExpr = r.getParen(0); // wholeExpr will be 'aaaab'
* String insideParens = r.getParen(1); // insideParens will be 'aaaa'
*
* <br>
*
* int startWholeExpr = getParenStart(0); // startWholeExpr will be index 1
* int endWholeExpr = getParenEnd(0); // endWholeExpr will be index 6
* int lenWholeExpr = getParenLength(0); // lenWholeExpr will be 5
*
* <br>
*
* int startInside = getParenStart(1); // startInside will be index 1
* int endInside = getParenEnd(1); // endInside will be index 5
* int lenInside = getParenLength(1); // lenInside will be 4
*
* </pre>
*
* You can also refer to the contents of a parenthesized expression within
* a regular expression itself. This is called a 'backreference'. The first
* backreference in a regular expression is denoted by \1, the second by \2
* and so on. So the expression:
*
* <pre>
*
* ([0-9]+)=\1
*
* </pre>
*
* will match any string of the form n=n (like 0=0 or 2=2).
*
* <p>
*
* The full regular expression syntax accepted by RE is described here:
*
* <pre>
*
* <br>
*
* <b><font face=times roman>Characters</font></b>
*
* <br>
*
* <i>unicodeChar</i> Matches any identical unicode character
* \ Used to quote a meta-character (like '*')
* \\ Matches a single '\' character
* \0nnn Matches a given octal character
* \xhh Matches a given 8-bit hexadecimal character
* \\uhhhh Matches a given 16-bit hexadecimal character
* \t Matches an ASCII tab character
* \n Matches an ASCII newline character
* \r Matches an ASCII return character
* \f Matches an ASCII form feed character
*
* <br>
*
* <b><font face=times roman>Character Classes</font></b>
*
* <br>
*
* [abc] Simple character class
* [a-zA-Z] Character class with ranges
* [^abc] Negated character class
*
* <br>
*
* <b><font face=times roman>Standard POSIX Character Classes</font></b>
*
* <br>
*
* [:alnum:] Alphanumeric characters.
* [:alpha:] Alphabetic characters.
* [:blank:] Space and tab characters.
* [:cntrl:] Control characters.
* [:digit:] Numeric characters.
* [:graph:] Characters that are printable and are also visible. (A space is printable, but not visible, while an `a' is both.)
* [:lower:] Lower-case alphabetic characters.
* [:print:] Printable characters (characters that are not control characters.)
* [:punct:] Punctuation characters (characters that are not letter, digits, control characters, or space characters).
* [:space:] Space characters (such as space, tab, and formfeed, to name a few).
* [:upper:] Upper-case alphabetic characters.
* [:xdigit:] Characters that are hexadecimal digits.
*
* <br>
*
* <b><font face=times roman>Non-standard POSIX-style Character Classes</font></b>
*
* <br>
*
* [:javastart:] Start of a Java identifier
* [:javapart:] Part of a Java identifier
*
* <br>
*
* <b><font face=times roman>Predefined Classes</font></b>
*
* <br>
*
* . Matches any character other than newline
* \w Matches a "word" character (alphanumeric plus "_")
* \W Matches a non-word character
* \s Matches a whitespace character
* \S Matches a non-whitespace character
* \d Matches a digit character
* \D Matches a non-digit character
*
* <br>
*
* <b><font face=times roman>Boundary Matchers</font></b>
*
* <br>
*
* ^ Matches only at the beginning of a line
* $ Matches only at the end of a line
* \b Matches only at a word boundary
* \B Matches only at a non-word boundary
*
* <br>
*
* <b><font face=times roman>Greedy Closures</font></b>
*
* <br>
*
* A* Matches A 0 or more times (greedy)
* A+ Matches A 1 or more times (greedy)
* A? Matches A 1 or 0 times (greedy)
* A{n} Matches A exactly n times (greedy)
* A{n,} Matches A at least n times (greedy)
* A{n,m} Matches A at least n but not more than m times (greedy)
*
* <br>
*
* <b><font face=times roman>Reluctant Closures</font></b>
*
* <br>
*
* A*? Matches A 0 or more times (reluctant)
* A+? Matches A 1 or more times (reluctant)
* A?? Matches A 0 or 1 times (reluctant)
*
* <br>
*
* <b><font face=times roman>Logical Operators</font></b>
*
* <br>
*
* AB Matches A followed by B
* A|B Matches either A or B
* (A) Used for subexpression grouping
*
* <br>
*
* <b><font face=times roman>Backreferences</font></b>
*
* <br>
*
* \1 Backreference to 1st parenthesized subexpression
* \2 Backreference to 2nd parenthesized subexpression
* \3 Backreference to 3rd parenthesized subexpression
* \4 Backreference to 4th parenthesized subexpression
* \5 Backreference to 5th parenthesized subexpression
* \6 Backreference to 6th parenthesized subexpression
* \7 Backreference to 7th parenthesized subexpression
* \8 Backreference to 8th parenthesized subexpression
* \9 Backreference to 9th parenthesized subexpression
*
* <br>
*
* </pre>
*
* <p>
*
* All closure operators (+, *, ?, {m,n}) are greedy by default, meaning that they
* match as many elements of the string as possible without causing the overall
* match to fail. If you want a closure to be reluctant (non-greedy), you can
* simply follow it with a '?'. A reluctant closure will match as few elements
* of the string as possible when finding matches. {m,n} closures don't currently
* support reluctancy.
*
* <p>
*
* RE runs programs compiled by the RECompiler class. But the RE matcher class
* does not include the actual regular expression compiler for reasons of
* efficiency. In fact, if you want to pre-compile one or more regular expressions,
* the 'recompile' class can be invoked from the command line to produce compiled
* output like this:
*
* <pre>
*
* // Pre-compiled regular expression "a*b"
* char[] re1Instructions =
* {
* 0x007c, 0x0000, 0x001a, 0x007c, 0x0000, 0x000d, 0x0041,
* 0x0001, 0x0004, 0x0061, 0x007c, 0x0000, 0x0003, 0x0047,
* 0x0000, 0xfff6, 0x007c, 0x0000, 0x0003, 0x004e, 0x0000,
* 0x0003, 0x0041, 0x0001, 0x0004, 0x0062, 0x0045, 0x0000,
* 0x0000,
* };
*
* <br>
*
* REProgram re1 = new REProgram(re1Instructions);
*
* </pre>
*
* You can then construct a regular expression matcher (RE) object from the pre-compiled
* expression re1 and thus avoid the overhead of compiling the expression at runtime.
* If you require more dynamic regular expressions, you can construct a single RECompiler
* object and re-use it to compile each expression. Similarly, you can change the
* program run by a given matcher object at any time. However, RE and RECompiler are
* not threadsafe (for efficiency reasons, and because requiring thread safety in this
* class is deemed to be a rare requirement), so you will need to construct a separate
* compiler or matcher object for each thread (unless you do thread synchronization
* yourself).
*
* </pre>
* <br><p><br>
*
* <font color=red>
* <i>ISSUES:</i>
*
* <ul>
* <li>com.weusours.util.re is not currently compatible with all standard POSIX regcomp flags
* <li>com.weusours.util.re does not support POSIX equivalence classes ([=foo=] syntax) (I18N/locale issue)
* <li>com.weusours.util.re does not support nested POSIX character classes (definitely should, but not completely trivial)
* <li>com.weusours.util.re Does not support POSIX character collation concepts ([.foo.] syntax) (I18N/locale issue)
* <li>Should there be different matching styles (simple, POSIX, Perl etc?)
* <li>Should RE support character iterators (for backwards RE matching!)?
* <li>Should RE support reluctant {m,n} closures (does anyone care)?
* <li>Not *all* possibilities are considered for greediness when backreferences
* are involved (as POSIX suggests should be the case). The POSIX RE
* "(ac*)c*d[ac]*\1", when matched against "acdacaa" should yield a match
* of acdacaa where \1 is "a". This is not the case in this RE package,
* and actually Perl doesn't go to this extent either! Until someone
* actually complains about this, I'm not sure it's worth "fixing".
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -