📄 re.java

📁 java写的crawler
💻 JAVA
📖 第 1 页 / 共 5 页
字号:
12 3 4 5 下一页
package org.apache.regexp;

/*
 * ====================================================================
 * 
 * The Apache Software License, Version 1.1
 *
 * Copyright (c) 1999 The Apache Software Foundation.  All rights 
 * reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer. 
 *
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in
 *    the documentation and/or other materials provided with the
 *    distribution.
 *
 * 3. The end-user documentation included with the redistribution, if
 *    any, must include the following acknowlegement:  
 *       "This product includes software developed by the 
 *        Apache Software Foundation (http://www.apache.org/)."
 *    Alternately, this acknowlegement may appear in the software itself,
 *    if and wherever such third-party acknowlegements normally appear.
 *
 * 4. The names "The Jakarta Project", "Jakarta-Regexp", and "Apache Software
 *    Foundation" must not be used to endorse or promote products derived
 *    from this software without prior written permission. For written 
 *    permission, please contact apache@apache.org.
 *
 * 5. Products derived from this software may not be called "Apache"
 *    nor may "Apache" appear in their names without prior written
 *    permission of the Apache Group.
 *
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
 * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 * ====================================================================
 *
 * This software consists of voluntary contributions made by many
 * individuals on behalf of the Apache Software Foundation.  For more
 * information on the Apache Software Foundation, please see
 * <http://www.apache.org/>.
 *
 */ 
 
import java.util.Vector;

/**
 * RE is an efficient, lightweight regular expression evaluator/matcher class.
 * Regular expressions are pattern descriptions which enable sophisticated matching of
 * strings.  In addition to being able to match a string against a pattern, you
 * can also extract parts of the match.  This is especially useful in text parsing!
 * Details on the syntax of regular expression patterns are given below.
 *
 * <p>
 *
 * To compile a regular expression (RE), you can simply construct an RE matcher
 * object from the string specification of the pattern, like this:
 *
 * <pre>
 *
 *     RE r = new RE("a*b");
 *
 * </pre>
 *
 * <p>
 *
 * Once you have done this, you can call either of the RE.match methods to
 * perform matching on a String.  For example:
 *
 * <pre>
 *
 *     boolean matched = r.match("aaaab");
 *
 * </pre>
 *
 * will cause the boolean matched to be set to true because the
 * pattern "a*b" matches the string "aaaab".
 *
 * <p>
 * If you were interested in the <i>number</i> of a's which matched the first
 * part of our example expression, you could change the expression to
 * "(a*)b".  Then when you compiled the expression and matched it against
 * something like "xaaaab", you would get results like this:
 *
 * <pre>
 *
 *     RE r = new RE("(a*)b");                  // Compile expression
 *     boolean matched = r.match("xaaaab");     // Match against "xaaaab"
 *
 * <br>
 *
 *     String wholeExpr = r.getParen(0);        // wholeExpr will be 'aaaab'
 *     String insideParens = r.getParen(1);     // insideParens will be 'aaaa'
 *
 * <br>
 *
 *     int startWholeExpr = getParenStart(0);   // startWholeExpr will be index 1
 *     int endWholeExpr = getParenEnd(0);       // endWholeExpr will be index 6
 *     int lenWholeExpr = getParenLength(0);    // lenWholeExpr will be 5
 *
 * <br>
 *
 *     int startInside = getParenStart(1);      // startInside will be index 1
 *     int endInside = getParenEnd(1);          // endInside will be index 5
 *     int lenInside = getParenLength(1);       // lenInside will be 4
 *
 * </pre>
 *
 * You can also refer to the contents of a parenthesized expression within
 * a regular expression itself.  This is called a 'backreference'.  The first
 * backreference in a regular expression is denoted by \1, the second by \2
 * and so on.  So the expression:
 *
 * <pre>
 *
 *     ([0-9]+)=\1
 *
 * </pre>
 *
 * will match any string of the form n=n (like 0=0 or 2=2).
 *
 * <p>
 *
 * The full regular expression syntax accepted by RE is described here:
 *
 * <pre>
 *
 * <br>
 *
 *  <b><font face=times roman>Characters</font></b>
 *
 * <br>
 *
 *    <i>unicodeChar</i>          Matches any identical unicode character
 *    \                    Used to quote a meta-character (like '*')
 *    \\                   Matches a single '\' character
 *    \0nnn                Matches a given octal character
 *    \xhh                 Matches a given 8-bit hexadecimal character
 *    \\uhhhh               Matches a given 16-bit hexadecimal character
 *    \t                   Matches an ASCII tab character
 *    \n                   Matches an ASCII newline character
 *    \r                   Matches an ASCII return character
 *    \f                   Matches an ASCII form feed character
 *
 * <br>
 *
 *  <b><font face=times roman>Character Classes</font></b>
 *
 * <br>
 *
 *    [abc]                Simple character class
 *    [a-zA-Z]             Character class with ranges
 *    [^abc]               Negated character class
 *
 * <br>
 *
 *  <b><font face=times roman>Standard POSIX Character Classes</font></b>
 *
 * <br>
 *
 *    [:alnum:]            Alphanumeric characters. 
 *    [:alpha:]            Alphabetic characters. 
 *    [:blank:]            Space and tab characters. 
 *    [:cntrl:]            Control characters. 
 *    [:digit:]            Numeric characters. 
 *    [:graph:]            Characters that are printable and are also visible. (A space is printable, but not visible, while an `a' is both.) 
 *    [:lower:]            Lower-case alphabetic characters. 
 *    [:print:]            Printable characters (characters that are not control characters.) 
 *    [:punct:]            Punctuation characters (characters that are not letter, digits, control characters, or space characters). 
 *    [:space:]            Space characters (such as space, tab, and formfeed, to name a few). 
 *    [:upper:]            Upper-case alphabetic characters. 
 *    [:xdigit:]           Characters that are hexadecimal digits.
 *         
 * <br>
 *
 *  <b><font face=times roman>Non-standard POSIX-style Character Classes</font></b>
 *
 * <br>
 *
 *    [:javastart:]        Start of a Java identifier
 *    [:javapart:]         Part of a Java identifier
 *
 * <br>
 *         
 *  <b><font face=times roman>Predefined Classes</font></b>
 *
 * <br>
 *
 *    .                    Matches any character other than newline
 *    \w                   Matches a "word" character (alphanumeric plus "_")
 *    \W                   Matches a non-word character
 *    \s                   Matches a whitespace character
 *    \S                   Matches a non-whitespace character
 *    \d                   Matches a digit character
 *    \D                   Matches a non-digit character
 *
 * <br>
 *
 *  <b><font face=times roman>Boundary Matchers</font></b>
 *
 * <br>
 *
 *    ^                    Matches only at the beginning of a line
 *    $                    Matches only at the end of a line
 *    \b                   Matches only at a word boundary
 *    \B                   Matches only at a non-word boundary
 *
 * <br>
 *
 *  <b><font face=times roman>Greedy Closures</font></b>
 *
 * <br>
 *
 *    A*                   Matches A 0 or more times (greedy)
 *    A+                   Matches A 1 or more times (greedy)
 *    A?                   Matches A 1 or 0 times (greedy)
 *    A{n}                 Matches A exactly n times (greedy)
 *    A{n,}                Matches A at least n times (greedy)
 *    A{n,m}               Matches A at least n but not more than m times (greedy)
 *
 * <br>
 *
 *  <b><font face=times roman>Reluctant Closures</font></b>
 *
 * <br>
 *
 *    A*?                  Matches A 0 or more times (reluctant)
 *    A+?                  Matches A 1 or more times (reluctant)
 *    A??                  Matches A 0 or 1 times (reluctant)
 *
 * <br>
 *
 *  <b><font face=times roman>Logical Operators</font></b>
 *
 * <br>
 *
 *    AB                   Matches A followed by B
 *    A|B                  Matches either A or B
 *    (A)                  Used for subexpression grouping
 *
 * <br>
 *
 *  <b><font face=times roman>Backreferences</font></b>
 *
 * <br>
 *
 *    \1                   Backreference to 1st parenthesized subexpression
 *    \2                   Backreference to 2nd parenthesized subexpression
 *    \3                   Backreference to 3rd parenthesized subexpression
 *    \4                   Backreference to 4th parenthesized subexpression
 *    \5                   Backreference to 5th parenthesized subexpression
 *    \6                   Backreference to 6th parenthesized subexpression
 *    \7                   Backreference to 7th parenthesized subexpression
 *    \8                   Backreference to 8th parenthesized subexpression
 *    \9                   Backreference to 9th parenthesized subexpression
 *
 * <br>
 *
 * </pre>
 *
 * <p>
 *
 * All closure operators (+, *, ?, {m,n}) are greedy by default, meaning that they
 * match as many elements of the string as possible without causing the overall
 * match to fail.  If you want a closure to be reluctant (non-greedy), you can
 * simply follow it with a '?'.  A reluctant closure will match as few elements
 * of the string as possible when finding matches.  {m,n} closures don't currently
 * support reluctancy.
 *
 * <p>
 *
 * RE runs programs compiled by the RECompiler class.  But the RE matcher class
 * does not include the actual regular expression compiler for reasons of
 * efficiency.  In fact, if you want to pre-compile one or more regular expressions,
 * the 'recompile' class can be invoked from the command line to produce compiled
 * output like this:
 *
 * <pre>
 *
 *    // Pre-compiled regular expression "a*b"
 *    char[] re1Instructions =
 *    {
 *        0x007c, 0x0000, 0x001a, 0x007c, 0x0000, 0x000d, 0x0041,
 *        0x0001, 0x0004, 0x0061, 0x007c, 0x0000, 0x0003, 0x0047,
 *        0x0000, 0xfff6, 0x007c, 0x0000, 0x0003, 0x004e, 0x0000,
 *        0x0003, 0x0041, 0x0001, 0x0004, 0x0062, 0x0045, 0x0000,
 *        0x0000,
 *    };
 *
 *    <br>
 *
 *    REProgram re1 = new REProgram(re1Instructions);
 *
 * </pre>
 *
 * You can then construct a regular expression matcher (RE) object from the pre-compiled
 * expression re1 and thus avoid the overhead of compiling the expression at runtime.
 * If you require more dynamic regular expressions, you can construct a single RECompiler
 * object and re-use it to compile each expression.  Similarly, you can change the
 * program run by a given matcher object at any time.  However, RE and RECompiler are
 * not threadsafe (for efficiency reasons, and because requiring thread safety in this
 * class is deemed to be a rare requirement), so you will need to construct a separate
 * compiler or matcher object for each thread (unless you do thread synchronization
 * yourself).
 *
 * </pre>
 * <br><p><br>
 *
 * <font color=red>
 * <i>ISSUES:</i>
 *
 * <ul>
 *  <li>com.weusours.util.re is not currently compatible with all standard POSIX regcomp flags
 *  <li>com.weusours.util.re does not support POSIX equivalence classes ([=foo=] syntax) (I18N/locale issue)
 *  <li>com.weusours.util.re does not support nested POSIX character classes (definitely should, but not completely trivial)
 *  <li>com.weusours.util.re Does not support POSIX character collation concepts ([.foo.] syntax) (I18N/locale issue)
 *  <li>Should there be different matching styles (simple, POSIX, Perl etc?)
 *  <li>Should RE support character iterators (for backwards RE matching!)?
 *  <li>Should RE support reluctant {m,n} closures (does anyone care)?
 *  <li>Not *all* possibilities are considered for greediness when backreferences
 *      are involved (as POSIX suggests should be the case).  The POSIX RE
 *      "(ac*)c*d[ac]*\1", when matched against "acdacaa" should yield a match
 *      of acdacaa where \1 is "a".  This is not the case in this RE package,
 *      and actually Perl doesn't go to this extent either!  Until someone
 *      actually complains about this, I'm not sure it's worth "fixing".
12 3 4 5 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -