📄 pattern.java
字号:
* literals that represent regular expressions to protect them from * interpretation by the Java bytecode compiler. The string literal * <tt>"\b"</tt>, for example, matches a single backspace character when * interpreted as a regular expression, while <tt>"\\b"</tt> matches a * word boundary. The string literal <tt>"\(hello\)"</tt> is illegal * and leads to a compile-time error; in order to match the string * <tt>(hello)</tt> the string literal <tt>"\\(hello\\)"</tt> * must be used. * * <a name="cc"> * <h4> Character Classes </h4> * * <p> Character classes may appear within other character classes, and * may be composed by the union operator (implicit) and the intersection * operator (<tt>&&</tt>). * The union operator denotes a class that contains every character that is * in at least one of its operand classes. The intersection operator * denotes a class that contains every character that is in both of its * operand classes. * * <p> The precedence of character-class operators is as follows, from * highest to lowest: * * <blockquote><table border="0" cellpadding="1" cellspacing="0" * summary="Precedence of character class operators."> * <tr><th>1 </th> * <td>Literal escape </td> * <td><tt>\x</tt></td></tr> * <tr><th>2 </th> * <td>Grouping</td> * <td><tt>[...]</tt></td></tr> * <tr><th>3 </th> * <td>Range</td> * <td><tt>a-z</tt></td></tr> * <tr><th>4 </th> * <td>Union</td> * <td><tt>[a-e][i-u]<tt></td></tr> * <tr><th>5 </th> * <td>Intersection</td> * <td><tt>[a-z&&[aeiou]]</tt></td></tr> * </table></blockquote> * * <p> Note that a different set of metacharacters are in effect inside * a character class than outside a character class. For instance, the * regular expression <tt>.</tt> loses its special meaning inside a * character class, while the expression <tt>-</tt> becomes a range * forming metacharacter. * * <a name="lt"> * <h4> Line terminators </h4> * * <p> A <i>line terminator</i> is a one- or two-character sequence that marks * the end of a line of the input character sequence. The following are * recognized as line terminators: * * <ul> * * <li> A newline (line feed) character (<tt>'\n'</tt>), * * <li> A carriage-return character followed immediately by a newline * character (<tt>"\r\n"</tt>), * * <li> A standalone carriage-return character (<tt>'\r'</tt>), * * <li> A next-line character (<tt>'\u0085'</tt>), * * <li> A line-separator character (<tt>'\u2028'</tt>), or * * <li> A paragraph-separator character (<tt>'\u2029</tt>). * * </ul> * <p>If {@link #UNIX_LINES} mode is activated, then the only line terminators * recognized are newline characters. * * <p> The regular expression <tt>.</tt> matches any character except a line * terminator unless the {@link #DOTALL} flag is specified. * * <p> By default, the regular expressions <tt>^</tt> and <tt>$</tt> ignore * line terminators and only match at the beginning and the end, respectively, * of the entire input sequence. If {@link #MULTILINE} mode is activated then * <tt>^</tt> matches at the beginning of input and after any line terminator * except at the end of input. When in {@link #MULTILINE} mode <tt>$</tt> * matches just before a line terminator or the end of the input sequence. * * <a name="cg"> * <h4> Groups and capturing </h4> * * <p> Capturing groups are numbered by counting their opening parentheses from * left to right. In the expression <tt>((A)(B(C)))</tt>, for example, there * are four such groups: </p> * * <blockquote><table cellpadding=1 cellspacing=0 summary="Capturing group numberings"> * <tr><th>1 </th> * <td><tt>((A)(B(C)))</tt></td></tr> * <tr><th>2 </th> * <td><tt>(A)</tt></td></tr> * <tr><th>3 </th> * <td><tt>(B(C))</tt></td></tr> * <tr><th>4 </th> * <td><tt>(C)</tt></td></tr> * </table></blockquote> * * <p> Group zero always stands for the entire expression. * * <p> Capturing groups are so named because, during a match, each subsequence * of the input sequence that matches such a group is saved. The captured * subsequence may be used later in the expression, via a back reference, and * may also be retrieved from the matcher once the match operation is complete. * * <p> The captured input associated with a group is always the subsequence * that the group most recently matched. If a group is evaluated a second time * because of quantification then its previously-captured value, if any, will * be retained if the second evaluation fails. Matching the string * <tt>"aba"</tt> against the expression <tt>(a(b)?)+</tt>, for example, leaves * group two set to <tt>"b"</tt>. All captured input is discarded at the * beginning of each match. * * <p> Groups beginning with <tt>(?</tt> are pure, <i>non-capturing</i> groups * that do not capture text and do not count towards the group total. * * * <h4> Unicode support </h4> * * <p> This class follows <a * href="http://www.unicode.org/unicode/reports/tr18/"><i>Unicode Technical * Report #18: Unicode Regular Expression Guidelines</i></a>, implementing its * second level of support though with a slightly different concrete syntax. * * <p> Unicode escape sequences such as <tt>\u2014</tt> in Java source code * are processed as described in <a * href="http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html#100850">\u00A73.3</a> * of the Java Language Specification. Such escape sequences are also * implemented directly by the regular-expression parser so that Unicode * escapes can be used in expressions that are read from files or from the * keyboard. Thus the strings <tt>"\u2014"</tt> and <tt>"\\u2014"</tt>, * while not equal, compile into the same pattern, which matches the character * with hexadecimal value <tt>0x2014</tt>. * * <a name="ubc"> <p>Unicode blocks and categories are written with the * <tt>\p</tt> and <tt>\P</tt> constructs as in * Perl. <tt>\p{</tt><i>prop</i><tt>}</tt> matches if the input has the * property <i>prop</i>, while \P{</tt><i>prop</i><tt>}</tt> does not match if * the input has that property. Blocks are specified with the prefix * <tt>In</tt>, as in <tt>InMongolian</tt>. Categories may be specified with * the optional prefix <tt>Is</tt>: Both <tt>\p{L}</tt> and <tt>\p{IsL}</tt> * denote the category of Unicode letters. Blocks and categories can be used * both inside and outside of a character class. * * <p> The supported blocks and categories are those of <a * href="http://www.unicode.org/unicode/standard/standard.html"><i>The Unicode * Standard, Version 3.0</i></a>. The block names are those defined in * Chapter 14 and in the file <a * href="http://www.unicode.org/Public/3.0-Update/Blocks-3.txt">Blocks-3.txt * </a> of the <a * href="http://www.unicode.org/Public/3.0-Update/UnicodeCharacterDatabase-3.0.0.html">Unicode * Character Database</a> except that the spaces are removed; <tt>"Basic * Latin"</tt>, for example, becomes <tt>"BasicLatin"</tt>. The category names * are those defined in table 4-5 of the Standard (p. 88), both normative * and informative. * * * <h4> Comparison to Perl 5 </h4> * * <p> Perl constructs not supported by this class: </p> * * <ul> * * <li><p> The conditional constructs <tt>(?{</tt><i>X</i><tt>})</tt> and * <tt>(?(</tt><i>condition</i><tt>)</tt><i>X</i><tt>|</tt><i>Y</i><tt>)</tt>, * </p></li> * * <li><p> The embedded code constructs <tt>(?{</tt><i>code</i><tt>})</tt> * and <tt>(??{</tt><i>code</i><tt>})</tt>,</p></li> * * <li><p> The embedded comment syntax <tt>(?#comment)</tt>, and </p></li> * * <li><p> The preprocessing operations <tt>\l</tt> <tt>\u</tt>, * <tt>\L</tt>, and <tt>\U</tt>. </p></li> * * </ul> * * <p> Constructs supported by this class but not by Perl: </p> * * <ul> * * <li><p> Possessive quantifiers, which greedily match as much as they can * and do not back off, even when doing so would allow the overall match to * succeed. </p></li> * * <li><p> Character-class union and intersection as described * <a href="#cc">above</a>.</p></li> * * </ul> * * <p> Notable differences from Perl: </p> * * <ul> * * <li><p> In Perl, <tt>\1</tt> through <tt>\9</tt> are always interpreted * as back references; a backslash-escaped number greater than <tt>9</tt> is * treated as a back reference if at least that many subexpressions exist, * otherwise it is interpreted, if possible, as an octal escape. In this * class octal escapes must always begin with a zero. In this class, * <tt>\1</tt> through <tt>\9</tt> are always interpreted as back * references, and a larger number is accepted as a back reference if at * least that many subexpressions exist at that point in the regular * expression, otherwise the parser will drop digits until the number is * smaller or equal to the existing number of groups or it is one digit. * </p></li> * * <li><p> Perl uses the <tt>g</tt> flag to request a match that resumes * where the last match left off. This functionality is provided implicitly * by the {@link Matcher} class: Repeated invocations of the {@link * Matcher#find find} method will resume where the last match left off, * unless the matcher is reset. </p></li> * * <li><p> In Perl, embedded flags at the top level of an expression affect * the whole expression. In this class, embedded flags always take effect * at the point at which they appear, whether they are at the top level or * within a group; in the latter case, flags are restored at the end of the * group just as in Perl. </p></li> * * <li><p> Perl is forgiving about malformed matching constructs, as in the * expression <tt>*a</tt>, as well as dangling brackets, as in the * expression <tt>abc]</tt>, and treats them as literals. This * class also accepts dangling brackets but is strict about dangling * metacharacters like +, ? and *, and will throw a * {@link PatternSyntaxException} if it encounters them. </p></li> * * </ul> * * * <p> For a more precise description of the behavior of regular expression * constructs, please see <a href="http://www.oreilly.com/catalog/regex2/"> * <i>Mastering Regular Expressions, 2nd Edition</i>, Jeffrey E. F. Friedl, * O'Reilly and Associates, 2002.</a> * </p> * * @see java.lang.String#split(String, int) * @see java.lang.String#split(String) * * @author Mike McCloskey * @author Mark Reinhold * @author JSR-51 Expert Group * @version 1.97, 04/01/13 * @since 1.4 * @spec JSR-51 */public final class Pattern implements java.io.Serializable{ /** * Regular expression modifier values. Instead of being passed as * arguments, they can also be passed as inline modifiers. * For example, the following statements have the same effect. * <pre> * RegExp r1 = RegExp.compile("abc", Pattern.I|Pattern.M); * RegExp r2 = RegExp.compile("(?im)abc", 0); * </pre> * * The flags are duplicated so that the familiar Perl match flag * names are available. */ /** * Enables Unix lines mode. * * <p> In this mode, only the <tt>'\n'</tt> line terminator is recognized * in the behavior of <tt>.</tt>, <tt>^</tt>, and <tt>$</tt>. * * <p> Unix lines mode can also be enabled via the embedded flag * expression <tt>(?d)</tt>. */ public static final int UNIX_LINES = 0x01; /** * Enables case-insensitive matching. * * <p> By default, case-insensitive matching assumes that only characters * in the US-ASCII charset are being matched. Unicode-aware * case-insensitive matching can be enabled by specifying the {@link * #UNICODE_CASE} flag in conjunction with this flag. * * <p> Case-insensitive matching can also be enabled via the embedded flag * expression <tt>(?i)</tt>. * * <p> Specifying this flag may impose a slight performance penalty. </p> */ public static final int CASE_INSENSITIVE = 0x02; /** * Permits whitespace and comments in pattern. * * <p> In this mode, whitespace is ignored, and embedded comments starting * with <tt>#</tt> are ignored until the end of a line. * * <p> Comments mode can also be enabled via the embedded flag * expression <tt>(?x)</tt>. */ public static final int COMMENTS = 0x04; /** * Enables multiline mode. * * <p> In multiline mode the expressions <tt>^</tt> and <tt>$</tt> match * just after or just before, respectively, a line terminator or the end of * the input sequence. By default these expressions only match at the * beginning and the end of the entire input sequence. * * <p> Multiline mode can also be enabled via the embedded flag * expression <tt>(?m)</tt>. </p> */ public static final int MULTILINE = 0x08; /** * Enables dotall mode. * * <p> In dotall mode, the expression <tt>.</tt> matches any character, * including a line terminator. By default this expression does not match * line terminators. * * <p> Dotall mode can also be enabled via the embedded flag * expression <tt>(?s)</tt>. (The <tt>s</tt> is a mnemonic for * "single-line" mode, which is what this is called in Perl.) </p> */ public static final int DOTALL = 0x20;
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -