📄 regularexpression.java
字号:
/* * Copyright 1999-2002,2004,2005 The Apache Software Foundation. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package com.sun.org.apache.xerces.internal.impl.xpath.regex;import java.text.CharacterIterator;/** * A regular expression matching engine using Non-deterministic Finite Automaton (NFA). * This engine does not conform to the POSIX regular expression. * * <hr width="50%"> * <h3>How to use</h3> * * <dl> * <dt>A. Standard way * <dd> * <pre> * RegularExpression re = new RegularExpression(<var>regex</var>); * if (re.matches(text)) { ... } * </pre> * * <dt>B. Capturing groups * <dd> * <pre> * RegularExpression re = new RegularExpression(<var>regex</var>); * Match match = new Match(); * if (re.matches(text, match)) { * ... // You can refer captured texts with methods of the <code>Match</code> class. * } * </pre> * * </dl> * * <h4>Case-insensitive matching</h4> * <pre> * RegularExpression re = new RegularExpression(<var>regex</var>, "i"); * if (re.matches(text) >= 0) { ...} * </pre> * * <h4>Options</h4> * <p>You can specify options to <a href="#RegularExpression(java.lang.String, java.lang.String)"><code>RegularExpression(</code><var>regex</var><code>, </code><var>options</var><code>)</code></a> * or <a href="#setPattern(java.lang.String, java.lang.String)"><code>setPattern(</code><var>regex</var><code>, </code><var>options</var><code>)</code></a>. * This <var>options</var> parameter consists of the following characters. * </p> * <dl> * <dt><a name="I_OPTION"><code>"i"</code></a> * <dd>This option indicates case-insensitive matching. * <dt><a name="M_OPTION"><code>"m"</code></a> * <dd class="REGEX"><kbd>^</kbd> and <kbd>$</kbd> consider the EOL characters within the text. * <dt><a name="S_OPTION"><code>"s"</code></a> * <dd class="REGEX"><kbd>.</kbd> matches any one character. * <dt><a name="U_OPTION"><code>"u"</code></a> * <dd class="REGEX">Redefines <Kbd>\d \D \w \W \s \S \b \B \< \></kbd> as becoming to Unicode. * <dt><a name="W_OPTION"><code>"w"</code></a> * <dd class="REGEX">By this option, <kbd>\b \B \< \></kbd> are processed with the method of * 'Unicode Regular Expression Guidelines' Revision 4. * When "w" and "u" are specified at the same time, * <kbd>\b \B \< \></kbd> are processed for the "w" option. * <dt><a name="COMMA_OPTION"><code>","</code></a> * <dd>The parser treats a comma in a character class as a range separator. * <kbd class="REGEX">[a,b]</kbd> matches <kbd>a</kbd> or <kbd>,</kbd> or <kbd>b</kbd> without this option. * <kbd class="REGEX">[a,b]</kbd> matches <kbd>a</kbd> or <kbd>b</kbd> with this option. * * <dt><a name="X_OPTION"><code>"X"</code></a> * <dd class="REGEX"> * By this option, the engine confoms to <a href="http://www.w3.org/TR/2000/WD-xmlschema-2-20000407/#regexs">XML Schema: Regular Expression</a>. * The <code>match()</code> method does not do subsring matching * but entire string matching. * * </dl> * * <hr width="50%"> * <h3>Syntax</h3> * <table border="1" bgcolor="#ddeeff"> * <tr> * <td> * <h4>Differences from the Perl 5 regular expression</h4> * <ul> * <li>There is 6-digit hexadecimal character representation (<kbd>\u005cv</kbd><var>HHHHHH</var>.) * <li>Supports subtraction, union, and intersection operations for character classes. * <li>Not supported: <kbd>\</kbd><var>ooo</var> (Octal character representations), * <Kbd>\G</kbd>, <kbd>\C</kbd>, <kbd>\l</kbd><var>c</var>, * <kbd>\u005c u</kbd><var>c</var>, <kbd>\L</kbd>, <kbd>\U</kbd>, * <kbd>\E</kbd>, <kbd>\Q</kbd>, <kbd>\N{</kbd><var>name</var><kbd>}</kbd>, * <Kbd>(?{<kbd><var>code</var><kbd>})</kbd>, <Kbd>(??{<kbd><var>code</var><kbd>})</kbd> * </ul> * </td> * </tr> * </table> * * <P>Meta characters are `<KBD>. * + ? { [ ( ) | \ ^ $</KBD>'.</P> * <ul> * <li>Character * <dl> * <dt class="REGEX"><kbd>.</kbd> (A period) * <dd>Matches any one character except the following characters. * <dd>LINE FEED (U+000A), CARRIAGE RETURN (U+000D), * PARAGRAPH SEPARATOR (U+2029), LINE SEPARATOR (U+2028) * <dd>This expression matches one code point in Unicode. It can match a pair of surrogates. * <dd>When <a href="#S_OPTION">the "s" option</a> is specified, * it matches any character including the above four characters. * * <dt class="REGEX"><Kbd>\e \f \n \r \t</kbd> * <dd>Matches ESCAPE (U+001B), FORM FEED (U+000C), LINE FEED (U+000A), * CARRIAGE RETURN (U+000D), HORIZONTAL TABULATION (U+0009) * * <dt class="REGEX"><kbd>\c</kbd><var>C</var> * <dd>Matches a control character. * The <var>C</var> must be one of '<kbd>@</kbd>', '<kbd>A</kbd>'-'<kbd>Z</kbd>', * '<kbd>[</kbd>', '<kbd>\u005c</kbd>', '<kbd>]</kbd>', '<kbd>^</kbd>', '<kbd>_</kbd>'. * It matches a control character of which the character code is less than * the character code of the <var>C</var> by 0x0040. * <dd class="REGEX">For example, a <kbd>\cJ</kbd> matches a LINE FEED (U+000A), * and a <kbd>\c[</kbd> matches an ESCAPE (U+001B). * * <dt class="REGEX">a non-meta character * <dd>Matches the character. * * <dt class="REGEX"><KBD>\</KBD> + a meta character * <dd>Matches the meta character. * * <dt class="REGEX"><kbd>\u005cx</kbd><var>HH</var> <kbd>\u005cx{</kbd><var>HHHH</var><kbd>}</kbd> * <dd>Matches a character of which code point is <var>HH</var> (Hexadecimal) in Unicode. * You can write just 2 digits for <kbd>\u005cx</kbd><var>HH</var>, and * variable length digits for <kbd>\u005cx{</kbd><var>HHHH</var><kbd>}</kbd>. * * <!-- * <dt class="REGEX"><kbd>\u005c u</kbd><var>HHHH</var> * <dd>Matches a character of which code point is <var>HHHH</var> (Hexadecimal) in Unicode. * --> * * <dt class="REGEX"><kbd>\u005cv</kbd><var>HHHHHH</var> * <dd>Matches a character of which code point is <var>HHHHHH</var> (Hexadecimal) in Unicode. * * <dt class="REGEX"><kbd>\g</kbd> * <dd>Matches a grapheme. * <dd class="REGEX">It is equivalent to <kbd>(?[\p{ASSIGNED}]-[\p{M}\p{C}])?(?:\p{M}|[\x{094D}\x{09CD}\x{0A4D}\x{0ACD}\x{0B3D}\x{0BCD}\x{0C4D}\x{0CCD}\x{0D4D}\x{0E3A}\x{0F84}]\p{L}|[\x{1160}-\x{11A7}]|[\x{11A8}-\x{11FF}]|[\x{FF9E}\x{FF9F}])*</kbd> * * <dt class="REGEX"><kbd>\X</kbd> * <dd class="REGEX">Matches a combining character sequence. * It is equivalent to <kbd>(?:\PM\pM*)</kbd> * </dl> * </li> * * <li>Character class * <dl>+ * <dt class="REGEX"><kbd>[</kbd><var>R<sub>1</sub></var><var>R<sub>2</sub></var><var>...</var><var>R<sub>n</sub></var><kbd>]</kbd> (without <a href="#COMMA_OPTION">"," option</a>)+ * <dt class="REGEX"><kbd>[</kbd><var>R<sub>1</sub></var><kbd>,</kbd><var>R<sub>2</sub></var><kbd>,</kbd><var>...</var><kbd>,</kbd><var>R<sub>n</sub></var><kbd>]</kbd> (with <a href="#COMMA_OPTION">"," option</a>) * <dd>Positive character class. It matches a character in ranges. * <dd><var>R<sub>n</sub></var>: * <ul> * <li class="REGEX">A character (including <Kbd>\e \f \n \r \t</kbd> <kbd>\u005cx</kbd><var>HH</var> <kbd>\u005cx{</kbd><var>HHHH</var><kbd>}</kbd> <!--kbd>\u005c u</kbd><var>HHHH</var--> <kbd>\u005cv</kbd><var>HHHHHH</var>) * <p>This range matches the character. * <li class="REGEX"><var>C<sub>1</sub></var><kbd>-</kbd><var>C<sub>2</sub></var> * <p>This range matches a character which has a code point that is >= <var>C<sub>1</sub></var>'s code point and <= <var>C<sub>2</sub></var>'s code point.+ * <li class="REGEX">A POSIX character class: <Kbd>[:alpha:] [:alnum:] [:ascii:] [:cntrl:] [:digit:] [:graph:] [:lower:] [:print:] [:punct:] [:space:] [:upper:] [:xdigit:]</kbd>,+ * and negative POSIX character classes in Perl like <kbd>[:^alpha:]</kbd> * <p>... * <li class="REGEX"><kbd>\d \D \s \S \w \W \p{</kbd><var>name</var><kbd>} \P{</kbd><var>name</var><kbd>}</kbd> * <p>These expressions specifies the same ranges as the following expressions. * </ul> * <p class="REGEX">Enumerated ranges are merged (union operation). * <kbd>[a-ec-z]</kbd> is equivalent to <kbd>[a-z]</kbd> * * <dt class="REGEX"><kbd>[^</kbd><var>R<sub>1</sub></var><var>R<sub>2</sub></var><var>...</var><var>R<sub>n</sub></var><kbd>]</kbd> (without a <a href="#COMMA_OPTION">"," option</a>) * <dt class="REGEX"><kbd>[^</kbd><var>R<sub>1</sub></var><kbd>,</kbd><var>R<sub>2</sub></var><kbd>,</kbd><var>...</var><kbd>,</kbd><var>R<sub>n</sub></var><kbd>]</kbd> (with a <a href="#COMMA_OPTION">"," option</a>) * <dd>Negative character class. It matches a character not in ranges. * * <dt class="REGEX"><kbd>(?[</kbd><var>ranges</var><kbd>]</kbd><var>op</var><kbd>[</kbd><var>ranges</var><kbd>]</kbd><var>op</var><kbd>[</kbd><var>ranges</var><kbd>]</kbd> ... <Kbd>)</kbd> * (<var>op</var> is <kbd>-</kbd> or <kbd>+</kbd> or <kbd>&</kbd>.) * <dd>Subtraction or union or intersection for character classes. * <dd class="REGEX">For exmaple, <kbd>(?[A-Z]-[CF])</kbd> is equivalent to <kbd>[A-BD-EG-Z]</kbd>, and <kbd>(?[0x00-0x7f]-[K]&[\p{Lu}])</kbd> is equivalent to <kbd>[A-JL-Z]</kbd>. * <dd>The result of this operations is a <u>positive character class</u> * even if an expression includes any negative character classes. * You have to take care on this in case-insensitive matching. * For instance, <kbd>(?[^b])</kbd> is equivalent to <kbd>[\x00-ac-\x{10ffff}]</kbd>, * which is equivalent to <kbd>[^b]</kbd> in case-sensitive matching. * But, in case-insensitive matching, <kbd>(?[^b])</kbd> matches any character because * it includes '<kbd>B</kbd>' and '<kbd>B</kbd>' matches '<kbd>b</kbd>' * though <kbd>[^b]</kbd> is processed as <kbd>[^Bb]</kbd>. * * <dt class="REGEX"><kbd>[</kbd><var>R<sub>1</sub>R<sub>2</sub>...</var><kbd>-[</kbd><var>R<sub>n</sub>R<sub>n+1</sub>...</var><kbd>]]</kbd> (with an <a href="#X_OPTION">"X" option</a>)</dt> * <dd>Character class subtraction for the XML Schema. * You can use this syntax when you specify an <a href="#X_OPTION">"X" option</a>. * * <dt class="REGEX"><kbd>\d</kbd> * <dd class="REGEX">Equivalent to <kbd>[0-9]</kbd>. * <dd>When <a href="#U_OPTION">a "u" option</a> is set, it is equivalent to * <span class="REGEX"><kbd>\p{Nd}</kbd></span>. * * <dt class="REGEX"><kbd>\D</kbd> * <dd class="REGEX">Equivalent to <kbd>[^0-9]</kbd> * <dd>When <a href="#U_OPTION">a "u" option</a> is set, it is equivalent to * <span class="REGEX"><kbd>\P{Nd}</kbd></span>. * * <dt class="REGEX"><kbd>\s</kbd> * <dd class="REGEX">Equivalent to <kbd>[ \f\n\r\t]</kbd> * <dd>When <a href="#U_OPTION">a "u" option</a> is set, it is equivalent to * <span class="REGEX"><kbd>[ \f\n\r\t\p{Z}]</kbd></span>. * * <dt class="REGEX"><kbd>\S</kbd> * <dd class="REGEX">Equivalent to <kbd>[^ \f\n\r\t]</kbd> * <dd>When <a href="#U_OPTION">a "u" option</a> is set, it is equivalent to * <span class="REGEX"><kbd>[^ \f\n\r\t\p{Z}]</kbd></span>. * * <dt class="REGEX"><kbd>\w</kbd> * <dd class="REGEX">Equivalent to <kbd>[a-zA-Z0-9_]</kbd> * <dd>When <a href="#U_OPTION">a "u" option</a> is set, it is equivalent to * <span class="REGEX"><kbd>[\p{Lu}\p{Ll}\p{Lo}\p{Nd}_]</kbd></span>. * * <dt class="REGEX"><kbd>\W</kbd> * <dd class="REGEX">Equivalent to <kbd>[^a-zA-Z0-9_]</kbd> * <dd>When <a href="#U_OPTION">a "u" option</a> is set, it is equivalent to * <span class="REGEX"><kbd>[^\p{Lu}\p{Ll}\p{Lo}\p{Nd}_]</kbd></span>. * * <dt class="REGEX"><kbd>\p{</kbd><var>name</var><kbd>}</kbd> * <dd>Matches one character in the specified General Category (the second field in <a href="ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt"><kbd>UnicodeData.txt</kbd></a>) or the specified <a href="ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt">Block</a>. * The following names are available: * <dl> * <dt>Unicode General Categories: * <dd><kbd> * L, M, N, Z, C, P, S, Lu, Ll, Lt, Lm, Lo, Mn, Me, Mc, Nd, Nl, No, Zs, Zl, Zp, * Cc, Cf, Cn, Co, Cs, Pd, Ps, Pe, Pc, Po, Sm, Sc, Sk, So, * </kbd> * <dd>(Currently the Cn category includes U+10000-U+10FFFF characters) * <dt>Unicode Blocks: * <dd><kbd> * Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B, * IPA Extensions, Spacing Modifier Letters, Combining Diacritical Marks, Greek, * Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, * Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Tibetan, Georgian, * Hangul Jamo, Latin Extended Additional, Greek Extended, General Punctuation, * Superscripts and Subscripts, Currency Symbols, Combining Marks for Symbols, * Letterlike Symbols, Number Forms, Arrows, Mathematical Operators, * Miscellaneous Technical, Control Pictures, Optical Character Recognition, * Enclosed Alphanumerics, Box Drawing, Block Elements, Geometric Shapes, * Miscellaneous Symbols, Dingbats, CJK Symbols and Punctuation, Hiragana, * Katakana, Bopomofo, Hangul Compatibility Jamo, Kanbun, * Enclosed CJK Letters and Months, CJK Compatibility, CJK Unified Ideographs, * Hangul Syllables, High Surrogates, High Private Use Surrogates, Low Surrogates, * Private Use, CJK Compatibility Ideographs, Alphabetic Presentation Forms, * Arabic Presentation Forms-A, Combining Half Marks, CJK Compatibility Forms, * Small Form Variants, Arabic Presentation Forms-B, Specials, * Halfwidth and Fullwidth Forms * </kbd> * <dt>Others: * <dd><kbd>ALL</kbd> (Equivalent to <kbd>[\u005cu0000-\u005cv10FFFF]</kbd>) * <dd><kbd>ASSGINED</kbd> (<kbd>\p{ASSIGNED}</kbd> is equivalent to <kbd>\P{Cn}</kbd>) * <dd><kbd>UNASSGINED</kbd> * (<kbd>\p{UNASSIGNED}</kbd> is equivalent to <kbd>\p{Cn}</kbd>) * </dl> * * <dt class="REGEX"><kbd>\P{</kbd><var>name</var><kbd>}</kbd> * <dd>Matches one character not in the specified General Category or the specified Block. * </dl> * </li> * * <li>Selection and Quantifier * <dl> * <dt class="REGEX"><VAR>X</VAR><kbd>|</kbd><VAR>Y</VAR> * <dd>... * * <dt class="REGEX"><VAR>X</VAR><kbd>*</KBD> * <dd>Matches 0 or more <var>X</var>. * * <dt class="REGEX"><VAR>X</VAR><kbd>+</KBD> * <dd>Matches 1 or more <var>X</var>. * * <dt class="REGEX"><VAR>X</VAR><kbd>?</KBD> * <dd>Matches 0 or 1 <var>X</var>. * * <dt class="REGEX"><var>X</var><kbd>{</kbd><var>number</var><kbd>}</kbd> * <dd>Matches <var>number</var> times. * * <dt class="REGEX"><var>X</var><kbd>{</kbd><var>min</var><kbd>,}</kbd> * <dd>... * * <dt class="REGEX"><var>X</var><kbd>{</kbd><var>min</var><kbd>,</kbd><var>max</var><kbd>}</kbd> * <dd>... * * <dt class="REGEX"><VAR>X</VAR><kbd>*?</kbd> * <dt class="REGEX"><VAR>X</VAR><kbd>+?</kbd> * <dt class="REGEX"><VAR>X</VAR><kbd>??</kbd> * <dt class="REGEX"><var>X</var><kbd>{</kbd><var>min</var><kbd>,}?</kbd> * <dt class="REGEX"><var>X</var><kbd>{</kbd><var>min</var><kbd>,</kbd><var>max</var><kbd>}?</kbd> * <dd>Non-greedy matching. * </dl> * </li> * * <li>Grouping, Capturing, and Back-reference * <dl>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -