📄 regularexpression.java

📁 java1.6众多例子参考
💻 JAVA
📖 第 1 页 / 共 5 页
字号:
12 3 4 5 下一页
/* * Copyright 1999-2002,2004,2005 The Apache Software Foundation. *  * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at *  *      http://www.apache.org/licenses/LICENSE-2.0 *  * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package com.sun.org.apache.xerces.internal.impl.xpath.regex;import java.text.CharacterIterator;/** * A regular expression matching engine using Non-deterministic Finite Automaton (NFA). * This engine does not conform to the POSIX regular expression. * * <hr width="50%"> * <h3>How to use</h3> * * <dl> *   <dt>A. Standard way *   <dd> * <pre> * RegularExpression re = new RegularExpression(<var>regex</var>); * if (re.matches(text)) { ... } * </pre> * *   <dt>B. Capturing groups *   <dd> * <pre> * RegularExpression re = new RegularExpression(<var>regex</var>); * Match match = new Match(); * if (re.matches(text, match)) { *     ... // You can refer captured texts with methods of the <code>Match</code> class. * } * </pre> * * </dl> * * <h4>Case-insensitive matching</h4> * <pre> * RegularExpression re = new RegularExpression(<var>regex</var>, "i"); * if (re.matches(text) >= 0) { ...} * </pre> * * <h4>Options</h4> * <p>You can specify options to <a href="#RegularExpression(java.lang.String, java.lang.String)"><code>RegularExpression(</code><var>regex</var><code>, </code><var>options</var><code>)</code></a> *    or <a href="#setPattern(java.lang.String, java.lang.String)"><code>setPattern(</code><var>regex</var><code>, </code><var>options</var><code>)</code></a>. *    This <var>options</var> parameter consists of the following characters. * </p> * <dl> *   <dt><a name="I_OPTION"><code>"i"</code></a> *   <dd>This option indicates case-insensitive matching. *   <dt><a name="M_OPTION"><code>"m"</code></a> *   <dd class="REGEX"><kbd>^</kbd> and <kbd>$</kbd> consider the EOL characters within the text. *   <dt><a name="S_OPTION"><code>"s"</code></a> *   <dd class="REGEX"><kbd>.</kbd> matches any one character. *   <dt><a name="U_OPTION"><code>"u"</code></a> *   <dd class="REGEX">Redefines <Kbd>\d \D \w \W \s \S \b \B \&lt; \></kbd> as becoming to Unicode. *   <dt><a name="W_OPTION"><code>"w"</code></a> *   <dd class="REGEX">By this option, <kbd>\b \B \&lt; \></kbd> are processed with the method of *      'Unicode Regular Expression Guidelines' Revision 4. *      When "w" and "u" are specified at the same time, *      <kbd>\b \B \&lt; \></kbd> are processed for the "w" option. *   <dt><a name="COMMA_OPTION"><code>","</code></a> *   <dd>The parser treats a comma in a character class as a range separator. *      <kbd class="REGEX">[a,b]</kbd> matches <kbd>a</kbd> or <kbd>,</kbd> or <kbd>b</kbd> without this option. *      <kbd class="REGEX">[a,b]</kbd> matches <kbd>a</kbd> or <kbd>b</kbd> with this option. * *   <dt><a name="X_OPTION"><code>"X"</code></a> *   <dd class="REGEX"> *       By this option, the engine confoms to <a href="http://www.w3.org/TR/2000/WD-xmlschema-2-20000407/#regexs">XML Schema: Regular Expression</a>. *       The <code>match()</code> method does not do subsring matching *       but entire string matching. * * </dl> *  * <hr width="50%"> * <h3>Syntax</h3> * <table border="1" bgcolor="#ddeeff"> *   <tr> *    <td> *     <h4>Differences from the Perl 5 regular expression</h4> *     <ul> *      <li>There is 6-digit hexadecimal character representation  (<kbd>\u005cv</kbd><var>HHHHHH</var>.) *      <li>Supports subtraction, union, and intersection operations for character classes. *      <li>Not supported: <kbd>\</kbd><var>ooo</var> (Octal character representations), *          <Kbd>\G</kbd>, <kbd>\C</kbd>, <kbd>\l</kbd><var>c</var>, *          <kbd>\u005c u</kbd><var>c</var>, <kbd>\L</kbd>, <kbd>\U</kbd>, *          <kbd>\E</kbd>, <kbd>\Q</kbd>, <kbd>\N{</kbd><var>name</var><kbd>}</kbd>, *          <Kbd>(?{<kbd><var>code</var><kbd>})</kbd>, <Kbd>(??{<kbd><var>code</var><kbd>})</kbd> *     </ul> *    </td> *   </tr> * </table> * * <P>Meta characters are `<KBD>. * + ? { [ ( ) | \ ^ $</KBD>'.</P> * <ul> *   <li>Character *     <dl> *       <dt class="REGEX"><kbd>.</kbd> (A period) *       <dd>Matches any one character except the following characters. *       <dd>LINE FEED (U+000A), CARRIAGE RETURN (U+000D), *           PARAGRAPH SEPARATOR (U+2029), LINE SEPARATOR (U+2028) *       <dd>This expression matches one code point in Unicode. It can match a pair of surrogates. *       <dd>When <a href="#S_OPTION">the "s" option</a> is specified, *           it matches any character including the above four characters. * *       <dt class="REGEX"><Kbd>\e \f \n \r \t</kbd> *       <dd>Matches ESCAPE (U+001B), FORM FEED (U+000C), LINE FEED (U+000A), *           CARRIAGE RETURN (U+000D), HORIZONTAL TABULATION (U+0009) * *       <dt class="REGEX"><kbd>\c</kbd><var>C</var> *       <dd>Matches a control character. *           The <var>C</var> must be one of '<kbd>@</kbd>', '<kbd>A</kbd>'-'<kbd>Z</kbd>', *           '<kbd>[</kbd>', '<kbd>\u005c</kbd>', '<kbd>]</kbd>', '<kbd>^</kbd>', '<kbd>_</kbd>'. *           It matches a control character of which the character code is less than *           the character code of the <var>C</var> by 0x0040. *       <dd class="REGEX">For example, a <kbd>\cJ</kbd> matches a LINE FEED (U+000A), *           and a <kbd>\c[</kbd> matches an ESCAPE (U+001B). * *       <dt class="REGEX">a non-meta character *       <dd>Matches the character. * *       <dt class="REGEX"><KBD>\</KBD> + a meta character *       <dd>Matches the meta character. * *       <dt class="REGEX"><kbd>\u005cx</kbd><var>HH</var> <kbd>\u005cx{</kbd><var>HHHH</var><kbd>}</kbd> *       <dd>Matches a character of which code point is <var>HH</var> (Hexadecimal) in Unicode. *           You can write just 2 digits for <kbd>\u005cx</kbd><var>HH</var>, and *           variable length digits for <kbd>\u005cx{</kbd><var>HHHH</var><kbd>}</kbd>. * *       <!-- *       <dt class="REGEX"><kbd>\u005c u</kbd><var>HHHH</var> *       <dd>Matches a character of which code point is <var>HHHH</var> (Hexadecimal) in Unicode. *       --> * *       <dt class="REGEX"><kbd>\u005cv</kbd><var>HHHHHH</var> *       <dd>Matches a character of which code point is <var>HHHHHH</var> (Hexadecimal) in Unicode. * *       <dt class="REGEX"><kbd>\g</kbd> *       <dd>Matches a grapheme. *       <dd class="REGEX">It is equivalent to <kbd>(?[\p{ASSIGNED}]-[\p{M}\p{C}])?(?:\p{M}|[\x{094D}\x{09CD}\x{0A4D}\x{0ACD}\x{0B3D}\x{0BCD}\x{0C4D}\x{0CCD}\x{0D4D}\x{0E3A}\x{0F84}]\p{L}|[\x{1160}-\x{11A7}]|[\x{11A8}-\x{11FF}]|[\x{FF9E}\x{FF9F}])*</kbd> * *       <dt class="REGEX"><kbd>\X</kbd> *       <dd class="REGEX">Matches a combining character sequence. *       It is equivalent to <kbd>(?:\PM\pM*)</kbd> *     </dl> *   </li> * *   <li>Character class *     <dl>+ *       <dt class="REGEX"><kbd>[</kbd><var>R<sub>1</sub></var><var>R<sub>2</sub></var><var>...</var><var>R<sub>n</sub></var><kbd>]</kbd> (without <a href="#COMMA_OPTION">"," option</a>)+ *       <dt class="REGEX"><kbd>[</kbd><var>R<sub>1</sub></var><kbd>,</kbd><var>R<sub>2</sub></var><kbd>,</kbd><var>...</var><kbd>,</kbd><var>R<sub>n</sub></var><kbd>]</kbd> (with <a href="#COMMA_OPTION">"," option</a>) *       <dd>Positive character class.  It matches a character in ranges. *       <dd><var>R<sub>n</sub></var>: *       <ul> *         <li class="REGEX">A character (including <Kbd>\e \f \n \r \t</kbd> <kbd>\u005cx</kbd><var>HH</var> <kbd>\u005cx{</kbd><var>HHHH</var><kbd>}</kbd> <!--kbd>\u005c u</kbd><var>HHHH</var--> <kbd>\u005cv</kbd><var>HHHHHH</var>) *             <p>This range matches the character. *         <li class="REGEX"><var>C<sub>1</sub></var><kbd>-</kbd><var>C<sub>2</sub></var> *             <p>This range matches a character which has a code point that is >= <var>C<sub>1</sub></var>'s code point and &lt;= <var>C<sub>2</sub></var>'s code point.+ *         <li class="REGEX">A POSIX character class: <Kbd>[:alpha:] [:alnum:] [:ascii:] [:cntrl:] [:digit:] [:graph:] [:lower:] [:print:] [:punct:] [:space:] [:upper:] [:xdigit:]</kbd>,+ *             and negative POSIX character classes in Perl like <kbd>[:^alpha:]</kbd> *             <p>... *         <li class="REGEX"><kbd>\d \D \s \S \w \W \p{</kbd><var>name</var><kbd>} \P{</kbd><var>name</var><kbd>}</kbd> *             <p>These expressions specifies the same ranges as the following expressions. *       </ul> *       <p class="REGEX">Enumerated ranges are merged (union operation). *          <kbd>[a-ec-z]</kbd> is equivalent to <kbd>[a-z]</kbd> * *       <dt class="REGEX"><kbd>[^</kbd><var>R<sub>1</sub></var><var>R<sub>2</sub></var><var>...</var><var>R<sub>n</sub></var><kbd>]</kbd> (without a <a href="#COMMA_OPTION">"," option</a>) *       <dt class="REGEX"><kbd>[^</kbd><var>R<sub>1</sub></var><kbd>,</kbd><var>R<sub>2</sub></var><kbd>,</kbd><var>...</var><kbd>,</kbd><var>R<sub>n</sub></var><kbd>]</kbd> (with a <a href="#COMMA_OPTION">"," option</a>) *       <dd>Negative character class.  It matches a character not in ranges. * *       <dt class="REGEX"><kbd>(?[</kbd><var>ranges</var><kbd>]</kbd><var>op</var><kbd>[</kbd><var>ranges</var><kbd>]</kbd><var>op</var><kbd>[</kbd><var>ranges</var><kbd>]</kbd> ... <Kbd>)</kbd> *       (<var>op</var> is <kbd>-</kbd> or <kbd>+</kbd> or <kbd>&</kbd>.) *       <dd>Subtraction or union or intersection for character classes. *       <dd class="REGEX">For exmaple, <kbd>(?[A-Z]-[CF])</kbd> is equivalent to <kbd>[A-BD-EG-Z]</kbd>, and <kbd>(?[0x00-0x7f]-[K]&[\p{Lu}])</kbd> is equivalent to <kbd>[A-JL-Z]</kbd>. *       <dd>The result of this operations is a <u>positive character class</u> *           even if an expression includes any negative character classes. *           You have to take care on this in case-insensitive matching. *           For instance, <kbd>(?[^b])</kbd> is equivalent to <kbd>[\x00-ac-\x{10ffff}]</kbd>, *           which is equivalent to <kbd>[^b]</kbd> in case-sensitive matching. *           But, in case-insensitive matching, <kbd>(?[^b])</kbd> matches any character because *           it includes '<kbd>B</kbd>' and '<kbd>B</kbd>' matches '<kbd>b</kbd>' *           though <kbd>[^b]</kbd> is processed as <kbd>[^Bb]</kbd>. * *       <dt class="REGEX"><kbd>[</kbd><var>R<sub>1</sub>R<sub>2</sub>...</var><kbd>-[</kbd><var>R<sub>n</sub>R<sub>n+1</sub>...</var><kbd>]]</kbd> (with an <a href="#X_OPTION">"X" option</a>)</dt> *       <dd>Character class subtraction for the XML Schema. *           You can use this syntax when you specify an <a href="#X_OPTION">"X" option</a>. *            *       <dt class="REGEX"><kbd>\d</kbd> *       <dd class="REGEX">Equivalent to <kbd>[0-9]</kbd>. *       <dd>When <a href="#U_OPTION">a "u" option</a> is set, it is equivalent to *           <span class="REGEX"><kbd>\p{Nd}</kbd></span>. * *       <dt class="REGEX"><kbd>\D</kbd> *       <dd class="REGEX">Equivalent to <kbd>[^0-9]</kbd> *       <dd>When <a href="#U_OPTION">a "u" option</a> is set, it is equivalent to *           <span class="REGEX"><kbd>\P{Nd}</kbd></span>. * *       <dt class="REGEX"><kbd>\s</kbd> *       <dd class="REGEX">Equivalent to <kbd>[ \f\n\r\t]</kbd> *       <dd>When <a href="#U_OPTION">a "u" option</a> is set, it is equivalent to *           <span class="REGEX"><kbd>[ \f\n\r\t\p{Z}]</kbd></span>. * *       <dt class="REGEX"><kbd>\S</kbd> *       <dd class="REGEX">Equivalent to <kbd>[^ \f\n\r\t]</kbd> *       <dd>When <a href="#U_OPTION">a "u" option</a> is set, it is equivalent to *           <span class="REGEX"><kbd>[^ \f\n\r\t\p{Z}]</kbd></span>. * *       <dt class="REGEX"><kbd>\w</kbd> *       <dd class="REGEX">Equivalent to <kbd>[a-zA-Z0-9_]</kbd> *       <dd>When <a href="#U_OPTION">a "u" option</a> is set, it is equivalent to *           <span class="REGEX"><kbd>[\p{Lu}\p{Ll}\p{Lo}\p{Nd}_]</kbd></span>. * *       <dt class="REGEX"><kbd>\W</kbd> *       <dd class="REGEX">Equivalent to <kbd>[^a-zA-Z0-9_]</kbd> *       <dd>When <a href="#U_OPTION">a "u" option</a> is set, it is equivalent to *           <span class="REGEX"><kbd>[^\p{Lu}\p{Ll}\p{Lo}\p{Nd}_]</kbd></span>. * *       <dt class="REGEX"><kbd>\p{</kbd><var>name</var><kbd>}</kbd> *       <dd>Matches one character in the specified General Category (the second field in <a href="ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt"><kbd>UnicodeData.txt</kbd></a>) or the specified <a href="ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt">Block</a>. *       The following names are available: *       <dl> *         <dt>Unicode General Categories: *         <dd><kbd> *       L, M, N, Z, C, P, S, Lu, Ll, Lt, Lm, Lo, Mn, Me, Mc, Nd, Nl, No, Zs, Zl, Zp, *       Cc, Cf, Cn, Co, Cs, Pd, Ps, Pe, Pc, Po, Sm, Sc, Sk, So, *         </kbd> *         <dd>(Currently the Cn category includes U+10000-U+10FFFF characters) *         <dt>Unicode Blocks: *         <dd><kbd> *       Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B, *       IPA Extensions, Spacing Modifier Letters, Combining Diacritical Marks, Greek, *       Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, *       Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Tibetan, Georgian, *       Hangul Jamo, Latin Extended Additional, Greek Extended, General Punctuation, *       Superscripts and Subscripts, Currency Symbols, Combining Marks for Symbols, *       Letterlike Symbols, Number Forms, Arrows, Mathematical Operators, *       Miscellaneous Technical, Control Pictures, Optical Character Recognition, *       Enclosed Alphanumerics, Box Drawing, Block Elements, Geometric Shapes, *       Miscellaneous Symbols, Dingbats, CJK Symbols and Punctuation, Hiragana, *       Katakana, Bopomofo, Hangul Compatibility Jamo, Kanbun, *       Enclosed CJK Letters and Months, CJK Compatibility, CJK Unified Ideographs, *       Hangul Syllables, High Surrogates, High Private Use Surrogates, Low Surrogates, *       Private Use, CJK Compatibility Ideographs, Alphabetic Presentation Forms, *       Arabic Presentation Forms-A, Combining Half Marks, CJK Compatibility Forms, *       Small Form Variants, Arabic Presentation Forms-B, Specials, *       Halfwidth and Fullwidth Forms *         </kbd> *         <dt>Others: *         <dd><kbd>ALL</kbd> (Equivalent to <kbd>[\u005cu0000-\u005cv10FFFF]</kbd>) *         <dd><kbd>ASSGINED</kbd> (<kbd>\p{ASSIGNED}</kbd> is equivalent to <kbd>\P{Cn}</kbd>) *         <dd><kbd>UNASSGINED</kbd> *             (<kbd>\p{UNASSIGNED}</kbd> is equivalent to <kbd>\p{Cn}</kbd>) *       </dl> * *       <dt class="REGEX"><kbd>\P{</kbd><var>name</var><kbd>}</kbd> *       <dd>Matches one character not in the specified General Category or the specified Block. *     </dl> *   </li> * *   <li>Selection and Quantifier *     <dl> *       <dt class="REGEX"><VAR>X</VAR><kbd>|</kbd><VAR>Y</VAR> *       <dd>... * *       <dt class="REGEX"><VAR>X</VAR><kbd>*</KBD> *       <dd>Matches 0 or more <var>X</var>. * *       <dt class="REGEX"><VAR>X</VAR><kbd>+</KBD> *       <dd>Matches 1 or more <var>X</var>. * *       <dt class="REGEX"><VAR>X</VAR><kbd>?</KBD> *       <dd>Matches 0 or 1 <var>X</var>. * *       <dt class="REGEX"><var>X</var><kbd>{</kbd><var>number</var><kbd>}</kbd> *       <dd>Matches <var>number</var> times. * *       <dt class="REGEX"><var>X</var><kbd>{</kbd><var>min</var><kbd>,}</kbd> *       <dd>... * *       <dt class="REGEX"><var>X</var><kbd>{</kbd><var>min</var><kbd>,</kbd><var>max</var><kbd>}</kbd> *       <dd>... * *       <dt class="REGEX"><VAR>X</VAR><kbd>*?</kbd> *       <dt class="REGEX"><VAR>X</VAR><kbd>+?</kbd> *       <dt class="REGEX"><VAR>X</VAR><kbd>??</kbd> *       <dt class="REGEX"><var>X</var><kbd>{</kbd><var>min</var><kbd>,}?</kbd> *       <dt class="REGEX"><var>X</var><kbd>{</kbd><var>min</var><kbd>,</kbd><var>max</var><kbd>}?</kbd> *       <dd>Non-greedy matching. *     </dl> *   </li> * *   <li>Grouping, Capturing, and Back-reference *     <dl>
12 3 4 5 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -