📄 regularexpression.java
字号:
* <dt class="REGEX"><KBD>(?:</kbd><VAR>X</VAR><kbd>)</KBD> * <dd>Grouping. "<KBD>foo+</KBD>" matches "<KBD>foo</KBD>" or "<KBD>foooo</KBD>". * If you want it matches "<KBD>foofoo</KBD>" or "<KBD>foofoofoo</KBD>", * you have to write "<KBD>(?:foo)+</KBD>". * * <dt class="REGEX"><KBD>(</kbd><VAR>X</VAR><kbd>)</KBD> * <dd>Grouping with capturing. * It make a group and applications can know * where in target text a group matched with methods of a <code>Match</code> instance * after <code><a href="#matches(java.lang.String, com.sun.org.apache.xerces.internal.utils.regex.Match)">matches(String,Match)</a></code>. * The 0th group means whole of this regular expression. * The <VAR>N</VAR>th gorup is the inside of the <VAR>N</VAR>th left parenthesis. * * <p>For instance, a regular expression is * "<FONT color=blue><KBD> *([^<:]*) +<([^>]*)> *</KBD></FONT>" * and target text is * "<FONT color=red><KBD>From: TAMURA Kent <kent@trl.ibm.co.jp></KBD></FONT>": * <ul> * <li><code>Match.getCapturedText(0)</code>: * "<FONT color=red><KBD> TAMURA Kent <kent@trl.ibm.co.jp></KBD></FONT>" * <li><code>Match.getCapturedText(1)</code>: "<FONT color=red><KBD>TAMURA Kent</KBD></FONT>" * <li><code>Match.getCapturedText(2)</code>: "<FONT color=red><KBD>kent@trl.ibm.co.jp</KBD></FONT>" * </ul> * * <dt class="REGEX"><kbd>\1 \2 \3 \4 \5 \6 \7 \8 \9</kbd> * <dd> * * <dt class="REGEX"><kbd>(?></kbd><var>X</var><kbd>)</kbd> * <dd>Independent expression group. ................ * * <dt class="REGEX"><kbd>(?</kbd><var>options</var><kbd>:</kbd><var>X</var><kbd>)</kbd> * <dt class="REGEX"><kbd>(?</kbd><var>options</var><kbd>-</kbd><var>options2</var><kbd>:</kbd><var>X</var><kbd>)</kbd> * <dd>............................ * <dd>The <var>options</var> or the <var>options2</var> consists of 'i' 'm' 's' 'w'. * Note that it can not contain 'u'. * * <dt class="REGEX"><kbd>(?</kbd><var>options</var><kbd>)</kbd> * <dt class="REGEX"><kbd>(?</kbd><var>options</var><kbd>-</kbd><var>options2</var><kbd>)</kbd> * <dd>...... * <dd>These expressions must be at the beginning of a group. * </dl> * </li> * * <li>Anchor * <dl> * <dt class="REGEX"><kbd>\A</kbd> * <dd>Matches the beginnig of the text. * * <dt class="REGEX"><kbd>\Z</kbd> * <dd>Matches the end of the text, or before an EOL character at the end of the text, * or CARRIAGE RETURN + LINE FEED at the end of the text. * * <dt class="REGEX"><kbd>\z</kbd> * <dd>Matches the end of the text. * * <dt class="REGEX"><kbd>^</kbd> * <dd>Matches the beginning of the text. It is equivalent to <span class="REGEX"><Kbd>\A</kbd></span>. * <dd>When <a href="#M_OPTION">a "m" option</a> is set, * it matches the beginning of the text, or after one of EOL characters ( * LINE FEED (U+000A), CARRIAGE RETURN (U+000D), LINE SEPARATOR (U+2028), * PARAGRAPH SEPARATOR (U+2029).) * * <dt class="REGEX"><kbd>$</kbd> * <dd>Matches the end of the text, or before an EOL character at the end of the text, * or CARRIAGE RETURN + LINE FEED at the end of the text. * <dd>When <a href="#M_OPTION">a "m" option</a> is set, * it matches the end of the text, or before an EOL character. * * <dt class="REGEX"><kbd>\b</kbd> * <dd>Matches word boundary. * (See <a href="#W_OPTION">a "w" option</a>) * * <dt class="REGEX"><kbd>\B</kbd> * <dd>Matches non word boundary. * (See <a href="#W_OPTION">a "w" option</a>) * * <dt class="REGEX"><kbd>\<</kbd> * <dd>Matches the beginning of a word. * (See <a href="#W_OPTION">a "w" option</a>) * * <dt class="REGEX"><kbd>\></kbd> * <dd>Matches the end of a word. * (See <a href="#W_OPTION">a "w" option</a>) * </dl> * </li> * <li>Lookahead and lookbehind * <dl> * <dt class="REGEX"><kbd>(?=</kbd><var>X</var><kbd>)</kbd> * <dd>Lookahead. * * <dt class="REGEX"><kbd>(?!</kbd><var>X</var><kbd>)</kbd> * <dd>Negative lookahead. * * <dt class="REGEX"><kbd>(?<=</kbd><var>X</var><kbd>)</kbd> * <dd>Lookbehind. * <dd>(Note for text capturing......) * * <dt class="REGEX"><kbd>(?<!</kbd><var>X</var><kbd>)</kbd> * <dd>Negative lookbehind. * </dl> * </li> * * <li>Misc. * <dl> * <dt class="REGEX"><kbd>(?(</Kbd><var>condition</var><Kbd>)</kbd><var>yes-pattern</var><kbd>|</kbd><var>no-pattern</var><kbd>)</kbd>, * <dt class="REGEX"><kbd>(?(</kbd><var>condition</var><kbd>)</kbd><var>yes-pattern</var><kbd>)</kbd> * <dd>...... * <dt class="REGEX"><kbd>(?#</kbd><var>comment</var><kbd>)</kbd> * <dd>Comment. A comment string consists of characters except '<kbd>)</kbd>'. * You can not write comments in character classes and before quantifiers. * </dl> * </li> * </ul> * * * <hr width="50%"> * <h3>BNF for the regular expression</h3> * <pre> * regex ::= ('(?' options ')')? term ('|' term)* * term ::= factor+ * factor ::= anchors | atom (('*' | '+' | '?' | minmax ) '?'? )? * | '(?#' [^)]* ')' * minmax ::= '{' ([0-9]+ | [0-9]+ ',' | ',' [0-9]+ | [0-9]+ ',' [0-9]+) '}' * atom ::= char | '.' | char-class | '(' regex ')' | '(?:' regex ')' | '\' [0-9] * | '\w' | '\W' | '\d' | '\D' | '\s' | '\S' | category-block | '\X' * | '(?>' regex ')' | '(?' options ':' regex ')' * | '(?' ('(' [0-9] ')' | '(' anchors ')' | looks) term ('|' term)? ')' * options ::= [imsw]* ('-' [imsw]+)? * anchors ::= '^' | '$' | '\A' | '\Z' | '\z' | '\b' | '\B' | '\<' | '\>' * looks ::= '(?=' regex ')' | '(?!' regex ')' * | '(?<=' regex ')' | '(?<!' regex ')' * char ::= '\\' | '\' [efnrtv] | '\c' [@-_] | code-point | character-1 * category-block ::= '\' [pP] category-symbol-1 * | ('\p{' | '\P{') (category-symbol | block-name * | other-properties) '}' * category-symbol-1 ::= 'L' | 'M' | 'N' | 'Z' | 'C' | 'P' | 'S' * category-symbol ::= category-symbol-1 | 'Lu' | 'Ll' | 'Lt' | 'Lm' | Lo' * | 'Mn' | 'Me' | 'Mc' | 'Nd' | 'Nl' | 'No' * | 'Zs' | 'Zl' | 'Zp' | 'Cc' | 'Cf' | 'Cn' | 'Co' | 'Cs' * | 'Pd' | 'Ps' | 'Pe' | 'Pc' | 'Po' * | 'Sm' | 'Sc' | 'Sk' | 'So' * block-name ::= (See above) * other-properties ::= 'ALL' | 'ASSIGNED' | 'UNASSIGNED' * character-1 ::= (any character except meta-characters) * * char-class ::= '[' ranges ']' * | '(?[' ranges ']' ([-+&] '[' ranges ']')? ')' * ranges ::= '^'? (range <a href="#COMMA_OPTION">','?</a>)+ * range ::= '\d' | '\w' | '\s' | '\D' | '\W' | '\S' | category-block * | range-char | range-char '-' range-char * range-char ::= '\[' | '\]' | '\\' | '\' [,-efnrtv] | code-point | character-2 * code-point ::= '\x' hex-char hex-char * | '\x{' hex-char+ '}' * <!-- | '\u005c u' hex-char hex-char hex-char hex-char * --> | '\v' hex-char hex-char hex-char hex-char hex-char hex-char * hex-char ::= [0-9a-fA-F] * character-2 ::= (any character except \[]-,) * </pre> * * <hr width="50%"> * <h3>TODO</h3> * <ul> * <li><a href="http://www.unicode.org/unicode/reports/tr18/">Unicode Regular Expression Guidelines</a> * <ul> * <li>2.4 Canonical Equivalents * <li>Level 3 * </ul> * <li>Parsing performance * </ul> * * <hr width="50%"> * * @xerces.internal * * @author TAMURA Kent <kent@trl.ibm.co.jp> * @version $Id: RegularExpression.java,v 1.2.6.1 2005/09/06 11:46:34 neerajbj Exp $ */public class RegularExpression implements java.io.Serializable { private static final long serialVersionUID = 3905241217112815923L; static final boolean DEBUG = false; /** * Compiles a token tree into an operation flow. */ private synchronized void compile(Token tok) { if (this.operations != null) return; this.numberOfClosures = 0; this.operations = this.compile(tok, null, false); } /** * Converts a token to an operation. */ private Op compile(Token tok, Op next, boolean reverse) { Op ret; switch (tok.type) { case Token.DOT: ret = Op.createDot(); ret.next = next; break; case Token.CHAR: ret = Op.createChar(tok.getChar()); ret.next = next; break; case Token.ANCHOR: ret = Op.createAnchor(tok.getChar()); ret.next = next; break; case Token.RANGE: case Token.NRANGE: ret = Op.createRange(tok); ret.next = next; break; case Token.CONCAT: ret = next; if (!reverse) { for (int i = tok.size()-1; i >= 0; i --) { ret = compile(tok.getChild(i), ret, false); } } else { for (int i = 0; i < tok.size(); i ++) { ret = compile(tok.getChild(i), ret, true); } } break; case Token.UNION: Op.UnionOp uni = Op.createUnion(tok.size()); for (int i = 0; i < tok.size(); i ++) { uni.addElement(compile(tok.getChild(i), next, reverse)); } ret = uni; // ret.next is null. break; case Token.CLOSURE: case Token.NONGREEDYCLOSURE: Token child = tok.getChild(0); int min = tok.getMin(); int max = tok.getMax(); if (min >= 0 && min == max) { // {n} ret = next; for (int i = 0; i < min; i ++) { ret = compile(child, ret, reverse); } break; } if (min > 0 && max > 0) max -= min; if (max > 0) { // X{2,6} -> XX(X(X(XX?)?)?)? ret = next; for (int i = 0; i < max; i ++) { Op.ChildOp q = Op.createQuestion(tok.type == Token.NONGREEDYCLOSURE); q.next = next; q.setChild(compile(child, ret, reverse)); ret = q; } } else { Op.ChildOp op; if (tok.type == Token.NONGREEDYCLOSURE) { op = Op.createNonGreedyClosure(); } else { // Token.CLOSURE if (child.getMinLength() == 0) op = Op.createClosure(this.numberOfClosures++); else op = Op.createClosure(-1); } op.next = next; op.setChild(compile(child, op, reverse)); ret = op; } if (min > 0) { for (int i = 0; i < min; i ++) { ret = compile(child, ret, reverse); } } break; case Token.EMPTY: ret = next; break; case Token.STRING: ret = Op.createString(tok.getString()); ret.next = next; break; case Token.BACKREFERENCE: ret = Op.createBackReference(tok.getReferenceNumber()); ret.next = next; break; case Token.PAREN: if (tok.getParenNumber() == 0) { ret = compile(tok.getChild(0), next, reverse); } else if (reverse) { next = Op.createCapture(tok.getParenNumber(), next); next = compile(tok.getChild(0), next, reverse);
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -