📄 awkcompiler.java

📁 java正这表达式,简单.好用.
💻 JAVA
📖 第 1 页 / 共 2 页
字号:
12 下一页
package org.apache.oro.text.awk;/* ==================================================================== * The Apache Software License, Version 1.1 * * Copyright (c) 2000 The Apache Software Foundation.  All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright *    notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright *    notice, this list of conditions and the following disclaimer in *    the documentation and/or other materials provided with the *    distribution. * * 3. The end-user documentation included with the redistribution, *    if any, must include the following acknowledgment: *       "This product includes software developed by the *        Apache Software Foundation (http://www.apache.org/)." *    Alternately, this acknowledgment may appear in the software itself, *    if and wherever such third-party acknowledgments normally appear. * * 4. The names "Apache" and "Apache Software Foundation", "Jakarta-Oro"  *    must not be used to endorse or promote products derived from this *    software without prior written permission. For written *    permission, please contact apache@apache.org. * * 5. Products derived from this software may not be called "Apache"  *    or "Jakarta-Oro", nor may "Apache" or "Jakarta-Oro" appear in their  *    name, without prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * ==================================================================== * * This software consists of voluntary contributions made by many * individuals on behalf of the Apache Software Foundation.  For more * information on the Apache Software Foundation, please see * <http://www.apache.org/>. * * Portions of this software are based upon software originally written  * by Daniel F. Savarese. We appreciate his contributions. */import org.apache.oro.text.regex.*;/** * The AwkCompiler class is used to create compiled regular expressions * conforming to the Awk regular expression syntax.  It generates * AwkPattern instances upon compilation to be used in conjunction * with an AwkMatcher instance.  AwkMatcher finds true leftmost-longest * matches, so you must take care with how you formulate your regular * expression to avoid matching more than you really want. * <p> * The supported regular expression syntax is a superset of traditional AWK, * but NOT to be confused with GNU AWK or other AWK variants.  Additionally, * this AWK implementation is DFA-based and only supports 8-bit ASCII. * Consequently, these classes can perform very fast pattern matches in * most cases. * <p> * This is the traditional Awk syntax that is supported: * <ul> * <li> Alternatives separated by | * <li> Quantified atoms * <dl compact> *      <dt> *     <dd> Match 0 or more times. *      <dt> +     <dd> Match 1 or more times. *      <dt> ?     <dd> Match 0 or 1 times. * </dl> * <li> Atoms * <ul> *     <li> regular expression within parentheses *     <li> a . matches everything including newline *     <li> a ^ is a null token matching the beginning of a string *          but has no relation to newlines (and is only valid at the *          beginning of a regex; this differs from traditional awk *          for the sake of efficiency in Java). *     <li> a $ is a null token matching the end of a string but has *          no relation to newlines (and is only valid at the *          end of a regex; this differs from traditional awk for the *          sake of efficiency in Java). *     <li> Character classes (e.g., [abcd]) and ranges (e.g. [a-z]) *     <ul> *         <li> Special backslashed characters work within a character class *     </ul> *     <li> Special backslashed characters *     <dl compact> *         <dt> \b <dd> backspace *         <dt> \n <dd> newline *         <dt> \r <dd> carriage return *         <dt> \t <dd> tab *         <dt> \f <dd> formfeed *         <dt> \xnn <dd> hexadecimal representation of character *         <dt> \nn or \nnn <dd> octal representation of character *         <dt> Any other backslashed character matches itself *     </dl> * </ul></ul> * <p> * This is the extended syntax that is supported: * <ul> * <li> Quantified atoms * <dl compact> *      <dt> {n,m} <dd> Match at least n but not more than m times. *	<dt> {n,}  <dd> Match at least n times. *      <dt> {n}   <dd> Match exactly n times.   * </dl> * <li> Atoms * <ul> *     <li> Special backslashed characters *     <dl compact> *         <dt> \d <dd> digit [0-9] *         <dt> \D <dd> non-digit [^0-9] *         <dt> \w <dd> word character [0-9a-z_A-Z] *         <dt> \W <dd> a non-word character [^0-9a-z_A-Z] *         <dt> \s <dd> a whitespace character [ \t\n\r\f] *         <dt> \S <dd> a non-whitespace character [^ \t\n\r\f] *         <dt> \cD <dd> matches the corresponding control character *         <dt> \0 <dd> matches null character *     </dl> * </ul></ul> @author <a href="mailto:dfs@savarese.org">Daniel F. Savarese</a> @version $Id: AwkCompiler.java,v 1.2 2000/07/23 23:25:18 jon Exp $ * @see org.apache.oro.text.regex.PatternCompiler * @see org.apache.oro.text.regex.MalformedPatternException * @see AwkPattern * @see AwkMatcher */                        public final class AwkCompiler implements PatternCompiler {  public static final int DEFAULT_MASK          = 0;  public static final int CASE_INSENSITIVE_MASK = 0x0001;  static final char _END_OF_INPUT = '\uFFFF';    private boolean __inCharacterClass, __caseSensitive;  private boolean __beginAnchor, __endAnchor;  private char __lookahead;  private int __position, __bytesRead, __expressionLength;  private char[] __regularExpression;  private int __openParen, __closeParen;  public AwkCompiler() { }  private static boolean __isMetachar(char token) {    return (token == '*' || token == '?' || token == '+' ||	    token == '[' || token == ']' || token == '(' ||	    token == ')' || token == '|' || /* token == '^' ||	    token == '$' || */ token == '.');  }  static boolean _isWordCharacter(char token) {    return ((token >= 'a' && token <= 'z') || 	    (token >= 'A' && token <= 'Z') || 	    (token >= '0' && token <= '9') || 	    (token == '_'));  }  static boolean _isLowerCase(char token){    return (token >= 'a' && token <= 'z');  }  static boolean _isUpperCase(char token){    return (token >= 'A' && token <= 'Z');  }  static char _toggleCase(char token){    if(_isUpperCase(token))      return (char)(token + 32);    else if(_isLowerCase(token))      return (char)(token - 32);    return token;  }  private void __match(char token) throws MalformedPatternException {    if(token == __lookahead){      if(__bytesRead < __expressionLength)	__lookahead = __regularExpression[__bytesRead++];      else	__lookahead = _END_OF_INPUT;    }    else      throw new MalformedPatternException("token: " + token + 				    " does not match lookahead: " +				    __lookahead + " at position: " +					     __bytesRead);  }  private void __putback() {    if(__lookahead != _END_OF_INPUT)      --__bytesRead;    __lookahead = __regularExpression[__bytesRead - 1];  }  private SyntaxNode __regex() throws MalformedPatternException {    SyntaxNode left;    left = __branch();    if(__lookahead == '|') {      __match('|');      return (new OrNode(left, __regex()));    }     return left;  }  private SyntaxNode __branch() throws MalformedPatternException {    CatNode current;    SyntaxNode left, root;    left = __piece();    if(__lookahead == ')'){      if(__openParen > __closeParen)	return left;      else	  throw	    new MalformedPatternException("Parse error: close parenthesis"	     + " without matching open parenthesis at position " + __bytesRead);    } else if(__lookahead == '|' || __lookahead == _END_OF_INPUT)      return left;    root = current = new CatNode();    current._left = left;    while(true) {      left = __piece();      if(__lookahead == ')'){	if(__openParen > __closeParen){	  current._right = left;	  break;	}	else	  throw	    new MalformedPatternException("Parse error: close parenthesis"	     + " without matching open parenthesis at position " + __bytesRead);      } else  if(__lookahead == '|' || __lookahead == _END_OF_INPUT){	current._right = left;	break;      }      current._right = new CatNode();      current = (CatNode)current._right;      current._left   = left;    }    return root;  }  private SyntaxNode __piece() throws MalformedPatternException {    SyntaxNode left;    left = __atom();    switch(__lookahead){    case '+' : __match('+'); return (new PlusNode(left));    case '?' : __match('?'); return (new QuestionNode(left));    case '*' : __match('*'); return (new StarNode(left));    case '{' : return __repetition(left);    }    return left;  }  // if numChars is 0, this means match as many as you want  private int __parseUnsignedInteger(int radix, int minDigits, int maxDigits)    throws MalformedPatternException {    int num, digits = 0;    StringBuffer buf;    // We don't expect huge numbers, so an initial buffer of 4 is fine.    buf = new StringBuffer(4);    while(Character.digit(__lookahead, radix) != -1 && digits < maxDigits){      buf.append((char)__lookahead);      __match(__lookahead);      ++digits;    }    if(digits < minDigits || digits > maxDigits)      throw	new MalformedPatternException(        "Parse error: unexpected number of digits at position " + __bytesRead);    try {      num = Integer.parseInt(buf.toString(), radix);    } catch(NumberFormatException e) {      throw	new MalformedPatternException("Parse error: numeric value at " +				"position " + __bytesRead + " is invalid");    }    return num;  }  private SyntaxNode __repetition(SyntaxNode atom)    throws MalformedPatternException {    int min, max, startPosition[];    StringBuffer minBuffer, maxBuffer;    SyntaxNode root = null;    CatNode catNode;    __match('{');    min = __parseUnsignedInteger(10, 1, Integer.MAX_VALUE);    startPosition = new int[1];    startPosition[0] = __position;    if(__lookahead == '}'){      // Match exactly min times.  Concatenate the atom min times.      __match('}');      if(min == 0)	throw	  new MalformedPatternException(              "Parse error: Superfluous interval specified at position " +              __bytesRead + ".  Number of occurences was set to zero.");      if(min == 1)	return atom;      root = catNode = new CatNode();      catNode._left = atom;      while(--min > 1) {	atom = atom._clone(startPosition);	catNode._right = new CatNode();	catNode       = (CatNode)catNode._right;	catNode._left  = atom;      }      catNode._right = atom._clone(startPosition);    } else if(__lookahead == ','){      __match(',');      if(__lookahead == '}') {	// match at least min times	__match('}');	if(min == 0)	  return new StarNode(atom);	if(min == 1)	  return new PlusNode(atom);	root = catNode = new CatNode();	catNode._left = atom;	while(--min > 0) {	  atom = atom._clone(startPosition);	  catNode._right = new CatNode();	  catNode       = (CatNode)catNode._right;	  catNode._left  = atom;	}	catNode._right = new StarNode(atom._clone(startPosition));      } else {	// match at least min times and at most max times	max = __parseUnsignedInteger(10, 1, Integer.MAX_VALUE);	__match('}');	if(max < min)	  throw	    new MalformedPatternException("Parse error: invalid interval; "	     +  max + " is less than " + min + " at position " + __bytesRead);	if(max == 0)	  throw	    new MalformedPatternException(	    "Parse error: Superfluous interval specified at position " +	    __bytesRead + ".  Number of occurences was set to zero.");	if(min == 0) {	  if(max == 1)	    return new QuestionNode(atom);	  root = catNode = new CatNode();	  atom = new QuestionNode(atom);	  catNode._left = atom;	  while(--max > 1) {	    atom =  atom._clone(startPosition);	    catNode._right = new CatNode();	    catNode       = (CatNode)catNode._right;	    catNode._left  = atom;	  }	  catNode._right = atom._clone(startPosition);	} else if(min == max) {	  if(min == 1)	    return atom;	  root = catNode = new CatNode();	  catNode._left = atom;	  while(--min > 1) {	    atom = atom._clone(startPosition);	    catNode._right = new CatNode();	    catNode       = (CatNode)catNode._right;	    catNode._left  = atom;	  }	  catNode._right = atom._clone(startPosition);	} else {	  int count;	  root = catNode = new CatNode();	  catNode._left = atom;	  for(count=1; count < min; count++) {	    atom = atom._clone(startPosition);	    catNode._right = new CatNode();	    catNode       = (CatNode)catNode._right;	    catNode._left  = atom;
12 下一页
💿 文件大小 701 K
👤 上传用户 zhuxiaobei123
📂 所属分类 Java编程
🏷️ 相关标签

#java #正 #表达式
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -