📄 awkcompiler.java
字号:
package org.apache.oro.text.awk;/* ==================================================================== * The Apache Software License, Version 1.1 * * Copyright (c) 2000 The Apache Software Foundation. All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in * the documentation and/or other materials provided with the * distribution. * * 3. The end-user documentation included with the redistribution, * if any, must include the following acknowledgment: * "This product includes software developed by the * Apache Software Foundation (http://www.apache.org/)." * Alternately, this acknowledgment may appear in the software itself, * if and wherever such third-party acknowledgments normally appear. * * 4. The names "Apache" and "Apache Software Foundation", "Jakarta-Oro" * must not be used to endorse or promote products derived from this * software without prior written permission. For written * permission, please contact apache@apache.org. * * 5. Products derived from this software may not be called "Apache" * or "Jakarta-Oro", nor may "Apache" or "Jakarta-Oro" appear in their * name, without prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * ==================================================================== * * This software consists of voluntary contributions made by many * individuals on behalf of the Apache Software Foundation. For more * information on the Apache Software Foundation, please see * <http://www.apache.org/>. * * Portions of this software are based upon software originally written * by Daniel F. Savarese. We appreciate his contributions. */import org.apache.oro.text.regex.*;/** * The AwkCompiler class is used to create compiled regular expressions * conforming to the Awk regular expression syntax. It generates * AwkPattern instances upon compilation to be used in conjunction * with an AwkMatcher instance. AwkMatcher finds true leftmost-longest * matches, so you must take care with how you formulate your regular * expression to avoid matching more than you really want. * <p> * The supported regular expression syntax is a superset of traditional AWK, * but NOT to be confused with GNU AWK or other AWK variants. Additionally, * this AWK implementation is DFA-based and only supports 8-bit ASCII. * Consequently, these classes can perform very fast pattern matches in * most cases. * <p> * This is the traditional Awk syntax that is supported: * <ul> * <li> Alternatives separated by | * <li> Quantified atoms * <dl compact> * <dt> * <dd> Match 0 or more times. * <dt> + <dd> Match 1 or more times. * <dt> ? <dd> Match 0 or 1 times. * </dl> * <li> Atoms * <ul> * <li> regular expression within parentheses * <li> a . matches everything including newline * <li> a ^ is a null token matching the beginning of a string * but has no relation to newlines (and is only valid at the * beginning of a regex; this differs from traditional awk * for the sake of efficiency in Java). * <li> a $ is a null token matching the end of a string but has * no relation to newlines (and is only valid at the * end of a regex; this differs from traditional awk for the * sake of efficiency in Java). * <li> Character classes (e.g., [abcd]) and ranges (e.g. [a-z]) * <ul> * <li> Special backslashed characters work within a character class * </ul> * <li> Special backslashed characters * <dl compact> * <dt> \b <dd> backspace * <dt> \n <dd> newline * <dt> \r <dd> carriage return * <dt> \t <dd> tab * <dt> \f <dd> formfeed * <dt> \xnn <dd> hexadecimal representation of character * <dt> \nn or \nnn <dd> octal representation of character * <dt> Any other backslashed character matches itself * </dl> * </ul></ul> * <p> * This is the extended syntax that is supported: * <ul> * <li> Quantified atoms * <dl compact> * <dt> {n,m} <dd> Match at least n but not more than m times. * <dt> {n,} <dd> Match at least n times. * <dt> {n} <dd> Match exactly n times. * </dl> * <li> Atoms * <ul> * <li> Special backslashed characters * <dl compact> * <dt> \d <dd> digit [0-9] * <dt> \D <dd> non-digit [^0-9] * <dt> \w <dd> word character [0-9a-z_A-Z] * <dt> \W <dd> a non-word character [^0-9a-z_A-Z] * <dt> \s <dd> a whitespace character [ \t\n\r\f] * <dt> \S <dd> a non-whitespace character [^ \t\n\r\f] * <dt> \cD <dd> matches the corresponding control character * <dt> \0 <dd> matches null character * </dl> * </ul></ul> @author <a href="mailto:dfs@savarese.org">Daniel F. Savarese</a> @version $Id: AwkCompiler.java,v 1.2 2000/07/23 23:25:18 jon Exp $ * @see org.apache.oro.text.regex.PatternCompiler * @see org.apache.oro.text.regex.MalformedPatternException * @see AwkPattern * @see AwkMatcher */ public final class AwkCompiler implements PatternCompiler { public static final int DEFAULT_MASK = 0; public static final int CASE_INSENSITIVE_MASK = 0x0001; static final char _END_OF_INPUT = '\uFFFF'; private boolean __inCharacterClass, __caseSensitive; private boolean __beginAnchor, __endAnchor; private char __lookahead; private int __position, __bytesRead, __expressionLength; private char[] __regularExpression; private int __openParen, __closeParen; public AwkCompiler() { } private static boolean __isMetachar(char token) { return (token == '*' || token == '?' || token == '+' || token == '[' || token == ']' || token == '(' || token == ')' || token == '|' || /* token == '^' || token == '$' || */ token == '.'); } static boolean _isWordCharacter(char token) { return ((token >= 'a' && token <= 'z') || (token >= 'A' && token <= 'Z') || (token >= '0' && token <= '9') || (token == '_')); } static boolean _isLowerCase(char token){ return (token >= 'a' && token <= 'z'); } static boolean _isUpperCase(char token){ return (token >= 'A' && token <= 'Z'); } static char _toggleCase(char token){ if(_isUpperCase(token)) return (char)(token + 32); else if(_isLowerCase(token)) return (char)(token - 32); return token; } private void __match(char token) throws MalformedPatternException { if(token == __lookahead){ if(__bytesRead < __expressionLength) __lookahead = __regularExpression[__bytesRead++]; else __lookahead = _END_OF_INPUT; } else throw new MalformedPatternException("token: " + token + " does not match lookahead: " + __lookahead + " at position: " + __bytesRead); } private void __putback() { if(__lookahead != _END_OF_INPUT) --__bytesRead; __lookahead = __regularExpression[__bytesRead - 1]; } private SyntaxNode __regex() throws MalformedPatternException { SyntaxNode left; left = __branch(); if(__lookahead == '|') { __match('|'); return (new OrNode(left, __regex())); } return left; } private SyntaxNode __branch() throws MalformedPatternException { CatNode current; SyntaxNode left, root; left = __piece(); if(__lookahead == ')'){ if(__openParen > __closeParen) return left; else throw new MalformedPatternException("Parse error: close parenthesis" + " without matching open parenthesis at position " + __bytesRead); } else if(__lookahead == '|' || __lookahead == _END_OF_INPUT) return left; root = current = new CatNode(); current._left = left; while(true) { left = __piece(); if(__lookahead == ')'){ if(__openParen > __closeParen){ current._right = left; break; } else throw new MalformedPatternException("Parse error: close parenthesis" + " without matching open parenthesis at position " + __bytesRead); } else if(__lookahead == '|' || __lookahead == _END_OF_INPUT){ current._right = left; break; } current._right = new CatNode(); current = (CatNode)current._right; current._left = left; } return root; } private SyntaxNode __piece() throws MalformedPatternException { SyntaxNode left; left = __atom(); switch(__lookahead){ case '+' : __match('+'); return (new PlusNode(left)); case '?' : __match('?'); return (new QuestionNode(left)); case '*' : __match('*'); return (new StarNode(left)); case '{' : return __repetition(left); } return left; } // if numChars is 0, this means match as many as you want private int __parseUnsignedInteger(int radix, int minDigits, int maxDigits) throws MalformedPatternException { int num, digits = 0; StringBuffer buf; // We don't expect huge numbers, so an initial buffer of 4 is fine. buf = new StringBuffer(4); while(Character.digit(__lookahead, radix) != -1 && digits < maxDigits){ buf.append((char)__lookahead); __match(__lookahead); ++digits; } if(digits < minDigits || digits > maxDigits) throw new MalformedPatternException( "Parse error: unexpected number of digits at position " + __bytesRead); try { num = Integer.parseInt(buf.toString(), radix); } catch(NumberFormatException e) { throw new MalformedPatternException("Parse error: numeric value at " + "position " + __bytesRead + " is invalid"); } return num; } private SyntaxNode __repetition(SyntaxNode atom) throws MalformedPatternException { int min, max, startPosition[]; StringBuffer minBuffer, maxBuffer; SyntaxNode root = null; CatNode catNode; __match('{'); min = __parseUnsignedInteger(10, 1, Integer.MAX_VALUE); startPosition = new int[1]; startPosition[0] = __position; if(__lookahead == '}'){ // Match exactly min times. Concatenate the atom min times. __match('}'); if(min == 0) throw new MalformedPatternException( "Parse error: Superfluous interval specified at position " + __bytesRead + ". Number of occurences was set to zero."); if(min == 1) return atom; root = catNode = new CatNode(); catNode._left = atom; while(--min > 1) { atom = atom._clone(startPosition); catNode._right = new CatNode(); catNode = (CatNode)catNode._right; catNode._left = atom; } catNode._right = atom._clone(startPosition); } else if(__lookahead == ','){ __match(','); if(__lookahead == '}') { // match at least min times __match('}'); if(min == 0) return new StarNode(atom); if(min == 1) return new PlusNode(atom); root = catNode = new CatNode(); catNode._left = atom; while(--min > 0) { atom = atom._clone(startPosition); catNode._right = new CatNode(); catNode = (CatNode)catNode._right; catNode._left = atom; } catNode._right = new StarNode(atom._clone(startPosition)); } else { // match at least min times and at most max times max = __parseUnsignedInteger(10, 1, Integer.MAX_VALUE); __match('}'); if(max < min) throw new MalformedPatternException("Parse error: invalid interval; " + max + " is less than " + min + " at position " + __bytesRead); if(max == 0) throw new MalformedPatternException( "Parse error: Superfluous interval specified at position " + __bytesRead + ". Number of occurences was set to zero."); if(min == 0) { if(max == 1) return new QuestionNode(atom); root = catNode = new CatNode(); atom = new QuestionNode(atom); catNode._left = atom; while(--max > 1) { atom = atom._clone(startPosition); catNode._right = new CatNode(); catNode = (CatNode)catNode._right; catNode._left = atom; } catNode._right = atom._clone(startPosition); } else if(min == max) { if(min == 1) return atom; root = catNode = new CatNode(); catNode._left = atom; while(--min > 1) { atom = atom._clone(startPosition); catNode._right = new CatNode(); catNode = (CatNode)catNode._right; catNode._left = atom; } catNode._right = atom._clone(startPosition); } else { int count; root = catNode = new CatNode(); catNode._left = atom; for(count=1; count < min; count++) { atom = atom._clone(startPosition); catNode._right = new CatNode(); catNode = (CatNode)catNode._right; catNode._left = atom;
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -