📄 qregexp.cpp

📁 奇趣公司比较新的qt/emd版本
💻 CPP
📖 第 1 页 / 共 5 页
字号:
12 3 4 5 下一页
/******************************************************************************** Copyright (C) 1992-2007 Trolltech ASA. All rights reserved.**** This file is part of the QtCore module of the Qt Toolkit.**** This file may be used under the terms of the GNU General Public** License version 2.0 as published by the Free Software Foundation** and appearing in the file LICENSE.GPL included in the packaging of** this file.  Please review the following information to ensure GNU** General Public Licensing requirements will be met:** http://trolltech.com/products/qt/licenses/licensing/opensource/**** If you are unsure which license is appropriate for your use, please** review the following information:** http://trolltech.com/products/qt/licenses/licensing/licensingoverview** or contact the sales department at sales@trolltech.com.**** In addition, as a special exception, Trolltech gives you certain** additional rights. These rights are described in the Trolltech GPL** Exception version 1.0, which can be found at** http://www.trolltech.com/products/qt/gplexception/ and in the file** GPL_EXCEPTION.txt in this package.**** In addition, as a special exception, Trolltech, as the sole copyright** holder for Qt Designer, grants users of the Qt/Eclipse Integration** plug-in the right for the Qt/Eclipse Integration to link to** functionality provided by Qt Designer and its related libraries.**** Trolltech reserves all rights not expressly granted herein.**** This file is provided AS IS with NO WARRANTY OF ANY KIND, INCLUDING THE** WARRANTY OF DESIGN, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.******************************************************************************/#include "qregexp.h"#include "qalgorithms.h"#include "qbitarray.h"#include "qcache.h"#include "qdatastream.h"#include "qlist.h"#include "qmap.h"#include "qmutex.h"#include "qstring.h"#include "qstringlist.h"#include "qvector.h"#include <limits.h>// error strings for the regexp parser#define RXERR_OK         QT_TRANSLATE_NOOP("QRegExp", "no error occurred")#define RXERR_DISABLED   QT_TRANSLATE_NOOP("QRegExp", "disabled feature used")#define RXERR_CHARCLASS  QT_TRANSLATE_NOOP("QRegExp", "bad char class syntax")#define RXERR_LOOKAHEAD  QT_TRANSLATE_NOOP("QRegExp", "bad lookahead syntax")#define RXERR_REPETITION QT_TRANSLATE_NOOP("QRegExp", "bad repetition syntax")#define RXERR_OCTAL      QT_TRANSLATE_NOOP("QRegExp", "invalid octal value")#define RXERR_LEFTDELIM  QT_TRANSLATE_NOOP("QRegExp", "missing left delim")#define RXERR_END        QT_TRANSLATE_NOOP("QRegExp", "unexpected end")#define RXERR_LIMIT      QT_TRANSLATE_NOOP("QRegExp", "met internal limit")/*  WARNING! Be sure to read qregexp.tex before modifying this file.*//*!    \class QRegExp    \reentrant    \brief The QRegExp class provides pattern matching using regular expressions.    \ingroup tools    \ingroup misc    \ingroup shared    \mainclass    \keyword regular expression    Regular expressions, or "regexps", provide a way to find patterns    within text. This is useful in many contexts, for example:    \table    \row \i Validation         \i A regexp can be used to check whether a piece of text         meets some criteria, e.g. is an integer or contains no         whitespace.    \row \i Searching         \i Regexps provide a much more powerful means of searching         text than simple string matching does. For example we can         create a regexp which says "find one of the words 'mail',         'letter' or 'correspondence' but not any of the words         'email', 'mailman' 'mailer', 'letterbox', etc."    \row \i Search and Replace         \i A regexp can be used to replace a pattern with a piece of         text, for example replace all occurrences of '&' with         '\&amp;' except where the '&' is already followed by 'amp;'.    \row \i String Splitting         \i A regexp can be used to identify where a string should be         split into its component fields, e.g. splitting tab-delimited         strings.    \endtable    We present a very brief introduction to regexps, a description of    Qt's regexp language, some code examples, and finally the    function documentation itself. QRegExp is modeled on Perl's    regexp language, and also fully supports Unicode. QRegExp can    also be used in the weaker wildcard mode that works in a    similar way to command shells. It can even be feed with fixed    strings (see setPatternSyntax()). A good text on regexps is \e    {Mastering Regular Expressions} (Third Edition) by Jeffrey E. F.    Friedl, ISBN 0-596-52812-4.    \tableofcontents    \section1 Introduction    Regexps are built up from expressions, quantifiers, and assertions.    The simplest form of expression is simply a character, e.g.    \bold{x} or \bold{5}. An expression can also be a set of    characters. For example, \bold{[ABCD]}, will match an \bold{A} or    a \bold{B} or a \bold{C} or a \bold{D}. As a shorthand we could    write this as \bold{[A-D]}. If we want to match any of the    captital letters in the English alphabet we can write    \bold{[A-Z]}. A quantifier tells the regexp engine how many    occurrences of the expression we want, e.g. \bold{x{1,1}} means    match an \bold{x} which occurs at least once and at most once.    We'll look at assertions and more complex expressions later.    Note that in general regexps cannot be used to check for balanced    brackets or tags. For example if you want to match an opening html    \c{<b>} and its closing \c{<b>}, you can only use a regexp if you    know that these tags are not nested; the html fragment, \c{<b>bold    <b>bolder</b></b>} will not match as expected. If you know the    maximum level of nesting it is possible to create a regexp that    will match correctly, but for an unknown level of nesting, regexps    will fail.    We'll start by writing a regexp to match integers in the range 0    to 99. We will require at least one digit so we will start with    \bold{[0-9]{1,1}} which means match a digit exactly once. This    regexp alone will match integers in the range 0 to 9. To match one    or two digits we can increase the maximum number of occurrences so    the regexp becomes \bold{[0-9]{1,2}} meaning match a digit at    least once and at most twice. However, this regexp as it stands    will not match correctly. This regexp will match one or two digits    \e within a string. To ensure that we match against the whole    string we must use the anchor assertions. We need \bold{^} (caret)    which when it is the first character in the regexp means that the    regexp must match from the beginning of the string. And we also    need \bold{$} (dollar) which when it is the last character in the    regexp means that the regexp must match until the end of the    string. So now our regexp is \bold{^[0-9]{1,2}$}. Note that    assertions, such as \bold{^} and \bold{$}, do not match any    characters.    If you've seen regexps elsewhere, they may have looked different from    the ones above. This is because some sets of characters and some    quantifiers are so common that they have special symbols to    represent them. \bold{[0-9]} can be replaced with the symbol    \bold{\\d}. The quantifier to match exactly one occurrence,    \bold{{1,1}}, can be replaced with the expression itself. This means    that \bold{x{1,1}} is exactly the same as \bold{x} alone. So our 0    to 99 matcher could be written \bold{^\\d{1,2}$}. Another way of    writing it would be \bold{^\\d\\d{0,1}$}, i.e. from the start of the    string match a digit followed by zero or one digits. In practice    most people would write it \bold{^\\d\\d?$}. The \bold{?} is a    shorthand for the quantifier \bold{{0,1}}, i.e. a minimum of no    occurrences a maximum of one occurrence. This is used to make an    expression optional. The regexp \bold{^\\d\\d?$} means "from the    beginning of the string match one digit followed by zero or one    digits and then the end of the string".    Our second example is matching the words 'mail', 'letter' or    'correspondence' but without matching 'email', 'mailman',    'mailer', 'letterbox', etc. We'll start by just matching 'mail'. In    full the regexp is, \bold{m{1,1}a{1,1}i{1,1}l{1,1}}, but since    each expression itself is automatically quantified by \bold{{1,1}}    we can simply write this as \bold{mail}; an 'm' followed by an 'a'    followed by an 'i' followed by an 'l'. The symbol '|' (bar) is    used for \e alternation, so our regexp now becomes    \bold{mail|letter|correspondence} which means match 'mail' \e or    'letter' \e or 'correspondence'. Whilst this regexp will find the    words we want it will also find words we don't want such as    'email'. We will start by putting our regexp in parentheses,    \bold{(mail|letter|correspondence)}. Parentheses have two effects,    firstly they group expressions together and secondly they identify    parts of the regexp that we wish to \l{capturing text}{capture}.    Our regexp still matches any of the three words but now    they are grouped together as a unit. This is useful for building    up more complex regexps. It is also useful because it allows us to    examine which of the words actually matched. We need to use    another assertion, this time \bold{\\b} "word boundary":    \bold{\\b(mail|letter|correspondence)\\b}. This regexp means "match    a word boundary followed by the expression in parentheses followed    by another word boundary". The \bold{\\b} assertion matches at a \e    position in the regexp not a \e character in the regexp. A word    boundary is any non-word character such as a space a newline or    the beginning or end of the string.    For our third example we want to replace ampersands with the HTML    entity '\&amp;'. The regexp to match is simple: \bold{\&}, i.e.    match one ampersand. Unfortunately this will mess up our text if    some of the ampersands have already been turned into HTML    entities. So what we really want to say is replace an ampersand    providing it is not followed by 'amp;'. For this we need the    negative lookahead assertion and our regexp becomes:    \bold{\&(?!amp;)}. The negative lookahead assertion is introduced    with '(?!' and finishes at the ')'. It means that the text it    contains, 'amp;' in our example, must \e not follow the expression    that preceeds it.    Regexps provide a rich language that can be used in a variety of    ways. For example suppose we want to count all the occurrences of    'Eric' and 'Eirik' in a string. Two valid regexps to match these    are \bold{\\b(Eric|Eirik)\\b} and \bold{\\bEi?ri[ck]\\b}. We need    the word boundary '\\b' so we don't get 'Ericsson' etc. The second    regexp actually matches more than we want, 'Eric', 'Erik', 'Eiric'    and 'Eirik'.    We will implement some the examples above in the    \link #code-examples code examples \endlink section.    \target characters-and-abbreviations-for-sets-of-characters    \section1 Characters and Abbreviations for Sets of Characters    \table    \header \i Element \i Meaning    \row \i \bold{c}         \i Any character represents itself unless it has a special         regexp meaning. Thus \bold{c} matches the character \e c.    \row \i \bold{\\c}         \i A character that follows a backslash matches the character         itself except where mentioned below. For example if you         wished to match a literal caret at the beginning of a string         you would write \bold{\^}.    \row \i \bold{\\a}         \i This matches the ASCII bell character (BEL, 0x07).    \row \i \bold{\\f}         \i This matches the ASCII form feed character (FF, 0x0C).    \row \i \bold{\\n}         \i This matches the ASCII line feed character (LF, 0x0A, Unix newline).    \row \i \bold{\\r}         \i This matches the ASCII carriage return character (CR, 0x0D).    \row \i \bold{\\t}         \i This matches the ASCII horizontal tab character (HT, 0x09).    \row \i \bold{\\v}         \i This matches the ASCII vertical tab character (VT, 0x0B).    \row \i \bold{\\x\e{hhhh}}         \i This matches the Unicode character corresponding to the         hexadecimal number \e{hhhh} (between 0x0000 and 0xFFFF).    \row \i \bold{\\0\e{ooo}} (i.e., \\zero \e{ooo})         \i matches the ASCII/Latin1 character corresponding to the         octal number \e{ooo} (between 0 and 0377).    \row \i \bold{. (dot)}         \i This matches any character (including newline).    \row \i \bold{\\d}         \i This matches a digit (QChar::isDigit()).    \row \i \bold{\\D}         \i This matches a non-digit.    \row \i \bold{\\s}         \i This matches a whitespace (QChar::isSpace()).    \row \i \bold{\\S}         \i This matches a non-whitespace.    \row \i \bold{\\w}         \i This matches a word character (QChar::isLetterOrNumber(), QChar::isMark(), or '_').    \row \i \bold{\\W}         \i This matches a non-word character.    \row \i \bold{\\\e{n}}         \i The \e{n}-th \l backreference, e.g. \\1, \\2, etc.    \endtable    \bold{Note:} The C++ compiler transforms backslashes in strings,    so to include a \bold{\\} in a regexp, you will need to enter it    twice, i.e. \c{\\}. To match the backslash character itself, you    will need four: \c{\\\\}.    \target sets-of-characters    \section1 Sets of Characters    Square brackets are used to match any character in the set of    characters contained within the square brackets. All the character    set abbreviations described above can be used within square    brackets. Apart from the character set abbreviations and the    following two exceptions no characters have special meanings in    square brackets.    \table    \row \i \bold{^}         \i The caret negates the character set if it occurs as the         first character, i.e. immediately after the opening square         bracket. For example, \bold{[abc]} matches 'a' or 'b' or 'c',         but \bold{[^abc]} matches anything \e except 'a' or 'b' or         'c'.    \row \i \bold{-}         \i The dash is used to indicate a range of characters, for         example \bold{[W-Z]} matches 'W' or 'X' or 'Y' or 'Z'.    \endtable    Using the predefined character set abbreviations is more portable    than using character ranges across platforms and languages. For    example, \bold{[0-9]} matches a digit in Western alphabets but    \bold{\\d} matches a digit in \e any alphabet.    Note that in most regexp literature sets of characters are called    "character classes".    \target quantifiers    \section1 Quantifiers    By default an expression is automatically quantified by    \bold{{1,1}}, i.e. it should occur exactly once. In the following    list \bold{\e {E}} stands for any expression. An expression is a    character or an abbreviation for a set of characters or a set of    characters in square brackets or any parenthesised expression.    \table    \row \i \bold{\e {E}?}         \i Matches zero or one occurrence of \e E. This quantifier         means "the previous expression is optional" since it will         match whether or not the expression occurs in the string. It         is the same as \bold{\e {E}{0,1}}. For example \bold{dents?}         will match 'dent' and 'dents'.    \row \i \bold{\e {E}+}         \i Matches one or more occurrences of \e E. This is the same         as \bold{\e {E}{1,}}. For example, \bold{0+} will match         '0', '00', '000', etc.    \row \i \bold{\e {E}*}         \i Matches zero or more occurrences of \e E. This is the same         as \bold{\e {E}{0,}}. The \bold{*} quantifier is often         used by a mistake. Since it matches \e zero or more         occurrences it will match no occurrences at all. For example         if we want to match strings that end in whitespace and use         the regexp \bold{\\s*$} we would get a match on every string.         This is because we have said find zero or more whitespace         followed by the end of string, so even strings that don't end         in whitespace will match. The regexp we want in this case is         \bold{\\s+$} to match strings that have at least one         whitespace at the end.    \row \i \bold{\e {E}{n}}         \i Matches exactly \e n occurrences of the expression. This         is the same as repeating the expression \e n times. For         example, \bold{x{5}} is the same as \bold{xxxxx}. It is also         the same as \bold{\e {E}{n,n}}, e.g. \bold{x{5,5}}.    \row \i \bold{\e {E}{n,}}         \i Matches at least \e n occurrences of the expression.    \row \i \bold{\e {E}{,m}}         \i Matches at most \e m occurrences of the expression. This         is the same as \bold{\e {E}{0,m}}.    \row \i \bold{\e {E}{n,m}}         \i Matches at least \e n occurrences of the expression and at         most \e m occurrences of the expression.    \endtable    If we wish to apply a quantifier to more than just the preceding    character we can use parentheses to group characters together in    an expression. For example, \bold{tag+} matches a 't' followed by    an 'a' followed by at least one 'g', whereas \bold{(tag)+} matches    at least one occurrence of 'tag'.    Note that quantifiers are "greedy". They will match as much text    as they can. For example, \bold{0+} will match as many zeros as it    can from the first zero it finds, e.g. '2.\underline{000}5'.    Quantifiers can be made non-greedy, see setMinimal().
12 3 4 5 下一页
💿 文件大小 53509 K
👤 上传用户 nassdaq
📂 所属分类 Linux/Unix编程
🏷️ 相关标签

#emd #qt #比较 #版本
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -