⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 qregexp.cpp

📁 Trolltech公司发布的图形界面操作系统。可在qt-embedded-2.3.10平台上编译为嵌入式图形界面操作系统。
💻 CPP
📖 第 1 页 / 共 5 页
字号:
/************************************************************************ Copyright (C) 2000-2005 Trolltech AS.  All rights reserved.**** This file is part of the Qtopia Environment.** ** This program is free software; you can redistribute it and/or modify it** under the terms of the GNU General Public License as published by the** Free Software Foundation; either version 2 of the License, or (at your** option) any later version.** ** A copy of the GNU GPL license version 2 is included in this package as ** LICENSE.GPL.**** This program is distributed in the hope that it will be useful, but** WITHOUT ANY WARRANTY; without even the implied warranty of** MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ** See the GNU General Public License for more details.**** In addition, as a special exception Trolltech gives permission to link** the code of this program with Qtopia applications copyrighted, developed** and distributed by Trolltech under the terms of the Qtopia Personal Use** License Agreement. You must comply with the GNU General Public License** in all respects for all of the code used other than the applications** licensed under the Qtopia Personal Use License Agreement. If you modify** this file, you may extend this exception to your version of the file,** but you are not obligated to do so. If you do not wish to do so, delete** this exception statement from your version.** ** See http://www.trolltech.com/gpl/ for GPL licensing information.**** Contact info@trolltech.com if any conditions of this licensing are** not clear to you.************************************************************************/#include "qregexp.h"#ifndef QT_NO_REGEXP#include "qmemarray.h"#include "qbitarray.h"#include "qcache.h"#include "qcleanuphandler.h"#include "qintdict.h"#include "qmap.h"#include "qptrvector.h"#include "qstring.h"#include "qtl.h"#ifdef QT_THREAD_SUPPORT#include "qthreadstorage.h"#endif // QT_THREAD_SUPPORT#undef QT_TRANSLATE_NOOP#define QT_TRANSLATE_NOOP( context, sourceText ) sourceText#include <limits.h>// error strings for the regexp parser#define RXERR_OK         QT_TRANSLATE_NOOP( "QRegExp", "no error occurred" )#define RXERR_DISABLED   QT_TRANSLATE_NOOP( "QRegExp", "disabled feature used" )#define RXERR_CHARCLASS  QT_TRANSLATE_NOOP( "QRegExp", "bad char class syntax" )#define RXERR_LOOKAHEAD  QT_TRANSLATE_NOOP( "QRegExp", "bad lookahead syntax" )#define RXERR_REPETITION QT_TRANSLATE_NOOP( "QRegExp", "bad repetition syntax" )#define RXERR_OCTAL      QT_TRANSLATE_NOOP( "QRegExp", "invalid octal value" )#define RXERR_LEFTDELIM  QT_TRANSLATE_NOOP( "QRegExp", "missing left delim" )#define RXERR_END        QT_TRANSLATE_NOOP( "QRegExp", "unexpected end" )#define RXERR_LIMIT      QT_TRANSLATE_NOOP( "QRegExp", "met internal limit" )/*  WARNING! Be sure to read qregexp.tex before modifying this file.*//*!    \class QRegExp qregexp.h    \reentrant    \brief The QRegExp class provides pattern matching using regular expressions.    \ingroup tools    \ingroup misc    \ingroup shared    \mainclass    \keyword regular expression    Regular expressions, or "regexps", provide a way to find patterns    within text. This is useful in many contexts, for example:    \table    \row \i Validation	 \i A regexp can be used to check whether a piece of text	 meets some criteria, e.g. is an integer or contains no	 whitespace.    \row \i Searching	 \i Regexps provide a much more powerful means of searching	 text than simple string matching does. For example we can	 create a regexp which says "find one of the words 'mail',	 'letter' or 'correspondence' but not any of the words	 'email', 'mailman' 'mailer', 'letterbox' etc."    \row \i Search and Replace	 \i A regexp can be used to replace a pattern with a piece of	 text, for example replace all occurrences of '&' with	 '\&amp;' except where the '&' is already followed by 'amp;'.    \row \i String Splitting	 \i A regexp can be used to identify where a string should be	 split into its component fields, e.g. splitting tab-delimited	 strings.    \endtable    We present a very brief introduction to regexps, a description of    Qt's regexp language, some code examples, and finally the function    documentation itself. QRegExp is modeled on Perl's regexp    language, and also fully supports Unicode. QRegExp can also be    used in the weaker 'wildcard' (globbing) mode which works in a    similar way to command shells. A good text on regexps is \e    {Mastering Regular Expressions: Powerful Techniques for Perl and    Other Tools} by Jeffrey E. Friedl, ISBN 1565922573.    Experienced regexp users may prefer to skip the introduction and    go directly to the relevant information.    \tableofcontents    \section1 Introduction    Regexps are built up from expressions, quantifiers, and assertions.    The simplest form of expression is simply a character, e.g.    <b>x</b> or <b>5</b>. An expression can also be a set of    characters. For example, <b>[ABCD]</b>, will match an <b>A</b> or    a <b>B</b> or a <b>C</b> or a <b>D</b>. As a shorthand we could    write this as <b>[A-D]</b>. If we want to match any of the    captital letters in the English alphabet we can write    <b>[A-Z]</b>. A quantifier tells the regexp engine how many    occurrences of the expression we want, e.g. <b>x{1,1}</b> means    match an <b>x</b> which occurs at least once and at most once.    We'll look at assertions and more complex expressions later.    Note that in general regexps cannot be used to check for balanced    brackets or tags. For example if you want to match an opening html    \c <b> and its closing \c </b> you can only use a regexp if you    know that these tags are not nested; the html fragment, \c{<b>bold    <b>bolder</b></b>} will not match as expected. If you know the    maximum level of nesting it is possible to create a regexp that    will match correctly, but for an unknown level of nesting, regexps    will fail.    We'll start by writing a regexp to match integers in the range 0    to 99. We will require at least one digit so we will start with    <b>[0-9]{1,1}</b> which means match a digit exactly once. This    regexp alone will match integers in the range 0 to 9. To match one    or two digits we can increase the maximum number of occurrences so    the regexp becomes <b>[0-9]{1,2}</b> meaning match a digit at    least once and at most twice. However, this regexp as it stands    will not match correctly. This regexp will match one or two digits    \e within a string. To ensure that we match against the whole    string we must use the anchor assertions. We need <b>^</b> (caret)    which when it is the first character in the regexp means that the    regexp must match from the beginning of the string. And we also    need <b>$</b> (dollar) which when it is the last character in the    regexp means that the regexp must match until the end of the    string. So now our regexp is <b>^[0-9]{1,2}$</b>. Note that    assertions, such as <b>^</b> and <b>$</b>, do not match any    characters.    If you've seen regexps elsewhere they may have looked different from    the ones above. This is because some sets of characters and some    quantifiers are so common that they have special symbols to    represent them. <b>[0-9]</b> can be replaced with the symbol    <b>\d</b>. The quantifier to match exactly one occurrence,    <b>{1,1}</b>, can be replaced with the expression itself. This means    that <b>x{1,1}</b> is exactly the same as <b>x</b> alone. So our 0    to 99 matcher could be written <b>^\d{1,2}$</b>. Another way of    writing it would be <b>^\d\d{0,1}$</b>, i.e. from the start of the    string match a digit followed by zero or one digits. In practice    most people would write it <b>^\d\d?$</b>. The <b>?</b> is a    shorthand for the quantifier <b>{0,1}</b>, i.e. a minimum of no    occurrences a maximum of one occurrence. This is used to make an    expression optional. The regexp <b>^\d\d?$</b> means "from the    beginning of the string match one digit followed by zero or one    digits and then the end of the string".    Our second example is matching the words 'mail', 'letter' or    'correspondence' but without matching 'email', 'mailman',    'mailer', 'letterbox' etc. We'll start by just matching 'mail'. In    full the regexp is, <b>m{1,1}a{1,1}i{1,1}l{1,1}</b>, but since    each expression itself is automatically quantified by <b>{1,1}</b>    we can simply write this as <b>mail</b>; an 'm' followed by an 'a'    followed by an 'i' followed by an 'l'. The symbol '|' (bar) is    used for \e alternation, so our regexp now becomes    <b>mail|letter|correspondence</b> which means match 'mail' \e or    'letter' \e or 'correspondence'. Whilst this regexp will find the    words we want it will also find words we don't want such as    'email'. We will start by putting our regexp in parentheses,    <b>(mail|letter|correspondence)</b>. Parentheses have two effects,    firstly they group expressions together and secondly they identify    parts of the regexp that we wish to \link #capturing-text capture    \endlink. Our regexp still matches any of the three words but now    they are grouped together as a unit. This is useful for building    up more complex regexps. It is also useful because it allows us to    examine which of the words actually matched. We need to use    another assertion, this time <b>\b</b> "word boundary":    <b>\b(mail|letter|correspondence)\b</b>. This regexp means "match    a word boundary followed by the expression in parentheses followed    by another word boundary". The <b>\b</b> assertion matches at a \e    position in the regexp not a \e character in the regexp. A word    boundary is any non-word character such as a space a newline or    the beginning or end of the string.    For our third example we want to replace ampersands with the HTML    entity '\&amp;'. The regexp to match is simple: <b>\&</b>, i.e.    match one ampersand. Unfortunately this will mess up our text if    some of the ampersands have already been turned into HTML    entities. So what we really want to say is replace an ampersand    providing it is not followed by 'amp;'. For this we need the    negative lookahead assertion and our regexp becomes:    <b>\&(?!amp;)</b>. The negative lookahead assertion is introduced    with '(?!' and finishes at the ')'. It means that the text it    contains, 'amp;' in our example, must \e not follow the expression    that preceeds it.    Regexps provide a rich language that can be used in a variety of    ways. For example suppose we want to count all the occurrences of    'Eric' and 'Eirik' in a string. Two valid regexps to match these    are <b>\\b(Eric|Eirik)\\b</b> and <b>\\bEi?ri[ck]\\b</b>. We need    the word boundary '\b' so we don't get 'Ericsson' etc. The second    regexp actually matches more than we want, 'Eric', 'Erik', 'Eiric'    and 'Eirik'.    We will implement some the examples above in the    \link #code-examples code examples \endlink section.    \target characters-and-abbreviations-for-sets-of-characters    \section1 Characters and Abbreviations for Sets of Characters    \table    \header \i Element \i Meaning    \row \i <b>c</b>	 \i Any character represents itself unless it has a special	 regexp meaning. Thus <b>c</b> matches the character \e c.    \row \i <b>\\c</b>	 \i A character that follows a backslash matches the character	 itself except where mentioned below. For example if you	 wished to match a literal caret at the beginning of a string	 you would write <b>\^</b>.    \row \i <b>\\a</b>	 \i This matches the ASCII bell character (BEL, 0x07).    \row \i <b>\\f</b>	 \i This matches the ASCII form feed character (FF, 0x0C).    \row \i <b>\\n</b>	 \i This matches the ASCII line feed character (LF, 0x0A, Unix newline).    \row \i <b>\\r</b>	 \i This matches the ASCII carriage return character (CR, 0x0D).    \row \i <b>\\t</b>	 \i This matches the ASCII horizontal tab character (HT, 0x09).    \row \i <b>\\v</b>	 \i This matches the ASCII vertical tab character (VT, 0x0B).    \row \i <b>\\xhhhh</b>	 \i This matches the Unicode character corresponding to the	 hexadecimal number hhhh (between 0x0000 and 0xFFFF). \0ooo	 (i.e., \zero ooo) matches the ASCII/Latin-1 character	 corresponding to the octal number ooo (between 0 and 0377).    \row \i <b>. (dot)</b>	 \i This matches any character (including newline).    \row \i <b>\\d</b>	 \i This matches a digit (QChar::isDigit()).    \row \i <b>\\D</b>	 \i This matches a non-digit.    \row \i <b>\\s</b>	 \i This matches a whitespace (QChar::isSpace()).    \row \i <b>\\S</b>	 \i This matches a non-whitespace.    \row \i <b>\\w</b>	 \i This matches a word character (QChar::isLetterOrNumber() or '_').    \row \i <b>\\W</b>	 \i This matches a non-word character.    \row \i <b>\\n</b>	 \i The n-th \link #capturing-text backreference \endlink,	 e.g. \1, \2, etc.    \endtable    \e {Note that the C++ compiler transforms backslashes in strings    so to include a <b>\\</b> in a regexp you will need to enter it    twice, i.e. <b>\\\\</b>.}    \target sets-of-characters    \section1 Sets of Characters    Square brackets are used to match any character in the set of    characters contained within the square brackets. All the character    set abbreviations described above can be used within square    brackets. Apart from the character set abbreviations and the    following two exceptions no characters have special meanings in    square brackets.    \table    \row \i <b>^</b>	 \i The caret negates the character set if it occurs as the	 first character, i.e. immediately after the opening square	 bracket. For example, <b>[abc]</b> matches 'a' or 'b' or 'c',	 but <b>[^abc]</b> matches anything \e except 'a' or 'b' or	 'c'.    \row \i <b>-</b>	 \i The dash is used to indicate a range of characters, for	 example <b>[W-Z]</b> matches 'W' or 'X' or 'Y' or 'Z'.    \endtable    Using the predefined character set abbreviations is more portable    than using character ranges across platforms and languages. For    example, <b>[0-9]</b> matches a digit in Western alphabets but    <b>\d</b> matches a digit in \e any alphabet.    Note that in most regexp literature sets of characters are called    "character classes".    \target quantifiers    \section1 Quantifiers    By default an expression is automatically quantified by    <b>{1,1}</b>, i.e. it should occur exactly once. In the following    list <b>\e {E}</b> stands for any expression. An expression is a    character or an abbreviation for a set of characters or a set of    characters in square brackets or any parenthesised expression.    \table    \row \i <b>\e {E}?</b>	 \i Matches zero or one occurrence of \e E. This quantifier	 means "the previous expression is optional" since it will	 match whether or not the expression occurs in the string. It	 is the same as <b>\e {E}{0,1}</b>. For example <b>dents?</b>	 will match 'dent' and 'dents'.    \row \i <b>\e {E}+</b>	 \i Matches one or more occurrences of \e E. This is the same	 as <b>\e {E}{1,MAXINT}</b>. For example, <b>0+</b> will match	 '0', '00', '000', etc.    \row \i <b>\e {E}*</b>	 \i Matches zero or more occurrences of \e E. This is the same	 as <b>\e {E}{0,MAXINT}</b>. The <b>*</b> quantifier is often	 used by a mistake. Since it matches \e zero or more	 occurrences it will match no occurrences at all. For example	 if we want to match strings that end in whitespace and use	 the regexp <b>\s*$</b> we would get a match on every string.	 This is because we have said find zero or more whitespace	 followed by the end of string, so even strings that don't end	 in whitespace will match. The regexp we want in this case is	 <b>\s+$</b> to match strings that have at least one	 whitespace at the end.    \row \i <b>\e {E}{n}</b>	 \i Matches exactly \e n occurrences of the expression. This	 is the same as repeating the expression \e n times. For	 example, <b>x{5}</b> is the same as <b>xxxxx</b>. It is also	 the same as <b>\e {E}{n,n}</b>, e.g. <b>x{5,5}</b>.    \row \i <b>\e {E}{n,}</b>	 \i Matches at least \e n occurrences of the expression. This	 is the same as <b>\e {E}{n,MAXINT}</b>.    \row \i <b>\e {E}{,m}</b>	 \i Matches at most \e m occurrences of the expression. This	 is the same as <b>\e {E}{0,m}</b>.    \row \i <b>\e {E}{n,m}</b>	 \i Matches at least \e n occurrences of the expression and at	 most \e m occurrences of the expression.    \endtable    (MAXINT is implementation dependent but will not be smaller than    1024.)    If we wish to apply a quantifier to more than just the preceding    character we can use parentheses to group characters together in    an expression. For example, <b>tag+</b> matches a 't' followed by    an 'a' followed by at least one 'g', whereas <b>(tag)+</b> matches    at least one occurrence of 'tag'.    Note that quantifiers are "greedy". They will match as much text    as they can. For example, <b>0+</b> will match as many zeros as it    can from the first zero it finds, e.g. '2.<u>000</u>5'.    Quantifiers can be made non-greedy, see setMinimal().    \target capturing-text    \section1 Capturing Text    Parentheses allow us to group elements together so that we can    quantify and capture them. For example if we have the expression    <b>mail|letter|correspondence</b> that matches a string we know    that \e one of the words matched but not which one. Using    parentheses allows us to "capture" whatever is matched within    their bounds, so if we used <b>(mail|letter|correspondence)</b>    and matched this regexp against the string "I sent you some email"    we can use the cap() or capturedTexts() functions to extract the    matched characters, in this case 'mail'.    We can use captured text within the regexp itself. To refer to the    captured text we use \e backreferences which are indexed from 1,    the same as for cap(). For example we could search for duplicate    words in a string using <b>\b(\w+)\W+\1\b</b> which means match a    word boundary followed by one or more word characters followed by    one or more non-word characters followed by the same text as the    first parenthesised expression followed by a word boundary.    If we want to use parentheses purely for grouping and not for    capturing we can use the non-capturing syntax, e.g.    <b>(?:green|blue)</b>. Non-capturing parentheses begin '(?:' and    end ')'. In this example we match either 'green' or 'blue' but we    do not capture the match so we only know whether or not we matched    but not which color we actually found. Using non-capturing

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -