static_regexes.qbk

来自「Boost provides free peer-reviewed portab」· QBK 代码 · 共 231 行

QBK
231
字号
[/ / Copyright (c) 2008 Eric Niebler / / Distributed under the Boost Software License, Version 1.0. (See accompanying / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) /][section Static Regexes][h2 Overview]The feature that really sets xpressive apart from other C/C++ regularexpression libraries is the ability to author a regular expression using C++expressions. xpressive achieves this through operator overloading, using atechnique called ['expression templates] to embed a mini-language dedicatedto pattern matching within C++. These "static regexes" have many advantagesover their string-based brethren. In particular, static regexes:* are syntax-checked at compile-time; they will never fail at run-time due to  a syntax error.* can naturally refer to other C++ data and code, including other regexes,  making it simple to build grammars out of regular expressions and bind  user-defined actions that execute when parts of your regex match.* are statically bound for better inlining and optimization. Static regexes  require no state tables, virtual functions, byte-code or calls through  function pointers that cannot be resolved at compile time.* are not limited to searching for patterns in strings. You can declare a  static regex that finds patterns in an array of integers, for instance.Since we compose static regexes using C++ expressions, we are constrained bythe rules for legal C++ expressions. Unfortunately, that means that"classic" regular expression syntax cannot always be mapped cleanly intoC++. Rather, we map the regex ['constructs], picking new syntax that islegal C++.[h2 Construction and Assignment]You create a static regex by assigning one to an object of type _basic_regex_.For instance, the following defines a regex that can be used to find patternsin objects of type `std::string`:    sregex re = '$' >> +_d >> '.' >> _d >> _d;Assignment works similarly.[h2 Character and String Literals]In static regexes, character and string literals match themselves. Forinstance, in the regex above, `'$'` and `'.'` match the characters `'$'` and`'.'` respectively. Don't be confused by the fact that [^$] and [^.] aremeta-characters in Perl. In xpressive, literals always represent themselves.When using literals in static regexes, you must take care that at least oneoperand is not a literal. For instance, the following are ['not] validregexes:    sregex re1 = 'a' >> 'b';         // ERROR!    sregex re2 = +'a';               // ERROR!The two operands to the binary `>>` operator are both literals, and theoperand of the unary `+` operator is also a literal, so these statementswill call the native C++ binary right-shift and unary plus operators,respectively. That's not what we want. To get operator overloading to kickin, at least one operand must be a user-defined type. We can use xpressive's`as_xpr()` helper function to "taint" an expression with regex-ness, forcingoperator overloading to find the correct operators. The two regexes aboveshould be written as:    sregex re1 = as_xpr('a') >> 'b'; // OK    sregex re2 = +as_xpr('a');       // OK[h2 Sequencing and Alternation]As you've probably already noticed, sub-expressions in static regexes mustbe separated by the sequencing operator, `>>`. You can read this operator as"followed by".    // Match an 'a' followed by a digit    sregex re = 'a' >> _d;Alternation works just as it does in Perl with the `|` operator. You canread this operator as "or". For example:    // match a digit character or a word character one or more times    sregex re = +( _d | _w );[h2 Grouping and Captures]In Perl, parentheses `()` have special meaning. They group, but as aside-effect they also create back\-references like [^$1] and [^$2]. In C++,parentheses only group \-\- there is no way to give them side\-effects. Toget the same effect, we use the special `s1`, `s2`, etc. tokens. Assigningto one creates a back-reference. You can then use the back-reference laterin your expression, like using [^\1] and [^\2] in Perl. For example,consider the following regex, which finds matching HTML tags:    "<(\\w+)>.*?</\\1>"In static xpressive, this would be:    '<' >> (s1= +_w) >> '>' >> -*_ >> "</" >> s1 >> '>'Notice how you capture a back-reference by assigning to `s1`, and then youuse `s1` later in the pattern to find the matching end tag.[tip [*Grouping without capturing a back-reference] \n\n Inxpressive, if you just want grouping without capturing a back-reference, youcan just use `()` without `s1`. That is the equivalent of Perl's [^(?:)]non-capturing grouping construct.][h2 Case-Insensitivity and Internationalization]Perl lets you make part of your regular expression case-insensitive by usingthe [^(?i:)] pattern modifier. xpressive also has a case-insensitivitypattern modifier, called `icase`. You can use it as follows:    sregex re = "this" >> icase( "that" );In this regular expression, `"this"` will be matched exactly, but `"that"`will be matched irrespective of case.Case-insensitive regular expressions raise the issue ofinternationalization: how should case-insensitive character comparisons beevaluated? Also, many character classes are locale-specific. Whichcharacters are matched by `digit` and which are matched by `alpha`? Theanswer depends on the `std::locale` object the regular expression object isusing. By default, all regular expression objects use the global locale. Youcan override the default by using the `imbue()` pattern modifier, asfollows:    std::locale my_locale = /* initialize a std::locale object */;    sregex re = imbue( my_locale )( +alpha >> +digit );This regular expression will evaluate `alpha` and `digit` according to`my_locale`. See the section on [link boost_xpressive.user_s_guide.localization_and_regex_traitsLocalization and Regex Traits] for more information about how to customizethe behavior of your regexes.[h2 Static xpressive Syntax Cheat Sheet]The table below lists the familiar regex constructs and their equivalents instatic xpressive.[def _s1_       [globalref boost::xpressive::s1 s1]][def _bos_      [globalref boost::xpressive::bos bos]][def _eos_      [globalref boost::xpressive::eos eos]][def _b_        [globalref boost::xpressive::_b _b]][def _n_        [globalref boost::xpressive::_n _n]][def _ln_       [globalref boost::xpressive::_ln _ln]][def _d_        [globalref boost::xpressive::_d _d]][def _w_        [globalref boost::xpressive::_w _w]][def _s_        [globalref boost::xpressive::_s _s]][def _alnum_    [globalref boost::xpressive::alnum alnum]][def _alpha_    [globalref boost::xpressive::alpha alpha]][def _blank_    [globalref boost::xpressive::blank blank]][def _cntrl_    [globalref boost::xpressive::cntrl cntrl]][def _digit_    [globalref boost::xpressive::digit digit]][def _graph_    [globalref boost::xpressive::graph graph]][def _lower_    [globalref boost::xpressive::lower lower]][def _print_    [globalref boost::xpressive::print print]][def _punct_    [globalref boost::xpressive::punct punct]][def _space_    [globalref boost::xpressive::space space]][def _upper_    [globalref boost::xpressive::upper upper]][def _xdigit_   [globalref boost::xpressive::xdigit xdigit]][def _set_      [globalref boost::xpressive::set set]][def _repeat_   [funcref boost::xpressive::repeat repeat]][def _range_    [funcref boost::xpressive::range range]][def _icase_    [funcref boost::xpressive::icase icase]][def _before_   [funcref boost::xpressive::before before]][def _after_    [funcref boost::xpressive::after after]][def _keep_     [funcref boost::xpressive::keep keep]][table Perl syntax vs. Static xpressive syntax    [[Perl]               [Static xpressive]                              [Meaning]]    [[[^.]]               [[globalref boost::xpressive::_ `_`]]           [any character (assuming Perl's /s modifier).]]    [[[^ab]]              [`a >> b`]                                      [sequencing of [^a] and [^b] sub-expressions.]]    [[[^a|b]]             [`a | b`]                                       [alternation of [^a] and [^b] sub-expressions.]]    [[[^(a)]]             [`(_s1_= a)`]                                   [group and capture a back-reference.]]    [[[^(?:a)]]           [`(a)`]                                         [group and do not capture a back-reference.]]    [[[^\1]]              [`_s1_`]                                        [a previously captured back-reference.]]    [[[^a*]]              [`*a`]                                          [zero or more times, greedy.]]    [[[^a+]]              [`+a`]                                          [one or more times, greedy.]]    [[[^a?]]              [`!a`]                                          [zero or one time, greedy.]]    [[[^a{n,m}]]          [`_repeat_<n,m>(a)`]                            [between [^n] and [^m] times, greedy.]]    [[[^a*?]]             [`-*a`]                                         [zero or more times, non-greedy.]]    [[[^a+?]]             [`-+a`]                                         [one or more times, non-greedy.]]    [[[^a??]]             [`-!a`]                                         [zero or one time, non-greedy.]]    [[[^a{n,m}?]]         [`-_repeat_<n,m>(a)`]                           [between [^n] and [^m] times, non-greedy.]]    [[[^^]]               [`_bos_`]                                       [beginning of sequence assertion.]]    [[[^$]]               [`_eos_`]                                       [end of sequence assertion.]]    [[[^\b]]              [`_b_`]                                         [word boundary assertion.]]    [[[^\B]]              [`~_b_`]                                        [not word boundary assertion.]]    [[[^\\n]]             [`_n_`]                                         [literal newline.]]    [[[^.]]               [`~_n_`]                                        [any character except a literal newline (without Perl's /s modifier).]]    [[[^\\r?\\n|\\r]]     [`_ln_`]                                        [logical newline.]]    [[[^\[^\\r\\n\]]]     [`~_ln_`]                                       [any single character not a logical newline.]]    [[[^\w]]              [`_w_`]                                         [a word character, equivalent to set\[alnum | '_'\].]]    [[[^\W]]              [`~_w_`]                                        [not a word character, equivalent to ~set\[alnum | '_'\].]]    [[[^\d]]              [`_d_`]                                         [a digit character.]]    [[[^\D]]              [`~_d_`]                                        [not a digit character.]]    [[[^\s]]              [`_s_`]                                         [a space character.]]    [[[^\S]]              [`~_s_`]                                        [not a space character.]]    [[[^\[:alnum:\]]]     [`_alnum_`]                                     [an alpha-numeric character.]]    [[[^\[:alpha:\]]]     [`_alpha_`]                                     [an alphabetic character.]]    [[[^\[:blank:\]]]     [`_blank_`]                                     [a horizontal white-space character.]]    [[[^\[:cntrl:\]]]     [`_cntrl_`]                                     [a control character.]]    [[[^\[:digit:\]]]     [`_digit_`]                                     [a digit character.]]    [[[^\[:graph:\]]]     [`_graph_`]                                     [a graphable character.]]    [[[^\[:lower:\]]]     [`_lower_`]                                     [a lower-case character.]]    [[[^\[:print:\]]]     [`_print_`]                                     [a printing character.]]    [[[^\[:punct:\]]]     [`_punct_`]                                     [a punctuation character.]]    [[[^\[:space:\]]]     [`_space_`]                                     [a white-space character.]]    [[[^\[:upper:\]]]     [`_upper_`]                                     [an upper-case character.]]    [[[^\[:xdigit:\]]]    [`_xdigit_`]                                    [a hexadecimal digit character.]]    [[[^\[0-9\]]]         [`_range_('0','9')`]                            [characters in range `'0'` through `'9'`.]]    [[[^\[abc\]]]         [`as_xpr('a') | 'b' |'c'`]                      [characters `'a'`, `'b'`, or `'c'`.]]    [[[^\[abc\]]]         [`(_set_= 'a','b','c')`]                        [['same as above]]]    [[[^\[0-9abc\]]]      [`_set_[ _range_('0','9') | 'a' | 'b' | 'c' ]`] [characters `'a'`, `'b'`, `'c'` or  in range `'0'` through `'9'`.]]    [[[^\[0-9abc\]]]      [`_set_[ _range_('0','9') | (_set_= 'a','b','c') ]`]  [['same as above]]]    [[[^\[^abc\]]]        [`~(_set_= 'a','b','c')`]                       [not characters `'a'`, `'b'`, or `'c'`.]]    [[[^(?i:['stuff])]]   [`_icase_(`[^['stuff]]`)`]                      [match ['stuff] disregarding case.]]    [[[^(?>['stuff])]]    [`_keep_(`[^['stuff]]`)`]                       [independent sub-expression, match ['stuff] and turn off backtracking.]]    [[[^(?=['stuff])]]    [`_before_(`[^['stuff]]`)`]                     [positive look-ahead assertion, match if before ['stuff] but don't include ['stuff] in the match.]]    [[[^(?!['stuff])]]    [`~_before_(`[^['stuff]]`)`]                    [negative look-ahead assertion, match if not before ['stuff].]]    [[[^(?<=['stuff])]]   [`_after_(`[^['stuff]]`)`]                      [positive look-behind assertion, match if after ['stuff] but don't include ['stuff] in the match. (['stuff] must be constant-width.)]]    [[[^(?<!['stuff])]]   [`~_after_(`[^['stuff]]`)`]                     [negative look-behind assertion, match if not after ['stuff]. (['stuff] must be constant-width.)]]]\n[endsect]

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?