grammars.qbk

来自「Boost provides free peer-reviewed portab」· QBK 代码 · 共 314 行

QBK
314
字号
[/ / Copyright (c) 2008 Eric Niebler / / Distributed under the Boost Software License, Version 1.0. (See accompanying / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) /][section Grammars and Nested Matches][h2 Overview]One of the key benefits of representing regexes as C++ expressions is the ability to easily refer to other C++code and data from within the regex. This enables programming idioms that are not possible with other regularexpression libraries. Of particular note is the ability for one regex to refer to another regex, allowing youto build grammars out of regular expressions. This section describes how to embed one regex in another by valueand by reference, how regex objects behave when they refer to other regexes, and how to access the tree of resultsafter a successful parse.[h2 Embedding a Regex by Value]The _basic_regex_ object has value semantics. When a regex object appears on the right-hand side in the definitionof another regex, it is as if the regex were embedded by value; that is, a copy of the nested regex is stored bythe enclosing regex. The inner regex is invoked by the outer regex during pattern matching. The inner regexparticipates fully in the match, back-tracking as needed to make the match succeed.Consider a text editor that has a regex-find feature with a whole-word option. You can implement this withxpressive as follows:    find_dialog dlg;    if( dialog_ok == dlg.do_modal() )    {        std::string pattern = dlg.get_text();          // the pattern the user entered        bool whole_word = dlg.whole_word.is_checked(); // did the user select the whole-word option?        sregex re = sregex::compile( pattern );        // try to compile the pattern        if( whole_word )        {            // wrap the regex in begin-word / end-word assertions            re = bow >> re >> eow;        }        // ... use re ...    }Look closely at this line:    // wrap the regex in begin-word / end-word assertions    re = bow >> re >> eow;This line creates a new regex that embeds the old regex by value. Then, the new regex is assigned back tothe original regex. Since a copy of the old regex was made on the right-hand side, this works as you mightexpect: the new regex has the behavior of the old regex wrapped in begin- and end-word assertions.[note Note that `re = bow >> re >> eow` does ['not] define a recursive regular expression, since regexobjects embed by value by default. The next section shows how to define a recursive regular expression byembedding a regex by reference.][h2 Embedding a Regex by Reference]If you want to be able to build recursive regular expressions and context-free grammars, embedding a regexby value is not enough. You need to be able to make your regular expressions self-referential. Most regularexpression engines don't give you that power, but xpressive does.[tip The theoretical computer scientists out there will correctly point out that a self-referentialregular expression is not "regular", so in the strict sense, xpressive isn't really a ['regular] expression engineat all. But as Larry Wall once said, "the term '''[regular expression]''' has grown with the capabilities of ourpattern matching engines, so I'm not going to try to fight linguistic necessity here."]Consider the following code, which uses the `by_ref()` helper to define a recursive regular expression thatmatches balanced, nested parentheses:    sregex parentheses;    parentheses                          // A balanced set of parentheses ...        = '('                            // is an opening parenthesis ...            >>                           // followed by ...             *(                          // zero or more ...                keep( +~(set='(',')') )  // of a bunch of things that are not parentheses ...              |                          // or ...                by_ref(parentheses)      // a balanced set of parentheses              )                          //   (ooh, recursion!) ...            >>                           // followed by ...          ')'                            // a closing parenthesis        ;Matching balanced, nested tags is an important text processing task, and it is one that "classic" regularexpressions cannot do. The `by_ref()` helper makes it possible. It allows one regex object to be embeddedin another ['by reference]. Since the right-hand side holds `parentheses` by reference, assigning the right-handside back to `parentheses` creates a cycle, which will execute recursively.[h2 Building a Grammar]Once we allow self-reference in our regular expressions, the genie is out of the bottle and all manner offun things are possible. In particular, we can now build grammars out of regular expressions. Let's havea look at the text-book grammar example: the humble calculator.    sregex group, factor, term, expression;    group       = '(' >> by_ref(expression) >> ')';    factor      = +_d | group;    term        = factor >> *(('*' >> factor) | ('/' >> factor));    expression  = term >> *(('+' >> term) | ('-' >> term));The regex `expression` defined above does something rather remarkable for a regular expression: it matchesmathematical expressions. For example, if the input string were `"foo 9*(10+3) bar"`, this pattern wouldmatch `"9*(10+3)"`. It only matches well-formed mathematical expressions, where the parentheses arebalanced and the infix operators have two arguments each. Don't try this with just any regular expressionengine!Let's take a closer look at this regular expression grammar. Notice that it is cyclic: `expression` isimplemented in terms of `term`, which is implemented in terms of `factor`, which is implemented in termsof `group`, which is implemented in terms of `expression`, closing the loop. In general, the way to definea cyclic grammar is to forward-declare the regex objects and embed by reference those regular expressionsthat have not yet been initialized. In the above grammar, there is only one place where we need to referencea regex object that has not yet been initialized: the definition of `group`. In that place, we use`by_ref()` to embed `expression` by reference. In all other places, it is sufficient to embed the otherregex objects by value, since they have already been initialized and their values will not change.[tip [*Embed by value if possible]\n\nIn general, prefer embedding regular expressions by value rather than by reference. Itinvolves one less indirection, making your patterns match a little faster. Besides, value semanticsare simpler and will make your grammars easier to reason about. Don't worry about the expense of "copying"a regex. Each regex object shares its implementation with all of its copies.][h2 Dynamic Regex Grammars]Using _regex_compiler_, you can also build grammars out of dynamic regular expressions.You do that by creating named regexes, and referring to other regexes by name. Each_regex_compiler_ instance keeps a mapping from names to regexes that have been createdwith it.You can create a named dynamic regex by prefacing your regex with `"(?$name=)"`, where/name/ is the name of the regex. You can refer to a named regex from another regex with`"(?$name)"`. The named regex does not need to exist yet at the time it is referencedin another regex, but it must exist by the time you use the regex.Below is a code fragment that uses dynamic regex grammars to implement the calculatorexample from above.    using namespace boost::xpressive;    using namespace regex_constants;    sregex expr;    {         sregex_compiler compiler;         syntax_option_type x = ignore_white_space;                compiler.compile("(? $group  = ) \\( (? $expr ) \\) ", x);                compiler.compile("(? $factor = ) \\d+ | (? $group ) ", x);                compiler.compile("(? $term   = ) (? $factor )"                                 " ( \\* (? $factor ) | / (? $factor ) )* ", x);         expr = compiler.compile("(? $expr   = ) (? $term )"                                 "   ( \\+ (? $term ) | - (? $term )   )* ", x);    }    std::string str("foo 9*(10+3) bar");    smatch what;    if(regex_search(str, what, expr))    {         // This prints "9*(10+3)":         std::cout << what[0] << std::endl;    }As with static regex grammars, nested regex invocations create nestedmatch results (see /Nested Results/ below). The result is a complete parse treefor string that matched. Unlike static regexes, dynamic regexes are alwaysembedded by reference, not by value.[h2 Cyclic Patterns, Copying and Memory Management, Oh My!]The calculator examples above raises a number of very complicated memory-management issues. Each of thefour regex objects refer to each other, some directly and some indirectly, some by value and some byreference. What if we were to return one of them from a function and let the others go out of scope?What becomes of the references? The answer is that the regex objects are internally reference counted,such that they keep their referenced regex objects alive as long as they need them. So passing a regexobject by value is never a problem, even if it refers to other regex objects that have gone out of scope.Those of you who have dealt with reference counting are probably familiar with its Achilles Heel: cyclicreferences. If regex objects are reference counted, what happens to cycles like the one created in thecalculator examples? Are they leaked? The answer is no, they are not leaked. The _basic_regex_ object has some trickyreference tracking code that ensures that even cyclic regex grammars are cleaned up when the last externalreference goes away. So don't worry about it. Create cyclic grammars, pass your regex objects around andcopy them all you want. It is fast and efficient and guaranteed not to leak or result in dangling references.[h2 Nested Regexes and Sub-Match Scoping]Nested regular expressions raise the issue of sub-match scoping. If both the inner and outer regex writeto and read from the same sub-match vector, chaos would ensue. The inner regex would stomp on thesub-matches written by the outer regex. For example, what does this do?    sregex inner = sregex::compile( "(.)\\1" );    sregex outer = (s1= _) >> inner >> s1;The author probably didn't intend for the inner regex to overwrite the sub-match written by the outerregex. The problem is particularly acute when the inner regex is accepted from the user as input. Theauthor has no way of knowing whether the inner regex will stomp the sub-match vector or not. This isclearly not acceptable.Instead, what actually happens is that each invocation of a nested regex gets its own scope. Sub-matchesbelong to that scope. That is, each nested regex invocation gets its own copy of the sub-match vector toplay with, so there is no way for an inner regex to stomp on the sub-matches of an outer regex. So, forexample, the regex `outer` defined above would match `"ABBA"`, as it should.[h2 Nested Results]If nested regexes have their own sub-matches, there should be a way to access them after a successfulmatch. In fact, there is. After a _regex_match_ or _regex_search_, the _match_results_ struct behaveslike the head of a tree of nested results. The _match_results_ class provides a `nested_results()`member function that returns an ordered sequence of _match_results_ structures, representing theresults of the nested regexes. The order of the nested results is the same as the order in whichthe nested regex objects matched.Take as an example the regex for balanced, nested parentheses we saw earlier:    sregex parentheses;    parentheses = '(' >> *( keep( +~(set='(',')') ) | by_ref(parentheses) ) >> ')';    smatch what;    std::string str( "blah blah( a(b)c (c(e)f (g)h )i (j)6 )blah" );    if( regex_search( str, what, parentheses ) )    {        // display the whole match        std::cout << what[0] << '\n';        // display the nested results        std::for_each(            what.nested_results().begin(),            what.nested_results().end(),            output_nested_results() );    }This program displays the following:[pre( a(b)c (c(e)f (g)h )i (j)6 )    (b)    (c(e)f (g)h )        (e)        (g)    (j)]Here you can see how the results are nested and that they are stored in the order in which theyare found.[tip See the definition of [link boost_xpressive.user_s_guide.examples.display_a_tree_of_nested_results output_nested_results] in the[link boost_xpressive.user_s_guide.examples Examples] section.][h2 Filtering Nested Results]Sometimes a regex will have several nested regex objects, and you want to know which result correspondsto which regex object. That's where `basic_regex<>::regex_id()` and `match_results<>::regex_id()`come in handy. When iterating over the nested results, you can compare the regex id from the results tothe id of the regex object you're interested in.To make this a bit easier, xpressive provides a predicate to make it simple to iterate over just theresults that correspond to a certain nested regex. It is called `regex_id_filter_predicate`, and it isintended to be used with _iterator_. You can use it as follows:    sregex name = +alpha;    sregex integer = +_d;    sregex re = *( *_s >> ( name | integer ) );    smatch what;    std::string str( "marsha 123 jan 456 cindy 789" );    if( regex_match( str, what, re ) )    {        smatch::nested_results_type::const_iterator begin = what.nested_results().begin();        smatch::nested_results_type::const_iterator end   = what.nested_results().end();        // declare filter predicates to select just the names or the integers        sregex_id_filter_predicate name_id( name.regex_id() );        sregex_id_filter_predicate integer_id( integer.regex_id() );        // iterate over only the results from the name regex        std::for_each(            boost::make_filter_iterator( name_id, begin, end ),            boost::make_filter_iterator( name_id, end, end ),            output_result            );        std::cout << '\n';        // iterate over only the results from the integer regex        std::for_each(            boost::make_filter_iterator( integer_id, begin, end ),            boost::make_filter_iterator( integer_id, end, end ),            output_result            );    }where `output_results` is a simple function that takes a `smatch` and displays the full match.Notice how we use the `regex_id_filter_predicate` together with `basic_regex<>::regex_id()` and`boost::make_filter_iterator()` from the _iterator_ to select only those resultscorresponding to a particular nested regex. This program displays the following:[premarshajancindy123456789][endsect]

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?