📄 re_syntax.tex

📁 很牛的GUI源码wxWidgets-2.8.0.zip 可在多种平台下运行.
💻 TEX
📖 第 1 页 / 共 3 页
字号:
varieties: character entry, class shorthands, constraint escapes, and backreferences. A {\bf $\backslash$} followed by an alphanumeric character but not constitutinga valid escape is illegal in AREs. In EREs, there are no escapes: outsidea bracket expression, a {\bf $\backslash$} followed by an alphanumeric character merely standsfor that character as an ordinary character, and inside a bracket expression, {\bf $\backslash$} is an ordinary character. (The latter is the one actual incompatibilitybetween EREs and AREs.) Character-entry escapes (AREs only) exist to makeit easier to specify non-printing and otherwise inconvenient charactersin REs:\begin{twocollist}\twocolwidtha{4cm}\twocolitem{{\bf $\backslash$a}}{alert (bell) character, as in C}\twocolitem{{\bf $\backslash$b}}{backspace, as in C}\twocolitem{{\bf $\backslash$B}}{synonymfor {\bf $\backslash$} to help reduce backslash doubling in some applications where thereare multiple levels of backslash processing}\twocolitem{{\bf $\backslash$c{\it X}}}{(where X is any character)the character whose low-order 5 bits are the same as those of {\it X}, and whoseother bits are all zero}\twocolitem{{\bf $\backslash$e}}{the character whose collating-sequence name is`{\bf ESC}', or failing that, the character with octal value 033}\twocolitem{{\bf $\backslash$f}}{formfeed, as in C}\twocolitem{{\bf $\backslash$n}}{newline, as in C}\twocolitem{{\bf $\backslash$r}}{carriage return, as in C}\twocolitem{{\bf $\backslash$t}}{horizontal tab, as in C}\twocolitem{{\bf $\backslash$u{\it wxyz}}}{(where {\it wxyz} is exactly four hexadecimal digits)the Unicodecharacter {\bf U+{\it wxyz}} in the local byte ordering}\twocolitem{{\bf $\backslash$U{\it stuvwxyz}}}{(where {\it stuvwxyz} isexactly eight hexadecimal digits) reserved for a somewhat-hypothetical Unicodeextension to 32 bits}\twocolitem{{\bf $\backslash$v}}{vertical tab, as in C are all available.}\twocolitem{{\bf $\backslash$x{\it hhh}}}{(where {\it hhh} is any sequence of hexadecimal digits) the character whose hexadecimalvalue is {\bf 0x{\it hhh}} (a single character no matter how many hexadecimal digitsare used).}\twocolitem{{\bf $\backslash$0}}{the character whose value is {\bf 0}}\twocolitem{{\bf $\backslash${\it xy}}}{(where {\it xy} is exactly twooctal digits, and is not a {\it back reference} (see below)) the character whoseoctal value is {\bf 0{\it xy}}}\twocolitem{{\bf $\backslash${\it xyz}}}{(where {\it xyz} is exactly three octal digits, and isnot a back reference (see below))the character whose octal value is {\bf 0{\it xyz}}}\end{twocollist}Hexadecimal digits are `{\bf 0}'-`{\bf 9}', `{\bf a}'-`{\bf f}', and `{\bf A}'-`{\bf F}'. Octaldigits are `{\bf 0}'-`{\bf 7}'. The character-entryescapes are always taken as ordinary characters. For example, {\bf $\backslash$135} is {\bf ]} inASCII, but {\bf $\backslash$135} does not terminate a bracket expression. Beware, however,that some applications (e.g., C compilers) interpret  such sequences themselvesbefore the regular-expression package gets to see them, which may requiredoubling (quadrupling, etc.) the `{\bf $\backslash$}'. Class-shorthand escapes (AREs only) provideshorthands for certain commonly-used character classes:\begin{twocollist}\twocolwidtha{4cm}\twocolitem{{\bf $\backslash$d}}{{\bf $[[:digit:]]$}}\twocolitem{{\bf $\backslash$s}}{{\bf $[[:space:]]$}}\twocolitem{{\bf $\backslash$w}}{{\bf $[[:alnum:]\_]$} (note underscore)}\twocolitem{{\bf $\backslash$D}}{{\bf $[^[:digit:]]$}}\twocolitem{{\bf $\backslash$S}}{{\bf $[^[:space:]]$}}\twocolitem{{\bf $\backslash$W}}{{\bf $[^[:alnum:]\_]$} (note underscore)}\end{twocollist}Within bracket expressions, `{\bf $\backslash$d}', `{\bf $\backslash$s}', and`{\bf $\backslash$w}' lose their outer brackets, and `{\bf $\backslash$D}',`{\bf $\backslash$S}', and `{\bf $\backslash$W}' are illegal. (So, for example, {\bf $[$a-c$\backslash$d$]$} is equivalent to {\bf $[a-c[:digit:]]$}.Also, {\bf $[$a-c$\backslash$D$]$}, which is equivalent to {\bf $[a-c^[:digit:]]$}, is illegal.) A constraint escape (AREs only) is a constraint,matching the empty string if specific conditions are met, written as anescape:\begin{twocollist}\twocolwidtha{4cm}\twocolitem{{\bf $\backslash$A}}{matches only at the beginning of the string(see \helpref{Matching}{wxresynmatching}, below,for how this differs from `{\bf \caret}')}\twocolitem{{\bf $\backslash$m}}{matches only at the beginning of a word}\twocolitem{{\bf $\backslash$M}}{matches only at the end of a word}\twocolitem{{\bf $\backslash$y}}{matches only at the beginning or end of a word}\twocolitem{{\bf $\backslash$Y}}{matches only at a point that is not the beginning or end ofa word}\twocolitem{{\bf $\backslash$Z}}{matches only at the end of the string(see \helpref{Matching}{wxresynmatching}, below, forhow this differs from `{\bf \$}')}\twocolitem{{\bf $\backslash${\it m}}}{(where {\it m} is a nonzero digit) a {\it back reference},see below}\twocolitem{{\bf $\backslash${\it mnn}}}{(where {\it m} is a nonzero digit, and {\it nn} is some more digits,and the decimal value {\it mnn} is not greater than the number of closing capturingparentheses seen so far) a {\it back reference}, see below}\end{twocollist}A word is definedas in the specification of {\bf $[[:$<$:]]$} and {\bf $[[:$>$:]]$} above. Constraint escapes areillegal within bracket expressions. A back reference (AREs only) matchesthe same string matched by the parenthesized subexpression specified bythe number, so that (e.g.) {\bf ($[bc]$)$\backslash$1} matches {\bf bb} or {\bf cc} but not `{\bf bc}'.The subexpressionmust entirely precede the back reference in the RE. Subexpressions are numberedin the order of their leading parentheses. Non-capturing parentheses do notdefine subexpressions. There is an inherent historical ambiguity betweenoctal character-entry  escapes and back references, which is resolved byheuristics, as hinted at above. A leading zero always indicates an octalescape. A single non-zero digit, not followed by another digit, is alwaystaken as a back reference. A multi-digit sequence not starting with a zerois taken as a back  reference if it comes after a suitable subexpression(i.e. the number is in the legal range for a back reference), and otherwiseis taken as octal. \subsection{Metasyntax}\label{remetasyntax}\helpref{Syntax of the builtin regular expression library}{wxresyn}In addition to the main syntax described above,there are some special forms and miscellaneous syntactic facilities available.Normally the flavor of RE being used is specified by application-dependentmeans. However, this can be overridden by a {\it director}. If an RE of any flavorbegins with `{\bf ***:}', the rest of the RE is an ARE. If an RE of any flavor beginswith `{\bf ***=}', the rest of the RE is taken to be a literal string, with allcharacters considered ordinary characters. An ARE may begin with {\it embedded options}: a sequence {\bf (?xyz)}(where {\it xyz} is one or more alphabetic characters)specifies options affecting the rest of the RE. These supplement, and canoverride, any options specified by the application. The available optionletters are:\begin{twocollist}\twocolwidtha{4cm}\twocolitem{{\bf b}}{rest of RE is a BRE}\twocolitem{{\bf c}}{case-sensitive matching (usual default)}\twocolitem{{\bf e}}{rest of RE is an ERE}\twocolitem{{\bf i}}{case-insensitive matching (see \helpref{Matching}{wxresynmatching}, below)}\twocolitem{{\bf m}}{historical synonym for {\bf n}}\twocolitem{{\bf n}}{newline-sensitive matching (see \helpref{Matching}{wxresynmatching}, below)}\twocolitem{{\bf p}}{partial newline-sensitive matching (see \helpref{Matching}{wxresynmatching}, below)}\twocolitem{{\bf q}}{rest of REis a literal (``quoted'') string, all ordinary characters}\twocolitem{{\bf s}}{non-newline-sensitive matching (usual default)}\twocolitem{{\bf t}}{tight syntax (usual default; see below)}\twocolitem{{\bf w}}{inversepartial newline-sensitive (``weird'') matching (see \helpref{Matching}{wxresynmatching}, below)}\twocolitem{{\bf x}}{expanded syntax (see below)}\end{twocollist}Embedded options take effect at the {\bf )} terminating thesequence. They are available only at the start of an ARE, and may not beused later within it. In addition to the usual ({\it tight}) RE syntax, in whichall characters are significant, there is an {\it expanded} syntax, available%in all flavors of RE with the {\bf -expanded} switch, orin AREs with the embeddedx option. In the expanded syntax, white-space characters are ignored andall characters between a {\bf \#} and the following newline (or the end of theRE) are ignored, permitting paragraphing and commenting a complex RE. Thereare three exceptions to that basic rule:{\itemize\item%a white-space character or `{\bf \#}' precededby `{\bf $\backslash$}' is retained \item%white space or `{\bf \#}' within a bracket expression is retained\item%white space and comments are illegal within multi-character symbols likethe ARE `{\bf (?:}' or the BRE `{\bf $\backslash$(}' }Expanded-syntax white-space characters are blank,tab, newline, and any character that belongs to the {\it space} character class.Finally, in an ARE, outside bracket expressions, the sequence `{\bf (?\#ttt)}' (where {\it ttt} is any text not containing a `{\bf )}') is a comment, completely ignored. Again,this is not allowed between the characters of multi-character symbols like`{\bf (?:}'. Such comments are more a historical artifact than a useful facility,and their use is deprecated; use the expanded syntax instead. {\it None} of thesemetasyntax extensions is available if the application (or an initial {\bf ***=}director) has specified that the user's input be treated as a literal stringrather than as an RE. \subsection{Matching}\label{wxresynmatching}\helpref{Syntax of the builtin regular expression library}{wxresyn}In the event that an RE could match more thanone substring of a given string, the RE matches the one starting earliestin the string. If the RE could match more than one substring starting atthat point, its choice is determined by its {\it preference}: either the longestsubstring, or the shortest. Most atoms, and all constraints, have no preference.A parenthesized RE has the same preference (possibly none) as the RE. Aquantified atom with quantifier {\bf \{m\}} or {\bf \{m\}?} has the same preference (possiblynone) as the atom itself. A quantified atom with other normal quantifiers(including {\bf \{m,n\}} with {\it m} equal to {\it n}) prefers longest match. A quantifiedatom with other non-greedy quantifiers (including {\bf \{m,n\}?} with {\it m} equal to {\it n}) prefers shortest match. A branch has the same preference as the firstquantified atom in it which has a preference. An RE consisting of two ormore branches connected by the {\bf $|$} operator prefers longest match. Subject to the constraints imposed by the rules for matching the whole RE, subexpressionsalso match the longest or shortest possible substrings, based on theirpreferences, with subexpressions starting earlier in the RE taking priorityover ones starting later. Note that outer subexpressions thus take priorityover their component subexpressions. Note that the quantifiers {\bf \{1,1\}} and {\bf \{1,1\}?} can be used to force longest and shortest preference, respectively,on a subexpression or a whole RE.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -