re_syntax.tex
来自「Wxpython Implemented on Windows CE, Sou」· TEX 代码 · 共 661 行 · 第 1/3 页
TEX
661 行
varieties: character entry, class shorthands, constraint escapes, and back
references. A {\bf $\backslash$} followed by an alphanumeric character but not constituting
a valid escape is illegal in AREs. In EREs, there are no escapes: outside
a bracket expression, a {\bf $\backslash$} followed by an alphanumeric character merely stands
for that character as an ordinary character, and inside a bracket expression,
{\bf $\backslash$} is an ordinary character. (The latter is the one actual incompatibility
between EREs and AREs.)
Character-entry escapes (AREs only) exist to make
it easier to specify non-printing and otherwise inconvenient characters
in REs:
\begin{twocollist}\twocolwidtha{4cm}
\twocolitem{{\bf $\backslash$a}}{alert (bell) character, as in C}
\twocolitem{{\bf $\backslash$b}}{backspace, as in C}
\twocolitem{{\bf $\backslash$B}}{synonym
for {\bf $\backslash$} to help reduce backslash doubling in some applications where there
are multiple levels of backslash processing}
\twocolitem{{\bf $\backslash$c{\it X}}}{(where X is any character)
the character whose low-order 5 bits are the same as those of {\it X}, and whose
other bits are all zero}
\twocolitem{{\bf $\backslash$e}}{the character whose collating-sequence name is
`{\bf ESC}', or failing that, the character with octal value 033}
\twocolitem{{\bf $\backslash$f}}{formfeed, as in C}
\twocolitem{{\bf $\backslash$n}}{newline, as in C}
\twocolitem{{\bf $\backslash$r}}{carriage return, as in C}
\twocolitem{{\bf $\backslash$t}}{horizontal tab, as in C}
\twocolitem{{\bf $\backslash$u{\it wxyz}}}{(where {\it wxyz} is exactly four hexadecimal digits)
the Unicode
character {\bf U+{\it wxyz}} in the local byte ordering}
\twocolitem{{\bf $\backslash$U{\it stuvwxyz}}}{(where {\it stuvwxyz} is
exactly eight hexadecimal digits) reserved for a somewhat-hypothetical Unicode
extension to 32 bits}
\twocolitem{{\bf $\backslash$v}}{vertical tab, as in C are all available.}
\twocolitem{{\bf $\backslash$x{\it hhh}}}{(where
{\it hhh} is any sequence of hexadecimal digits) the character whose hexadecimal
value is {\bf 0x{\it hhh}} (a single character no matter how many hexadecimal digits
are used).}
\twocolitem{{\bf $\backslash$0}}{the character whose value is {\bf 0}}
\twocolitem{{\bf $\backslash${\it xy}}}{(where {\it xy} is exactly two
octal digits, and is not a {\it back reference} (see below)) the character whose
octal value is {\bf 0{\it xy}}}
\twocolitem{{\bf $\backslash${\it xyz}}}{(where {\it xyz} is exactly three octal digits, and is
not a back reference (see below))
the character whose octal value is {\bf 0{\it xyz}}}
\end{twocollist}
Hexadecimal digits are `{\bf 0}'-`{\bf 9}', `{\bf a}'-`{\bf f}', and `{\bf A}'-`{\bf F}'. Octal
digits are `{\bf 0}'-`{\bf 7}'.
The character-entry
escapes are always taken as ordinary characters. For example, {\bf $\backslash$135} is {\bf ]} in
ASCII, but {\bf $\backslash$135} does not terminate a bracket expression. Beware, however,
that some applications (e.g., C compilers) interpret such sequences themselves
before the regular-expression package gets to see them, which may require
doubling (quadrupling, etc.) the `{\bf $\backslash$}'.
Class-shorthand escapes (AREs only) provide
shorthands for certain commonly-used character classes:
\begin{twocollist}\twocolwidtha{4cm}
\twocolitem{{\bf $\backslash$d}}{{\bf $[[:digit:]]$}}
\twocolitem{{\bf $\backslash$s}}{{\bf $[[:space:]]$}}
\twocolitem{{\bf $\backslash$w}}{{\bf $[[:alnum:]\_]$} (note underscore)}
\twocolitem{{\bf $\backslash$D}}{{\bf $[^[:digit:]]$}}
\twocolitem{{\bf $\backslash$S}}{{\bf $[^[:space:]]$}}
\twocolitem{{\bf $\backslash$W}}{{\bf $[^[:alnum:]\_]$} (note underscore)}
\end{twocollist}
Within bracket expressions, `{\bf $\backslash$d}', `{\bf $\backslash$s}', and
`{\bf $\backslash$w}' lose their outer brackets, and `{\bf $\backslash$D}',
`{\bf $\backslash$S}', and `{\bf $\backslash$W}' are illegal. (So, for example,
{\bf $[$a-c$\backslash$d$]$} is equivalent to {\bf $[a-c[:digit:]]$}.
Also, {\bf $[$a-c$\backslash$D$]$}, which is equivalent to
{\bf $[a-c^[:digit:]]$}, is illegal.)
A constraint escape (AREs only) is a constraint,
matching the empty string if specific conditions are met, written as an
escape:
\begin{twocollist}\twocolwidtha{4cm}
\twocolitem{{\bf $\backslash$A}}{matches only at the beginning of the string
(see \helpref{Matching}{wxresynmatching}, below,
for how this differs from `{\bf \caret}')}
\twocolitem{{\bf $\backslash$m}}{matches only at the beginning of a word}
\twocolitem{{\bf $\backslash$M}}{matches only at the end of a word}
\twocolitem{{\bf $\backslash$y}}{matches only at the beginning or end of a word}
\twocolitem{{\bf $\backslash$Y}}{matches only at a point that is not the beginning or end of
a word}
\twocolitem{{\bf $\backslash$Z}}{matches only at the end of the string
(see \helpref{Matching}{wxresynmatching}, below, for
how this differs from `{\bf \$}')}
\twocolitem{{\bf $\backslash${\it m}}}{(where {\it m} is a nonzero digit) a {\it back reference},
see below}
\twocolitem{{\bf $\backslash${\it mnn}}}{(where {\it m} is a nonzero digit, and {\it nn} is some more digits,
and the decimal value {\it mnn} is not greater than the number of closing capturing
parentheses seen so far) a {\it back reference}, see below}
\end{twocollist}
A word is defined
as in the specification of {\bf $[[:$<$:]]$} and {\bf $[[:$>$:]]$} above. Constraint escapes are
illegal within bracket expressions.
A back reference (AREs only) matches
the same string matched by the parenthesized subexpression specified by
the number, so that (e.g.) {\bf ($[bc]$)$\backslash$1} matches {\bf bb} or {\bf cc} but not `{\bf bc}'.
The subexpression
must entirely precede the back reference in the RE. Subexpressions are numbered
in the order of their leading parentheses. Non-capturing parentheses do not
define subexpressions.
There is an inherent historical ambiguity between
octal character-entry escapes and back references, which is resolved by
heuristics, as hinted at above. A leading zero always indicates an octal
escape. A single non-zero digit, not followed by another digit, is always
taken as a back reference. A multi-digit sequence not starting with a zero
is taken as a back reference if it comes after a suitable subexpression
(i.e. the number is in the legal range for a back reference), and otherwise
is taken as octal.
\subsection{Metasyntax}\label{remetasyntax}
\helpref{Syntax of the builtin regular expression library}{wxresyn}
In addition to the main syntax described above,
there are some special forms and miscellaneous syntactic facilities available.
Normally the flavor of RE being used is specified by application-dependent
means. However, this can be overridden by a {\it director}. If an RE of any flavor
begins with `{\bf ***:}', the rest of the RE is an ARE. If an RE of any flavor begins
with `{\bf ***=}', the rest of the RE is taken to be a literal string, with all
characters considered ordinary characters.
An ARE may begin with {\it embedded options}: a sequence {\bf (?xyz)}
(where {\it xyz} is one or more alphabetic characters)
specifies options affecting the rest of the RE. These supplement, and can
override, any options specified by the application. The available option
letters are:
\begin{twocollist}\twocolwidtha{4cm}
\twocolitem{{\bf b}}{rest of RE is a BRE}
\twocolitem{{\bf c}}{case-sensitive matching (usual default)}
\twocolitem{{\bf e}}{rest of RE is an ERE}
\twocolitem{{\bf i}}{case-insensitive matching (see \helpref{Matching}{wxresynmatching}, below)}
\twocolitem{{\bf m}}{historical synonym for {\bf n}}
\twocolitem{{\bf n}}{newline-sensitive matching (see \helpref{Matching}{wxresynmatching}, below)}
\twocolitem{{\bf p}}{partial newline-sensitive matching (see \helpref{Matching}{wxresynmatching}, below)}
\twocolitem{{\bf q}}{rest of RE
is a literal (``quoted'') string, all ordinary characters}
\twocolitem{{\bf s}}{non-newline-sensitive matching (usual default)}
\twocolitem{{\bf t}}{tight syntax (usual default; see below)}
\twocolitem{{\bf w}}{inverse
partial newline-sensitive (``weird'') matching (see \helpref{Matching}{wxresynmatching}, below)}
\twocolitem{{\bf x}}{expanded syntax (see below)}
\end{twocollist}
Embedded options take effect at the {\bf )} terminating the
sequence. They are available only at the start of an ARE, and may not be
used later within it.
In addition to the usual ({\it tight}) RE syntax, in which
all characters are significant, there is an {\it expanded} syntax, available
%in all flavors of RE with the {\bf -expanded} switch, or
in AREs with the embedded
x option. In the expanded syntax, white-space characters are ignored and
all characters between a {\bf \#} and the following newline (or the end of the
RE) are ignored, permitting paragraphing and commenting a complex RE. There
are three exceptions to that basic rule:
{\itemize
\item%
a white-space character or `{\bf \#}' preceded
by `{\bf $\backslash$}' is retained
\item%
white space or `{\bf \#}' within a bracket expression is retained
\item%
white space and comments are illegal within multi-character symbols like
the ARE `{\bf (?:}' or the BRE `{\bf $\backslash$(}'
}
Expanded-syntax white-space characters are blank,
tab, newline, and any character that belongs to the {\it space} character class.
Finally, in an ARE, outside bracket expressions, the sequence `{\bf (?\#ttt)}' (where
{\it ttt} is any text not containing a `{\bf )}') is a comment, completely ignored. Again,
this is not allowed between the characters of multi-character symbols like
`{\bf (?:}'. Such comments are more a historical artifact than a useful facility,
and their use is deprecated; use the expanded syntax instead.
{\it None} of these
metasyntax extensions is available if the application (or an initial {\bf ***=}
director) has specified that the user's input be treated as a literal string
rather than as an RE.
\subsection{Matching}\label{wxresynmatching}
\helpref{Syntax of the builtin regular expression library}{wxresyn}
In the event that an RE could match more than
one substring of a given string, the RE matches the one starting earliest
in the string. If the RE could match more than one substring starting at
that point, its choice is determined by its {\it preference}: either the longest
substring, or the shortest.
Most atoms, and all constraints, have no preference.
A parenthesized RE has the same preference (possibly none) as the RE. A
quantified atom with quantifier {\bf \{m\}} or {\bf \{m\}?} has the same preference (possibly
none) as the atom itself. A quantified atom with other normal quantifiers
(including {\bf \{m,n\}} with {\it m} equal to {\it n}) prefers longest match. A quantified
atom with other non-greedy quantifiers (including {\bf \{m,n\}?} with {\it m} equal to
{\it n}) prefers shortest match. A branch has the same preference as the first
quantified atom in it which has a preference. An RE consisting of two or
more branches connected by the {\bf $|$} operator prefers longest match.
Subject to the constraints imposed by the rules for matching the whole RE, subexpressions
also match the longest or shortest possible substrings, based on their
preferences, with subexpressions starting earlier in the RE taking priority
over ones starting later. Note that outer subexpressions thus take priority
over their component subexpressions.
Note that the quantifiers {\bf \{1,1\}} and
{\bf \{1,1\}?} can be used to force longest and shortest preference, respectively,
on a subexpression or a whole RE.
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?