📄 regex.3.html

📁 regex-spencer-3.8-doc.zip
💻 HTML
📖 第 1 页 / 共 2 页
字号:
上一页 12
within the RE; member <I>i</I> reports subexpression <I>i</I>, with subexpressions counted
(starting at 1) by the order of their opening parentheses in the RE, left
to right. Unused entries in the array--corresponding either to subexpressions
that did not participate in the match at all, or to subexpressions that
do not exist in the RE (that is, <I>i</I>&nbsp;&gt; <I>preg</I>-&gt;<I>re_nsub</I>)--have both <I>rm_so</I> and <I>rm_eo</I>
set to -1. If a subexpression participated in the match several times, the
reported substring is the last one it matched. (Note, as an example in particular,
that when the RE `(b*)+' matches `bbb', the parenthesized subexpression matches
the three `b's and then an infinite number of empty strings following the
last `b', so the reported substring is one of the empties.) <P>
If REG_STARTEND
is specified, <I>pmatch</I> must point to at least one <I>regmatch_t</I> (even if <I>nmatch</I>
is 0 or REG_NOSUB was specified), to hold the input offsets for REG_STARTEND.
Use for output is still entirely controlled by <I>nmatch</I>; if <I>nmatch</I> is 0 or
REG_NOSUB was specified, the value of <I>pmatch</I>[0] will not be changed by
a successful <I>regexec</I>. <P>
<I>Regerror</I> maps a non-zero <I>errcode</I> from either <I>regcomp</I>
or <I>regexec</I> to a human-readable, printable message. If <I>preg</I> is non-NULL, the
error code should have arisen from use of the <I>regex_t</I> pointed to by <I>preg</I>,
and if the error code came from <I>regcomp</I>, it should have been the result
from the most recent <I>regcomp</I> using that <I>regex_t</I>. (<I>Regerror</I> may be able to
supply a more detailed message using information from the <I>regex_t</I>.) <I>Regerror</I>
places the NUL-terminated message into the buffer pointed to by <I>errbuf</I>,
limiting the length (including the NUL) to at most <I>errbuf_size</I> bytes. If
the whole message won't fit, as much of it as will fit before the terminating
NUL is supplied. In any case, the returned value is the size of buffer needed
to hold the whole message (including terminating NUL). If <I>errbuf_size</I> is
0, <I>errbuf</I> is ignored but the return value is still correct. <P>
If the <I>errcode</I>
given to <I>regerror</I> is first ORed with REG_ITOA, the ``message'' that results
is the printable name of the error code, e.g. ``REG_NOMATCH'', rather than an
explanation thereof. If <I>errcode</I> is REG_ATOI, then <I>preg</I> shall be non-NULL
and the <I>re_endp</I> member of the structure it points to must point to the
printable name of an error code; in this case, the result in <I>errbuf</I> is
the decimal digits of the numeric value of the error code (0 if the name
is not recognized). REG_ITOA and REG_ATOI are intended primarily as debugging
facilities; they are extensions, compatible with but not specified by POSIX
1003.2, and should be used with caution in software intended to be portable
to other systems. Be warned also that they are considered experimental and
changes are possible. <P>
<I>Regfree</I> frees any dynamically-allocated storage associated
with the compiled RE pointed to by <I>preg</I>. The remaining <I>regex_t</I> is no longer
a valid compiled RE and the effect of supplying it to <I>regexec</I> or <I>regerror</I>
is undefined. <P>
None of these functions references global variables except
for tables of constants; all are safe for use from multiple threads if
the arguments are safe. 
<H2><A NAME="sect3" HREF="#toc3">Implementation Choices</A></H2>
There are a number of decisions
that 1003.2 leaves up to the implementor, either by explicitly saying ``undefined''
or by virtue of them being forbidden by the RE grammar. This implementation
treats them as follows. <P>
See  <I><A HREF="regex.7.html">regex</I>(7)<I></I></A>
  for a discussion of the definition
of case-independent matching. <P>
There is no particular limit on the length
of REs, except insofar as memory is limited. Memory usage is approximately
linear in RE size, and largely insensitive to RE complexity, except for
bounded repetitions. See BUGS for one short RE using them that will run
almost any system out of memory. <P>
A backslashed character other than one
specifically given a magic meaning by 1003.2 (such magic meanings occur
only in obsolete [``basic''] REs) is taken as an ordinary character. <P>
Any unmatched
[ is a REG_EBRACK error. <P>
Equivalence classes cannot begin or end bracket-expression
ranges. The endpoint of one range cannot begin another. <P>
RE_DUP_MAX, the limit
on repetition counts in bounded repetitions, is 255. <P>
A repetition operator
(?, *, +, or bounds) cannot follow another repetition operator. A repetition
operator cannot begin an expression or subexpression or follow `^' or `|'. <P>
`|' cannot
appear first or last in a (sub)expression or after another `|', i.e. an operand
of `|' cannot be an empty subexpression. An empty parenthesized subexpression,
`()', is legal and matches an empty (sub)string. An empty string is not a
legal RE. <P>
A `{' followed by a digit is considered the beginning of bounds
for a bounded repetition, which must then follow the syntax for bounds.
A `{' <I>not</I> followed by a digit is considered an ordinary character. <P>
`^' and `$'
beginning and ending subexpressions in obsolete (``basic'') REs are anchors,
not ordinary characters. 
<H2><A NAME="sect4" HREF="#toc4">See Also</A></H2>
<A HREF="grep.1.html">grep(1)</A>
, <A HREF="regex.7.html">regex(7)</A>
 <P>
POSIX 1003.2, sections
2.8 (Regular Expression Notation) and B.5 (C Binding for Regular Expression
Matching). 
<H2><A NAME="sect5" HREF="#toc5">Diagnostics</A></H2>
Non-zero error codes from <I>regcomp</I> and <I>regexec</I> include
the following: <P>
<BR>
<PRE>REG_NOMATCH<tt> </tt>&nbsp;<tt> </tt>&nbsp;regexec() failed to match
REG_BADPAT<tt> </tt>&nbsp;<tt> </tt>&nbsp;invalid regular expression
REG_ECOLLATE<tt> </tt>&nbsp;<tt> </tt>&nbsp;invalid collating element
REG_ECTYPE<tt> </tt>&nbsp;<tt> </tt>&nbsp;invalid character class
REG_EESCAPE<tt> </tt>&nbsp;<tt> </tt>&nbsp;\ applied to unescapable character
REG_ESUBREG<tt> </tt>&nbsp;<tt> </tt>&nbsp;invalid backreference number
REG_EBRACK<tt> </tt>&nbsp;<tt> </tt>&nbsp;brackets [ ] not balanced
REG_EPAREN<tt> </tt>&nbsp;<tt> </tt>&nbsp;parentheses ( ) not balanced
REG_EBRACE<tt> </tt>&nbsp;<tt> </tt>&nbsp;braces { } not balanced
REG_BADBR<tt> </tt>&nbsp;<tt> </tt>&nbsp;invalid repetition count(s) in { }
REG_ERANGE<tt> </tt>&nbsp;<tt> </tt>&nbsp;invalid character range in [ ]
REG_ESPACE<tt> </tt>&nbsp;<tt> </tt>&nbsp;ran out of memory
REG_BADRPT<tt> </tt>&nbsp;<tt> </tt>&nbsp;?, *, or + operand invalid
REG_EMPTY<tt> </tt>&nbsp;<tt> </tt>&nbsp;empty (sub)expression
REG_ASSERT<tt> </tt>&nbsp;<tt> </tt>&nbsp;``can't happen''--you found a bug
REG_INVARG<tt> </tt>&nbsp;<tt> </tt>&nbsp;invalid argument, e.g. negative-length string
</PRE>
<H2><A NAME="sect6" HREF="#toc6">History</A></H2>
Written by Henry Spencer, henry@zoo.toronto.edu. 
<H2><A NAME="sect7" HREF="#toc7">Bugs</A></H2>
This is an alpha
release with known defects. Please report problems. <P>
There is one known functionality
bug. The implementation of internationalization is incomplete: the locale
is always assumed to be the default one of 1003.2, and only the collating
elements etc. of that locale are available. <P>
The back-reference code is subtle
and doubts linger about its correctness in complex cases. <P>
<I>Regexec</I> performance
is poor. This will improve with later releases. <I>Nmatch</I> exceeding 0 is expensive;
<I>nmatch</I> exceeding 1 is worse. <I>Regexec</I> is largely insensitive to RE complexity
<I>except</I> that back references are massively expensive. RE length does matter;
in particular, there is a strong speed bonus for keeping RE length under
about 30 characters, with most special characters counting roughly double.
<P>
<I>Regcomp</I> implements bounded repetitions by macro expansion, which is costly
in time and space if counts are large or bounded repetitions are nested.
An RE like, say, `((((a{1,100}){1,100}){1,100}){1,100}){1,100}' will (eventually)
run almost any existing machine out of swap space. <P>
There are suspected problems
with response to obscure error conditions. Notably, certain kinds of internal
overflow, produced only by truly enormous REs or by multiply nested bounded
repetitions, are probably not handled well. <P>
Due to a mistake in 1003.2, things
like `a)b' are legal REs because `)' is a special character only in the presence
of a previous unmatched `('. This can't be fixed until the spec is fixed. <P>
The
standard's definition of back references is vague. For example, does `a\(\(b\)*\2\)*d'
match `abbbd'? Until the standard is clarified, behavior in such cases should
not be relied on. <P>
The implementation of word-boundary matching is a bit of
a kludge, and bugs may lurk in combinations of word-boundary matching and
anchoring. <P>

<HR><P>
<A NAME="toc"><B>Table of Contents</B></A><P>
<UL>
<LI><A NAME="toc0" HREF="#sect0">Name</A></LI>
<LI><A NAME="toc1" HREF="#sect1">Synopsis</A></LI>
<LI><A NAME="toc2" HREF="#sect2">Description</A></LI>
<LI><A NAME="toc3" HREF="#sect3">Implementation Choices</A></LI>
<LI><A NAME="toc4" HREF="#sect4">See Also</A></LI>
<LI><A NAME="toc5" HREF="#sect5">Diagnostics</A></LI>
<LI><A NAME="toc6" HREF="#sect6">History</A></LI>
<LI><A NAME="toc7" HREF="#sect7">Bugs</A></LI>
</UL>
</BODY></HTML>
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -