📄 re_syntax.n
字号:
\fB\eS\fR\fB[^[:space:]]\fR.TP\fB\eW\fR\fB[^[:alnum:]_]\fR(note underscore).RE.PPWithin bracket expressions, `\fB\ed\fR', `\fB\es\fR',and `\fB\ew\fR'\&lose their outer brackets,and `\fB\eD\fR', `\fB\eS\fR',and `\fB\eW\fR'\&are illegal..VS 8.2(So, for example, \fB[a-c\ed]\fR is equivalent to \fB[a-c[:digit:]]\fR.Also, \fB[a-c\eD]\fR, which is equivalent to \fB[a-c^[:digit:]]\fR, is illegal.).VE 8.2.PPA constraint escape (AREs only) is a constraint,matching the empty string if specific conditions are met,written as an escape:.RS 2.TP 6\fB\eA\fRmatches only at the beginning of the string(see MATCHING, below, for how this differs from `\fB^\fR').TP\fB\em\fRmatches only at the beginning of a word.TP\fB\eM\fRmatches only at the end of a word.TP\fB\ey\fRmatches only at the beginning or end of a word.TP\fB\eY\fRmatches only at a point that is not the beginning or end of a word.TP\fB\eZ\fRmatches only at the end of the string(see MATCHING, below, for how this differs from `\fB$\fR').TP\fB\e\fIm\fR(where\fIm\fRis a nonzero digit) a \fIback reference\fR, see below.TP\fB\e\fImnn\fR(where\fIm\fRis a nonzero digit, and\fInn\fRis some more digits,and the decimal value\fImnn\fRis not greater than the number of closing capturing parentheses seen so far)a \fIback reference\fR, see below.RE.PPA word is defined as in the specification of\fB[[:<:]]\fRand\fB[[:>:]]\fRabove.Constraint escapes are illegal within bracket expressions..PPA back reference (AREs only) matches the same string matched by the parenthesizedsubexpression specified by the number,so that (e.g.)\fB([bc])\e1\fRmatches\fBbb\fRor\fBcc\fRbut not `\fBbc\fR'.The subexpression must entirely precede the back reference in the RE.Subexpressions are numbered in the order of their leading parentheses.Non-capturing parentheses do not define subexpressions..PPThere is an inherent historical ambiguity between octal character-entry escapes and back references, which is resolved by heuristics,as hinted at above.A leading zero always indicates an octal escape.A single non-zero digit, not followed by another digit,is always taken as a back reference.A multi-digit sequence not starting with a zero is taken as a back reference if it comes after a suitable subexpression(i.e. the number is in the legal range for a back reference),and otherwise is taken as octal..SH "METASYNTAX"In addition to the main syntax described above, there are some specialforms and miscellaneous syntactic facilities available..PPNormally the flavor of RE being used is specified byapplication-dependent means.However, this can be overridden by a \fIdirector\fR.If an RE of any flavor begins with `\fB***:\fR',the rest of the RE is an ARE.If an RE of any flavor begins with `\fB***=\fR',the rest of the RE is taken to be a literal string,with all characters considered ordinary characters..PPAn ARE may begin with \fIembedded options\fR:a sequence\fB(?\fIxyz\fB)\fR(where\fIxyz\fRis one or more alphabetic characters)specifies options affecting the rest of the RE.These supplement, and can override,any options specified by the application.The available option letters are:.RS 2.TP 3\fBb\fRrest of RE is a BRE.TP 3\fBc\fRcase-sensitive matching (usual default).TP 3\fBe\fRrest of RE is an ERE.TP 3\fBi\fRcase-insensitive matching (see MATCHING, below).TP 3\fBm\fRhistorical synonym for\fBn\fR.TP 3\fBn\fRnewline-sensitive matching (see MATCHING, below).TP 3\fBp\fRpartial newline-sensitive matching (see MATCHING, below).TP 3\fBq\fRrest of RE is a literal (``quoted'') string, all ordinary characters.TP 3\fBs\fRnon-newline-sensitive matching (usual default).TP 3\fBt\fRtight syntax (usual default; see below).TP 3\fBw\fRinverse partial newline-sensitive (``weird'') matching (see MATCHING, below).TP 3\fBx\fRexpanded syntax (see below).RE.PPEmbedded options take effect at the\fB)\fRterminating the sequence.They are available only at the start of an ARE,and may not be used later within it..PPIn addition to the usual (\fItight\fR) RE syntax, in which all characters aresignificant, there is an \fIexpanded\fR syntax,available in all flavors of REwith the \fB-expanded\fR switch, or in AREs with the embedded x option.In the expanded syntax,white-space characters are ignoredand all characters between a\fB#\fRand the following newline (or the end of the RE) are ignored,permitting paragraphing and commenting a complex RE.There are three exceptions to that basic rule:.RS 2.PPa white-space character or `\fB#\fR' preceded by `\fB\e\fR' is retained.PPwhite space or `\fB#\fR' within a bracket expression is retained.PPwhite space and comments are illegal within multi-character symbolslike the ARE `\fB(?:\fR' or the BRE `\fB\e(\fR'.RE.PPExpanded-syntax white-space characters are blank, tab, newline, and.VS 8.2any character that belongs to the \fIspace\fR character class..VE 8.2.PPFinally, in an ARE,outside bracket expressions, the sequence `\fB(?#\fIttt\fB)\fR'(where\fIttt\fRis any text not containing a `\fB)\fR')is a comment,completely ignored.Again, this is not allowed between the characters ofmulti-character symbols like `\fB(?:\fR'.Such comments are more a historical artifact than a useful facility,and their use is deprecated;use the expanded syntax instead..PP\fINone\fR of these metasyntax extensions is available if the application(or an initial\fB***=\fRdirector)has specified that the user's input be treated as a literal stringrather than as an RE..SH MATCHINGIn the event that an RE could match more than one substring of a givenstring,the RE matches the one starting earliest in the string.If the RE could match more than one substring starting at that point,its choice is determined by its \fIpreference\fR:either the longest substring, or the shortest..PPMost atoms, and all constraints, have no preference.A parenthesized RE has the same preference (possibly none) as the RE.A quantified atom with quantifier\fB{\fIm\fB}\fRor\fB{\fIm\fB}?\fRhas the same preference (possibly none) as the atom itself.A quantified atom with other normal quantifiers (including\fB{\fIm\fB,\fIn\fB}\fRwith\fIm\fRequal to\fIn\fR)prefers longest match.A quantified atom with other non-greedy quantifiers (including\fB{\fIm\fB,\fIn\fB}?\fRwith\fIm\fRequal to\fIn\fR)prefers shortest match.A branch has the same preference as the first quantified atom in itwhich has a preference.An RE consisting of two or more branches connected by the\fB|\fRoperator prefers longest match..PPSubject to the constraints imposed by the rules for matching the whole RE,subexpressions also match the longest or shortest possible substrings,based on their preferences,with subexpressions starting earlier in the RE taking priority overones starting later.Note that outer subexpressions thus take priority overtheir component subexpressions..PPNote that the quantifiers\fB{1,1}\fRand\fB{1,1}?\fRcan be used to force longest and shortest preference, respectively,on a subexpression or a whole RE..PPMatch lengths are measured in characters, not collating elements.An empty string is considered longer than no match at all.For example,\fBbb*\fRmatches the three middle characters of `\fBabbbc\fR',\fB(week|wee)(night|knights)\fRmatches all ten characters of `\fBweeknights\fR',when\fB(.*).*\fRis matched against\fBabc\fRthe parenthesized subexpressionmatches all three characters, andwhen\fB(a*)*\fRis matched against\fBbc\fRboth the whole RE and the parenthesizedsubexpression match an empty string..PPIf case-independent matching is specified,the effect is much as if all case distinctions had vanished from thealphabet.When an alphabetic that exists in multiple cases appears as anordinary character outside a bracket expression, it is effectivelytransformed into a bracket expression containing both cases,so that\fBx\fRbecomes `\fB[xX]\fR'.When it appears inside a bracket expression, all case counterpartsof it are added to the bracket expression, so that\fB[x]\fRbecomes\fB[xX]\fRand\fB[^x]\fRbecomes `\fB[^xX]\fR'..PPIf newline-sensitive matching is specified, \fB.\fRand bracket expressions using\fB^\fRwill never match the newline character(so that matches will never cross newlines unless the REexplicitly arranges it)and\fB^\fRand\fB$\fRwill match the empty string after and before a newlinerespectively, in addition to matching at beginning and end of stringrespectively.ARE\fB\eA\fRand\fB\eZ\fRcontinue to match beginning or end of string \fIonly\fR..PPIf partial newline-sensitive matching is specified,this affects \fB.\fRand bracket expressionsas with newline-sensitive matching, but not\fB^\fRand `\fB$\fR'..PPIf inverse partial newline-sensitive matching is specified,this affects\fB^\fRand\fB$\fRas withnewline-sensitive matching,but not \fB.\fRand bracket expressions.This isn't very useful but is provided for symmetry..SH "LIMITS AND COMPATIBILITY"No particular limit is imposed on the length of REs.Programs intended to be highly portable should not employ REs longerthan 256 bytes,as a POSIX-compliant implementation can refuse to accept such REs..PPThe only feature of AREs that is actually incompatible withPOSIX EREs is that\fB\e\fRdoes not lose its specialsignificance inside bracket expressions.All other ARE features use syntax which is illegal or hasundefined or unspecified effects in POSIX EREs;the\fB***\fRsyntax of directors likewise is outside the POSIXsyntax for both BREs and EREs..PPMany of the ARE extensions are borrowed from Perl, but some havebeen changed to clean them up, and a few Perl extensions are not present.Incompatibilities of note include `\fB\eb\fR', `\fB\eB\fR',the lack of special treatment for a trailing newline,the addition of complemented bracket expressions to the thingsaffected by newline-sensitive matching,the restrictions on parentheses and back references in lookahead constraints,and the longest/shortest-match (rather than first-match) matching semantics..PPThe matching rules for REs containing both normal and non-greedy quantifiershave changed since early beta-test versions of this package.(The new rules are much simpler and cleaner,but don't work as hard at guessing the user's real intentions.).PPHenry Spencer's original 1986 \fIregexp\fR package,still in widespread use (e.g., in pre-8.1 releases of Tcl),implemented an early version of today's EREs.There are four incompatibilities between \fIregexp\fR's near-EREs(`RREs' for short) and AREs.In roughly increasing order of significance:.PP.RSIn AREs,\fB\e\fRfollowed by an alphanumeric character is either anescape or an error,while in RREs, it was just another way of writing the alphanumeric.This should not be a problem because there was no reason to writesuch a sequence in RREs..PP\fB{\fRfollowed by a digit in an ARE is the beginning of a bound,while in RREs,\fB{\fRwas always an ordinary character.Such sequences should be rare,and will often result in an error because following characterswill not look like a valid bound..PPIn AREs,\fB\e\fRremains a special character within `\fB[\|]\fR',so a literal\fB\e\fRwithin\fB[\|]\fRmust be written `\fB\e\e\fR'.\fB\e\e\fRalso gives a literal\fB\e\fRwithin\fB[\|]\fRin RREs,but only truly paranoid programmers routinely doubled the backslash..PPAREs report the longest/shortest match for the RE,rather than the first found in a specified search order.This may affect some RREs which were written in the expectation thatthe first match would be reported.(The careful crafting of RREs to optimize the search order for fastmatching is obsolete (AREs examine all possible matchesin parallel, and their performance is largely insensitive to theircomplexity) but cases where the search order was exploited to deliberately find a match which was \fInot\fR the longest/shortest will need rewriting.).RE.SH "BASIC REGULAR EXPRESSIONS"BREs differ from EREs in several respects. `\fB|\fR', `\fB+\fR',and\fB?\fRare ordinary characters and there is no equivalentfor their functionality.The delimiters for bounds are\fB\e{\fRand `\fB\e}\fR',with\fB{\fRand\fB}\fRby themselves ordinary characters.The parentheses for nested subexpressions are\fB\e(\fRand `\fB\e)\fR',with\fB(\fRand\fB)\fRby themselves ordinary characters.\fB^\fRis an ordinary character except at the beginning of theRE or the beginning of a parenthesized subexpression,\fB$\fRis an ordinary character except at the end of theRE or the end of a parenthesized subexpression,and\fB*\fRis an ordinary character if it appears at the beginning of theRE or the beginning of a parenthesized subexpression(after a possible leading `\fB^\fR').Finally,single-digit back references are available,and\fB\e<\fRand\fB\e>\fRare synonyms for\fB[[:<:]]\fRand\fB[[:>:]]\fRrespectively;no other escapes are available..SH "SEE ALSO"RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n).SH KEYWORDSmatch, regular expression, string
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -