📄 ch26_04.htm
字号:
<HTML><!--Distributed by F --><HEAD><TITLE>[Chapter 26] 26.4 Using Metacharacters in Regular Expressions </TITLE><METANAME="DC.title"CONTENT="UNIX Power Tools"><METANAME="DC.creator"CONTENT="Jerry Peek, Tim O'Reilly & Mike Loukides"><METANAME="DC.publisher"CONTENT="O'Reilly & Associates, Inc."><METANAME="DC.date"CONTENT="1998-08-04T21:44:01Z"><METANAME="DC.type"CONTENT="Text.Monograph"><METANAME="DC.format"CONTENT="text/html"SCHEME="MIME"><METANAME="DC.source"CONTENT="1-56592-260-3"SCHEME="ISBN"><METANAME="DC.language"CONTENT="en-US"><METANAME="generator"CONTENT="Jade 1.1/O'Reilly DocBook 3.0 to HTML 4.0"><LINKREV="made"HREF="mailto:online-books@oreilly.com"TITLE="Online Books Comments"><LINKREL="up"HREF="ch26_01.htm"TITLE="26. Regular Expressions (Pattern Matching)"><LINKREL="prev"HREF="ch26_03.htm"TITLE="26.3 Understanding Expressions "><LINKREL="next"HREF="ch26_05.htm"TITLE="26.5 Getting Regular Expressions Right "></HEAD><BODYBGCOLOR="#FFFFFF"TEXT="#000000"><DIVCLASS="htmlnav"><H1><IMGSRC="gifs/smbanner.gif"ALT="UNIX Power Tools"USEMAP="#srchmap"BORDER="0"></H1><MAPNAME="srchmap"><AREASHAPE="RECT"COORDS="0,0,466,58"HREF="index.htm"ALT="UNIX Power Tools"><AREASHAPE="RECT"COORDS="467,0,514,18"HREF="jobjects/fsearch.htm"ALT="Search this book"></MAP><TABLEWIDTH="515"BORDER="0"CELLSPACING="0"CELLPADDING="0"><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="172"><ACLASS="SECT1"HREF="ch26_03.htm"TITLE="26.3 Understanding Expressions "><IMGSRC="gifs/txtpreva.gif"SRC="gifs/txtpreva.gif"ALT="Previous: 26.3 Understanding Expressions "BORDER="0"></A></TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="171"><B><FONTFACE="ARIEL,HELVETICA,HELV,SANSERIF"SIZE="-1">Chapter 26<BR>Regular Expressions (Pattern Matching)</FONT></B></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="172"><ACLASS="SECT1"HREF="ch26_05.htm"TITLE="26.5 Getting Regular Expressions Right "><IMGSRC="gifs/txtnexta.gif"SRC="gifs/txtnexta.gif"ALT="Next: 26.5 Getting Regular Expressions Right "BORDER="0"></A></TD></TR></TABLE> <HRALIGN="LEFT"WIDTH="515"TITLE="footer"></DIV><DIVCLASS="SECT1"><H2CLASS="sect1"><ACLASS="title"NAME="UPT-ART-0427">26.4 Using Metacharacters in Regular Expressions </A></H2><PCLASS="para"><ACLASS="indexterm"NAME="UPT-ART-427-IX-REGULAR-EXPRESSIONS-METACHARACTERS-IN"></A>There are three important parts to a regular expression: </P><OLCLASS="orderedlist"><LICLASS="listitem"><PCLASS="para"><EMCLASS="emphasis">Anchors</EM><ACLASS="indexterm"NAME="AUTOID-28439"></A>are used to specify the position of the pattern in relation to a line oftext.</P></LI><LICLASS="listitem"><PCLASS="para"><EMCLASS="emphasis">Character sets</EM><ACLASS="indexterm"NAME="AUTOID-28444"></A>match one or more characters in a single position.</P></LI><LICLASS="listitem"><PCLASS="para"><EMCLASS="emphasis">Modifiers</EM><ACLASS="indexterm"NAME="AUTOID-28449"></A>specify how many times the previous character set is repeated.</P></LI></OL><PCLASS="para">A simple example that demonstrates all three parts is the regularexpression: </P><PCLASS="para"><BLOCKQUOTECLASS="screen"><PRECLASS="screen">^#*</PRE></BLOCKQUOTE></P><PCLASS="para">The caret (<CODECLASS="literal">^</CODE>) is an anchor that indicates the beginning of the line. The hash mark is a simple character set that matches thesingle character<CODECLASS="literal">#</CODE>.The asterisk (<CODECLASS="literal">*</CODE>) is a modifier.In a regular expression it specifies that the previous character setcan appear any number of times, including zero.As you will see shortly, this is a useless regular expression(except for demonstrating the syntax!).</P><PCLASS="para">There are two main types of regular expressions: <EMCLASS="emphasis">simple</EM>regular expressions and <EMCLASS="emphasis">extended</EM>regular expressions.(As we'll see later in the article, the boundaries between the twotypes have become blurred as regular expressions have evolved.)A few utilities like<EMCLASS="emphasis">awk</EM>and<EMCLASS="emphasis">egrep</EM>use the extended regular expression.Most use the simpleregular expression.From now on, if I talk about a "regular expression" (without specifying simple or extended),I am describing a feature common to both types.</P><PCLASS="para">The commands that understand just simple regular expressions are:<EMCLASS="emphasis">vi</EM>, <EMCLASS="emphasis">sed</EM>, <EMCLASS="emphasis">grep</EM>, <EMCLASS="emphasis">csplit</EM>, <EMCLASS="emphasis">dbx</EM>,<EMCLASS="emphasis">more</EM>, <EMCLASS="emphasis">ed</EM>, <EMCLASS="emphasis">expr</EM>, <EMCLASS="emphasis">lex</EM>, and <EMCLASS="emphasis">pg</EM>.The utilities <EMCLASS="emphasis">awk</EM>, <EMCLASS="emphasis">nawk</EM>, and <EMCLASS="emphasis">egrep</EM>understand extended regular expressions.</P><PCLASS="para">[The situation is complicated by the fact that simple regularexpressions have evolved over time, and so there are versions of"simple regular expressions" that support extensions missing fromextended regular expressions!Bruce explains the incompatibility atthe end of his article. -<EMCLASS="emphasis">TOR</EM> ]</P><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="UPT-ART-427-SECT-1.1">26.4.1 The Anchor Characters: ^ and $ </A></H3><PCLASS="para"><ACLASS="indexterm"NAME="AUTOID-28482"></A><ACLASS="indexterm"NAME="AUTOID-28485"></A><ACLASS="indexterm"NAME="AUTOID-28488"></A>Most UNIX text facilities are line-oriented. Searching for patternsthat span several lines is not easy to do.You see, the end-of-line character is not included in the block oftext that is searched.It is a separator.Regular expressions examine the text between the separators.If you want to search for a pattern that is at one end or the other,you use<EMCLASS="emphasis">anchors</EM>.The caret (<CODECLASS="literal">^</CODE>)is the starting anchor, and the dollar sign (<CODECLASS="literal">$</CODE>)is the end anchor.The regular expression <CODECLASS="literal">^A</CODE>will match all lines that start with an uppercase A.The expression<CODECLASS="literal">A$</CODE>will match all lines that end with uppercase A.If the anchor characters are not used at the proper end of thepattern, then they no longer act as anchors.That is, the <CODECLASS="literal">^</CODE>is only an anchor if it is the first character in a regularexpression.The<CODECLASS="literal">$</CODE>is only an anchor if it is the last character.The expression<CODECLASS="literal">$1</CODE>does not have an anchor.Neither does<CODECLASS="literal">1^</CODE>.If you need to match a<CODECLASS="literal">^</CODE>at the beginning of the line or a<CODECLASS="literal">$</CODE>at the end of a line, you must <EMCLASS="emphasis">escape</EM>the special character by typing a backslash (<CODECLASS="literal">\</CODE>) before it.<ACLASS="xref"HREF="ch26_04.htm#UPT-ART-427-TAB-0"TITLE="Regular Expression Anchor Character Examples">Table 26.1</A>has a summary.</P><TABLECLASS="table"><CAPTIONCLASS="table"><ACLASS="title"NAME="UPT-ART-427-TAB-0">Table 26.1: Regular Expression Anchor Character Examples</A></CAPTION><THEADCLASS="thead"><TRCLASS="row"VALIGN="TOP"><THCLASS="entry"ALIGN="LEFT"ROWSPAN="1"COLSPAN="1">Pattern</TH><THCLASS="entry"ALIGN="LEFT"ROWSPAN="1"COLSPAN="1">Matches</TH></TR></THEAD><TBODYCLASS="tbody"><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><CODECLASS="literal">^A</CODE></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">An A at the beginning of a line</TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><CODECLASS="literal">A$</CODE></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">An A at the end of a line</TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><CODECLASS="literal">A</CODE></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">An A anywhere on a line</TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><CODECLASS="literal">$A</CODE></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">A <CODECLASS="literal">$A</CODE> anywhere on a line</TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">^\^</TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">A <CODECLASS="literal">^</CODE> at the beginning of a line</TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">^^</TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">Same as <CODECLASS="literal">^\^</CODE></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">\$$</TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">A <CODECLASS="literal">$</CODE> at the end of a line</TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">$$</TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">Same as <CODECLASS="literal">\$$</CODE></TD></TR></TBODY></TABLE><PCLASS="para">The use of<CODECLASS="literal">^</CODE>and<CODECLASS="literal">$</CODE>as indicators of the beginning or end of a line is a conventionother utilities use.The<EMCLASS="emphasis">vi</EM>editor uses these two characters as commands to go to the beginning orend of a line.The C shell uses<CODECLASS="literal">!^</CODE>to specify the first argument of the previous line, and<CODECLASS="literal">!$</CODE>is the last argument on the previous line(article<ACLASS="xref"HREF="ch11_07.htm"TITLE="History Substitutions ">11.7</A>explains).</P><PCLASS="para">It is one of those choices that other utilities go along with tomaintain consistency.For instance,<CODECLASS="literal">$</CODE>can refer to the last line of a file when using<EMCLASS="emphasis">ed</EM>and<EMCLASS="emphasis">sed</EM>.<SPANCLASS="link"><EMCLASS="emphasis">cat -v -e</EM> (<ACLASS="linkend"HREF="ch25_06.htm"TITLE="What's in That White Space? ">25.6</A>, <ACLASS="linkend"HREF="ch25_07.htm"TITLE="Show Non-Printing Characters with cat -v or od -c ">25.7</A>)</SPAN>marks ends of lines with a<CODECLASS="literal">$</CODE>.You might see it in other programs as well.</P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="UPT-ART-427-SECT-1.2">26.4.2 Matching a Character with a Character Set </A></H3><PCLASS="para"><ACLASS="indexterm"NAME="AUTOID-28562"></A><ACLASS="indexterm"NAME="AUTOID-28564"></A>The simplest character set is a character.The regular expression<CODECLASS="literal">the</CODE>contains three character sets:<CODECLASS="literal">t</CODE>,<CODECLASS="literal">h</CODE>,and <CODECLASS="literal">e</CODE>.It will match any line that contains the string<CODECLASS="literal">the</CODE>,including the word<CODECLASS="literal">other</CODE>.To prevent this, put spaces (<IMGSRC="../chars/squ.gif"ALT=" ">) before and after the pattern:<IMGSRC="../chars/squ.gif"ALT=" "><CODECLASS="literal">the</CODE><IMGSRC="../chars/squ.gif"ALT=" ">.You can combine the string with an anchor.The pattern<CODECLASS="literal">^From:</CODE><IMGSRC="../chars/squ.gif"ALT=" ">will match the lines of a<SPANCLASS="link">mail message (<ACLASS="linkend"HREF="ch01_33.htm"TITLE="UNIX Networking and Communications ">1.33</A>)</SPAN>that identify the sender.Use this pattern with <EMCLASS="emphasis">grep</EM> to print every address in your incoming mailbox:</P><PCLASS="para"><TABLECLASS="screen.co"BORDER="1"><TR><THVALIGN="TOP"><PRECLASS="calloutlist"><ACLASS="co"HREF="ch06_03.htm"TITLE="6.3 Predefined Environment Variables ">$USER</A> </PRE></TH><TDVALIGN="TOP"><PRECLASS="screen">% <CODECLASS="userinput"><B>grep '^From: ' /usr/spool/mail/$USER</B></CODE></PRE></TD></TR></TABLE></P><PCLASS="para">Some characters have a special meaning in regular expressions.If you want to search for such a character as itself, escape it with abackslash (<CODECLASS="literal">\</CODE>).</P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="UPT-ART-427-SECT-1.3">26.4.3 Match any Character with . (Dot) </A></H3><PCLASS="para"><ACLASS="indexterm"NAME="AUTOID-28586"></A>The dot (<CODECLASS="literal">.</CODE>)is one of those special metacharacters. By itself it will match any character, except the end-of-linecharacter.The pattern that will match a line with any single character is: <CODECLASS="literal">^.$</CODE>.</P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="UPT-ART-427-SECT-1.4">26.4.4 Specifying a Range of Characters with [...] </A></H3><PCLASS="para"><ACLASS="indexterm"NAME="AUTOID-28594"></A>If you want to match specific characters, you can use square brackets, <CODECLASS="literal">[]</CODE>, to identify the exact characters you are searching for.The pattern that will match any line of text that contains exactly onedigit is: <CODECLASS="literal">^[0123456789]$</CODE>.This is longer than it has to be.You can use the hyphen between two characters to specify a range:<CODECLASS="literal">^[0-9]$</CODE>.You can intermix explicit characters with character ranges.This pattern will match a single character that is a letter, digit,or underscore:<CODECLASS="literal">[A-Za-z0-9_]</CODE>.Character sets can be combined by placing them next to one another.If you wanted to search for a word that:</P><ULCLASS="itemizedlist"><LICLASS="listitem"><PCLASS="para">started with an uppercase T,</P></LI><LICLASS="listitem"
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -