ch26_04.htm

来自「the unix power tools」· HTM 代码 · 共 2,581 行 · 第 1/3 页
HTM
2,581 行
<HTML><!--Distributed by F --><HEAD><TITLE>[Chapter 26] 26.4 Using Metacharacters in Regular Expressions </TITLE><METANAME="DC.title"CONTENT="UNIX Power Tools"><METANAME="DC.creator"CONTENT="Jerry Peek, Tim O'Reilly &amp; Mike Loukides"><METANAME="DC.publisher"CONTENT="O'Reilly &amp; Associates, Inc."><METANAME="DC.date"CONTENT="1998-08-04T21:44:01Z"><METANAME="DC.type"CONTENT="Text.Monograph"><METANAME="DC.format"CONTENT="text/html"SCHEME="MIME"><METANAME="DC.source"CONTENT="1-56592-260-3"SCHEME="ISBN"><METANAME="DC.language"CONTENT="en-US"><METANAME="generator"CONTENT="Jade 1.1/O'Reilly DocBook 3.0 to HTML 4.0"><LINKREV="made"HREF="mailto:online-books@oreilly.com"TITLE="Online Books Comments"><LINKREL="up"HREF="ch26_01.htm"TITLE="26. Regular Expressions (Pattern Matching)"><LINKREL="prev"HREF="ch26_03.htm"TITLE="26.3 Understanding Expressions "><LINKREL="next"HREF="ch26_05.htm"TITLE="26.5 Getting Regular Expressions Right "></HEAD><BODYBGCOLOR="#FFFFFF"TEXT="#000000"><DIVCLASS="htmlnav"><H1><IMGSRC="gifs/smbanner.gif"ALT="UNIX Power Tools"USEMAP="#srchmap"BORDER="0"></H1><MAPNAME="srchmap"><AREASHAPE="RECT"COORDS="0,0,466,58"HREF="index.htm"ALT="UNIX Power Tools"><AREASHAPE="RECT"COORDS="467,0,514,18"HREF="jobjects/fsearch.htm"ALT="Search this book"></MAP><TABLEWIDTH="515"BORDER="0"CELLSPACING="0"CELLPADDING="0"><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="172"><ACLASS="SECT1"HREF="ch26_03.htm"TITLE="26.3 Understanding Expressions "><IMGSRC="gifs/txtpreva.gif"SRC="gifs/txtpreva.gif"ALT="Previous: 26.3 Understanding Expressions "BORDER="0"></A></TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="171"><B><FONTFACE="ARIEL,HELVETICA,HELV,SANSERIF"SIZE="-1">Chapter 26<BR>Regular Expressions (Pattern Matching)</FONT></B></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="172"><ACLASS="SECT1"HREF="ch26_05.htm"TITLE="26.5 Getting Regular Expressions Right "><IMGSRC="gifs/txtnexta.gif"SRC="gifs/txtnexta.gif"ALT="Next: 26.5 Getting Regular Expressions Right "BORDER="0"></A></TD></TR></TABLE>&nbsp;<HRALIGN="LEFT"WIDTH="515"TITLE="footer"></DIV><DIVCLASS="SECT1"><H2CLASS="sect1"><ACLASS="title"NAME="UPT-ART-0427">26.4 Using Metacharacters in Regular Expressions </A></H2><PCLASS="para"><ACLASS="indexterm"NAME="UPT-ART-427-IX-REGULAR-EXPRESSIONS-METACHARACTERS-IN"></A>There are three important parts to a regular expression: </P><OLCLASS="orderedlist"><LICLASS="listitem"><PCLASS="para"><EMCLASS="emphasis">Anchors</EM><ACLASS="indexterm"NAME="AUTOID-28439"></A>are used to specify the position of the pattern in relation to a line oftext.</P></LI><LICLASS="listitem"><PCLASS="para"><EMCLASS="emphasis">Character sets</EM><ACLASS="indexterm"NAME="AUTOID-28444"></A>match one or more characters in a single position.</P></LI><LICLASS="listitem"><PCLASS="para"><EMCLASS="emphasis">Modifiers</EM><ACLASS="indexterm"NAME="AUTOID-28449"></A>specify how many times the previous character set is repeated.</P></LI></OL><PCLASS="para">A simple example that demonstrates all three parts is the regularexpression: </P><PCLASS="para"><BLOCKQUOTECLASS="screen"><PRECLASS="screen">^#*</PRE></BLOCKQUOTE></P><PCLASS="para">The caret (<CODECLASS="literal">^</CODE>) is an anchor that indicates the beginning of the line. The hash mark is a simple character set that matches thesingle character<CODECLASS="literal">#</CODE>.The asterisk (<CODECLASS="literal">*</CODE>) is a modifier.In a regular expression it specifies that the previous character setcan appear any number of times, including zero.As you will see shortly, this is a useless regular expression(except for demonstrating the syntax!).</P><PCLASS="para">There are two main types of regular expressions: <EMCLASS="emphasis">simple</EM>regular expressions and <EMCLASS="emphasis">extended</EM>regular expressions.(As we'll see later in the article, the boundaries between the twotypes have become blurred as regular expressions have evolved.)A few utilities like<EMCLASS="emphasis">awk</EM>and<EMCLASS="emphasis">egrep</EM>use the extended regular expression.Most use the simpleregular expression.From now on, if I talk about a &quot;regular expression&quot; (without specifying simple or extended),I am describing a feature common to both types.</P><PCLASS="para">The commands that understand just simple regular expressions are:<EMCLASS="emphasis">vi</EM>, <EMCLASS="emphasis">sed</EM>, <EMCLASS="emphasis">grep</EM>, <EMCLASS="emphasis">csplit</EM>, <EMCLASS="emphasis">dbx</EM>,<EMCLASS="emphasis">more</EM>, <EMCLASS="emphasis">ed</EM>, <EMCLASS="emphasis">expr</EM>, <EMCLASS="emphasis">lex</EM>, and <EMCLASS="emphasis">pg</EM>.The utilities <EMCLASS="emphasis">awk</EM>, <EMCLASS="emphasis">nawk</EM>, and <EMCLASS="emphasis">egrep</EM>understand extended regular expressions.</P><PCLASS="para">[The situation is complicated by the fact that simple regularexpressions have evolved over time, and so there are versions of&quot;simple regular expressions&quot; that support extensions missing fromextended regular expressions!Bruce explains the incompatibility atthe end of his article. -<EMCLASS="emphasis">TOR</EM>&nbsp;]</P><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="UPT-ART-427-SECT-1.1">26.4.1 The Anchor Characters: ^ and $ </A></H3><PCLASS="para"><ACLASS="indexterm"NAME="AUTOID-28482"></A><ACLASS="indexterm"NAME="AUTOID-28485"></A><ACLASS="indexterm"NAME="AUTOID-28488"></A>Most UNIX text facilities are line-oriented. Searching for patternsthat span several lines is not easy to do.You see, the end-of-line character is not included in the block oftext that is searched.It is a separator.Regular expressions examine the text between the separators.If you want to search for a pattern that is at one end or the other,you use<EMCLASS="emphasis">anchors</EM>.The caret (<CODECLASS="literal">^</CODE>)is the starting anchor, and the dollar sign (<CODECLASS="literal">$</CODE>)is the end anchor.The regular expression <CODECLASS="literal">^A</CODE>will match all lines that start with an uppercase A.The expression<CODECLASS="literal">A$</CODE>will match all lines that end with uppercase A.If the anchor characters are not used at the proper end of thepattern, then they no longer act as anchors.That is, the <CODECLASS="literal">^</CODE>is only an anchor if it is the first character in a regularexpression.The<CODECLASS="literal">$</CODE>is only an anchor if it is the last character.The expression<CODECLASS="literal">$1</CODE>does not have an anchor.Neither does<CODECLASS="literal">1^</CODE>.If you need to match a<CODECLASS="literal">^</CODE>at the beginning of the line or a<CODECLASS="literal">$</CODE>at the end of a line, you must <EMCLASS="emphasis">escape</EM>the special character by typing a backslash (<CODECLASS="literal">\</CODE>) before it.<ACLASS="xref"HREF="ch26_04.htm#UPT-ART-427-TAB-0"TITLE="Regular Expression Anchor Character Examples">Table 26.1</A>has a summary.</P><TABLECLASS="table"><CAPTIONCLASS="table"><ACLASS="title"NAME="UPT-ART-427-TAB-0">Table 26.1: Regular Expression Anchor Character Examples</A></CAPTION><THEADCLASS="thead"><TRCLASS="row"VALIGN="TOP"><THCLASS="entry"ALIGN="LEFT"ROWSPAN="1"COLSPAN="1">Pattern</TH><THCLASS="entry"ALIGN="LEFT"ROWSPAN="1"COLSPAN="1">Matches</TH></TR></THEAD><TBODYCLASS="tbody"><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><CODECLASS="literal">^A</CODE></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">An A at the beginning of a line</TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><CODECLASS="literal">A$</CODE></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">An A at the end of a line</TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><CODECLASS="literal">A</CODE></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">An A anywhere on a line</TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><CODECLASS="literal">$A</CODE></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">A <CODECLASS="literal">$A</CODE> anywhere on a line</TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">^\^</TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">A <CODECLASS="literal">^</CODE> at the beginning of a line</TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">^^</TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">Same as <CODECLASS="literal">^\^</CODE></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">\$$</TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">A <CODECLASS="literal">$</CODE> at the end of a line</TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">$$</TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1">Same as <CODECLASS="literal">\$$</CODE></TD></TR></TBODY></TABLE><PCLASS="para">The use of<CODECLASS="literal">^</CODE>and<CODECLASS="literal">$</CODE>as indicators of the beginning or end of a line is a conventionother utilities use.The<EMCLASS="emphasis">vi</EM>editor uses these two characters as commands to go to the beginning orend of a line.The C shell uses<CODECLASS="literal">!^</CODE>to specify the first argument of the previous line, and<CODECLASS="literal">!$</CODE>is the last argument on the previous line(article<ACLASS="xref"HREF="ch11_07.htm"TITLE="History Substitutions ">11.7</A>explains).</P><PCLASS="para">It is one of those choices that other utilities go along with tomaintain consistency.For instance,<CODECLASS="literal">$</CODE>can refer to the last line of a file when using<EMCLASS="emphasis">ed</EM>and<EMCLASS="emphasis">sed</EM>.<SPANCLASS="link"><EMCLASS="emphasis">cat -v -e</EM> (<ACLASS="linkend"HREF="ch25_06.htm"TITLE="What's in That White Space? ">25.6</A>, <ACLASS="linkend"HREF="ch25_07.htm"TITLE="Show Non-Printing Characters with cat -v or od -c ">25.7</A>)</SPAN>marks ends of lines with a<CODECLASS="literal">$</CODE>.You might see it in other programs as well.</P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="UPT-ART-427-SECT-1.2">26.4.2 Matching a Character with a Character Set </A></H3><PCLASS="para"><ACLASS="indexterm"NAME="AUTOID-28562"></A><ACLASS="indexterm"NAME="AUTOID-28564"></A>The simplest character set is a character.The regular expression<CODECLASS="literal">the</CODE>contains three character sets:<CODECLASS="literal">t</CODE>,<CODECLASS="literal">h</CODE>,and <CODECLASS="literal">e</CODE>.It will match any line that contains the string<CODECLASS="literal">the</CODE>,including the word<CODECLASS="literal">other</CODE>.To prevent this, put spaces (<IMGSRC="../chars/squ.gif"ALT=" ">) before and after the pattern:<IMGSRC="../chars/squ.gif"ALT=" "><CODECLASS="literal">the</CODE><IMGSRC="../chars/squ.gif"ALT=" ">.You can combine the string with an anchor.The pattern<CODECLASS="literal">^From:</CODE><IMGSRC="../chars/squ.gif"ALT=" ">will match the lines of a<SPANCLASS="link">mail message (<ACLASS="linkend"HREF="ch01_33.htm"TITLE="UNIX Networking and Communications ">1.33</A>)</SPAN>that identify the sender.Use this pattern with <EMCLASS="emphasis">grep</EM> to print every address in your incoming mailbox:</P><PCLASS="para"><TABLECLASS="screen.co"BORDER="1"><TR><THVALIGN="TOP"><PRECLASS="calloutlist"><ACLASS="co"HREF="ch06_03.htm"TITLE="6.3 Predefined Environment Variables ">$USER</A> </PRE></TH><TDVALIGN="TOP"><PRECLASS="screen">% <CODECLASS="userinput"><B>grep '^From: ' /usr/spool/mail/$USER</B></CODE></PRE></TD></TR></TABLE></P><PCLASS="para">Some characters have a special meaning in regular expressions.If you want to search for such a character as itself, escape it with abackslash (<CODECLASS="literal">\</CODE>).</P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="UPT-ART-427-SECT-1.3">26.4.3 Match any Character with . (Dot) </A></H3><PCLASS="para"><ACLASS="indexterm"NAME="AUTOID-28586"></A>The dot (<CODECLASS="literal">.</CODE>)is one of those special metacharacters. By itself it will match any character, except the end-of-linecharacter.The pattern that will match a line with any single character is: <CODECLASS="literal">^.$</CODE>.</P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="UPT-ART-427-SECT-1.4">26.4.4 Specifying a Range of Characters with [...] </A></H3><PCLASS="para"><ACLASS="indexterm"NAME="AUTOID-28594"></A>If you want to match specific characters, you can use square brackets, <CODECLASS="literal">[]</CODE>, to identify the exact characters you are searching for.The pattern that will match any line of text that contains exactly onedigit is: <CODECLASS="literal">^[0123456789]$</CODE>.This is longer than it has to be.You can use the hyphen between two characters to specify a range:<CODECLASS="literal">^[0-9]$</CODE>.You can intermix explicit characters with character ranges.This pattern will match a single character that is a letter, digit,or underscore:<CODECLASS="literal">[A-Za-z0-9_]</CODE>.Character sets can be combined by placing them next to one another.If you wanted to search for a word that:</P><ULCLASS="itemizedlist"><LICLASS="listitem"><PCLASS="para">started with an uppercase T,</P></LI><LICLASS="listitem"
ch26_04.htm - 源码说明

本页面展示了「the unix power tools」中的 ch26_04.htm 源码文件，采用 HTM 编程语言编写，共 2,581 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与power相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?