📄 ch26_04.htm

📁 the unix power tools
💻 HTM
📖 第 1 页 / 共 3 页
字号:
上一页 1 23
><ACLASS="indexterm"NAME="AUTOID-28807"></A>Searching for a word isn't quite as simple as it at first appears.The string<CODECLASS="literal">the</CODE>will match the word <CODECLASS="literal">other</CODE>.You can put spaces before and after the letters and use this regularexpression:<IMGSRC="../chars/squ.gif"ALT=" "><CODECLASS="literal">the</CODE><IMGSRC="../chars/squ.gif"ALT=" ">.However, this does not match words at the beginning or the end of the line.And it does not match the case where there is a punctuation markafter the word. </P><PCLASS="para">There is an easy solution&nbsp;- at least in many versions of <EMCLASS="emphasis">ed</EM>, <EMCLASS="emphasis">ex</EM>, and<EMCLASS="emphasis">vi</EM>.The characters <CODECLASS="literal">\&lt;</CODE>and<CODECLASS="literal">\&gt;</CODE>are similar to the<CODECLASS="literal">^</CODE>and<CODECLASS="literal">$</CODE>anchors,as they don't occupy a position of a character.They do <EMCLASS="emphasis">anchor</EM>the expression between to match only if it is on a word boundary.The pattern to search for the words<CODECLASS="literal">the</CODE> and <CODECLASS="literal">The</CODE>would be:<CODECLASS="literal">\&lt;[tT]he\&gt;</CODE>.</P><PCLASS="para">Let's define a &quot;word boundary.&quot;The character before the<CODECLASS="literal">t</CODE> or <CODECLASS="literal">T</CODE>must be either a newline character or anything except a letter,digit, or underscore (&nbsp;<CODECLASS="literal">_</CODE>&nbsp;).The character after the<CODECLASS="literal">e</CODE>mustalso be a character other than a digit, letter, or underscore,or it could be the end-of-line character.</P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="UPT-ART-427-SECT-1.9">26.4.9 Remembering Patterns with \&nbsp;(, \&nbsp;), and \1 </A></H3><PCLASS="para"><ACLASS="indexterm"NAME="AUTOID-28833"></A><ACLASS="indexterm"NAME="AUTOID-28836"></A><ACLASS="indexterm"NAME="AUTOID-28839"></A>Another pattern that requires a special mechanism is searching forrepeated words.The expression<CODECLASS="literal">[a-z][a-z]</CODE>will match any two lowercase letters.If you wanted to search for lines that had two adjoining identicalletters, the above pattern wouldn't help.You need a way to remember what you found and see if the same pattern occurs again.In some programs, you can mark part of a pattern using<CODECLASS="literal">\(</CODE>and<CODECLASS="literal">\)</CODE>.You can recall the remembered pattern with<CODECLASS="literal">\</CODE> followed by a single digit.Therefore, to search for two identical letters, use:<CODECLASS="literal">\([a-z]\)\1</CODE>.You can have nine different remembered patterns. Each occurrence of <CODECLASS="literal">\(</CODE>starts a new pattern.The regular expression to match a five-letter palindrome (e.g., &quot;radar&quot;) is:<CODECLASS="literal">\([a-z]\)\([a-z]\)[a-z]\2\1</CODE>.[Some versions of some programs can't handle <CODECLASS="literal">\(&nbsp;\)</CODE> in the sameregular expression as <CODECLASS="literal">\</CODE><CODECLASS="replaceable"><I>1</I></CODE>, etc.In all versions of <EMCLASS="emphasis">sed</EM>, you're safe if you use<SPANCLASS="link"> <CODECLASS="literal">/( /)</CODE> on the pattern side of an <EMCLASS="emphasis">s</EM> command-and<CODECLASS="literal">/</CODE><CODECLASS="replaceable"><I>1</I></CODE>, etc., on the replacement side . (<ACLASS="linkend"HREF="ch34_10.htm"TITLE="Referencing Portions of a Search String ">34.10</A>)</SPAN><EMCLASS="emphasis">-JP</EM> ]</P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="UPT-ART-427-SECT-1.10">26.4.10 Potential Problems </A></H3><PCLASS="para">That completes a discussion of simple regular expressions.Before I discuss the extensions that extended expressions offer, Iwant to mention two potential problem areas.</P><PCLASS="para">The <CODECLASS="literal">/&lt;</CODE>and<CODECLASS="literal">/&gt;</CODE>characters were introduced in the<EMCLASS="emphasis">vi</EM>editor. The other programs didn't have this ability at that time.Also, the<CODECLASS="literal">/{</CODE><CODECLASS="replaceable"><I>min</I></CODE><CODECLASS="literal">,</CODE><CODECLASS="replaceable"><I>max</I></CODE><CODECLASS="literal">/}</CODE>modifier is new, and earlier utilities didn't have this ability.This makes it difficult for the novice user of regular expressions,because it seems as if each utility has a different convention.Sun has retrofitted the newest regular expression library to all oftheir programs, so they all have the same ability.If you try to use these newer features on other vendors' machines, youmight find they don't work the same way.</P><PCLASS="para">The other potential point of confusion is the <SPANCLASS="link">extent of the pattern matches (<ACLASS="linkend"HREF="ch26_06.htm"TITLE="Just What Does a Regular Expression Match? ">26.6</A>)</SPAN>.Regular expressions match the longest possible pattern.That is, the regular expression<CODECLASS="literal">A.*B</CODE>matches <CODECLASS="literal">AAB</CODE>as well as <CODECLASS="literal">AAAABBBBABCCCCBBBAAAB</CODE>.This doesn't cause many problems using<EMCLASS="emphasis">grep</EM>,because an oversight in a regular expression will just match morelines than desired.If you use <EMCLASS="emphasis">sed</EM>,and your patterns get carried away, you may end up deleting orchanging more than you want to.</P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="UPT-ART-427-SECT-1.11">26.4.11 Extended Regular Expressions </A></H3><PCLASS="para"><ACLASS="indexterm"NAME="AUTOID-28881"></A><ACLASS="indexterm"NAME="AUTOID-28883"></A><ACLASS="indexterm"NAME="AUTOID-28886"></A><ACLASS="indexterm"NAME="AUTOID-28888"></A><ACLASS="indexterm"NAME="AUTOID-28891"></A>Two programs use extended regular expressions:<EMCLASS="emphasis">egrep</EM>and <EMCLASS="emphasis">awk</EM>.[<EMCLASS="emphasis">perl</EM> uses expressions that are even more extended. <EMCLASS="emphasis">-JP</EM> ]With these extensions, those special characters preceded by a backslashno longer have special meaning:<CODECLASS="literal">/{</CODE>,<CODECLASS="literal">/}</CODE>,<CODECLASS="literal">/&lt;</CODE>,<CODECLASS="literal">/&gt;</CODE>,<CODECLASS="literal">/(</CODE>,<CODECLASS="literal">/)</CODE>,as well as <CODECLASS="literal">/</CODE><CODECLASS="replaceable"><I>digit</I></CODE>.There is a very good reason for this, which I willdelay explaining to build up suspense.</P><PCLASS="para"><ACLASS="indexterm"NAME="AUTOID-28906"></A>The question mark (<CODECLASS="literal">?</CODE>) matches zero or one instances of the character set before it, and the<ACLASS="indexterm"NAME="AUTOID-28910"></A>plus sign (<CODECLASS="literal">+</CODE>)matches one or more copies of the character set.You can't use <CODECLASS="literal">/{</CODE> and <CODECLASS="literal">/}</CODE> in extended regular expressions,but if you could, you might consider<CODECLASS="literal">?</CODE>to be the same as<CODECLASS="literal">/{0,1/}</CODE>and <CODECLASS="literal">+</CODE>to be the same as<CODECLASS="literal">/{1,/}</CODE>.</P><PCLASS="para">By now, you are wondering why the extended regular expressions are even worth using. Except for two abbreviations, there seem to be noadvantages and a lot of disadvantages.Therefore, examples would be useful.</P><PCLASS="para">The three important characters in the expanded regular expressions are<CODECLASS="literal">(</CODE>,<CODECLASS="literal">|</CODE>,and <CODECLASS="literal">)</CODE>.<ACLASS="indexterm"NAME="AUTOID-28925"></A><ACLASS="indexterm"NAME="AUTOID-28928"></A>Parentheses are used to group expressions; the vertical bar acts anan OR operator.Together, they let you match a<EMCLASS="emphasis">choice</EM>of patterns.As an example, you can use <EMCLASS="emphasis">egrep</EM>to print all <CODECLASS="literal">From:</CODE> and <CODECLASS="literal">Subject:</CODE>lines from your incoming mail:</P><PCLASS="para"><BLOCKQUOTECLASS="screen"><PRECLASS="screen">% <CODECLASS="userinput"><B>egrep '^(From|Subject): ' /usr/spool/mail/$USER</B></CODE></PRE></BLOCKQUOTE></P><PCLASS="para">All lines starting with <CODECLASS="literal">From:</CODE>or<CODECLASS="literal">Subject:</CODE>will be printed. There is no easy way to do this with simpleregular expressions. You could try something like<CODECLASS="literal">^[FS][ru][ob][mj]e*c*t*:</CODE>and hope you don't have any lines that start with<CODECLASS="literal">Sromeet:</CODE>.Extended expressions don't havethe <CODECLASS="literal">/&lt;</CODE>and<CODECLASS="literal">/&gt;</CODE>characters.You can compensate by using the alternation mechanism.Matching the word&quot;the&quot;in the beginning, middle, or end of a sentence or at the end of a line can bedone with the extended regular expression:<CODECLASS="literal">(^| )the([^a-z]|$)</CODE>.There are two choices before the word: a space or the beginning of aline.Following the word, there must be something besides a lowercase letter orelse the end of the line.One extra bonus with extended regular expressions is the ability touse the<CODECLASS="literal">*</CODE>,<CODECLASS="literal">+</CODE>,and <CODECLASS="literal">?</CODE>modifiers after a <CODECLASS="literal">(...)</CODE>grouping.Here are two ways to match&quot;a simple problem&quot;,&quot;an easy problem&quot;,as well as&quot;a problem&quot;;the second expression is more exact:</P><PCLASS="para"><BLOCKQUOTECLASS="screen"><PRECLASS="screen">% <CODECLASS="userinput"><B>egrep &quot;a[n]? (simple|easy)? ?problem&quot; data</B></CODE>% <CODECLASS="userinput"><B>egrep &quot;a[n]? ((simple|easy) )?problem&quot; data</B></CODE></PRE></BLOCKQUOTE></P><PCLASS="para">I promised to explain why the backslash characters don't work inextended regular expressions.Well, perhaps the<CODECLASS="literal">/{.../}</CODE>and<ACLASS="indexterm"NAME="AUTOID-28956"></A><ACLASS="indexterm"NAME="AUTOID-28959"></A><ACLASS="indexterm"NAME="AUTOID-28962"></A><ACLASS="indexterm"NAME="AUTOID-28965"></A><CODECLASS="literal">/&lt;.../&gt;</CODE>could be added to the extended expressions, but it might confuse people if those characters are added and the<CODECLASS="literal">/(.../)</CODE>are not. And there is no way to add that functionality to the extendedexpressions without changing the current usage. Do you see why?It's quite simple. If<CODECLASS="literal">(</CODE>has a special meaning, then <CODECLASS="literal">/(</CODE>must be the ordinary character.This is the opposite of the simple regular expressions,where<CODECLASS="literal">(</CODE>is ordinary and<CODECLASS="literal">/(</CODE>is special.The usage of the parentheses is incompatible, and any change couldbreak old programs.</P><PCLASS="para">If the extended expression used <CODECLASS="literal">(...|...)</CODE>as regular characters, and<CODECLASS="literal">/(.../|.../)</CODE>for specifying alternate patterns, then it is possible to have one setof regular expressions that has full functionality.This is exactlywhat<SPANCLASS="link">GNU Emacs (<ACLASS="linkend"HREF="ch32_01.htm#UPT-ART-5540"TITLE="Emacs: The Other Editor ">32.1</A>)</SPAN>does, by the way-it combinesall of the features of regular andextended expressions with one syntax.<ACLASS="indexterm"NAME="AUTOID-28978"></A></P></DIV><DIVCLASS="sect1info"><PCLASS="SECT1INFO">- <SPANCLASS="authorinitials">BB</SPAN></P></DIV></DIV><DIVCLASS="htmlnav"><P></P><HRALIGN="LEFT"WIDTH="515"TITLE="footer"><TABLEWIDTH="515"BORDER="0"CELLSPACING="0"CELLPADDING="0"><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="172"><ACLASS="SECT1"HREF="ch26_03.htm"TITLE="26.3 Understanding Expressions "><IMGSRC="gifs/txtpreva.gif"SRC="gifs/txtpreva.gif"ALT="Previous: 26.3 Understanding Expressions "BORDER="0"></A></TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="171"><ACLASS="book"HREF="index.htm"TITLE="UNIX Power Tools"><IMGSRC="gifs/txthome.gif"SRC="gifs/txthome.gif"ALT="UNIX Power Tools"BORDER="0"></A></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="172"><ACLASS="SECT1"HREF="ch26_05.htm"TITLE="26.5 Getting Regular Expressions Right "><IMGSRC="gifs/txtnexta.gif"SRC="gifs/txtnexta.gif"ALT="Next: 26.5 Getting Regular Expressions Right "BORDER="0"></A></TD></TR><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="172">26.3 Understanding Expressions </TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="171"><ACLASS="index"HREF="index/idx_0.htm"TITLE="Book Index"><IMGSRC="gifs/index.gif"SRC="gifs/index.gif"ALT="Book Index"BORDER="0"></A></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="172">26.5 Getting Regular Expressions Right </TD></TR></TABLE><HRALIGN="LEFT"WIDTH="515"TITLE="footer"><IMGSRC="gifs/smnavbar.gif"SRC="gifs/smnavbar.gif"USEMAP="#map"BORDER="0"ALT="The UNIX CD Bookshelf Navigation"><MAPNAME="map"><AREASHAPE="RECT"COORDS="0,0,73,21"HREF="../index.htm"ALT="The UNIX CD Bookshelf"><AREASHAPE="RECT"COORDS="74,0,163,21"HREF="index.htm"ALT="UNIX Power Tools"><AREASHAPE="RECT"COORDS="164,0,257,21"HREF="../unixnut/index.htm"ALT="UNIX in a Nutshell"><AREASHAPE="RECT"COORDS="258,0,321,21"HREF="../vi/index.htm"ALT="Learning the vi Editor"><AREASHAPE="RECT"COORDS="322,0,378,21"HREF="../sedawk/index.htm"ALT="sed &amp; awk"><AREASHAPE="RECT"COORDS="379,0,438,21"HREF="../ksh/index.htm"ALT="Learning the Korn Shell"><AREASHAPE="RECT"COORDS="439,0,514,21"HREF="../lrnunix/index.htm"ALT="Learning the UNIX Operating System"></MAP></DIV></BODY></HTML>
上一页 1 23
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -