ch06_04.htm

来自「By Tom Christiansen and Nathan Torkingto」· HTM 代码 · 共 218 行

HTM

218 行

<HTML><HEAD><TITLE>Recipe 6.3. Matching Words (Perl Cookbook)</TITLE><METANAME="DC.title"CONTENT="Perl Cookbook"><METANAME="DC.creator"CONTENT="Tom Christiansen &amp; Nathan Torkington"><METANAME="DC.publisher"CONTENT="O'Reilly &amp; Associates, Inc."><METANAME="DC.date"CONTENT="1999-07-02T01:33:40Z"><METANAME="DC.type"CONTENT="Text.Monograph"><METANAME="DC.format"CONTENT="text/html"SCHEME="MIME"><METANAME="DC.source"CONTENT="1-56592-243-3"SCHEME="ISBN"><METANAME="DC.language"CONTENT="en-US"><METANAME="generator"CONTENT="Jade 1.1/O'Reilly DocBook 3.0 to HTML 4.0"><LINKREV="made"HREF="mailto:online-books@oreilly.com"TITLE="Online Books Comments"><LINKREL="up"HREF="ch06_01.htm"TITLE="6. Pattern Matching"><LINKREL="prev"HREF="ch06_03.htm"TITLE="6.2. Matching Letters"><LINKREL="next"HREF="ch06_05.htm"TITLE="6.4.  Commenting Regular Expressions"></HEAD><BODYBGCOLOR="#FFFFFF"><img alt="Book Home" border="0" src="gifs/smbanner.gif" usemap="#banner-map" /><map name="banner-map"><area shape="rect" coords="1,-2,616,66" href="index.htm" alt="Perl Cookbook"><area shape="rect" coords="629,-11,726,25" href="jobjects/fsearch.htm" alt="Search this book" /></map><div class="navbar"><p><TABLEWIDTH="684"BORDER="0"CELLSPACING="0"CELLPADDING="0"><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch06_03.htm"TITLE="6.2. Matching Letters"><IMGSRC="../gifs/txtpreva.gif"ALT="Previous: 6.2. Matching Letters"BORDER="0"></A></TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><B><FONTFACE="ARIEL,HELVETICA,HELV,SANSERIF"SIZE="-1"><ACLASS="chapter"REL="up"HREF="ch06_01.htm"TITLE="6. Pattern Matching"></A></FONT></B></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch06_05.htm"TITLE="6.4.  Commenting Regular Expressions"><IMGSRC="../gifs/txtnexta.gif"ALT="Next: 6.4.  Commenting Regular Expressions"BORDER="0"></A></TD></TR></TABLE></DIV><DIVCLASS="sect1"><H2CLASS="sect1"><ACLASS="title"NAME="ch06-20126">6.3. Matching Words</A></H2><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch06-pgfId-403">Problem<ACLASS="indexterm"NAME="ch06-idx-1000007529-0"></A><ACLASS="indexterm"NAME="ch06-idx-1000007529-1"></A><ACLASS="indexterm"NAME="ch06-idx-1000007529-2"></A><ACLASS="indexterm"NAME="ch06-idx-1000007529-3"></A></A></H3><PCLASS="para">You want to pick out words from a string.</P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch06-pgfId-409">Solution</A></H3><PCLASS="para">Think long and hard about what you want a word to be and what separates one word from the next, then write a regular expression that embodies your decisions. For example:</P><PRECLASS="programlisting">/\S+/               # as many non-whitespace bytes as possible/[A-Za-z'-]+/       # as many letters, apostrophes, and hyphens</PRE></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch06-pgfId-419">Discussion</A></H3><PCLASS="para">Because words vary between applications, languages, and input streams, Perl does not have built-in definitions of words. You must make them from character classes and quantifiers yourself, as we did previously. The second pattern is an attempt to recognize <CODECLASS="literal">&quot;shepherd's&quot;</CODE> and <CODECLASS="literal">&quot;sheep-shearing&quot;</CODE> each as single words.</P><PCLASS="para">Most approaches will have limitations because of the vagaries of written human languages. For instance, although the second pattern successfully identifies <CODECLASS="literal">&quot;spank'd&quot;</CODE> and <CODECLASS="literal">&quot;counter-clockwise&quot;</CODE> as words, it will also pull the <CODECLASS="literal">&quot;rd&quot;</CODE> out of <CODECLASS="literal">&quot;23rd</CODE> <CODECLASS="literal">Psalm&quot;</CODE>. If you want to be more precise when you pull words out from a string, you can specify the stuff surrounding the word. Normally, this should be a word-boundary, not whitespace:</P><PRECLASS="programlisting">/\b([A-Za-z]+)\b/            # usually best/\s([A-Za-z]+)\s/            # fails at ends or w/ punctuation</PRE><PCLASS="para">Although Perl provides <CODECLASS="literal">\w</CODE>, which matches a character that is part of a valid Perl identifier, Perl identifiers are rarely what you think of as words, since we really mean a string of alphanumerics and underscores, but not colons or quotes. Because it's defined in terms of <CODECLASS="literal">\w</CODE>, <CODECLASS="literal">\b</CODE> may surprise you if you expect to match an English word boundary (or, even worse, a Swahili word boundary).</P><PCLASS="para"><CODECLASS="literal">\b</CODE> and <CODECLASS="literal">\B</CODE> can still be useful. For example, <CODECLASS="literal">/\Bis\B/</CODE> matches the string <CODECLASS="literal">&quot;is&quot;</CODE> only within a word, not at the edges. And while <CODECLASS="literal">&quot;thistle&quot;</CODE> would be found,  <CODECLASS="literal">&quot;vis-

ch06_04.htm - 源码说明

本页面展示了「By Tom Christiansen and Nathan Torkington ISBN 1-56592-243-3 First Edition, published August 1998」中的 ch06_04.htm 源码文件，采用 HTM 编程语言编写，共 218 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。

虫虫下载站收录了大量与Christiansen相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。

⌨️ 快捷键说明

复制代码Ctrl + C

搜索代码Ctrl + F

全屏模式F11

增大字号Ctrl + =

减小字号Ctrl + -

显示快捷键?