ch06_07.htm

来自「By Tom Christiansen and Nathan Torkingto」· HTM 代码 · 共 686 行 · 第 1/2 页
HTM
686 行
<HTML><HEAD><TITLE>Recipe 6.6. Matching Multiple Lines (Perl Cookbook)</TITLE><METANAME="DC.title"CONTENT="Perl Cookbook"><METANAME="DC.creator"CONTENT="Tom Christiansen &amp; Nathan Torkington"><METANAME="DC.publisher"CONTENT="O'Reilly &amp; Associates, Inc."><METANAME="DC.date"CONTENT="1999-07-02T01:33:48Z"><METANAME="DC.type"CONTENT="Text.Monograph"><METANAME="DC.format"CONTENT="text/html"SCHEME="MIME"><METANAME="DC.source"CONTENT="1-56592-243-3"SCHEME="ISBN"><METANAME="DC.language"CONTENT="en-US"><METANAME="generator"CONTENT="Jade 1.1/O'Reilly DocBook 3.0 to HTML 4.0"><LINKREV="made"HREF="mailto:online-books@oreilly.com"TITLE="Online Books Comments"><LINKREL="up"HREF="ch06_01.htm"TITLE="6. Pattern Matching"><LINKREL="prev"HREF="ch06_06.htm"TITLE="6.5. Finding the Nth Occurrence of a Match"><LINKREL="next"HREF="ch06_08.htm"TITLE="6.7. Reading Records with a Pattern Separator"></HEAD><BODYBGCOLOR="#FFFFFF"><img alt="Book Home" border="0" src="gifs/smbanner.gif" usemap="#banner-map" /><map name="banner-map"><area shape="rect" coords="1,-2,616,66" href="index.htm" alt="Perl Cookbook"><area shape="rect" coords="629,-11,726,25" href="jobjects/fsearch.htm" alt="Search this book" /></map><div class="navbar"><p><TABLEWIDTH="684"BORDER="0"CELLSPACING="0"CELLPADDING="0"><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch06_06.htm"TITLE="6.5. Finding the Nth Occurrence of a Match"><IMGSRC="../gifs/txtpreva.gif"ALT="Previous: 6.5. Finding the Nth Occurrence of a Match"BORDER="0"></A></TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><B><FONTFACE="ARIEL,HELVETICA,HELV,SANSERIF"SIZE="-1"><ACLASS="chapter"REL="up"HREF="ch06_01.htm"TITLE="6. Pattern Matching"></A></FONT></B></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch06_08.htm"TITLE="6.7. Reading Records with a Pattern Separator"><IMGSRC="../gifs/txtnexta.gif"ALT="Next: 6.7. Reading Records with a Pattern Separator"BORDER="0"></A></TD></TR></TABLE></DIV><DIVCLASS="sect1"><H2CLASS="sect1"><ACLASS="title"NAME="ch06-14503">6.6. Matching Multiple Lines</A></H2><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch06-pgfId-755">Problem<ACLASS="indexterm"NAME="ch06-idx-1000007578-0"></A><ACLASS="indexterm"NAME="ch06-idx-1000007578-1"></A><ACLASS="indexterm"NAME="ch06-idx-1000007578-2"></A><ACLASS="indexterm"NAME="ch06-idx-1000007578-3"></A><ACLASS="indexterm"NAME="ch06-idx-1000007578-4"></A></A></H3><PCLASS="para">You want to use regular expressions on a string containing more than one line, but the special characters <CODECLASS="literal">.</CODE> (any character but newline), <CODECLASS="literal">^</CODE> (start of string), and <CODECLASS="literal">$</CODE> (end of string) don't seem to work for you. This might happen if you're reading in multiline records or the whole file at once.</P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch06-pgfId-761">Solution</A></H3><PCLASS="para">Use <CODECLASS="literal">/m</CODE><ACLASS="indexterm"NAME="ch06-idx-1000007582-0"></A><ACLASS="indexterm"NAME="ch06-idx-1000007582-1"></A>, <CODECLASS="literal">/s</CODE>, or both as pattern modifiers. <CODECLASS="literal">/s</CODE> lets <CODECLASS="literal">.</CODE> match newline (normally it doesn't). If the string had more than one line in it, then <CODECLASS="literal">/foo.*bar/s</CODE> could match a <CODECLASS="literal">&quot;foo&quot;</CODE> on one line and a <CODECLASS="literal">&quot;bar&quot;</CODE> on a following line. This doesn't affect dots in character classes like <CODECLASS="literal">[#%.]</CODE>, since they are regular periods anyway.</P><PCLASS="para">The <CODECLASS="literal">/m</CODE> modifier lets <CODECLASS="literal">^</CODE> and <CODECLASS="literal">$</CODE> match next to a newline. <CODECLASS="literal">/^=head[1-7]$/m</CODE> would match that pattern not just at the beginning of the record, but anywhere right after a newline as well.</P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch06-pgfId-769">Discussion</A></H3><PCLASS="para">A common, brute-force approach to parsing documents where newlines are not significant is to read the file one paragraph at a time (or sometimes even the entire file as one string) and then extract tokens one by one. To match across newlines, you need to make <CODECLASS="literal">.</CODE> match a newline; it ordinarily does not. In cases where newlines are important and you've read more than one line into a string, you'll probably prefer to have <CODECLASS="literal">^</CODE> and <CODECLASS="literal">$</CODE> match beginning- and end-of-line, not just beginning- and end-of-string.</P><PCLASS="para">The difference between <CODECLASS="literal">/m</CODE> and <CODECLASS="literal">/s</CODE> is important: <CODECLASS="literal">/m</CODE> makes <CODECLASS="literal">^</CODE> and <CODECLASS="literal">$</CODE> match next to a newline, while <CODECLASS="literal">/s</CODE> makes <CODECLASS="literal">.</CODE> match newlines. You can even use them together &nbsp;- they're not mutually exclusive options.</P><PCLASS="para"><ACLASS="xref"HREF="ch06_07.htm#ch06-10877"TITLE="killtags">Example 6.2</A> creates a filter to strip HTML tags out of each file in <CODECLASS="literal">@ARGV</CODE> and send the results to STDOUT. First we undefine the record separator so each read operation fetches one entire file. (There could be more than one file, because <CODECLASS="literal">@ARGV</CODE> has several arguments in it. In this case, each read would get a whole file.) Then we strip out instances of beginning and ending angle brackets, plus anything in between them. We can't use just <CODECLASS="literal">.*</CODE> for two reasons: first, it would match closing angle brackets, and second, the dot wouldn't cross newline boundaries. Using <CODECLASS="literal">.*?</CODE> in conjunction with <CODECLASS="literal">/s</CODE> solves these problems &nbsp;-  at least in this case.</P><DIVCLASS="example"><H4CLASS="example"><ACLASS="title"NAME="ch06-10877">Example 6.2: killtags</A></H4><PRECLASS="programlisting">#!/usr/bin/perl# <ACLASS="indexterm"NAME="ch06-idx-1000007791-0"></A>killtags - very bad html tag killerundef $/;           # each read is whole filewhile (&lt;&gt;) {        # get one whole file at a time    s/&lt;.*?&gt;//gs;    # strip tags (terribly)    print;          # print file to STDOUT}</PRE></DIV><PCLASS="para">Because this is just a single character, it would be much faster to use <CODECLASS="literal">s/&lt;[^&gt;]*&gt;//gs,</CODE> but that's still a na飗e approach: It doesn't correctly handle tags inside HTML comments or angle brackets in quotes (&lt;<CODECLASS="literal">IMG</CODE> <CODECLASS="literal">SRC=&quot;here.gif&quot;</CODE> <CODECLASS="literal">ALT=&quot;&lt;&lt;Ooh</CODE> <CODECLASS="literal">la</CODE> <CODECLASS="literal">la!&gt;&gt;&quot;&gt;</CODE>). <ACLASS="xref"HREF="ch20_07.htm"TITLE="Extracting or Removing HTML Tags">Recipe 20.6</A> explains how to avoid these problems.</P><PCLASS="para"><ACLASS="xref"HREF="ch06_07.htm#ch06-31611"TITLE="headerfy">Example 6.3</A> takes a plain text document and looks for lines at the start of paragraphs that look like <CODECLASS="literal">&quot;Chapter</CODE> <CODECLASS="literal">20:</CODE> <CODECLASS="literal">Better</CODE> <CODECLASS="literal">Living</CODE> <CODECLASS="literal">Through</CODE> <CODECLASS="literal"
ch06_07.htm - 源码说明

本页面展示了「By Tom Christiansen and Nathan Torkington ISBN 1-56592-243-3 First Edition, published August 1998」中的 ch06_07.htm 源码文件，采用 HTM 编程语言编写，共 686 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与Christiansen相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?