📄 ch20_01.htm
字号:
<HTML><HEAD><METANAME="DC.title"CONTENT="Perl Cookbook"><METANAME="DC.creator"CONTENT="Tom Christiansen & Nathan Torkington"><METANAME="DC.publisher"CONTENT="O'Reilly & Associates, Inc."><METANAME="DC.date"CONTENT="1999-07-02T01:45:50Z"><METANAME="DC.type"CONTENT="Text.Monograph"><METANAME="DC.format"CONTENT="text/html"SCHEME="MIME"><METANAME="DC.source"CONTENT="1-56592-243-3"SCHEME="ISBN"><METANAME="DC.language"CONTENT="en-US"><METANAME="generator"CONTENT="Jade 1.1/O'Reilly DocBook 3.0 to HTML 4.0"><LINKREV="made"HREF="mailto:online-books@oreilly.com"TITLE="Online Books Comments"><LINKREL="up"HREF="index.htm"TITLE="Perl Cookbook"><LINKREL="prev"HREF="ch19_15.htm"TITLE="19.14. Program: chemiserie"><LINKREL="next"HREF="ch20_02.htm"TITLE="20.1. Fetching a URL from a Perl Script"></HEAD><BODYBGCOLOR="#FFFFFF"><img alt="Book Home" border="0" src="gifs/smbanner.gif" usemap="#banner-map" /><map name="banner-map"><area shape="rect" coords="1,-2,616,66" href="index.htm" alt="Perl Cookbook"><area shape="rect" coords="629,-11,726,25" href="jobjects/fsearch.htm" alt="Search this book" /></map><div class="navbar"><p><TABLEWIDTH="684"BORDER="0"CELLSPACING="0"CELLPADDING="0"><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch19_15.htm"TITLE="19.14. Program: chemiserie"><IMGSRC="../gifs/txtpreva.gif"ALT="Previous: 19.14. Program: chemiserie"BORDER="0"></A></TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><B><FONTFACE="ARIEL,HELVETICA,HELV,SANSERIF"SIZE="-1"></FONT></B></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch20_02.htm"TITLE="20.1. Fetching a URL from a Perl Script"><IMGSRC="../gifs/txtnexta.gif"ALT="Next: 20.1. Fetching a URL from a Perl Script"BORDER="0"></A></TD></TR></TABLE></DIV><DIVCLASS="chapter"><H1CLASS="chapter"><ACLASS="title"NAME="ch20-70642">20. Web Automation</A></H1><DIVCLASS="htmltoc"><P><B>Contents:</B><BR><ACLASS="sect1"HREF="#ch20-chap20_introduction_0"TITLE="20.0. Introduction">Introduction</A><BR><ACLASS="sect1"HREF="ch20_02.htm"TITLE="20.1. Fetching a URL from a Perl Script">Fetching a URL from a Perl Script</A><BR><ACLASS="sect1"HREF="ch20_03.htm"TITLE="20.2. Automating Form Submission">Automating Form Submission</A><BR><ACLASS="sect1"HREF="ch20_04.htm"TITLE="20.3. Extracting URLs">Extracting URLs</A><BR><ACLASS="sect1"HREF="ch20_05.htm"TITLE="20.4. Converting ASCII to HTML">Converting ASCII to HTML</A><BR><ACLASS="sect1"HREF="ch20_06.htm"TITLE="20.5. Converting HTML to ASCII">Converting HTML to ASCII</A><BR><ACLASS="sect1"HREF="ch20_07.htm"TITLE="20.6. Extracting or Removing HTML Tags">Extracting or Removing HTML Tags</A><BR><ACLASS="sect1"HREF="ch20_08.htm"TITLE="20.7. Finding Stale Links">Finding Stale Links</A><BR><ACLASS="sect1"HREF="ch20_09.htm"TITLE="20.8. Finding Fresh Links">Finding Fresh Links</A><BR><ACLASS="sect1"HREF="ch20_10.htm"TITLE="20.9. Creating HTML Templates">Creating HTML Templates</A><BR><ACLASS="sect1"HREF="ch20_11.htm"TITLE="20.10. Mirroring Web Pages">Mirroring Web Pages</A><BR><ACLASS="sect1"HREF="ch20_12.htm"TITLE="20.11. Creating a Robot">Creating a Robot</A><BR><ACLASS="sect1"HREF="ch20_13.htm"TITLE="20.12. Parsing a Web Server Log File">Parsing a Web Server Log File</A><BR><ACLASS="sect1"HREF="ch20_14.htm"TITLE="20.13. Processing Server Logs">Processing Server Logs</A><BR><ACLASS="sect1"HREF="ch20_15.htm"TITLE="20.14. Program: htmlsub">Program: htmlsub</A><BR><ACLASS="sect1"HREF="ch20_16.htm"TITLE="20.15. Program: hrefsub">Program: hrefsub</A></P><P></P></DIV><DIVCLASS="epigraph"ALIGN="right"><PCLASS="para"ALIGN="right"><I>The web, then, or the pattern, a web at once sensuous and logical, an elegant and pregnant texture: that is style, that is the foundation of the art of literature.</I></P><PCLASS="attribution"ALIGN="right">- Robert Louis Stevenson, <CITECLASS="citetitle">On some technical Elements of Style in Literature (1885) </CITE></P></DIV><DIVCLASS="sect1"><H2CLASS="sect1"><ACLASS="title"NAME="ch20-chap20_introduction_0">20.0. Introduction</A></H2><PCLASS="para"><ACLASS="xref"HREF="ch19_01.htm"TITLE="CGI Programming">Chapter 19, <CITECLASS="chapter">CGI Programming</CITE></A>, concentrated on responding to browser requests and producing documents using CGI. This one approaches the Web from the other side: instead of responding to a browser, you pretend to be one, generating requests and processing returned documents. We make extensive use of modules to simplify this process, because the intricate network protocols and document formats are tricky to get right. By letting existing modules handle the hard parts, you can concentrate on the interesting part - your own program.<ACLASS="indexterm"NAME="ch20-idx-1000003760-0"></A></P><PCLASS="para">The relevant modules can all be found under the following URL:</P><PRECLASS="programlisting">http://www.perl.com/CPAN/modules/by-category/15_World_Wide_Web_HTML_HTTP_CGI/</PRE><PCLASS="para">There are modules for computing credit card checksums, interacting with Netscape or Apache server APIs, processing image maps, validating HTML, and manipulating MIME. The largest and most important modules for this chapter, though, are found in the <ACLASS="indexterm"NAME="ch20-idx-1000002558-0"></A><ACLASS="indexterm"NAME="ch20-idx-1000002558-1"></A>libwww-perl suite of modules, referred to collectively as LWP. Here are just a few of the modules included in LWP:</P><TABLECLASS="informaltable"BORDER="1"CELLPADDING="3"><THEADCLASS="thead"><TRCLASS="row"VALIGN="TOP"><THCLASS="entry"ALIGN="LEFT"ROWSPAN="1"COLSPAN="1"><PCLASS="para">Module Name</P></TH><THCLASS="entry"ALIGN="LEFT"ROWSPAN="1"COLSPAN="1"><PCLASS="para">Purpose</P></TH></TR></THEAD><TBODYCLASS="tbody"><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">LWP::UserAgent</P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">WWW user agent class</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">LWP::RobotUA</P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">Develop robot applications</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">LWP::Protocol</P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">Interface to various protocol schemes</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">LWP::Authen::Basic</P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">Handle 401 and 407 responses</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">LWP::MediaTypes</P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">MIME types configuration (text/html, etc.)</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">LWP::Debug</P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">Debug logging module</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">LWP::Simple</P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">Simple procedural interface for common functions</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">HTTP::Headers</P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">MIME/RFC822 style headers</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">HTTP::Message</P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">HTTP style message</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">HTTP::Request</P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">HTTP request</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">HTTP::Response</P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">HTTP response</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">HTTP::Daemon</P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">A HTTP server class</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">HTTP::Status</P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">HTTP status code (200 OK etc)</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">HTTP::Date</P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">Date parsing module for HTTP date formats</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">HTTP::Negotiate</P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">HTTP content negotiation calculation</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">WWW::RobotRules</P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">Parse <ICLASS="filename">robots.txt</I> files</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">File::Listing</P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">Parse directory listings</P></TD></TR></TBODY></TABLE><PCLASS="para">The HTTP:: and LWP:: modules let you request documents from a server. The LWP::Simple module, in particular, offers a very basic way to fetch a document. LWP::Simple, however, lacks the ability to access individual components of the HTTP response. To access these, use HTTP::Request, HTTP::Response, and LWP::UserAgent. We show both sets of modules in Recipes <ACLASS="xref"HREF="ch20_02.htm"TITLE="Fetching a URL from a Perl Script">Recipe 20.1</A>, <ACLASS="xref"HREF="ch20_03.htm"TITLE="Automating Form Submission">Recipe 20.2</A>, and <ACLASS="xref"HREF="ch20_11.htm"TITLE="Mirroring Web Pages">Recipe 20.10</A>.</P><PCLASS="para">Closely allied with LWP, but not distributed in the LWP bundle, are the HTML:: modules. These let you parse HTML. They provide the basis for Recipes <ACLASS="xref"HREF="ch20_06.htm"TITLE="Converting HTML to ASCII">Recipe 20.5</A>, <ACLASS="xref"HREF="ch20_05.htm"TITLE="Converting ASCII to HTML">Recipe 20.4</A>, <ACLASS="xref"HREF="ch20_07.htm"TITLE="Extracting or Removing HTML Tags">Recipe 20.6</A>, <ACLASS="xref"HREF="ch20_04.htm"TITLE="Extracting URLs">Recipe 20.3</A>, <ACLASS="xref"HREF="ch20_08.htm"TITLE="Finding Stale Links">Recipe 20.7</A>, and the programs htmlsub and hrefsub.</P><PCLASS="para"><ACLASS="xref"HREF="ch20_13.htm"TITLE="Parsing a Web Server Log File">Recipe 20.12</A> gives a regular expression to decode the fields in your web server's log files and shows how to interpret the fields. We use this regular expression and the Logfile::Apache module in <ACLASS="xref"HREF="ch20_14.htm"TITLE="Processing Server Logs">Recipe 20.13</A> to show two ways of summarizing the data in web server log files.</P></DIV></DIV><DIVCLASS="htmlnav"><P></P><HRALIGN="LEFT"WIDTH="684"TITLE="footer"><TABLEWIDTH="684"BORDER="0"CELLSPACING="0"CELLPADDING="0"><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch19_15.htm"TITLE="19.14. Program: chemiserie"><IMGSRC="../gifs/txtpreva.gif"ALT="Previous: 19.14. Program: chemiserie"BORDER="0"></A></TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><ACLASS="book"HREF="index.htm"TITLE="Perl Cookbook"><IMGSRC="../gifs/txthome.gif"ALT="Perl Cookbook"BORDER="0"></A></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch20_02.htm"TITLE="20.1. Fetching a URL from a Perl Script"><IMGSRC="../gifs/txtnexta.gif"ALT="Next: 20.1. Fetching a URL from a Perl Script"BORDER="0"></A></TD></TR><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228">19.14. Program: chemiserie</TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><ACLASS="index"HREF="index/index.htm"TITLE="Book Index"><IMGSRC="../gifs/index.gif"ALT="Book Index"BORDER="0"></A></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228">20.1. Fetching a URL from a Perl Script</TD></TR></TABLE><HRALIGN="LEFT"WIDTH="684"TITLE="footer"><FONTSIZE="-1"></DIV<!-- LIBRARY NAV BAR --> <img src="../gifs/smnavbar.gif" usemap="#library-map" border="0" alt="Library Navigation Links"><p> <a href="copyrght.htm">Copyright © 2002</a> O'Reilly & Associates. All rights reserved.</font> </p> <map name="library-map"> <area shape="rect" coords="1,0,85,94" href="../index.htm"><area shape="rect" coords="86,1,178,103" href="../lwp/index.htm"><area shape="rect" coords="180,0,265,103" href="../lperl/index.htm"><area shape="rect" coords="267,0,353,105" href="../perlnut/index.htm"><area shape="rect" coords="354,1,446,115" href="../prog/index.htm"><area shape="rect" coords="448,0,526,132" href="../tk/index.htm"><area shape="rect" coords="528,1,615,119" href="../cookbook/index.htm"><area shape="rect" coords="617,0,690,135" href="../pxml/index.htm"></map> </BODY></HTML>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -