📄 notes

📁 harvest是一个下载html网页得机器人
💻
字号:
Background==========An SGML document is of some _doctype_.  This should appear at the top of the file.  For example:    <!DOCTYPE HTML SYSTEM>    <HTML>    ...    </HTML>The SGML summarizer===================A Perl script named SGML.sum is used to parse generic SGML files.  SGML.sumreads the output of the 'sgmls' program by James Clark <jjc@jclark.com>.The 'sgmls' program needs a few files to do its work:    The source SGML document    A Document Type Definition (DTD)    An SGML declaration (.decl) file The 'sgmls' program takes three command line arguments:    The name of a Catalog file    The declaration file    The source document fileThe Catalog is used to map doctypes to DTD files on disk.  For Harvestthe Catalog is $HARVEST_HOME/lib/gatherer/sgmls-lib/catalog.The SGML.sum script takes two command line arguments:    The doctype    The source file to summarizeBy default it looks for the following support files for $doctype:    $HARVEST_HOME/lib/gatherer/sgmls-lib/$doctype/$doctype.decl    $HARVEST_HOME/lib/gatherer/sgmls-lib/$doctype/$doctype.sum.tblThe file $doctype.dtd should be kept here also, but is specified in the Catalog.  The default 'decl' and 'tbl' pathnames can be overriddenby using -d and -t options to SGML.sum.  The 'tbl' file is discussed later.Many files to be used with the SGML summarizer may not have <!DOCTYPE..on the first line.  This will be especially true of HTML.  For thisreason, SGML.sum writes the input to a tmpfile and looks for the<!DOCTYPE string.  If not found, it inserts    <!DOCTYPE $doctype SYSTEM>as the first line of the source document before feeding it to 'sgmls'.Creating a summarizer for a new doctype=============================================Assume you have a new doctype named FOO.   *) As outlined in the Harvest users manual, edit the Essence config      files (eg lib/byurl.cf) so that your FOO documents get typed      as FOO.           FOO		^http://.*\.foo$   *) Create these files          $HARVEST_HOME/lib/gatherer/sgmls-lib/FOO/FOO.dtd          $HARVEST_HOME/lib/gatherer/sgmls-lib/FOO/FOO.decl          $HARVEST_HOME/lib/gatherer/sgmls-lib/FOO/FOO.sum.tbl      The 'decl' and 'tbl' files could possibly live in the gatherer      lib directory.  Not sure yet about the DTD.  Edit the Catalog      file to reflect the pathname of the DTD.   *) Write a shell script named FOO.sum.  The simplest way is:           #!/bin/sh           exec SGML.sum FOO $*      Or possibly           #!/bin/sh           dcl="$HARVEST_HOME/gatherers/foo/lib/foo.decl"           tbl="$HARVEST_HOME/gatherers/foo/lib/foo.tbl"           exec SGML.sum -d $dcl -t $tbl FOO $*The SGML to SOIF translation============================There are two types of SGML data that can be extracted by the SGMLsummarizer.  The first is ``content'' which appears between twotags.  (Note that SGML allows some ending tags to be implied).  Example:    <B>This phrase is in bold</B>    <PARA TYPE="title">The title of this paper is....</PARA>The second type is data that appears in SGML attributes, inside the tag delimiters.  Examples:    <A HREF="http://harvest.cs.colorado.edu/">    <META NAME="author"  CONTENT="Duane Wessels">The SGML summarizer uses a translation table to know which SGML datagoes into which SOIF attributes.  For the examples above we might use:	# SGML-to-SOIF mappings	#	<B>			keywords,parent	<PARA,TYPE=title>	title        <PARA>			body	<A:HREF>		url-references	<META:CONTENT>		$NAME	<PRE>			ignoreThe first field is the SGML tag, enclosed in angle brackets.  The secondfield is a comma-separated list of SOIF attributes.  There are no special or reserved SGML tag values.  There are two specialcharacters (comma and colon) which are assumed to not appear in any valid tag names.  If content appears for a tag not listed in the table, that content ispassed up to the parent and becomes a part of the parent's content.  This continues until a tag is found with an output SOIF attribute.There are two special soif attributes: 'parent' and 'ignore'.  The 'parent'attribute means to pass the content for this tag up to the parent tag.This is only needed when you want the content to appear in an attributein addition to the parent's.  You would never need list just 'parent' as the only attribute for a tag.  The 'ignore' attribute means that thecontent for this tag should be discarded.Another special case is the example '$NAME' above.  This means that thevalue of the CONTENT attribute in the META tag should be output inthe SOIF attribute given by the value of NAME in the META TAG.  Example:     <META CONTENT="Dirk Niblick" NAME="owner">results in      owner{12}:	Dirk Niblick(there is nothing special about the word 'NAME', it can be any validattribute for the tag).Note that order in the translation table is important.  For a given tag,the first match is taken.  For example, this is NOT what you want:        <PARA>			body	<PARA,TYPE=title>	titleThe second line would never be checked because the first line wouldmatch before it.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -