⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 manual.txt

📁 harvest是一个下载html网页得机器人
💻 TXT
📖 第 1 页 / 共 5 页
字号:
  and then apply a type-specific extraction algorithm (called a  _s_u_m_m_a_r_i_z_e_r) to the data to generate a content summary.  Users can  customize each of these aspects, but often this is not necessary.  Harvest is distributed with a ``stock'' set of type recognizers,  presentation unnesters, candidate selectors, and summarizers that work  well for many applications.  Below we describe the stock summarizer set, the current components  distribution, and how users can customize summarizers to change how  they operate and add summarizers for new types of data.  If you  develop a summarizer that is likely to be useful to other users,  please notify us via email at lee@arco.de <mailto:lee@arco.de> so we  may include it in our Harvest distribution.  Type            Summarizer Function  --------------------------------------------------------------------  Bibliographic   Extract author and titles  Binary          Extract meaningful strings and manual page summary  C, CHeader      Extract procedure names, included file names, and comments  Dvi             Invoke the Text summarizer on extracted ASCII text  FAQ, FullText, README                  Extract all words in file  Font            Extract comments  HTML            Extract anchors, hypertext links, and selected fields  LaTex           Parse selected LaTex fields (author, title, etc.)  Mail            Extract certain header fields  Makefile        Extract comments and target names  ManPage         Extract synopsis, author, title, etc., based on ``-man'' macros  News            Extract certain header fields  Object          Extract symbol table  Patch           Extract patched file names  Perl            Extract procedure names and comments  PostScript      Extract text in word processor-specific fashion, and pass                  through Text summarizer.  RCS, SCCS       Extract revision control summary  RTF             Up-convert to HTML and pass through HTML summarizer  SGML            Extract fields named in extraction table  ShellScript     Extract comments  SourceDistribution                  Extract full text of README file and comments from Makefile                  and source code files, and summarize any manual pages  SymbolicLink    Extract file name, owner, and date created  TeX             Invoke the Text summarizer on extracted ASCII text  Text            Extract first 100 lines plus first sentence of each                  remaining paragraph  Troff           Extract author, title, etc., based on ``-man'', ``-ms'',                  ``-me'' macro packages, or extract section headers and                  topic sentences.  Unrecognized    Extract file name, owner, and date created.  44..55..11..  DDeeffaauulltt aaccttiioonnss ooff ````ssttoocckk'''' ssuummmmaarriizzeerrss  The table in Section ``Extracting data for indexing: The Essence  summarizing subsystem'' provides a brief reference for how documents  are summarized depending on their type.  These actions can be  customized, as discussed in Section ``Customizing the type  recognition, candidate selection, presentation unnesting, and  summarizing steps''.  Some summarizers are implemented as UNIX  programs while others are expressed as regular expressions; see  Section ``Customizing the summarizing step'' or Section ``Example 4''  for more information about how to write a summarizer.  44..55..22..  SSuummmmaarriizziinngg SSGGMMLL ddaattaa  It is possible to summarize documents that conform to the Standard  Generalized Markup Language (SGML), for which you have a Document Type  Definition (DTD).  The World Wide Web's Hypertext Mark-up Language  (HTML) is actually a particular application of SGML, with a  corresponding DTD.  (In fact, the Harvest HTML summarizer can use the  HTML DTD and our SGML summarizing mechanism, which provides various  advantages; see Section ``The SGML-based HTML summarizer''.)  SGML is  being used in an increasingly broad variety of applications, for  example as a format for storing data for a number of physical  sciences.  Because SGML allows documents to contain a good deal of  structure, Harvest can summarize SGML documents very effectively.  The SGML summarizer (SGML.sum) uses the sgmls program by James Clark  to parse the SGML document.  The parser needs both a DTD for the  document and a Declaration file that describes the allowed character  set.  The SGML.sum program uses a table that maps SGML tags to SOIF  attributes.  44..55..22..11..  LLooccaattiioonn ooff ssuuppppoorrtt ffiilleess  SGML support files can be found in _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_-  _l_i_b_/.  For example, these are the default pathnames for HTML  summarizing using the SGML summarizing mechanism:               $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/html.dtd               $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.decl               $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.sum.tbl  The location of the DTD file must be specified in the sgmls catalog  (_$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_-_l_i_b_/_c_a_t_a_l_o_g).  For example:               DOCTYPE   HTML   HTML/html.dtd  The SGML.sum program looks for the _._d_e_c_l file in the default location.  An alternate pathname can be specified with the --dd option to SGML.sum.  The summarizer looks for the _._s_u_m_._t_b_l file first in the Gatherer's lib  directory and then in the default location.  Both of these can be  overridden with the --tt option to SGML.sum.  44..55..22..22..  TThhee SSGGMMLL ttoo SSOOIIFF ttaabbllee  The translation table provides a simple yet powerful way to specify  how an SGML document is to be summarized.  There are four ways to map  SGML data into SOIF.  The first two are concerned with placing the  _c_o_n_t_e_n_t of an SGML tag into a SOIF attribute.  A simple SGML-to-SOIF mapping looks like this:               <TAG>              soif1,soif2,...  This places the content that occurs inside the tag ``TAG'' into the  SOIF attributes ``soif1'' and ``soif2''.  It is possible to select  different SOIF attributes based on SGML attribute values.  For  example, if ``ATT'' is an attribute of ``TAG'', then it would be  written like this:          <TAG,ATT=x>         x-stuff          <TAG,ATT=y>         y-stuff          <TAG>               stuff  The second two mappings place values of SGML attributes into SOIF  attributes.  To place the value of the ``ATT'' attribute of the  ``TAG'' tag into the ``att-stuff'' SOIF attribute you would write:               <TAG:ATT>           att-stuff  It is also possible to place the value of an SGML attribute into a  SOIF attribute named by a different SOIF attribute:               <TAG:ATT1>          $ATT2  When the summarizer encounters an SGML attribute not listed in the  table, the content is passed to the parent tag and becomes a part of  the parent's content.  To force the content of some tag _n_o_t to be  passed up, specify the SOIF attribute as ``ignore''.  To force the  content of some tag to be passed to the parent in addition to being  placed into a SOIF attribute, list an addition SOIF attribute named  ``parent''.  Please see Section ``The SGML-based HTML summarizer'' for examples of  these mappings.  44..55..22..33..  EErrrroorrss aanndd wwaarrnniinnggss ffrroomm tthhee SSGGMMLL PPaarrsseerr  The sgmls parser can generate an overwhelming volume of error and  warning messages.  This will be especially true for HTML documents  found on the Internet, which often do not conform to the strict HTML  DTD.  By default, errors and warnings are redirected to _/_d_e_v_/_n_u_l_l so  that they do not clutter the Gatherer's log files.  To enable logging  of these messages, edit the SGML.sum Perl script and set $$ssyynnttaaxx__cchheecckk  == 11.  44..55..22..44..  CCrreeaattiinngg aa ssuummmmaarriizzeerr ffoorr aa nneeww SSGGMMLL--ttaaggggeedd ddaattaa ttyyppee  To create an SGML summarizer for a new SGML-tagged data type with an  associated DTD, you need to do the following:  1. Write a shell script named FOO.sum which simply contains               #!/bin/sh               exec SGML.sum FOO $*  2. Modify the essence configuration files (as described in Section     ``Customizing the type recognition step'') so that your documents     get typed as FOO.  3. Create the directory _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_-_l_i_b_/_F_O_O_/ and     copy your DTD and Declaration there as FOO.dtd and FOO.decl.  Edit     _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_-_l_i_b_/_c_a_t_a_l_o_g and add FOO.dtd to it.  4. Create the translation table FOO.sum.tbl and place it with the DTD     in _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_-_l_i_b_/_F_O_O_/.  At this point you can test everything from the command line as  follows:               % FOO.sum myfile.foo  44..55..22..55..  TThhee SSGGMMLL--bbaasseedd HHTTMMLL ssuummmmaarriizzeerr  Harvest can summarize HTML using the generic SGML summarizer described  in Section ``Summarizing SGML data''.  The advantage of this approach  is that the summarizer is more easily customizable, and fits with the  well-conceived SGML model (where you define DTDs for individual  document types and build interpretation software to understand DTDs  rather than individual document types).  The downside is that the  summarizer is now pickier about syntax, and many Web documents are not  syntactically correct.  Because of this pickiness, the default is for  the HTML summarizer to run with syntax checking outputs disabled.  If  your documents are so badly formed that they confuse the parser, this  may mean the summarizing process dies unceremoniously.  If you find  that some of your HTML documents do not get summarized or only get  summarized in part, you can turn syntax-checking output on by setting  $$ssyynnttaaxx__cchheecckk == 11 in $HARVEST_HOME/lib/gatherer/SGML.sum.  That will  allow you to see which documents are invalid and where.  Note that part of the reason for this problem is that Web browsers do  not insist on well-formed documents.  So, users can easily create  documents that are not completely valid, yet display fine.  Below is the default SGML-to-SOIF table used by the HTML summarizer:  HTML ELEMENT   SOIF ATTRIBUTES  ------------   -----------------------      <A>             keywords,parent      <A:HREF>        url-references      <ADDRESS>       address      <B>             keywords,parent      <BODY>          body      <CITE>          references      <CODE>          ignore      <EM>            keywords,parent      <H1>            headings      <H2>            headings      <H3>            headings      <H4>            headings      <H5>            headings      <H6>            headings      <HEAD>          head      <I>             keywords,parent      <IMG:SRC>       images      <META:CONTENT>  $NAME      <STRONG>        keywords,parent      <TITLE>         title      <TT>            keywords,parent      <UL>            keywords,parent  The pathname to this file is _$_H_A_R_V_E_S_T___H_O_M_E_

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -