📄 manual.txt
字号:
and then apply a type-specific extraction algorithm (called a _s_u_m_m_a_r_i_z_e_r) to the data to generate a content summary. Users can customize each of these aspects, but often this is not necessary. Harvest is distributed with a ``stock'' set of type recognizers, presentation unnesters, candidate selectors, and summarizers that work well for many applications. Below we describe the stock summarizer set, the current components distribution, and how users can customize summarizers to change how they operate and add summarizers for new types of data. If you develop a summarizer that is likely to be useful to other users, please notify us via email at lee@arco.de <mailto:lee@arco.de> so we may include it in our Harvest distribution. Type Summarizer Function -------------------------------------------------------------------- Bibliographic Extract author and titles Binary Extract meaningful strings and manual page summary C, CHeader Extract procedure names, included file names, and comments Dvi Invoke the Text summarizer on extracted ASCII text FAQ, FullText, README Extract all words in file Font Extract comments HTML Extract anchors, hypertext links, and selected fields LaTex Parse selected LaTex fields (author, title, etc.) Mail Extract certain header fields Makefile Extract comments and target names ManPage Extract synopsis, author, title, etc., based on ``-man'' macros News Extract certain header fields Object Extract symbol table Patch Extract patched file names Perl Extract procedure names and comments PostScript Extract text in word processor-specific fashion, and pass through Text summarizer. RCS, SCCS Extract revision control summary RTF Up-convert to HTML and pass through HTML summarizer SGML Extract fields named in extraction table ShellScript Extract comments SourceDistribution Extract full text of README file and comments from Makefile and source code files, and summarize any manual pages SymbolicLink Extract file name, owner, and date created TeX Invoke the Text summarizer on extracted ASCII text Text Extract first 100 lines plus first sentence of each remaining paragraph Troff Extract author, title, etc., based on ``-man'', ``-ms'', ``-me'' macro packages, or extract section headers and topic sentences. Unrecognized Extract file name, owner, and date created. 44..55..11.. DDeeffaauulltt aaccttiioonnss ooff ````ssttoocckk'''' ssuummmmaarriizzeerrss The table in Section ``Extracting data for indexing: The Essence summarizing subsystem'' provides a brief reference for how documents are summarized depending on their type. These actions can be customized, as discussed in Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps''. Some summarizers are implemented as UNIX programs while others are expressed as regular expressions; see Section ``Customizing the summarizing step'' or Section ``Example 4'' for more information about how to write a summarizer. 44..55..22.. SSuummmmaarriizziinngg SSGGMMLL ddaattaa It is possible to summarize documents that conform to the Standard Generalized Markup Language (SGML), for which you have a Document Type Definition (DTD). The World Wide Web's Hypertext Mark-up Language (HTML) is actually a particular application of SGML, with a corresponding DTD. (In fact, the Harvest HTML summarizer can use the HTML DTD and our SGML summarizing mechanism, which provides various advantages; see Section ``The SGML-based HTML summarizer''.) SGML is being used in an increasingly broad variety of applications, for example as a format for storing data for a number of physical sciences. Because SGML allows documents to contain a good deal of structure, Harvest can summarize SGML documents very effectively. The SGML summarizer (SGML.sum) uses the sgmls program by James Clark to parse the SGML document. The parser needs both a DTD for the document and a Declaration file that describes the allowed character set. The SGML.sum program uses a table that maps SGML tags to SOIF attributes. 44..55..22..11.. LLooccaattiioonn ooff ssuuppppoorrtt ffiilleess SGML support files can be found in _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_- _l_i_b_/. For example, these are the default pathnames for HTML summarizing using the SGML summarizing mechanism: $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/html.dtd $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.decl $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.sum.tbl The location of the DTD file must be specified in the sgmls catalog (_$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_-_l_i_b_/_c_a_t_a_l_o_g). For example: DOCTYPE HTML HTML/html.dtd The SGML.sum program looks for the _._d_e_c_l file in the default location. An alternate pathname can be specified with the --dd option to SGML.sum. The summarizer looks for the _._s_u_m_._t_b_l file first in the Gatherer's lib directory and then in the default location. Both of these can be overridden with the --tt option to SGML.sum. 44..55..22..22.. TThhee SSGGMMLL ttoo SSOOIIFF ttaabbllee The translation table provides a simple yet powerful way to specify how an SGML document is to be summarized. There are four ways to map SGML data into SOIF. The first two are concerned with placing the _c_o_n_t_e_n_t of an SGML tag into a SOIF attribute. A simple SGML-to-SOIF mapping looks like this: <TAG> soif1,soif2,... This places the content that occurs inside the tag ``TAG'' into the SOIF attributes ``soif1'' and ``soif2''. It is possible to select different SOIF attributes based on SGML attribute values. For example, if ``ATT'' is an attribute of ``TAG'', then it would be written like this: <TAG,ATT=x> x-stuff <TAG,ATT=y> y-stuff <TAG> stuff The second two mappings place values of SGML attributes into SOIF attributes. To place the value of the ``ATT'' attribute of the ``TAG'' tag into the ``att-stuff'' SOIF attribute you would write: <TAG:ATT> att-stuff It is also possible to place the value of an SGML attribute into a SOIF attribute named by a different SOIF attribute: <TAG:ATT1> $ATT2 When the summarizer encounters an SGML attribute not listed in the table, the content is passed to the parent tag and becomes a part of the parent's content. To force the content of some tag _n_o_t to be passed up, specify the SOIF attribute as ``ignore''. To force the content of some tag to be passed to the parent in addition to being placed into a SOIF attribute, list an addition SOIF attribute named ``parent''. Please see Section ``The SGML-based HTML summarizer'' for examples of these mappings. 44..55..22..33.. EErrrroorrss aanndd wwaarrnniinnggss ffrroomm tthhee SSGGMMLL PPaarrsseerr The sgmls parser can generate an overwhelming volume of error and warning messages. This will be especially true for HTML documents found on the Internet, which often do not conform to the strict HTML DTD. By default, errors and warnings are redirected to _/_d_e_v_/_n_u_l_l so that they do not clutter the Gatherer's log files. To enable logging of these messages, edit the SGML.sum Perl script and set $$ssyynnttaaxx__cchheecckk == 11. 44..55..22..44.. CCrreeaattiinngg aa ssuummmmaarriizzeerr ffoorr aa nneeww SSGGMMLL--ttaaggggeedd ddaattaa ttyyppee To create an SGML summarizer for a new SGML-tagged data type with an associated DTD, you need to do the following: 1. Write a shell script named FOO.sum which simply contains #!/bin/sh exec SGML.sum FOO $* 2. Modify the essence configuration files (as described in Section ``Customizing the type recognition step'') so that your documents get typed as FOO. 3. Create the directory _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_-_l_i_b_/_F_O_O_/ and copy your DTD and Declaration there as FOO.dtd and FOO.decl. Edit _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_-_l_i_b_/_c_a_t_a_l_o_g and add FOO.dtd to it. 4. Create the translation table FOO.sum.tbl and place it with the DTD in _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_-_l_i_b_/_F_O_O_/. At this point you can test everything from the command line as follows: % FOO.sum myfile.foo 44..55..22..55.. TThhee SSGGMMLL--bbaasseedd HHTTMMLL ssuummmmaarriizzeerr Harvest can summarize HTML using the generic SGML summarizer described in Section ``Summarizing SGML data''. The advantage of this approach is that the summarizer is more easily customizable, and fits with the well-conceived SGML model (where you define DTDs for individual document types and build interpretation software to understand DTDs rather than individual document types). The downside is that the summarizer is now pickier about syntax, and many Web documents are not syntactically correct. Because of this pickiness, the default is for the HTML summarizer to run with syntax checking outputs disabled. If your documents are so badly formed that they confuse the parser, this may mean the summarizing process dies unceremoniously. If you find that some of your HTML documents do not get summarized or only get summarized in part, you can turn syntax-checking output on by setting $$ssyynnttaaxx__cchheecckk == 11 in $HARVEST_HOME/lib/gatherer/SGML.sum. That will allow you to see which documents are invalid and where. Note that part of the reason for this problem is that Web browsers do not insist on well-formed documents. So, users can easily create documents that are not completely valid, yet display fine. Below is the default SGML-to-SOIF table used by the HTML summarizer: HTML ELEMENT SOIF ATTRIBUTES ------------ ----------------------- <A> keywords,parent <A:HREF> url-references <ADDRESS> address <B> keywords,parent <BODY> body <CITE> references <CODE> ignore <EM> keywords,parent <H1> headings <H2> headings <H3> headings <H4> headings <H5> headings <H6> headings <HEAD> head <I> keywords,parent <IMG:SRC> images <META:CONTENT> $NAME <STRONG> keywords,parent <TITLE> title <TT> keywords,parent <UL> keywords,parent The pathname to this file is _$_H_A_R_V_E_S_T___H_O_M_E_
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -