⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 sgml.sum

📁 harvest是一个下载html网页得机器人
💻 SUM
📖 第 1 页 / 共 2 页
字号:
: # *-*-perl-*-*    eval 'exec perl -S $0 "$@"'    if $running_under_some_shell;##  SGML.sum - SGML summarizer for Harvest##  Usage: SGML.sum [-d file.decl] [-t file.tbl] DOCTYPE [file]##  $Id: SGML.sum,v 2.7 2000/01/21 17:37:32 sxw Exp $################################################################################  Harvest Indexer http://harvest.sourceforge.net/#  -----------------------------------------------##  The Harvest Indexer is a continued development of code developed by#  the Harvest Project. Development is carried out by numerous individuals#  in the Internet community, and is not officially connected with the#  original Harvest Project or its funding sources.##  Please mail lee@arco.de if you are interested in participating#  in the development effort.##  This program is free software; you can redistribute it and/or modify#  it under the terms of the GNU General Public License as published by#  the Free Software Foundation; either version 2 of the License, or#  (at your option) any later version.##  This program is distributed in the hope that it will be useful,#  but WITHOUT ANY WARRANTY; without even the implied warranty of#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the#  GNU General Public License for more details.##  You should have received a copy of the GNU General Public License#  along with this program; if not, write to the Free Software#  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.########################################################################  Copyright (c) 1994, 1995.  All rights reserved.##    The Harvest software was developed by the Internet Research Task#    Force Research Group on Resource Discovery (IRTF-RD):##          Mic Bowman of Transarc Corporation.#          Peter Danzig of the University of Southern California.#          Darren R. Hardy of the University of Colorado at Boulder.#          Udi Manber of the University of Arizona.#          Michael F. Schwartz of the University of Colorado at Boulder.#          Duane Wessels of the University of Colorado at Boulder.##    This copyright notice applies to software in the Harvest#    ``src/'' directory only.  Users should consult the individual#    copyright notices in the ``components/'' subdirectories for#    copyright information about other software bundled with the#    Harvest source code distribution.##  TERMS OF USE##    The Harvest software may be used and re-distributed without#    charge, provided that the software origin and research team are#    cited in any use of the system.  Most commonly this is#    accomplished by including a link to the Harvest Home Page#    (http://harvest.cs.colorado.edu/) from the query page of any#    Broker you deploy, as well as in the query result pages.  These#    links are generated automatically by the standard Broker#    software distribution.##    The Harvest software is provided ``as is'', without express or#    implied warranty, and with no support nor obligation to assist#    in its use, correction, modification or enhancement.  We assume#    no liability with respect to the infringement of copyrights,#    trade secrets, or any patents, and are not responsible for#    consequential damages.  Proper use of the Harvest software is#    entirely the responsibility of the user.##  DERIVATIVE WORKS##    Users may make derivative works from the Harvest software, subject#    to the following constraints:##      - You must include the above copyright notice and these#        accompanying paragraphs in all forms of derivative works,#        and any documentation and other materials related to such#        distribution and use acknowledge that the software was#        developed at the above institutions.##      - You must notify IRTF-RD regarding your distribution of#        the derivative work.##      - You must clearly notify users that your are distributing#        a modified version and not the original Harvest software.##      - Any derivative product is also subject to these copyright#        and use restrictions.##    Note that the Harvest software is NOT in the public domain.  We#    retain copyright, as specified above.##  HISTORY OF FREE SOFTWARE STATUS##    Originally we required sites to license the software in cases#    where they were going to build commercial products/services#    around Harvest.  In June 1995 we changed this policy.  We now#    allow people to use the core Harvest software (the code found in#    the Harvest ``src/'' directory) for free.  We made this change#    in the interest of encouraging the widest possible deployment of#    the technology.  The Harvest software is really a reference#    implementation of a set of protocols and formats, some of which#    we intend to standardize.  We encourage commercial#    re-implementations of code complying to this set of standards.##$start_t	= time;$syntax_check	= 0;	# set to 1 if you want sgmls to print error messages$debug		= 0;$print_times    = 0;	# set to 1 if you want SGML.sum to print time info$print_times    = 1 if ($debug);# Don't perform "word wrap" on strings larger than this length.  The word# wrap function can be very slow for large buffers$MAXWRAPSIZE	= 16384;$usage		= "usage: $0 DOCTYPE [file]\n";&Getopts('d:t:');$doctype	= shift || die $usage;die $usage unless ($#ARGV <= $[);$| = 1;$sgmls_cmd	= "$ENV{'HARVEST_HOME'}/lib/gatherer/sgmls";$sgmls_lib	= "$ENV{'HARVEST_HOME'}/lib/gatherer/sgmls-lib";$catalog	= "$sgmls_lib/catalog";$decl		= "$sgmls_lib/$doctype/$doctype.decl";$table		= "$sgmls_lib/$doctype/$doctype.sum.tbl";	# default# Set this to get prefixed META NAMEs (and HTTP-EQUIVs). (HS)#$meta_prefix = 'meta-';$meta_prefix = '';# Set this to 1 to change all non-alphanumeric characters in SOIF attribute# names to '_'.  The only exception is '-' that is not changed. (HS)$fix_attrib_names = 0;# search for the .sum.tbl file; allows a per Gatherer customizations#$LIBPATH = $ENV{'SUMMARIZER_LIBPATH'};foreach $d (split(':', $LIBPATH)) {	if ( -r "$d/$doctype.sum.tbl" ) {		$table = "$d/$doctype.sum.tbl";		last;	}}# Command-line options override all defaults#$decl		= $opt_d if defined $opt_d;$table		= $opt_t if defined $opt_t;unless ( -f $decl && -f $table ) {	print STDERR "No support for doctype $doctype\n";	print STDERR "Missing: $decl\n" unless ( -f $decl );	print STDERR "Missing: $table\n" unless ( -f $table );	exit (1);}# Load the TAG->ATTR table#open (table) || die "$table: $!\n";@soifkeys = ();		# need a list to preserve the orderwhile (<table>) {	s/#.*$//;	next unless (/\S/);	chop;	next unless (/^<([^>]+)>\s*(.*)\s*$/);	($tag = $1)	=~ tr/a-z/A-Z/;	$SOIF{$tag}	= $2;	print STDERR "table> $tag --> $SOIF{$tag}\n" if ($debug);	push (@soifkeys, $tag);}close table;undef ($tag);# Read the SGML data into a tmpfile.  Tack on the DOCYTYPE if missing#undef $/;$sgml = <> || die "SGML.sum: No input.\n";$T = &tempnam;open (T, ">$T")  || die "$T: $!\n";# Ack! this SGML crap junk is ignorant somtimes.  Sometimes the authoring# tool writes##  <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML Level 1//EN//2.0" "html.dtd">## and 'sgmls' then looks for 'html.dtd' in the current directory.  This# fails in a bad way.  So now we check for this type of DOCTYPE spec.# If found, stat the "html.dtd" part and if its not there, change that# whole thing to##    <!DOCTYPE HTML SYSTEM>#if ($sgml =~ /^\s*<!DOCTYPE\s+(\S+)\s+(\S+)\s+(\"[^\"]+\")\s+\"([^\"]+)\"\s*>/i) {        $xdoctype = $1;        $dtdfile = $4;        $sgml =~ s/^\s*<!DOCTYPE[^>]+>/<!DOCTYPE $xdoctype SYSTEM>/i                unless ( -f $dtdfile );	undef $xdoctype;	undef $dtdfile;}# Sheesh.  We need to do the same for:#  <!DOCTYPE HTML SYSTEM "html.dtd">##  but NOT:#  <!DOCTYPE HTML PUBLIC "-//Sun Micorsystems Corp.//DTD HotJava HTML//EN">## or  <!DOCTYPE HTML SYSTEM "http:...">if ($sgml =~ /^\s*<!DOCTYPE\s+(\S+)\s+(\S+)\s+\"(http:\/\/[^\"]+)\"\s*>/i) {    $xdoctype = $1;    $dtdfile = $3;    $sgml =~ s/^\s*<!DOCTYPE[^>]+>/<!DOCTYPE $xdoctype SYSTEM "$dtdfile">/i	unless ( -f $dtdfile );    undef $xdoctype;    undef $dtdfile;} elsif ($sgml =~ /^\s*<!DOCTYPE\s+(\S+)\s+(\S+)\s+\"([^-][^\"]+)\"\s*>/i) {    $xdoctype = $1;    $dtdfile = $3;    $sgml =~ s/^\s*<!DOCTYPE[^>]+>/<!DOCTYPE $xdoctype SYSTEM>/i	unless ( -f $dtdfile );    undef $xdoctype;    undef $dtdfile;}print T "<!DOCTYPE $doctype SYSTEM>\n"	unless $sgml =~ /^\s*<!DOCTYPE/i;print T $sgml;close T;$/="\n";# Open the pipe from 'sgmls' for reading...#if ($syntax_check) {    $Terr = &temperrnam;    $cmd = "$sgmls_cmd -f $Terr -m $catalog $decl $T";} else {    $cmd = "$sgmls_cmd -f /dev/null -m $catalog $decl $T";}open (SGMLS, "$cmd|") || (unlink ($T), unlink ($Terr), die "$cmd: $!\n");$foo = &parse_sgml ('NIL', ());close SGMLS;if (-s $Terr) {    if (open(Terr,$Terr)) {	print STDERR "\n";	foreach (<Terr>) {	    chop;	    s/^[^:]*:[^:]*://;	    ($line_no, $eno, $e, $error_message) = split(':');	    print STDERR "SGML error in $ENV{SUMMARIZER_URL} at line $line_no: $error_message\n";	}    }}unlink ($Terr) unless ($debug);unlink ($T) unless ($debug);# CLEANUP ATT VALUES#foreach $k (keys %ATT) {        $ATT{$k} =~ s/\\n/\n/g;         # change backslash-N to newline        $ATT{$k} =~ s/\\011/\t/g;       # change escaped tabs        $ATT{$k} =~ s/\n*\n/\n/g;       # remove blank lines}

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -