📄 sgml.sum
字号:
: # *-*-perl-*-* eval 'exec perl -S $0 "$@"' if $running_under_some_shell;## SGML.sum - SGML summarizer for Harvest## Usage: SGML.sum [-d file.decl] [-t file.tbl] DOCTYPE [file]## $Id: SGML.sum,v 2.7 2000/01/21 17:37:32 sxw Exp $################################################################################ Harvest Indexer http://harvest.sourceforge.net/# -----------------------------------------------## The Harvest Indexer is a continued development of code developed by# the Harvest Project. Development is carried out by numerous individuals# in the Internet community, and is not officially connected with the# original Harvest Project or its funding sources.## Please mail lee@arco.de if you are interested in participating# in the development effort.## This program is free software; you can redistribute it and/or modify# it under the terms of the GNU General Public License as published by# the Free Software Foundation; either version 2 of the License, or# (at your option) any later version.## This program is distributed in the hope that it will be useful,# but WITHOUT ANY WARRANTY; without even the implied warranty of# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the# GNU General Public License for more details.## You should have received a copy of the GNU General Public License# along with this program; if not, write to the Free Software# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.######################################################################## Copyright (c) 1994, 1995. All rights reserved.## The Harvest software was developed by the Internet Research Task# Force Research Group on Resource Discovery (IRTF-RD):## Mic Bowman of Transarc Corporation.# Peter Danzig of the University of Southern California.# Darren R. Hardy of the University of Colorado at Boulder.# Udi Manber of the University of Arizona.# Michael F. Schwartz of the University of Colorado at Boulder.# Duane Wessels of the University of Colorado at Boulder.## This copyright notice applies to software in the Harvest# ``src/'' directory only. Users should consult the individual# copyright notices in the ``components/'' subdirectories for# copyright information about other software bundled with the# Harvest source code distribution.## TERMS OF USE## The Harvest software may be used and re-distributed without# charge, provided that the software origin and research team are# cited in any use of the system. Most commonly this is# accomplished by including a link to the Harvest Home Page# (http://harvest.cs.colorado.edu/) from the query page of any# Broker you deploy, as well as in the query result pages. These# links are generated automatically by the standard Broker# software distribution.## The Harvest software is provided ``as is'', without express or# implied warranty, and with no support nor obligation to assist# in its use, correction, modification or enhancement. We assume# no liability with respect to the infringement of copyrights,# trade secrets, or any patents, and are not responsible for# consequential damages. Proper use of the Harvest software is# entirely the responsibility of the user.## DERIVATIVE WORKS## Users may make derivative works from the Harvest software, subject# to the following constraints:## - You must include the above copyright notice and these# accompanying paragraphs in all forms of derivative works,# and any documentation and other materials related to such# distribution and use acknowledge that the software was# developed at the above institutions.## - You must notify IRTF-RD regarding your distribution of# the derivative work.## - You must clearly notify users that your are distributing# a modified version and not the original Harvest software.## - Any derivative product is also subject to these copyright# and use restrictions.## Note that the Harvest software is NOT in the public domain. We# retain copyright, as specified above.## HISTORY OF FREE SOFTWARE STATUS## Originally we required sites to license the software in cases# where they were going to build commercial products/services# around Harvest. In June 1995 we changed this policy. We now# allow people to use the core Harvest software (the code found in# the Harvest ``src/'' directory) for free. We made this change# in the interest of encouraging the widest possible deployment of# the technology. The Harvest software is really a reference# implementation of a set of protocols and formats, some of which# we intend to standardize. We encourage commercial# re-implementations of code complying to this set of standards.##$start_t = time;$syntax_check = 0; # set to 1 if you want sgmls to print error messages$debug = 0;$print_times = 0; # set to 1 if you want SGML.sum to print time info$print_times = 1 if ($debug);# Don't perform "word wrap" on strings larger than this length. The word# wrap function can be very slow for large buffers$MAXWRAPSIZE = 16384;$usage = "usage: $0 DOCTYPE [file]\n";&Getopts('d:t:');$doctype = shift || die $usage;die $usage unless ($#ARGV <= $[);$| = 1;$sgmls_cmd = "$ENV{'HARVEST_HOME'}/lib/gatherer/sgmls";$sgmls_lib = "$ENV{'HARVEST_HOME'}/lib/gatherer/sgmls-lib";$catalog = "$sgmls_lib/catalog";$decl = "$sgmls_lib/$doctype/$doctype.decl";$table = "$sgmls_lib/$doctype/$doctype.sum.tbl"; # default# Set this to get prefixed META NAMEs (and HTTP-EQUIVs). (HS)#$meta_prefix = 'meta-';$meta_prefix = '';# Set this to 1 to change all non-alphanumeric characters in SOIF attribute# names to '_'. The only exception is '-' that is not changed. (HS)$fix_attrib_names = 0;# search for the .sum.tbl file; allows a per Gatherer customizations#$LIBPATH = $ENV{'SUMMARIZER_LIBPATH'};foreach $d (split(':', $LIBPATH)) { if ( -r "$d/$doctype.sum.tbl" ) { $table = "$d/$doctype.sum.tbl"; last; }}# Command-line options override all defaults#$decl = $opt_d if defined $opt_d;$table = $opt_t if defined $opt_t;unless ( -f $decl && -f $table ) { print STDERR "No support for doctype $doctype\n"; print STDERR "Missing: $decl\n" unless ( -f $decl ); print STDERR "Missing: $table\n" unless ( -f $table ); exit (1);}# Load the TAG->ATTR table#open (table) || die "$table: $!\n";@soifkeys = (); # need a list to preserve the orderwhile (<table>) { s/#.*$//; next unless (/\S/); chop; next unless (/^<([^>]+)>\s*(.*)\s*$/); ($tag = $1) =~ tr/a-z/A-Z/; $SOIF{$tag} = $2; print STDERR "table> $tag --> $SOIF{$tag}\n" if ($debug); push (@soifkeys, $tag);}close table;undef ($tag);# Read the SGML data into a tmpfile. Tack on the DOCYTYPE if missing#undef $/;$sgml = <> || die "SGML.sum: No input.\n";$T = &tempnam;open (T, ">$T") || die "$T: $!\n";# Ack! this SGML crap junk is ignorant somtimes. Sometimes the authoring# tool writes## <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML Level 1//EN//2.0" "html.dtd">## and 'sgmls' then looks for 'html.dtd' in the current directory. This# fails in a bad way. So now we check for this type of DOCTYPE spec.# If found, stat the "html.dtd" part and if its not there, change that# whole thing to## <!DOCTYPE HTML SYSTEM>#if ($sgml =~ /^\s*<!DOCTYPE\s+(\S+)\s+(\S+)\s+(\"[^\"]+\")\s+\"([^\"]+)\"\s*>/i) { $xdoctype = $1; $dtdfile = $4; $sgml =~ s/^\s*<!DOCTYPE[^>]+>/<!DOCTYPE $xdoctype SYSTEM>/i unless ( -f $dtdfile ); undef $xdoctype; undef $dtdfile;}# Sheesh. We need to do the same for:# <!DOCTYPE HTML SYSTEM "html.dtd">## but NOT:# <!DOCTYPE HTML PUBLIC "-//Sun Micorsystems Corp.//DTD HotJava HTML//EN">## or <!DOCTYPE HTML SYSTEM "http:...">if ($sgml =~ /^\s*<!DOCTYPE\s+(\S+)\s+(\S+)\s+\"(http:\/\/[^\"]+)\"\s*>/i) { $xdoctype = $1; $dtdfile = $3; $sgml =~ s/^\s*<!DOCTYPE[^>]+>/<!DOCTYPE $xdoctype SYSTEM "$dtdfile">/i unless ( -f $dtdfile ); undef $xdoctype; undef $dtdfile;} elsif ($sgml =~ /^\s*<!DOCTYPE\s+(\S+)\s+(\S+)\s+\"([^-][^\"]+)\"\s*>/i) { $xdoctype = $1; $dtdfile = $3; $sgml =~ s/^\s*<!DOCTYPE[^>]+>/<!DOCTYPE $xdoctype SYSTEM>/i unless ( -f $dtdfile ); undef $xdoctype; undef $dtdfile;}print T "<!DOCTYPE $doctype SYSTEM>\n" unless $sgml =~ /^\s*<!DOCTYPE/i;print T $sgml;close T;$/="\n";# Open the pipe from 'sgmls' for reading...#if ($syntax_check) { $Terr = &temperrnam; $cmd = "$sgmls_cmd -f $Terr -m $catalog $decl $T";} else { $cmd = "$sgmls_cmd -f /dev/null -m $catalog $decl $T";}open (SGMLS, "$cmd|") || (unlink ($T), unlink ($Terr), die "$cmd: $!\n");$foo = &parse_sgml ('NIL', ());close SGMLS;if (-s $Terr) { if (open(Terr,$Terr)) { print STDERR "\n"; foreach (<Terr>) { chop; s/^[^:]*:[^:]*://; ($line_no, $eno, $e, $error_message) = split(':'); print STDERR "SGML error in $ENV{SUMMARIZER_URL} at line $line_no: $error_message\n"; } }}unlink ($Terr) unless ($debug);unlink ($T) unless ($debug);# CLEANUP ATT VALUES#foreach $k (keys %ATT) { $ATT{$k} =~ s/\\n/\n/g; # change backslash-N to newline $ATT{$k} =~ s/\\011/\t/g; # change escaped tabs $ATT{$k} =~ s/\n*\n/\n/g; # remove blank lines}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -