⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 collect.c

📁 harvest是一个下载html网页得机器人
💻 C
📖 第 1 页 / 共 2 页
字号:
static char rcsid[] = "$Id: collect.c,v 2.1 1997/03/21 19:21:53 sxw Exp $";/* *  collect.c - Process the requests of a collector client * *  DEBUG: none *  AUTHOR: Harvest derived * *  Harvest Indexer http://www.tardis.ed.ac.uk/harvest/ *  --------------------------------------------------- * *  The Harvest Indexer is a continued development of code developed by *  the Harvest Project. Development is carried out by numerous individuals *  in the Internet community, and is not officially connected with the *  original Harvest Project or its funding sources. * *  Please mail harvest@tardis.ed.ac.uk if you are interested in participating *  in the development effort. * *  This program is free software; you can redistribute it and/or modify *  it under the terms of the GNU General Public License as published by *  the Free Software Foundation; either version 2 of the License, or *  (at your option) any later version. * *  This program is distributed in the hope that it will be useful, *  but WITHOUT ANY WARRANTY; without even the implied warranty of *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the *  GNU General Public License for more details. * *  You should have received a copy of the GNU General Public License *  along with this program; if not, write to the Free Software *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. *//*  ---------------------------------------------------------------------- *  Copyright (c) 1994, 1995.  All rights reserved. * *    The Harvest software was developed by the Internet Research Task *    Force Research Group on Resource Discovery (IRTF-RD): * *          Mic Bowman of Transarc Corporation. *          Peter Danzig of the University of Southern California. *          Darren R. Hardy of the University of Colorado at Boulder. *          Udi Manber of the University of Arizona. *          Michael F. Schwartz of the University of Colorado at Boulder. *          Duane Wessels of the University of Colorado at Boulder. * *    This copyright notice applies to software in the Harvest *    ``src/'' directory only.  Users should consult the individual *    copyright notices in the ``components/'' subdirectories for *    copyright information about other software bundled with the *    Harvest source code distribution. * *  TERMS OF USE * *    The Harvest software may be used and re-distributed without *    charge, provided that the software origin and research team are *    cited in any use of the system.  Most commonly this is *    accomplished by including a link to the Harvest Home Page *    (http://harvest.cs.colorado.edu/) from the query page of any *    Broker you deploy, as well as in the query result pages.  These *    links are generated automatically by the standard Broker *    software distribution. * *    The Harvest software is provided ``as is'', without express or *    implied warranty, and with no support nor obligation to assist *    in its use, correction, modification or enhancement.  We assume *    no liability with respect to the infringement of copyrights, *    trade secrets, or any patents, and are not responsible for *    consequential damages.  Proper use of the Harvest software is *    entirely the responsibility of the user. * *  DERIVATIVE WORKS * *    Users may make derivative works from the Harvest software, subject *    to the following constraints: * *      - You must include the above copyright notice and these *        accompanying paragraphs in all forms of derivative works, *        and any documentation and other materials related to such *        distribution and use acknowledge that the software was *        developed at the above institutions. * *      - You must notify IRTF-RD regarding your distribution of *        the derivative work. * *      - You must clearly notify users that your are distributing *        a modified version and not the original Harvest software. * *      - Any derivative product is also subject to these copyright *        and use restrictions. * *    Note that the Harvest software is NOT in the public domain.  We *    retain copyright, as specified above. * *  HISTORY OF FREE SOFTWARE STATUS * *    Originally we required sites to license the software in cases *    where they were going to build commercial products/services *    around Harvest.  In June 1995 we changed this policy.  We now *    allow people to use the core Harvest software (the code found in *    the Harvest ``src/'' directory) for free.  We made this change *    in the interest of encouraging the widest possible deployment of *    the technology.  The Harvest software is really a reference *    implementation of a set of protocols and formats, some of which *    we intend to standardize.  We encourage commercial *    re-implementations of code complying to this set of standards. * */#include <stdio.h>#include <stdlib.h>#include <unistd.h>#include <string.h>#include <signal.h>#include <sys/types.h>#include <ctype.h>#include <time.h>#include <netinet/in.h>#include <arpa/inet.h>#include <netdb.h>#include <sys/socket.h>#include <sys/stat.h>#include <sys/time.h>#include <sys/resource.h>#include <gdbm.h>#include "util.h"#include "template.h"/* Global variables */extern char *dbfile;extern char *indexfile;extern char *allzipped;extern char *cmd_gzip;extern char *allow_hosts[];extern char *deny_hosts[];extern int allow_all;extern int deny_all;/* number of second allowed for idle client */#ifndef MAX_TIMEOUT#define MAX_TIMEOUT	300#endif/* Protocol Messages */#define WELCOME_OK 		"000 - HELLO 0.2.3 %s [port %d] - are you %s?\n"#define WELCOME_UNKNOWN_CMD	"001 - Unknown Command: %s\n"#define WELCOME_UNIMPL_CMD	"002 - Unimplemented Command\n"#define WELCOME_ACCESS_DENIED	"003 - Access Denied for %s\n"#define WELCOME_NOREVIP		"004 - Warning: %s has no reverse DNS pointer.\n"#define WELCOME_INTERR		"005 - Sorry, this Gatherer has a fatal internal error.\n"#define HELLO_OK	"100 - Pleased to meet you %s\n"#define HELLO_INVALID   "101 - Invalid Usage - HELLO <hostname>\n"#define HELLO_MISMATCH  "102 - Warning: DNS told me %s, not %s\n"#define OBJ_OK         "300 - Sending Object %s\n"#define OBJ_INVALID    "301 - Invalid Object %s\n"#define OBJ_DONE       "399 - Sent Object %s (%d bytes)\n"#define UPDATE_OK	"400 - Sending all Object Descriptions since %d\n"#define UPDATE_INVALID	"401 - Invalid Usage - SEND-UPDATE <timestamp>\n"#define UPDATE_DONE	"499 - Sent %d Object Descriptions (%d bytes)\n"#define SET_OK		"500 - Set mode: %s\n"#define SET_INVALID	"501 - Invalid Usage - SET <mode>\n"#define INFO_OK		"600 - Finished sending INFO\n"#define INFO_INVALID	"601 - Invalid Usage - INFO\n"#define INFO_UNAVAIL	"602 - INFO is unavailable for this Gatherer.\n"#define GOODBYE_OK	"999 - Later, %s. %d bytes transmitted.\n"#define HELP_OK "\200 - List of Available Commands:\n\200 - HELLO <hostname>         - Friendly Greeting\n\200 - HELP                     - This message\n\200 - INFO                     - Information about Gatherer\n\200 - SEND-OBJECT <oid>        - Send an Object Description\n\200 - SEND-UPDATE <timestamp>  - Send all Object Descriptions that\n\200 -                            have been changed/created since timestamp\n\200 - SET compression          - Enable GNU zip compressed transfers\n\200 - QUIT                     - Close session\n"/* Local Functions */void Tolower _PARAMS((char *));static void die _PARAMS((int));static void die_msg _PARAMS((int));static void print_welcome _PARAMS((int));static void send_msg_to_client _PARAMS((int, char *, int));static void send_data_to_client _PARAMS((int, char *, int));static void init_compression _PARAMS((int));static void send_allzipped _PARAMS((int));static int finish_compression _PARAMS((int));static int access_denied _PARAMS((char *));static int process_command _PARAMS((int, char *));static int process_hello _PARAMS((int, char *));static int process_info _PARAMS((int));static int process_update _PARAMS((int, char *));static int process_set _PARAMS((int, char *));static int send_all _PARAMS((int));static int send_selected_index _PARAMS((int, time_t));static int send_update _PARAMS((int, time_t));static int strend_match _PARAMS((char *, char *));static int send_object _PARAMS((int, char *));/* Local macros */#define cmdcmp(s)	strncasecmp(cmd, (s), strlen(s))/* Local varibles */static int dead = 0;		/* is the client dead? */static char *remote_host = NULL;	/* full DNS name of remote host */static char *this_host = NULL;	/* full DNS name of local host */static int do_compress = 0;	/* mode == compression? */static int topipe[2];		/* pipe for compression */static int nxmit = 0;		/* # of bytes transmitted */static int nobjs = 0;		/* # of objects transmitted *//* *  serve_client() - Processes all of the client's requests.  Returns *  the exit code of the UNIX process that's serving the client. *  Remember that if gatherd is run from inetd then fd is stdin/stdout. */int serve_client(fd)     int fd;{    static char buf[BUFSIZ];    char *s = NULL;    int nread;    FILE *fp = NULL;    if ((s = getfullhostname()) == NULL) {	errorlog("getfullhostname returned NULL\n");	exit(1);    }    this_host = strdup(s);    dead = 0;    nxmit = 0;    print_welcome(fd);    if ((fp = fdopen(fd, "r")) == NULL) {	log_errno("fdopen");	exit(1);    }    signal(SIGALRM, die);    signal(SIGINT, die_msg);    signal(SIGTERM, die_msg);    while (1) {	alarm(MAX_TIMEOUT);	buf[0] = '\0';	fgets(buf, BUFSIZ, fp);	nread = strlen(buf);	alarm(0);	if (dead) {	    errorlog("Client died and never said goodbye...\n");	    close(fd);	    return (1);	} else if (nread < 0) {	    log_errno("read");	    close(fd);	    return (1);	} else if (nread == 0) {	    break;	}	buf[nread] = '\0';	if (nread > 1) {	/* strip tailing \r or \n */	    if (isspace(buf[nread - 2]))		buf[nread - 2] = '\0';	}	if (isspace(buf[nread - 1]))	    buf[nread - 1] = '\0';	if (process_command(fd, buf)) {	    close(fd);	    return (0);	}    }    fclose(fp);    close(fd);    return (0);}/* *  process_command() - Executes one of the supported commands.  Returns *  non-zero on error (which causes the server to terminate the session); *  otherwise, returns zero. */static int process_command(s, cmd)     int s;     char *cmd;{    static char buf[BUFSIZ];    char *p = NULL;    if (!cmdcmp("hello")) {	return (process_hello(s, cmd));    } else if (!cmdcmp("set")) {	return (process_set(s, cmd));    } else if (!cmdcmp("send-update")) {	return (process_update(s, cmd));    } else if (!cmdcmp("info")) {	return (process_info(s));    } else if (!cmdcmp("help")) {	sprintf(buf, HELP_OK);    } else if (!cmdcmp("send-object")) {	return (send_object(s, cmd));    } else if (!cmdcmp("quit") ||	    !cmdcmp("exit") ||	!cmdcmp("bye")) {	sprintf(buf, GOODBYE_OK, remote_host, nxmit);	send_msg_to_client(s, buf, 1);	return (1);    } else {			/* Unknown command */	if ((p = strrchr(cmd, '\n')) != NULL)	    *p = '\0';	sprintf(buf, WELCOME_UNKNOWN_CMD, cmd);    }    send_msg_to_client(s, buf, 0);    return (0);}static void die(sig_unused)   int sig_unused;{    dead = 1;}static void die_msg(x)     int x;{    dead = 1;    errorlog("Dying with signal %d...\n", x);    exit(x);}/* *  print_welcome() - Says hello to the client.  Checks to make sure that *  the client has access. */static void print_welcome(s)     int s;{    struct sockaddr_in sin;    int slen;    int getpeername();    struct hostent *hp = NULL;    static char buf[BUFSIZ];    slen = sizeof(sin);    if (getpeername(s, (struct sockaddr *) &sin, &slen) < 0) {	log_errno("getpeername");	sprintf(buf, WELCOME_INTERR);	send_msg_to_client(s, buf, 1);	exit(1);    }    if ((hp = gethostbyaddr((char *) &sin.sin_addr,		sizeof(struct in_addr), AF_INET)) == NULL) {	/* Do not write anything to the client until WELCOME_OK */	Log(WELCOME_NOREVIP, inet_ntoa(sin.sin_addr));	remote_host = strdup(inet_ntoa(sin.sin_addr));    } else {	remote_host = strdup(hp->h_name);	Tolower(remote_host);    }    if (access_denied(remote_host)) {	sprintf(buf, WELCOME_ACCESS_DENIED, remote_host);	send_msg_to_client(s, buf, 1);	exit(1);    }    slen = sizeof(sin);    if (getsockname(s, (struct sockaddr *) &sin, &slen) < 0) {	log_errno("getsockname");	sprintf(buf, WELCOME_INTERR);	send_msg_to_client(s, buf, 1);	exit(1);    }    sprintf(buf, WELCOME_OK, this_host, ntohs(sin.sin_port), remote_host);    send_msg_to_client(s, buf, 1);}/* *  process_hello() - Processes the HELLO command */static int process_hello(s, cmd)     int s;     char *cmd;{    static char buf[BUFSIZ];    char *p = NULL;    (void) strtok(cmd, " \t\n");	/* ignore HELLO */    p = strtok(NULL, " \t\n");	/* grab hostname */    if (p == NULL) {	sprintf(buf, HELLO_INVALID);	send_msg_to_client(s, buf, 0);    } else if (!strncasecmp(remote_host, "localhost", 9)) {	sprintf(buf, HELLO_OK, remote_host);	send_msg_to_client(s, buf, 1);#ifdef SEND_IP_HOSTNAME_MISMATCH_WARNING    } else if (strcasecmp(p, remote_host)) {	/* Warning message supressed--some people don't know the	 * difference between a warning and an error.  */	sprintf(buf, HELLO_MISMATCH, remote_host, p);	send_msg_to_client(s, buf, 1);#endif    } else {	sprintf(buf, HELLO_OK, remote_host);	send_msg_to_client(s, buf, 1);    }    return (0);}/* *  process_info() - Processes the INFO command */static int process_info(s)     int s;{    FILE *fp = NULL;    static char buf[BUFSIZ];    int n;    extern char *infofile;    /* If the info file doesn't exist then send error */    if (infofile == NULL) {	sprintf(buf, INFO_UNAVAIL);	send_msg_to_client(s, buf, 0);	return (0);    }    /* If the info file doesn't exist then send error */    if ((fp = fopen(infofile, "r")) == NULL) {	sprintf(buf, INFO_UNAVAIL);	send_msg_to_client(s, buf, 0);	return (0);    }    /* send the INFO.soif file to the client */    while ((n = fread(buf, 1, BUFSIZ - 1, fp)) > 0)	send_data_to_client(s, buf, n);    fclose(fp);    /* Everything went ok */    sprintf(buf, INFO_OK);    send_msg_to_client(s, buf, 1);    return (0);}/* *  process_update() - Processes the SEND-UPDATE command */static int process_update(s, cmd)     int s;     char *cmd;{    static char buf[BUFSIZ];    char *p = NULL;    char *q = NULL;    time_t timestamp;    (void) strtok(cmd, " \t\n");	/* ignore hello */    p = strtok(NULL, " \t\n");    if (p == NULL) {

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -