📄 hlink.l

📁 this is very good for teacher
💻 L
字号:
/* * hlink.l -- Libraray for detecting hyper links in html file. * * Created: Xie Han, OS lab of Peking University. <me@pku.edu> * * Created: Oct 5 6:03am 2003. version 0.1.1 *		# A framework was given. * * Updated: Oct 7 5:14am 2003. version 0.1.2 *		# The original framework was completely discarded. New framework *		  came out. *		# All the links in a html file are merged with the base URI and *		  returned through a link list. *		# The link detecting is error-tolerant. When unexpected characters *		  occur the detector won't terminate but turn to some other state *		  and go on scanning. * * Updated: Oct 8 5:15am 2003. version 0.1.3 *		# The module has been compiled OK. Function "link_detect" is *		  expected to return a URI list. But I am so frustrated at the *		  fact that function "uri_rally_parse" does not work! And I even *		  suppose that all the good wills with which "uri_rally_parse" is *		  designed is indeed not realistic, i.e, "uri_rally_parse" can *		  NOT been implemented. It seems that I should turn out with some *		  other ways :( * * Updated: Oct 9 4:50am 2003. version 0.2.1 *		# Fix incorrect usages of "yy_push_state". *		# Because "uri_rally_parse" no longer exist, we call *		  "uri_parse_string" instead. And this makes the program structure *		  a little simplier. *		# Test program showed that "link_detect" work well, but the store *		  structure will definitely been changed greatly in next version. * * Updated: Oct 10 4:52am 2003. version 1.0.0 *		# "uri_parse_string" is substituted by "uri_parse_buffer", because *		  the latter need not copy string. *		# As planned, from this version we use the super list to store *		  hyper links we detected in html files. Function "link_detect" *		  is renamed "hlink_detect", which take 3 arguments: the first *		  arguemnt is a list, which will been appended with hyper links *		  we found; the second argument is a FILE pointer which points to *		  the html file; and the third is a struct uri pointer which *		  points to the URL of the html file. *		# "link_uri_list_destroy" is renamed "hlink_destroy". * * Updated: Oct 10 5:25am 2003. version 1.0.1 *		# Fix a bug introduced by version 0.2.1: Illegal URI can been *		  truncated and its prefix is accepted as a legal URI. Two states *		  are added. * * Updated: Oct 10 6:03am 2003. version 1.0.2 *		# Slight bug found, fixed. *		# "yy_stack_push(URI)" in state "WAIT_HREF" is substituted by *		  "BEGIN URI", because state "URI" do not need to know what is *		  the previous state. * * Updated: Oct 11 3:43am 2003. version 1.0.3 *		# I changed the way of ignoring an attribute value, which affect *		  nothing but make the program more clear. *		# Kernel filter will be added. It will make the program amazing! * * Updated: Oct 11 4:30am 2003. version 1.2.0 *		# Kernel filter is added. See the comment before "hlink_detect" *		  where the concept will be explained in detail. * * Updated: Oct 11 6:23am 2003. version 1.2.1 *		# The program structure is much more clear. States "ILLEGAL_xxx" *		  and state "PARSE_URI" are removed. * * Updated: Oct 12 3:58am 2003. version 1.2.2 *		# Filter function from now on can return signed number indicating *		  an error occurred in it, and in this case detecting process will *		  terminate with a failure. *		# Meaningless blanks preceding and following a URI in quotes are *		  allowed (like <a href="  http://www.pku.edu  ">). *		# "LINK" element in html files will been processed and treated *		  same as "A" element. * * Updated: Oct 13 5:25am 2003. version 1.2.3 *		# Attribute names are changed case-insensitive. Thus, "HreF=" OK. *		# A "cross boundary" bug found: when checking whether the an *		  attribute name is "href" I wrote such codes: *		  if (memcmp(yytext, "href") == 0 && ...) *		  It's dangerous because I did not check whether yytext has 4 *		  characters. This bug has been fixed and I will put it in mind. * * Updated: Oct 15 8:21am 2003. version 1.2.4 *		# A slight change in "HLINK_LIST_DESTROY_LAST_N" macro. Version *		  1.2.3 is proved to be very stable. * * Updated: Oct 17 4:42am 2003. version 1.3.0 *		# The change is slight but incompatible. One more argument is *		  added to the "hlink_detect" function. See the comment before *		  "hlink_detect" for detail. * * Updated: Oct 17 9:11am 2003. version 1.3.1 *		# I change the way how to ignore an attribute value to make the *		  program shorter and more clear. *		# The change from 1.2.x to 1.3.x is so slight that 1.3.x is not *		  likely to be unstable. It's believed to been updated into 1.4.0 *		  in a short time. *		# "HLINK_LIST_DESTROY_LAST_N" macro is changed to a function named *		  "__hlink_destroy_last_n", static, intended for internal usage *		  only. * * Updated: Oct 18 1:37pm 2003. version 1.4.0 *		# We needn't turn all the blanks follows a quoted URI into '\0'. *		  Just the last 2 are needed. So we do. *		# Version 1.3.x is tested OK and we update it into 1.4.x. */blank				[ \t\n]|\r\nname				[A-Za-z][A-Za-z0-9\-_:.]*%option stack%s ELEMENT ELEMENT_BASE ELEMENT_A WAIT_HREF HREF_OK NO_HREF%s IGNORE IGNORE_QUOTED IGNORE_NOT_QUOTED URI URI_QUOTED URI_NOT_QUOTED%{#include <stdio.h>#include <string.h>#include <ctype.h>#include <list.h>#include <uri.h>#include "hlink.h"#define HLINK_ISBLANK(c) \	((c) == ' ' || (c) == '\r' || (c) == '\n' || (c) == 't')static void __hlink_destroy_last_n(struct list_head *head, int n);static struct list_head *__head;static struct uri __base_uri;static struct uri __uri;static int __is_our_base;static int __nhlinks;static int (*__filter)(const struct uri *, void *);static void *__context;%}%%<INITIAL><			BEGIN ELEMENT;<INITIAL>.|\n<INITIAL><<EOF>>	{	if (__is_our_base)		uri_destroy(&__base_uri);	return __nhlinks;}<ELEMENT>{name}		{	int i;	/* Element names are case-insensitive. */	for (i = 0; i < yyleng; i++)		yytext[i] = toupper(yytext[i]);	if (strcmp(yytext, "A") == 0 || strcmp(yytext, "LINK") == 0)	{		BEGIN ELEMENT_A;		yy_push_state(WAIT_HREF);	}	else if (strcmp(yytext, "BASE") == 0)	{		BEGIN ELEMENT_BASE;		yy_push_state(WAIT_HREF);	}	else		BEGIN INITIAL;}<ELEMENT>{blank}+<ELEMENT>.|\n		|<ELEMENT><<EOF>>	{	yyless(0);	BEGIN INITIAL;}<WAIT_HREF>{name}{blank}*={blank}*	{	int i;	/* Atrribute names are case-insensitive. */	for (i = 0; i < yyleng && i < 4; i++)		yytext[i] = toupper(yytext[i]);	if (i == 4 && memcmp(yytext, "HREF", 4) == 0 &&							(yytext[4] == '=' || HLINK_ISBLANK(yytext[4])))		BEGIN URI;	else		yy_push_state(IGNORE);}	/* An "href" attribute has been found. All the following attributes will	   been ignored. */<HREF_OK>{name}{blank}*={blank}*	yy_push_state(IGNORE);<WAIT_HREF,HREF_OK>{blank}+<WAIT_HREF>.|\n		|<WAIT_HREF><<EOF>>	{	yyless(0);	yy_pop_state();	BEGIN NO_HREF;}<HREF_OK>.|\n		|<HREF_OK><<EOF>>	{	yyless(0);	yy_pop_state();}<NO_HREF>>			BEGIN INITIAL;<NO_HREF>.|\n		|<NO_HREF><<EOF>>	{	yyless(0);	BEGIN INITIAL;}<ELEMENT_A>>		{	struct hlink *entry;	int n;	if (entry = (struct hlink *)malloc(sizeof (struct hlink)))		n = uri_merge(&entry->uri, &__uri, &__base_uri);	uri_destroy(&__uri);	if (entry)	{		if (n >= 0)		{			/* Filter function is called. */			if ((n = __filter(&entry->uri, __context)) > 0)			{				list_add_tail(&entry->list, __head);				__nhlinks++;			}			else			{				uri_destroy(&entry->uri);				free(entry);			}			if (n >= 0)			{				BEGIN INITIAL;				YY_BREAK;			}		}		else			free(entry);	}	/* Failed! We should clean what we'v added. Possibilities of	 * failure: failed to allocate memory for "entry"; failed to	 * merge the relative URI with the base URI; filter function	 * return negative number. The first 2 do not likely to happen. */	__hlink_destroy_last_n(__head, __nhlinks);	if (__is_our_base)		uri_destroy(&__base_uri);	return -1;}<ELEMENT_BASE>>		{	if (__is_our_base)		uri_destroy(&__base_uri);	else		__is_our_base = 1;	__base_uri = __uri;	BEGIN INITIAL;}<ELEMENT_A,ELEMENT_BASE>.|\n	|<ELEMENT_A,ELEMENT_BASE><<EOF>>	{	uri_destroy(&__uri);	yyless(0);	BEGIN INITIAL;}<IGNORE>\"{blank}*	BEGIN IGNORE_QUOTED;<IGNORE>.|\n		|<IGNORE><<EOF>>		{	yyless(0);	BEGIN IGNORE_NOT_QUOTED;}<IGNORE_QUOTED>[^"<>]<IGNORE_QUOTED>\"	yy_pop_state();<IGNORE_NOT_QUOTED>[^"<> \t\r\n]<IGNORE_QUOTED,IGNORE_NOT_QUOTED>.|\n		|<IGNORE_QUOTED,IGNORE_NOT_QUOTED><<EOF>>	{	yyless(0);	yy_pop_state();}<URI>\"{blank}*		BEGIN URI_QUOTED;<URI>.|\n			|<URI><<EOF>>		{	yyless(0);	BEGIN URI_NOT_QUOTED;}<URI_QUOTED>[^"<>]*\"						|<URI_NOT_QUOTED>[^"<> \t\r\n]*["<> \t\r\n]	{	int len;	char c;	/* Preserve the last character. If the current state is URI_NOT_QUOTED,	 * this character will be put back to the upper previous. */	c = yytext[yyleng - 1];	/* If the URI is in quotes, we turn all the blanks following the URI	 * into '\0' in order to check whether it's a legal URI. */	while (yyleng > 1 && HLINK_ISBLANK(yytext[yyleng - 2]))		yyleng--;	/* Last two characters must be "\0". */	yytext[yyleng] = yytext[yyleng - 1] = YY_END_OF_BUFFER_CHAR;	/* Calling "uri_parse_buffer" instead of "uri_parse_string" because	 * the former don't copy memory. */	if ((len = uri_parse_buffer(&__uri, yytext, yyleng + 1)) >= 0)	{		if (YY_START == URI_NOT_QUOTED)			unput(c);		if (len == yyleng - 1)			BEGIN HREF_OK;		else		{			/* Illegal URI! */			uri_destroy(&__uri);			BEGIN WAIT_HREF;		}	}	else	{		yy_pop_state();		__hlink_destroy_last_n(__head, __nhlinks);		if (__is_our_base)			uri_destroy(&__base_uri);		return -1;	}}<URI_QUOTED>[<>]					|<URI_QUOTED,URI_NOT_QUOTED><<EOF>>	{	yyless(0);	BEGIN WAIT_HREF;}<URI_QUOTED,URI_NOT_QUOTED>.|\n%%int yywrap(void){	return 1;}static void __hlink_destroy_last_n(struct list_head *head, int n){	struct hlink *entry;	struct list_head *last;	for (; n > 0; n--)	{		last = head->prev;		entry = list_entry(last, struct hlink, list);		list_del(last);		uri_destroy(&entry->uri);		free(entry);	}}/* * hlink_detect -- Function detecting hyper links in a html file. * * This is the most important function of this module. It take 5 arguments: * @struct list_head *head *		The list where we will store hyper links we find, which needn't be *		initially empty. All the hyper links we find will been added to the *		tail of it. * @FILE *page_file *		A FILE pointer that points to the page file to be scanned. * @const struct uri *page_uri *		A struct uri pointer that pointer to the URI of the page file to be *	 	scanned. It can NOT be a NULL pointer. Every page file has a URI, *		even if an empty URI can be parse into a URI struct. Note we use URI *		but not URL. URI is a super set of URL. * @int (*filter)(const struct uri *, void *) *		A function pointer that points to the costom filter function. Only *		URIs that pass the filter function will be added into the list. *		The second argument is a pointer that point the the context of the *		filter function. *		For example, we need only ftp URI with host "ftp.pku.edu.cn" and *		mailto URI. The filter function can be: *		int my_filter(const struct uri *uri, void *context) *		{ *			if (strcasecmp(uri->schemem, "ftp" == 0)) *			{ *				if (uri->authority_type == AT_SERVER && *						memcpy(uri->host, "ftp.pku.edu.cn") == 0) *					return 1; *			} *			else if (strcasecmp(uri->schemem, "mailto") == 0) *				return 1; *			else *				return 0; *		} * *		... *		hlink_detect(&list, file, &uri, my_filter); *		... * *		Then, only URIs such as "ftp://name:pass@ftp.pku.edu.cn:21/pub/", *		"mailto:webmaster@pku.edu.cn" will be added into the list. All *		the http URIs will be filtered. * *		Another example, we need only URIs have a scheme name same as the *		string that "context" point to. Maybe The following codes may meet *		your need: *		int my_filter(struct uri *uri, void *context) *		{ *			if (strcasecmp(uri->scheme, (char *)context) == 0) *				return 1; *			else *				return 0; *		} * *		Filter function can be very large, and if error occurs in filter *		function, it can return a negative number to signal the detecting *		function which in this case will stop the scanning process, clean *		what it'v added and end with -1. * *		A extremic case: filter function always returns 0 or negative *		number. And, it becomes an "event driven" programming model. *		"event" is a hyper link is detected (Recall windows programming: *		"button clicked" is an event), and filter function is the codes *		to be executed when event occurs. In this case, the "head" *		argument can be NULL. * *		"filter" is introduced in version 1.1.0. It's original design *		does not have a "context". * * @void *context *		The context that you will pass the the filter function. If the *		filter function do not need a context, this argument can be NULL. * *		"context" is introduced in version 1.3.0. * * Oh, "hlink_detect" returns the number of hyper links we found and * added (not filtered) into the list. Returning -1 indicates a failure, * the possibilities of which may be failed to allocation memory, or * the filter function returned negative number. */int hlink_detect(struct list_head *head, FILE *page_file,				 const struct uri *page_uri,				 int (*filter)(const struct uri *, void *), void *context){	yyin = page_file;	__base_uri = *page_uri;	__is_our_base = 0;	__head = head;	__filter = filter;	__nhlinks = 0;	__context = context;	BEGIN INITIAL;	return yylex();}void hlink_destroy(struct list_head *head){	struct hlink *entry;	struct list_head *first;	while ((first = head->next) != head)	{		entry = list_entry(first, struct hlink, list);		list_del(first);		uri_destroy(&entry->uri);		free(entry);	}}
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -