📄 hlink.l
字号:
/* * hlink.l -- Libraray for detecting hyper links in html file. * * Created: Xie Han, OS lab of Peking University. <me@pku.edu> * * Created: Oct 5 6:03am 2003. version 0.1.1 * # A framework was given. * * Updated: Oct 7 5:14am 2003. version 0.1.2 * # The original framework was completely discarded. New framework * came out. * # All the links in a html file are merged with the base URI and * returned through a link list. * # The link detecting is error-tolerant. When unexpected characters * occur the detector won't terminate but turn to some other state * and go on scanning. * * Updated: Oct 8 5:15am 2003. version 0.1.3 * # The module has been compiled OK. Function "link_detect" is * expected to return a URI list. But I am so frustrated at the * fact that function "uri_rally_parse" does not work! And I even * suppose that all the good wills with which "uri_rally_parse" is * designed is indeed not realistic, i.e, "uri_rally_parse" can * NOT been implemented. It seems that I should turn out with some * other ways :( * * Updated: Oct 9 4:50am 2003. version 0.2.1 * # Fix incorrect usages of "yy_push_state". * # Because "uri_rally_parse" no longer exist, we call * "uri_parse_string" instead. And this makes the program structure * a little simplier. * # Test program showed that "link_detect" work well, but the store * structure will definitely been changed greatly in next version. * * Updated: Oct 10 4:52am 2003. version 1.0.0 * # "uri_parse_string" is substituted by "uri_parse_buffer", because * the latter need not copy string. * # As planned, from this version we use the super list to store * hyper links we detected in html files. Function "link_detect" * is renamed "hlink_detect", which take 3 arguments: the first * arguemnt is a list, which will been appended with hyper links * we found; the second argument is a FILE pointer which points to * the html file; and the third is a struct uri pointer which * points to the URL of the html file. * # "link_uri_list_destroy" is renamed "hlink_destroy". * * Updated: Oct 10 5:25am 2003. version 1.0.1 * # Fix a bug introduced by version 0.2.1: Illegal URI can been * truncated and its prefix is accepted as a legal URI. Two states * are added. * * Updated: Oct 10 6:03am 2003. version 1.0.2 * # Slight bug found, fixed. * # "yy_stack_push(URI)" in state "WAIT_HREF" is substituted by * "BEGIN URI", because state "URI" do not need to know what is * the previous state. * * Updated: Oct 11 3:43am 2003. version 1.0.3 * # I changed the way of ignoring an attribute value, which affect * nothing but make the program more clear. * # Kernel filter will be added. It will make the program amazing! * * Updated: Oct 11 4:30am 2003. version 1.2.0 * # Kernel filter is added. See the comment before "hlink_detect" * where the concept will be explained in detail. * * Updated: Oct 11 6:23am 2003. version 1.2.1 * # The program structure is much more clear. States "ILLEGAL_xxx" * and state "PARSE_URI" are removed. * * Updated: Oct 12 3:58am 2003. version 1.2.2 * # Filter function from now on can return signed number indicating * an error occurred in it, and in this case detecting process will * terminate with a failure. * # Meaningless blanks preceding and following a URI in quotes are * allowed (like <a href=" http://www.pku.edu ">). * # "LINK" element in html files will been processed and treated * same as "A" element. * * Updated: Oct 13 5:25am 2003. version 1.2.3 * # Attribute names are changed case-insensitive. Thus, "HreF=" OK. * # A "cross boundary" bug found: when checking whether the an * attribute name is "href" I wrote such codes: * if (memcmp(yytext, "href") == 0 && ...) * It's dangerous because I did not check whether yytext has 4 * characters. This bug has been fixed and I will put it in mind. * * Updated: Oct 15 8:21am 2003. version 1.2.4 * # A slight change in "HLINK_LIST_DESTROY_LAST_N" macro. Version * 1.2.3 is proved to be very stable. * * Updated: Oct 17 4:42am 2003. version 1.3.0 * # The change is slight but incompatible. One more argument is * added to the "hlink_detect" function. See the comment before * "hlink_detect" for detail. * * Updated: Oct 17 9:11am 2003. version 1.3.1 * # I change the way how to ignore an attribute value to make the * program shorter and more clear. * # The change from 1.2.x to 1.3.x is so slight that 1.3.x is not * likely to be unstable. It's believed to been updated into 1.4.0 * in a short time. * # "HLINK_LIST_DESTROY_LAST_N" macro is changed to a function named * "__hlink_destroy_last_n", static, intended for internal usage * only. * * Updated: Oct 18 1:37pm 2003. version 1.4.0 * # We needn't turn all the blanks follows a quoted URI into '\0'. * Just the last 2 are needed. So we do. * # Version 1.3.x is tested OK and we update it into 1.4.x. */blank [ \t\n]|\r\nname [A-Za-z][A-Za-z0-9\-_:.]*%option stack%s ELEMENT ELEMENT_BASE ELEMENT_A WAIT_HREF HREF_OK NO_HREF%s IGNORE IGNORE_QUOTED IGNORE_NOT_QUOTED URI URI_QUOTED URI_NOT_QUOTED%{#include <stdio.h>#include <string.h>#include <ctype.h>#include <list.h>#include <uri.h>#include "hlink.h"#define HLINK_ISBLANK(c) \ ((c) == ' ' || (c) == '\r' || (c) == '\n' || (c) == 't')static void __hlink_destroy_last_n(struct list_head *head, int n);static struct list_head *__head;static struct uri __base_uri;static struct uri __uri;static int __is_our_base;static int __nhlinks;static int (*__filter)(const struct uri *, void *);static void *__context;%}%%<INITIAL>< BEGIN ELEMENT;<INITIAL>.|\n<INITIAL><<EOF>> { if (__is_our_base) uri_destroy(&__base_uri); return __nhlinks;}<ELEMENT>{name} { int i; /* Element names are case-insensitive. */ for (i = 0; i < yyleng; i++) yytext[i] = toupper(yytext[i]); if (strcmp(yytext, "A") == 0 || strcmp(yytext, "LINK") == 0) { BEGIN ELEMENT_A; yy_push_state(WAIT_HREF); } else if (strcmp(yytext, "BASE") == 0) { BEGIN ELEMENT_BASE; yy_push_state(WAIT_HREF); } else BEGIN INITIAL;}<ELEMENT>{blank}+<ELEMENT>.|\n |<ELEMENT><<EOF>> { yyless(0); BEGIN INITIAL;}<WAIT_HREF>{name}{blank}*={blank}* { int i; /* Atrribute names are case-insensitive. */ for (i = 0; i < yyleng && i < 4; i++) yytext[i] = toupper(yytext[i]); if (i == 4 && memcmp(yytext, "HREF", 4) == 0 && (yytext[4] == '=' || HLINK_ISBLANK(yytext[4]))) BEGIN URI; else yy_push_state(IGNORE);} /* An "href" attribute has been found. All the following attributes will been ignored. */<HREF_OK>{name}{blank}*={blank}* yy_push_state(IGNORE);<WAIT_HREF,HREF_OK>{blank}+<WAIT_HREF>.|\n |<WAIT_HREF><<EOF>> { yyless(0); yy_pop_state(); BEGIN NO_HREF;}<HREF_OK>.|\n |<HREF_OK><<EOF>> { yyless(0); yy_pop_state();}<NO_HREF>> BEGIN INITIAL;<NO_HREF>.|\n |<NO_HREF><<EOF>> { yyless(0); BEGIN INITIAL;}<ELEMENT_A>> { struct hlink *entry; int n; if (entry = (struct hlink *)malloc(sizeof (struct hlink))) n = uri_merge(&entry->uri, &__uri, &__base_uri); uri_destroy(&__uri); if (entry) { if (n >= 0) { /* Filter function is called. */ if ((n = __filter(&entry->uri, __context)) > 0) { list_add_tail(&entry->list, __head); __nhlinks++; } else { uri_destroy(&entry->uri); free(entry); } if (n >= 0) { BEGIN INITIAL; YY_BREAK; } } else free(entry); } /* Failed! We should clean what we'v added. Possibilities of * failure: failed to allocate memory for "entry"; failed to * merge the relative URI with the base URI; filter function * return negative number. The first 2 do not likely to happen. */ __hlink_destroy_last_n(__head, __nhlinks); if (__is_our_base) uri_destroy(&__base_uri); return -1;}<ELEMENT_BASE>> { if (__is_our_base) uri_destroy(&__base_uri); else __is_our_base = 1; __base_uri = __uri; BEGIN INITIAL;}<ELEMENT_A,ELEMENT_BASE>.|\n |<ELEMENT_A,ELEMENT_BASE><<EOF>> { uri_destroy(&__uri); yyless(0); BEGIN INITIAL;}<IGNORE>\"{blank}* BEGIN IGNORE_QUOTED;<IGNORE>.|\n |<IGNORE><<EOF>> { yyless(0); BEGIN IGNORE_NOT_QUOTED;}<IGNORE_QUOTED>[^"<>]<IGNORE_QUOTED>\" yy_pop_state();<IGNORE_NOT_QUOTED>[^"<> \t\r\n]<IGNORE_QUOTED,IGNORE_NOT_QUOTED>.|\n |<IGNORE_QUOTED,IGNORE_NOT_QUOTED><<EOF>> { yyless(0); yy_pop_state();}<URI>\"{blank}* BEGIN URI_QUOTED;<URI>.|\n |<URI><<EOF>> { yyless(0); BEGIN URI_NOT_QUOTED;}<URI_QUOTED>[^"<>]*\" |<URI_NOT_QUOTED>[^"<> \t\r\n]*["<> \t\r\n] { int len; char c; /* Preserve the last character. If the current state is URI_NOT_QUOTED, * this character will be put back to the upper previous. */ c = yytext[yyleng - 1]; /* If the URI is in quotes, we turn all the blanks following the URI * into '\0' in order to check whether it's a legal URI. */ while (yyleng > 1 && HLINK_ISBLANK(yytext[yyleng - 2])) yyleng--; /* Last two characters must be "\0". */ yytext[yyleng] = yytext[yyleng - 1] = YY_END_OF_BUFFER_CHAR; /* Calling "uri_parse_buffer" instead of "uri_parse_string" because * the former don't copy memory. */ if ((len = uri_parse_buffer(&__uri, yytext, yyleng + 1)) >= 0) { if (YY_START == URI_NOT_QUOTED) unput(c); if (len == yyleng - 1) BEGIN HREF_OK; else { /* Illegal URI! */ uri_destroy(&__uri); BEGIN WAIT_HREF; } } else { yy_pop_state(); __hlink_destroy_last_n(__head, __nhlinks); if (__is_our_base) uri_destroy(&__base_uri); return -1; }}<URI_QUOTED>[<>] |<URI_QUOTED,URI_NOT_QUOTED><<EOF>> { yyless(0); BEGIN WAIT_HREF;}<URI_QUOTED,URI_NOT_QUOTED>.|\n%%int yywrap(void){ return 1;}static void __hlink_destroy_last_n(struct list_head *head, int n){ struct hlink *entry; struct list_head *last; for (; n > 0; n--) { last = head->prev; entry = list_entry(last, struct hlink, list); list_del(last); uri_destroy(&entry->uri); free(entry); }}/* * hlink_detect -- Function detecting hyper links in a html file. * * This is the most important function of this module. It take 5 arguments: * @struct list_head *head * The list where we will store hyper links we find, which needn't be * initially empty. All the hyper links we find will been added to the * tail of it. * @FILE *page_file * A FILE pointer that points to the page file to be scanned. * @const struct uri *page_uri * A struct uri pointer that pointer to the URI of the page file to be * scanned. It can NOT be a NULL pointer. Every page file has a URI, * even if an empty URI can be parse into a URI struct. Note we use URI * but not URL. URI is a super set of URL. * @int (*filter)(const struct uri *, void *) * A function pointer that points to the costom filter function. Only * URIs that pass the filter function will be added into the list. * The second argument is a pointer that point the the context of the * filter function. * For example, we need only ftp URI with host "ftp.pku.edu.cn" and * mailto URI. The filter function can be: * int my_filter(const struct uri *uri, void *context) * { * if (strcasecmp(uri->schemem, "ftp" == 0)) * { * if (uri->authority_type == AT_SERVER && * memcpy(uri->host, "ftp.pku.edu.cn") == 0) * return 1; * } * else if (strcasecmp(uri->schemem, "mailto") == 0) * return 1; * else * return 0; * } * * ... * hlink_detect(&list, file, &uri, my_filter); * ... * * Then, only URIs such as "ftp://name:pass@ftp.pku.edu.cn:21/pub/", * "mailto:webmaster@pku.edu.cn" will be added into the list. All * the http URIs will be filtered. * * Another example, we need only URIs have a scheme name same as the * string that "context" point to. Maybe The following codes may meet * your need: * int my_filter(struct uri *uri, void *context) * { * if (strcasecmp(uri->scheme, (char *)context) == 0) * return 1; * else * return 0; * } * * Filter function can be very large, and if error occurs in filter * function, it can return a negative number to signal the detecting * function which in this case will stop the scanning process, clean * what it'v added and end with -1. * * A extremic case: filter function always returns 0 or negative * number. And, it becomes an "event driven" programming model. * "event" is a hyper link is detected (Recall windows programming: * "button clicked" is an event), and filter function is the codes * to be executed when event occurs. In this case, the "head" * argument can be NULL. * * "filter" is introduced in version 1.1.0. It's original design * does not have a "context". * * @void *context * The context that you will pass the the filter function. If the * filter function do not need a context, this argument can be NULL. * * "context" is introduced in version 1.3.0. * * Oh, "hlink_detect" returns the number of hyper links we found and * added (not filtered) into the list. Returning -1 indicates a failure, * the possibilities of which may be failed to allocation memory, or * the filter function returned negative number. */int hlink_detect(struct list_head *head, FILE *page_file, const struct uri *page_uri, int (*filter)(const struct uri *, void *), void *context){ yyin = page_file; __base_uri = *page_uri; __is_our_base = 0; __head = head; __filter = filter; __nhlinks = 0; __context = context; BEGIN INITIAL; return yylex();}void hlink_destroy(struct list_head *head){ struct hlink *entry; struct list_head *first; while ((first = head->next) != head) { entry = list_entry(first, struct hlink, list); list_del(first); uri_destroy(&entry->uri); free(entry); }}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -