📄 parser.pm
字号:
For C<end> events, this contains the original tag name (always one token).For C<process> events, this contains the process instructions (always onetoken).This passes C<undef> for C<text> events.=item C<text>Text causes the source text (including markup element delimiters) to bepassed.=item C<undef>Pass an undefined value. Useful as padding where the same handlerroutine is registered for multiple events.=item C<'...'>A literal string of 0 to 255 characters enclosedin single (') or double (") quotes is passed as entered.=backThe whole argspec string can be wrapped up in C<'@{...}'> to signalthat the resulting event array should be flattened. This only makes adifference if an array reference is used as the handler target.Consider this example: $p->handler(text => [], 'text'); $p->handler(text => [], '@{text}']);With two text events; C<"foo">, C<"bar">; then the first example will endup with [["foo"], ["bar"]] and the second with ["foo", "bar"] inthe handler target array.=head2 EventsHandlers for the following events can be registered:=over=item C<comment>This event is triggered when a markup comment is recognized.Example: <!-- This is a comment -- -- So is this -->=item C<declaration>This event is triggered when a I<markup declaration> is recognized.For typical HTML documents, the only declaration you arelikely to find is <!DOCTYPE ...>.Example: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html40/strict.dtd">DTDs inside <!DOCTYPE ...> will confuse HTML::Parser.=item C<default>This event is triggered for events that do not have a specifichandler. You can set up a handler for this event to catch stuff youdid not want to catch explicitly.=item C<end>This event is triggered when an end tag is recognized.Example: </A>=item C<end_document>This event is triggered when $p->eof is called and after any remainingtext is flushed. There is no document text associated with this event.=item C<process>This event is triggered when a processing instructions markup isrecognized.The format and content of processing instructions are system andapplication dependent.Examples: <? HTML processing instructions > <? XML processing instructions ?>=item C<start>This event is triggered when a start tag is recognized.Example: <A HREF="http://www.perl.com/">=item C<start_document>This event is triggered before any other events for a new document. Ahandler for it can be used to initialize stuff. There is no documenttext associated with this event.=item C<text>This event is triggered when plain text (characters) is recognized.The text may contain multiple lines. A sequence of text may be brokenbetween several text events unless $p->unbroken_text is enabled.The parser will make sure that it does not break a word or a sequenceof whitespace between two text events.=back=head2 UnicodeThe C<HTML::Parser> can parse Unicode strings when running underperl-5.8 or better. If Unicode is passed to $p->parse() then chunksof Unicode will be reported to the handlers. The offset and lengthargspecs will also report their position in terms of characters.It is safe to parse raw undecoded UTF-8 if you either avoid decodingentities and make sure to not use I<argspecs> that do, or enable theC<utf8_mode> for the parser. Parsing of undecoded UTF-8 might beuseful when parsing from a file where you need the reported offsetsand lengths to match the byte offsets in the file.If a filename is passed to $p->parse_file() then the file will be readin binary mode. This will be fine if the file contains only ASCII orLatin-1 characters. If the file contains UTF-8 encoded text then caremust be taken when decoding entities as described in the previousparagraph, but better is to open the file with the UTF-8 layer so thatit is decoded properly: open(my $fh, "<:utf8", "index.html") || die "...: $!"; $p->parse_file($fh);If the file contains text encoded in a charset besides ASCII, Latin-1or UTF-8 then decoding will always be needed.=head1 VERSION 2 COMPATIBILITYWhen an C<HTML::Parser> object is constructed with no arguments, a setof handlers is automatically provided that is compatible with the oldHTML::Parser version 2 callback methods.This is equivalent to the following method calls: $p->handler(start => "start", "self, tagname, attr, attrseq, text"); $p->handler(end => "end", "self, tagname, text"); $p->handler(text => "text", "self, text, is_cdata"); $p->handler(process => "process", "self, token0, text"); $p->handler(comment => sub { my($self, $tokens) = @_; for (@$tokens) {$self->comment($_);}}, "self, tokens"); $p->handler(declaration => sub { my $self = shift; $self->declaration(substr($_[0], 2, -1));}, "self, text");Setting up these handlers can also be requested with the "api_version =>2" constructor option.=head1 SUBCLASSINGThe C<HTML::Parser> class is subclassable. Parser objects are plainhashes and C<HTML::Parser> reserves only hash keys that start with"_hparser". The parser state can be set up by invoking the init()method, which takes the same arguments as new().=head1 EXAMPLESThe first simple example shows how you might strip out comments froman HTML document. We achieve this by setting up a comment handler thatdoes nothing and a default handler that will print out anything else: use HTML::Parser; HTML::Parser->new(default_h => [sub { print shift }, 'text'], comment_h => [""], )->parse_file(shift || die) || die $!;An alternative implementation is: use HTML::Parser; HTML::Parser->new(end_document_h => [sub { print shift }, 'skipped_text'], comment_h => [""], )->parse_file(shift || die) || die $!;This will in most cases be much more efficient since only a singlecallback will be made.The next example prints out the text that is inside the <title>element of an HTML document. Here we start by setting up a starthandler. When it sees the title start tag it enables a text handlerthat prints any text found and an end handler that will terminateparsing as soon as the title end tag is seen: use HTML::Parser (); sub start_handler { return if shift ne "title"; my $self = shift; $self->handler(text => sub { print shift }, "dtext"); $self->handler(end => sub { shift->eof if shift eq "title"; }, "tagname,self"); } my $p = HTML::Parser->new(api_version => 3); $p->handler( start => \&start_handler, "tagname,self"); $p->parse_file(shift || die) || die $!; print "\n";More examples are found in the F<eg/> directory of the C<HTML-Parser>distribution: the program C<hrefsub> shows how you can edit all linksfound in a document; the program C<htextsub> shows how to edit the text only; theprogram C<hstrip> shows how you can strip out certain tags/elementsand/or attributes; and the program C<htext> show how to obtain theplain text, but not any script/style content.You can browse the F<eg/> directory online from the I<[Browse]> link onthe http://search.cpan.org/~gaas/HTML-Parser/ page.=head1 BUGSThe <style> and <script> sections do not end with the first "</", butneed the complete corresponding end tag. The standard behaviour isnot really practical.When the I<strict_comment> option is enabled, we still recognizecomments where there is something other than whitespace between evenand odd "--" markers.Once $p->boolean_attribute_value has been set, there is no way torestore the default behaviour.There is currently no way to get both quote charactersinto the same literal argspec.Empty tags, e.g. "<>" and "</>", are not recognized. SGML allows themto repeat the previous start tag or close the previous start tagrespectively.NET tags, e.g. "code/.../" are not recognized. This is SGMLshorthand for "<code>...</code>".Unclosed start or end tags, e.g. "<tt<b>...</b</tt>" are notrecognized.=head1 DIAGNOSTICSThe following messages may be produced by HTML::Parser. The notationin this listing is the same as used in L<perldiag>:=over=item Not a reference to a hash(F) The object blessed into or subclassed from HTML::Parser is not ahash as required by the HTML::Parser methods.=item Bad signature in parser state object at %p(F) The _hparser_xs_state element does not refer to a valid state structure.Something must have changed the internal valuestored in this hash element, or the memory has been overwritten.=item _hparser_xs_state element is not a reference(F) The _hparser_xs_state element has been destroyed.=item Can't find '_hparser_xs_state' element in HTML::Parser hash(F) The _hparser_xs_state element is missing from the parser hash.It was either deleted, or not created when the object was created.=item API version %s not supported by HTML::Parser %s(F) The constructor option 'api_version' with an argument greater thanor equal to 4 is reserved for future extensions.=item Bad constructor option '%s'(F) An unknown constructor option key was passed to the new() orinit() methods.=item Parse loop not allowed(F) A handler invoked the parse() or parse_file() method.This is not permitted.=item marked sections not supported(F) The $p->marked_sections() method was invoked in a HTML::Parsermodule that was compiled without support for marked sections.=item Unknown boolean attribute (%d)(F) Something is wrong with the internal logic that set up aliases forboolean attributes.=item Only code or array references allowed as handler(F) The second argument for $p->handler must be either a subroutinereference, then name of a subroutine or method, or a reference to anarray.=item No handler for %s events(F) The first argument to $p->handler must be a valid event name; i.e. oneof "start", "end", "text", "process", "declaration" or "comment".=item Unrecognized identifier %s in argspec(F) The identifier is not a known argspec name.Use one of the names mentioned in the argspec section above.=item Literal string is longer than 255 chars in argspec(F) The current implementation limits the length of literals inan argspec to 255 characters. Make the literal shorter.=item Backslash reserved for literal string in argspec(F) The backslash character "\" is not allowed in argspec literals.It is reserved to permit quoting inside a literal in a later version.=item Unterminated literal string in argspec(F) The terminating quote character for a literal was not found.=item Bad argspec (%s)(F) Only identifier names, literals, spaces and commasare allowed in argspecs.=item Missing comma separator in argspec(F) Identifiers in an argspec must be separated with ",".=item Parsing of undecoded UTF-8 will give garbage when decoding entities(W) The first chunk parsed appears to contain undecoded UTF-8 and oneor more argspecs that decode entities are used for the callbackhandlers.The result of decoding will be a mix of encoded and decoded charactersfor any entities that expand to characters with code above 127. Thisis not a good thing.The solution is to use the Encode::encode_utf8() on the data beforefeeding it to the $p->parse(). For $p->parse_file() pass a file thathas been opened in ":utf8" mode.The parser can process raw undecoded UTF-8 sanely if the C<utf8_mode>is enabled or if the "attr", "@attr" or "dtext" argspecs is avoided.=item Parsing string decoded with wrong endianess(W) The first character in the document is U+FFFE. This is not alegal Unicode character but a byte swapped BOM. The result of parsingwill likely be garbage.=item Parsing of undecoded UTF-32(W) The parser found the Unicode UTF-32 BOM signature at the startof the document. The result of parsing will likely be garbage.=item Parsing of undecoded UTF-16(W) The parser found the Unicode UTF-16 BOM signature at the start ofthe document. The result of parsing will likely be garbage.=back=head1 SEE ALSOL<HTML::Entities>, L<HTML::PullParser>, L<HTML::TokeParser>, L<HTML::HeadParser>,L<HTML::LinkExtor>, L<HTML::Form>L<HTML::TreeBuilder> (part of the I<HTML-Tree> distribution)http://www.w3.org/TR/html4More information about marked sections and processing instructions maybe found at C<http://www.sgml.u-net.com/book/sgml-8.htm>.=head1 COPYRIGHT Copyright 1996-2007 Gisle Aas. All rights reserved. Copyright 1999-2000 Michael A. Chase. All rights reserved.This library is free software; you can redistribute it and/ormodify it under the same terms as Perl itself.=cut
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -