📄 html::parser.3
字号:
.\" Automatically generated by Pod::Man 2.16 (Pod::Simple 3.05).\".\" Standard preamble:.\" ========================================================================.de Sh \" Subsection heading.br.if t .Sp.ne 5.PP\fB\\$1\fR.PP...de Sp \" Vertical space (when we can't use .PP).if t .sp .5v.if n .sp...de Vb \" Begin verbatim text.ft CW.nf.ne \\$1...de Ve \" End verbatim text.ft R.fi...\" Set up some character translations and predefined strings. \*(-- will.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left.\" double quote, and \*(R" will give a right double quote. \*(C+ will.\" give a nicer C++. Capital omega is used to do unbreakable dashes and.\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff,.\" nothing in troff, for use with C<>..tr \(*W-.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'.ie n \{\. ds -- \(*W-. ds PI pi. if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch. if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch. ds L" "". ds R" "". ds C` "". ds C' ""'br\}.el\{\. ds -- \|\(em\|. ds PI \(*p. ds L" ``. ds R" '''br\}.\".\" Escape single quotes in literal strings from groff's Unicode transform..ie \n(.g .ds Aq \(aq.el .ds Aq '.\".\" If the F register is turned on, we'll generate index entries on stderr for.\" titles (.TH), headers (.SH), subsections (.Sh), items (.Ip), and index.\" entries marked with X<> in POD. Of course, you'll have to process the.\" output yourself in some meaningful fashion..ie \nF \{\. de IX. tm Index:\\$1\t\\n%\t"\\$2"... nr % 0. rr F.\}.el \{\. de IX...\}.\".\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2)..\" Fear. Run. Save yourself. No user-serviceable parts.. \" fudge factors for nroff and troff.if n \{\. ds #H 0. ds #V .8m. ds #F .3m. ds #[ \f1. ds #] \fP.\}.if t \{\. ds #H ((1u-(\\\\n(.fu%2u))*.13m). ds #V .6m. ds #F 0. ds #[ \&. ds #] \&.\}. \" simple accents for nroff and troff.if n \{\. ds ' \&. ds ` \&. ds ^ \&. ds , \&. ds ~ ~. ds /.\}.if t \{\. ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u". ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u'. ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u'. ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u'. ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u'. ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u'.\}. \" troff and (daisy-wheel) nroff accents.ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V'.ds 8 \h'\*(#H'\(*b\h'-\*(#H'.ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#].ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H'.ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u'.ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#].ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#].ds ae a\h'-(\w'a'u*4/10)'e.ds Ae A\h'-(\w'A'u*4/10)'E. \" corrections for vroff.if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u'.if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u'. \" for low resolution devices (crt and lpr).if \n(.H>23 .if \n(.V>19 \\{\. ds : e. ds 8 ss. ds o a. ds d- d\h'-1'\(ga. ds D- D\h'-1'\(hy. ds th \o'bp'. ds Th \o'LP'. ds ae ae. ds Ae AE.\}.rm #[ #] #H #V #F C.\" ========================================================================.\".IX Title "Parser 3".TH Parser 3 "2007-01-12" "perl v5.10.0" "User Contributed Perl Documentation".\" For nroff, turn off justification. Always turn off hyphenation; it makes.\" way too many mistakes in technical documents..if n .ad l.nh.SH "NAME"HTML::Parser \- HTML parser class.SH "SYNOPSIS".IX Header "SYNOPSIS".Vb 1\& use HTML::Parser ();\&\& # Create parser object\& $p = HTML::Parser\->new( api_version => 3,\& start_h => [\e&start, "tagname, attr"],\& end_h => [\e&end, "tagname"],\& marked_sections => 1,\& );\&\& # Parse document text chunk by chunk\& $p\->parse($chunk1);\& $p\->parse($chunk2);\& #...\& $p\->eof; # signal end of document\&\& # Parse directly from file\& $p\->parse_file("foo.html");\& # or\& open(my $fh, "<:utf8", "foo.html") || die;\& $p\->parse_file($fh);.Ve.SH "DESCRIPTION".IX Header "DESCRIPTION"Objects of the \f(CW\*(C`HTML::Parser\*(C'\fR class will recognize markup andseparate it from plain text (alias data content) in \s-1HTML\s0documents. As different kinds of markup and text are recognized, thecorresponding event handlers are invoked..PP\&\f(CW\*(C`HTML::Parser\*(C'\fR is not a generic \s-1SGML\s0 parser. We have tried tomake it able to deal with the \s-1HTML\s0 that is actually \*(L"out there\*(R", andit normally parses as closely as possible to the way the popular webbrowsers do it instead of strictly following one of the many \s-1HTML\s0specifications from W3C. Where there is disagreement, there is oftenan option that you can enable to get the official behaviour..PPThe document to be parsed may be supplied in arbitrary chunks. Thismakes on-the-fly parsing as documents are received from the networkpossible..PPIf event driven parsing does not feel right for your application, youmight want to use \f(CW\*(C`HTML::PullParser\*(C'\fR. This is an \f(CW\*(C`HTML::Parser\*(C'\fRsubclass that allows a more conventional program structure..SH "METHODS".IX Header "METHODS"The following method is used to construct a new \f(CW\*(C`HTML::Parser\*(C'\fR object:.ie n .IP "$p\fR = HTML::Parser\->new( \f(CW%options_and_handlers )" 4.el .IP "\f(CW$p\fR = HTML::Parser\->new( \f(CW%options_and_handlers\fR )" 4.IX Item "$p = HTML::Parser->new( %options_and_handlers )"This class method creates a new \f(CW\*(C`HTML::Parser\*(C'\fR object andreturns it. Key/value argument pairs may be provided to assign eventhandlers or initialize parser options. The handlers and parseroptions can also be set or modified later by the method calls described below..SpIf a top level key is in the form \*(L"<event>_h\*(R" (e.g., \*(L"text_h\*(R") then itassigns a handler to that event, otherwise it initializes a parseroption. The event handler specification value must be an arrayreference. Multiple handlers may also be assigned with the 'handlers=> [%handlers]' option. See examples below..SpIf \fInew()\fR is called without any arguments, it will create a parser thatuses callback methods compatible with version 2 of \f(CW\*(C`HTML::Parser\*(C'\fR.See the section on \*(L"version 2 compatibility\*(R" below for details..SpThe special constructor option 'api_version => 2' can be used toinitialize version 2 callbacks while still setting other options andhandlers. The 'api_version => 3' option can be used if you don't wantto set any options and don't want to fall back to v2 compatiblemode..SpExamples:.Sp.Vb 2\& $p = HTML::Parser\->new(api_version => 3,\& text_h => [ sub {...}, "dtext" ]);.Ve.SpThis creates a new parser object with a text event handler subroutinethat receives the original text with general entities decoded..Sp.Vb 2\& $p = HTML::Parser\->new(api_version => 3,\& start_h => [ \*(Aqmy_start\*(Aq, "self,tokens" ]);.Ve.SpThis creates a new parser object with a start event handler methodthat receives the \f(CW$p\fR and the tokens array..Sp.Vb 4\& $p = HTML::Parser\->new(api_version => 3,\& handlers => { text => [\e@array, "event,text"],\& comment => [\e@array, "event,text"],\& });.Ve.SpThis creates a new parser object that stores the event type and theoriginal text in \f(CW@array\fR for text and comment events..PPThe following methods feed the \s-1HTML\s0 documentto the \f(CW\*(C`HTML::Parser\*(C'\fR object:.ie n .IP "$p\fR\->parse( \f(CW$string )" 4.el .IP "\f(CW$p\fR\->parse( \f(CW$string\fR )" 4.IX Item "$p->parse( $string )"Parse \f(CW$string\fR as the next chunk of the \s-1HTML\s0 document. The returnvalue is normally a reference to the parser object (i.e. \f(CW$p\fR).Handlers invoked should not attempt to modify the \f(CW$string\fR in-place until\&\f(CW$p\fR\->parse returns..SpIf an invoked event handler aborts parsing by calling \f(CW$p\fR\->eof, then\&\f(CW$p\fR\->\fIparse()\fR will return a \s-1FALSE\s0 value..ie n .IP "$p\fR\->parse( \f(CW$code_ref )" 4.el .IP "\f(CW$p\fR\->parse( \f(CW$code_ref\fR )" 4.IX Item "$p->parse( $code_ref )"If a code reference is passed as the argument to be parsed, then thechunks to be parsed are obtained by invoking this function repeatedly.Parsing continues until the function returns an empty (or undefined)result. When this happens \f(CW$p\fR\->eof is automatically signaled..SpParsing will also abort if one of the event handlers calls \f(CW$p\fR\->eof..SpThe effect of this is the same as:.Sp.Vb 8\& while (1) {\& my $chunk = &$code_ref();\& if (!defined($chunk) || !length($chunk)) {\& $p\->eof;\& return $p;\& }\& $p\->parse($chunk) || return undef;\& }.Ve.SpBut it is more efficient as this loop runs internally in \s-1XS\s0 code..ie n .IP "$p\fR\->parse_file( \f(CW$file )" 4.el .IP "\f(CW$p\fR\->parse_file( \f(CW$file\fR )" 4.IX Item "$p->parse_file( $file )"Parse text directly from a file. The \f(CW$file\fR argument can be afilename, an open file handle, or a reference to an open filehandle..SpIf \f(CW$file\fR contains a filename and the file can't be opened, then themethod returns an undefined value and $! tells why it failed.Otherwise the return value is a reference to the parser object..SpIf a file handle is passed as the \f(CW$file\fR argument, then the file willnormally be read until \s-1EOF\s0, but not closed..SpIf an invoked event handler aborts parsing by calling \f(CW$p\fR\->eof,then \f(CW$p\fR\->\fIparse_file()\fR may not have read the entire file..SpOn systems with multi-byte line terminators, the values passed for theoffset and length argspecs may be too low if \fIparse_file()\fR is called ona file handle that is not in binary mode..SpIf a filename is passed in, then \fIparse_file()\fR will open the file inbinary mode..ie n .IP "$p\->eof" 4.el .IP "\f(CW$p\fR\->eof" 4.IX Item "$p->eof"Signals the end of the \s-1HTML\s0 document. Calling the \f(CW$p\fR\->eof methodoutside a handler callback will flush any remaining buffered text(which triggers the \f(CW\*(C`text\*(C'\fR event if there is any remaining text)..SpCalling \f(CW$p\fR\->eof inside a handler will terminate parsing at that pointand cause \f(CW$p\fR\->parse to return a \s-1FALSE\s0 value. This also terminatesparsing by \f(CW$p\fR\->\fIparse_file()\fR..SpAfter \f(CW$p\fR\->eof has been called, the \fIparse()\fR and \fIparse_file()\fR methodscan be invoked to feed new documents with the parser object..SpThe return value from \fIeof()\fR is a reference to the parser object..PPMost parser options are controlled by boolean attributes.Each boolean attribute is enabled by calling the corresponding method
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -