📄 parser.pm
字号:
=item $p->strict_names=item $p->strict_names( $bool )By default, almost anything is allowed in tag and attribute names.This is the behaviour of most popular browsers and allows us to parsesome broken tags with invalid attribute values like: <IMG SRC=newprevlstGr.gif ALT=[PREV LIST] BORDER=0>By default, "LIST]" is parsed as a boolean attribute, not aspart of the ALT value as was clearly intended. This is also whatMozilla sees.The official behaviour is enabled by enabling this attribute. Ifenabled, it will cause the tag above to be reported as textsince "LIST]" is not a legal attribute name.=item $p->unbroken_text=item $p->unbroken_text( $bool )By default, blocks of text are given to the text handler as soon aspossible (but the parser takes care always to break text at aboundary between whitespace and non-whitespace so single words andentities can always be decoded safely). This might create breaks thatmake it hard to do transformations on the text. When this attribute isenabled, blocks of text are always reported in one piece. This willdelay the text event until the following (non-text) event has beenrecognized by the parser.Note that the C<offset> argspec will give you the offset of the firstsegment of text and C<length> is the combined length of the segments.Since there might be ignored tags in between, these numbers can't beused to directly index in the original document file.=item $p->utf8_mode=item $p->utf8_mode( $bool )Enable this option when parsing raw undecoded UTF-8. This tells theparser that the entities expanded for strings reported by C<attr>,C<@attr> and C<dtext> should be expanded as decoded UTF-8 so they endup compatible with the surrounding text.If C<utf8_mode> is enabled then it is an error to pass stringscontaining characters with code above 255 to the parse() method, andthe parse() method will croak if you try.Example: The Unicode character "\x{2665}" is "\xE2\x99\xA5" when UTF-8encoded. The character can also be represented by the entity"♥" or "♥". If we feed the parser: $p->parse("\xE2\x99\xA5♥");then C<dtext> will be reported as "\xE2\x99\xA5\x{2665}" withoutC<utf8_mode> enabled, but as "\xE2\x99\xA5\xE2\x99\xA5" when enabled.The later string is what you want.This option is only available with perl-5.8 or better.=item $p->xml_mode=item $p->xml_mode( $bool )Enabling this attribute changes the parser to allow some XMLconstructs. This enables the behaviour controlled by individually bythe C<case_sensitive>, C<empty_element_tags>, C<strict_names> andC<xml_pic> attributes and also suppresses special treatment ofelements that are parsed as CDATA for HTML.=item $p->xml_pic=item $p->xml_pic( $bool )By default, I<processing instructions> are terminated by ">". Whenthis attribute is enabled, processing instructions are terminated by"?>" instead.=backAs markup and text is recognized, handlers are invoked. The followingmethod is used to set up handlers for different events:=over=item $p->handler( event => \&subroutine, $argspec )=item $p->handler( event => $method_name, $argspec )=item $p->handler( event => \@accum, $argspec )=item $p->handler( event => "" );=item $p->handler( event => undef );=item $p->handler( event );This method assigns a subroutine, method, or array to handle an event.Event is one of C<text>, C<start>, C<end>, C<declaration>, C<comment>,C<process>, C<start_document>, C<end_document> or C<default>.The C<\&subroutine> is a reference to a subroutine which is called to handlethe event.The C<$method_name> is the name of a method of $p which is called to handlethe event.The C<@accum> is an array that will hold the event information assub-arrays.If the second argument is "", the event is ignored.If it is undef, the default handler is invoked for the event.The C<$argspec> is a string that describes the information to be reportedfor the event. Any requested information that does not apply to aspecific event is passed as C<undef>. If argspec is omitted, then itis left unchanged.The return value from $p->handler is the old callback routine or areference to the accumulator array.Any return values from handler callback routines/methods are alwaysignored. A handler callback can request parsing to be aborted byinvoking the $p->eof method. A handler callback is not allowed toinvoke the $p->parse() or $p->parse_file() method. An exception willbe raised if it tries.Examples: $p->handler(start => "start", 'self, attr, attrseq, text' );This causes the "start" method of object $p to be called for 'start' events.The callback signature is $p->start(\%attr, \@attr_seq, $text). $p->handler(start => \&start, 'attr, attrseq, text' );This causes subroutine start() to be called for 'start' events.The callback signature is start(\%attr, \@attr_seq, $text). $p->handler(start => \@accum, '"S", attr, attrseq, text' );This causes 'start' event information to be saved in @accum.The array elements will be ['S', \%attr, \@attr_seq, $text]. $p->handler(start => "");This causes 'start' events to be ignored. It also suppressesinvocations of any default handler for start events. It is in mostcases equivalent to $p->handler(start => sub {}), but is moreefficient. It is different from the empty-sub-handler in thatC<skipped_text> is not reset by it. $p->handler(start => undef);This causes no handler to be associated with start events.If there is a default handler it will be invoked.=backFilters based on tags can be set up to limit the number of eventsreported. The main bottleneck during parsing is often the huge numberof callbacks made from the parser. Applying filters can improveperformance significantly.The following methods control filters:=over=item $p->ignore_elements( @tags )Both the C<start> event and the C<end> event as well as any events thatwould be reported in between are suppressed. The ignored elements cancontain nested occurrences of itself. Example: $p->ignore_elements(qw(script style));The C<script> and C<style> tags will always nest properly since theircontent is parsed in CDATA mode. For most other tagsC<ignore_elements> must be used with caution since HTML is often notI<well formed>.=item $p->ignore_tags( @tags )Any C<start> and C<end> events involving any of the tags given aresuppressed. To reset the filter (i.e. don't suppress any C<start> andC<end> events), call C<ignore_tags> without an argument.=item $p->report_tags( @tags )Any C<start> and C<end> events involving any of the tags I<not> givenare suppressed. To reset the filter (i.e. report all C<start> andC<end> events), call C<report_tags> without an argument.=backInternally, the system has two filter lists, one for C<report_tags>and one for C<ignore_tags>, and both filters are applied. Thiseffectively gives C<ignore_tags> precedence over C<report_tags>.Examples: $p->ignore_tags(qw(style)); $p->report_tags(qw(script style));results in only C<script> events being reported.=head2 ArgspecArgspec is a string containing a comma-separated list that describesthe information reported by the event. The following argspecidentifier names can be used:=over=item C<attr>Attr causes a reference to a hash of attribute name/value pairs to bepassed.Boolean attributes' values are either the value set by$p->boolean_attribute_value, or the attribute name if no value has beenset by $p->boolean_attribute_value.This passes undef except for C<start> events.Unless C<xml_mode> or C<case_sensitive> is enabled, the attributenames are forced to lower case.General entities are decoded in the attribute values andone layer of matching quotes enclosing the attribute values is removed.The Unicode character set is assumed for entity decoding. With Perlversion 5.6 or earlier only the Latin-1 range is supported, andentities for characters outside the range 0..255 are left unchanged.=item C<@attr>Basically the same as C<attr>, but keys and values are passed asindividual arguments and the original sequence of the attributes iskept. The parameters passed will be the same as the @attr calculatedhere: @attr = map { $_ => $attr->{$_} } @$attrseq;assuming $attr and $attrseq here are the hash and array passed as theresult of C<attr> and C<attrseq> argspecs.This passes no values for events besides C<start>.=item C<attrseq>Attrseq causes a reference to an array of attribute names to bepassed. This can be useful if you want to walk the C<attr> hash inthe original sequence.This passes undef except for C<start> events.Unless C<xml_mode> or C<case_sensitive> is enabled, the attributenames are forced to lower case.=item C<column>Column causes the column number of the start of the event to be passed.The first column on a line is 0.=item C<dtext>Dtext causes the decoded text to be passed. General entities areautomatically decoded unless the event was inside a CDATA section orwas between literal start and end tags (C<script>, C<style>,C<xmp>, and C<plaintext>).The Unicode character set is assumed for entity decoding. With Perlversion 5.6 or earlier only the Latin-1 range is supported, andentities for characters outside the range 0..255 are left unchanged.This passes undef except for C<text> events.=item C<event>Event causes the event name to be passed.The event name is one of C<text>, C<start>, C<end>, C<declaration>,C<comment>, C<process>, C<start_document> or C<end_document>.=item C<is_cdata>Is_cdata causes a TRUE value to be passed if the event is inside a CDATAsection or between literal start and end tags (C<script>,C<style>, C<xmp>, and C<plaintext>).if the flag is FALSE for a text event, then you should normallyeither use C<dtext> or decode the entities yourself before the text isprocessed further.=item C<length>Length causes the number of bytes of the source text of the event tobe passed.=item C<line>Line causes the line number of the start of the event to be passed.The first line in the document is 1. Line counting doesn't startuntil at least one handler requests this value to be reported.=item C<offset>Offset causes the byte position in the HTML document of the start ofthe event to be passed. The first byte in the document has offset 0.=item C<offset_end>Offset_end causes the byte position in the HTML document of the end ofthe event to be passed. This is the same as C<offset> + C<length>.=item C<self>Self causes the current object to be passed to the handler. If thehandler is a method, this must be the first element in the argspec.An alternative to passing self as an argspec is to register closuresthat capture $self by themselves as handlers. Unfortunately thiscreates circular references which prevent the HTML::Parser objectfrom being garbage collected. Using the C<self> argspec avoids thisproblem.=item C<skipped_text>Skipped_text returns the concatenated text of all the events that havebeen skipped since the last time an event was reported. Events mightbe skipped because no handler is registered for them or because somefilter applies. Skipped text also includes marked section markup,since there are no events that can catch it.If an C<"">-handler is registered for an event, then the text for thisevent is not included in C<skipped_text>. Skipped text both beforeand after the C<"">-event is included in the next reportedC<skipped_text>.=item C<tag>Same as C<tagname>, but prefixed with "/" if it belongs to an C<end>event and "!" for a declaration. The C<tag> does not have any prefixfor C<start> events, and is in this case identical to C<tagname>.=item C<tagname>This is the element name (or I<generic identifier> in SGML jargon) forstart and end tags. Since HTML is case insensitive, this name isforced to lower case to ease string matching.Since XML is case sensitive, the tagname case is not changed whenC<xml_mode> is enabled. The same happens if the C<case_sensitive> attributeis set.The declaration type of declaration elements is also passed as a tagname,even if that is a bit strange.In fact, in the current implementation tagname isidentical to C<token0> except that the name may be forced to lower case.=item C<token0>Token0 causes the original text of the first token string to bepassed. This should always be the same as $tokens->[0].For C<declaration> events, this is the declaration type.For C<start> and C<end> events, this is the tag name.For C<process> and non-strict C<comment> events, this is everythinginside the tag.This passes undef if there are no tokens in the event.=item C<tokenpos>Tokenpos causes a reference to an array of token positions to bepassed. For each string that appears in C<tokens>, this arraycontains two numbers. The first number is the offset of the start ofthe token in the original C<text> and the second number is the lengthof the token.Boolean attributes in a C<start> event will have (0,0) for theattribute value offset and length.This passes undef if there are no tokens in the event (e.g., C<text>)and for artificial C<end> events triggered by empty element tags.If you are using these offsets and lengths to modify C<text>, youshould either work from right to left, or be very careful to calculatethe changes to the offsets.=item C<tokens>Tokens causes a reference to an array of token strings to be passed.The strings are exactly as they were found in the original text,no decoding or case changes are applied.For C<declaration> events, the array contains each word, comment, anddelimited string starting with the declaration type.For C<comment> events, this contains each sub-comment. If$p->strict_comments is disabled, there will be only one sub-comment.For C<start> events, this contains the original tag name followed bythe attribute name/value pairs. The values of boolean attributes willbe either the value set by $p->boolean_attribute_value, or theattribute name if no value has been set by$p->boolean_attribute_value.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -