📄 tree.pm

📁 该软件可以方便的把HTML网页解析成一棵Tree
💻 PM
字号:
##package HTML::Tree;#	mod/HTML/Tree/Tree.pm##	Copyright (C) 1999  Paul J. Lucas##	This program is free software; you can redistribute it and/or modify#	it under the terms of the GNU General Public License as published by#	the Free Software Foundation; either version 2 of the License, or#	(at your option) any later version.##	This program is distributed in the hope that it will be useful,#	but WITHOUT ANY WARRANTY; without even the implied warranty of#	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the#	GNU General Public License for more details.##	You should have received a copy of the GNU General Public License#	along with this program; if not, write to the Free Software#	Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.##use strict;use vars qw( @EXPORT @EXPORT_OK @ISA $VERSION );require Exporter;require DynaLoader;require AutoLoader;@EXPORT = qw();@EXPORT_OK = qw();@ISA = qw( Exporter DynaLoader );$VERSION = '2.4.3';bootstrap HTML::Tree $VERSION;1;__END__########## End of Perl Part -- The rest is documentation ######################=head1 NAMEC<HTML::Tree> - Perl extension for quickly parsing HTML files into trees=head1 SYNOPSIS use HTML::Tree; $tree1 = HTML::Tree->from_file( 'file.html' ); $aref  = $tree1->as_array(); $tree2 = HTML::Tree->from_array( $aref ); $str   = $tree2->as_string(); $tree3 = HTML::Tree->from_string( $str ); $tree3->write( 'new_file.html' );then: sub visitor {	my( $node, $depth, $is_end_tag ) = @_;	# ... } $tree1->visit( \&visitor );or: sub visitor {	my( $hash_ref, $node, $depth, $is_end_tag ) = @_;	# ... } %my_hash; # ... $tree1->visit( \%my_hash, \&visitor );also: $aref = $node->children(); $node->delete(); $node = $node->find_if( \&predicate_function ); $node = $node->find_name( 'name' ); $bool = $node->is_element(); $bool = $node->is_comment(); $bool = $node->is_text(); $name = $node->name(); $text = $node->text();=head1 DESCRIPTIONC<HTML::Tree> is a fast parser that parses an HTML fileinto a tree structure like the HTML DOM (Document Object Model).Once built, the nodes of the tree (elements and text from the HTML file)can be traversed by a user-defined I<visitor> functionor compiled into an array-of-hashes data structure.C<HTML::Tree> is very similar tothe C<HTML::Parser> and C<HTML::TreeBuilder> modules by Gisle Aas,except that it:=over 4=item 1.Is several times faster.C<HTML::Tree> owes its speed to two things:using mmap(2) to read the HTML file bypassing conventional I/O and buffering,and being written entirely in C++ as opposed to Perl.=item 2.Isn't a strict DTD (Document Type Definition) parser.The goal is to parse HTML files fast,not check for validity.(You should check the validity of your HTML files with other tools I<before>you put them on your web site anyway.)For example,C<HTML::Tree> couldn't care less what attributes a given HTML element hasjust so long as the syntax is correct.This is actually similar to browsers in that both are very permissivein what they accept.=item 3.Offers simple conditional and looping mechanismsassisting in the generation of dynamic HTML content.=back=head1 MethodsFor the methods below,the kind of node a method may be called on is indicated;C<$node> means "any kind of node."Calling a method for a node of the wrong kind is a fatal error.=over 4=item C<$parent_node = HTML::Tree-E<gt>from_file(> I<file_name> B<[> C<{> I<param_hash_ref> C<}> B<]> C<)>Parse the given HTML fileand return a reference to a new C<HTML::Tree> object.If, for any reason, the file can not be parsed(file does not exist, insufficient permissions, etc.),C<undef> is returned.Parameters that control how the data structure is builtmay be passed via a reference to a hash.If C<Include_Comments> is given with a non-zero value,then comment nodes are included; otherwise, they are elided.=item C<$array_ref = $node-E<gt>as_array(> B<[> C<{> I<param_hash_ref> C<}> B<]> C<)>Returns a reference to an array-of-hashes data structurerepresenting the nodes in the HTML tree starting at the specified node.Parameters that control how the data structure is builtmay be passed via a reference to a hash.The parameters are the same as for C<from_file()> above.For example,given this HTML:	<a href="file.html">	  Text A	  <b>Text B</b>	  <i>Text C</i>	</a>the C<Data::Dumper> representation of the resulting data structure would be:	$ref = [	    {	        'name' => 'a',	        'atts' => { 'href' => 'file.html' },	        'content' => [	            'Text A',                    {	                'name' => 'b',	                'content' => [ 'Text B' ]	            },                    {	                'name' => 'i',	                'content' => [ 'Text C' ]	            },	        ]	    }	]Every HTML element at the same "depth" or "level" is containedin the same array, i.e., they are "siblings" in the tree.The order of the elements in the array matches the order of the HTML elementsin the file.A node is either a string (representing text or a comment)or a reference to a hash (representing an HTML element).Strings are tied scalars, so modifying them changes the underlying tree.Strings in the HTML file that are entirely whitespaceare elided from the data structure.A hash always has a C<name> key whose value is the name of the HTML elementand may also have an C<atts> key and/or a C<content> key.The value of the C<atts> key is a reference to a tied hash where the hashkeys are attribute names and the hash values are the attribute values.Attribute names are returned in lower case(regardless of how they are in the HTML file).Because the hash is tied,assigning to a hash attribute changes that attribute's value;similarly, deleting an element deletes the attribute.The value of the C<content> key is a reference to an array containingall of the node's child nodes at the next level down.B<Note>: Modifying the arrays themselves (adding elements, deleting, etc.)does I<not> modify the underlying tree.To do that, either use the C<children()> methodor "walk" the tree using a I<visitor> function.=item C<$parent_node = HTML::Tree-E<gt>from_array(> I<array_ref> C<)>Create a new C<HTML::Tree> object from a data structurein the form returned by C<as_array()>.If, for any reason, the data structure isn't in the right form,the function will croak with an error message.=item C<$string = $node-E<gt>as_string(> B<[> C<{> I<param_hash_ref> C<}> B<]> C<)>Return the HTML text representation of the portion of the treestarting at the given node as a single string.Parameters that control how the HTML tree is converted to a stringmay be passed via a reference to a hash.If the C<Pretty_Print> parameter is givenwith a value greater than or equal to zero,then text nodes have leading and trailing whitespace removed,are indented according to their depth,and have a single newline appended.All other nodes appear on lines by themselvesand are also indented according to their depth.Indentation is done by spaces where the number of spaces at a given depth isC<(Pretty_Print + depth) * 2>.B<Note>:pretty-printing is suspended inside C<E<lt>PREE<gt>> elementsto preserve the original formatting.=item C<$parent_node = HTML::Tree-E<gt>from_string(> I<string> C<{> I<param_hash_ref> C<}> B<]> C<)>This is the same as C<from_file()> except that the HTML is parsedfrom the given string rather than a file.=item C<$value = $element_node-E<gt>att(> I<name> C<)>Returns the value of the element node's I<name> attributeor C<undef> if said node does not have one.Attribute names B<must> be specified in lower case(regardless of how they are in the HTML file).=item C<$element_node-E<gt>att(> I<name>C<, >I<new_value> C<)>Sets the value of the element node's I<name> attribute to I<new_value>.If I<new_value> is C<undef>, then the attribute is deleted.Attribute names B<must> be specified in lower case(regardless of how they are in the HTML file).If no I<name> attribute existed, it is added.=item C<$attributes_ref = $element_node-E<gt>atts()>Returns a reference to a tied hash of all of an element node'sattribute/value pairsor a reference to an empty hash if said node does not have any.Attribute names are returned in lower case(regardless of how they are in the HTML file).Because the hash is tied,assigning to a hash element changes that attribute's value;similarly, deleting an element deletes the attribute.=item C<$child_nodes_ref = $parent_node-E<gt>children()>Returns a reference to a tied array of all of an element node's child nodes.Because the array is tied, the Perl array manipulation functionspop, push, shift, and unshift work and affect the structure of the HTML::Tree.For example:	$orphan = unshift @{ $node1->children() };"detaches" the first child node of $node1 from the tree structureand returns a reference to it now as its own distinct HTML::Tree.Conversely:        push @{ $node2->children() }, $orphan;"reattaches" the sub-tree but now at the end of the child nodes of $node2elsewhere in the tree.Additionally, a child node can also be replaced by assignment as in:	$node->children()->[0] = expressionwhere I<expression> is one of:a reference to a data structure in the form returned by C<as_array()>,a reference to an HTML::Tree(in which case the whole tree is "inserted"),or a string (in which case the string is parsed as HTML).=item C<$node-E<gt>delete()>Delete the node and all of its child nodes, if any, from the tree.Once deleted, the reference to the node B<must not> be used.=item C<$node = $node-E<gt>find_if(> I<func_ref> C<)>Find the first node for which the given predicate function is truestarting the find from the given node.Returns C<undef> if no such node is found.Closures work well to generate the predicate functionsince additional parameters can be used during the find.For example:	sub pred_att_re {		my( $att, $re ) = @_;		return sub {			my $node = shift;			return	$node->is_element() &&				$node->att( $att ) =~ /$re/;		}	}	$node = $html->find_if( pred_att_re( 'href', '\.jpg$' ) );This would find an element node having an attribute C<href>that matches the regular expression C<\.jpg$>.=item C<$element_node = $node-E<gt>find_name(> I<name> C<)>Find the first element node having the given namestarting the find from the given node.The I<name> B<must> be specified in lower case.Returns C<undef> if no such element node is found.(This function is a special case of C<find_if()>and is much faster for finding by name alone.=item C<$bool = $node-E<gt>is_comment()>Returns true (1) only if the current node is a comment node;false (0), otherwise.=item C<$bool = $node-E<gt>is_text()>Returns true (1) only if the current node is a text node; false (0), otherwise.(If a node isn't a text node, it must be an element node.)=item C<$name = $element_node-E<gt>name()>Returns the HTML element name of an element node, e.g., C<title>.All names are returned in lower case(regardless of how they are in the HTML file).=item C<$text = $text_node-E<gt>text(> B<[> I<new_text> B<]> C<)>Returns the text of a text node as a string.If I<new_text> is given, the text is set to that first.=item C<$node-E<gt>visit( \&>I<visitor>C< )>Traverse the HTML treeby calling the I<visitor> function for every nodestarting at the given nodepreviously returned by a constructor.=item C<$node-E<gt>visit( \%>I<hash>C<, \&>I<visitor>C< )>Same as the previous methodexcept that a hash reference is passed along(see B<Arguments> below).=item C<$success = $node-E<gt>write( >I<file_name>C< >B<[>C<, { >I<param_hash_ref>C< } >B<]>C< )>Write the HTML text representation of the portion of the treestarting at the given node as a single string to a file.Returns 1 upon sucess, 0 otherwise.Parameters that control how the HTML is writtenmay be passed via a reference to a hash.The C<Pretty_Print> parameterhas the same meaning as it does for C<as_string()>.=back=head1 The Visitor FunctionThe user supplies a I<visitor> function:a Perl function that is called when every node is visited(i.e., a "call-back")during an in-order tree traversal.For HTML elements that have end tags,the I<visitor> function may be called more than once for a given nodebased on the function's return value.(See B<Return Value> below.)Note that this occurs for such HTML elementseven if said element's end tag is optionaland was not present in the HTML file.=head2 Arguments=over 15=item C<$hash_ref>A reference to a hash that is passed only if the two-argument formof the C<visit()> method is used.This provides a mechanism for additional data(or a blessed object)to be passed to and among the calls to the I<visitor> function.The argument is not used at all by C<HTML::Tree>.=item C<$node>A reference to the current node.=item C<$depth>An integer specifying how "deep" the node is in the tree.(Depths start at zero.)=item C<$is_end_tag>True (1) only if the tag is an end tag of an HTML element;false (0), otherwise.=back=head2 Return ValueThe I<visitor> function is expected to return a Boolean value(zero or non-zero for false or true, respectively).There are two meanings for the return value:=over 4=item 1.If the $is_end_tag argument is false,returning false means:do not visit any of the current node's child nodes,i.e., skip them and proceed directly to the current node's next siblingand also do not call the I<visitor> again for the end tag;returning true means: do visit all child nodesand call the I<visitor> again for the end tag.=item 2.If the $is_end_tag argument is true,returning false means:proceed normally to the next sibling;returning true means:loop back and repeat the visit cycle from the beginningby revisiting the start tag of the current element node(case 1 above).=back=head1 EXAMPLEHere is a sample visitor function that "pretty prints" an HTML file:	sub visitor {		my( $node, $depth, $is_end_tag ) = @_;		print "    " x $depth;		if ( $node->is_text() ) {			my $text = $node->text();			$text =~ s/(?:^\n|\n$)//g;			print "$text\n";			return 1;		}		if ( $is_end_tag ) {			print "</", $node->name(), ">\n";			return 0;		}		print '<', $node->name();		my $atts = $node->atts();		while ( my( $att, $val ) = each %{ $atts } ) {			print " $att=\"$val\"";		}		print ">\n";		return 1;	}=head1 NOTESIn order for an HTML file to be properly parsed,scripting languages B<must> be "comment hidden" as in:	<SCRIPT LANGUAGE="JavaScript">	<!--		... script goes here ...	// -->	</SCRIPT>=head1 SEE ALSOperl(1),mmap(2),Data::Dumper(3),HTML::Parser(3),HTML::TreeBuilder(3).World Wide Web Consortium Document Object Model Working Group.I<Document Object Model>,December 1998.C<http://www.w3.org/DOM/>=head1 AUTHORPaul J. Lucas <I<pauljlucas@mac.com>>=head1 HISTORYThe HTML parser of the C++ part of the module is derived from code in SWISH++,a really fast file indexing and searching engine (also by the author).=cut
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -