extractorhtml.html

来自「网络爬虫开源代码」· HTML 代码 · 共 1,082 行 · 第 1/5 页

HTML
1,082
字号
<CODE>(package private) static&nbsp;java.lang.String</CODE></FONT></TD><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#EACH_ATTRIBUTE_EXTRACTOR">EACH_ATTRIBUTE_EXTRACTOR</A></B></CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD ALIGN="right" VALIGN="top" WIDTH="1%"><FONT SIZE="-1"><CODE>(package private) static&nbsp;java.lang.String</CODE></FONT></TD><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#FRAME">FRAME</A></B></CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD ALIGN="right" VALIGN="top" WIDTH="1%"><FONT SIZE="-1"><CODE>(package private) static&nbsp;java.lang.String</CODE></FONT></TD><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#IFRAME">IFRAME</A></B></CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD ALIGN="right" VALIGN="top" WIDTH="1%"><FONT SIZE="-1"><CODE>(package private) static&nbsp;java.lang.String</CODE></FONT></TD><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#JAVASCRIPT">JAVASCRIPT</A></B></CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD ALIGN="right" VALIGN="top" WIDTH="1%"><FONT SIZE="-1"><CODE>(package private) static&nbsp;java.lang.String</CODE></FONT></TD><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#LIKELY_URI_PATH">LIKELY_URI_PATH</A></B></CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD ALIGN="right" VALIGN="top" WIDTH="1%"><FONT SIZE="-1"><CODE>(package private) static&nbsp;java.lang.String</CODE></FONT></TD><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#LINK">LINK</A></B></CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD ALIGN="right" VALIGN="top" WIDTH="1%"><FONT SIZE="-1"><CODE>(package private) static&nbsp;int</CODE></FONT></TD><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#MAX_ATTR_VAL_LENGTH">MAX_ATTR_VAL_LENGTH</A></B></CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD ALIGN="right" VALIGN="top" WIDTH="1%"><FONT SIZE="-1"><CODE>(package private) static&nbsp;java.lang.String</CODE></FONT></TD><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#NON_HTML_PATH_EXTENSION">NON_HTML_PATH_EXTENSION</A></B></CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD ALIGN="right" VALIGN="top" WIDTH="1%"><FONT SIZE="-1"><CODE>protected &nbsp;long</CODE></FONT></TD><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#numberOfCURIsHandled">numberOfCURIsHandled</A></B></CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD ALIGN="right" VALIGN="top" WIDTH="1%"><FONT SIZE="-1"><CODE>protected &nbsp;long</CODE></FONT></TD><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#numberOfLinksExtracted">numberOfLinksExtracted</A></B></CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD ALIGN="right" VALIGN="top" WIDTH="1%"><FONT SIZE="-1"><CODE>(package private) static&nbsp;java.lang.String</CODE></FONT></TD><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#RELEVANT_TAG_EXTRACTOR">RELEVANT_TAG_EXTRACTOR</A></B></CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD ALIGN="right" VALIGN="top" WIDTH="1%"><FONT SIZE="-1"><CODE>(package private) static&nbsp;java.lang.String</CODE></FONT></TD><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#WHITESPACE">WHITESPACE</A></B></CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</TD></TR></TABLE>&nbsp;<A NAME="fields_inherited_from_class_org.archive.crawler.framework.Processor"><!-- --></A><TABLE BORDER="1" WIDTH="100%" CELLPADDING="3" CELLSPACING="0" SUMMARY=""><TR BGCOLOR="#EEEEFF" CLASS="TableSubHeadingColor"><TH ALIGN="left"><B>Fields inherited from class org.archive.crawler.framework.<A HREF="../../../../org/archive/crawler/framework/Processor.html" title="class in org.archive.crawler.framework">Processor</A></B></TH></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD><CODE><A HREF="../../../../org/archive/crawler/framework/Processor.html#ATTR_DECIDE_RULES">ATTR_DECIDE_RULES</A>, <A HREF="../../../../org/archive/crawler/framework/Processor.html#ATTR_ENABLED">ATTR_ENABLED</A>, <A HREF="../../../../org/archive/crawler/framework/Processor.html#attrDecideRules">attrDecideRules</A></CODE></TD></TR></TABLE>&nbsp;<A NAME="fields_inherited_from_class_org.archive.crawler.settings.ComplexType"><!-- --></A><TABLE BORDER="1" WIDTH="100%" CELLPADDING="3" CELLSPACING="0" SUMMARY=""><TR BGCOLOR="#EEEEFF" CLASS="TableSubHeadingColor"><TH ALIGN="left"><B>Fields inherited from class org.archive.crawler.settings.<A HREF="../../../../org/archive/crawler/settings/ComplexType.html" title="class in org.archive.crawler.settings">ComplexType</A></B></TH></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD><CODE><A HREF="../../../../org/archive/crawler/settings/ComplexType.html#definition">definition</A>, <A HREF="../../../../org/archive/crawler/settings/ComplexType.html#definitionMap">definitionMap</A></CODE></TD></TR></TABLE>&nbsp;<A NAME="fields_inherited_from_class_org.archive.crawler.datamodel.CoreAttributeConstants"><!-- --></A><TABLE BORDER="1" WIDTH="100%" CELLPADDING="3" CELLSPACING="0" SUMMARY=""><TR BGCOLOR="#EEEEFF" CLASS="TableSubHeadingColor"><TH ALIGN="left"><B>Fields inherited from interface org.archive.crawler.datamodel.<A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html" title="interface in org.archive.crawler.datamodel">CoreAttributeConstants</A></B></TH></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD><CODE><A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_ANNOTATIONS">A_ANNOTATIONS</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_CONTENT_DIGEST">A_CONTENT_DIGEST</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_CONTENT_TYPE">A_CONTENT_TYPE</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_CREDENTIAL_AVATARS_KEY">A_CREDENTIAL_AVATARS_KEY</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_DELAY_FACTOR">A_DELAY_FACTOR</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_DISTANCE_FROM_SEED">A_DISTANCE_FROM_SEED</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_DNS_FETCH_TIME">A_DNS_FETCH_TIME</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_DNS_SERVER_IP_LABEL">A_DNS_SERVER_IP_LABEL</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_ETAG_HEADER">A_ETAG_HEADER</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_FETCH_BEGAN_TIME">A_FETCH_BEGAN_TIME</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_FETCH_COMPLETED_TIME">A_FETCH_COMPLETED_TIME</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_FETCH_HISTORY">A_FETCH_HISTORY</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_FORCE_RETIRE">A_FORCE_RETIRE</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_HERITABLE_KEYS">A_HERITABLE_KEYS</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_HTML_BASE">A_HTML_BASE</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_HTTP_PROXY_HOST">A_HTTP_PROXY_HOST</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_HTTP_PROXY_PORT">A_HTTP_PROXY_PORT</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_HTTP_TRANSACTION">A_HTTP_TRANSACTION</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_LAST_MODIFIED_HEADER">A_LAST_MODIFIED_HEADER</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_LOCALIZED_ERRORS">A_LOCALIZED_ERRORS</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_META_ROBOTS">A_META_ROBOTS</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_MINIMUM_DELAY">A_MINIMUM_DELAY</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_MIRROR_PATH">A_MIRROR_PATH</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_PREREQUISITE_URI">A_PREREQUISITE_URI</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_REFERENCE_LENGTH">A_REFERENCE_LENGTH</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_RETRY_DELAY">A_RETRY_DELAY</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_RRECORD_SET_LABEL">A_RRECORD_SET_LABEL</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_RUNTIME_EXCEPTION">A_RUNTIME_EXCEPTION</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_SOURCE_TAG">A_SOURCE_TAG</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#A_STATUS">A_STATUS</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#HEADER_TRUNC">HEADER_TRUNC</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#LENGTH_TRUNC">LENGTH_TRUNC</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#TIMER_TRUNC">TIMER_TRUNC</A>, <A HREF="../../../../org/archive/crawler/datamodel/CoreAttributeConstants.html#TRUNC_SUFFIX">TRUNC_SUFFIX</A></CODE></TD></TR></TABLE>&nbsp;<!-- ======== CONSTRUCTOR SUMMARY ======== --><A NAME="constructor_summary"><!-- --></A><TABLE BORDER="1" WIDTH="100%" CELLPADDING="3" CELLSPACING="0" SUMMARY=""><TR BGCOLOR="#CCCCFF" CLASS="TableHeadingColor"><TH ALIGN="left" COLSPAN="2"><FONT SIZE="+2"><B>Constructor Summary</B></FONT></TH></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#ExtractorHTML(java.lang.String)">ExtractorHTML</A></B>(java.lang.String&nbsp;name)</CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#ExtractorHTML(java.lang.String, java.lang.String)">ExtractorHTML</A></B>(java.lang.String&nbsp;name,              java.lang.String&nbsp;description)</CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</TD></TR></TABLE>&nbsp;<!-- ========== METHOD SUMMARY =========== --><A NAME="method_summary"><!-- --></A><TABLE BORDER="1" WIDTH="100%" CELLPADDING="3" CELLSPACING="0" SUMMARY=""><TR BGCOLOR="#CCCCFF" CLASS="TableHeadingColor"><TH ALIGN="left" COLSPAN="2"><FONT SIZE="+2"><B>Method Summary</B></FONT></TH></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD ALIGN="right" VALIGN="top" WIDTH="1%"><FONT SIZE="-1"><CODE>&nbsp;void</CODE></FONT></TD><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#extract(org.archive.crawler.datamodel.CrawlURI)">extract</A></B>(<A HREF="../../../../org/archive/crawler/datamodel/CrawlURI.html" title="class in org.archive.crawler.datamodel">CrawlURI</A>&nbsp;curi)</CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD ALIGN="right" VALIGN="top" WIDTH="1%"><FONT SIZE="-1"><CODE>(package private) &nbsp;void</CODE></FONT></TD><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#extract(org.archive.crawler.datamodel.CrawlURI, java.lang.CharSequence)">extract</A></B>(<A HREF="../../../../org/archive/crawler/datamodel/CrawlURI.html" title="class in org.archive.crawler.datamodel">CrawlURI</A>&nbsp;curi,        java.lang.CharSequence&nbsp;cs)</CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Run extractor.</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD ALIGN="right" VALIGN="top" WIDTH="1%"><FONT SIZE="-1"><CODE>protected &nbsp;boolean</CODE></FONT></TD><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#isHtmlExpectedHere(org.archive.crawler.datamodel.CrawlURI)">isHtmlExpectedHere</A></B>(<A HREF="../../../../org/archive/crawler/datamodel/CrawlURI.html" title="class in org.archive.crawler.datamodel">CrawlURI</A>&nbsp;curi)</CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Test whether this HTML is so unexpected (eg in place of a GIF URI) that it shouldn't be scanned for links.</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD ALIGN="right" VALIGN="top" WIDTH="1%"><FONT SIZE="-1"><CODE>protected &nbsp;void</CODE></FONT></TD><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#processEmbed(org.archive.crawler.datamodel.CrawlURI, java.lang.CharSequence, java.lang.CharSequence)">processEmbed</A></B>(<A HREF="../../../../org/archive/crawler/datamodel/CrawlURI.html" title="class in org.archive.crawler.datamodel">CrawlURI</A>&nbsp;curi,             java.lang.CharSequence&nbsp;value,             java.lang.CharSequence&nbsp;context)</CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD ALIGN="right" VALIGN="top" WIDTH="1%"><FONT SIZE="-1"><CODE>protected &nbsp;void</CODE></FONT></TD><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#processEmbed(org.archive.crawler.datamodel.CrawlURI, java.lang.CharSequence, java.lang.CharSequence, char)">processEmbed</A></B>(<A HREF="../../../../org/archive/crawler/datamodel/CrawlURI.html" title="class in org.archive.crawler.datamodel">CrawlURI</A>&nbsp;curi,             java.lang.CharSequence&nbsp;value,             java.lang.CharSequence&nbsp;context,             char&nbsp;hopType)</CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD ALIGN="right" VALIGN="top" WIDTH="1%"><FONT SIZE="-1"><CODE>protected &nbsp;void</CODE></FONT></TD><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#processGeneralTag(org.archive.crawler.datamodel.CrawlURI, java.lang.CharSequence, java.lang.CharSequence)">processGeneralTag</A></B>(<A HREF="../../../../org/archive/crawler/datamodel/CrawlURI.html" title="class in org.archive.crawler.datamodel">CrawlURI</A>&nbsp;curi,                  java.lang.CharSequence&nbsp;element,                  java.lang.CharSequence&nbsp;cs)</CODE><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD ALIGN="right" VALIGN="top" WIDTH="1%"><FONT SIZE="-1"><CODE>protected &nbsp;void</CODE></FONT></TD><TD><CODE><B><A HREF="../../../../org/archive/crawler/extractor/ExtractorHTML.html#processLink(org.archive.crawler.datamodel.CrawlURI, java.lang.CharSequence, java.lang.CharSequence)">processLink</A></B>(<A HREF="../../../../org/archive/crawler/datamodel/CrawlURI.html" title="class in org.archive.crawler.datamodel">CrawlURI</A>&nbsp;curi,            java.lang.CharSequence&nbsp;value,            java.lang.CharSequence&nbsp;context)</CODE>

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?