⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 package-summary.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 2 页
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><!--NewPage--><HTML><HEAD><!-- Generated by javadoc (build 1.5.0_06) on Wed Sep 27 16:03:14 PDT 2006 --><TITLE>org.archive.io.warc (Heritrix 1.10.1)</TITLE><META NAME="keywords" CONTENT="org.archive.io.warc package"><LINK REL ="stylesheet" TYPE="text/css" HREF="../../../../stylesheet.css" TITLE="Style"><SCRIPT type="text/javascript">function windowTitle(){    parent.document.title="org.archive.io.warc (Heritrix 1.10.1)";}</SCRIPT><NOSCRIPT></NOSCRIPT></HEAD><BODY BGCOLOR="white" onload="windowTitle();"><!-- ========= START OF TOP NAVBAR ======= --><A NAME="navbar_top"><!-- --></A><A HREF="#skip-navbar_top" title="Skip navigation links"></A><TABLE BORDER="0" WIDTH="100%" CELLPADDING="1" CELLSPACING="0" SUMMARY=""><TR><TD COLSPAN=2 BGCOLOR="#EEEEFF" CLASS="NavBarCell1"><A NAME="navbar_top_firstrow"><!-- --></A><TABLE BORDER="0" CELLPADDING="0" CELLSPACING="3" SUMMARY="">  <TR ALIGN="center" VALIGN="top">  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="../../../../overview-summary.html"><FONT CLASS="NavBarFont1"><B>Overview</B></FONT></A>&nbsp;</TD>  <TD BGCOLOR="#FFFFFF" CLASS="NavBarCell1Rev"> &nbsp;<FONT CLASS="NavBarFont1Rev"><B>Package</B></FONT>&nbsp;</TD>  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <FONT CLASS="NavBarFont1">Class</FONT>&nbsp;</TD>  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="package-use.html"><FONT CLASS="NavBarFont1"><B>Use</B></FONT></A>&nbsp;</TD>  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="package-tree.html"><FONT CLASS="NavBarFont1"><B>Tree</B></FONT></A>&nbsp;</TD>  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="../../../../deprecated-list.html"><FONT CLASS="NavBarFont1"><B>Deprecated</B></FONT></A>&nbsp;</TD>  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="../../../../index-all.html"><FONT CLASS="NavBarFont1"><B>Index</B></FONT></A>&nbsp;</TD>  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="../../../../help-doc.html"><FONT CLASS="NavBarFont1"><B>Help</B></FONT></A>&nbsp;</TD>  </TR></TABLE></TD><TD ALIGN="right" VALIGN="top" ROWSPAN=3><EM></EM></TD></TR><TR><TD BGCOLOR="white" CLASS="NavBarCell2"><FONT SIZE="-2">&nbsp;<A HREF="../../../../org/archive/io/arc/package-summary.html"><B>PREV PACKAGE</B></A>&nbsp;&nbsp;<A HREF="../../../../org/archive/net/package-summary.html"><B>NEXT PACKAGE</B></A></FONT></TD><TD BGCOLOR="white" CLASS="NavBarCell2"><FONT SIZE="-2">  <A HREF="../../../../index.html?org/archive/io/warc/package-summary.html" target="_top"><B>FRAMES</B></A>  &nbsp;&nbsp;<A HREF="package-summary.html" target="_top"><B>NO FRAMES</B></A>  &nbsp;&nbsp;<SCRIPT type="text/javascript">  <!--  if(window==top) {    document.writeln('<A HREF="../../../../allclasses-noframe.html"><B>All Classes</B></A>');  }  //--></SCRIPT><NOSCRIPT>  <A HREF="../../../../allclasses-noframe.html"><B>All Classes</B></A></NOSCRIPT></FONT></TD></TR></TABLE><A NAME="skip-navbar_top"></A><!-- ========= END OF TOP NAVBAR ========= --><HR><H2>Package org.archive.io.warc</H2>Experimental WARC Writer and Readers.<P><B>See:</B><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<A HREF="#package_description"><B>Description</B></A><P><TABLE BORDER="1" WIDTH="100%" CELLPADDING="3" CELLSPACING="0" SUMMARY=""><TR BGCOLOR="#CCCCFF" CLASS="TableHeadingColor"><TH ALIGN="left" COLSPAN="2"><FONT SIZE="+2"><B>Interface Summary</B></FONT></TH></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD WIDTH="15%"><B><A HREF="../../../../org/archive/io/warc/WARCConstants.html" title="interface in org.archive.io.warc">WARCConstants</A></B></TD><TD>WARC Constants used by readers and writers.</TD></TR></TABLE>&nbsp;<P><TABLE BORDER="1" WIDTH="100%" CELLPADDING="3" CELLSPACING="0" SUMMARY=""><TR BGCOLOR="#CCCCFF" CLASS="TableHeadingColor"><TH ALIGN="left" COLSPAN="2"><FONT SIZE="+2"><B>Class Summary</B></FONT></TH></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD WIDTH="15%"><B><A HREF="../../../../org/archive/io/warc/ExperimentalWARCWriter.html" title="class in org.archive.io.warc">ExperimentalWARCWriter</A></B></TD><TD><b>Experimental</b> WARC implementation.</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD WIDTH="15%"><B><A HREF="../../../../org/archive/io/warc/WARCReader.html" title="class in org.archive.io.warc">WARCReader</A></B></TD><TD>WARCReader.</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD WIDTH="15%"><B><A HREF="../../../../org/archive/io/warc/WARCReaderFactory.html" title="class in org.archive.io.warc">WARCReaderFactory</A></B></TD><TD>Factory for WARC Readers.</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD WIDTH="15%"><B><A HREF="../../../../org/archive/io/warc/WARCRecord.html" title="class in org.archive.io.warc">WARCRecord</A></B></TD><TD>A WARC file Record.</TD></TR><TR BGCOLOR="white" CLASS="TableRowColor"><TD WIDTH="15%"><B><A HREF="../../../../org/archive/io/warc/WARCWriterPool.html" title="class in org.archive.io.warc">WARCWriterPool</A></B></TD><TD>A pool of WARCWriters.</TD></TR></TABLE>&nbsp;<P><A NAME="package_description"><!-- --></A><H2>Package org.archive.io.warc Description</H2><P>Experimental WARC Writer and Readers.  Code and specification subject to changewith no guarantees of backward compatibility: i.e. newer readersmay not be able to parse WARCs written with older writers. This code, with noted exceptions, is a loose implementation of parts of the(unreleased and unfinished)<a href="http://archive-access.sourceforge.net/warc/warc_file_format.html">WARCFile Format (Version 0.9)</a>. Deviations from 0.9, outlined below in thesection <i>Deviations from Spec.</i>, are to be proposed as amendments to thespecification.  Since the new spec. revision will likely be named version 0.10,code in this package writes WARCs of version 0.10 -- not 0.9.<h2>Tools</h2><p>Initial implementations of <code>Arc2Warc</code> and <code>Warc2Arc</code>tools can be found in the package above this one, at<A HREF="../../../../org/archive/io/Arc2Warc.html" title="class in org.archive.io"><CODE>Arc2Warc</CODE></A> and <A HREF="../../../../org/archive/io/Warc2Arc.html" title="class in org.archive.io"><CODE>Warc2Arc</CODE></A>respectively.  Pass <code>--help</code> to learn how to use each tool.<h2>Implementation Notes</h2><h3>Unique ID Generator</h3><p>WARC requires a GUID for each record written. A configurable unique ID<A HREF="../../../../org/archive/uid/GeneratorFactory.html" title="class in org.archive.uid"><CODE>GeneratorFactory</CODE></A>, it can be configured to use alternateunique ID generators, was added with a default of<A HREF="../../../../org/archive/uid/UUIDGenerator.html" title="class in org.archive.uid"><CODE>UUIDGenerator</CODE></A>.  The default implementation generates<a url="http://en.wikipedia.org/wiki/UUID">UUIDs</a> (using java5<code>java.util.UUID</code>) with the <code>urn</code> scheme [See<a href="http://www.ietf.org/rfc/rfc4122.txt">RFC4122</a>].</p><h3><A HREF="../../../../org/archive/util/anvl/package-summary.html"><CODE>ANVL</CODE></A></h3><p>The ANVL RFC822-like format is used writing <code>Named Fields</code> inWARCs and occasionally for metadata. An implementation was added at<A HREF="../../../../org/archive/util/anvl/package-summary.html"><CODE>org.archive.util.anvl</CODE></A>.</p><h3><a name="deviations">Deviations from Spec.</a></h3><p>Below deviations from spec. 0.9 to be proposed as spec. amendments with newrevision likely to be 0.10 (Vocal agreement between John, Gordon, and Stack at<i>La Honda</i> Meeting, August 8th, 2006).</p><h3>mimetype in header line</h3><p>Allow full mimetypes in the header line as per RFC2045 rather thancurrent, shriveled mimetype that allows only type and subtype.  This will meanmimetypes are allowed <i>parameters</i>: e.g.<code>text/plain; charset=UTF-8</code> or<code>application/http; msgtype=request</code>.  Allowing full mimetypes, we can support the following scenarios withoutfurther amendment to specification and without parsers having to resort to<code>metadata</code> records or to custom<code>Named Fields</code> to figure how to interpret payload:<ul><li>Consider the case where an archiving organization would store allrelated to a capture as one record with a mimetype of <code>multipart/mixed; boundary=RECORD-ID</code>.  An example recordmight comprise the parts <code>Content-Type: application/http; msgtype=request</code>,<code>Content-Type: application/http; msgtype=response</code>, and<code>Content-Type: text/xml+rdf</code> (For metadata).</li><li>Or, an archiving institution would store a capture with<code>multipart/alternatives</code> ranging frommost basic (or 'desiccated' in Kunze-speak)-- perhaps a <code>text/plain</code> rendition of a PDF capture -- through to<code>best</code>, the actual PDF binary itself.</li></ul></p><p>To support full mimetypes, we must allow for whitespace between parametersand allow that parameter values themselves might include whitespace('quoted-string'). The WARC Writer converts any embedded carriage-return andnewlines to single space.</p><h3>Swap position of recordid and mimetype in the header line</h3><p>Because of the above amendment where we allow full mimetypes on header line,to ease the parse, since miemtype now may include whitespace, we move themimetype to last position on header line and recordid to second-from-last.</p><h3>Use application/http instead of message/http</h3><p>message type has line length maximum of 1000 characters absent a<code>Content-Type-Encoding</code> header set to <code>BINARY</code>.(See definition of message/http for talk of adherence to MIME<code>message</code> line limits: See 19.1 Internet Media Type message/http and application/http in <a href="http://www.faqs.org/rfcs/rfc2616.html">RFC2616</a>).</p><h3>Miscellaneous</h3><p>Writing WARCs, the <code>resource</code> record type is chosen as the corerecord that all others associate to: i.e. all others have a <code>Related-Record-ID</code> that points back to the<code>resource</code>.</p><h2>Suggested Spec. Amendments</h2><p>Apart from the above listed <a href="#deviations">deviations</a>, the belowchanges are also suggested by Stack:</p><h3>Drop response record type</h3><p><code>resource</code> is sufficent. Let mimetype distingush if capture withresponse headers or not (As per comment at end of <i>8.1 HTTP and HTTPS</i>where it allows that if no response headers, use resource record type andpage mimetype rather than response type plus a mimetype of message/http: Thedifference in record types is not needed distingushing between the twotypes of capture)</p><p>Are there other capture methods that would require a response record,that don't have a mimetype that includes response headers and content?

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -