⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 package-summary.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 2 页
字号:
SMTP has rich MIME set to describe responses. Its request ispretty much unrecordable. NNTP and FTP similar.  Because of rich MIME, noneed of a special response type here.</p><p>Related, do we need the <code>request</code> record?Only makes sense for HTTP?</p><p>This proposal is contentious.  Gordon drew scenario where responsewould be needed distingushing local from remote capture if an archivinginstitution purposefully archived without recording headers orif the payload itself was an archived record. In opposition, was suggested thatshould an institution choose to cature in this 'unusual' mode, crawl metadatacould be used consulted to disambiguate confusion on how capture was done (Tobe further investigated.  In general, definition of record types is still in need of work).</p><h3>Edits</h3><p>Below are suggested edits.  Changes are not substantative.</p><h4>Allow multiple instances of a single Named Parameter</h4><p>Allow that there may be multiple instances of same Named Parameterin any one Named Parameter block.E.g. Multiple <code>Related-Record-ID</code>s could prove of use.Spec. mentions this in <i>8.1 HTTP and HTTPS</i> section but betterbelongs in the <i>5.2 Named Parameters</i> preamble.</p><p>Related, add to <code>Named Field</code> section note on bidirectional<code>Related-Record-ID</code>.</p><h4>Miscellaneous</h4><p>LaHonda in below is reference to meeting of John, Gordon and Stack atLaHonda Cafe on 16th St., on August 8th, 2006.</p><ul><li>Leave off 9.2 GZIP extra fields. Big section on implementing an optionthat has little to do with WARCing. AGREED at LaHonda.</li><li>But, we need to mark gzipped files as being WARC: i.e. that the GZIP is a member per resource. Its useful so readers know how to invokeGZIP (That it has to be done once to get at any record or just need todo per record). Suggest adding GZIP extra field in HEAD ofGZIP member that says 'WARC' (ARC has such a thing currently). NOT NECESSARY per LaHonda meeting.</li><li>IP-Address for dns resource is DNS Server.  Add note to this effect in8.2 DNS.</li><li>Section 6. is truncated -- missing text.  What was intended here? SEEISO DOC.</li><li>In-line ANVL definition (From Kunze).  Related, can labels haveCTLs such as CRLF (Shouldn't)?  When says 'control-chars', does this includeUNICODE control characters (Should)? CHAR is described as ASCII/UTF-8 but theyare not same (Should be UTF-8).  ANVL OR NOT STILL UP IN AIR AFTER LaHonda.</li><li>Fix examples. Use output of experimental ARC Writer.</li><li>Fix ambiguity in spec. pertaining to 'smallest possible anvl-fields' notcited by Mads Alhof Kristiansen in <a href="ftp://ftp.diku.dk/diku/semantics/papers/D-548.pdf">Digital Preservationusing the WARC File Format</a>.</li></ul><h2>Open Issues</h2><ul><li>Should we allow freeform creation of custom Named Fields ifhave a MIME-like 'X-' or somesuch prefix?</li><li>Nothing on header-line encoding (Section 11 says UTF-8). For completeness should be US-ASCII or UTF-8, no control-chars (especiallyCR or LF), etc.</li><li><code>warcinfo</code><ul><li>What for a scheme?  Using UUID as per G suggestion.</li><li>In the pastwe used to get the filename from this URL header field when we unsure of thefilename or it was unavailable (We're reading a Stream).  Won't be able to dothat with UUID for URL.  So, introducing new warcinfo Named Field (optional)'Filename' that will be used when warcinfo is put at start of a file.</li><li>Also, how to populate description of crawl into warcinfo?'Documentation' <code>Named Field</code> with list of URLs that can be assumedto exist somewhere in the current WARC set (We'd have to make the crawler goget them at start of a crawl).</li><li>I don't want to repeat crawl description for every WARC. How to have thiswarcinfo point at an original?  <code>related-record-id</code> seemsinsufficent.</li><li>If the crawler config. changes, can I just write a warcinfo withdifferences?  How to express?  Or better as metadata about a warcinfo?</li></ul></li><li><code>revisit</code><ul><li>What to write?  Use a description field or just expect this info to be present in the warcinfo? Example has request header(inside XML).  Better to use associated <code>request</code> record for thiskind of info?</li><li><code>Related-Record-ID</code> (RRID) of original is likelyan onerous requirement. Envisioning an implementation where we'd write<code>revisit</code> records, we'd write such a record where content wasjudged same or where date since last fetch had not changed.  If we're towrite the RRID, then we'd have to maintain table keyed by URL with value ofpage hash or of last modified-date plus associated RRID (actual RRIDURL, not a hash).</li></ul></li><li>Should we allow a <code>Description</code> <code>Named Field</code>.E.g. I add an order file as a metadata record and associate with a<code>warcinfo</code> record.  Description field could say "This is HeritrixOrder file".  Same for seeds.  Alternative is custom XML packaging (Schemecould describe fields such as 'order' file or ANVL packaging using ANVL'comments'.</li><li>Section 11, why was it we said we don't need a parameter or explicitsubtype for special gzip WARC format?  I don't remember?   Reader needs toknow when its reading a stream.  A client would like to know so it wrotestream to disk with right suffix?  Recap. (Perhaps it was looking atthe MAGIC bytes -- if it starts with GZIP MAGIC and includes extra fieldsthat denote it WARC, thats sufficent?).</li><li>Section 7, on truncation, on 7.1, suggest values -- 'time', 'length' --but allow free form description?Leave off 'superior method of indicating truncation' paragraph.  This qualifiercould be added to all sections of doc -- that a subsequent revision of any aspect of the doc. will be superior. Rather than <code>End-Length</code>, like MIME, last record could have<code>Segment-Number-Total</code>, a count of all segments that make upcomplete record.</li></ul><p>From LaHonda, discussion of <code>revisit</code> type. Definition wastighted some by saying revisit is used when you chose not to store the capture.Was thought possible that itNOT require pointer back to an original.  Suggested it might have asimilarity judgment header -- <code>similiarity-value</code> -- with valuesbetween 0 and 1.  Might also have <code>analysis-method</code> and<code>description</code>.  Possible methods discussed included: URI same,length same, hash of content same, judgement based off content of HTTP HEADrequest, etc.  Possible payloads might be: Nothing, a diff, the hash obtained,etc.</p><h2>Unimplemented</h2><ul><li>4.2 <code>response</code>. May not be needed.</li> <li>Record Segmentation (4.8 <code>continuation</code> record typeand the 5.2 <code>Segment-*</code> Named Parameters.  Future TODO.</li><li>4.7 <code>conversion</code> type. Future TODO.</li><li>9.2 GZIP extra field to mark this gzip as list of GZIP members rather thana big gzip bundle. Future TODO.</li> </ul><h2>TODOs</h2><ul><li>unit tests using <code>multipart/*</code> (JavaMail) reading andwriting records? Try <code>record-id</code> as part boundary.</li><li>Performance: Need to add Record-based buffering.  GZIP'd streamshave some buffering because of the deflater but could probably dow/ more.</li></ul><P><P><DL></DL><HR><!-- ======= START OF BOTTOM NAVBAR ====== --><A NAME="navbar_bottom"><!-- --></A><A HREF="#skip-navbar_bottom" title="Skip navigation links"></A><TABLE BORDER="0" WIDTH="100%" CELLPADDING="1" CELLSPACING="0" SUMMARY=""><TR><TD COLSPAN=2 BGCOLOR="#EEEEFF" CLASS="NavBarCell1"><A NAME="navbar_bottom_firstrow"><!-- --></A><TABLE BORDER="0" CELLPADDING="0" CELLSPACING="3" SUMMARY="">  <TR ALIGN="center" VALIGN="top">  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="../../../../overview-summary.html"><FONT CLASS="NavBarFont1"><B>Overview</B></FONT></A>&nbsp;</TD>  <TD BGCOLOR="#FFFFFF" CLASS="NavBarCell1Rev"> &nbsp;<FONT CLASS="NavBarFont1Rev"><B>Package</B></FONT>&nbsp;</TD>  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <FONT CLASS="NavBarFont1">Class</FONT>&nbsp;</TD>  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="package-use.html"><FONT CLASS="NavBarFont1"><B>Use</B></FONT></A>&nbsp;</TD>  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="package-tree.html"><FONT CLASS="NavBarFont1"><B>Tree</B></FONT></A>&nbsp;</TD>  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="../../../../deprecated-list.html"><FONT CLASS="NavBarFont1"><B>Deprecated</B></FONT></A>&nbsp;</TD>  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="../../../../index-all.html"><FONT CLASS="NavBarFont1"><B>Index</B></FONT></A>&nbsp;</TD>  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="../../../../help-doc.html"><FONT CLASS="NavBarFont1"><B>Help</B></FONT></A>&nbsp;</TD>  </TR></TABLE></TD><TD ALIGN="right" VALIGN="top" ROWSPAN=3><EM></EM></TD></TR><TR><TD BGCOLOR="white" CLASS="NavBarCell2"><FONT SIZE="-2">&nbsp;<A HREF="../../../../org/archive/io/arc/package-summary.html"><B>PREV PACKAGE</B></A>&nbsp;&nbsp;<A HREF="../../../../org/archive/net/package-summary.html"><B>NEXT PACKAGE</B></A></FONT></TD><TD BGCOLOR="white" CLASS="NavBarCell2"><FONT SIZE="-2">  <A HREF="../../../../index.html?org/archive/io/warc/package-summary.html" target="_top"><B>FRAMES</B></A>  &nbsp;&nbsp;<A HREF="package-summary.html" target="_top"><B>NO FRAMES</B></A>  &nbsp;&nbsp;<SCRIPT type="text/javascript">  <!--  if(window==top) {    document.writeln('<A HREF="../../../../allclasses-noframe.html"><B>All Classes</B></A>');  }  //--></SCRIPT><NOSCRIPT>  <A HREF="../../../../allclasses-noframe.html"><B>All Classes</B></A></NOSCRIPT></FONT></TD></TR></TABLE><A NAME="skip-navbar_bottom"></A><!-- ======== END OF BOTTOM NAVBAR ======= --><HR>Copyright &copy; 2003-2006 Internet Archive. All Rights Reserved.</BODY></HTML>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -