📄 arcs.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>13.&nbsp;Internet Archive ARC files</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix developer documentation"><link rel="up" href="index.html" title="Heritrix developer documentation"><link rel="prev" href="ar01s12.html" title="12.&nbsp;Writing a Statistics Tracker"><link rel="next" href="apa.html" title="A.&nbsp;Future changes in the API"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">13.&nbsp;Internet Archive ARC files</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="ar01s12.html">Prev</a>&nbsp;</td><th align="center" width="60%">&nbsp;</th><td align="right" width="20%">&nbsp;<a accesskey="n" href="apa.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="arcs"></a>13.&nbsp;Internet Archive ARC files</h2></div></div></div><p>By default, heritrix writes all its crawled to disk using <a href="http://crawler.archive.org/apidocs/org/archive/crawler/writer/ARCWriterProcessor.html" target="_top">ARCWriterProcessor</a>.    This processor writes the found crawl content as Internet Archive ARC    files. The ARC file format is described here: <a href="http://www.archive.org/web/researcher/ArcFileFormat.php" target="_top">Arc File    Format</a>. Heritrix writes version 1 ARC files.</p><p>By default, Heritrix writes <span class="emphasis"><em>compressed</em></span>    version 1 ARC files.  The compression is done with gzip, but rather    compress the ARC as a whole, instead, each ARC Record is in turn    gzipped.  All gzipped records are concatenated together to make up    a file of multiple gzipped members.  This concatenation, it turns out,    is a legal gzip file; you can give it to gzip and it will undo each    compressed record in turn.  Its an amenable compression technique    because it allows random seek to a single record and the undoing of    that record only. Otherwise, first the total ARC would have to be    uncompressed to get any one record.    </p><p>Pre-release of Heritrix 1.0, an amendment was made to the ARC file    version 1 format to allow writing of extra metadata into first record of    an ARC file. This extra metadata is written as XML. The XML Schema used by    metadata instance documents can be found at <a href="http://archive.org/arc/1.0/arc.xsd" target="_top">http://archive.org/arc/1.0/xsd</a>.    The schema is documented <a href="http://archive.org/arc/1.0/arc.html" target="_top">here</a>.</p><p>If the extra XML metadata info is present, the second    '&lt;reserved&gt;' field of the second line of version 1 ARC files will be    changed from '0' to '1': i.e. ARCs with XML metadata are version    '1.1'.</p><p>If present, the ARC file metadata record body will contain at least    the following fields (Later revisions to the ARC may add other    fields):<div class="orderedlist"><ol type="1"><li><p>Software and software version used creating the ARC file.          Example: 'heritrix 0.7.1 http://crawler.archive.org'.</p></li><li><p>The IP of the host that created the ARC file. Example:          '103.1.0.3'.</p></li><li><p>The hostname of the host that created the ARC file. Example:          'debord.archive.org'.</p></li><li><p>Contact name of the crawl operator. Default value is          'admin'.</p></li><li><p>The http-header 'user-agent' field from the crawl-job order          file. This field is recorded here in the metadata only until the day          ARCs record the HTTP request made. Example: 'os-heritrix/0.7.0          (+http://localhost.localdomain)'.</p></li><li><p>The http-header 'from' from the crawl-job order file. This          field is recorded here in the metadata only until the day ARCs          record the HTTP request made. Example:          'webmaster@localhost.localdomain'.</p></li><li><p>The 'description' from the crawl-job order file. Example:          'Heritrix integrated selftest'</p></li><li><p>The Robots honoring policy. Example: 'classic'.</p></li><li><p>Organization on whose behalf the operator is running the          crawl. Example 'Internet Archive'.</p></li><li><p>The recipient of the crawl ARC resource if known. Example:          'Library of Congress'.</p></li></ol></div></p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="arcnaming"></a>13.1.&nbsp;ARC File Naming</h3></div></div></div><p>When heritrix creates ARC files, it uses the following template      naming them: <pre class="programlisting">        &lt;OPERATOR SPECIFIED&gt; '-' &lt;12 DIGIT TIMESTAMP&gt; '-' &lt;SERIAL NO.&gt; '-' &lt;FQDN HOSTNAME&gt; '.arc' | '.gz'        </pre>... where &lt;OPERATOR SPECIFIED&gt; is any operator      specified text, &lt;SERIAL NO&gt; is a zero-padded 5 digit number and      &lt;FQDN HOSTNAME&gt; is the fully-qualified domainname on which the      crawl was run.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="arcreader"></a>13.2.&nbsp;Reading arc files</h3></div></div></div><p><a href="http://crawler.archive.org/apidocs/org/archive/io/arc/ARCReader.html" target="_top">ARCReader</a>      can be used reading arc files. It has a command line interface that can      be used to print out meta info in a pseudo       <a href="http://www.archive.org/web/researcher/example_cdx.php" target="_top">CDX      format</a>       and for doing random access getting of      arc records (The command-line interface is described in the <a href="http://crawler.archive.org/apidocs/org/archive/io/arc/ARCReader.html#main(java.lang.String[])" target="_top">main      method javadoc</a> comments).</p><p><a href="http://netarchive.dk/website/sources/index-en.php" target="_top">Netarchive.dk</a>      have also developed arc reading and writing tools.</p><p>Tom Emerson of Basis Technology has put up a project on      sourceforge to host a BSD-Licensed C++ ARC reader called <a href="http://sourceforge.net/projects/libarc/" target="_top">libarc</a> (Its since      been moved to      <a href="http://archive-access.sourceforge.net/" target="_top">archive-access</a>).</p><p>The French National Library (BnF) has also released a GPL perl/c      ARC Reader. See       <a href="http://crawler.archive.org/cgi-bin/wiki.pl?BnfArcTools" target="_top">BAT</a> for documentation and where to download..      </p>      See <a href="http://cvs.sourceforge.net/viewcvs.py/archive-access/archive-access/projects/hedaern/" target="_top">Hedaern</a> for python readers/writers and      for a skeleton webapp that allows querying by timestamp+date as well          as full-text search of ARC content..      <p>      </p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="arcwriter"></a>13.3.&nbsp;Writing arc files</h3></div></div></div><p>Here is an example arc writer application: <a href="http://nwatoolset.sourceforge.net/docs/NedlibToARC/" target="_top">Nedlib To ARC      conversion</a>. It rewrites <a href="http://www.csc.fi/sovellus/nedlib/index.phtml.en" target="_top">nedlib</a>      files as arcs.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="searching_arcs"></a>13.4.&nbsp;Searching ARCS</h3></div></div></div><p>Check out the <a href="http://archive-access.sourceforge.net/projects/nutch/" target="_top">NutchWAX</a>+<a href="http://archive-access.sourceforge.net/projects/wera/" target="_top">WERA</a> bundle.</p></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="ar01s12.html">Prev</a>&nbsp;</td><td align="center" width="20%">&nbsp;</td><td align="right" width="40%">&nbsp;<a accesskey="n" href="apa.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">12.&nbsp;Writing a Statistics Tracker&nbsp;</td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%">&nbsp;A.&nbsp;Future changes in the API</td></tr></table></div></body></html>
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -