📄 arcs.html
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>13. Internet Archive ARC files</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix developer documentation"><link rel="up" href="index.html" title="Heritrix developer documentation"><link rel="prev" href="ar01s12.html" title="12. Writing a Statistics Tracker"><link rel="next" href="apa.html" title="A. Future changes in the API"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">13. Internet Archive ARC files</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="ar01s12.html">Prev</a> </td><th align="center" width="60%"> </th><td align="right" width="20%"> <a accesskey="n" href="apa.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="arcs"></a>13. Internet Archive ARC files</h2></div></div></div><p>By default, heritrix writes all its crawled to disk using <a href="http://crawler.archive.org/apidocs/org/archive/crawler/writer/ARCWriterProcessor.html" target="_top">ARCWriterProcessor</a>. This processor writes the found crawl content as Internet Archive ARC files. The ARC file format is described here: <a href="http://www.archive.org/web/researcher/ArcFileFormat.php" target="_top">Arc File Format</a>. Heritrix writes version 1 ARC files.</p><p>By default, Heritrix writes <span class="emphasis"><em>compressed</em></span> version 1 ARC files. The compression is done with gzip, but rather compress the ARC as a whole, instead, each ARC Record is in turn gzipped. All gzipped records are concatenated together to make up a file of multiple gzipped members. This concatenation, it turns out, is a legal gzip file; you can give it to gzip and it will undo each compressed record in turn. Its an amenable compression technique because it allows random seek to a single record and the undoing of that record only. Otherwise, first the total ARC would have to be uncompressed to get any one record. </p><p>Pre-release of Heritrix 1.0, an amendment was made to the ARC file version 1 format to allow writing of extra metadata into first record of an ARC file. This extra metadata is written as XML. The XML Schema used by metadata instance documents can be found at <a href="http://archive.org/arc/1.0/arc.xsd" target="_top">http://archive.org/arc/1.0/xsd</a>. The schema is documented <a href="http://archive.org/arc/1.0/arc.html" target="_top">here</a>.</p><p>If the extra XML metadata info is present, the second '<reserved>' field of the second line of version 1 ARC files will be changed from '0' to '1': i.e. ARCs with XML metadata are version '1.1'.</p><p>If present, the ARC file metadata record body will contain at least the following fields (Later revisions to the ARC may add other fields):<div class="orderedlist"><ol type="1"><li><p>Software and software version used creating the ARC file. Example: 'heritrix 0.7.1 http://crawler.archive.org'.</p></li><li><p>The IP of the host that created the ARC file. Example: '103.1.0.3'.</p></li><li><p>The hostname of the host that created the ARC file. Example: 'debord.archive.org'.</p></li><li><p>Contact name of the crawl operator. Default value is 'admin'.</p></li><li><p>The http-header 'user-agent' field from the crawl-job order file. This field is recorded here in the metadata only until the day ARCs record the HTTP request made. Example: 'os-heritrix/0.7.0 (+http://localhost.localdomain)'.</p></li><li><p>The http-header 'from' from the crawl-job order file. This field is recorded here in the metadata only until the day ARCs record the HTTP request made. Example: 'webmaster@localhost.localdomain'.</p></li><li><p>The 'description' from the crawl-job order file. Example: 'Heritrix integrated selftest'</p></li><li><p>The Robots honoring policy. Example: 'classic'.</p></li><li><p>Organization on whose behalf the operator is running the crawl. Example 'Internet Archive'.</p></li><li><p>The recipient of the crawl ARC resource if known. Example: 'Library of Congress'.</p></li></ol></div></p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="arcnaming"></a>13.1. ARC File Naming</h3></div></div></div><p>When heritrix creates ARC files, it uses the following template naming them: <pre class="programlisting"> <OPERATOR SPECIFIED> '-' <12 DIGIT TIMESTAMP> '-' <SERIAL NO.> '-' <FQDN HOSTNAME> '.arc' | '.gz' </pre>... where <OPERATOR SPECIFIED> is any operator specified text, <SERIAL NO> is a zero-padded 5 digit number and <FQDN HOSTNAME> is the fully-qualified domainname on which the crawl was run.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="arcreader"></a>13.2. Reading arc files</h3></div></div></div><p><a href="http://crawler.archive.org/apidocs/org/archive/io/arc/ARCReader.html" target="_top">ARCReader</a> can be used reading arc files. It has a command line interface that can be used to print out meta info in a pseudo <a href="http://www.archive.org/web/researcher/example_cdx.php" target="_top">CDX format</a> and for doing random access getting of arc records (The command-line interface is described in the <a href="http://crawler.archive.org/apidocs/org/archive/io/arc/ARCReader.html#main(java.lang.String[])" target="_top">main method javadoc</a> comments).</p><p><a href="http://netarchive.dk/website/sources/index-en.php" target="_top">Netarchive.dk</a> have also developed arc reading and writing tools.</p><p>Tom Emerson of Basis Technology has put up a project on sourceforge to host a BSD-Licensed C++ ARC reader called <a href="http://sourceforge.net/projects/libarc/" target="_top">libarc</a> (Its since been moved to <a href="http://archive-access.sourceforge.net/" target="_top">archive-access</a>).</p><p>The French National Library (BnF) has also released a GPL perl/c ARC Reader. See <a href="http://crawler.archive.org/cgi-bin/wiki.pl?BnfArcTools" target="_top">BAT</a> for documentation and where to download.. </p> See <a href="http://cvs.sourceforge.net/viewcvs.py/archive-access/archive-access/projects/hedaern/" target="_top">Hedaern</a> for python readers/writers and for a skeleton webapp that allows querying by timestamp+date as well as full-text search of ARC content.. <p> </p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="arcwriter"></a>13.3. Writing arc files</h3></div></div></div><p>Here is an example arc writer application: <a href="http://nwatoolset.sourceforge.net/docs/NedlibToARC/" target="_top">Nedlib To ARC conversion</a>. It rewrites <a href="http://www.csc.fi/sovellus/nedlib/index.phtml.en" target="_top">nedlib</a> files as arcs.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="searching_arcs"></a>13.4. Searching ARCS</h3></div></div></div><p>Check out the <a href="http://archive-access.sourceforge.net/projects/nutch/" target="_top">NutchWAX</a>+<a href="http://archive-access.sourceforge.net/projects/wera/" target="_top">WERA</a> bundle.</p></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="ar01s12.html">Prev</a> </td><td align="center" width="20%"> </td><td align="right" width="40%"> <a accesskey="n" href="apa.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">12. Writing a Statistics Tracker </td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%"> A. Future changes in the API</td></tr></table></div></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -