outside.html

来自「网络爬虫开源代码」· HTML 代码 · 共 315 行 · 第 1/3 页

HTML
315
字号
      <code class="literal">recover.gz</code>. If a crawl crashes, the recover.gz      journal can be used to recreate approximately the status of the crawler      at the time of the crash. Recovery can take a long time in some cases,      but is usually much quicker then repeating a crawl.</p><p>To run the recovery process, relaunch the crashed crawler. Create      a new crawl order job based on the crawl that crashed. If you choose the      "recover-log" link from the list of completed jobs in the 'Based on      recovery' page, the new job will automatically be set up to use the      original job's recovery journal to bootstrap its Frontier state (of      completed and queued URIs). Further, if the recovered job attempts to      reuse any already-full 'logs' or 'state' directories, new paths for      these directories will be chosen with as many '-R' suffixes as are      necessary to specify a new empty directory.</p><p>(If you simply base your new job on the old job without using the      'Recover' link, you must manually enter the full path of the original      crawl's recovery journal into the <code class="literal">recover-path</code>      setting, near the end of all settings. You must also adjust the 'logs'      and 'state' directory settings, if they were specified using absolute      paths that would cause the new crawl to reuse the directories of the      original job.)</p><p>After making any further adjustments to the crawl settings, submit      the new job. The submission will hang a long time as the recover.gz file      is read in its entirety by the frontier. (This can take hours for a      crawl that has run for a long time, and during this time the crawler      control panel will appear idle, with no job pending or in progress, but      the machine will be busy.) Eventually the submit and crawl job launche      should complete. The crawl should pick up from close to where the crash      occurred. There is no marking in the logs that this crawl was started by      reading a recover log (Be sure to mark this was done in the crawl      journal).</p><p>The recovery log is gzipped because it gets very large otherwise      and because of the reptition of terms, it compresses very well. On      abnormal termination of the crawl job, if you look at the recover.gz      file with gzip, gzip will report <code class="literal">unexpected end of      file</code> if you try to ungzip it. Gzip is complaining that the      file write was abnormally terminated. But the recover.gz file will be of      use restoring the frontier at least to where the gzip file went bad      (Gzip zips in 32k blocks; the worst loss would be the last 32k of      gzipped data).</p><p>Java's Gzip support (up through at least Java 1.5/5.0) can      compress arbitrarily large input streams, but has problems when      decompressing any stream to output larger than 2GB. When attempting to      recover a crawl that has a recovery log that, when uncompressed, would      be over 2GB, this will trigger a FatalConfigurationException alert, with      detail message "Recover.log problem: java.io.IOException: Corrupt GZIP      trailer". Heritrix will accept either compressed or uncompressed      recovery log files, so a workaround is to first uncompress the recovery      log using another non-Java tool (such as the 'gunzip' available in Linux      and Cygwin), then refer to this uncompressed recovery log when      recovering. (Reportedly, Java 6.0 "Mustang" will fix this Java bug with      un-gzipping large files.)</p><p>See also below, the related recovery facility, <a href="outside.html#checkpoint" title="9.4.&nbsp;Checkpointing">Section&nbsp;9.4, &ldquo;Checkpointing&rdquo;</a>, for an alternate recovery mechanism.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="checkpoint"></a>9.4.&nbsp;Checkpointing</h3></div></div></div><p>Checkpointing [<a href="glossary.html#checkpointing">Checkpointing</a>], the crawler      writes a representation of current state to a directory under      <code class="literal">checkpoints-path</code>, named for the checkpoint.      Checkpointed state includes serialization of main crawler objects,      copies of current set of bdbje log files, etc. The idea is that the      checkpoint directory contains all that is required recovering a crawler.      Checkpointing also rotates off the crawler logs including the      <code class="literal">recover.gz</code> log, if enabled. Log files are NOT copied      to the checkpoint directory. They are left under the logs directory but      are distingushed by a suffix. The suffix is the checkpoint name (e.g.      <code class="literal">crawl.log.000031</code> where <code class="literal">000031</code> is      the checkpoint name).</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Currently, only the BdbFrontier using the bdbje-based        already-seen or the bloom filter already-seen is        checkpointable.</p></div><p>To run a checkpoint, click the checkpoint button in the UI or      invoke <code class="literal">checkpoint</code> from JMX. This launches a thread to      run through the following steps: If crawling, pause the crawl; run      actual checkpoint; if was crawling when checkpoint was invoked, resume      the crawl. Dependent on the size of the crawl, checkpointing can take      some time; often the step that takes longest is pausing the crawl,      waiting on threads to get into a paused, checkpointable state. While      checkpointing, the status will show as <code class="literal">CHECKPOINTING</code>.      When the checkpoint has completed -- the crawler will resume crawling      (Of if in PAUSED state when checkpointing was invoked, will return to      the PAUSED state).</p><p>Recovery from a checkpoint has much in common with the recovery of      a crawl using the recover.log (See ???. To      recover, create a job. Then before launching, set the      <code class="literal">crawl-order/recover-path</code> to point at the checkpoint      directory you want to recover from. Alternatively, browse to the      <code class="literal">Jobs-&gt;Based on a recovery</code> screen and select the      checkpoint you want to recover from. After clicking, a new job will be      created that takes the old jobs' (end-of-crawl) settings and autofills      the recover-path with the right directory-path (The renaming of logs and      <code class="literal">crawl-order/state-path</code> "state" dirs so they do not      clash with the old as is described above in <a href="outside.html#recover" title="9.3.&nbsp;Recovery of Frontier State and recover.gz">Section&nbsp;9.3, &ldquo;Recovery of Frontier State and recover.gz&rdquo;</a>      is also done). The first thing recover does is copy into place the      saved-off bdbje log files. Again, recovery can take time -- an hour or      more if a crawl of millions.</p><p>Checkpointing is currently experimental. The recover-log technique      is tried-and-true. Once checkpointing is proven reliable, faster, and      more comprehensive, it will become the preferred method recovering a      crawler).</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10E84"></a>9.4.1.&nbsp;Expert mode: Fast checkpointing</h4></div></div></div><p>The bulk of the time checkpointing is taken up copying off the        bdbje logs. For example, checkpointing a crawl that had downloaded        18million items -- it had discovered &gt; 130million (bloom filter) --        the checkpointing took about 100minutes to complete of which 90 plus        minutes were spent copying the ~12k bdbje log files (One disk only        involved). Set log level on        <code class="literal">org.archive.util.FileUtils</code> to FINE to watch the        java bdbje log file-copy.</p><p>Since copying off bdbje log files can take hours, we've added an        <span class="emphasis"><em>expert mode</em></span> checkpoint that bypasses bdbje log        copying. The upside is your checkpoint completes promptly -- in        minutes, even if the crawl is large -- but downside is recovery takes        more work: to recover from a checkpoint, the bdbje log files need to        be manually assembled in the checkpoint <code class="literal">bdb-logs</code>        subdirectory. You'll know which bdbje log files make up the checkpoint        because Heritrix writes the checkpoint list of bdbje logs into the        checkpoint directory to a file named        <code class="literal">bdbj-logs-manifest.txt</code>. To prevent bdbje removing        log files that might be needed assembling a checkpoint made at        sometime in the past, when running expert mode checkpointing, we        configure bdbje not to delete logs when its finished with them;        instead, bdbje gives logs its no longer using a

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?