📄 outside.html
字号:
to use it. See also <a href="outside.html#mon_com" title="9.5. Remote Monitoring and Control">Section 9.5, “Remote Monitoring and Control”</a>.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="recover"></a>9.3. Recovery of Frontier State and recover.gz</h3></div></div></div><p>During normal running, the Heritrix Frontier by default keeps a journal. The journal is kept in the jobs logs directory. Its named <code class="literal">recover.gz</code>. If a crawl crashes, the recover.gz journal can be used to recreate approximately the status of the crawler at the time of the crash. Recovery can take a long time in some cases, but is usually much quicker then repeating a crawl.</p><p>To run the recovery process, relaunch the crashed crawler. Create a new crawl order job based on the crawl that crashed. If you choose the "recover-log" link from the list of completed jobs in the 'Based on recovery' page, the new job will automatically be set up to use the original job's recovery journal to bootstrap its Frontier state (of completed and queued URIs). Further, if the recovered job attempts to reuse any already-full 'logs' or 'state' directories, new paths for these directories will be chosen with as many '-R' suffixes as are necessary to specify a new empty directory.</p><p>(If you simply base your new job on the old job without using the 'Recover' link, you must manually enter the full path of the original crawl's recovery journal into the <code class="literal">recover-path</code> setting, near the end of all settings. You must also adjust the 'logs' and 'state' directory settings, if they were specified using absolute paths that would cause the new crawl to reuse the directories of the original job.)</p><p>After making any further adjustments to the crawl settings, submit the new job. The submission will hang a long time as the recover.gz file is read in its entirety by the frontier. (This can take hours for a crawl that has run for a long time, and during this time the crawler control panel will appear idle, with no job pending or in progress, but the machine will be busy.) Eventually the submit and crawl job launche should complete. The crawl should pick up from close to where the crash occurred. There is no marking in the logs that this crawl was started by reading a recover log (Be sure to mark this was done in the crawl journal).</p><p>The recovery log is gzipped because it gets very large otherwise and because of the reptition of terms, it compresses very well. On abnormal termination of the crawl job, if you look at the recover.gz file with gzip, gzip will report <code class="literal">unexpected end of file</code> if you try to ungzip it. Gzip is complaining that the file write was abnormally terminated. But the recover.gz file will be of use restoring the frontier at least to where the gzip file went bad (Gzip zips in 32k blocks; the worst loss would be the last 32k of gzipped data).</p><p>Java's Gzip support (up through at least Java 1.5/5.0) can compress arbitrarily large input streams, but has problems when decompressing any stream to output larger than 2GB. When attempting to recover a crawl that has a recovery log that, when uncompressed, would be over 2GB, this will trigger a FatalConfigurationException alert, with detail message "Recover.log problem: java.io.IOException: Corrupt GZIP trailer". Heritrix will accept either compressed or uncompressed recovery log files, so a workaround is to first uncompress the recovery log using another non-Java tool (such as the 'gunzip' available in Linux and Cygwin), then refer to this uncompressed recovery log when recovering. (Reportedly, Java 6.0 "Mustang" will fix this Java bug with un-gzipping large files.)</p><p>See also below, the related recovery facility, <a href="outside.html#checkpoint" title="9.4. Checkpointing">Section 9.4, “Checkpointing”</a>, for an alternate recovery mechanism. </p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="checkpoint"></a>9.4. Checkpointing</h3></div></div></div><p>Checkpointing [<a href="glossary.html#checkpointing">Checkpointing</a>], the crawler writes a representation of current state to a directory under <code class="literal">checkpoints-path</code>, named for the checkpoint. Checkpointed state includes serialization of main crawler objects, copies of current set of bdbje log files, etc. The idea is that the checkpoint directory contains all that is required recovering a crawler. Checkpointing also rotates off the crawler logs including the <code class="literal">recover.gz</code> log, if enabled. Log files are NOT copied to the checkpoint directory. They are left under the logs directory but are distingushed by a suffix. The suffix is the checkpoint name (e.g. <code class="literal">crawl.log.000031</code> where <code class="literal">000031</code> is the checkpoint name). </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Currently, only the BdbFrontier using the bdbje-based already-seen or the bloom filter already-seen is checkpointable. </p></div><p>To run a checkpoint, click the checkpoint button in the UI or invoke <code class="literal">checkpoint</code> from JMX. This launches a thread to run through the following steps: If crawling, pause the crawl; run actual checkpoint; if was crawling when checkpoint was invoked, resume the crawl. Dependent on the size of the crawl, checkpointing can take some time; often the step that takes longest is pausing the crawl, waiting on threads to get into a paused, checkpointable state. While checkpointing, the status will show as <code class="literal">CHECKPOINTING</code>. When the checkpoint has completed -- the crawler will resume crawling (Of if in PAUSED state when checkpointing was invoked, will return to the PAUSED state). </p><p>Recovery from a checkpoint has much in common with the recovery of a crawl using the recover.log (See ???. To recover, create a job. Then before launching, set the <code class="literal">crawl-order/recover-path</code> to point at the checkpoint directory you want to recover from. Alternatively, browse to the <code class="literal">Jobs->Based on a recovery</code> screen and select the checkpoint you want to recover from. After clicking, a new job will be created that takes the old jobs' (end-of-crawl) settings and autofills the recover-path with the right directory-path (The renaming of logs and <code class="literal">crawl-order/state-path</code> "state" dirs so they do not clash with the old as is described above in <a href="outside.html#recover" title="9.3. Recovery of Frontier State and recover.gz">Section 9.3, “Recovery of Frontier State and recover.gz”</a> is also done). The first thing recover does is copy into place the saved-off bdbje log files. Again, recovery can take time -- an hour or more if a crawl of millions. </p><p>Checkpointing is currently experimental. The recover-log technique is tried-and-true. Once checkpointing is proven reliable, faster, and more comprehensive, it will become the preferred method recovering a crawler). </p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10E7D"></a>9.4.1. Expert mode: Fast checkpointing</h4></div></div></div><p>The bulk of the time checkpointing is taken up copying off the bdbje logs. For example, checkpointing a crawl that had downloaded 18million items -- it had discovered > 130million (bloom filter) -- the checkpointing took about 100minutes to complete of which 90 plus minutes were spent copying the ~12k bdbje log files (One disk only involved). Set log level on <code class="literal">org.archive.util.FileUtils</code> to FINE to watch the java bdbje log file-copy. </p><p>Since copying off bdbje log files can take hours, we've added an <span class="emphasis"><em>expert mode</em></span> checkpoint that bypasses bdbje log copying. The upside is your checkpoint completes promptly -- in minutes, even if the crawl is large -- but downside is recovery takes
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -