⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 outside.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 3 页
字号:
        to use it. See also <a href="outside.html#mon_com" title="9.5.&nbsp;Remote Monitoring and Control">Section&nbsp;9.5, &ldquo;Remote Monitoring and Control&rdquo;</a>.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="recover"></a>9.3.&nbsp;Recovery of Frontier State and recover.gz</h3></div></div></div><p>During normal running, the Heritrix Frontier by default keeps a      journal. The      journal is kept in the jobs logs directory. Its named      <code class="literal">recover.gz</code>. If a crawl crashes, the recover.gz      journal can be used to recreate approximately the status of the crawler      at the time of the crash. Recovery can take a long time in some cases,      but is usually much quicker then repeating a crawl.</p><p>To run the recovery process, relaunch the crashed crawler. Create      a new crawl order job based on the crawl that crashed. If you choose the      "recover-log" link from the list of completed jobs in the 'Based on      recovery' page, the new job will      automatically be set up to use the original job's recovery journal to      bootstrap its Frontier state (of completed and queued URIs). Further, if      the recovered job attempts to reuse any already-full 'logs' or 'state'      directories, new paths for these directories will be chosen with as many      '-R' suffixes as are necessary to specify a new empty directory.</p><p>(If you simply base your new job on the old job without using the      'Recover' link, you must manually enter the full path of the original      crawl's recovery journal into the <code class="literal">recover-path</code>      setting, near the end of all settings. You must also adjust the 'logs'      and 'state' directory settings, if they were specified using absolute      paths that would cause the new crawl to reuse the directories of the      original job.)</p><p>After making any further adjustments to the crawl settings, submit      the new job. The submission will hang a long time as the recover.gz file      is read in its entirety by the frontier. (This can take hours for a      crawl that has run for a long time, and during this time the crawler      control panel will appear idle, with no job pending or in progress, but      the machine will be busy.) Eventually the submit and crawl job launche      should complete. The crawl should pick up from close to where the crash      occurred. There is no marking in the logs that this crawl was started by      reading a recover log (Be sure to mark this was done in the crawl      journal).</p><p>The recovery log is gzipped because it gets very large otherwise      and because of the reptition of terms, it compresses very well. On      abnormal termination of the crawl job, if you look at the recover.gz      file with gzip, gzip will report <code class="literal">unexpected end of      file</code> if you try to ungzip it. Gzip is complaining that the      file write was abnormally terminated. But the recover.gz file will be of      use restoring the frontier at least to where the gzip file went bad      (Gzip zips in 32k blocks; the worst loss would be the last 32k of      gzipped data).</p><p>Java's Gzip support (up through at least Java 1.5/5.0) can       compress arbitrarily large input streams, but has problems when       decompressing any stream to output larger than 2GB. When attempting      to recover a crawl that has a recovery log that, when uncompressed,      would be over 2GB, this will trigger a FatalConfigurationException      alert, with detail message "Recover.log problem: java.io.IOException:       Corrupt GZIP trailer". Heritrix will accept either compressed or       uncompressed recovery log files, so a workaround is to first uncompress       the recovery log using another non-Java tool (such as the 'gunzip'       available in Linux and Cygwin), then refer to this uncompressed       recovery log when recovering. (Reportedly, Java 6.0 "Mustang" will fix      this Java bug with un-gzipping large files.)</p><p>See also below, the related recovery facility,        <a href="outside.html#checkpoint" title="9.4.&nbsp;Checkpointing">Section&nbsp;9.4, &ldquo;Checkpointing&rdquo;</a>, for an alternate recovery mechanism.      </p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="checkpoint"></a>9.4.&nbsp;Checkpointing</h3></div></div></div><p>Checkpointing [<a href="glossary.html#checkpointing">Checkpointing</a>], the crawler      writes a representation of current  state to a directory under      <code class="literal">checkpoints-path</code>, named for the checkpoint.      Checkpointed state includes serialization of main crawler objects, copies      of current set of bdbje log files, etc.  The idea is that the checkpoint      directory contains all that is required recovering a crawler.       Checkpointing also rotates off the crawler logs including the      <code class="literal">recover.gz</code> log, if enabled. Log files are NOT copied      to the checkpoint directory. They are left under the logs directory but      are distingushed by a suffix. The suffix is      the checkpoint name (e.g.  <code class="literal">crawl.log.000031</code> where      <code class="literal">000031</code> is the checkpoint name).      </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Currently, only the BdbFrontier using the bdbje-based already-seen      or the bloom filter already-seen is checkpointable.      </p></div><p>To run a checkpoint, click the checkpoint button in the UI or      invoke <code class="literal">checkpoint</code> from JMX.  This launches      a thread to run through the following steps: If crawling, pause the crawl;      run actual checkpoint; if was crawling when checkpoint was invoked,       resume the crawl.  Dependent on the size of the crawl, checkpointing can      take some time; often the step that takes longest is pausing the crawl,      waiting on threads to get into a paused, checkpointable state.  While      checkpointing, the status will show      as <code class="literal">CHECKPOINTING</code>. When the checkpoint has completed --      the crawler will resume crawling (Of if in PAUSED state when      checkpointing was invoked, will return to the PAUSED state).      </p><p>Recovery from a checkpoint has much in common with the      recovery of a crawl using the recover.log (See ???.      To recover, create a job.  Then before launching, set the      <code class="literal">crawl-order/recover-path</code> to point at the checkpoint      directory you want to recover from.  Alternatively, browse to the      <code class="literal">Jobs-&gt;Based on a recovery</code> screen and select the      checkpoint you want to recover from.  After clicking, a new job      will be created that takes the old jobs' (end-of-crawl) settings      and autofills the recover-path with the right directory-path (The      renaming of logs and <code class="literal">crawl-order/state-path</code> "state"      dirs so they do not clash with the old as is described above in      <a href="outside.html#recover" title="9.3.&nbsp;Recovery of Frontier State and recover.gz">Section&nbsp;9.3, &ldquo;Recovery of Frontier State and recover.gz&rdquo;</a> is also done).  The first thing recover does      is copy into place the saved-off bdbje log files.  Again, recovery can      take time -- an hour or more if a crawl of millions.       </p><p>Checkpointing is currently experimental.  The recover-log technique      is tried-and-true.  Once checkpointing is proven reliable, faster, and      more comprehensive, it will become the preferred method recovering      a crawler).        </p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10E7D"></a>9.4.1.&nbsp;Expert mode: Fast checkpointing</h4></div></div></div><p>The bulk of the time checkpointing is taken up copying off the bdbje      logs.  For example, checkpointing a crawl that had downloaded      18million items -- it had discovered &gt; 130million (bloom filter) -- the      checkpointing took about 100minutes to complete of which 90 plus minutes      were spent copying the ~12k bdbje log files (One disk only involved).       Set log level on <code class="literal">org.archive.util.FileUtils</code> to FINE to      watch the java bdbje log file-copy.      </p><p>Since copying off bdbje log files can take hours, we've added an        <span class="emphasis"><em>expert mode</em></span> checkpoint that bypasses bdbje log        copying.  The upside is your checkpoint completes promptly -- in        minutes, even if the crawl is large -- but downside is recovery takes

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -