⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 bdbfrontier.java

📁 高性能分词算法
💻 JAVA
📖 第 1 页 / 共 2 页
字号:
/* BdbFrontier *  * $Id: BdbFrontier.java 5440 2007-08-28 05:19:52Z gojomo $*  * Created on Sep 24, 2004 * *  Copyright (C) 2004 Internet Archive. * * This file is part of the Heritrix web crawler (crawler.archive.org). * * Heritrix is free software; you can redistribute it and/or modify * it under the terms of the GNU Lesser Public License as published by * the Free Software Foundation; either version 2.1 of the License, or * any later version. * * Heritrix is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the * GNU Lesser Public License for more details. * * You should have received a copy of the GNU Lesser Public License * along with Heritrix; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA *  */package org.archive.crawler.frontier;import java.io.File;import java.io.FileNotFoundException;import java.io.IOException;import java.io.Serializable;import java.util.ArrayList;import java.util.Collections;import java.util.Iterator;import java.util.List;import java.util.Queue;import java.util.TreeSet;import java.util.concurrent.LinkedBlockingQueue;import java.util.logging.Level;import java.util.logging.Logger;import javax.management.AttributeNotFoundException;import org.apache.commons.collections.Closure;import org.archive.crawler.datamodel.CrawlURI;import org.archive.crawler.datamodel.UriUniqFilter;import org.archive.crawler.framework.CrawlController;import org.archive.crawler.framework.FrontierMarker;import org.archive.crawler.framework.exceptions.FatalConfigurationException;import org.archive.crawler.settings.SimpleType;import org.archive.crawler.settings.Type;import org.archive.crawler.util.BdbUriUniqFilter;import org.archive.crawler.util.BloomUriUniqFilter;import org.archive.crawler.util.CheckpointUtils;import org.archive.crawler.util.DiskFPMergeUriUniqFilter;import org.archive.crawler.util.MemFPMergeUriUniqFilter;import org.archive.queue.StoredQueue;import org.archive.util.ArchiveUtils;import com.sleepycat.je.Database;import com.sleepycat.je.DatabaseException;/** * A Frontier using several BerkeleyDB JE Databases to hold its record of * known hosts (queues), and pending URIs.  * * @author Gordon Mohr */public class BdbFrontier extends WorkQueueFrontier implements Serializable {    // be robust against trivial implementation changes    private static final long serialVersionUID = ArchiveUtils        .classnameBasedUID(BdbFrontier.class, 1);    private static final Logger logger =        Logger.getLogger(BdbFrontier.class.getName());    /** all URIs scheduled to be crawled */    protected transient BdbMultipleWorkQueues pendingUris;    /** all URI-already-included options available to be chosen */    private String[] AVAILABLE_INCLUDED_OPTIONS = new String[] {            BdbUriUniqFilter.class.getName(),            BloomUriUniqFilter.class.getName(),            MemFPMergeUriUniqFilter.class.getName(),            DiskFPMergeUriUniqFilter.class.getName()};        /** URI-already-included to use (by class name) */    public final static String ATTR_INCLUDED = "uri-included-structure";        private final static String DEFAULT_INCLUDED =        BdbUriUniqFilter.class.getName();        /** URI-already-included to use (by class name) */    public final static String ATTR_DUMP_PENDING_AT_CLOSE =         "dump-pending-at-close";    private final static Boolean DEFAULT_DUMP_PENDING_AT_CLOSE =         Boolean.FALSE;        /**     * Constructor.     * @param name Name for of this Frontier.     */    public BdbFrontier(String name) {        this(name, "BdbFrontier. "            + "A Frontier using BerkeleyDB Java Edition databases for "            + "persistence to disk.");        Type t = addElementToDefinition(new SimpleType(ATTR_INCLUDED,                "Structure to use for tracking already-seen URIs. Non-default " +                "options may require additional configuration via system " +                "properties.", DEFAULT_INCLUDED, AVAILABLE_INCLUDED_OPTIONS));        t.setExpertSetting(true);        t = addElementToDefinition(new SimpleType(ATTR_DUMP_PENDING_AT_CLOSE,                "Whether to dump all URIs waiting in queues to crawl.log " +                "when a crawl ends. May add a significant delay to " +                "crawl termination. Dumped lines will have a zero (0) " +                "status.", DEFAULT_DUMP_PENDING_AT_CLOSE));        t.setExpertSetting(true);    }    /**     * Create the BdbFrontier     *      * @param name     * @param description     */    public BdbFrontier(String name, String description) {        super(name, description);    }        /**     * Create the single object (within which is one BDB database)     * inside which all the other queues live.      *      * @return the created BdbMultipleWorkQueues     * @throws DatabaseException     */    private BdbMultipleWorkQueues createMultipleWorkQueues()    throws DatabaseException {        return new BdbMultipleWorkQueues(this.controller.getBdbEnvironment(),            this.controller.getBdbEnvironment().getClassCatalog(),            this.controller.isCheckpointRecover());    }        @Override    protected void initQueuesOfQueues() {        if(this.controller.isCheckpointRecover()) {            // do not setup here; take/init from deserialized frontier            return;         }        // small risk of OutOfMemoryError: if 'hold-queues' is false,        // readyClassQueues may grow in size without bound        readyClassQueues = new LinkedBlockingQueue<String>();        try {            Database inactiveQueuesDb = this.controller.getBdbEnvironment()                    .openDatabase(null, "inactiveQueues",                            StoredQueue.databaseConfig());            inactiveQueues = new StoredQueue<String>(inactiveQueuesDb,                    String.class, null);            Database retiredQueuesDb = this.controller.getBdbEnvironment()                    .openDatabase(null, "retiredQueues",                            StoredQueue.databaseConfig());            retiredQueues = new StoredQueue<String>(retiredQueuesDb,                    String.class, null);        } catch (DatabaseException e) {            throw new RuntimeException(e);        }                // small risk of OutOfMemoryError: in large crawls with many         // unresponsive queues, an unbounded number of snoozed queues         // may exist        snoozedClassQueues = Collections.synchronizedSortedSet(new TreeSet<WorkQueue>());    }    protected Queue<String> reinit(Queue<String> q, String name) {        try {            // restore the innner Database/StoredSortedMap of the queue            Database db = this.controller.getBdbEnvironment()                .openDatabase(null, name, StoredQueue.databaseConfig());                        StoredQueue<String> queue;            if(q instanceof StoredQueue) {                queue = (StoredQueue<String>) q;                queue.hookupDatabase(db, String.class, null);            } else {                // recovery of older checkpoint; copy to StoredQueue                queue = new StoredQueue<String>(db,String.class,                        this.controller.getBdbEnvironment().getClassCatalog());                 queue.addAll(q);            }            return queue;        } catch (DatabaseException e) {            throw new RuntimeException(e);        }    }        /**     * Create a UriUniqFilter that will serve as record      * of already seen URIs.     *     * @return A UURISet that will serve as a record of already seen URIs     * @throws IOException     */    protected UriUniqFilter createAlreadyIncluded() throws IOException {        UriUniqFilter uuf;        String c = null;        try {            c = (String)getAttribute(null, ATTR_INCLUDED);        } catch (AttributeNotFoundException e) {            // Do default action if attribute not in order.        }        // TODO: avoid all this special-casing; enable some common        // constructor interface usable for all alt implemenations        if (c != null && c.equals(BloomUriUniqFilter.class.getName())) {            uuf = this.controller.isCheckpointRecover()?                    deserializeAlreadySeen(BloomUriUniqFilter.class,                        this.controller.getCheckpointRecover().getDirectory()):                    new BloomUriUniqFilter();        } else if (c!=null && c.equals(MemFPMergeUriUniqFilter.class.getName())) {            // TODO: add checkpointing for MemFPMergeUriUniqFilter            uuf = new MemFPMergeUriUniqFilter();        } else if (c!=null && c.equals(DiskFPMergeUriUniqFilter.class.getName())) {            // TODO: add checkpointing for DiskFPMergeUriUniqFilter            uuf = new DiskFPMergeUriUniqFilter(controller.getScratchDisk());        } else {

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -