⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 mirrorwriterprocessor.java

📁 Heritrix是一个开源,可扩展的web爬虫项目。Heritrix设计成严格按照robots.txt文件的排除指示和META robots标签。
💻 JAVA
📖 第 1 页 / 共 5 页
字号:
/* MirrorWriter * * $Id: MirrorWriterProcessor.java 4654 2006-09-25 20:19:54Z paul_jack $ * * Created on 2004 October 26 * * Copyright (C) 2004 Internet Archive. * * This file is part of the Heritrix web crawler (crawler.archive.org). * * Heritrix is free software; you can redistribute it and/or modify * it under the terms of the GNU Lesser Public License as published by * the Free Software Foundation; either version 2.1 of the License, or * any later version. * * Heritrix is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the * GNU Lesser Public License for more details. * * You should have received a copy of the GNU Lesser Public License * along with Heritrix; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA */package org.archive.crawler.writer;import java.io.File;import java.io.FileOutputStream;import java.io.FilenameFilter;import java.io.IOException;import java.text.NumberFormat;import java.util.Collections;import java.util.HashMap;import java.util.HashSet;import java.util.Iterator;import java.util.Map;import java.util.Set;import java.util.TreeMap;import java.util.logging.Level;import java.util.logging.Logger;import javax.management.AttributeNotFoundException;import org.archive.crawler.datamodel.CoreAttributeConstants;import org.archive.crawler.datamodel.CrawlURI;import org.archive.crawler.framework.Processor;import org.archive.crawler.settings.ListType;import org.archive.crawler.settings.RegularExpressionConstraint;import org.archive.crawler.settings.SimpleType;import org.archive.crawler.settings.StringList;import org.archive.crawler.settings.Type;import org.archive.io.RecordingInputStream;import org.archive.io.ReplayInputStream;import org.archive.net.UURI;import org.archive.util.IoUtils;/**   Processor module that writes the results of successful fetches to   files on disk.      Writes contents of one URI to one file on disk.  The files are   arranged in a directory hierarchy based on the URI paths.  In that sense   they mirror the file hierarchy that might exist on the servers.   <p>   There are a number of issues involved:   <ul>   <li>   URIs can have arbitrary length, but file systems have length constraints.   </li>   <li>   URIs can contain characters that file systems prohibit.   </li>   <li>   URI paths are case-sensitive, but some file systems are case-insensitive.   </li>   </ul>   This class tries very hard to map each URI into a file system path that   obeys all file system constraints and yet reasonably represents   the original URI.   <p>   There would normally be a single instance of this class per Heritrix   instance. This class is thread-safe; any number of threads can be in its   innerProcess method at once. However, conflicts can still arise in the file   system. For example, if several threads try to create the same directory at   the same time, only one can win. Therefore, there should be at most one   access to a server at a given time.      @author Howard Lee Gayle*/public class MirrorWriterProcessorextends Processor implements CoreAttributeConstants {    private static final long serialVersionUID = 301407556928389168L;    /**     * Key to use asking settings for case sensitive option.     */    public static final String ATTR_CASE_SENSITIVE = "case-sensitive";    /**     * Key to use asking settings for character map.     */    public static final String ATTR_CHAR_MAP = "character-map";    /**     * Key to use asking settings for content type map.     */    public static final String ATTR_CONTENT_TYPE_MAP = "content-type-map";    /**     * Key to use asking settings for dot begin replacement.     */    public static final String ATTR_DOT_BEGIN = "dot-begin";    /**     * Key to use asking settings for dot end replacement.     */    public static final String ATTR_DOT_END = "dot-end";    /**     * Key to use asking settings for directory file.     */    public static final String ATTR_DIRECTORY_FILE = "directory-file";    /**     * Key to use asking settings for host directory option.     */    public static final String ATTR_HOST_DIRECTORY = "host-directory";    /**     * Key to use asking settings for host map.     */    public static final String ATTR_HOST_MAP = "host-map";    /**     * Key to use asking settings for maximum file system path length.     */    public static final String ATTR_MAX_PATH_LEN = "max-path-length";    /**     * Key to use asking settings for maximum file system path segment length.     */    public static final String ATTR_MAX_SEG_LEN = "max-segment-length";    /**     * Key to use asking settings for base directory path value.     */    public static final String ATTR_PATH = "path";    /**     * Key to use asking settings for port directory option.     */    public static final String ATTR_PORT_DIRECTORY = "port-directory";    /**     * Key to use asking settings for suffix at end option.     */    public static final String ATTR_SUFFIX_AT_END = "suffix-at-end";    /**     * Key to use asking settings for too-long directory.     */    public static final String ATTR_TOO_LONG_DIRECTORY = "too-long-directory";    /**     * Key to use asking settings for underscore set.     */    public static final String ATTR_UNDERSCORE_SET = "underscore-set";    /** Default value for ATTR_DOT_BEGIN.*/    private static final String DEFAULT_DOT_BEGIN = "%2E";    /** Default maximum file system path length.*/    private static final int DEFAULT_MAX_PATH_LEN = 1023;    /** Default maximum file system path segment length.*/    private static final int DEFAULT_MAX_SEG_LEN = 255;    /** Default value for ATTR_TOO_LONG_DIRECTORY.*/    private static final String DEFAULT_TOO_LONG_DIRECTORY = "LONG";    /** An empty Map.*/    private static final Map<String,String> EMPTY_MAP     = Collections.unmodifiableMap(new TreeMap<String,String>());    /**       Regular expression matching a file system path segment.       The intent is one or more non-file-separator characters.       The backslash is to quote File.separator if it's also backslash.    */    private static final String PATH_SEGMENT_RE =        "[^\\" + File.separator + "]+";    /**       Regular expression constraint on ATTR_DIRECTORY_FILE.       The intent is one non-file-separator character,       followed by zero or more characters.       The backslash is to quote File.separator if it's also backslash.    */    private static final String TOO_LONG_DIRECTORY_RE =        "[^\\" + File.separator + "].*";    /**     * Logger.     */    private static final Logger logger =        Logger.getLogger(MirrorWriterProcessor.class.getName());    /**     * @param name Name of this processor.     */    public MirrorWriterProcessor(String name) {        super(name, "MirrorWriter processor. " +            "A writer that writes each URL to a file on disk named for " +            "a derivative of the URL.");        Type e; // Current element.        addElementToDefinition(new SimpleType(ATTR_CASE_SENSITIVE,            "True if the file system is case-sensitive, like UNIX. "            + "False if the file system is case-insensitive, "            + "like Macintosh HFS+ and Windows.",            Boolean.TRUE));        addElementToDefinition(new StringList(ATTR_CHAR_MAP,            "This list is grouped in pairs. "            + "The first string in each pair must have a length of one. "            + "If it occurs in a URI path, "            + "it is replaced by the second string in the pair. "            + "For UNIX, no character mapping is normally needed. "            + "For Macintosh, the recommended value is [: %%3A]. "            + "For Windows, the recommended value is "            + "[' ' %%20  &quot; %%22  * %%2A  : %%3A  < %%3C "            + "\\> %%3E ? %%3F  \\\\ %%5C  ^ %%5E  | %%7C]."));        addElementToDefinition(new StringList(ATTR_CONTENT_TYPE_MAP,            "This list is grouped in pairs. "            + "If the content type of a resource begins (case-insensitive) "            + "with the first string in a pair, the suffix is set to "            + "the second string in the pair, replacing any suffix that may "            + "have been in the URI.  For example, to force all HTML files "            + "to have the same suffix, use [text/html html]."));        e = addElementToDefinition(new SimpleType(ATTR_DIRECTORY_FILE,            "Implicitly append this to a URI ending with '/'.",            "index.html"));        e.addConstraint(new RegularExpressionConstraint(PATH_SEGMENT_RE,            Level.SEVERE, "This must be a simple file name."));        e = addElementToDefinition(new SimpleType(ATTR_DOT_BEGIN,            "If a segment starts with '.', the '.' is replaced by this.",            DEFAULT_DOT_BEGIN));        e.addConstraint(new RegularExpressionConstraint(PATH_SEGMENT_RE,            Level.SEVERE,            "This must not be empty, and must not contain " + File.separator));        addElementToDefinition(new SimpleType(ATTR_DOT_END,            "If a directory name ends with '.' it is replaced by this.  "            + "For all file systems except Windows, '.' is recommended.  "            + "For Windows, %%2E is recommended.",            "."));        addElementToDefinition(new StringList(ATTR_HOST_MAP,            "This list is grouped in pairs. "            + "If a host name matches (case-insensitive) the first string "            + "in a pair, it is replaced by the second string in the pair.  "            + "This can be used for consistency when several names are used "            + "for one host, for example "            + "[12.34.56.78 www42.foo.com]."));        addElementToDefinition(new SimpleType(ATTR_HOST_DIRECTORY,            "Create a subdirectory named for the host in the URI.",            Boolean.TRUE));        addElementToDefinition(new SimpleType(ATTR_PATH,            "Top-level directory for mirror files.", "mirror"));        // TODO: Add a new Constraint subclass so ATTR_MAX_PATH_LEN and        // ATTR_MAX_SEG_LEN can be constained to reasonable values.        addElementToDefinition(new SimpleType(ATTR_MAX_PATH_LEN,            "Maximum file system path length.",            new Integer(DEFAULT_MAX_PATH_LEN)));        addElementToDefinition(new SimpleType(ATTR_MAX_SEG_LEN,            "Maximum file system path segment length.",            new Integer(DEFAULT_MAX_SEG_LEN)));        addElementToDefinition(new SimpleType(ATTR_PORT_DIRECTORY,            "Create a subdirectory named for the port in the URI.",            Boolean.FALSE));        addElementToDefinition(new SimpleType(ATTR_SUFFIX_AT_END,            "If true, the suffix is placed at the end of the path, "            + "after the query (if any).  If false, the suffix is placed "            + "before the query.",            Boolean.TRUE));        e = addElementToDefinition(new SimpleType(ATTR_TOO_LONG_DIRECTORY,            "If all the directories in the URI would exceed, "            + "or come close to exceeding, the file system maximum "            + "path length, then they are all replaced by this.",            DEFAULT_TOO_LONG_DIRECTORY));        e.addConstraint(new RegularExpressionConstraint(TOO_LONG_DIRECTORY_RE,            Level.SEVERE, "This must be relative and not empty."));        addElementToDefinition(new StringList(ATTR_UNDERSCORE_SET,            "If a directory name appears (case-insensitive) in this list "            + "then an underscore is placed before it.  "            + "For all file systems except Windows, this is not needed.  "            + "For Windows, the following is recommended: "            + "[com1 com2 com3 com4 com5 com6 com7 com8 com9 "            + "lpt1 lpt2 lpt3 lpt4 lpt5 lpt6 lpt7 lpt8 lpt9 "            + "con nul prn]."));    }    protected void innerProcess(CrawlURI curi) {        if (!curi.isSuccess()) {            return;        }        UURI uuri = curi.getUURI(); // Current URI.        // Only http and https schemes are supported.        String scheme = uuri.getScheme();        if (!"http".equalsIgnoreCase(scheme)                && !"https".equalsIgnoreCase(scheme)) {            return;        }        RecordingInputStream recis = curi.getHttpRecorder().getRecordedInput();        if (0L == recis.getResponseContentLength()) {            return;        }        String baseDir = null; // Base directory.        String baseSeg = null; // ATTR_PATH value.        try {            baseSeg = (String) getAttribute(ATTR_PATH, curi);        } catch (AttributeNotFoundException e) {            logger.warning(e.getLocalizedMessage());            return;        }

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -