⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 httpdocmanager.java

📁 真正的网络爬虫的源代码啊,希望大家好好阅读,写出心得体会啊
💻 JAVA
字号:
package net.matuschek.http;

/*********************************************
    Copyright (c) 2001 by Daniel Matuschek
*********************************************/

import java.io.IOException;
import java.net.URL;

/**
 * An HttpDocManager does something with an HttpDoc. 
 * It is used by the WebRobot to store the retrieved documents.
 * @see net.matuschek.http.AbstractHttpDocManager
 *
 * @author Daniel Matuschek
 * @version $Id: HttpDocManager.java,v 1.3 2003/02/27 18:40:19 oliver_schmidt Exp $
 */

public interface HttpDocManager {

  /**
   * "Processes" a document (without storing it).
   * Either direct processing or collecting urls and later processing.
   * Most documents should be stored (for reruns) but not all of them should be
   * processed (Maybe you only want to process PDF documents).
   *
   * @param doc a HttpDoc object to process. This may also be null
   * @exception DocManagerException will be thrown if an error occurs
   * while processing the document.
   */
  void processDocument(HttpDoc doc) throws DocManagerException;

  /**
   * Stores a document. Usually this will store the document somewhere (file
   * system, database, ...). It is also possible that this will not store the
   * whole documents, but extract information from it and process this
   * information.
   * Most documents should be stored (for reruns) but not all of them should be
   * processed (Maybe you only want to process PDF documents).
   *
   * @param doc a HttpDoc object to store. This may also be null
   * @exception DocManagerException will be thrown if an error occurs
   * while storing the document.
   */
  void storeDocument(HttpDoc doc) throws DocManagerException;

  /**
   * Removes a  document from cache
   *
   * @param doc a HttpDoc object to store. This may also be null
   * @exception DocManagerException will be thrown if an error occurs
   * while storing the document.
   */
  public void removeDocument(URL url);
  
  /**
   * Returns URL of a stored document with the same content or null.
   * 
   * @param doc
   * @return URL of duplicate document as String or null
   * @throws IOException
   */
  public String findDuplicate(HttpDoc doc) throws IOException;
  
  
  /**
   * If a HttpDocManager stores the complete HttpDocs, it is possible
   * to use it as a cache. Using this method it is possible to access the cached
   * objects. If a HttpDocManager can't be used as a cache, it should always
   * return null.
   *
   * @return a cached HttpDoc for this URL or null
   */
  HttpDoc retrieveFromCache(URL u);

	/**
	 * Should be called if the instance is not used any more.
	 * Some resources might need to be released.
	 */
  	public void finish();

}

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -