⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 nonpagelinklog.java

📁 是个java写的sipder,非常不错!能承受很大的压力,每天采集的数量在10000万
💻 JAVA
字号:
package cn.yicha.subject.spider.writer;

import java.io.*;
import java.text.SimpleDateFormat;
import java.util.*;

import cn.yicha.subject.spider.store.HeaderContent;


public class NonPageLinkLog {
	protected static final String _FILE_NAME = "log\\nonpagelinks";

	protected static final String _SUFFIX_NAME = ".log";

	protected static final String _SPLIT_TAG = "\t";

	protected static final String _SITE_SPLIT_TAG = "\n|\r\n"; // can't be changed
	protected static final String _SITE_SPLIT_TAG_FOR_WTITE = "\n"; // can't be changed

	protected static final String _FINISH_TAG = "###";

	protected static boolean _haveInit = false;

	protected static FileWriter _fw;

	protected static File _f;

	private NonPageLinkLog() {

	}

	public static void init(String kind, String siteID) {
		try {
			String n = new String(_FILE_NAME + "_" + now() + "_" + kind + "_" + siteID
					+ _SUFFIX_NAME);
			_f = new File(n);
			_f.createNewFile();
			_fw = new FileWriter(_f);
		} catch (Exception e) {
			System.out.println("non-page link log error");
			System.exit(0);
		}
		_haveInit = true;
	}

//	public static String getFileName(String kind, String siteID) {
//		return _f.getAbsolutePath();
//	}
	
	protected static String now() {
		GregorianCalendar calenda = new GregorianCalendar();
		SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd-HH-mm-ss");
		return sdf.format(calenda.getTime());
	}

	public static HeaderContent[] praseHeaderLog(String logFileName) {
		ArrayList alHeader = new ArrayList();
		try {
			RandomAccessFile raf = new RandomAccessFile(logFileName, "r");
			String linkPorps = null;
			System.out.println("start to open previous log file...please wait...");
			do {
				linkPorps = raf.readLine();
				if (linkPorps == null || linkPorps.length() == 0) {
					break;
				}
				
				HeaderContent hc = new HeaderContent();
				String[] props = linkPorps.split(_SPLIT_TAG);
				hc.set_uri(props[0]);
				hc.set_lmt(props[1]);
				hc.set_contentType(props[2]);
				hc.set_contentLen(props[3]);
				alHeader.add(hc);
			} while(true);
		} catch (Exception e) {
			System.out.println("load non-page link log error");
			e.printStackTrace();
			System.exit(0);
		}
		
		return (HeaderContent[]) alHeader.toArray(new HeaderContent[0]);
	}

	public static synchronized void add(HeaderContent hc) {
		if (!_haveInit)
			return;
		if (hc == null)
			return;

		try {
			_fw.write(hc.get_uri() + _SPLIT_TAG);
			// _fw.write(hc.get_date() + "***");
			_fw.write(hc.get_lmt() + _SPLIT_TAG);
			_fw.write(hc.get_contentType() + _SPLIT_TAG);
			_fw.write(hc.get_contentLen() + _SITE_SPLIT_TAG_FOR_WTITE);
			_fw.flush();
		} catch (Exception e) {
			System.out.println("non-page link log write error");
			System.exit(0);
		}
	}

	public static synchronized void tagFinish() {
		if (!_haveInit)
			return;
		HeaderContent hc = new HeaderContent(_FINISH_TAG, _FINISH_TAG,
				_FINISH_TAG, _FINISH_TAG);
		add(hc);
	}

	public static FileWriter get_fw() {
		return _fw;
	}

	public static void set_fw(FileWriter _fw) {
		NonPageLinkLog._fw = _fw;
	}
}

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -