⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 araleutilities.java

📁 一个网络爬虫
💻 JAVA
字号:
package org.flaviotordini.arale;

import java.util.*;
import java.net.*;
import java.io.*;

/**
 *  Collection of static methods used by Arale.
 *
 *@author     Flavio Tordini
 *@created    26 novembre 2001
 */
public class AraleUtilities {

    /**
     *@param  url  Description of Parameter
     *@return      The validConnection value
     *@since       26 novembre 2001
     */
    public static HttpURLConnection getValidConnection(URL url) {
        HttpURLConnection httpurlconnection = null;

        try {
            URLConnection urlconnection = url.openConnection();
            urlconnection.connect();

            if (!(urlconnection instanceof HttpURLConnection)) {
                Arale.logger.log("Not an http connection: " + url);
                // urlconnection.disconnect();
                return null;
            }

            httpurlconnection = (HttpURLConnection) urlconnection;
            // httpurlconnection.setFollowRedirects(true);

            int responsecode = httpurlconnection.getResponseCode();

            switch (responsecode) {
                // here valid codes!
                case HttpURLConnection.HTTP_OK:
                case HttpURLConnection.HTTP_MOVED_PERM:
                case HttpURLConnection.HTTP_MOVED_TEMP:
                    break;
                default:
                    Arale.logger.log("Invalid response code: " + responsecode + " " + url);
                    httpurlconnection.disconnect();
                    return null;
            }
        } catch (IOException ioexception) {
            Arale.logger.log("unable to connect: " + ioexception);
            if (httpurlconnection != null) {
                httpurlconnection.disconnect();
            }
            return null;
        }

        return httpurlconnection;
    }


    /**
     *  Returns an InputStream from a connection. Retries 3 times.
     *
     *@param  connection  a connection
     *@return             an InputStream
     *@since              26 novembre 2001
     */
    public static InputStream getSafeInputStream(HttpURLConnection connection) {

        InputStream inputstream = null;

        for (int i = 0; i < 3; ) {
            try {
                inputstream = connection.getInputStream();
                break;
            } catch (IOException ioexception1) {
                Arale.logger.log("error opening connection " + ioexception1);
                i++;
            }
        }

        return inputstream;
    }


    /**
     *  Gets the content attribute of the Arale object
     *
     *@param  connection  Description of Parameter
     *@return             The content value
     *@since              26 novembre 2001
     */
    public static String downloadStringResource(HttpURLConnection connection) {
        // Arale.logger.log("fetchURL started");

        StringBuffer content;
        InputStream inputstream = getSafeInputStream(connection);
        if (inputstream == null) {
            return null;
        }

        // load the Stream in a StringBuffer
        InputStreamReader isr = new InputStreamReader(inputstream);

        content = new StringBuffer();

        try {
            char buf[] = new char[Arale.STREAM_BUFFER_SIZE];
            int cnt = 0;
            while ((cnt = isr.read(buf, 0, Arale.STREAM_BUFFER_SIZE)) != -1) {
                content.append(buf, 0, cnt);
            }
            isr.close();
            inputstream.close();
        } catch (IOException ioexception) {
            Arale.logger.log(ioexception);
        }

        return content.toString();
    }


    /**
     *  Description of the Method
     *
     *@param  url       Description of Parameter
     *@param  basepath  Description of Parameter
     *@return           Description of the Returned Value
     */
    public static File URLToLocalFile(File basepath, URL url) {

        String strPort;
        int port = url.getPort();
        if (port != -1) {
            strPort = ";" + port;
        } else {
            strPort = "";
        }

        String localpathname = url.getHost() + strPort + url.getPath();

        // il path assoluto locale (nella dir di output)
        File localfile = new File(basepath, localpathname);
        // Arale.logger.log("localfile: " + localfile);

        return localfile;
    }


    /**
     *  Description of the Method
     *
     *@param  contextualURL  Description of Parameter
     *@return                Description of the Returned Value
     *@since                 26 novembre 2001
     */
    public static URL transformURL(ContextualURL contextualURL) {

        URL transformed_url = null;

        String query = contextualURL.url.getQuery();
        if (query == null) {
            query = "";
        } else {
            query = "!" + query;
        }

        String pathname = contextualURL.url.getPath() + query;

        // if (contextualURL.scannable) {
        String pathname_lower = pathname.toLowerCase();
        if (!(pathname_lower.endsWith(".html") || pathname_lower.endsWith(".htm"))) {
            pathname = pathname + ".html";
        }
        // }

        try {
            transformed_url =
                    new URL(contextualURL.url.getProtocol(), contextualURL.url.getHost(),
                    contextualURL.url.getPort(), pathname);

        } catch (MalformedURLException e) {
            Arale.logger.log(e);
        }

        // System.out.println(transformed_url.toString());

        return transformed_url;
    }


    /**
     *  Description of the Method
     *
     *@param  content     Description of Parameter
     *@param  pos         Description of Parameter
     *@param  limitchars  Description of Parameter
     *@return             Description of the Returned Value
     *@since              26 novembre 2001
     */
    public static int findLeftLimit(String content, int pos, String limitchars) {

        int lpos = -1;

        for (int i = pos; i > -1; i--) {
            char ch = content.charAt(i);
            if (limitchars.indexOf(ch) != -1) {
                lpos = i + 1;
                break;
            }
        }

        return lpos;
    }


    /**
     *  Description of the Method
     *
     *@param  content     Description of Parameter
     *@param  pos         Description of Parameter
     *@param  limitchars  Description of Parameter
     *@return             Description of the Returned Value
     *@since              26 novembre 2001
     */
    public static int findRightLimit(String content, int pos, String limitchars) {
        int rpos = -1;

        for (int i = pos; i < content.length(); i++) {
            char ch = content.charAt(i);
            if (limitchars.indexOf(ch) != -1) {
                rpos = i;
                break;
            }
        }

        return rpos;
    }


    /**
     *  Description of the Method
     *
     *@param  context  Description of Parameter
     *@param  url      Description of Parameter
     *@return          Description of the Returned Value
     */
    public static String buildRelativePath(URL context, URL url) {
        String relativepath = null;

        File url_file = new File(url.getPath());
        File context_file = new File(context.getPath());

        File f = url_file;
        while ((f = f.getParentFile()) != null) {
            if (context_file.equals(f)) {
                String context_filepath = context_file.getPath();
                String url_filepath = url_file.getPath();
                relativepath = url_filepath.substring(context_filepath.length());
                break;
            }
        }

        // Arale.logger.log("new realtive path: " + relativepath);
        return relativepath;
    }


    /**
     *  Streams to disk
     *
     *@param  inputstream  input stream
     *@param  outputfile   file to write to
     *@return              Description of the Returned Value
     *@since               26 novembre 2001
     */
    public static long writeToDisk(InputStream inputstream, File outputfile) {
        long writtenBytesCount = 0;

        try {

            if (inputstream == null) {
                System.out.print("inputstream is null: " + outputfile);
                return writtenBytesCount;
            }

            // create local directory
            outputfile.getParentFile().mkdirs();

            FileOutputStream fileoutputstream = new FileOutputStream(outputfile);

            byte abyte0[] = new byte[Arale.STREAM_BUFFER_SIZE];
            int j;
            while ((j = inputstream.read(abyte0)) >= 0) {
                fileoutputstream.write(abyte0, 0, j);
                writtenBytesCount += j;
            }

            fileoutputstream.close();
            inputstream.close();
        } catch (FileNotFoundException fnfe) {
            Arale.logger.log(fnfe);
        } catch (IOException ioexception) {
            Arale.logger.log(ioexception);
        }

        return writtenBytesCount;
    }


    /**
     *  Description of the Method
     *
     *@param  tokens  Description of Parameter
     *@param  string  Description of Parameter
     *@return         Description of the Returned Value
     */
    public static boolean StringContainsToken(String string, Collection tokens) {
        boolean result = false;
        for (Iterator i = tokens.iterator(); i.hasNext(); ) {
            String token = (String) i.next();
            if (string.indexOf(token) != -1) {
                result = true;
                break;
            }
        }
        return result;
    }


    public static String transformLink(String link) {
        String transformed_link = null;

        transformed_link = link.replace('?','!');

        String transformed_link_lower = transformed_link.toLowerCase();
        if (!(transformed_link_lower.endsWith(".html") || transformed_link_lower.endsWith(".htm"))) {
            transformed_link = transformed_link + ".html";
        }

        // Arale.logger.log("transformed_link: " + transformed_link);
        return transformed_link;
    }

}

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -