⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 testpageparsermanager2.java

📁 用来为垂直搜索引擎抓取数据的采集系统
💻 JAVA
字号:
/*
 * *****************************************************
 * Copyright (c) 2005 IIM Lab. All  Rights Reserved.
 * Created by xuehao at 2005-10-12
 * Contact: zxuehao@mail.ustc.edu.cn
 * *****************************************************
 */

package org.indigo.tests.parser;

import java.util.ArrayList;

import org.indigo.pages.CollectedIdsPage;
import org.indigo.pages.CollectedPage;
import org.indigo.pages.VisitPage;
import org.indigo.parser.PageParserManager;
import org.indigo.parser.Parser;

import junit.framework.TestCase;

public class TestPageParserManager2 extends TestCase
{
        public void testPageParserManager2()
        {
            String url = "http://www.ahnw.gov.cn/scxx/gqrx/?page=1&lb=1%C5%A9%B8%B1%B2%FA%C6%B7&key=&r=a&SortName=";

            VisitPage vPage = new VisitPage("page");
            vPage.setBeginUrl(url);
            vPage.setParameters(1, 3, 1);

            CollectedPage cPage = new CollectedPage("id");
            cPage.setBeginUrl( "http://www.ahnw.gov.cn/scxx/gqrx/content.asp?id=" );

            String idFront, idBack;
            idFront = "<a href=\"javascript:opencontent(";
            idBack = ");";
            CollectedIdsPage idsPage = new CollectedIdsPage( idFront, idBack );
            idsPage.setVisitPage( vPage );

            Parser parser = new Parser();
            PageParserManager pageMag = new PageParserManager(true);
            pageMag.setParser(parser);

            String startStr, endStr;

            startStr = "<td colspan=\"2\" class=\"br\" height=\"30\" bgcolor=\"#E8F3F7\" align=\"center\"><b>";
            endStr = "</b></td>";
            pageMag.addField(startStr, endStr);

            startStr = "<td class=\"br\" height=\"24\" bgcolor=\"#E8F3F7\" width=\"341\">";
            endStr = "</td>";
            pageMag.addField(startStr, endStr);

            startStr = "<td class=\"br\" height=\"24\" bgcolor=\"#E8F3F7\" width=\"309\">";
            endStr = "</td>";
            pageMag.addField(startStr, endStr);

            startStr = "<td class=\"br\" height=\"12\" bgcolor=\"#E8F3F7\" colspan=\"2\">";
            endStr = "</td>";
            pageMag.addField(startStr, endStr);

            startStr = "<td class=\"br\" height=\"8\" bgcolor=\"#E8F3F7\" width=\"341\">";
            endStr = "</td>";
            pageMag.addField( startStr, endStr );
            
            startStr = "<td class=\"br\" height=\"8\" bgcolor=\"#E8F3F7\" width=\"309\">";
            endStr = "</td>";
            pageMag.addField( startStr, endStr );
            
            String aItem = null;
            String nextUrl = null;
            
            url = vPage.getCurrentLink();
            while( url!=null )
            {
    	        idsPage.setUrl( url );
    	        ArrayList ids=null;
    	        ids = idsPage.getIds();

    	        for( int i=0; i<ids.size(); i++ )
    	        {
    	            String id=null;
    	            id = (String) ids.get(i);
    	            url = cPage.getCollectedUrl( id );
    	            System.out.println( url );
    	            
    	            pageMag.setCollectedUrl(url);
    	            pageMag.open();
    	            do
    	            {
    	                aItem = pageMag.getAItem();
    	                if (aItem != null)
    	                    System.out.println(aItem);

    	            } while (aItem != null);
    	            pageMag.close();
    	            
    	        }      
    	        url = vPage.getNextVisitLink();
            }
                   
            System.out.println("over");
        }
        
}

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -