⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 xmltablewriter.java

📁 用来为垂直搜索引擎抓取数据的采集系统
💻 JAVA
字号:
/*
 * *****************************************************
 * Copyright (c) 2005 IIM Lab. All  Rights Reserved.
 * Created by xuehao at 2005-10-12
 * Contact: zxuehao@mail.ustc.edu.cn
 * *****************************************************
 */

package org.indigo.xml;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.HashMap;

import org.indigo.parser.ItemParser;
import org.indigo.util.MainConfig;
import org.indigo.util.TaskProperties;
import org.jdom.Document;
import org.jdom.Element;
import org.jdom.Verifier;
import org.jdom.output.Format;
import org.jdom.output.XMLOutputter;
import org.indigo.util.*;
/**
 * 此类主要提供把采集到的数据写入到xml文件中的操作。
 * @author wbz
 *
 */
public class XmlTableWriter
{
    
    private Document itsDoc=null;
    private Element itsRows=null;
    private String itsXmlName=null,itsTaskName;
    private HashMap itsCols=null;
    
    private HashMap itsDefaultCols=null;
    private HashMap itsDefaultValues=null;
    public XmlTableWriter( String taskName, String xmlName )
    {
        
        itsTaskName = taskName;
        itsXmlName = xmlName;
        itsCols = new HashMap();
        
        int i = xmlName.lastIndexOf( "/" );
        String str = xmlName.substring( 0, i );
        File dir = new File( str );
        if( dir.mkdirs() )
        {
//            System.out.println( "make dir " + str + " ok." );
        }
        else
        {
//            System.out.println( "make dir " + str + " not ok." );
        }
        
        open();
        close();
    }
    public String getFileName()
    {
        return itsXmlName;
    }
    /**
     * 在内存中建立xml文件。
     * 并读取模板中的一些参数,然后写入到xml文件中。
     *
     */
    private void open()
    {
        Element root,definition,data;
        root = new Element( "Table" );
        itsDoc = new Document( root );
        
        definition = new Element( "Definition" );
        data = new Element( "Data" );
        
        root.addContent( definition );
        root.addContent( data );
        itsRows = new Element( "rows");
        data.addContent( itsRows );
        
        TaskProperties props = new TaskProperties();
        props.open( itsTaskName );
        
        String tableType=null;
        tableType = props.getProperty( "TableType" );
        if( tableType==null || tableType.equals("") )
            tableType = "unknown";
        definition.addContent( new Element("tabletype").setText( tableType ) );
        
        String str = null;
        int ruleCount = 0,i;
        str = props.getProperty("RuleCount");
        ruleCount = Integer.parseInt(str);
        for( i=0; i<ruleCount; i++ )
        {
            str = props.getProperty( "EngName"+(i+1));
            if( str==null || str.equals("") )
                str = "unknown";
            itsCols.put( "col"+(i+1), str );
            definition.addContent( new Element("col"+(i+1)).setText(str) );
//            System.out.println( "available i="+i );
        }
        
        str = MainConfig.getInstance().getProperty( "ShowUrl" );
        if( str==null || str.equals("") || str.equalsIgnoreCase("false") )
        {
            
        }else
        if( str.equalsIgnoreCase("true") )
        {
            itsCols.put( "col"+(i+1), "page_url" );
            definition.addContent( new Element("col"+(i+1)).setText("page_url") );
        }

        i++;
        i++;
        int j=1;
        //////////
        
        while( props.isContainKey( "DefaultEngName"+j) )
        {
            if( itsDefaultCols==null )
                itsDefaultCols = new HashMap();
            if( itsDefaultValues==null )
                itsDefaultValues = new HashMap();
            
            str = props.getProperty( "DefaultEngName"+j );
            itsDefaultCols.put( "col"+i, str );
            definition.addContent( new Element("col"+i).setText(str) );
            
            str = props.getProperty( "DefaultValue"+j );
            itsDefaultValues.put( "DefaultValue"+i, str );
            
            i++;
            j++;
        }
        ///////        
    }
    /**
     * 把采集到的数据添加到xml文件中。
     * @param aItem  采集到的数据。
     */
    public void appendData( String aItem )
    {
//    	System.out.println( aItem );
        
        int i,j=0;
        String str=aItem,sub,colName;
        StringBuffer buf=new StringBuffer(); 
        
        Element row = new Element( "row" );
        itsRows.addContent( row );
        
        i = str.indexOf( ItemParser.GAP_TOKEN );
        while( i!=-1 )
        {
            sub = str.substring( 0, i );
            buf.delete( 0, buf.length() );

            colName = (String) itsCols.get( "col"+(j+1) );
            j++;
            if( colName!=null )
            {
	            for( int k=0; k<sub.length(); k++ )
	            {
	                if( Verifier.isXMLCharacter(sub.charAt(k)) )
	                    buf.append( sub.charAt(k) );
	            }
	            sub = buf.toString();
	            row.addContent( new Element(colName).setText(sub) );
            }
            str = str.substring( i+1 );
            i = str.indexOf( ItemParser.GAP_TOKEN );
        }
        colName = (String) itsCols.get( "col"+(j+1) );
        j++;
        if( colName!=null )
        {
            sub = str;
            buf.delete( 0, buf.length() );
            for( int k=0; k<sub.length(); k++ )
            {
                if( Verifier.isXMLCharacter(sub.charAt(k)) )
                    buf.append( sub.charAt(k) );
            }
            sub = buf.toString();
            
            row.addContent( new Element(colName).setText(sub) );
        }

        j++;
        i = j;
        j = 1;
        
        if( itsDefaultCols!=null )
        {   
            while( itsDefaultCols.containsKey("col"+i) )
	        {
	            colName = (String)itsDefaultCols.get( "col"+i );
	            sub = (String)itsDefaultValues.get( "DefaultValue"+i );
	            row.addContent( new Element(colName).setText(sub) );
	
	            i++;
	            j++;
	        }
        }
        
    }
    /**
     * 当对一个模板所指定的URL采集结束时,
     * 把内存中的xml文件写入到硬盘。
     *
     */
    public void close()
    {
        Format format = Format.getCompactFormat();
        format.setEncoding( "gb2312" );
        format.setIndent( "    " );
        
        XMLOutputter XMLout = new XMLOutputter( format );
        try
        {
            XMLout.output( itsDoc, new FileOutputStream( itsXmlName) );
        } catch (FileNotFoundException e)
        {
            // 
            e.printStackTrace();
        } catch (IOException e)
        {
            // 
            e.printStackTrace();
        }
    }
}

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -