📄 ictclascaller.java

📁 ICTCLAS的JNI调用接口文件: Title:ICTCLAS Caller * <p>Description:do chinese word segmentation.do
💻 JAVA
字号:
package org.apache.nutch.analysis;

/**
 * <p>Title:ICTCLAS Caller </p>
 * <p>Description:do chinese word segmentation.don't change the pakage and CLASS name, orelse you can't use it.
 * 请不要改变包名、类名以及native的方法名，否则调用将失效。
 * 由于ICTCLAS本身存在很多鲁棒性问题，调用segSentence时，string参数请保证不要过长或带有乱码。调用次数过多（如处理几十G的数据）会有可能造成内存溢出。
 * 故基本只能用于较小规模数据（相对几十G来说）。
 * 请运行时设置jvm足够的堆栈空间。
 * please leave enough spaces for JVM.
 * e.g. java -Xmx500m caomo.ICTCLASCaller
 * this class need WordSegDll.dll,and Data directory on working directory(it contains dictionaries needed by the DLL). 
 * the content of DATA directory can be found in http://www.nlp.org.cn/project/project.php?proj_id=6
 * </p>
 * <p>Copyright: Copyright (c) 2004. WORDSEGDLL is just a JNI wrapper of ICTCLAS. so, its usage following the copyright of ICTCLAS.
 * for source code and more usage detail,please see http://www.nlp.org.cn/project/project.php?proj_id=6 </p>
 * <p>Company:buaa sei 北京航空航天大学软件工程研究所 http://sei.buaa.edu.cn </p>
 * @author Yong-gang Cao http://spaces.msn.com/caomo
 * @version 1.0
 */

public class ICTCLASCaller {
  // Load the dll that exports functions callable from java
  static {
    System.load(
        "C:\\WINDOWS\\system32\\WordSegDll.dll"); //put the dll to the right position
//   System.loadLibrary("WordSegDll.dll");  //this can substitute above, but need to setup environment. for more,see java documentations.
  }

  public native String segSentence(String sentence);

  public ICTCLASCaller() {
  }
  
  /*
   * 这个调整策略用于ICTCLASCaller分词器，也适用于只在汉字之间插入空格的任何分词算法,
   * used for nutch highlight.
   * see http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200504.mbox/%3cBAY15-F1F4FCC11C5093A1C4E5F4DA330@phx.gbl%3e
   * @author cyz http://spaces.msn.com/caoyuzhong
   */
  public void adjustPosition(char[] originalBuf,int originalLength,
                              char[] modifiedBuf,int modifiedLength,
                              int[] adjustPos, int newPosition){
      
      for(int i=0;i<newPosition;i++){ //上一次循环在buffer中留下来的Token的一部分,其位置必是一一对应的
          adjustPos[i]=i;
      }
      int modifiedPos=newPosition;
      int originalPos=0;
      while( (modifiedPos<modifiedLength)&&(originalPos<originalLength) ){
          //抹去空格
          while( (originalPos<originalLength)&&(originalBuf[originalPos]==' ') ){
              originalPos++;
          }
          while( (modifiedPos<modifiedLength)&&(modifiedBuf[modifiedPos]==' ') ){
              adjustPos[modifiedPos]=0;
              modifiedPos++;
          }
          if( (originalPos>=originalLength) || (modifiedPos>=modifiedLength) ){
               if( (originalPos>=originalLength) && (modifiedPos>=modifiedLength) ){
                   break;  //匹配成功
               }else if(modifiedPos>=modifiedLength){
                   break; //匹配成功
               }
               else{ //出现了匹配错误,一般情况下应该不会出现这种情况,
                   while(modifiedPos<modifiedLength){
                       adjustPos[modifiedPos] = 0; //出现了也要确保不能出现字符串或数组越界错误
                       modifiedPos++;
                   }
               }
          }

          char c1=originalBuf[originalPos];
          char c2=modifiedBuf[modifiedPos];
          if(c1==c2){
              adjustPos[modifiedPos]=originalPos+newPosition;
              originalPos++;
              modifiedPos++;
          }
          else{
              adjustPos[modifiedPos]=originalPos+newPosition;
              modifiedPos++;
          }
      }

      while (modifiedPos < modifiedLength) {
          adjustPos[modifiedPos] = 0; //确保不出现字符串或数组越界错误
          modifiedPos++;
      }

  }


  public static void main(String[] args) {
    //for testing purpose
    ICTCLASCaller test = new ICTCLASCaller();
    /*
         //testing system properties
         Properties props = System.getProperties();
//    for(int i=0;i<props.keys()
         Enumeration enum = props.propertyNames();
         while (enum.hasMoreElements()) {
      String key = (String)enum.nextElement();
      String value = props.getProperty(key);
      System.out.println(key + ":" + value);

         }
         //testing current directory
         Process pr;
         try {
      pr = Runtime.getRuntime().exec("cmd /c dir");
      BufferedReader dis = new BufferedReader(new InputStreamReader(pr.
          getInputStream()));
      String str;
      while ( (str = dis.readLine()) != null) {
        System.out.println(str);
      }
         }
         catch (IOException ex) {
      System.out.print(ex.getMessage());

         }
     */
    //testing segmentation
    System.out.print(test.segSentence("happy-birthday yeah! 张华平和曹勇刚学生病的样子"));//just for fun!:P
  }
}
💿 文件大小 3 K
👤 上传用户 wujiahui1pm
📂 所属分类多国语言处理
📄 代码行数 128 行
💻 语言类型 Java
🏷️ 相关标签

#ICTCLAS #segmentation #Description #chinese
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -