📄 ictclascaller.java
字号:
package org.apache.nutch.analysis;
/**
* <p>Title:ICTCLAS Caller </p>
* <p>Description:do chinese word segmentation.don't change the pakage and CLASS name, orelse you can't use it.
* 请不要改变包名、类名以及native的方法名,否则调用将失效。
* 由于ICTCLAS本身存在很多鲁棒性问题,调用segSentence时,string参数请保证不要过长或带有乱码。调用次数过多(如处理几十G的数据)会有可能造成内存溢出。
* 故基本只能用于较小规模数据(相对几十G来说)。
* 请运行时设置jvm足够的堆栈空间。
* please leave enough spaces for JVM.
* e.g. java -Xmx500m caomo.ICTCLASCaller
* this class need WordSegDll.dll,and Data directory on working directory(it contains dictionaries needed by the DLL).
* the content of DATA directory can be found in http://www.nlp.org.cn/project/project.php?proj_id=6
* </p>
* <p>Copyright: Copyright (c) 2004. WORDSEGDLL is just a JNI wrapper of ICTCLAS. so, its usage following the copyright of ICTCLAS.
* for source code and more usage detail,please see http://www.nlp.org.cn/project/project.php?proj_id=6 </p>
* <p>Company:buaa sei 北京航空航天大学软件工程研究所 http://sei.buaa.edu.cn </p>
* @author Yong-gang Cao http://spaces.msn.com/caomo
* @version 1.0
*/
public class ICTCLASCaller {
// Load the dll that exports functions callable from java
static {
System.load(
"C:\\WINDOWS\\system32\\WordSegDll.dll"); //put the dll to the right position
// System.loadLibrary("WordSegDll.dll"); //this can substitute above, but need to setup environment. for more,see java documentations.
}
public native String segSentence(String sentence);
public ICTCLASCaller() {
}
/*
* 这个调整策略用于ICTCLASCaller分词器,也适用于只在汉字之间插入空格的任何分词算法,
* used for nutch highlight.
* see http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200504.mbox/%3cBAY15-F1F4FCC11C5093A1C4E5F4DA330@phx.gbl%3e
* @author cyz http://spaces.msn.com/caoyuzhong
*/
public void adjustPosition(char[] originalBuf,int originalLength,
char[] modifiedBuf,int modifiedLength,
int[] adjustPos, int newPosition){
for(int i=0;i<newPosition;i++){ //上一次循环在buffer中留下来的Token的一部分,其位置必是一一对应的
adjustPos[i]=i;
}
int modifiedPos=newPosition;
int originalPos=0;
while( (modifiedPos<modifiedLength)&&(originalPos<originalLength) ){
//抹去空格
while( (originalPos<originalLength)&&(originalBuf[originalPos]==' ') ){
originalPos++;
}
while( (modifiedPos<modifiedLength)&&(modifiedBuf[modifiedPos]==' ') ){
adjustPos[modifiedPos]=0;
modifiedPos++;
}
if( (originalPos>=originalLength) || (modifiedPos>=modifiedLength) ){
if( (originalPos>=originalLength) && (modifiedPos>=modifiedLength) ){
break; //匹配成功
}else if(modifiedPos>=modifiedLength){
break; //匹配成功
}
else{ //出现了匹配错误,一般情况下应该不会出现这种情况,
while(modifiedPos<modifiedLength){
adjustPos[modifiedPos] = 0; //出现了也要确保不能出现字符串或数组越界错误
modifiedPos++;
}
}
}
char c1=originalBuf[originalPos];
char c2=modifiedBuf[modifiedPos];
if(c1==c2){
adjustPos[modifiedPos]=originalPos+newPosition;
originalPos++;
modifiedPos++;
}
else{
adjustPos[modifiedPos]=originalPos+newPosition;
modifiedPos++;
}
}
while (modifiedPos < modifiedLength) {
adjustPos[modifiedPos] = 0; //确保不出现字符串或数组越界错误
modifiedPos++;
}
}
public static void main(String[] args) {
//for testing purpose
ICTCLASCaller test = new ICTCLASCaller();
/*
//testing system properties
Properties props = System.getProperties();
// for(int i=0;i<props.keys()
Enumeration enum = props.propertyNames();
while (enum.hasMoreElements()) {
String key = (String)enum.nextElement();
String value = props.getProperty(key);
System.out.println(key + ":" + value);
}
//testing current directory
Process pr;
try {
pr = Runtime.getRuntime().exec("cmd /c dir");
BufferedReader dis = new BufferedReader(new InputStreamReader(pr.
getInputStream()));
String str;
while ( (str = dis.readLine()) != null) {
System.out.println(str);
}
}
catch (IOException ex) {
System.out.print(ex.getMessage());
}
*/
//testing segmentation
System.out.print(test.segSentence("happy-birthday yeah! 张华平和曹勇刚学生病的样子"));//just for fun!:P
}
}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -