📄 ictclas_api.htm
字号:
</p>
</TD>
</TR>
<TR VALIGN=TOP>
<TD WIDTH=463>
<p style="margin-top: 0; margin-bottom: 0">
<font face="华文楷体"><a href="#evaluation">ICTCLAS的性能评估</a></font>
<ul>
<li>
<p style="margin-top: 0; margin-bottom: 0"><font face="华文楷体"> <a href="#973">ICTCLAS在973评测中的测试结果</a></font></li>
<li>
<p style="margin-top: 0; margin-bottom: 0"><font face="华文楷体"><a href="#acl">第一届国际分词大赛的评测结果</a></font></li>
</ul>
</TD>
</TR>
<TR VALIGN=TOP>
<TD WIDTH=463>
<p style="margin-top: 0; margin-bottom: 0">
<font face="华文楷体"><a href="#Dairly">ICTCLAS大事记</a></font>
</p>
</TD>
</TR>
</TBODY>
</TABLE>
<P><BR>[Hua-Ping Zhang 2003 : Chapter 1 - Introduction / 1]
</P>
<h2><a name="background"><span style="font-family:黑体;mso-ascii-font-family:
Arial"></a>背景</span></h2>
<p class="MsoNormal" style="text-indent:21.0pt;mso-char-indent-count:2.0;
mso-char-indent-size:10.5pt"><span style="font-family:宋体;mso-ascii-font-family:
"MS Song\, \000B\, Beijing";mso-hansi-font-family:"MS Song\, \000B\, Beijing""> <font color="#800000">词是最小的能够独立活动的有意义的语言成分</font></span><font color="#800000"><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing"">,</span><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">但汉语是以字为基本的书写单位,词语之间没有明显的区分标记,因此,中文词法分析是中文信息处理的基础与关键。所有涉及中文内容处理的系统,如果没有一个好的中文词法分析系统支持,正确率都会受很大影响。具体来说,中文词法分析的主要应用领域包括:</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing""><o:p>
</o:p>
</span></font></p>
<p class="MsoNormal" style="margin-left:63.0pt;text-indent:-21.0pt;mso-list:l5 level1 lfo7;
tab-stops:list 63.0pt"><span lang="EN-US" style="font-family:
Wingdings"><font color="#800000">l<span style="font:7.0pt "Times New Roman"">
</span></font></span><font color="#800000"><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">信息检索(搜索引擎)</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing""><o:p>
</o:p>
</span></font></p>
<p class="MsoNormal" style="margin-left:63.0pt;text-indent:-21.0pt;mso-list:l5 level1 lfo7;
tab-stops:list 63.0pt"><span lang="EN-US" style="font-family:
Wingdings"><font color="#800000">l<span style="font:7.0pt "Times New Roman"">
</span></font></span><font color="#800000"><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">机器翻译</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing""><o:p>
</o:p>
</span></font></p>
<p class="MsoNormal" style="margin-left:63.0pt;text-indent:-21.0pt;mso-list:l5 level1 lfo7;
tab-stops:list 63.0pt"><span lang="EN-US" style="font-family:
Wingdings"><font color="#800000">l<span style="font:7.0pt "Times New Roman"">
</span></font></span><font color="#800000"><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">文本分类、摘要、过滤</span></font></p>
<p class="MsoNormal" style="margin-left:63.0pt;text-indent:-21.0pt;mso-list:l5 level1 lfo7;
tab-stops:list 63.0pt"><span lang="EN-US" style="font-family:
Wingdings"><font color="#800000">l<span style="font:7.0pt "Times New Roman"">
</span></font></span><font color="#800000"><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">信息提取</span></font></p>
<p class="MsoNormal" style="margin-left:63.0pt;text-indent:-21.0pt;mso-list:l5 level1 lfo7;
tab-stops:list 63.0pt"><span lang="EN-US" style="font-family:
Wingdings"><font color="#800000">l<span style="font:7.0pt "Times New Roman"">
</span></font></span><font color="#800000"><span style="font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-hansi-font-family:"Times New Roman"">其他和中文内容处理相关的领域</span></font></p>
<p class="MsoNormal" style="text-indent:21.0pt;mso-char-indent-count:2.0;
mso-char-indent-size:10.5pt"><span style="font-family:宋体;mso-ascii-font-family:
"Times New Roman";mso-hansi-font-family:"Times New Roman""><font color="#800000">中文词法分析又是一个非常困难的问题,其难点主要体现在以下几方面:</font></span></p>
<p class="MsoNormal" style="margin-left:63.0pt;text-indent:-21.0pt;mso-list:l5 level1 lfo7;
tab-stops:list 63.0pt"><span lang="EN-US" style="font-family:
Wingdings"><font color="#800000">l<span style="font:7.0pt "Times New Roman"">
</span></font></span><font color="#800000"><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">词语切分:由于汉语词语之间没有空格分开,需要从连续的汉字串中正确辨认汉语的词语,常见的歧义现象如:“的确切”可能是“的确/切”或者“的/确切”,“马上”可能是一个词表示很快,也可能是两个词“马/上”表示位置;这些类型的歧义现象在汉语中非常常见,会对汉语词语切分造成极大的干扰;</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing""><o:p>
</o:p>
</span></font></p>
<p class="MsoNormal" style="margin-left:63.0pt;text-indent:-21.0pt;mso-list:l5 level1 lfo7;
tab-stops:list 63.0pt"><span lang="EN-US" style="font-family:
Wingdings"><font color="#800000">l<span style="font:7.0pt "Times New Roman"">
</span></font></span><font color="#800000"><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">未定义词识别:词典中不可能收录所有的词语,大量的人名、地名、机构名、外来语译名、新词语等等,如“王小山、十里堡、北京计算机研究所、瓦杰帕依、非典”等等,都需要通过软件来自动识别,而在汉语中这些未定义词没有空格作为边界,其组成成分又是有意义的普通汉字,因此识别难度很大;</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing""><o:p>
</o:p>
</span></font></p>
<p class="MsoNormal" style="margin-left:63.0pt;text-indent:-21.0pt;mso-list:l5 level1 lfo7;
tab-stops:list 63.0pt"><span lang="EN-US" style="font-family:
Wingdings"><font color="#800000">l<span style="font:7.0pt "Times New Roman"">
</span></font></span><font color="#800000"><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">词性标注:汉语中词语兼类情况非常常见,比如说“领导”可以是动词、也可以是名词,要正确标注出每个词的词性,也有很多困难。</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing""><o:p>
</o:p>
</span></font></p>
<span style="font-family: 宋体; mso-ascii-font-family: 'MS Song\', '\000B\', Beijing; mso-hansi-font-family: 'MS Song\', '\000B\', Beijing"><font color="#800000">
虽然汉语词法分析的研究已经有了很长的历史,但在很多应用系统中,还是非常缺乏一个集成的、能够全面解决上述问题的汉语词法分析系统。中国科学院计算技术研究所在多年研究工作积累的基础上,耗时两年研制出了汉语词法分析系统ICTCLAS(Institute
of Computing Technology, Chinese Lexical Analysis System)。该系统在国内外多次著名的技术评测(包括我国973专家组组织的评测和国际中文处理研究机构SigHan组织的评测)中都获得了多项第一名,在国内外产生了广泛的影响,并已被应用到国内外许多著名大学、研究机构和公司的科研教学和商业系统中,产生了良好的经济和社会效益。</font></span>
<p>>><span style="font-family: 宋体; mso-ascii-font-family: 'MS Song\'', '\000B \'', Beijing; mso-hansi-font-family: 'MS Song\'', '\000B \'', Beijing"><a href="#chap1">Back</a>
</span>>|<span style="font-family: 宋体; mso-ascii-font-family: 'MS Song\'', '\000B \'', Beijing; mso-hansi-font-family: 'MS Song\'', '\000B \'', Beijing"><a href="#header">Top</a></span></p>
<P>[Hua-Ping Zhang 2003 : Chapter 1 - Introduction / 2]
</P>
<h2><span style="font-family:黑体;mso-ascii-font-family:
Arial"><A NAME="background"></A></span><span style="font-family: 黑体; mso-ascii-font-family: Arial">
</span><a name="ICTCLAS"><span lang="EN-US">ICTCLAS</span></a><span style="font-family: 黑体; mso-ascii-font-family: Arial; mso-bookmark: _Toc51384088">介绍</span></h2>
<p class="MsoNormal" style="text-indent:21.0pt;mso-char-indent-count:2.0;
mso-char-indent-size:10.5pt"><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing"">ICTCLAS</span><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">的最主要的特点在于采用了层叠隐马尔可夫模型(</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing"">Hierarchical
Hidden Markov Model</span><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">),将汉语词法分析的主要问题(汉语分词、未定义词识别和词性标注)都统一到了一个完整的理论框架中,以获得最好的总体效果。</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing"">
<o:p>
</o:p>
</span></p>
<p class="MsoNormal" style="text-indent:21.0pt;mso-char-indent-count:2.0;
mso-char-indent-size:10.5pt"><span style="font-family:宋体;mso-ascii-font-family:
"MS Song\, \000B\, Beijing";mso-hansi-font-family:"MS Song\, \000B\, Beijing"">该系统的功能有:中文分词;词性标注;命名实体识别;新词识别;用户词典。</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing""><o:p>
</o:p>
</span></p>
<p class="MsoNormal" style="text-indent:21.0pt;mso-char-indent-count:2.0;
mso-char-indent-size:10.5pt"><span style="font-family:宋体;mso-ascii-font-family:
"MS Song\, \000B\, Beijing";mso-hansi-font-family:"MS Song\, \000B\, Beijing"">特色在于:</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing"">C/C++</span><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">编写,支持</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing"">Linux</span><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">、FreeBSD及</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing"">Windows</span><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">多种</span><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">系列操作系统;</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing"">ICTCLAS</span><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">有</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing"">GB2312</span><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">和</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing"">BIG5</span><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">版本,可分别处理目简繁体中文;支持当前广泛承认的分词和词类标准,包括计算所词类标注集</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing"">ICTPOS3.0</span><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">,北大标准、滨州大学标准、国家语委标准、台湾“中研院”、香港“城市大学”;用户可以直接自定义输出的词类标准,定义输出格式;可按需要输出多个最优结果;所有功能模块均可拆卸组装。</span></p>
<p class="MsoNormal" style="text-indent:21.0pt;mso-char-indent-count:2.0;
mso-char-indent-size:10.5pt"><span style="font-family:宋体;mso-ascii-font-family:
"MS Song\, \000B\, Beijing";mso-hansi-font-family:"MS Song\, \000B\, Beijing"">计算所汉语词法分析系统</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing"">ICTCLAS</span><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">同时还提供一套完整的</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing"">API</span><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">接口</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing"">(</span><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">包括:动态连接库,静态连接库,</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing"">Linux</span><span style="font-family: 宋体; mso-ascii-font-family: 'MS Song\', '\000B\', Beijing; mso-hansi-font-family: 'MS Song\', '\000B\', Beijing">和FreeBSD</span><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">下的库函数和</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing"">COM</span><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">组件</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing"">)</span><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">和相应的概率词典,开发者可以直接在自己的系统中调用</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing"">ICTCLAS</span><span style="font-family:宋体;mso-ascii-font-family:"MS Song\, \000B\, Beijing";
mso-hansi-font-family:"MS Song\, \000B\, Beijing"">,在分词和词性标注的基础上继续上层开发。</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing""><o:p>
</span></p>
<p class="MsoNormal" style="text-indent:21.0pt;mso-char-indent-count:2.0;
mso-char-indent-size:10.5pt"><span style="font-family:宋体;mso-ascii-font-family:
"MS Song\, \000B\, Beijing";mso-hansi-font-family:"MS Song\, \000B\, Beijing"">欢迎相关领域的工程技术人员、研究人员使用,并提供宝贵意见。</span><span lang="EN-US" style="font-family:"MS Song\, \000B\, Beijing""><o:p>
</o:p>
</span></p>
<p>>><span style="font-family: 宋体; mso-ascii-font-family: 'MS Song\'', '\000B \'', Beijing; mso-hansi-font-family: 'MS Song\'', '\000B \'', Beijing"><a href="#chap1">Back</a>
</span>>|<span style="font-family: 宋体; mso-ascii-font-family: 'MS Song\'', '\000B \'', Beijing; mso-hansi-font-family: 'MS Song\'', '\000B \'', Beijing"><a href="#header">Top</a></span></p>
<P>[Hua-Ping Zhang 2003 : Chapter 1 - Introduction / 3]
</P>
<h2><a name="evaluation"><span lang="EN-US">ICTCLAS</span></a><span style="font-family: 黑体; mso-ascii-font-family: Arial; mso-bookmark: _Toc51384094">的性能评估</span></h2>
<h3><a name="973"><span lang="EN-US" style="mso-bookmark: OLE_LINK1">ICTCLAS</span></a><span style="mso-bookmark: OLE_LINK1"><span style="font-family:
宋体;mso-ascii-font-family:"Times New Roman";mso-hansi-font-family:"Times New Roman"">在</span><span lang="EN-US">973</span><span style="font-family:宋体;mso-ascii-font-family:
"Times New Roman";mso-hansi-font-family:"Times New Roman"">评测中的测试结果</span></span></h3>
<p class="MsoNormal" style="text-indent:21.0pt;mso-char-indent-count:2.0;
mso-char-indent-size:10.5pt"><span lang="EN-US" style="mso-bidi-font-size:10.5pt;
font-family:宋体;mso-hansi-font-family:"Times New Roman";mso-font-kerning:0pt">2002年7月6日,ICTCLAS参加了</span><span style="font-family:宋体">国家<span lang="EN-US">973英汉机器翻译第二阶段的</span></span><span style="mso-bidi-font-size:10.5pt;font-family:宋体;mso-hansi-font-family:"Times New Roman";
mso-font-kerning:0pt">开放</span><span style="font-family:宋体">评测,测试结果如下:<span lang="EN-US"><o:p>
</o:p>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -