📄 第二章 中文文本分类的关键技术.htm
字号:
</o:OLEObject>
</xml><![endif]--><o:p></o:p></SPAN></P>
<P class=MsoNormal style="TEXT-INDENT: 24pt"><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">其中,</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt"><SUB><!--[if gte vml 1]><v:shape
id=_x0000_i1037 style="WIDTH: 36pt; HEIGHT: 18pt" o:ole="" type = "#_x0000_t75"
coordsize = "21600,21600"> <v:imagedata o:title="" src =
"第二章%20中文文本分类的关键技术.files/image023.wmz"></v:imagedata></v:shape><![endif]--><![if !vml]><img width=48 height=24
src="第二章%20中文文本分类的关键技术.files/image024.gif" v:shapes="_x0000_i1037"><![endif]></SUB><!--[if gte mso 9]><xml>
<o:OLEObject Type="Embed" ProgID="Equation.3" ShapeID="_x0000_i1037"
DrawAspect="Content" ObjectID="_1205238713">
</o:OLEObject>
</xml><![endif]--></SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">为词</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt">t</SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">在文本</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt"><SUB><!--[if gte vml 1]><v:shape
id=_x0000_i1038 style="WIDTH: 12pt; HEIGHT: 15.75pt" o:ole="" type =
"#_x0000_t75" coordsize = "21600,21600"> <v:imagedata o:title="" src =
"第二章%20中文文本分类的关键技术.files/image025.wmz"></v:imagedata></v:shape><![endif]--><![if !vml]><img width=16 height=21
src="第二章%20中文文本分类的关键技术.files/image026.gif" v:shapes="_x0000_i1038"><![endif]></SUB><!--[if gte mso 9]><xml>
<o:OLEObject Type="Embed" ProgID="Equation.3" ShapeID="_x0000_i1038"
DrawAspect="Content" ObjectID="_1205238714">
</o:OLEObject>
</xml><![endif]--></SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">中的权重,而</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt"><SUB><!--[if gte vml 1]><v:shape
id=_x0000_i1039 style="WIDTH: 38.25pt; HEIGHT: 18pt" o:ole="" type =
"#_x0000_t75" coordsize = "21600,21600"> <v:imagedata o:title="" src =
"第二章%20中文文本分类的关键技术.files/image027.wmz"></v:imagedata></v:shape><![endif]--><![if !vml]><img width=51 height=24
src="第二章%20中文文本分类的关键技术.files/image028.gif" v:shapes="_x0000_i1039"><![endif]></SUB><!--[if gte mso 9]><xml>
<o:OLEObject Type="Embed" ProgID="Equation.3" ShapeID="_x0000_i1039"
DrawAspect="Content" ObjectID="_1205238715">
</o:OLEObject>
</xml><![endif]--></SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">为词</SPAN><I><SPAN
lang=EN-US style="FONT-SIZE: 12pt">t</SPAN></I><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">在文本</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt"><SUB><!--[if gte vml 1]><v:shape
id=_x0000_i1040 style="WIDTH: 12pt; HEIGHT: 15.75pt" o:ole="" type =
"#_x0000_t75" coordsize = "21600,21600"> <v:imagedata o:title="" src =
"第二章%20中文文本分类的关键技术.files/image025.wmz"></v:imagedata></v:shape><![endif]--><![if !vml]><img width=16 height=21
src="第二章%20中文文本分类的关键技术.files/image026.gif" v:shapes="_x0000_i1040"><![endif]></SUB><!--[if gte mso 9]><xml>
<o:OLEObject Type="Embed" ProgID="Equation.3" ShapeID="_x0000_i1040"
DrawAspect="Content" ObjectID="_1205238716">
</o:OLEObject>
</xml><![endif]--></SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">中的词频,</SPAN><B><I><SPAN
lang=EN-US style="FONT-SIZE: 12pt">N</SPAN></I></B><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">为训练文本的总数,</SPAN><I><SPAN
lang=EN-US style="FONT-SIZE: 12pt">n</SPAN></I><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">为向量的维数,</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt"><SUB><!--[if gte vml 1]><v:shape
id=_x0000_i1041 style="WIDTH: 9.75pt; HEIGHT: 18pt" o:ole="" type =
"#_x0000_t75" coordsize = "21600,21600"> <v:imagedata o:title="" src =
"第二章%20中文文本分类的关键技术.files/image029.wmz"></v:imagedata></v:shape><![endif]--><![if !vml]><img width=13 height=24
src="第二章%20中文文本分类的关键技术.files/image016.gif" v:shapes="_x0000_i1041"><![endif]></SUB><!--[if gte mso 9]><xml>
<o:OLEObject Type="Embed" ProgID="Equation.3" ShapeID="_x0000_i1041"
DrawAspect="Content" ObjectID="_1205238717">
</o:OLEObject>
</xml><![endif]--></SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">为向量第</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt">i</SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">个分量对应的特征项,</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt"><SUB><!--[if gte vml 1]><v:shape
id=_x0000_i1042 style="WIDTH: 15pt; HEIGHT: 18.75pt" o:ole="" type =
"#_x0000_t75" coordsize = "21600,21600"> <v:imagedata o:title="" src =
"第二章%20中文文本分类的关键技术.files/image030.wmz"></v:imagedata></v:shape><![endif]--><![if !vml]><img width=20 height=25
src="第二章%20中文文本分类的关键技术.files/image031.gif" v:shapes="_x0000_i1042"><![endif]></SUB><!--[if gte mso 9]><xml>
<o:OLEObject Type="Embed" ProgID="Equation.3" ShapeID="_x0000_i1042"
DrawAspect="Content" ObjectID="_1205238718">
</o:OLEObject>
</xml><![endif]--></SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">为训练文本集中出现</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt"><SUB><!--[if gte vml 1]><v:shape
id=_x0000_i1043 style="WIDTH: 9.75pt; HEIGHT: 18pt" o:ole="" type =
"#_x0000_t75" coordsize = "21600,21600"> <v:imagedata o:title="" src =
"第二章%20中文文本分类的关键技术.files/image029.wmz"></v:imagedata></v:shape><![endif]--><![if !vml]><img width=13 height=24
src="第二章%20中文文本分类的关键技术.files/image016.gif" v:shapes="_x0000_i1043"><![endif]></SUB><!--[if gte mso 9]><xml>
<o:OLEObject Type="Embed" ProgID="Equation.3" ShapeID="_x0000_i1043"
DrawAspect="Content" ObjectID="_1205238719">
</o:OLEObject>
</xml><![endif]--></SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">的文本数,</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt"><SUB><!--[if gte vml 1]><v:shape
id=_x0000_i1044 style="WIDTH: 12.75pt; HEIGHT: 18pt" o:ole="" type =
"#_x0000_t75" coordsize = "21600,21600"> <v:imagedata o:title="" src =
"第二章%20中文文本分类的关键技术.files/image032.wmz"></v:imagedata></v:shape><![endif]--><![if !vml]><img width=17 height=24
src="第二章%20中文文本分类的关键技术.files/image033.gif" v:shapes="_x0000_i1044"><![endif]></SUB><!--[if gte mso 9]><xml>
<o:OLEObject Type="Embed" ProgID="Equation.3" ShapeID="_x0000_i1044"
DrawAspect="Content" ObjectID="_1205238720">
</o:OLEObject>
</xml><![endif]--></SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">为训练文本集中出现</SPAN><I><SPAN
lang=EN-US style="FONT-SIZE: 12pt">t</SPAN></I><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">的文本数,分母为规范化因子</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt">,</SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">使每一个特征词的权重在</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt">[0</SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">,</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt">1]</SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">之间。</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt"><o:p></o:p></SPAN></P>
<P class=MsoNormal
style="TEXT-INDENT: 24.1pt; LINE-HEIGHT: 20pt; mso-line-height-rule: exactly"><SPAN
lang=EN-US style="FONT-SIZE: 12pt">TF-IDF</SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">公式表示文本的方法提出了这样一个假设:对于区分文本类别最有意义的词条应该是那些在本类文本中出现频率足够高,而在整个文本集合的其他类别的文本中出现的频率足够的低的词条。向量空间模型的优点:使得文本内容被形式化到多维空间中的一个点,通过向量形式给出,将文本以向量的形式定义到了实数域中,提高了自然语言文档的可计算性和可操作性;为特征词引进权值,通过调节词对应权值的大小来反映特征词与所在文本的相关程度,部分地克服了传统布尔模型的缺陷</SPAN><SUP><SPAN
lang=EN-US style="FONT-SIZE: 12pt">[28]</SPAN></SUP><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">。其缺点是:过于利用一个文本的“与众不同”之处,反而忽略了文本共有的特性。且基于词汇层描述文本特性,忽略了文本内具有相似意义的词条间的关系。</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt"><o:p></o:p></SPAN></P>
<H2><A name=_Toc122844534></A><A name=_Toc118729807></A><A
name=_Toc117686909><SPAN style="mso-bookmark: _Toc118729807"><SPAN
style="mso-bookmark: _Toc122844534"><SPAN
style="FONT-WEIGHT: normal; FONT-SIZE: 14pt; LINE-HEIGHT: 173%; FONT-FAMILY: 黑体; mso-ascii-font-family: 宋体; mso-hansi-font-family: 宋体; mso-bidi-font-weight: bold">§</SPAN></SPAN></SPAN></A><SPAN
style="mso-bookmark: _Toc117686909"><SPAN
style="mso-bookmark: _Toc118729807"><SPAN
style="mso-bookmark: _Toc122844534"><SPAN lang=EN-US
style="FONT-SIZE: 14pt; LINE-HEIGHT: 173%; FONT-FAMILY: 宋体; mso-bidi-font-size: 16.0pt">2.3</SPAN></SPAN></SPAN></SPAN><SPAN
style="mso-bookmark: _Toc117686909"><SPAN
style="mso-bookmark: _Toc118729807"><SPAN
style="mso-bookmark: _Toc122844534"><SPAN
style="FONT-SIZE: 14pt; LINE-HEIGHT: 173%; FONT-FAMILY: 宋体; mso-bidi-font-size: 16.0pt">文本特征的提取</SPAN></SPAN></SPAN></SPAN><SPAN
lang=EN-US
style="FONT-SIZE: 14pt; LINE-HEIGHT: 173%; FONT-FAMILY: 宋体; mso-bidi-font-size: 16.0pt"><o:p></o:p></SPAN></H2>
<P class=MsoNormal
style="TEXT-INDENT: 24pt; LINE-HEIGHT: 20pt; mso-char-indent-count: 2.0; mso-line-height-rule: exactly"><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">文本中词空间维度很高,并且不同的词对文本内容的贡献是不等的,因此需要度量词在文本中的权重,只有大于一定权重阈值的词才能作为表征文本内容的关键词。关键词的提取也称为文本特征的提取,特征提取可以在一定程度上缓解过匹配现象。<SPAN
lang=EN-US><o:p></o:p></SPAN></SPAN></P>
<P class=MsoNormal
style="TEXT-INDENT: 24pt; LINE-HEIGHT: 20pt; mso-char-indent-count: 2.0; mso-line-height-rule: exactly"><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">采用统计方法的模式识别使用特征参数将模式表达为特征空间的向量,然后使用判别函数进行分类。随着数据量的增加,特征提取将逐步变得困难,所谓特征提取就是对原始数据进行分析,发现最能反映模式分类的本质特征。随着维数的增长,计算开销将急剧增加,需要对特征空间的维度进行降维处理。因此模式的特征提取和选择是这一技术的关键。文本特征提取的本质是高维数据的降维技术,即将高维数据通过变换映射到低维空间。降维方法的主要问题在于,从高维到低维的变换有可能掩盖数据原有的信息,这样原先在高维空间存在明显差异或特征的类别在低维的空间内会混杂在一起难以区分。因此,从高维空间向低维空间变换的关键就在于寻找适合的映射,将高维空间的目标信息尽可能真实地映射到低维空间。<SPAN
lang=EN-US><o:p></o:p></SPAN></SPAN></P>
<P class=MsoNormal
style="TEXT-INDENT: 21.75pt; LINE-HEIGHT: 20pt; mso-line-height-rule: exactly"><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">特征提取的方式有四种:<SPAN
lang=EN-US><o:p></o:p></SPAN></SPAN></P>
<P class=MsoNormal
style="TEXT-INDENT: 21.75pt; LINE-HEIGHT: 20pt; mso-line-height-rule: exactly"><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">第一种方式是用映射或变换的方法把原始特征变换为较少的新特征;<SPAN
lang=EN-US><o:p></o:p></SPAN></SPAN></P>
<P class=MsoNormal
style="TEXT-INDENT: 21.75pt; LINE-HEIGHT: 20pt; mso-line-height-rule: exactly"><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">第二种方式是从原始特征中挑选出一些最具代表性的特征;<SPAN
lang=EN-US><o:p></o:p></SPAN></SPAN></P>
<P class=MsoNormal
style="TEXT-INDENT: 21.75pt; LINE-HEIGHT: 20pt; mso-line-height-rule: exactly"><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">第三种方式是根据专家的知识挑选最有影响的特征;<SPAN
lang=EN-US><o:p></o:p></SPAN></SPAN></P>
<P class=MsoNormal
style="TEXT-INDENT: 21.75pt; LINE-HEIGHT: 20pt; mso-line-height-rule: exactly"><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">第四种方式是用数学的方法进行选取,找出最具分类信
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -