📄 第二章 中文文本分类的关键技术.htm
字号:
style="WIDTH: 27pt; HEIGHT: 15.75pt" o:ole="" type = "#_x0000_t75" coordsize =
"21600,21600"> <v:imagedata o:title="" src =
"第二章%20中文文本分类的关键技术.files/image007.wmz"></v:imagedata></v:shape><![endif]--><![if !vml]><img width=36 height=21
src="第二章%20中文文本分类的关键技术.files/image008.gif" v:shapes="_x0000_i1028"><![endif]></SUB><!--[if gte mso 9]><xml>
<o:OLEObject Type="Embed" ProgID="Equation.3" ShapeID="_x0000_i1028"
DrawAspect="Content" ObjectID="_1205238702">
</o:OLEObject>
</xml><![endif]--></SPAN>分别表示词在相关文本集和无关文本集中出现的概率。概率模型的优点是采用严格的数学理论为依据,为人们提供了一种数学理论基础来进行匹配,采用相关性反馈原理,可开发出理论上更为坚实的方法。缺点是增加了存储和计算资源的开销,且参数估计难度较大。<SPAN
lang=EN-US><o:p></o:p></SPAN></SPAN></P>
<H3><A name=_Toc122844533></A><A name=_Toc118729806></A><A
name=_Toc117686908><SPAN style="mso-bookmark: _Toc118729806"><SPAN
style="mso-bookmark: _Toc122844533"><SPAN
style="FONT-WEIGHT: normal; FONT-SIZE: 12pt; LINE-HEIGHT: 173%; FONT-FAMILY: 宋体; mso-bidi-font-size: 16.0pt; mso-bidi-font-weight: bold">§</SPAN></SPAN></SPAN></A><st1:chsdate
w:st="on" IsROCDate="False" IsLunarDate="False" Day="30" Month="12"
Year="1899"><SPAN style="mso-bookmark: _Toc122844533"><SPAN
style="mso-bookmark: _Toc118729806"><SPAN
style="mso-bookmark: _Toc117686908"><SPAN lang=EN-US
style="FONT-SIZE: 12pt; LINE-HEIGHT: 173%; FONT-FAMILY: 宋体; mso-bidi-font-size: 16.0pt">2.2.3</SPAN></SPAN></SPAN></SPAN></st1:chsdate><SPAN
style="mso-bookmark: _Toc122844533"><SPAN
style="mso-bookmark: _Toc118729806"><SPAN
style="mso-bookmark: _Toc117686908"><SPAN lang=EN-US
style="FONT-SIZE: 12pt; LINE-HEIGHT: 173%; FONT-FAMILY: 宋体; mso-bidi-font-size: 16.0pt">
</SPAN></SPAN></SPAN></SPAN><SPAN style="mso-bookmark: _Toc122844533"><SPAN
style="mso-bookmark: _Toc118729806"><SPAN
style="mso-bookmark: _Toc117686908"><SPAN
style="FONT-SIZE: 12pt; LINE-HEIGHT: 173%; FONT-FAMILY: 宋体; mso-bidi-font-size: 16.0pt">向量空间模型(<SPAN
lang=EN-US>Vector Space Model, VSM</SPAN>)</SPAN></SPAN></SPAN></SPAN><SPAN
lang=EN-US
style="FONT-SIZE: 12pt; LINE-HEIGHT: 173%; FONT-FAMILY: 宋体; mso-bidi-font-size: 16.0pt"><o:p></o:p></SPAN></H3>
<P class=MsoNormal
style="LINE-HEIGHT: 20pt; TEXT-ALIGN: left; mso-line-height-rule: exactly; mso-layout-grid-align: none"
align=left><SPAN lang=EN-US style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体"><SPAN
style="mso-tab-count: 1"> </SPAN></SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体">向量空间模型是由<SPAN
lang=EN-US>Salton</SPAN>于<SPAN
lang=EN-US>1968</SPAN>年提出的,一直以来都是信息检索领域最为经典的计算模型。向量空间模型</SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">一般使用词来代表文本的特征信息,每个词称为一个特征项。在向量空间模型中,每一个文本都被表示为由一组规范化正交词条矢量所组成的向量空间中的一个点,即形式化为</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt"><SUB><!--[if gte vml 1]><v:shape
id=_x0000_i1029 style="WIDTH: 9.75pt; HEIGHT: 11.25pt" o:ole="" type =
"#_x0000_t75" coordsize = "21600,21600"> <v:imagedata o:title="" src =
"第二章%20中文文本分类的关键技术.files/image009.wmz"></v:imagedata></v:shape><![endif]--><![if !vml]><img width=13 height=15
src="第二章%20中文文本分类的关键技术.files/image010.gif" v:shapes="_x0000_i1029"><![endif]></SUB><!--[if gte mso 9]><xml>
<o:OLEObject Type="Embed" ProgID="Equation.3" ShapeID="_x0000_i1029"
DrawAspect="Content" ObjectID="_1205238703">
</o:OLEObject>
</xml><![endif]--></SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">维空间中的向量。其文本表示形式为:</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt"><o:p></o:p></SPAN></P>
<P class=MsoNormal
style="MARGIN-LEFT: 18pt; TEXT-ALIGN: left; tab-stops: list 36.0pt; mso-layout-grid-align: none"
align=left><SPAN lang=EN-US
style="FONT-SIZE: 12pt"><SUB><!--[if gte vml 1]><v:shape id=_x0000_i1030
style="WIDTH: 21pt; HEIGHT: 14.25pt" o:ole="" type = "#_x0000_t75" coordsize =
"21600,21600"><v:imagedata o:title="" src =
"第二章%20中文文本分类的关键技术.files/image011.wmz"></v:imagedata></v:shape><![endif]--><![if !vml]><img width=28 height=19
src="第二章%20中文文本分类的关键技术.files/image012.gif" v:shapes="_x0000_i1030"><![endif]></SUB><!--[if gte mso 9]><xml>
<o:OLEObject Type="Embed" ProgID="Equation.3" ShapeID="_x0000_i1030"
DrawAspect="Content" ObjectID="_1205238704">
</o:OLEObject>
</xml><![endif]--><SPAN style="mso-tab-count: 1"></SPAN><SUB><!--[if gte vml 1]><v:shape id=_x0000_i1031
style="WIDTH: 111pt; HEIGHT: 18pt" o:ole="" type = "#_x0000_t75" coordsize =
"21600,21600"><v:imagedata o:title="" src =
"第二章%20中文文本分类的关键技术.files/image013.wmz"></v:imagedata></v:shape><![endif]--><![if !vml]><img width=148 height=24
src="第二章%20中文文本分类的关键技术.files/image014.gif" v:shapes="_x0000_i1031"><![endif]></SUB><!--[if gte mso 9]><xml>
<o:OLEObject Type="Embed" ProgID="Equation.3" ShapeID="_x0000_i1031"
DrawAspect="Content" ObjectID="_1205238705">
</o:OLEObject>
</xml><![endif]--><o:p></o:p></SPAN></P>
<P class=MsoNormal
style="MARGIN-LEFT: 0.05pt; TEXT-INDENT: -18pt; LINE-HEIGHT: 20pt; TEXT-ALIGN: left; tab-stops: list 18.0pt; mso-char-indent-count: -1.5; mso-line-height-rule: exactly; mso-layout-grid-align: none; mso-para-margin-left: -1.71gd"
align=left><SPAN lang=EN-US style="FONT-SIZE: 12pt"><SPAN
style="mso-tab-count: 2">
</SPAN></SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">其中</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt"><SUB><!--[if gte vml 1]><v:shape
id=_x0000_i1032 style="WIDTH: 9.75pt; HEIGHT: 18pt" o:ole="" type =
"#_x0000_t75" coordsize = "21600,21600"> <v:imagedata o:title="" src =
"第二章%20中文文本分类的关键技术.files/image015.wmz"></v:imagedata></v:shape><![endif]--><![if !vml]><img width=13 height=24
src="第二章%20中文文本分类的关键技术.files/image016.gif" v:shapes="_x0000_i1032"><![endif]></SUB><!--[if gte mso 9]><xml>
<o:OLEObject Type="Embed" ProgID="Equation.3" ShapeID="_x0000_i1032"
DrawAspect="Content" ObjectID="_1205238706">
</o:OLEObject>
</xml><![endif]--></SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">为特征项词条,</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt"><SUB><!--[if gte vml 1]><v:shape
id=_x0000_i1033 style="WIDTH: 14.25pt; HEIGHT: 18pt" o:ole="" type =
"#_x0000_t75" coordsize = "21600,21600"> <v:imagedata o:title="" src =
"第二章%20中文文本分类的关键技术.files/image017.wmz"></v:imagedata></v:shape><![endif]--><![if !vml]><img width=19 height=24
src="第二章%20中文文本分类的关键技术.files/image018.gif" v:shapes="_x0000_i1033"><![endif]></SUB><!--[if gte mso 9]><xml>
<o:OLEObject Type="Embed" ProgID="Equation.3" ShapeID="_x0000_i1033"
DrawAspect="Content" ObjectID="_1205238707">
</o:OLEObject>
</xml><![endif]--></SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">为特征项在文本</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt"><SUB><!--[if gte vml 1]><v:shape
id=_x0000_i1034 style="WIDTH: 11.25pt; HEIGHT: 14.25pt" o:ole="" type =
"#_x0000_t75" coordsize = "21600,21600"> <v:imagedata o:title="" src =
"第二章%20中文文本分类的关键技术.files/image019.wmz"></v:imagedata></v:shape><![endif]--><![if !vml]><img width=15 height=19
src="第二章%20中文文本分类的关键技术.files/image020.gif" v:shapes="_x0000_i1034"><![endif]></SUB><!--[if gte mso 9]><xml>
<o:OLEObject Type="Embed" ProgID="Equation.3" ShapeID="_x0000_i1034"
DrawAspect="Content" ObjectID="_1205238708">
</o:OLEObject>
</xml><![endif]--></SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">中的权重。特征项的权重是用以刻画该特征项在描述文本内容时所起作用的重要程度。权值越大,表示该特征项在文本中的份量越大,即该特征项越能反映文本</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt"><SUB><!--[if gte vml 1]><v:shape
id=_x0000_i1035 style="WIDTH: 11.25pt; HEIGHT: 14.25pt" o:ole="" type =
"#_x0000_t75" coordsize = "21600,21600"> <v:imagedata o:title="" src =
"第二章%20中文文本分类的关键技术.files/image019.wmz"></v:imagedata></v:shape><![endif]--><![if !vml]><img width=15 height=19
src="第二章%20中文文本分类的关键技术.files/image020.gif" v:shapes="_x0000_i1035"><![endif]></SUB><!--[if gte mso 9]><xml>
<o:OLEObject Type="Embed" ProgID="Equation.3" ShapeID="_x0000_i1035"
DrawAspect="Content" ObjectID="_1205238709">
</o:OLEObject>
</xml><![endif]--></SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">的内容。通常使用词频来表示特征项的权重。词频分为绝对词频和相对词频两种:绝对词频是指词在文本中出现的频率;相对词频是规范化的词频,即要求所有向量分量的平方和为</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt">1</SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">。相对词频的计算方法主要运用</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt">TF-IDF</SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">(</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt">Term Frequency-Inverse Document
Frequency</SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">)公式,</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt">TF-IDF</SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">公式是由</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt">Salton</SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">和</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt">McGill</SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">在</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt">1983</SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">年针对向量空间信息检索范例</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt"><o:p></o:p></SPAN></P>
<P class=MsoNormal
style="MARGIN-LEFT: -0.05pt; TEXT-INDENT: -0.05pt; LINE-HEIGHT: 20pt; TEXT-ALIGN: left; tab-stops: list 18.0pt; mso-line-height-rule: exactly; mso-layout-grid-align: none; mso-para-margin-left: -.01gd"
align=left><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">(</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt">Vector Space Information Retrieval
Paradigm</SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">)提出的文本特征表示方法,其中</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt">TF</SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">表示词频,指特征项词条在给定文本中出现的次数;</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt">IDF</SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">表示倒排频度,是反映一个特征项在一个文本集中按文本统计出现频繁程度的指标。一种较为普遍的</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt">TF-IDF</SPAN><SPAN
style="FONT-SIZE: 12pt; FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">公式如下:</SPAN><SPAN
lang=EN-US style="FONT-SIZE: 12pt"><o:p></o:p></SPAN></P>
<P class=MsoNormal style="TEXT-INDENT: 24pt"><SPAN lang=EN-US
style="FONT-SIZE: 12pt"><SUB><!--[if gte vml 1]><v:shape id=_x0000_i1036
style="WIDTH: 216.75pt; HEIGHT: 63pt" o:ole="" type = "#_x0000_t75" coordsize =
"21600,21600"><v:imagedata o:title="" src =
"第二章%20中文文本分类的关键技术.files/image021.wmz"></v:imagedata></v:shape><![endif]--><![if !vml]><img width=289 height=84
src="第二章%20中文文本分类的关键技术.files/image022.gif" v:shapes="_x0000_i1036"><![endif]></SUB><!--[if gte mso 9]><xml>
<o:OLEObject Type="Embed" ProgID="Equation.3" ShapeID="_x0000_i1036"
DrawAspect="Content" ObjectID="_1205238712">
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -