首页 › 资源下载 › SQL Server › data preprocessing f › 源码查看

data preprocessing for tc.sql

来自「data preprocessing for text classificati」· SQL 代码 · 共 104 行

SQL

104 行

--1---------import from text to table reuter21587
--2---------step1:build dictionary using term extraction of SSIS(sql server integrated service)
select * from dic_test
--3--------- create dataset from reuter21587, add doc_id
select identity(int,1,1) as doc_id, topic, [text] into Dataset from reuters21578
select * from dataset

--4--------- using SSIS build Term Vector by using table dic(after stemming) or table dic_test(before stemming)
------------ destination table is vector(926,388) and vector_after_stemed_dic(694054)
select count(*) from vector_from_stemed_dic
--5---------stem dic
				select count(*) from dataset
				where [text] like '%accountant%'
				select * from dic_test
				where term like '%accountant%'
						----------- stemer normalize dictionary into table dic
						drop table #dic
						drop table dic
						select distinct dbo.fnporteralgorithm(term) as term into #dic 
						from dic_test
						--where score>5 ---can be different
						order by term 

						select identity(int,1,1) as id, term into dic from #dic
						order by id

--6-------stem vector (with index on term and id, 7'19'' too much!)without index(8'32'') build a new table vector_stem

select doc_id, dbo.fnporteralgorithm(term) as term, frequency as tf, topic as class
into vector_stem1 from vector
order by doc_id

--7-----build term vecter space from vector (merge the same term and sum the tf after stemming)
----remember change topic(ntext) to nvchar(max)
--drop table vector1

			select * from vector_stem1
			where doc_id=1 and term like '%doubt%'
			order by term
			select * from vector1
			where doc_id=1 and term like '%doubt%'
			order by term 



select doc_id, class as class, term, sum(tf) as tf
into vector1 from vector_stem1
group by doc_id, class, term
order by doc_id, class, term

---check the result(records of vector_stem have been reduced from 926,338 to 874,151)
select count(*) from vector1
select count(*) from vector_stem1


------added term word_id from table dic1 to term vector space.
--drop table vector2
select t.doc_id, t.class, d.id as wordid, t.term, t.tf into vector2
from vector1 t, dic d
where t.term=d.term
order by t.doc_id

select count(*) from vector2
where doc_id=4 --and  term like '%conver%'
order by term


------check the wrong stemer result of vector_stem, may caused by double stem
select * from vector2--_stem1
where doc_id=4 and  term like '%conver%'
order by term
select * from vector_stem
where doc_id=4 and  term like '%conver%'
order by term

----calculate document frequency......
drop table dic_freq
select wordid, term, count(tf)as doc_frq into dic_freq
from vector2
group by wordid, term
order by wordid, term

select * from dic where term not in(
select term from dic_freq)


----add doc_frq into term_vector_space table vecter3
select t.doc_id, t.class, t.wordid, t.term, t.tf, d.doc_frq into Vector3
from vector2 t, dic_freq d
where t.wordid=d.wordid

-------20 words do not appear in the vecotr3 and being deleted from dic
select * from vector3
where term in(
select term from dic where term not in (select term from dic_freq))



---calculate tf_idf......
select *, (1+log10(tf))*log10(21587/doc_frq) as tf_idf into vector4 from vector3

select * from vector4
order by tf_idf desc

data preprocessing for tc.sql - 源码说明

本页面展示了「data preprocessing for text classification.」中的 data preprocessing for tc.sql 源码文件，采用 SQL 编程语言编写，共 104 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。

虫虫下载站收录了大量与classification相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。

⌨️ 快捷键说明

复制代码Ctrl + C

搜索代码Ctrl + F

全屏模式F11

增大字号Ctrl + =

减小字号Ctrl + -

显示快捷键?