⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 2.txt

📁 This complete matlab for neural network
💻 TXT
字号:
发信人: ccipt (北方的狼), 信区: DataMining
标  题: Web Content Mining
发信站: 南京大学小百合站 (Thu Aug 23 17:10:53 2001)

The heterogeneity and the lack of structure that permeates much of the ever ex
panding information sources on the World Wide Web, such as hypertext documents
, makes automated discovery, organization, and management of Web-based informa
tion difficult. Traditional search and indexing tools of the Internet and the 
World Wide Web such as Lycos, Alta Vista, WebCrawler, ALIWEB [Kos94], MetaCraw
ler, and others provide some comfort to users, but they do not generally provi
de structural information nor categorize, filter, or interpret documents. A re
cent study provides a comprehensive and statistically thorough comparative eva
luation of the most popular search tools [LS97]. 


In recent years these factors have prompted researchers to develop more intell
igent tools for information retrieval, such as intelligent Web agents, as well
 as to extend database and data mining techniques to provide a higher level of
 organization for semi-structured data available on the Web. We summarize some
 of these efforts below. 


Agent-Based Approach


The agent-based approach to Web mining involves the development of sophisticat
ed AI systems that can act autonomously or semi-autonomously on behalf of a pa
rticular user, to discover and organize Web-based information. Generally, the 
agent-based Web mining systems can be placed into the following three categori
es: 



1Intelligent Search Agents 

Several intelligent Web agents have been developed that search for relevant in
formation using characteristics of a particular domain (and possibly a user pr
ofile) to organize and interpret the discovered information. For example, agen
ts such as Harvest [BDH94], FAQ-Finder [HBML95], Information Manifold [KLSS95]
, OCCAM [KW96], and ParaSite [Spe97] rely either on pre-specified and domain s
pecific information about particular types of documents, or on hard coded mode
ls of the information sources to retrieve and interpret documents. Other agent
s, such as ShopBot [DEW96] and ILA (Internet Learning Agent) [PE95], attempt t
o interact with and learn the structure of unfamiliar information sources. Sho
pBot retrieves product information from a variety of vendor sites using only g
eneral information about the product domain. ILA, on the other hand, learns mo
dels of various information sources and translates these into its own internal
 concept hierarchy. 


2Information Filtering/Categorization 

A number of Web agents use various information retrieval techniques [FBY92] an
d characteristics of open hypertext Web documents to automatically retrieve, f
ilter, and categorize them [CH97,BGMZ97,MS96,WP97,WVS96]. For example, HyPursu
it [WVS96] uses semantic information embedded in link structures as well as do
cument content to create cluster hierarchies of hypertext documents, and struc
ture an information space. BO (Bookmark Organizer) [MS96] combines hierarchica
l clustering techniques and user interaction to organize a collection of Web d
ocuments based on conceptual information. 


3Personalized Web Agents 

Another category of Web agents includes those that obtain or learn user prefer
ences and discover Web information sources that correspond to these preference
s, and possibly those of other individuals with similar interests (using colla
borative filtering). A few recent examples of such agents include the WebWatch
er [AFJM95], PAINT [OPW94], Syskill & Webert [PMB96], and others [BSY95]. For 
example, Syskill & Webert is a system that utilizes a user profile and learns 
to rate Web pages of interest using a Bayesian classifier. 


Database Approach


The database approaches to Web mining have generally focused on techniques for
 integrating and organizing the heterogeneous and semi-structured data on the 
Web into more structured and high-level collections of resources, such as in r
elational databases, and using standard database querying mechanisms and data 
mining techniques to access and analyze this information. 


1Multilevel Databases 

Several researchers have proposed a multilevel database approach to organizing
 Web-based information. The main idea behind these proposals is that the lowes
t level of the database contains primitive semi-structured information stored 
in various Web repositories, such as hypertext documents. At the higher level(
s) meta data or generalizations are extracted from lower levels and organized 
in structured collections such as relational or object-oriented databases. For
 example, Han, et. al. [ZH95] use a multi-layered database where each layer is
 obtained via generalization and transformation operations performed on the lo
wer layers. Kholsa, et. al. [KKS96] propose the creation and maintenance of me
ta-databases at each information providing domain and the use of a global sche
ma for the meta-database. King & Novak [KN96] propose the incremental integrat
ion of a portion of the schema from each information source, rather than relyi
ng on a global heterogeneous database schema. ARANEUS system [PA97] extracts relevant information from hypertext documents and integrates the
se into higher-level derived Web Hypertexts which are generalizations of the n
otion of database views. 


2Web Query Systems 

There have been many Web-base query systems and languages developed recently t
hat attempt to utilize standard database query languages such as SQL, structur
al information about Web documents, and even natural language processing for a
ccommodating the types of queries that are used in World Wide Web searches. We
 mention a few examples of these Web-base query systems here. W3QL [KS95]: com
bines structure queries, based on the organization of hypertext documents, and
 content queries, based on information retrieval techniques. WebLog [LSS96]: L
ogic-based query language for restructuring extracted information from Web inf
ormation sources. Lorel [QRS95] and UnQL [BDS95,BDHS96]: query heterogeneous a
nd semi-structured information on the Web using a labeled graph data model. TS
IMMIS [CGMH94]: extracts data from heterogeneous and semi-structured informati
on sources and correlates them to generate an integrated database representati
on of the extracted information.

 

http://maya.cs.depaul.edu/~mobasher/webminer/survey/node3.html#SECTION00021000
000000000000



--
FAMILY=(F)ATHER (A)ND (M)OTHER, (I) (L)OVE (Y)OU!


※ 来源:.南京大学小百合站 http://bbs.nju.edu.cn [FROM: 202.100.5.132]

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -