📄 163-165.html
字号:
<HTML>
<HEAD>
<META name=vsisbn content="0849398010">
<META name=vstitle content="Industrial Applications of Genetic Algorithms">
<META name=vsauthor content="Charles Karr; L. Michael Freeman">
<META name=vsimprint content="CRC Press">
<META name=vspublisher content="CRC Press LLC">
<META name=vspubdate content="12/01/98">
<META name=vscategory content="Web and Software Development: Artificial Intelligence: Other">
<TITLE>Industrial Applications of Genetic Algorithms:Data Mining Using Genetic Algorithms</TITLE>
<!-- HEADER -->
<STYLE type="text/css">
<!--
A:hover {
color : Red;
}
-->
</STYLE>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<!--ISBN=0849398010//-->
<!--TITLE=Industrial Applications of Genetic Algorithms//-->
<!--AUTHOR=Charles Karr//-->
<!--AUTHOR=L. Michael Freeman//-->
<!--PUBLISHER=CRC Press LLC//-->
<!--IMPRINT=CRC Press//-->
<!--CHAPTER=9//-->
<!--PAGES=163-165//-->
<!--UNASSIGNED1//-->
<!--UNASSIGNED2//-->
<CENTER>
<TABLE BORDER>
<TR>
<TD><A HREF="161-163.html">Previous</A></TD>
<TD><A HREF="../ewtoc.html">Table of Contents</A></TD>
<TD><A HREF="166-168.html">Next</A></TD>
</TR>
</TABLE>
</CENTER>
<P><BR></P>
</P>
<P>There are many possible approaches for implementing high level search in a data mining environment. The following are some definitions used in the discussion of high level search strategies. Following the definitions are several high level search strategies that have been cited as popular alternatives for implementation in present day data mining environments [5]:
</P>
<P><I>Definitions:</I></P>
<DL>
<DD><B>•</B> <B>Search space:</B> A search space consists of a set of states (nodes) and a set of moves to neighboring states. Search spaces may consist of pre-existing data structures that represent states which can be visited during the search, and moves which guide the traversal of the search space.
<DD><B>•</B> <B>Search strategy:</B> A search strategy is a general approach to construct and process a search space. For example, a heuristic search strategy differs from an exhaustive search strategy by leaving part of the search space unsearched.
<DD><B>•</B> <B>Heuristic search:</B> A heuristic search generates and/or processes only part of a total search space. Hill climbing and Beam search, each described below, are considered heuristic search approaches. Given this definition, a genetic algorithm would be considered a heuristic search.
<DD><B>•</B> <B>Exhaustive (or Enumerative) search:</B> An exhaustive search is a search that can reach each node of a search space, ensuring the optimal solution. This approach is usually not feasible when the search space is extremely large, as is typically the case in data mining applications.
</DL>
<P><I>High Level Search Strategies:</I></P>
<DL>
<DD><B>•</B> <B>Hill climbing:</B> Hill climbing is a heuristic search strategy which retains one node after each step of search. At each step, it selects the best neighbor of the current state, according to an optimizing criterion. If each neighbor is worse than the current state, the search halts.
<DD><B>•</B> <B>Beam search:</B> Beam search is a heuristic search strategy similar to hill climbing. At each step, the best N partial solutions are retained for further search according to an optimizing criterion.
<DD><B>•</B> <B>Exhaustive (or Enumerative) search:</B> (described above)
<DD><B>•</B> <B>Hierarchical tree search:</B> A tree partitions a data set into a hierarchically ordered set of possible alternatives where each alternative on a hierarchical level is recursively divided into sub-concepts on the next lower hierarchical level. A hierarchical tree search algorithm, “Hierarchy Scan,” proposes an alternative to a traditional sequential scan that is typically used to search a tree [6].
</DL>
<P><FONT SIZE="+1"><B>PROBLEM STATEMENT</B></FONT></P>
<P>The problem addressed in this chapter consists of using a genetic algorithm to search a synthetically generated database in order to discover a relationship between data items. The synthetic database contains N transactions, where N varies based on the test being run. Each transaction, similar to transactions found in a retailer’s database, consists of a list of items purchased, where each item is one of 100 different items. Given a database with the characteristics just described, the goal is then to determine the four items that are most often purchased together.
</P>
<P>The remainder of this section discusses an overview of the problem implementation, the format of a synthetic data set, and a supporting program, the synthetic data generator.</P>
<P><FONT SIZE="+1"><B><I>Implementation Overview</I></B></FONT></P>
<P>If we are to find four related items out of a possible 100, this implies that there are 100 choose 4, or 3,921,225, combinations of items that we can potentially investigate. As previously mentioned, it is unreasonable to investigate all 3,921,225 combinations of potential relationships since, for each combination, the entire database (or a sufficiently large subset) has to be scanned to determine how often the four items appear together. To address this problem, a genetic algorithm is used to investigate a subset of the possible combinations. The intent was that by using a genetic-based search, the combination of four items that are most often purchased together could be determined much more efficiently than through an enumerative investigation of every possible combination. Genetic algorithm specifics on coding, fitness, and operators is given in the following section, “Genetic Algorithm Specifics.” Thus, an overview of the genetic algorithm in the data mining problem domain is presented next.
</P>
<P>Before performing any type of genetic search, a coding scheme was determined to represent the four potentially related items. To prepare for the search, X sets (the population size) of four items are picked at random and then coded into strings based on the coding scheme. These X strings represent generation 0 of the search. Then, for each of the X sets of four coded items, the entire database (or a sufficiently large subset) is scanned to generate a statistic (the fitness) as to how often the four items are purchased together. This fitness is based on the frequency with which the set of four items appear within the database under investigation. In addition, if a fractional number of the items in the item list appear within a transaction, the fitness reflects that fraction.</P>
<P>Once a fitness value is determined for each set of items in the population, the three genetic operators (reproduction, crossover, and mutation) are performed to generate a new population of strings. After performing reproduction, crossover, and mutation, if all goes as planned, each successive population should contain more highly fit strings, ultimately leading to a set of the four most frequently occurring items within the database.</P>
<P>In order to fully verify and analyze this genetic algorithm approach, simulations were run using several different fitness functions, crossover operators, mutation operators and genetic algorithm parameters. In addition, simulations were run using several different size data sets with varying characteristics (see the database configuration matrix in the section, “Results” for details). The synthetically generated data, as described in the next section, provided the advantage of being able to represent specific relationships within a data set and facilitated verification of the genetic algorithm-based search approach.</P><P><BR></P>
<CENTER>
<TABLE BORDER>
<TR>
<TD><A HREF="161-163.html">Previous</A></TD>
<TD><A HREF="../ewtoc.html">Table of Contents</A></TD>
<TD><A HREF="166-168.html">Next</A></TD>
</TR>
</TABLE>
</CENTER>
<hr width="90%" size="1" noshade>
<div align="center">
<font face="Verdana,sans-serif" size="1">Copyright © <a href="/reference/crc00001.html">CRC Press LLC</a></font>
</div>
<!-- all of the reference materials (books) have the footer and subfoot reveresed -->
<!-- reference_subfoot = footer -->
<!-- reference_footer = subfoot -->
</BODY>
</HTML>
<!-- END FOOTER -->
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -