166-168.html

遗传算法经典书籍-英文原版是研究遗传算法的很好的资料
HTML
字号:
<HTML>
<HEAD>
<META name=vsisbn content="0849398010">
<META name=vstitle content="Industrial Applications of Genetic Algorithms">
<META name=vsauthor content="Charles Karr; L. Michael Freeman">
<META name=vsimprint content="CRC Press">
<META name=vspublisher content="CRC Press LLC">
<META name=vspubdate content="12/01/98">
<META name=vscategory content="Web and Software Development: Artificial Intelligence: Other">




<TITLE>Industrial Applications of Genetic Algorithms:Data Mining Using Genetic Algorithms</TITLE>

<!-- HEADER -->

<STYLE type="text/css"> 
 <!--
 A:hover  {
 	color : Red;
 }
 -->
</STYLE>

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

<!--ISBN=0849398010//-->
<!--TITLE=Industrial Applications of Genetic Algorithms//-->
<!--AUTHOR=Charles Karr//-->
<!--AUTHOR=L. Michael Freeman//-->
<!--PUBLISHER=CRC Press LLC//-->
<!--IMPRINT=CRC Press//-->
<!--CHAPTER=9//-->
<!--PAGES=166-168//-->
<!--UNASSIGNED1//-->
<!--UNASSIGNED2//-->

<CENTER>
<TABLE BORDER>
<TR>
<TD><A HREF="163-165.html">Previous</A></TD>
<TD><A HREF="../ewtoc.html">Table of Contents</A></TD>
<TD><A HREF="168-170.html">Next</A></TD>
</TR>
</TABLE>
</CENTER>
<P><BR></P>
<P><FONT SIZE="+1"><B><I>Synthetic Data Sets and Data Set Generation</I></B></FONT></P>
<P>Since the focus of this chapter is on the development and verification of a genetic algorithm-based data mining search approach, an actual database with &#147;real&#148; data was not used nor was it required to demonstrate the usefulness of the genetic algorithm. Instead, &#147;synthetic&#148; data sets were generated to demonstrate the application of the genetic algorithm. As described earlier, a synthetically generated data set consists of X transactions. Each transaction consists of a transaction number (Trans_Num), the number of items purchased in the transaction (N), and a list of N items, where each item is specified by a number (i.e. 0=Pretzels, 1=Aspirin, 2=Beer, 3=Bread, etc.). Also, N can be any of 100 different items. The format of a transaction and an example of a transaction containing fifteen items are shown below:
</P>
<TABLE WIDTH="100%"><TR>
<TD WIDTH="30%">Transaction Format:
<TD>Trans_Num N Item<SUB><SMALL>0</SMALL></SUB> Item<SUB><SMALL>1</SMALL></SUB> &#133; Item<SUB><SMALL>N</SMALL></SUB>
<TR>
<TD>Example Transaction:
<TD>0 15 1 2 3 10 19 55 42 41 22 83 20 20 87 92 6
</TABLE>
<P>Synthetic data sets were generated using a C program that allowed the user, through command line input, to specify data characteristics. Allowing the user control over data set characteristics facilitated verification of the genetic algorithm search results, since known relationships were embedded within the data sets. Then, the relationships that were discovered by the genetic search were confirmed against the expected results.
</P>
<P>The synthetic data generator program required the user to provide at least one option to specify the number of transactions to generate. Other options allowed the user to specify individual items and the probability with which each of the specified items appeared within the synthetically generated data set.</P>
<P><FONT SIZE="+1"><B>GENETIC ALGORITHM SPECIFICS</B></FONT></P>
<P>The coding scheme, fitness function, and each of the three genetic operators (reproduction, crossover, and mutation) can be implemented in a variety of ways depending on the problem to which a genetic algorithm is being applied. Various implementation alternatives of these for the data mining problem examined in this chapter are discussed below.
</P>
<P><FONT SIZE="+1"><B><I>Parameter Coding</I></B></FONT></P>
<P>A main difference between genetic algorithms and more traditional optimization and search algorithms is that genetic algorithms work with a coding of the parameter set and not the parameters themselves. Thus, before any type of genetic search can be performed, a coding scheme must be determined to represent the parameters in the problem at hand. In the data mining problem addressed by this chapter, the parameters of interest are simply four, potentially related, item numbers. Therefore, a coding scheme for four item numbers was determined considering the following factors:
</P>
<DL>
<DD><B>&#149;</B>&nbsp;&nbsp;A multi-parameter coding, consisting of four sub-strings, is required to code each of the four items into a single string.
<DD><B>&#149;</B>&nbsp;&nbsp;Each sub-string needs to represent one of a hundred (0 through 99) possible items.
<DD><B>&#149;</B>&nbsp;&nbsp;There are 100 choose 4 (3,921,225) possible combinations of items that need to be represented.
</DL>
<P>Given these factors, several coding schemes were considered. First, a scheme using a base-10 coded sub-string was examined. This coding allows 100 items to be represented, and no codings correspond to non-existent items (all possible sub-string codings would represent valid item numbers, 0 through 99). This coding results in an 8-digit string where each digit (dx) is between 0 and 9, as follows:
</P>
<P><A NAME="Fig2"></A><A HREF="javascript:displayWindow('images/09-02.jpg',300,47)"><IMG SRC="images/09-02t.jpg"></A></P>
<P>A coding of this sort, however, limits the number of schemata that are available for the genetic algorithm to exploit. Therefore, since we want to maximize the number of schemata, a binary coding, as shown below was considered.
</P>
<P><A NAME="Fig3"></A><A HREF="javascript:displayWindow('images/09-03.jpg',656,54)"><IMG SRC="images/09-03t.jpg"></A></P>
<P>Here, each digit (b<SUB><SMALL>x</SMALL></SUB>) is a 1 or 0. One problem with this coding is that, in order to represent 100 items, a sub-string must consist of seven bits (2<SUP><SMALL>7</SMALL></SUP> = 128). This means that 28<SUP><SMALL>4</SMALL></SUP> (128-100=28 values in each sub-string), or 614,656 strings could be manipulated by the genetic algorithm which don&#146;t correspond to actual solutions.</P>
<P>Another possibility, when considering the fact that there are 3,921,225 combinations of items that need to be represented, was to code the items as a 22-bit binary string. This corresponds to 2<SUP><SMALL>22</SMALL></SUP>, or 4,194,304 possible solution representations, reducing the number of non-solution strings to 273,079 (4,194,304-3,921,225). Using this approach, however, requires a mapping from a &#147;combination number&#148; to the actual four items represented by that combination since the item numbers are not specifically embedded within the string. If this combination-mapping approach was used, it would be difficult for the genetic algorithm to exploit similarities between strings since two combination numbers would not necessarily have anything in common, even if the items represented by the combination numbers were similar. Therefore, this coding approach was deemed unacceptable.</P>
<P>Given these possibilities, and considering the implementation of the crossover operator which is discussed in the section, &#147;Reproduction, Crossover, and Mutation Operators,&#148; the decimal coding appeared to be the best alternative and was therefore implemented for the data mining problem.</P><P><BR></P>
<CENTER>
<TABLE BORDER>
<TR>
<TD><A HREF="163-165.html">Previous</A></TD>
<TD><A HREF="../ewtoc.html">Table of Contents</A></TD>
<TD><A HREF="168-170.html">Next</A></TD>
</TR>
</TABLE>
</CENTER>

<hr width="90%" size="1" noshade>
<div align="center">
<font face="Verdana,sans-serif" size="1">Copyright &copy; <a href="/reference/crc00001.html">CRC Press LLC</a></font>
</div>
<!-- all of the reference materials (books) have the footer and subfoot reveresed -->
<!-- reference_subfoot = footer -->
<!-- reference_footer = subfoot -->

</BODY>
</HTML>

<!-- END FOOTER -->
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -