📄 168-170.html
字号:
<HTML>
<HEAD>
<META name=vsisbn content="0849398010">
<META name=vstitle content="Industrial Applications of Genetic Algorithms">
<META name=vsauthor content="Charles Karr; L. Michael Freeman">
<META name=vsimprint content="CRC Press">
<META name=vspublisher content="CRC Press LLC">
<META name=vspubdate content="12/01/98">
<META name=vscategory content="Web and Software Development: Artificial Intelligence: Other">
<TITLE>Industrial Applications of Genetic Algorithms:Data Mining Using Genetic Algorithms</TITLE>
<!-- HEADER -->
<STYLE type="text/css">
<!--
A:hover {
color : Red;
}
-->
</STYLE>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<!--ISBN=0849398010//-->
<!--TITLE=Industrial Applications of Genetic Algorithms//-->
<!--AUTHOR=Charles Karr//-->
<!--AUTHOR=L. Michael Freeman//-->
<!--PUBLISHER=CRC Press LLC//-->
<!--IMPRINT=CRC Press//-->
<!--CHAPTER=9//-->
<!--PAGES=168-170//-->
<!--UNASSIGNED1//-->
<!--UNASSIGNED2//-->
<CENTER>
<TABLE BORDER>
<TR>
<TD><A HREF="166-168.html">Previous</A></TD>
<TD><A HREF="../ewtoc.html">Table of Contents</A></TD>
<TD><A HREF="170-173.html">Next</A></TD>
</TR>
</TABLE>
</CENTER>
<P><BR></P>
<P><FONT SIZE="+1"><B><I>Fitness function</I></B></FONT></P>
<P>If a string of four items is coded as just described, the fitness for such a string can be determined based on the frequency with which the set of four items appear within the transactions contained in the database under investigation. In addition, if a fractional number of the items in the item list appear within a transaction, the fitness value will also reflect that fraction. For example, if the item list for which a fitness value is to be determined is as follows:
</P>
<P><A NAME="Fig4"></A><A HREF="javascript:displayWindow('images/09-04.jpg',270,32)"><IMG SRC="images/09-04t.jpg"></A></P>
<P>and if the database, for the purpose of this discussion, contains only two transactions, as follows:
</P>
<TABLE WIDTH="100%"><TR>
<TD WIDTH="20%">transaction 1:
<TD WIDTH="80%">34 22 55 01 68 99 02 07 42 24
<TR>
<TD>transaction 2:
<TD>76 01 86 99 02 07 42 24
</TABLE>
<P>then transaction 1 contains all four items (01, 22, 68, and 07) and transaction 2 contains only two of the four items (01 and 07). If <I>I(t<SUB><SMALL>i</SMALL></SUB>)</I> represents the fraction of items that are contained within a single transaction <I>t<SUB><SMALL>i</SMALL></SUB></I>, then <I>I(t<SUB><SMALL>1</SMALL></SUB>)</I> would equal 1.0 and <I>I(t<SUB><SMALL>2</SMALL></SUB>)</I> would equal 0.5 (2/4). If all the <I>I(t<SUB><SMALL>i</SMALL></SUB>)</I>s are summed over all transactions, and this sum is divided by the total number of transaction, <I>N</I>, then a relationship (referred to as the <I>base fitness</I>) between the item set and the frequency with which the items appear within the database can be obtained. Thus, the base fitness value for the item combination of 01, 22, 68, and 07 for the example database of two transactions would be:</P>
<P ALIGN="CENTER"><IMG SRC="images/09-01d.jpg"></P>
<P>This yields a base fitness function that can be represented as follows:
</P>
<P>Assume:</P>
<TABLE WIDTH="100%"><TR>
<TD WIDTH="10%"><I>N</I>
<TD WIDTH="90%">= total number of transactions in the database file (sdg.db)
<TR>
<TD><I>t<SUB><SMALL>i</SMALL></SUB></I>
<TD>= transaction number
<TR>
<TD><I>I(t<SUB><SMALL>i</SMALL></SUB>)</I>
<TD>= fraction of items that are contained in <I>t<SUB><SMALL>i</SMALL></SUB></I>
<TR>
<TD><I>M</I>
<TD>= sum of all <I>I(t<SUB><SMALL>i</SMALL></SUB>)</I> over all transactions
</TABLE>
<P>then <I>M</I> is:</P>
<P ALIGN="CENTER"><IMG SRC="images/09-02d.jpg"></P>
<P>and the fitness, referred to as F1, for a given string of items is:
</P>
<P ALIGN="CENTER"><IMG SRC="images/09-03d.jpg"></P>
<P>Furthermore, in an effort to avoid possible premature convergence, and to promote competition between strings throughout a simulation, a scaled fitness, F2, was implemented and tested. F2 is expressed as,
</P>
<P ALIGN="CENTER"><IMG SRC="images/09-04d.jpg"></P>
<P>where <I>F2</I> is the scaled fitness, <I>F1</I> is the raw fitness, and <I>a</I> and <I>b</I> are determined based on the maximum and average raw fitness values. In the case of the data mining problem, the maximum possible fitness (if every transaction in the database contained all four items of interest) was 1.0 and the average raw fitness was found to be around 0.4.</P>
<P>In a third alternative, the fitness function was based on the following additional assumptions: Since the goal is to determine the four items that are most often purchased together, the fitness function should yield a higher value if more of the items in the item set are contained together within the database transactions. For example, if one item set yielded a base fitness value of 0.25 (implying that, on average, one of the four items in the item set were contained in each of the database transactions) and a second item set yielded a base fitness value of 0.75 (implying that, on average, three of the four items in the item set were contained in each of the database transactions), it may be desirable to give more than a simple linear emphasis to the second item set value since it is closer to the goal of four. Thus, a third fitness, referred to as F3, is expressed as</P>
<P ALIGN="CENTER"><IMG SRC="images/09-05d.jpg"></P>
<P>where <I>x</I> was chosen to be 3.</P>
<P>While this type of fitness expression gives more emphasis to item sets that are closer to the four item goal, it must be realized that a single occurrence of four items purchased together within a large database will not yield a higher fitness than three of the items purchased together many times. Since we are seeking trends within a large set of database transactions, it was felt to be more important to recognize major trends (such as three items purchased together many times) which might lead to a trend of four items purchased together.</P>
<P>The results of simulations run with each of the fitness expressions just described (F1, F2, and F3) are discussed in the section, “Fitness Function Simulation Results.” Those results show that fitness function F3 provided very good performance, better than both F1 and F2.</P>
<P><FONT SIZE="+1"><B><I>Reproduction, Crossover, and Mutation Operators</I></B></FONT></P>
<P>The reproduction, crossover, and mutation operators that were implemented to support the eight-digit decimal coding are discussed next.
</P>
<P><B>Reproduction Operator</B></P>
<P>A basic roulette wheel reproduction operator was implemented for this data mining application. In this type of reproduction, each string in the population is given a roulette wheel slot sized in proportion to its fitness. Then, by “spinning” the wheel N times (the population size), N new offspring are created for the next generation. By having a weighted, or “biased” roulette wheel, it is more probable that higher fit strings receive more copies in subsequent generations.
</P><P><BR></P>
<CENTER>
<TABLE BORDER>
<TR>
<TD><A HREF="166-168.html">Previous</A></TD>
<TD><A HREF="../ewtoc.html">Table of Contents</A></TD>
<TD><A HREF="170-173.html">Next</A></TD>
</TR>
</TABLE>
</CENTER>
<hr width="90%" size="1" noshade>
<div align="center">
<font face="Verdana,sans-serif" size="1">Copyright © <a href="/reference/crc00001.html">CRC Press LLC</a></font>
</div>
<!-- all of the reference materials (books) have the footer and subfoot reveresed -->
<!-- reference_subfoot = footer -->
<!-- reference_footer = subfoot -->
</BODY>
</HTML>
<!-- END FOOTER -->
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -