readme

来自「这是一个用于数据挖掘的常用算法的模板库（数据挖掘的C++模板库for UNIX)」· 代码 · 共 69 行

TXT

69 行

This directory contains pattern specific source files. Here is a short descriptionwhat each file contains:1. iset_can_code.h: canocnical_code class specialized for itemset. This class providesa unique code for each patterns. For itemset, the coding is really simple. For everyitemset, an integer is used as its code. For a length-1 pattern, the vertex-id is itscode. For higher length, a negative integer number is used as the code. Code is generatedincrementally from a static integer(variable id_generator), set to -1.2. iset_cand_gen.h: This file contains two important functions for itemset mining: cand_genand join.cand_gen: After finding all level-1 candidates, recursive mining to the higher level patternis initiated throuth this routine. In this way, this routine is a driver program for themining process. It takes a set of patterns that belongs to the same equivalence class, andjoin all possible pairs of patterns in that equivalence class family. In any equivalence classall patterns have exactly same size k and they share a same (k-1)-length prefix. In everyiteration, cand_gen produces a new equivalence class family and proceed recursively to get higherlength pattern, until it reaches to a equivalence class for which there is no patterns. Thesearch in always depth-first, which makes is possible to get a equivalence class in each iteration.example: Say, from the level 1, we have {A, B, C, D, E} as frequent pattern.They belongs to an equivalence class with a prefix = NULL and length 1. Through the firstiteration of cand_gen() we get new equivalence class {AB, AC, AD, AE} with a prefix = Aand length = 2. Then cand_gen is recursively called on this family to get new equivalenceclass {ABC, ABD, ABE} that has a prefix = AB and length = 3. The process continue untilwe reach a equivalence class {ABCDE} for which the prefix = ABCD and length = 5. Recursionreturns at this point and starts with the equivalence class {BC, BD, BE} and proceed so on.Note that, Fk_one holds the patterns that belongs to an equivalence class, hence named pat_fam.In each iteration frequent patterns are stored in freq_pats, and the frequency is determinedfrom the back-end through the count_support class. That's the reason, an instance of count_supportclass cnt_sup is passed also as parameter. Minsup is user-defined. At this time, you may noticed that the second parameter Fk_two is passed but never used. This is passed to maintain the generic behavior of cand_gen routine throughout all different patterns. It has no use as long as we join in Fk X Fk fashion since this kind of join joins patterns from exactly one family. But, in Fk X F1 mining, like what is used in graph mining implementation, this parameter will be used.join: It is simple, it takes two candidate patterns that belongs to the same equivalence classand produce a new candidate pattern. For the simple itemset cases, only one pattern is producedfrom the join, but we still retruns an array of candidate patterns, again to maintain the generic behavior of this routine since other patterns like sequence, tree, etc. may generatemore than one pattern from a join operations. Note that, we always work with a pointer tothe pattern instead of pattern itself for efficiency reason.3. iset_iso_check.h: This file contains a routine that is called check_isomorphism().Isomorphism checking is important to get rid of duplicate patterns, for the case where different join operation produces the same candidate pattern. Canonical code of a pattern becomes usefulin this case. If we find a pattern, for which the canonical_code is not a minimal code,we consider this pattern to be duplicate and discard them. The return type for this functionis boolean(true if the pattern is valid, false if it is a duplicate). You must have noticedthat it always return true in the case of itemset, because all candidates are valid for itemset.In other word, candidate generation process was easy enough NOT to produce any duplicate candidates.But it is not an easy task for more complex patterns and check_isomporphism is not a trivialroutine for those.4. iset_operators.h: It implements some operators for itemset patterns. Like '<' and '<<'5. iset_tokenizer.h: Tokenizer class specialized for itemset. It provides routines to tokenizethe input data file. These routine are called by db_reader.h class to populate all level-1 patterns.6. iset_vat.h: VAT class specialized for itemset pattern. It implements the vertical databaserepresentation for a pattern and stores the VAT for a pattern. For itemset, VAT is really simple,it just stores the set of transaction-id's that has this pattern, typically in a std::vector.For, complex pattern VAT are more intrigued.

readme - 源码说明

本页面展示了「这是一个用于数据挖掘的常用算法的模板库（数据挖掘的C++模板库for UNIX)」中的 readme 源码文件，采用编程语言编写，共 69 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。

虫虫下载站收录了大量与UNIX相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。

⌨️ 快捷键说明

复制代码Ctrl + C

搜索代码Ctrl + F

全屏模式F11

增大字号Ctrl + =

减小字号Ctrl + -

显示快捷键?