📄 readme

📁 这是一个用于数据挖掘的常用算法的模板库（数据挖掘的C++模板库for UNIX)
💻
📖 第 1 页 / 共 2 页
字号:
12 下一页
This directory contains 4 test files. They test itemset, sequence, tree and graph mining.For sequence test, two variation exists (embedded, induced) and for tree, four variationsexists(ordered/unordered, embedded/induced). User can choose any of the variation from thecommand line parameter.SOURCE TEST FILES:==================itemset_test.cpp: Testing itemset mining over string dataset.sequence_test.cpp: Testing embedded/induced sequence mining over string dataset.tree_test.cpp: Testing embedded/induced and ordered/unordeded tree mining over string dataset.graph_test.cpp: Testing undirected graph mining over string dataset.multiset_test.cpp: Testing multiset mining over string dataset.EXECUTABLE: (How to Run the Programs)=====================================Here we show, how to run the programs. Parameter name are enclosed in < ... >. User must replace those name with actual values. Choice parameter names are given in [ ...] and the actual choice values are given here. User must choose any parameters out of the following whichare separated by '|'. The parameter '-p' should be used if the user wants to print the frequentpatterns to stdout, otherwise only the count of total frequent patterns will be displayed.To perform Itemset Mining:$ ./itemset_test  -i <input_data_file> -s <support> [-p]To perform Sequence Mining:$ ./sequence_test -i <input_data_File> -s <support> [-I|-E] [-p]choose '-I' if you want induced mining, or '-E' for embedded miningTo perform Tree Mining:$ ./tree_test -i <input_data_file> -s <support> [-I|-E] [-O|-U] [-p]choose -I if you want induced mining, or -E for embedded miningchoose -O if you want ordered tree mining, or -U for unordered tree miningTo perform Graph Mining:$ ./graph_test -i <input_data_file> -s <support> [-p]To perform Multiset Mining:$ ./multiset_test -i <input_data_file> -s <support> [-p]For using file based back end storage, compile a different set of binariesby running 'make file-based' in the test directory. The binaries generatedwould be - itemset_test_file, sequence_test_file, tree_test_file and graph_test_file. NOTE: There is no file based backend support for multisetscurrently.The command line parameters to execute the above binaries are are quite similar to the binaries for memory-based back end.For example, the command line for itemsets with memory based back end (as shown above) is$ ./itemset_test  -i <input_data_file> -s <support> [-p]whereas, the command line for itemsets with file-based back end.$ ./itemset_test_file  -i <input_data_file> -s <support> [-p] [-f file buffer in MB]The additional parameter -f specifies the size of the memory buffer in MB. The default is 512 MB.INPUT DATA FILE FORMAT:=======================In the data directory, example input files are given. * For itemset, each row is a transaction:<transaction_id> <timestamp> <# of elements> <element 1> <element 2> ... <element n> (NOTE: The timestamp is ignored for itemset, you can just duplicate transaction id in that column)* For sequence (set of itemsets with total ordering), each row is a single transaction in the sequence:<sequence_id> <timestamp> <# of elements> <element 1> <element 2> ... <element n> For example, let us consider the sequence (2,3)->(0,1,2), where the itemset (2,3) hastimestamp 10 and itemset (0,1,2) has timestamp 15. Let us assume that the sequence has an identifier 1.This sequence will be represented as follows:1 10 2 2 31 15 3 0 1 2* For tree, each row is a tree:<transaction_number> <transaction_number> <# of elements> <element 1> <element 2> ... <elementn> (NOTE: For tree elements are listed in a pre-order traversal order, for leaf node a -1 is used)For graph, multiple rows makes up a graph:* A graph starts with a row like below:t # <graph_id>Followed by the listing of vertices, each vertex in a row. Each vertex row is like:v <vertex_id> <vertex_label>Followed by the listing of edges, each edge in a row. Each edge(v1, v2) is like:e <vertex id of v1> <vertex id of v2> <edge label>(NOTE: If the edges does not have any label, use same number for all the edge-label)* For multiset, each row is a transaction. The format is exactly like itemsets with   the exception that items can repeat in the same transaction (from the definition of  multisets).INPUT DATA TYPE===============Before running DMTL user needs to define the input data type, asDMTL is templated on input data type. Template argument PAT_ST (check any of .cpp filesin test directory, to see examples on how PAT_ST is being used) defines the data types and storage data structure for the pattern. Our generic mining algorithm assumes a adjacency list data structure for this, but users are free to choose any alternative data structure as long as the interface remains the same. PAT_ST data structure should be templated with the type of vertex_label and edge_label.Current test program are written with std::string as vertex_label type and int as edge_labe_type.So, our PAT_ST definition is like beloe:typedef adj_list<std::string, int>      PAT_ST;But, if the vertex would have been labeled by say, integer, the declaration would looks like below:typedef adj_list<int, int>      PAT_ST;SNAPSHOT OF RUN WITH FILES IN data DIRECTORY=============================================ITEMSET RUN:------------data$ cat ../data/IS_str_toy.data1 1 4 A C T W2 2 3 C D W3 3 4 A C T W4 4 4 A C D W5 5 5 A C D T W6 6 3 C D Ttest$./itemset_test -i ../data/IS_str_toy.data -s 4 -pCommand line parameters given were -infile=../data/IS_str_toy.dataminsup=4print flag=1Size of length one patterns = 5Size of level-1 = 5FREQUENT PATTERNS -A -- Support: 4C -- Support: 6T -- Support: 4W -- Support: 5D -- Support: 4A C -- Support: 4A W -- Support: 4A C W -- Support: 4C T -- Support: 4C W -- Support: 5C D -- Support: 4The following statistics are purely for performance measurementWall clock time taken to read db in:0.000187159TOTAL wall clock time taken:0.00074601211 patterns foundOutput Help for Itemset:------------------------In the above itemset mining, we got 11 frequent patterns, each frequent pattern is printed with the pattern andits support count on the same line. For example the pattern {C, W} has a support 5, i.e. {C, W} appeared in 5transaction of the above datasets given in file IS_str_toy.dataSEQUENCE RUN:------------Consider the following file as input to the sequence miner.data$ cat ../data/SEQ_str_toy.data1 10 5 V D I I W1 15 6 L L G T F D1 20 4 D L G V1 25 2 K E2 15 3 V P P 2 20 4 L G S G3 10 3 D I V4 10 4 K E A K4 20 3 D E K4 25 4 I E F VThe file contains 4 sequences which can be better visualized as:Timestamp         10                    15                 20             25            (V, D, I, I, W) --> (L, L, G, T, F, D) --> (D, L, G, V) --> (K, E)                                                       (V, P, P) -----> (L, G, S, G)            (D, I, V)            (K, E, A, K) ----------------------------> (D, E, K) -----> (I, E, F, V)The command line and output are as follows:test$./sequence_test -i ../data/SEQ_str_toy.data -s 2 -I -pinfile=../data/SEQ_str_toy.dataminsup=2print flag=1Induced flag=1SIZE OF LEN 1 VATS = 8Wall clock time taken to read db in:0.000519991FREQUENT PATTERNS -V -- Support: 4D -- Support: 3I -- Support: 3L -- Support: 2G -- Support: 2F -- Support: 2K -- Support: 2E -- Support: 2D -> V -- Support: 2V -> L -- Support: 2V -> G -- Support: 2D -> F -- Support: 2D -> E -- Support: 213 patterns foundThe following statistics are purely for performance measurementWall clock time taken to read db in:0.000519991TOTAL wall clock time taken:0.0017581Interpreting Sequence Output :-----------------------------The command line parameters indicated that we are performing induced mining with an absolute minimum support of 2.We got 13 frequent patterns. Each frequent pattern is printed with its support count on the same line. For example the pattern 'D -> E'  has a support 2, i.e. 'D -> E' appears (without any other element in between)in 2 transactions in the above dataset.TREE RUN:------------Consider the following file as input to the tree miner.data$ cat ../data/TREE_int_toy.data1 1 5 1 2 -1 3 -12 2 9 3 2 -1 1 2 -1 3 -1 -13 3 3 2 1 -14 4 1 15 5 5 3 1 -1 2 -16 6 1 27 7 7 3 1 2 1 -1 -1 -18 8 3 1 3 -1
12 下一页
💿 文件大小 3269 K
👤 上传用户 d_zhihua
📂 所属分类人工智能/神经网络
🏷️ 相关标签

#UNIX #for #数据挖掘 #模板
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -