📄 readme.txt
字号:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%% THE GREEDY EM ALGORITHM FOR MULTIPLE MOTIF DISCOVERY %%%%%
%%%%% %%%%%
%%%%% Kostas Blekas, 15 May 2001 %%%%%
%%%%% Dept. of Computer Science %%%%%
%%%%% University of Ioannina, Greece %%%%%
%%%%% %%%%%
%%%%% please contact at kblekas@cc.uoi.gr in case of problems %%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The GreedyEM.zip file contains the following files:
filename explanation
-----------------------------------------------------------
readme.txt (this file)
artificial_seqs.txt the artificial set of 10 sequences with 6 motifs
pr00058.txt the PRINTS family PR00058 with 16 sequences
artificial_res.txt the results as depicted from applying the GreedyEM algorithm to the artificial_seqs dataset (demo1)
pr00058_res.txt the results as depicted from applying the GreedyEM algorithm to the PR00058 family (demo2)
read_seqs.m reads the file stored the training sequences
and creates the training set of substrings
GreedyEM.m The Greedy EM basic algorithm
kd_trees.m Kd-tree technique for partitioning the set
kd_recurse.m of n substrings to a set of C candidate models
bestpos.m for global searching phase. The Ksi matrix
Ksi_matrix.m is calculated for iteratively using during candidate selection
candidate_selection.m Finds the candidate model that maximizes the log-likelihood and
initializes the trial component parameters (a' and
probability matrix)
partial_EMsteps.m Perform partial EM steps until convergence
Estep.m Expectation phase and
Mstep.m Maximization phase of the general EM algorithm for likelihood maximization
take_res.m Stores the results in the res.txt file
demo1.m Demo with application in the artificial dataset artificial_seqs.txt
demo2.m Demo with application in the PR00058 family
=====================
Follow the instructions:
steps
-----
[1]. Run the GreedyEM.m file to discover the motifs of length W in the set of sequences.
Give the following inputs:
- The motif length (W)
- The maximum number of motifs to discover
- the name of the sequences dataset in FASTA format
- the value of parameter (T) used for partitioning the input substrings.
For better results choose T=N (number of training sequences)
[2]. Read the file res.txt for an explanation and statistics of the results (motifs found).
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -