mast.doc

来自「EM算法的改进」· DOC 代码 · 共 199 行
DOC
199 行
	[-sep]		score reverse complement DNA strand as a separate 			sequence	[-norc]		do not score reverse complement DNA strand	[-dna]		translate DNA sequences to protein	[-comp]		adjust p-values and E-values for sequence composition	[-rank <rank>]	print results starting with <rank> best (default: 1)	[-smax <smax>]	print results for no more than <smax> sequences			(default: all)	[-ev <ev>]	print results for sequences with E-value < <ev>			(default: 10)	[-mt <mt>]	show motif matches with p-value < mt (default: 0.0001)	[-w]		show weak matches (mt<p-value<mt*10) in angle brackets	[-bfile <bfile>]	read background frequencies from <bfile>	[-seqp]		use SEQUENCE p-values for motif thresholds			(default: use POSITION p-values)	[-mf <mf>]	print <mf> as motif file name	[-df <df>]	print <df> as database name	[-minseqs <minseqs>]	lower bound on number of sequences in db	[-mev <mev>]+	use only motifs with E-values less than <mev>	[-m <m>]+	use only motif(s) number <m> (overrides -mev)	[-diag <diag>]	nominal order and spacing of motifs	[-best]		include only the best motif in diagrams	[-remcorr]	remove highly correlated motifs from query	[-brief]	brief output--do not print documentation	[-b]		print only sections I and II	[-nostatus]	do not print progress report	[-hit_list]	print machine-readable list of all hits only; implies -text  MAST: Motif Alignment and Search Tool  MAST is a tool for searching biological sequence databases for sequences that contain one or more of a group of known motifs.   A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences. Motifs are represented as position-dependent scoring matrices that describe the score of each possible letter at each position in the pattern. Individual motifs may not contain gaps. Patterns with variable-length gaps must be split into two or more separate motifs before being submitted as input to MAST.   MAST takes as input a file containing the descriptions of one or more motifs and searches a sequence database that you select for sequences that match the motifs. The motif file can be the output of the MEME motif discovery tool  or any file in the appropriate format.   MAST outputs three things:     1. The names of the high-scoring sequences sorted by the strength of the      combined match of the sequence to all of the motifs in the group.    2. Motif diagrams showing the order and spacing of the motifs within each      matching sequence.    3. Detailed annotation of each matching sequence showing the sequence      and the locations and strengths of matches to the motifs.   MAST works by calculating match scores for each sequence in the database compared with each of the motifs in the group of motifs you provide. For each sequence, the match scores are converted into various types of p-values and these are used to determine the overall match of the sequence to the group of motifs and the probable order and spacing of occurrences of the motifs in the sequence.   MAST outputs a file containing:      * the version of MAST and the date it was built,      * the reference to cite if you use MAST in your research,      * a description of the database and motifs used in the search,      * an explanation of the results,     * high-scoring sequences--sequences matching the group of motifs       above a stated level of statistical significance,      * motif diagrams showing the order and spacing of occurrences of the       motifs in the high-scoring sequences and      * annotated sequences showing the positions and p-values of all motif       occurrences in each of the high-scoring sequences.   Each section of the results file contains an explanation of how to interpret them.     Match Scores  The match score of a motif to a position in a sequence is the sum of the score from each column of the position-dependent scoring matrix corresponding to the letter at that position in the sequence. For example, if the sequence is   TAATGTTGGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGC    ========  and the motif is represented by the position-dependent scoring matrix (where each row of the matrix corresponds to a position in the motif)   =========|================================= POSITION |   A        C        G        T =========|=================================   1      | 1.447    0.188   -4.025   -4.095    2      | 0.739    1.339   -3.945   -2.325    3      | 1.764   -3.562   -4.197   -3.895    4      | 1.574   -3.784   -1.594   -1.994    5      | 1.602   -3.935   -4.054   -1.370    6      | 0.797   -3.647   -0.814    0.215    7      |-1.280    1.873   -0.607   -1.933    8      |-3.076    1.035    1.414   -3.913  =========|=================================  then the match score of the fourth position in the sequence (underlined) would be found by summing the score for T in position 1, G in position 2 and so on until G in position 8. So the match score would be     score = -4.095 + -3.945 + -3.895 + -1.994           + -4.054 + -0.814 + -1.933 + 1.414          = -19.316  The match scores for other positions in the sequence are calculated in the same way. Match scores are only calculated if the match completely fits within the sequence. Match scores are not calculated if the motif would overhang either end of the sequence.     P-values  MAST reports all matches of a sequence to a motif or group of motifs in terms of the p-value of the match. MAST considers the p-values of four types of events:       position p-value: the match of a single position within a sequence to     	a given motif,      sequence p-value: the best match of any position within a sequence     	to a given motif,      combined p-value: the combined best matches of a sequence to a     	group of motifs, and      E-value: observing a combined p-value at least as small in a random     	database of the same size.   All p-values are based on a random sequence model that assumes each position in a random sequence is generated according to the average letter frequencies of all sequences in the the appropriate (peptide or nucleotide) non-redundant database (ftp://ncbi.nlm.nih.gov/blast/db/) on September 22, 1996.  This can be overridden in two ways:  	1) -bfile <bfile> 	The random model uses the letter frequencies given in <bfile>  	instead of the non-redundant database frequencies. 	The format of <bfile> is the same as that for the MEME -bfile opton;  	see the MEME documentation for details.  Sample files are given in  	directory tests: tests/nt.freq and tests/na.freq.)  	 	2) -comp 	The random model uses the letter frequencies in the current target 	sequence instead of the non-redundant database frequencies.  This 	causes p-values and E-values to be compensated individually for the  	actual composition of each sequence in the database.  This option 	can increase search time substantially due to the need to compute 	a different score distribution for each high-scoring sequence.       Position p-value      The p-value of a match of a given position within a sequence to a     motif is defined as the probability of a randomly selected position in a     randomly generated sequence having a match score at least as large     as that of the given position.       Sequence p-value      The p-value of a match of a sequence to a motif is defined as the     probability of a randomly generated sequence of the same length     having a match score at least as large as the largest match score of     any position in the sequence.       Combined p-value      The p-value of a match of a sequence to a group of motifs is defined     as the probability of a randomly generated sequence of the same     length having sequence p-values whose product is at least as small     as the product of the sequence p-values of the matches of the motifs     to the given sequence.       E-value      The E-value of the match of a sequence in a database to a a group     of motifs is defined as the expected number of sequences in a random     database of the same size that would match the motifs as well as the     sequence does and is equal to the combined p-value of the sequence     times the number of sequences in the database.     High-scoring Sequences  MAST lists the names and part of the descriptive text of all sequences whose E-value is less than E. Sequences shorter than one or more of the motifs are skipped. The sequences are sorted by increasing E-value. The value of E is set to 10 for the WEB server but is user-selectable in the down-loadable version of MAST.     Motif Diagrams  Motif diagrams show the order and spacing of non-overlapping matches to the motifs in each high-scoring sequence. Motif occurrences are determined based on the position p-value of matches to the motif. Strong matches (p-value < M) are shown in square brackets (`[ ]'), weak matches (M < p-value < M
mast.doc - 源码说明

本页面展示了「EM算法的改进」中的 mast.doc 源码文件，采用 DOC 编程语言编写，共 199 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫开发者社区收录了大量与EM算法相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?