meme-explanation.html
来自「EM算法的改进」· HTML 代码 · 共 412 行 · 第 1/2 页
HTML
412 行
<TR> <TD>Y</TD> <TD><FONT COLOR=TURQUOISE>TURQUOISE</FONT></TD> </TR> </TABLE></CENTER><p>J. Kyte and R. Doolittle, 1982."A Simple Method for Displaying the Hydropathic Character of a Protein",J. Mol Biol. 157, 105-132.<p>Summing the information content for each position in the motif givesthe total information content of the motif (shown in parentheses to theleft of the diagram). The total information content is approximatelyequal to the log likelihood ratio divided by the number of occurrences timesln(2).The total information content gives a measure ofthe usefulness of the motif for database searches.For a motif to be useful for database searches, it must as a rule contain at least log_2(N) bits of information where N is the number of sequences in the database being searched. For example, to effectively search a database containing 100,000 sequencesfor occurrences of a <I>single</I> motif, the motif should have an IC of atleast 16.6 bits. Motifs with lower information content are still useful when a family of sequences shares more than one motif since they can be combinedin <I>multiple</I> motif searches (using MAST).<P><LI> <A NAME=consensus_doc2 HREF=#consensus1><H4>Multilevel Consensus Sequence</H4></A>The multilevel consensus sequence corresponding to the motif is an aid in remembering and understanding the motif. It is calculated from the motif position-specific probability matrix as follows. Separately for each column of the motif, the letters in the alphabet are sorted in decreasing order by the probability with which they are expected to occur in that position of motif occurrences. The sorted letters are then printed vertically with the most probable letter on top. Only letters with probabilities of 0.2 or higher at that position in the motif are printed. As an example, the multilevel consensus sequence of motif 1 in the sample output is:<PRE><B>Multilevel</B> <B>TTATGTGAACGACGTCACACT</B><B>consensus</B> AA T A G A GA AA<B>sequence</B> T C TT T</PRE>This multilevel consensus sequence says several things about the motif.First, the most likely form of the motifcan be read from the top line as <TT>TTATGTGAACGACGTCACACT</TT>.Second, that only letter <TT>A</TT> has probability more than 0.2 inposition 3 of the motif, both <TT>T</TT> and <TT>A</TT> have probabilitygreater than 0.2 in position 1, etc.Third, a <I>rough approximation</I> of the motif can be made by converting themultilevel consensus sequence into a <A HREF=#regular_expression_doc2>regular expression</A> for the motif.<UL> <TT>[TA][TA]AT[GT][T][GA]A[AGT]C[GAC]A[CGT][GAT]TCACA[CAT][TA]</TT></UL><P><LI> <A NAME=sites_doc2 HREF=#sites1><H4>Occurrences of the Motif</H4></A>MEME displays the occurrences (sites) of the motif in the training set. The sites are shown aligned with each other, and the ten sequencepositions preceding and following each site are also shown.Each site is identified by the name of the sequence where it occurs,the strand (if both strands of DNA sequences are being used), and theposition in the sequence where the site begins. When the DNA strandis specified, `+' means the sequence in the training set,and `-' means the reverse complement of the training set sequence.(For `-' strands, the `start' position is actually the position on the<B>positive</B> strand where the site ends.)The sites are listed in order of increasing statistical significance(<I>p</I>-value). The <I>p</I>-value of a site is computed from the the match score of the site with the <A HREF=#pssm>position specific scoring matrix</A> for the motif. The <I>p</I>-value gives the probability of a random string(generated from the background letter frequencies) having the same matchscore or higher. (This is referred to as the <B>position <I>p</I>-value</B>by the MAST algorithm.)<P><LI> <A NAME=diagrams_doc2 HREF=#diagrams1><H4>Block Diagrams of Motif Occurrences</H4></A>The occurrences of the motif in the training set sequences are shown with MAST-style block diagrams. One diagram is printed for eachsequence showing all the occurrences of the motif in that sequence.The sequences are sorted by the <B>lowest</B> <I>p</I>-value among all occurrences of the motif in a given sequence.(The <I>p</I>-value of an occurrence is the probability of a singlerandom subsequence the length of the motif,generated according to the 0-order background model, having a scoreat least as high as the score of the occurrence.)When the DNA strand is specified, `+' means the motif appears from left toright on the sequence, and `-' means the motif appears from right to left on the complementary strand.A sequence position scale is shown at the end of each table of blockdiagrams. Very long sequences are shown with thick lines connecting themotifs and are <B>not</B> drawn to scale.<P><LI> <A NAME=BLOCKS_doc2 HREF=#BLOCKS1><H4>Motif in BLOCKS format or FASTA format</H4></A>For use with <A HREF="http://blocks.fhcrc.org/blocks">BLOCKS tools</A>, MEME prints the occurrences of the motif in BLOCKS format. <P>You can convert these blocks to PSSMs (position-specific scoring matrices), LOGOS (color representationsof the motifs), phylogeny trees and search them against a database of otherblocks by pasting everything from the "BL" line to the "//" line (inclusive)into the <A HREF=http://blocks.fhcrc.org/blocks/process_blocks.html>Multiple Alignment Processor.</A> If you include the <B>-print_fasta</B> switch on the command line, MEME printsthe motif sites in FASTA format instead of BLOCKS format.<P><LI> <A NAME=pssm_doc2 HREF=#pssm1><H4>Position-Specific Scoring Matrix</H4></A>The position-specific scoring matrix corresponding to the motif is printedfor use by database search programs such as MAST. This matrix is alog-odds matrix calculatedby taking 100 times the log (base 2) of the ratio <I>p/f</I> at each position inthe motif where <I>p</I> is the probability of a particular letter at thatposition in the motif, and <I>f</I> is the background frequency of theletter (given in the <A HREF=#command_doc>command line summary</A> section.)This is the same matrix that is used above in computing the <I>p</I>-valuesof the occurrences of the motif in the <A HREF=#sites_doc2>Occurrences of theMotif</A> and <A HREF=#diagrams_doc2>Block Diagrams of Motif Occurrences</A>sections. The scoring matrix is printed "sideways"--columnscorrespond to the letters in the alphabet (in the same order as shown inthe simplified motif) and rows corresponding to the positions of the motif,position one first. The scoring matrix is preceded by a line starting with"log-odds matrix:" and containing the length of the alphabet, widthof the motif, number of characters in the training set, the scoring threshold (obsolete) and the motif <I>E</I>-value.<P><B>Note:</B> The probability <I>p</I> used to compute the PSSMis <I>not</I> exactly the same as the corresponding value in the Position Specific Probability Matrix (PSPM). The values of <I>p</I> used to compute the PSSM takeinto account the motif prior, whereas the values in the PSPM are justthe <I>observed</I> frequencies of letters in the motif sites.<P><LI> <A NAME=pspm_doc2 HREF=#pspm1><H4>Position-Specific Probability Matrix</H4></A>The motif itself is a position-specific probability matrix giving,for each position in the pattern, the observed frequency ("probability") of each possible letter. The probability matrix is printed "sideways"--columnscorrespond to the letters in the alphabet (in the same order as shown inthe simplified motif) and rows corresponding to the positions of the motif,position one first.The motif is preceded by a line starting with"letter-probability matrix:" and containing the length of the alphabet, widthof the motif, number of occurrences of the motif, and the <I>E</I>-value of themotif.<p><b>Note:</b> Earlier versionsof MEME gave the posterior probabilities--the probability after applyinga prior on letter frequencies--rather than the observed frequencies.These versions of MEME also gave the number of <I>possible</I>positions for the motif rather than the actual number of occurrences.The output from these earlier versions of MEME can be distinguishedby "n=" rather than "nsites=" in the line preceding the matrix.<P><LI> <A NAME=regular_expression_doc2 HREF=#regular_expression1><H4>Regular Expression</H4></A>This is the <A HREF=#consensus_doc2>multilevel consensus</A> expressed asa regular expression for convenience. Regular expressions canbe used for searching for against sequences (using, for example,<A HREF="http://nar.oxfordjournals.org/cgi/content/full/33/suppl_2/W262">PatMatch</A>) but the search accuracy will usually be better with the PSSM (using,for example<A HREF=http://meme.nbcr.net/meme/mast-intro.html>MAST</A>.)MEME regular expressions are interpreted as follows: single letters match that letter; groups of letters in square brackets match any of the letters in the group.<P><LI> <A NAME=motif-summary-doc2 HREF=#motif-summary><H4>Motif Summary Tiling</H4></A>The motif summary tiling is done using the same algorithm as usedby <A HREF=http://meme.nbcr.net/meme/mast-intro.html>MAST</A>.The motif occurrences shown in the motif summary<B>may not be exactly the same as those reported in each motif section</B>because only motifs with a position <em>p</em>-value of 0.0001 thatdon't overlap other, more significant motif occurrences are shown.The format of the machine readable motif-summary is:<pre>[sequence_name combined_<em>p</em>-value number_of_motif_occurrences [motif_number start_of_motif position_<em>p</em>-value]+]+</pre>See the documentation for <A HREF=http://meme.nbcr.net/meme/mast-output.html>MAST output</A> for the definition of position andcombined <em>p</em>-values.</UL>
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?