📄 emma.txt

📁 emboss的linux版本的源代码
💻 TXT
📖 第 1 页 / 共 4 页
字号:
   one has a high quality reference alignment and wishes to keep it fixed   while adding new sequences automatically.  Terminal Gaps   In the original Clustal V program, terminal gaps were penalised the   same as all other gaps. This caused some ugly side effects e.g.acgtacgtacgtacgt                              acgtacgtacgtacgta----cgtacgtacgt  gets the same score as      ----acgtacgtacgt   NOW, terminal gaps are free. This is better on average and stops silly   effects like single residues jumping to the edge of the alignment.   However, it is not perfect. It does mean that if there should be a gap   near the end of the alignment, the program may be reluctant to insert   it i.e.cccccgggccccc                                              cccccgggcccccccccc---ccccc  may be considered worse (lower score) than  cccccccccc---   In the right hand case above, the terminal gap is free and may score   higher than the laft hand alignment. This can be prevented by lowering   the gap opening and extension penalties. It is difficult to get this   right all the time. Please watch the ends of your alignments.  Speed of the initial (pairwise) alignments (fast approximate/slow accurate)   By default, the initial pairwise alignments are now carried out using   a full dynamic programming algorithm. This is more accurate than the   older hash/ k-tuple based alignments (Wilbur and Lipman) but is MUCH   slower. On a fast workstation you may not notice but on a slow box,   the difference is extreme. You can set the alignment method from the   menus easily to the older, faster method.  Delaying alignment of distant sequences   The user can set a cut off to delay the alignment of the most   divergent sequences in a data set until all other sequences have been   aligned. By default, this is set to 40% which means that if a sequence   is less than 40% identical to any other sequence, its alignment will   be delayed.  Iterative realignment/Reset gaps between alignments   By default, if you align a set of sequences a second time (e.g. with   changed gap penalties), the gaps from the first alignment are   discarded. You can set this from the menus so that older gaps will be   kept between alignments, This can sometimes give better alignments by   keeping the gaps (do not reset them) and doing the full multiple   alignment a second time. Sometimes, the alignment will converge on a   better solution; sometimes the new alignment will be the same as the   first. There can be a strange side effect: you can get columns of   nothing but gaps introduced.   Any gaps that are read in from the input file are always kept,   regardless of the setting of this switch. If you read in a full   multiple alignment, the "reset gaps" switch has no effect. The old   gaps will remain and if you carry out a multiple alignment, any new   gaps will be added in. If you wish to carry out a full new alignment   of a set of sequences that are already aligned in a file you must   input the sequences without gaps.  Profile alignment   By profile alignment, we simply mean the alignment of old   alignments/sequences. In this context, a profile is just an existing   alignment (or even a set of unaligned sequences; see below). This   allows you to read in an old alignment (in any of the allowed input   formats) and align one or more new sequences to it. From the profile   alignment menu, you are allowed to read in 2 profiles. Either profile   can be a full alignment OR a single sequence. In the simplest mode,   you simply align the two profiles to each other. This is useful if you   want to gradually build up a full multiple alignment.   A second option is to align the sequences from the second profile, one   at a time to the first profile. This is done, taking the underlying   tree between the sequences into account. This is useful if you have a   set of new sequences (not aligned) and you wish to add them all to an   older alignment.Changes to the phylogentic tree calculations and some hints  Improved distance calculations for protein trees   The phylogenetic trees in Clustal W (the real trees that you calculate   AFTER alignment; not the guide trees used to decide the branching   order for multiple alignment) use the Neighbor-Joining method of   Saitou and Nei based on a matrix of "distances" between all sequences.   These distances can be corrected for "multiple hits". This is normal   practice when accurate trees are needed. This correction stretches   distances (especially large ones) to try to correct for the fact that   OBSERVED distances (mean number of differences per site) greatly   underestimate the actual number that happened during evolution.   In Clustal V we used a simple formula to convert an observed distance   to one that is corrected for multiple hits. The observed distance is   the mean number of differences per site in an alignment (ignoring   sites with a gap) and is therefore always between 0.0 (for ientical   sequences) an 1.0 (no residues the same at any site). These distances   can be multiplied by 100 to give percent difference values. 100 minus   percent difference gives percent identity. The formula we use to   correct for multiple hits is from Motoo Kimura (Kimura, M. The neutral   Theory of Molecular Evolution, Camb.Univ.Press, 1983, page 75) and is:   K = -Ln(1 - D - (D.D)/5)   where D is the observed distance and K is corrected distance.   This formula gives mean number of estimated substitutions per site   and, in contrast to D (the observed number), can be greater than 1   i.e. more than one substitution per site, on average. For example, if   you observe 0.8 differences per site (80% difference; 20% identity),   then the above formula predicts that there have been 2.5 substitutions   per site over the course of evolution since the 2 sequences diverged.   This can also be expressed in PAM units by multiplying by 100 (mean   number of substitutions per 100 residues). The PAM scale of evolution   and its derivation/calculation comes from the work of Margaret Dayhoff   and co workers (the famous Dayhoff PAM series of weight matrices also   came from this work). Dayhoff et al constructed an elaborate model of   protein evolution based on observed frequencies of substitution   between very closely related proteins. Using this model, they derived   a table relating observed distances to predicted PAM distances.   Kimura's formula, above, is just a "curve fitting" approximation to   this table. It is very accurate in the range 0.75 > D > 0.0 but   becomes increasingly unaccurate at high D (>0.75) and fails completely   at around D = 0.85.   To circumvent this problem, we calculated all the values for K   corresponding to D above 0.75 directly using the Dayhoff model and   store these in an internal table, used by Clustal W. This table is   declared in the file dayhoff.h and gives values of K for all D between   0.75 and 0.93 in intervals of 0.001 i.e. for D = 0.750, 0.751, 0.752   ...... 0.929, 0.930. For any observed D higher than 0.930, we   arbitrarily set K to 10.0. This sounds drastic but with real   sequences, distances of 0.93 (less than 7% identity) are rare. If your   data set includes sequences with this degree of divergence, you will   have great difficulty getting accurate trees by ANY method; the   alignment itself will be very difficult (to construct and to   evaluate).   There are some important things to note. Firstly, this formula works   well if your sequences are of average amino acid composition and if   the amino acids substitute according to the original Dayhoff model. In   other cases, it may be misleading. Secondly, it is based only on   observed percent distance i.e. it does not DIRECTLY take conservative   substitutions into account. Thirdly, the error on the estimated PAM   distances may be VERY great for high distances; at very high distance   (e.g. over 85%) it may give largely arbitrary corrected distances. In   most cases, however, the correction is still worth using; the trees   will be more accurate and the branch lengths will be more realistic.   A far more sophisticated distance correction based on a full Dayhoff   model which DOES take conservative substitutions and actual amino acid   composition into account, may be found in the PROTDIST program of the   PHYLIP package. For serious tree makers, this program is highly   recommended.  TWO NOTES ON BOOTSTRAPPING...   When you use the BOOTSTRAP in Clustal W to estimate the reliability of   parts of a tree, many of the uncorrected distances may randomly exceed   the arbitrary cut off of 0.93 (sequences only 7% identical) if the   sequences are distantly related. This will happen randomly i.e. even   if none of the pairs of sequences are less than 7% identical, the   bootstrap samples may contain pairs of sequences that do exceed this   cut off. If this happens, you will be warned. In practice, this can   happen with many data sets. It is not a serious problem if it happens   rarely. If it does happen (you are warned when it happens and told how   often the problem occurs), you should consider removing the most   distantly related sequences and/or using the PHYLIP package instead.   A further problem arises in almost exactly the opposite situation:   when you bootstrap a data set which contains 3 or more sequences that   are identical or almost identical. Here, the sets of identical   sequences should be shown as a multifurcation (several sequences joing   at the same part of the tree). Because the Neighbor-Joining method   only gives strictly dichotomous trees (never more than 2 sequences   join at one time), this cannot be exactly represented. In practice,   this is NOT a problem as there will be some internal branches of zero   length seperating the sequences. If you display the tree with all   branch lengths, you will still see a multifurcation. However, when you   bootstrap the tree, only the branching orders are stored and counted.   In the case of multifurcations, the exact branching order is arbitrary   but the program will always get the same branching order, depending   only on the input order of the sequences. In practice, this is only a   problem in situations where you have a set of sequences where all of   them are VERY similar. In this case, you can find very high support   for some groupings which will disappear if you run the analysis with a   different input order. Again, the PHYLIP package deals with this by   offering a JUMBLE option to shuffle the input order of your sequences   between each bootstrap sample.Usage   Here is a sample session with emma% emma Multiple alignment program - interface to ClustalW programInput (gapped) sequence(s): globins.fastaoutput sequence set [hbb_human.aln]: Dendrogram (tree file) from clustalw output file [hbb_human.dnd]:  CLUSTAL W (1.83) Multiple Sequence AlignmentsSequence type explicitly set to ProteinSequence format is PearsonSequence 1: HBB_HUMAN       146 aaSequence 2: HBB_HORSE       146 aaSequence 3: HBA_HUMAN       141 aaSequence 4: HBA_HORSE       141 aaSequence 5: MYG_PHYCA       153 aaSequence 6: GLB5_PETMA      149 aaSequence 7: LGB2_LUPLU      153 aaStart of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score:  83Sequences (1:3) Aligned. Score:  43Sequences (1:4) Aligned. Score:  42Sequences (1:5) Aligned. Score:  24Sequences (1:6) Aligned. Score:  21Sequences (1:7) Aligned. Score:  14Sequences (2:3) Aligned. Score:  41Sequences (2:4) Aligned. Score:  43Sequences (2:5) Aligned. Score:  24Sequences (2:6) Aligned. Score:  19Sequences (2:7) Aligned. Score:  15Sequences (3:4) Aligned. Score:  87Sequences (3:5) Aligned. Score:  26Sequences (3:6) Aligned. Score:  29Sequences (3:7) Aligned. Score:  16Sequences (4:5) Aligned. Score:  26Sequences (4:6) Aligned. Score:  27Sequences (4:7) Aligned. Score:  12Sequences (5:6) Aligned. Score:  21Sequences (5:7) Aligned. Score:  7Sequences (6:7) Aligned. Score:  11Guide tree        file created:   [12345678A]Start of Multiple AlignmentThere are 6 groupsAligning...Group 1: Sequences:   2      Score:2194Group 2: Sequences:   2      Score:2165Group 3: Sequences:   4      Score:960Group 4:                     DelayedGroup 5:                     DelayedGroup 6:                     DelayedSequence:5     Score:865Sequence:6     Score:797Sequence:7     Score:1044Alignment Score 4164GCG-Alignment file created      [12345678A]   Go to the input files for this example   Go to the output files for this exampleCommand line arguments   Standard (Mandatory) qualifiers:  [-sequence]          seqall     (Gapped) sequence(s) filename and optional                                  format, or reference (input USA)  [-outseq]            seqoutset  [.] Sequence set filename                                  and optional format (output USA)  [-dendoutfile]       outfile    [*.emma] Dendrogram (tree file) from                                  clustalw output file   Additional (Optional) qualifiers (* if not always prompted):   -onlydend           toggle     [N] Only produce dendrogram file*  -dend               toggle     [N] Do alignment using an old dendrogram*  -dendfile           infile     Dendrogram (tree file) from clustalw file                                  (optional)*  -pwmatrix           menu       [b] The scoring table which describes the                                  similarity of each amino acid to each other.                                  There are three 'in-built' series of weight                                  matrices offered. Each consists of several                                  matrices which work differently at different                                  evolutionary distances. To see the exact                                  details, read the documentation. Crudely, we                                  store several matrices in memory, spanning                                  the full range of amino acid distance (from                                  almost identical sequences to highly                                  divergent ones). For very similar sequences,                                  it is best to use a strict weight matrix                                  which only gives a high score to identities                                  and the most favoured conservative                                  substitutions. For more divergent sequences,                                  it is appropriate to use 'softer' matrices                                  which give a high score to many other                                  frequent substitutions.                                  1) BLOSUM (Henikoff). These matrices appear                                  to be the best available for carrying out                                  data base similarity (homology searches).                                  The matrices used are: Blosum80, 62, 45 and                                  30.                                  2) PAM (Dayhoff). These have been extremely                                  widely used since the late '70s. We use the                                  PAM 120, 160, 250 and 350 matrices.                                  3) GONNET . These matrices were derived                                  using almost the same procedure as the
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -